Analytical vs Gradient Descent methods of solving linear regression

The Gradient Descent offers an iterative method to solve linear models. However, there is a traditional and direct way of solving it called as normal equations. In normal equations, you build a matrix where each record of observation becomes a row (m rows) and each feature becomes a column. You prefix an additional column to represent the constant (n+1 columns). This matrix, represented as X is of dimension m x (n+1). You represent the response variable as a vector y of dimension m x 1.

The formula to calculate the optimal coefficients is given by \(\theta = (X^{T}X)^{-1}X^{T}y\). Where \(\theta\) is a vector of shape n+1 containing \([\theta_{0}, \theta_{1} ... \theta_{n}]\).

Caveats when applying analytical technique

In the analytical, normal equation method, there is no iteration to arrive at optimal \(\theta\). You simply calculate it.
You do not have to scale features. It is ok to have them in their native dimensions.

Guidelines for choosing between GD and Normal equation

GD needs you to play with \(\alpha\) (learning rate), while normal equation does not.
GD is an iterative process, while normal eq is not.
GD shines well when you have a large number of attributes / features / independent variables. The order of GD is given by \(O(kn^{2})\) for n features.
Normal equation needs to invert a matrix which is an expensive operation. Its time complexity is given by \(O(n^{3})\).
If you have >10,000 independent variables, or if the number of observations / rows is less than number of independent variables (m < (n+1)), then normal equation not produce a matrix that is invertible. You are better off with Gradient Descent regression.
If you have highly correlated features (multi-collinearity) or when you have more features than observations, you might end up with a non-invertible matrix for the normal equation. In these cases, you can choose GD or you can delete some features or regularization techniques if you want to continue with normal equation.
GD is an approximation technique, while normal equation is a deterministic approach. GD might settle in a local minima and not global minima. Although, for linear regressions, the shape of the loss function is such that there is no local but only a global minima.