Analytical vs Gradient Descent methods of solving linear regression

The Gradient Descent offers an iterative method to solve linear models. However, there is a traditional and direct way of solving it called as normal equations. In normal equations, you build a matrix where each record of observation becomes a row (m rows) and each feature becomes a column. You prefix an additional column to represent the constant (n+1 columns). This matrix, represented as X is of dimension m x (n+1). You represent the response variable as a vector y of dimension m x 1.

The formula to calculate the optimal coefficients is given by $\theta = (X^{T}X)^{-1}X^{T}y$. Where $\theta$ is a vector of shape n+1 containing $[\theta_{0}, \theta_{1} … \theta_{n}]$.

Caveats when applying analytical technique

  • In the analytical, normal equation method, there is no iteration to arrive at optimal $\theta$. You simply calculate it.
  • You do not have to scale features. It is ok to have them in their native dimensions.

Guidelines for choosing between GD and Normal equation

  • GD needs you to play with $\alpha$ (learning rate), while normal equation does not.
  • GD is an iterative process, while normal eq is not.
  • GD shines well when you have a large number of attributes / features / independent variables. The order of GD is given by $O(kn^{2})$ for n features.
  • Normal equation needs to invert a matrix which is an expensive operation. Its time complexity is given by $O(n^{3})$.
  • If you have >10,000 independent variables, or if the number of observations / rows is less than number of independent variables (m < (n+1)), then normal equation not produce a matrix that is invertible. You are better off with Gradient Descent regression.
  • If you have highly correlated features (multi-collinearity) or when you have more features than observations, you might end up with a non-invertible matrix for the normal equation. In these cases, you can choose GD or you can delete some features or regularization techniques if you want to continue with normal equation.
  • GD is an approximation technique, while normal equation is a deterministic approach. GD might settle in a local minima and not global minima. Although, for linear regressions, the shape of the loss function is such that there is no local but only a global minima.