# Analytical vs Gradient Descent methods of solving linear regression

The Gradient Descent offers an iterative method to solve linear models. However, there is a traditional and direct way of solving it called as normal equations. In normal equations, you build a matrix where each record of observation becomes a row (m rows) and each feature becomes a column. You prefix an additional column to represent the constant (n+1 columns). This matrix, represented as X is of dimension m x (n+1). You represent the response variable as a vector y of dimension m x 1. The formula to calculate the optimal coefficients is given by $$\theta = (X^{T}X)^{-1}X^{T}y$$. Where $$\theta$$ is a vector of shape n+1 containing $$[\theta_{0}, \theta_{1} ... \theta_{n}]$$.

## Caveats when applying analytical technique

• In the analytical, normal equation method, there is no iteration to arrive at optimal $$\theta$$. You simply calculate it.
• You do not have to scale features. It is ok to have them in their native dimensions.

## Guidelines for choosing between GD and Normal equation

• GD needs you to play with $$\alpha$$ (learning rate), while normal equation does not.
• GD is an iterative process, while normal eq is not.
• GD shines well when you have a large number of attributes / features / independent variables. The order of GD is given by $$O(kn^{2})$$ for n features.
• Normal equation needs to invert a matrix which is an expensive operation. Its time complexity is given by $$O(n^{3})$$.
• If you have >10,000 independent variables, or if the number of observations / rows is less than number of independent variables (m < (n+1)), then normal equation not produce a matrix that is invertible. You are better off with Gradient Descent regression.
• If you have highly correlated features (multi-collinearity) or when you have more features than observations, you might end up with a non-invertible matrix for the normal equation. In these cases, you can choose GD or you can delete some features or regularization techniques if you want to continue with normal equation.
• GD is an approximation technique, while normal equation is a deterministic approach. GD might settle in a local minima and not global minima. Although, for linear regressions, the shape of the loss function is such that there is no local but only a global minima.