Solving multivariate linear regression using Gradient Descent
Note: This is a continuation of Gradient Descent topic. The context and equations used here derive from that article.
When we regress for y
using multiple predictors of x
, the hypothesis function becomes:
If we consider
Here the dimensions of x
is n+1
as this goes from 0
to n
Loss function of multivariate linear regression
The loss function is given by
which you can simplify to
The gradient descent of the loss function is now
\theta_{j} := \theta_{j} - \alpha\frac{\partial}{\partial\theta_{j}}J(\theta)
Note: Here j
represents the n+1
features (attributes) and i
goes from 1 -> m
representing the m
Simplifying the partial differential equation, we get the n+1
update rules as follows
The equations above are very similar to ones from simple linear equations.
Impact of scaling on Gradient Descent
When the data ranges of features varies quite a bit from each other, the surface of GD is highly skewed as shown below:
This is because -1 to 1
Scaling methods
Feature Scaling is simply dividing values by range. Normalization is when you transform them to have a mean = 0
Mean normalization
scaled \ x_{j} = \frac{(x_{j} - \mu_{j})}{s_{j}}
where s
is range.
Standard normalization is similar to above, except, s
is standard deviation.
The exact range of normalization is less important than having all features follow a particular range.
Debugging Gradient Descent
The general premise is, as number of iterations increase, the loss should reduce. You can also declare a threshold and if the loss reduces below that for n
number of iterations, then you can declare convergence. However, Andrew Ng suggests against this and suggests visualizing the loss on a chart to pick LR.
When LR is too high: If you have a diverging graph - loss increases steadily or if the loss is oscillating (pic below), it is likely the the rate is too high. In case of oscillation, the weights sporadically hit the local minima but continue to overshoot.
Iterating through a number of LRs: Andrew suggests picking a range of LRs 0.001, 0.01, 0.1, 1, ...
and iterating through them. He typically bumps rates by a factor of 10
. For convenience, he picks ..0.001, 0.003, 0.01, 0.03, 0.1, 0.3..
where he bumps by ~3
which is also effective.
Non-linear functions vs non-linear models
A linear function is one which produces a straight line. It is typically of the form 1
. It typically takes form
Representing non-linearity using Polynomial Regression
Sometimes, when you plot the response variable with one of the predictors, it may not take a linear form. You might want an order 2
or 3
curve. You can still represent them using linear models. Consider the case where square footage is one of the parameters in predicting house price and you notice a non-linear relationship. From the graphic below, you might try a quadratic model as
The way to represent non-linearity is to sequentially raise the power / order of the parameter, represent them as additional features. This is a step in feature engineering. This method is called polynomial regression. When you raise the power, the range of that parameter also increases exponentially. Thus you model might become highly skewed. It is vital to scale features in a polynomial regression.
Another option here is, instead of raising power, you take square roots or nth roots, such as: