# Solving multivariate linear regression using Gradient Descent

**Note**: This is a continuation of Gradient Descent topic. The context and equations used here derive from that article.

When we regress for `y`

using multiple predictors of `x`

, the hypothesis function becomes:

$$ h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} + … + \theta_{n}x_{n} $$ If we consider $x_{0} = 1$, then the above can be represented as matrix multiplication using linear algebra.

$$ x = \begin{bmatrix} x_{0}\ x_{1}\ \vdots\ x_{n} \end{bmatrix} \ and \ \theta=\begin{bmatrix} \theta_{0}\ \theta_{1}\ \vdots\ \theta_{n} \end{bmatrix} $$

Thus,
$$
h_{\theta}(x) = \theta^{T}x
$$
Here the dimensions of $\theta$ and `x`

is `n+1`

as this goes from `0`

to `n`

.

### Loss function of multivariate linear regression¶

The loss function is given by

$$ J(\theta_{0}, \theta_{1},…,\theta_{n}) = \frac{1}{2m}\sum_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]^{2} $$

which you can simplify to

$$ J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]^{2} $$

The gradient descent of the loss function is now

$$
\theta_{j} := \theta_{j} - \alpha\frac{\partial}{\partial\theta_{j}}J(\theta)
$$
**Note**: Here `j`

represents the `n+1`

features (attributes) and `i`

goes from `1 -> m`

representing the `m`

records.

Simplifying the partial differential equation, we get the `n+1`

update rules as follows

$$ \theta_{0} := \theta_{0} - \alpha \frac{1}{m}\sum_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]x_{0}^{{(i)}} $$ $$ \theta_{1} := \theta_{1} - \alpha \frac{1}{m}\sum_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]x_{1}^{{(i)}} $$ $$ \vdots $$ $$ \theta_{n} := \theta_{n} - \alpha \frac{1}{m}\sum_{i=1}^{m}[h_{\theta}(x^{(i)}) - y^{(i)}]x_{n}^{{(i)}} $$

The equations above are very similar to ones from simple linear equations.

### Impact of scaling on Gradient Descent¶

When the data ranges of features varies quite a bit from each other, the surface of GD is highly skewed as shown below:

This is because $\theta$ (which is our weights) will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. When scaled, the surface takes a healthier shape, allowing the algorithm to converge faster. Ideally scale values so they fall within `-1 to 1`

.

#### Scaling methods¶

**Feature Scaling** is simply dividing values by **range**. **Normalization** is when you transform them to have a `mean = 0`

.

**Mean normalization**
$$
scaled \ x_{j} = \frac{(x_{j} - \mu_{j})}{s_{j}}
$$
where $\mu$ is mean and `s`

is range.

**Standard normalization** is similar to above, except, `s`

is standard deviation.

The exact range of normalization is less important than having all features follow a particular range.

### Debugging Gradient Descent¶

The general premise is, as number of iterations increase, the loss should reduce. You can also declare a threshold and if the loss reduces below that for `n`

number of iterations, then you can declare **convergence**. However, Andrew Ng suggests against this and suggests visualizing the loss on a chart to pick LR.

**When LR is too high**: If you have a diverging graph - loss increases steadily or if the loss is oscillating (pic below), it is likely the the rate is too high. In case of oscillation, the weights sporadically hit the local minima but continue to overshoot.

**Iterating through a number of LRs**: Andrew suggests picking a range of LRs `0.001, 0.01, 0.1, 1, ...`

and iterating through them. He typically bumps rates by a factor of `10`

. For convenience, he picks `..0.001, 0.003, 0.01, 0.03, 0.1, 0.3..`

where he bumps by `~3`

which is also effective.

### Non-linear functions vs non-linear models¶

A **linear function** is one which produces a straight line. It is typically of the form $y = \theta_{0} + \theta_{1}x_{1}$. A **non-linear function** is something that produces a curve. It is typically of the from $y = \theta_{0} + \theta_{1}x^{k}$. A **linear model** is when the model parameters are **additive**, even though individual parameters are non-linear. It takes form $y = \theta_{0} + \theta_{1}x_{1} + … + \theta_{n}x_{n}$. A **non-linear** model is when the model parameters are **multiplicative** even though they are of order `1`

. It typically takes form $y = \theta_{0}x_{1}theta_{1}x_{2}^{k}x_{3}$ etc.

### Representing non-linearity using Polynomial Regression¶

Sometimes, when you plot the response variable with one of the predictors, it may not take a linear form. You might want an order `2`

or `3`

curve. You can still represent them using linear models. Consider the case where square footage is one of the parameters in predicting house price and you notice a non-linear relationship. From the graphic below, you might try a **quadratic model** as $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2}^{2}$. But this model will eventually taper off. Instead, you may try a **cubic** model as $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2}^{2} + \theta_{3}x_{3}^{3}$.

The way to represent non-linearity is to sequentially raise the power / order of the parameter, represent them as additional features. This is a step in **feature engineering**. This method is called polynomial regression. When you raise the power, the range of that parameter also increases exponentially. Thus you model might become highly skewed. It is **vital to scale features** in a polynomial regression.

Another option here is, instead of raising power, you take **square roots** or **nth roots**, such as: $h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}\sqrt{x_{2}} + \theta_{3}\sqrt[3]{x_{3}}$