Understanding Gradient Descent
Linear regression
Cost functions
The linear regression estimation function (hypothesis function) can be written as
The core of the cost function is the squared difference between prediction and truth in
We square the error as that is a common way of measuring loss. We sum the loss for each value of 2
gets cancelled out. However, the objective of the estimation function is choosing values of
Minimizing cost functions
To understand how to minimize the cost functions, let us simplify the above regression to a state where intercept is 0
. Thus the estimation / hypothesis function changes to
Next, we solve for the cost function for different values of
The cost function takes shape of a parabola, with a clear minima.
Minimizing multidimensional cost functions
In the previous example, we assumed
Another way to represent the cost function is via contour plots as shown below:
Points along same contour have same values of error/loss for different values of
In a multiple regression problem, there are several predictor variables. Thus the loss function is hard to visualize as there now multiple dimensions, one for coefficient of predictor variable + intercept term. An algorithmic way of minimizing the cost function is called gradient descent.
Gradient descent
Gradient descent is an algorithm that is used to minimize the loss function. It is also used widely in many machine learning problems. The idea is, to start with arbitrary values for
The following graphic shows the distribution of the loss function. The GD algorithm starts at an arbitrary point for
An interesting feature of GD is, if you started at a different point, you might end up at a different local minima. The definition of gradient descent for any arbitrary equation is:
$$
\theta_{j} := \theta_{j} - \alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0}, \theta_{1}) \ (for \; j=0 \;and\; j=1)
$$
We repeat the above equation until convergence. The
Thus, when you have two coefficients, you would compute
$$
temp0 := \theta_{0} - \alpha\frac{\partial}{\partial\theta_{0}}J(\theta_{0}, \theta_{1})
$$
$$
temp1 := \theta_{1} - \alpha\frac{\partial}{\partial\theta_{1}}J(\theta_{0}, \theta_{1})
$$
$$
\theta_{0} := temp0
$$
$$
\theta_{1}:= temp1
$$
Note: It is important to compute
Gradient descent intuition
For simplicity, let us simplify our solver to minimize over just one coefficient
The shape of
The
The learning rate
Further, as we approach the local minima, the slope decreases. Thus even for a fixed 0
. Thus, once reached, the second term of the equation (right of the minus sign) turns to 0
and the iteration as converged.
Gradient descent for linear regression
The loss function
Here, m is number of training samples,
This process of using the full training sample for calculating the GD is called batch gradient descent. In the case of linear regression, the shape of the loss function is a convex function which looks like a bowl. There is only a global minima and no local minimas.