Linear Regression

In regression problems, we attempt to take input variables and trying to fit the output onto a continuous expected result function.


  • n = number of training examples
  • $p$ - denote the number of variables that are available for use in making predictions
  • $X_{ij}$ - is an index reference which refers to the element in the ith row and $jth$ column of matrix X. where $i = 1,2,\cdots,n$ and $j = 1,2,\cdots,p$
  • $v_{i}$ is an index reference which refers to the element in the ith row of the vector $v$.
  • x = “input” variables, or features
  • y = “output” variable, or “target” variable
  • $\theta$ = regression parameters

Key Terms

  • Univariate linear regression - Linear regression with one variable is also known as It is used when you want to predict a single output value $y$ from a single input value $x$.

Hypothesis function

$$ \hat{y} = h_\theta(x) = \theta_0 + \theta_1 x \cdot \theta_{n + 1} $$

With only two features ($\theta_0$ & $\theta_1$) it is the equivalent of straight line by its independent and dependent variables $y = mx + c$. Linear regression uses the value for $\theta_0$ and $theta_1$ as arguments for each feature.

We can succinctly write this as $h_\theta(x) = \theta^T x$

Cost Function

We can measure the accuracy of our hypothesis function by using a cost function called the Squared Error Function. This takes an average of all the results of the hypothesis with inputs from x's compared to the actual output y's.

$$ J( \theta_0, \theta_1 ) = \dfrac {1}{2n} \displaystyle \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) $$ $$ \mathrm{Cost}(h_\theta(x),y) = = \sum_{i=1}^n \left (h_{\theta}(x_{i}) - y_{i} \right)^2 $$

We can refer the squared error as $\bar{x}$ is the mean of the squares of $h_\theta(x_i) − y_i$, or the difference between the predicted value and the actual value. $\frac{1}{2}$ is the average and a convenience for gradient descent.

We can use this cost function with different inputs of $\theta$ to generate an output. The lower the output the lower the "squared difference" and the closer our $h_θ(x)$) passes through this scattered set of data.

Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. In the best case, the line should pass through all the points of our training data set. In such a case the value of $J(\theta_0, \theta_1)$ will be 0.

Gradient Descent

$$ \theta_j := \theta_j - \alpha\frac{\partial}{\partial \theta_j} J(\theta) $$ or $$ \theta_j := \theta_j - \alpha\frac{1}{n} \sum\limits_{i=1}^{n} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} $$

where $j$ is the index of the feature $for j := 0..p$.

A partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). Partial derivatives are used in vector calculus and differential geometry.

If we put $θ_0$ on the x axis and $θ_1$ on the y axis, with the ouput of the cost function on the vertical $z$ axis we would end up with a graph similar to the one shown.

The $\alpha$ in above is called the learning rate. If the learning rate is too small, the convergence for gradient descent will be too slow. On the other hand, if the learning rate is too large, then each iteration of gradient descent might overshoot the minimum, eventually failing to converge, or even diverging. Note that as the gradient descent algorithm approaches a local minimum, the second term in the update equation will automatically be smaller and hence gradient descent will automatically take smaller steps, so there is no need to decrease α over time.

Gradient Descent

(x, y in the diagram or $\theta_0 \theta_1$ in our case against the z axis $Error$ ($J(\theta)$))

Gradient descent is an optimization algorithm that finds the optimal weights for the parameters of $\theta$. Gradient descent's goal is to find the lowest point on this graph, i.e the point at which $\theta$ (our independent variables are lowest) produce a least error.

The way gradient descent works is by taking the derivative of the cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter $a$, which is called the learning rate.

In Linear Regression

When specifically applied to the case of linear regression, a further form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function.

In the example below we have the case of univariate linear regression. In univariate linear regression there is only one degree of freedom which is feature $\theta_1$ and one constant $\theta_0$.

One way to do this is to use the batch gradient descent algorithm. In batch gradient descent, each iteration performs the update

$$ \theta_0 := \theta_0 - \alpha \frac{1}{n} \sum\limits_{i=1}^{n}(h_{\theta}(x_{i}) - y_{i}) $$ $$ \theta_1 := \theta_1 - \alpha \frac{1}{n} \sum\limits_{i=1}^{n}\left((h_{\theta}(x_{i}) - y_{i}) x_{i}\right) $$

Since there are $p+1$ features (including $\theta_0$) we need to have $p+1$ gradient descent functions (one for each feature + 1).

The Gradient Descent Algorithm Iteratively

  1. Initialize the weights($\theta_0$ & $\theta_1$) with random values and calculate Error (Squared error)

  2. Calculate the gradient i.e. change in Error when the weights ($\theta_0$ & $\theta_1$) are changed by a very small value from their original randomly initialized value. This helps us move the values of $\theta_0$ & $\theta_1$ in the direction in which Squared error is minimized.

    • Intuitively in a one dimensional regression i.e. $\theta_1 = \theta_1 - \alpha\frac{\partial}{\partial \theta_1} J(\theta_1)$ that the derivative $\frac{\partial}{\partial \theta_1} J(\theta_1)$ will produce a 2d graph similar to the equation $y = x^2$. If the lowest point of the graph represented the optimal $\theta_1$ starting with a $\theta_1$ to the left of the optimal would produce a derivative with a negative gradient (a negative derivative) and any $\theta_1$ to the right would produce a positive derivative. Therefore one iteration of gradient descent would move $\theta_1$ to the right when our $\theta_1$ starts to the left of our optimum (since we subtract the derivative) or to the left when we start to the right of our optimum.
  3. Adjust the weights with the gradients to reach the optimal values where Squared error is minimized

  4. Use the new weights for prediction and to calculate the new Squared error

  5. Repeat steps 2 and 3 till further adjustments to weights doesn’t significantly reduce the Error

Vectorised Notation

Gradient Descent can be shorthanded as

$$ \theta := \theta - \alpha \nabla J(\theta) \\ \text{where} \\ \nabla J(\theta) = \begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_0} \\ \frac{\partial J(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial J(\theta)}{\partial \theta_n} \end{bmatrix} $$

If you are able to express one value of gradient descent as: $$ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{n} \sum\limits_{i=1}^{} x_{1j} \cdot \left(h_\theta(x_{i}) - y_i \right) $$

You are able to shorten $x_{1j}$ to $\vec{x_j}$ since it represents one column of $X$ (see here for more details).

The term $\left(h_\theta(x^{(i)}) - y^{(i)} \right)$ can also be shortened to $\vec{y}$ since it also references the column of y.

$$ \frac{\partial J(\theta)}{\partial \theta_j} = \frac1m \vec{x_j}^{T} (X\theta - \vec{y}) $$

$$ \nabla J(\theta) = \frac 1m X^{T} (X\theta - \vec{y}) $$

Therefore gradient descent can be written as $\theta := \theta - \frac{\alpha}{m} X^{T} (X\theta - \vec{y})$

Improving Gradient Descent

Feature Normalisation

$$ x_i := \dfrac{x_i - \mu_i}{s_i} $$

We can speed up gradient descent by having each of our input values in roughly the same range. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

  • Feature scaling dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
  • Mean normalization involves subtracting the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

The final value is called the standard score.

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well. We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

Normal Equation

$$ \theta = (X^T X)^{-1}X^T y $$

The "Normal Equation" is a method of finding the optimum theta without iteration.

Gradient Descent Normal Equation
Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
O (kn2) "O (n3)
Works well when n is large Slow if n is very large

Issues with Gradient Descent

Non Invertability

Usually due to

  1. Redundant features, where two features are very closely related (i.e. they are linearly dependent)
  2. Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to be explained in a later lesson).

Regularized Linear Regression

We can apply regularization to both linear regression and logistic regression. We will approach linear regression first.

Gradient Descent

We will modify our gradient descent function to separate out θ0 from the rest of the parameters because we do not want to penalize θ0.

$$ \text{Repeat}\ \lbrace \\ \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \\ \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right]\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...p\rbrace \\ \rbrace $$

The regularization term is $\frac{\lambda}{m}\theta_j$ With some manipulation our update rule can also be represented as:

$$ \theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} $$

The first term in the above equation, $1 - \alpha\frac{\lambda}{m}$ will always be less than 1. Intuitively you can see it as reducing the value of $\theta_j$ by some amount on every update.

Notice that the second term is now exactly the same as it was before.