Logistic Regression

In a logistic regression the values want to predict take on only a small number of discrete values. In binary regression $y$ being a continuous range of values, it will only be 0 or 1 where 0 is usually taken as the "negative class" and 1 as the "positive class" i.e. $y \in {0,1}$.

To compute a function which predicts trends in our data we need an initial hypothesis function (maybe a straight line like $y = \theta_0 + \theta_1 x$). To find the values of $\theta$ we need to use an algorithm like gradient descent. We first generate a cost function (the difference between our values and the training set given $\theta$) shows how far off our guesses are to the data. We use the gradient descent algorithm to iterately find values of $\theta$ with the lowest total error given our cost function.


  • $n$ - the number of distinct data points, or observations, in a sample.
  • $p$ - denote the number of variables that are available for use in making predictions
  • $X_{ij}$ - is an index reference which refers to the element in the ith row and $jth$ column of matrix X. where $i = 1,2,\cdots,n$ and $j = 1,2,\cdots,p$
  • $v_{i}$ is an index reference which refers to the element in the ith row of the vector $v$.
  • $i$ is be used to index the samples or observations (from 1 to n)
  • $j$ is be used to index the variables (from 1 to p)

Hypothesis - Sigmoid Function

$$ h_\theta (x) = g ( \theta^T x ) \\ z = \theta^T x \\ g(z) = \dfrac{1}{1 + e^{-z}} $$

Independent of the number of classifications, our hypothesis function for logistic regression $h_\theta (x)$ should not take values larger than 1 or smaller than 0. We therefore assign our hypothesis to a sigmoid function where our independent variables $\theta^T x$ are transformed by the sigmoid function to produce a value within the range $0 \geq g(\theta^Tx) \leq 1$. The function $g(z)$ maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.

The output of the hypothesis $h_\theta(x)$ will give us the probability that our output is 1. For example, $h_\theta (x) = 0.8$ gives us a probability of 80% that our output is 1. The probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 80%, then the probability that it is 0 is 20%).

$$ h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \\ P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1 $$

This can be read as The probability of $y$ being one given $x$ and our input parameter $\theta$ added to the probability of $y$ being zero given $x$ and our input parameter $\theta$ must be 1 (or 100% certain). A fancier way of saying y must always be equal to either 1 or 0 ($y \in {0,1}$). In logsitic classification all probabilities must add up to 1.

It can also be written more succintly

$$ p(y | x; θ) = (h_\theta(x))^y (1 − h_\theta(x))^{1−y} $$

Decision Boundary

The decision boundary is the line that separates the general area where $y = 0$ and where $y = 1$ and is the ultimate goal of logistic regression. It is created by the logistic regression hypothesis function.

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

$$ g(z) \geq 0.5 \\ when \; z \geq 0 $$

Logistic Function

Looking at a graph of the logistic hypothesis function we can see that $y = 1$ when $z$ is positive or alternatively $y = 0$ when $z$ is negative.

Since $z$ is the vector $\theta^Tx$ we are also able to derive our decision boundary by algebraicly solving $\theta^Tx \geq 0$


In this example we have a straight line input to the sigmoid function and values of $\theta$ like so $$ z = \theta_0 + \theta_1(x) + \theta_2(x) $$ $$ \theta = \begin{bmatrix}5 \\ -1 \\ 0\end{bmatrix} $$

We can use this information to plot a decision boundary for our parameters. $$ 5 + (-1) x_1 + 0 x_2 \geq 0 $$ $$ 5 - x_1 \geq 0 $$ $$ - x_1 \geq -5 $$ $$ x_1 \leq 5 $$

In this case, our decision boundary is a straight vertical line placed on the graph where $x_1 = 5$, and everything to the left of that denotes $y = 1$, while everything to the right denotes $y = 0$. The input to the sigmoid function $z \text{or} $\theta^Tx) doesn't need to be linear.

Cost Function

In order for gradient descent to reach the global minimum error the cost function would ideally need to produce a smooth convex curve. A curve with multiple dips could lead to gradient descent falling in to local optima and will distort the final classifications. The Squared Error leads to a cost function with many local optima. For this reason we need an alternative.

We use two different Cost functions, one for $y = 0$ and another for

$$ J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \\ \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \text{if y = 1} \\ \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \text{if y = 0} $$

This can be succinctly written as

$$ J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))] $$

Or in vector form

$$ h = g(X\theta) \\ J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) $$

This has the distinct benefit that

$$ \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \\ \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \\ \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \\ $$

Or in concrete terms 1. For $y = 1$ as $\theta$ approaches 1 the cost function with output 0. As it approaches 0 it will output $\infty$ 2. For $y = 0$ as $\theta$ approaches 0 the cost function with output 0. As it approaches 1 it will output $\infty$

Gradient Descent

The sigmoid function (logistic regression function) is

$$ h_\theta (x) = g ( \theta^T x ) \\ z = \theta^T x \\ g(z) = \dfrac{1}{1 + e^{-z}} $$

the derivative of which is $g(z)(1 − g(z))$, $g(z)$ can be written as $\sigma(x)$ in a more general function. We then calculate the derivative of the cost function

$$ J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))] $$

when differentiated with respect to $J(\theta)_j$ becomes

$$ (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} $$

Which leads to a gradient descent update rule of

$$ \text{Repeat}\ \lbrace \\ \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ \rbrace $$

If we compare this to the Linear Regression update rule, we see that it looks identical; but this is not the same algorithm, because $h(\theta)_x)$ is now defined as a non-linear function of $\theta^T x_i$. Nonetheless, it’s a little surprising that we end up with the same update rule for a rather different algorithm and learning problem.

Multiclass Classification

Now we will approach the classification of data into more than two categories. Instead of $y \in {0,1}$ we will expand our definition so that $y \in {0,1, \cdots, n}$. In this case we divide our problem into $n+1$ (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that 'y' is a member of one of our classes.

$$ y \in \lbrace0, 1 ... n \rbrace $$ $$ h_{\theta 0}(x) = P(y = 0 | x ; \theta) $$ $$ h_{\theta 1}(x) = P(y = 1 | x ; \theta) $$ $$ \cdots $$ $$ h_{\theta n}(x) = P(y = n | x ; \theta) $$

Where $h_{\theta 0}(x)$ is the classifier of the first feature $0$. We then use the hypothesis that returned the highest value as our prediction. $$ \mathrm{prediction} = \max_i( h_{\theta i}(x) ) $$

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

  • One-vs-all Classification
    • For each class $i$, train a logistic regression classifier $h_{i\theta}(x)$ to predict probability that $y = i$.
    • On new input $x$, to make a prediction, pick $i$ that maximizes the classifier $h_{i\theta}(x)$

Advanced Optimization

There are other optimization algorithms besides gradient descent. – E.g. Conjugate gradient; BFGS; L-BFGS. Pros: No need to choose $\alpha$, and often faster than gradient descent. Cons: More complex.


Regularization is designed to address the problem of overfitting.


Underfitting - or having a 'high bias' is hen the form of our hypothesis function $h$ maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. eg. by trying to use a linear model to fit a more complex set of data will mis-classify many data points

Overfitting - Introduces a 'high variance' is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

Naively, it might seem that the more features we add, the better. However, there is also a danger in adding too many features: The rightmost figure is the result of fitting a 5th order polynomial $y = \sum_{j=0} ^5 \theta_j x^j$.


This terminology is applied to both linear and logistic regression.


There are two main options to address the issue of overfitting:

  1. Reduce the number of features:
    1. Manually select which features to keep.
    2. Use a model selection algorithm (studied later in the course).
  2. Regularization
    1. Keep all the features, but reduce the parameters $\theta_j$.

Regularization works well when we have a lot of slightly useful features.

Cost Function

If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.

For instance to reduce the influence of $\theta_3$ in this function $\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4$ you can use the Cost function $min_\theta\ \dfrac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + 1000\cdot\theta_3^2$ to minimise the impact of $\theta_3$. This is useful when you know which feature you want to reduce the influence of.

You can also use a regularization parameter $lambda$ to add a weight to all paramteres:

$$ min_\theta\ \dfrac{1}{2m}\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \right] $$

$\lambda$ determines how much the costs of our theta parameters are inflated. Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting if it is too small we will still experience overfittingp.

Regularized Logistic Regression

We can regularize logistic regression.

Regularized Cost Function

We can regularize the above cost function equation by adding a term to the end:

$$ J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2 $$

Note Well: The second sum, $\sum_{j=1}^n \theta_j^2$ means to explicitly exclude the bias term, $\theta_0$. I.e. the $\theta$ vector is indexed from 0 to n (holding n+1 values, $\theta_0$ through $$\theta_n$), and this sum explicitly skips $\theta_0$, by running from 1 to n, skipping 0.

Gradient Descent

Just like with linear regression, we will want to separately update $\theta_0$ and the rest of the parameters because we do not want to regularize $\theta_0$.

$$ \text{Repeat}\ \lbrace \\ \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \\ \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right]\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...p\rbrace \\ \rbrace $$

This is identical to the gradient descent function presented for regularized linear regression.