Neural networks are limited imitations of how our own brains work. There is evidence that the brain uses only one "learning algorithm" for all its different functions.
The term 'neural network' is therefore used to stand for `articial neural network' in the remainder of this chapter, unless explicitly stated otherwise
In mice cutting the connection between the ears and the auditory cortex and rewiring the optical nerve with the auditory cortex to find that the auditory cortex literally learns to see.
This principle is called "neuroplasticity" and has many examples and experimental evidence. At a very simple level, neurons are basically computational units that take input (dendrites) as electrical input (called "spikes") that are channeled to outputs (axons).
Key Terms
 Input Layer  The first layer of a neural network.
 Hidden Layer  Intermediate layer of the neural network
 Output Layer  Final value of the hypothesis
 $a_i^{(j)}$  activation" of unit $i$ in layer $j$
 $\Theta^{(j)}$ = matrix of weights controlling function mapping from layer $j$ to layer $j+1$
 L  total number of layers in the network
 $s_l$ = number of units (not counting bias unit) in layer l
 $K$  the number of output units/classes
Implementation
Neural networks are made up of layers of nodes. Each node has an activation function similar to the logistic regression activation function (since they both deal with classifications). The activate function can be written as $\frac{1}{1 + e^{\theta^Tx}}$ creates a range of arguments between $0 \geq g(\theta^Tx) \leq 1$.
Our "theta" parameters are sometimes instead called "weights" in the neural networks model. Each layer of the neural network has a matrix of values for $\theta$ this can be written as $\Theta^{(j)}$ where $j$ is the layer index.
$$ \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \\ \end{bmatrix}
\rightarrow
\begin{bmatrix} a_1^{(2)} = g(z_1^{(2)}) \\ a_2^{(2)} = g(z_2^{(2)}) \\ a_3^{(2)} = g(z_3^{(2)}) \\ \end{bmatrix}
\rightarrow
h_\theta{(x)}
$$

The first layer is called the "input layer". The number of the nodes in this layer is the number of features + 1 ($p + 1$). In this model our x0 input node is sometimes called the "bias unit." It is always equal to 1.

Intermediate layers (not shown) are called the "hidden layers". There can be many layers and a varied number of nodes

The final layer the "output layer," which gives the final value computed on the hypothesis.
This can be calculated using the previous layers values of $\theta$ like so:
$$ a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \\ a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \\ a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \\ $$
Each node in the hidden layer has an attached 'activation' value $a_(i)^{(j)}$ which can also be written as $g(z_(i)^{(2=j)})$ where $z$ is values inside our function $g$. Calculating the final value of the hypothesis function would therefore be
$$
h_{\Theta(x)} = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)}) $$
If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$. The +1 comes from the addition in Θ(j) of the "bias nodes," x0 and Θ(j)0. In other words the output nodes will not include the bias nodes while the inputs will.
Multiclass Classification
To classify data into multiple classes, we let our hypothesis function return a vector of values. This is essentially a vector of all possible classifications. Say we wanted to classify our data into one of four final classes ($K = 4$), the neural network would be written as this:
$$ \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \cdots \\ x_n \end{bmatrix} $$ > $$ \begin{bmatrix}a_0^{(2)} \\ a_1^{(2)} \\ a_2^{(2)} \\ \cdots \end{bmatrix} \begin{bmatrix}a_0^{(3)} \\ a_1^{(3)} \\ a_2^{(3)} \\ \cdots\end{bmatrix} $$ > $$ \begin{bmatrix}h_\Theta(x)1 \\ h\Theta(x)2 \\ h\Theta(x)3 \\ h\Theta(x)_4 \\ \end{bmatrix} $$
The final layer of nodes, when multiplied by its theta matrix, will result in another vector, on which we will apply the g() logistic function to get a vector of hypothesis values.
Cost Function
A regularized cost function for logistic regression can be written as:
$$ J(\theta) =  \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1  y^{(i)})\ \log (1  h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2 $$
Generalising this for Neural networks
$$ \begin{gather}\large J(\Theta) =  \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}k \log ((h\Theta (x^{(i)}))k) + (1  y^{(i)}_k)\log (1  (h\Theta(x^{(i)}))k)\right] + \frac{\lambda}{2m}\sum^{L1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather} $$
We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, between the square brackets, we have an additional nested summation that loops through the number of output nodes.
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.
Backpropogation
"Backpropagation" is neuralnetwork terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. It is called backpropogation because you work from the hypothesis node backwards.
The goal is to compute $\min_\Theta J(\Theta)$ That is, we want to minimize our cost function J using an optimal set of parameters in theta. We therefore need to compute $\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)$ or our Cost Function error with resepct to $J(\Theta)$.
For each node we will calculate $\delta_j^{(l)}$.
For the last layer, we can compute the vector of delta values with:
\delta^{(L)} = a^{(L)}  y
Where $L$ is our total number of layers and $a^(L)$ is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y.
To compute the values of the intermediate layers by using the values of the layers to the right
$$ \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .\ g'(z^{(l)}) \text{where} g'(u) = g(u)\ .\ (1  g(u)) \\ \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .\ a^{(l)}\ .\ (1  a^{(l)}) $$
Intuitively, the delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then elementwise multiply that with a function called g', or gprime, which is the derivative of the activation function g evaluated with the input values given by z(l).
Algorithmically
Given training set $X = \lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace$
 Set \Delta^{(l)}_{i,j} (where $\Delta$ represents error of a single node $\delta$)
 Set the input layer variables: a^{(1)} := x^{(t)}
 Forward Propogation
 Perform forward propagation to compute $a^{(l)}$ for l=2,3,…,L
 Begin Backwards Propogation
 Using $y^(t)$, compute $\delta^{(L)} = a^{(L)}  y^{(t)}$ (the error of the final hypothesis)
 Compute the delta for each layer (using $\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .\ a^{(l)}\ .\ (1  a^{(l)})$)
 Find the layer error $\Delta^{(l)}{i,j} := \Delta^{(l)} + a_j^{(l)} \delta_i^{(l+1)}$
 Calculate the "accumulator" where $D_{i,j}^{(l)} = \dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}.$
 $D^{(l)}{i,j} := \dfrac{1}{m}\left(\Delta^{(l)} + \lambda\Theta^{(l)}_{i,j}\right)$
 $D^{(l)}{i,j} := \dfrac{1}{m}\Delta^{(l)}$