Epoch  When you've seen all your training data
Cross Entropy as a form of error function? $ \minus \sum Y_{i}!\dot\log{Y_{i}}$  Produces numbers between 0 and 1 (interpret as probibility)  Activation function still used for hidden layers
Sigmoid used to be most used activation function now it is RELU  Solves the vanishing gradient problem (very small gradients)  Simulates the biological neuron more than a sigmoid function (after a threshold they output proportional to the input)
Softmax as a form of neural network?  Why is "softmax" called softmax  The exponential is a steeply increasing function. It will increase differences between the elements of the vector. It also quickly produces large values. Then, as you normalise the vector, the largest element, which dominates the norm, will be normalised to a value close to 1 while all the other elements will end up divided by a large value and normalised to something close to 0. The resulting vector clearly shows which was its largest element, the "max", but retains the original relative order of its values, hence the "soft".  Produces a value from 0 to 1  Here softmax is serving as an "activation" or "link" function, shaping the output of our linear function into the form we want  in this case, a probability distribution over 10 cases. You can think of it as converting tallies of evidence into probabilities of our input being in each class.  helpful to think of softmax the first way: exponentiating its inputs and then normalizing them. The exponentiation means that one more unit of evidence increases the weight given to any hypothesis multiplicatively. And conversely, having one less unit of evidence means that a hypothesis gets a fraction of its earlier weight. No hypothesis ever has zero or negative weight. Softmax then normalizes these weights, so that they add up to one
Dropout? Freeze one neuron for one iteration
Overfitting manigests as difference between test loss and training loss  Too many degrees of freedom  Sometimes can manifest each training data in a weight
Convolutional Neural Networks  For 2D data where locality is important
XYZ  Cartesian coordinate system

Softmax as a form of neural network
Now that our neural network produces predictions from input images, we need to measure how good they are, i.e. the distance between what the network tells us and what we know to be the truth. Remember that we have true labels for all the images in this dataset.
Any distance would work, the ordinary euclidian distance is fine but for classification problems one distance, called the "crossentropy" is more efficient.
We also add some extra evidence called a bias. Basically, we want to be able to say that some things are more likely independent of the input. The result is that the evidence for a class i given an input x is:
$$ \text{evidence}i = \sum_j W x_j + b_i $$
Vectorised this is $ y = \text{softmax}(Wx + b) $