Neural Nets¶
For the online “lectures”, what these 4 NN videos from Youtuber 3Blue1Brown: https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
For each video, a quiz will be created that asks about important details of the videos.
But What is a Neural Network? Deep learning, chapter 1¶
As you are watching the video (or after), please answer these questions. NN is short for “neural network”.
What problem is the NN discussed throughout the video trying to solve? Answer: Recognizing hand-written digits.
According to the video, what kind of NN is best for:
- image recognition? Answer: convolutional NN
- speech recognition? Answer: long short-term memory NN
What is the name of the kind of neural network discussed in the video? Answer: Multi-layer perception.
In simple mathematical terms, what does a neuron do? Answer: It holds a number. Later in the video, this is generalized to say that a neuron is a function.
What is the number in a neuron called? Answer: The neuron’s activation.
What are the range of possible activations for a neuron? Answer: 0 (least activated) to 1 (most activated)
In the NN example in the video:
- What are the activations of the neurons in the first layer? Answer: The activations of the first layer neurons are set to be the grayscale values of the image pixels.
- How many neurons are in the last layer of the NN? Why that many? What do the activations in neurons of the last layer mean? Answer: The last layer consists of 10 neurons, one for each of the possible digits 0 to 9. The activation of these last-layer neurons corresponds to what digit the network “thinks” the image contains, i.e. whatever last-layer neuron has the highest activation is what the NN thinks the digit is.
- What are the layers between the input layer and last layer called? How many such layers are there? Why? How many neurons are in these layers? Why? Answer: The layers between the input and last layers are called hidden layers, and in this NN there are 2 hidden layers with 16 neurons each. The choice of 2 and 16 is somewhat arbitrary, and different choices might work just as well (or worse, or better). People often experiment with different sizes and numbers of hidden layers.
Describe how neurons in one layer connect to other neurons. Answer: Each neuron in a layer is connected to every other neuron in just the next layer.
What are the possible values for weights on th edges between neurons? Answer: They are real numbers that could be positive, negative, or 0.
What is the sigmoid function? Answer: \(f(x) = \frac{1}{1 + exp(-x)}\)
Why is the sigmoid function function used at all? What is it’s purpose? Answer: The input to a neuron is the weighted sum of the activations of other neurons, and that weighted sum could be any real number. The sigmoid function is used to squish this number into the range -1 to 1, which makes it easier to work with.
What is the purpose of a bias in a neuron? Answer: It helps control when the neuron is activated, e.g. a high bias means the neuron needs a high input to be activated. The bias tells you how big the weighted sum must be before the neuron is active.
In a NN, what is learning? Answer: Learning in a NN is the process of setting the weights and biases to values that make the network work the way you want it to work (i.e. in this case to recognize hand-written digits)
Explain each part of the following formula from the video:
\[a^{(1)} = \sigma (Wa^{(0)} + b)\]- \(a^(0)\) is a column vector that stores all the activations of the neurons in the input letter
- \(W\) is a matrix of edge weights; the first row of \(W\) contains all the weights on the edges from \(a^{(0)}\) neurons to the first neuron of \(a^{(1)}\); the second row of \(W\) contains all the edge weights to the second neuron \(a^{(1)}\); and so on.
- \(Wa^{(0)}\) calculates the weighted sums of inputs to all the neurons in layer \(a^{(1)}\)
- \(b\) is a column vector of the biases for all the neurons in layer \(a^{(1)}\)
- \(\sigma\) is the sigmoid function, and is use to squish the value of \(Wa^{(0)}+ b\) into the range -1 to 1; the intention is that \(\sigma\) is applied to each of the numbers in the vector that \(a^{(0)} + b\) evaluates to
What squishing function do modern NNs often used instead of the sigmoid function? Why?
Answer: ReLU, which is called the rectified linear unit, and is defined like this:
\[\textrm{ReLU}(a) = \text{max}(0,a)\]ReLU is faster to compute and easier to train than the sigmoid function.
Near the end of the video, it is stated that a NN can be thought of as a kind of function. If you treat the NN in this video as a function, what is it’s input and output? Answer: The input is an image, and the output is a column vector with 10 elements that contains numbers that indicate what the NN thinks the image is. Usually the answer is taken to be the digit with the highest output value.
Question¶
Suppose you created a multi-layer NN, the same kind as in the video, to recognize both hand-written digits and alphabetic letters. Each symbol is in a 200 x 200 pixel grayscale image, and the letters are both lowercase a-z and uppercase A-Z (there are no punctuation or other symbols). Suppose there are two hidden layers in the network, both with 25 neurons.
- How many neurons are in the:
- input layer? Answer: 200 * 200 = 40,000
- output layer? Answer: 10 + 26 + 26 = 62
- two hidden layers? Answer: 50
- entire network? Answer: 40000 + 62 + 50 = 40112
- How many edge weights are there from:
- The input layer to the first hidden layer? Answer: 40,000 * 25 = 1,000,000 (1 million)
- The first hidden layer to the second hidden layer? Answer: 25 * 25 = 625
- The second hidden layer to the output layer? Answer: 25 * 62 = 1550
- In total, how many edge weights does this neural network have? Answer: 1,000,000 + 625 + 1550 = 1,002,175
- How many biases does this neural network have? Answer: one for each neuron that is not in the input layer, so 62 + 50 = 112 biases; the input layer neurons have no biases because they are set to be the grayscale values of pixels
Gradient Descent, how neural networks learn, Deep learning, chapter 2¶
What is gradient descent?
What do the hidden layers look for?
28x28 pixel grid, each pixel a grayscale value from 0.0 (black) to 1.0 (white)
Training data: images of hand-written digits, along with labels of which data they’re supposed to be
The training phase sets the weights and biases of the net. After training, you then test the net on images it hasn’t seen before to see if it correctly classifies those new images (i.e. if it can correctly recognize hand-written digits).
Step 1 of training: set weights and biases to random values
Cost function: when you run an image through the network, its output later ends up having 10 different values, which may or may not be correct; to determine how close the network is to the correct answer, we calculate the square of the difference between the actual value of each output neuron with the correct value (that we know from the label of the training data), and then add those all up to get a number that tells us how close the network is to recognizing the digit
e.g. Suppose we give the network an image containing a “2”. Since we know it’s a 2, the correct output of the network should be this:
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] // correct output for a 2
But suppose the output layer of the network is this:
[0.1, 0.05, 0.86, 0.02, 0.15, 0.06, 0.1, 0.52, 0.08, 0.2]
Now we square the difference of all these values, and add them up:
- = (0-0.1)^2 + (0-0.5)^2 + (1-0.86)^2 + (0-0.02)^2 + (0-0.15)^2
- (0-0.06)^2 + (0-0.1)^2 + (0-0.52)^2 + (0-0.08)^2 + (0-0.2)^2
= 0.3025
The overall measure of the quality of of the network is the average of all the individual costs of the training images (calculated as in the example). The overall cost function takes all the weights and biases as inputs, and returns this average to give a single number that describes how well the network recognizes hand-written digits (the lower the better).
If w is a vector with all the weights and biases of the network, then C(w) is its cost (with respect to a training set of images). The goal is to find a w that minimizes C(w). C(w) is a big and complex function, and so it doesn’t have a simple solution. Instead, we have to search for a solution.
The video asks you to image a ball rolling down a hill. Where the ball lands is a local minimum for the function.
In multi-variable calculus, the gradient of a function, gives you the direction of steepest ascent. The negative of the gradient is the direction to go to decrease the function most quickly.
nabla C(w) is the gradient of the cost function.
The idea is to computer nabla C(w), take a step in the direction -nabla C(w), and then continue that until you find a local minimum.
In the neural network, -nabla C(w) ends up giving a vector that, when added to w, gives us a new w that results in a smaller cost.
The name for this algorithm is backpropogation, and it’s the key idea in neural network learning. It is essentially an algorithm that efficiently minimizes the cost function of a neural network.
Why do artificial neurons in a neural network have a continuous value, instead of just a discrete 0/1 (i.e. not activated, or activated) value? Answer: the resulting cost function is smoother when the neuron values are continuous, which makes backpropogation more effective
Gradient descent is the general name given to the algorithm of following the gradient to minimize a function.
96% success rate for neural network in the video (98% with some tweaking)
When the neural network in the video is given a total random input image, what kind of answer does it give? Answer: It “confidently” chooses some digit as the answer, but this makes no sense (an equal weight to all digits would seem to be more sensible)
What do the neurons in the second layer (the first hidden layer) appear to recognize? Answer: Nothing obvious … they seem like random patterns of images. Importantly, it does not seem to be recognizing edges or loops, or some other such visual aspect of a digit.
What is backpropogation really doing? Deep learning, chapter 3¶
backpropogation is the core algorithm for NN learning
backpropogation computes the gradient of the cost function in an effort to find the best changes to minimize it
good intuitive overview of backpropogation
to get the right outputs for a given input, the weights and biases of the networked are “nudged” in a way that makes the desired output neurons more active, and the other output neurons less active
stochastic gradient descent
- all training examples are randomly divided into mini-batches, e.g. 100 examples per mini-batches
- backpropogation is performed on each mini-batch instead of each training example, to improve efficiency
- each step downwards is not usually the best downwards step, but it is often good for practical purposes (and less computational intensive)
a lot of labelled training data is needed for this to work
- often, people are needed to do the labelling … expensive!
Backpropogation calculus, Deep learning, chapter 4¶
calculus!
the chain rule from calculus, in networks
there’s a lot of symbol-chasing, and knowledge of partial derivatives, in this video; some students might not have the pre-requisites (or stomach!) for such in depth math; the textbook might be better, i.e. the same calculus is written down in the textbook chapter on neural networks
In the simple 4-neuron network given at the start of the video, what does the notation a^(L) mean?
- Correct: the activation of the neuron in level L
- the activation, a, of a neuron to the power of the level, L, that it’s in
- the input weight to the neuron a of level L