# 1.7: Implementing our network to classify digits

- Page ID
- 7020

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Alright, let's write a program that learns how to recognize handwritten digits, using stochastic gradient descent and the MNIST training data. We'll do this with a short Python (2.7) program, just 74 lines of code! The first thing we need is to get the MNIST data. If you're a `git` user then you can obtain the data by cloning the code repository for this book,

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

If you don't use `git` then you can download the data and code here.

Incidentally, when I described the MNIST data earlier, I said it was split into 60,000 training images, and 10,000 test images. That's the official MNIST description. Actually, we're going to split the data a little differently. We'll leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which we'll use to train our neural network, and a separate 10,000 image *validation set*. We won't use the validation data in this chapter, but later in the book we'll find it useful in figuring out how to set certain *hyper-parameters* of the neural network - things like the learning rate, and so on, which aren't directly selected by our learning algorithm. Although the validation data isn't part of the original MNIST specification, many people use MNIST in this fashion, and the use of validation data is common in neural networks. When I refer to the "MNIST training data" from now on, I'll be referring to our 50,000 image data set, not the original 60,000 image data set*

*As noted earlier, the MNIST data set is based on two data sets collected by NIST, the United States' National Institute of Standards and Technology. To construct MNIST the NIST data sets were stripped down and put into a more convenient format by Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. See this link for more details. The data set in my repository is in a form that makes it easy to load and manipulate the MNIST data in Python. I obtained this particular form of the data from the LISA machine learning laboratory at the University of Montreal (link)..

Apart from the MNIST data we also need a Python library called Numpy, for doing fast linear algebra. If you don't already have Numpy installed, you can get it here.

Let me explain the core features of the neural networks code, before giving a full listing, below. The centerpiece is a `Network` class, which we use to represent a neural network. Here's the code we use to initialize a `Network` object:

classNetwork(object):def__init__(self, sizes): self.num_layers = len(sizes) self.sizes = sizes self.biases = [np.random.randn(y, 1)foryinsizes[1:]] self.weights = [np.random.randn(y, x)forx, yinzip(sizes[:-1], sizes[1:])]

In this code, the list `sizes` contains the number of neurons in the respective layers. So, for example, if we want to create a `Network `object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, we'd do this with the code:

net = Network([2, 3, 1])

The biases and weights in the `Network` object are all initialized randomly, using the Numpy `np.random.randn` function to generate Gaussian distributions with mean 00 and standard deviation 11. This random initialization gives our stochastic gradient descent algorithm a place to start from. In later chapters we'll find better ways of initializing the weights and biases, but this will do for now. Note that the `Network` initialization code assumes that the first layer of neurons is an input layer, and omits to set any biases for those neurons, since biases are only ever used in computing the outputs from later layers.

Note also that the biases and weights are stored as lists of Numpy matrices. So, for example `net.weights[1]` is a Numpy matrix storing the weights connecting the second and third layers of neurons. (It's not the first and second layers, since Python's list indexing starts at `0`.) Since `net.weights[1]` is rather verbose, let's just denote that matrix \(w\). It's a matrix such that \(w_jk\) is the weight for the connection between the \(k^{th} neuron in the second layer, and the j^{th} neuron in the third layer. This ordering of the \(j\) and \(k\) indices may seem strange - surely it'd make more sense to swap the \(j\) and \(k\) indices around? The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is:

\[ a′ = σ(wa+b) \]

There's quite a bit going on in this equation, so let's unpack it piece by piece. aa is the vector of activations of the second layer of neurons. To obtain \(a′\) we multiply \(a\) by the weight matrix \(w\), and add the vector bb of biases. We then apply the function \(σ\) elementwise to every entry in the vector \(wa+b\). (This is called *vectorizing* the function \(σ\).) It's easy to verify that Equation (1.7.1) gives the same result as our earlier rule, Equation (1.3.2), for computing the output of a sigmoid neuron.

## Exercise

- Write out Equation (1.7.1) in component form, and verify that it gives the same result as the rule (1.3.2) for computing the output of a sigmoid neuron.

With all this in mind, it's easy to write code computing the output from a `Network` instance. We begin by defining the sigmoid function:

defsigmoid(z):return1.0/(1.0+np.exp(-z))

Note that when the input `z` is a vector or Numpy array, Numpy automatically applies the function `sigmoid` elementwise, that is, in vectorized form.

We then add a `feedforward` method to the `Network` class, which, given an input `a` for the network, returns the corresponding output*

*It is assumed that the input `a` is an `(n, 1)`Numpy ndarray, not a `(n,)` vector. Here, `n` is the number of inputs to the network. If you try to use an `(n,)` vector as input you'll get strange results. Although using an `(n,)` vector appears the more natural choice, using an `(n, 1) `ndarray makes it particularly easy to modify the code to feedforward multiple inputs at once, and that is sometimes convenient.. All the method does is applies Equation (1.7.1) for each layer:

deffeedforward(self, a):"""Return the output of the network if "a" is input."""forb, winzip(self.biases, self.weights): a = sigmoid(np.dot(w, a)+b)returna

Of course, the main thing we want our `Network` objects to do is to learn. To that end we'll give them an `SGD` method which implements stochastic gradient descent. Here's the code. It's a little mysterious in a few places, but I'll break it down below, after the listing.

defSGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):"""Train the neural network using mini-batch stochasticgradient descent. The "training_data" is a list of tuples"(x, y)" representing the training inputs and the desiredoutputs. The other non-optional parameters areself-explanatory. If "test_data" is provided then thenetwork will be evaluated against the test data after eachepoch, and partial progress printed out. This is useful fortracking progress, but slows things down substantially."""iftest_data: n_test = len(test_data) n = len(training_data)forjinxrange(epochs): random.shuffle(training_data) mini_batches = [ training_data[k:k+mini_batch_size]forkinxrange(0, n, mini_batch_size)]formini_batchinmini_batches: self.update_mini_batch(mini_batch, eta)iftest_data:else:

The `training_data` is a list of tuples `(x, y)` representing the training inputs and corresponding desired outputs. The variables `epochs` and `mini_batch_size` are what you'd expect - the number of epochs to train for, and the size of the mini-batches to use when sampling. `eta `is the learning rate, \(η\). If the optional argument `test_data` is supplied, then the program will evaluate the network after each epoch of training, and print out partial progress. This is useful for tracking progress, but slows things down substantially.

The code works as follows. In each epoch, it starts by randomly shuffling the training data, and then partitions it into mini-batches of the appropriate size. This is an easy way of sampling randomly from the training data. Then for each `mini_batch` we apply a single step of gradient descent. This is done by the code`self.update_mini_batch(mini_batch, eta)`, which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in `mini_batch`. Here's the code for the `update_mini_batch` method:

defupdate_mini_batch(self, mini_batch, eta):"""Update the network's weights and biases by applyinggradient descent using backpropagation to a single mini batch.The "mini_batch" is a list of tuples "(x, y)", and "eta"is the learning rate."""nabla_b = [np.zeros(b.shape)forbinself.biases] nabla_w = [np.zeros(w.shape)forwinself.weights]forx, yinmini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnbfornb, dnbinzip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnwfornw, dnwinzip(nabla_w, delta_nabla_w)] self.weights = [w-(eta/len(mini_batch))*nwforw, nwinzip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nbforb, nbinzip(self.biases, nabla_b)]

Most of the work is done by the line

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

This invokes something called the *backpropagation* algorithm, which is a fast way of computing the gradient of the cost function. So `update_mini_batch` works simply by computing these gradients for every training example in the `mini_batch`, and then updating`self.weights` and `self.biases` appropriately.

I'm not going to show the code for `self.backprop` right now. We'll study how backpropagation works in the next chapter, including the code for `self.backprop`. For now, just assume that it behaves as claimed, returning the appropriate gradient for the cost associated to the training example `x`.

Let's look at the full program, including the documentation strings, which I omitted above. Apart from `self.backprop` the program is self-explanatory - all the heavy lifting is done in `self.SGD` and `self.update_mini_batch`, which we've already discussed. The`self.backprop` method makes use of a few extra functions to help in computing the gradient, namely `sigmoid_prime`, which computes the derivative of the σσ function, and `self.cost_derivative`, which I won't describe here. You can get the gist of these (and perhaps the details) just by looking at the code and documentation strings. We'll look at them in detail in the next chapter. Note that while the program appears lengthy, much of the code is documentation strings intended to make the code easy to understand. In fact, the program contains just 74 lines of non-whitespace, non-comment code. All the code may be found on GitHub here.

"""network.py~~~~~~~~~~A module to implement the stochastic gradient descent learningalgorithm for a feedforward neural network. Gradients are calculatedusing backpropagation. Note that I have focused on making the codesimple, easily readable, and easily modifiable. It is not optimized,and omits many desirable features."""#### Libraries# Standard libraryimportrandom# Third-party librariesimportnumpyasnpclassNetwork(object):def__init__(self, sizes):"""The list ``sizes`` contains the number of neurons in therespective layers of the network. For example, if the listwas [2, 3, 1] then it would be a three-layer network, with thefirst layer containing 2 neurons, the second layer 3 neurons,and the third layer 1 neuron. The biases and weights for thenetwork are initialized randomly, using a Gaussiandistribution with mean 0, and variance 1. Note that the firstlayer is assumed to be an input layer, and by convention wewon't set any biases for those neurons, since biases are onlyever used in computing the outputs from later layers."""self.num_layers = len(sizes) self.sizes = sizes self.biases = [np.random.randn(y, 1)foryinsizes[1:]] self.weights = [np.random.randn(y, x)forx, yinzip(sizes[:-1], sizes[1:])]deffeedforward(self, a):"""Return the output of the network if ``a`` is input."""forb, winzip(self.biases, self.weights): a = sigmoid(np.dot(w, a)+b)returnadefSGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):"""Train the neural network using mini-batch stochasticgradient descent. The ``training_data`` is a list of tuples``(x, y)`` representing the training inputs and the desiredoutputs. The other non-optional parameters areself-explanatory. If ``test_data`` is provided then thenetwork will be evaluated against the test data after eachepoch, and partial progress printed out. This is useful fortracking progress, but slows things down substantially."""iftest_data: n_test = len(test_data) n = len(training_data)forjinxrange(epochs): random.shuffle(training_data) mini_batches = [ training_data[k:k+mini_batch_size]forkinxrange(0, n, mini_batch_size)]formini_batchinmini_batches: self.update_mini_batch(mini_batch, eta)iftest_data:else:defupdate_mini_batch(self, mini_batch, eta):"""Update the network's weights and biases by applyinggradient descent using backpropagation to a single mini batch.The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``is the learning rate."""nabla_b = [np.zeros(b.shape)forbinself.biases] nabla_w = [np.zeros(w.shape)forwinself.weights]forx, yinmini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnbfornb, dnbinzip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnwfornw, dnwinzip(nabla_w, delta_nabla_w)] self.weights = [w-(eta/len(mini_batch))*nwforw, nwinzip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nbforb, nbinzip(self.biases, nabla_b)]defbackprop(self, x, y):"""Return a tuple ``(nabla_b, nabla_w)`` representing thegradient for the cost function C_x. ``nabla_b`` and``nabla_w`` are layer-by-layer lists of numpy arrays, similarto ``self.biases`` and ``self.weights``."""nabla_b = [np.zeros(b.shape)forbinself.biases] nabla_w = [np.zeros(w.shape)forwinself.weights]# feedforwardactivation = x activations = [x]# list to store all the activations, layer by layerzs = []# list to store all the z vectors, layer by layerforb, winzip(self.biases, self.weights): z = np.dot(w, activation)+b zs.append(z) activation = sigmoid(z) activations.append(activation)# backward passdelta = self.cost_derivative(activations[-1], y) * \ sigmoid_prime(zs[-1]) nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, activations[-2].transpose())# Note that the variable l in the loop below is used a little# differently to the notation in Chapter 2 of the book. Here,# l = 1 means the last layer of neurons, l = 2 is the# second-last layer, and so on. It's a renumbering of the# scheme in the book, used here to take advantage of the fact# that Python can use negative indices in lists.forlinxrange(2, self.num_layers): z = zs[-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l+1].transpose(), delta) * sp nabla_b[-l] = delta nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())return(nabla_b, nabla_w)defevaluate(self, test_data):"""Return the number of test inputs for which the neuralnetwork outputs the correct result. Note that the neuralnetwork's output is assumed to be the index of whicheverneuron in the final layer has the highest activation."""test_results = [(np.argmax(self.feedforward(x)), y)for(x, y)intest_data]returnsum(int(x == y)for(x, y)intest_results)defcost_derivative(self, output_activations, y):"""Return the vector of partial derivatives \partial C_x /\partial a for the output activations."""return(output_activations-y)#### Miscellaneous functionsdefsigmoid(z):"""The sigmoid function."""return1.0/(1.0+np.exp(-z))defsigmoid_prime(z):"""Derivative of the sigmoid function."""returnsigmoid(z)*(1-sigmoid(z))

How well does the program recognize handwritten digits? Well, let's start by loading in the MNIST data. I'll do this using a little helper program, `mnist_loader.py`, to be described below. We execute the following commands in a Python shell,

>>>importmnist_loader>>> training_data, validation_data, test_data = \ ... mnist_loader.load_data_wrapper()

Of course, this could also be done in a separate Python program, but if you're following along it's probably easiest to do in a Python shell.

After loading the MNIST data, we'll set up a `Network` with \(30\) hidden neurons. We do this after importing the Python program listed above, which is named `network`,

>>>importnetwork>>> net = network.Network([784, 30, 10])

Finally, we'll use stochastic gradient descent to learn from the MNIST `training_data` over 30 epochs, with a mini-batch size of 10, and a learning rate of \(η=3.0\),

>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

Note that if you're running the code as you read along, it will take some time to execute - for a typical machine (as of 2015) it will likely take a few minutes to run. I suggest you set things running, continue to read, and periodically check the output from the code. If you're in a rush you can speed things up by decreasing the number of epochs, by decreasing the number of hidden neurons, or by using only part of the training data. Note that production code would be much, much faster: these Python scripts are intended to help you understand how neural nets work, not to be high-performance code! And, of course, once we've trained a network it can be run very quickly indeed, on almost any computing platform. For example, once we've learned a good set of weights and biases for a network, it can easily be ported to run in Javascript in a web browser, or as a native app on a mobile device. In any case, here is a partial transcript of the output of one training run of the neural network. The transcript shows the number of test images correctly recognized by the neural network after each epoch of training. As you can see, after just a single epoch this has reached 9,129 out of 10,000, and the number continues to grow,

Epoch 0: 9129 / 10000 Epoch 1: 9295 / 10000 Epoch 2: 9348 / 10000 ... Epoch 27: 9528 / 10000 Epoch 28: 9542 / 10000 Epoch 29: 9534 / 10000

That is, the trained network gives us a classification rate of about \(95\) percent - \(95.42\) percent at its peak ("Epoch 28")! That's quite encouraging as a first attempt. I should warn you, however, that if you run the code then your results are not necessarily going to be quite the same as mine, since we'll be initializing our network using (different) random weights and biases. To generate results in this chapter I've taken best-of-three runs.

Let's rerun the above experiment, changing the number of hidden neurons to \(100\). As was the case earlier, if you're running the code as you read along, you should be warned that it takes quite a while to execute (on my machine this experiment takes tens of seconds for each training epoch), so it's wise to continue reading in parallel while the code executes.

>>> net = network.Network([784, 100, 10]) >>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

Sure enough, this improves the results to \(96.59\) percent. At least in this case, using more hidden neurons helps us get better results*

*Reader feedback indicates quite some variation in results for this experiment, and some training runs give results quite a bit worse. Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks..

Of course, to obtain these accuracies I had to make specific choices for the number of epochs of training, the mini-batch size, and the learning rate, \(η\). As I mentioned above, these are known as hyper-parameters for our neural network, in order to distinguish them from the parameters (weights and biases) learnt by our learning algorithm. If we choose our hyper-parameters poorly, we can get bad results. Suppose, for example, that we'd chosen the learning rate to be \(η=0.001\),

>>> net = network.Network([784, 100, 10]) >>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)

The results are much less encouraging,

Epoch 0: 1139 / 10000 Epoch 1: 1136 / 10000 Epoch 2: 1135 / 10000 ... Epoch 27: 2101 / 10000 Epoch 28: 2123 / 10000 Epoch 29: 2142 / 10000

However, you can see that the performance of the network is getting slowly better over time. That suggests increasing the learning rate, say to \(η=0.01\). If we do that, we get better results, which suggests increasing the learning rate again. (If making a change improves things, try doing more!) If we do that several times over, we'll end up with a learning rate of something like \(η=1.0\) (and perhaps fine tune to \(3.0\)), which is close to our earlier experiments. So even though we initially made a poor choice of hyper-parameters, we at least got enough information to help us improve our choice of hyper-parameters.

In general, debugging a neural network can be challenging. This is especially true when the initial choice of hyper-parameters produces results no better than random noise. Suppose we try the successful 30 hidden neuron network architecture from earlier, but with the learning rate changed to \(η=100.0\):

>>> net = network.Network([784, 30, 10]) >>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)

At this point we've actually gone too far, and the learning rate is too high:

Epoch 0: 1009 / 10000 Epoch 1: 1009 / 10000 Epoch 2: 1009 / 10000 Epoch 3: 1009 / 10000 ... Epoch 27: 982 / 10000 Epoch 28: 982 / 10000 Epoch 29: 982 / 10000

Now imagine that we were coming to this problem for the first time. Of course, we *know* from our earlier experiments that the right thing to do is to decrease the learning rate. But if we were coming to this problem for the first time then there wouldn't be much in the output to guide us on what to do. We might worry not only about the learning rate, but about every other aspect of our neural network. We might wonder if we've initialized the weights and biases in a way that makes it hard for the network to learn? Or maybe we don't have enough training data to get meaningful learning? Perhaps we haven't run for enough epochs? Or maybe it's impossible for a neural network with this architecture to learn to recognize handwritten digits? Maybe the learning rate is too *low*? Or, maybe, the learning rate is too high? When you're coming to a problem for the first time, you're not always sure.

The lesson to take away from this is that debugging a neural network is not trivial, and, just as for ordinary programming, there is an art to it. You need to learn that art of debugging in order to get good results from neural networks. More generally, we need to develop heuristics for choosing good hyper-parameters and a good architecture. We'll discuss all these at length through the book, including how I chose the hyper-parameters above.

## Exercise

- Try creating a network with just two layers - an input and an output layer, no hidden layer - with 784 and 10 neurons, respectively. Train the network using stochastic gradient descent. What classification accuracy can you achieve?

Earlier, I skipped over the details of how the MNIST data is loaded. It's pretty straightforward. For completeness, here's the code. The data structures used to store the MNIST data are described in the documentation strings - it's straightforward stuff, tuples and lists of Numpy `ndarray` objects (think of them as vectors if you're not familiar with `ndarray`s):

"""mnist_loader~~~~~~~~~~~~A library to load the MNIST image data. For details of the datastructures that are returned, see the doc strings for ``load_data``and ``load_data_wrapper``. In practice, ``load_data_wrapper`` is thefunction usually called by our neural network code."""#### Libraries# Standard libraryimportcPickleimportgzip# Third-party librariesimportnumpyasnpdefload_data():"""Return the MNIST data as a tuple containing the training data,the validation data, and the test data.The ``training_data`` is returned as a tuple with two entries.The first entry contains the actual training images. This is anumpy ndarray with 50,000 entries. Each entry is, in turn, anumpy ndarray with 784 values, representing the 28 * 28 = 784pixels in a single MNIST image.The second entry in the ``training_data`` tuple is a numpy ndarraycontaining 50,000 entries. Those entries are just the digitvalues (0...9) for the corresponding images contained in the firstentry of the tuple.The ``validation_data`` and ``test_data`` are similar, excepteach contains only 10,000 images.This is a nice data format, but for use in neural networks it'shelpful to modify the format of the ``training_data`` a little.That's done in the wrapper function ``load_data_wrapper()``, seebelow."""f = gzip.open('../data/mnist.pkl.gz', 'rb') training_data, validation_data, test_data = cPickle.load(f) f.close()return(training_data, validation_data, test_data)defload_data_wrapper():"""Return a tuple containing ``(training_data, validation_data,test_data)``. Based on ``load_data``, but the format is moreconvenient for use in our implementation of neural networks.In particular, ``training_data`` is a list containing 50,0002-tuples ``(x, y)``. ``x`` is a 784-dimensional numpy.ndarraycontaining the input image. ``y`` is a 10-dimensionalnumpy.ndarray representing the unit vector corresponding to thecorrect digit for ``x``.``validation_data`` and ``test_data`` are lists containing 10,0002-tuples ``(x, y)``. In each case, ``x`` is a 784-dimensionalnumpy.ndarry containing the input image, and ``y`` is thecorresponding classification, i.e., the digit values (integers)corresponding to ``x``.Obviously, this means we're using slightly different formats forthe training data and the validation / test data. These formatsturn out to be the most convenient for use in our neural networkcode."""tr_d, va_d, te_d = load_data() training_inputs = [np.reshape(x, (784, 1))forxintr_d[0]] training_results = [vectorized_result(y)foryintr_d[1]] training_data = zip(training_inputs, training_results) validation_inputs = [np.reshape(x, (784, 1))forxinva_d[0]] validation_data = zip(validation_inputs, va_d[1]) test_inputs = [np.reshape(x, (784, 1))forxinte_d[0]] test_data = zip(test_inputs, te_d[1])return(training_data, validation_data, test_data)defvectorized_result(j):"""Return a 10-dimensional unit vector with a 1.0 in the jthposition and zeroes elsewhere. This is used to convert a digit(0...9) into a corresponding desired output from the neuralnetwork."""e = np.zeros((10, 1)) e[j] = 1.0returne

I said above that our program gets pretty good results. What does that mean? Good compared to what? It's informative to have some simple (non-neural-network) baseline tests to compare against, to understand what it means to perform well. The simplest baseline of all, of course, is to randomly guess the digit. That'll be right about ten percent of the time. We're doing much better than that!

What about a less trivial baseline? Let's try an extremely simple idea: we'll look at how *dark* an image is. For instance, an image of a \(2\) will typically be quite a bit darker than an image of a \(1\), just because more pixels are blackened out, as the following examples illustrate:

This suggests using the training data to compute average darknesses for each digit, \(0,1,2,…,9\). When presented with a new image, we compute how dark the image is, and then guess that it's whichever digit has the closest average darkness. This is a simple procedure, and is easy to code up, so I won't explicitly write out the code - if you're interested it's in the GitHub repository. But it's a big improvement over random guessing, getting \(2,225\) of the \(10,000\) test images correct, i.e., \(22.25\) percent accuracy.

It's not difficult to find other ideas which achieve accuracies in the \(20\) to \(50\) percent range. If you work a bit harder you can get up over \(50\) percent. But to get much higher accuracies it helps to use established machine learning algorithms. Let's try using one of the best known algorithms, the *support vector machine* or *SVM*. If you're not familiar with SVMs, not to worry, we're not going to need to understand the details of how SVMs work. Instead, we'll use a Python library called scikit-learn, which provides a simple Python interface to a fast C-based library for SVMs known as LIBSVM.

If we run scikit-learn's SVM classifier using the default settings, then it gets 9,435 of 10,000 test images correct. (The code is available here.) That's a big improvement over our naive approach of classifying an image based on how dark it is. Indeed, it means that the SVM is performing roughly as well as our neural networks, just a little worse. In later chapters we'll introduce new techniques that enable us to improve our neural networks so that they perform much better than the SVM.

That's not the end of the story, however. The 9,435 of 10,000 result is for scikit-learn's default settings for SVMs. SVMs have a number of tunable parameters, and it's possible to search for parameters which improve this out-of-the-box performance. I won't explicitly do this search, but instead refer you to this blog post by Andreas Mueller if you'd like to know more. Mueller shows that with some work optimizing the SVM's parameters it's possible to get the performance up above 98.5 percent accuracy. In other words, a well-tuned SVM only makes an error on about one digit in 70. That's pretty good! Can neural networks do better?

In fact, they can. At present, well-designed neural networks outperform every other technique for solving MNIST, including SVMs. The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. We'll see most of the techniques they used later in the book. At that level the performance is close to human-equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence, for example:

I trust you'll agree that those are tough to classify! With images like these in the MNIST data set it's remarkable that neural networks can accurately classify all but 21 of the 10,000 test images. Usually, when programming we believe that solving a complicated problem like recognizing the MNIST digits requires a sophisticated algorithm. But even the neural networks in the Wan *et al* paper just mentioned involve quite simple algorithms, variations on the algorithm we've seen in this chapter. All the complexity is learned, automatically, from the training data. In some sense, the moral of both our results and those in more sophisticated papers, is that for some problems: