# 3.4: Handwriting recognition revisited- the code

- Page ID
- 3755

Let's implement the ideas we've discussed in this chapter. We'll develop a new program, `network2.py`, which is an improved version of the program `network.py` we developed in Chapter 1. If you haven't looked at `network.py` in a while then you may find it helpful to spend a few minutes quickly reading over the earlier discussion. It's only 74 lines of code, and is easily understood.

As was the case in `network.py`, the star of `network2.py` is the `Network`class, which we use to represent our neural networks. We initialize an instance of `Network` with a list of `sizes` for the respective layers in the network, and a choice for the `cost` to use, defaulting to the cross-entropy:

classNetwork(object):def__init__(self, sizes, cost=CrossEntropyCost): self.num_layers = len(sizes) self.sizes = sizes self.default_weight_initializer() self.cost=cost

The first couple of lines of the `__init__` method are the same as in `network.py`, and are pretty self-explanatory. But the next two lines are new, and we need to understand what they're doing in detail.

Let's start by examining the `default_weight_initializer` method. This makes use of our new and improved approach to weight initialization. As we've seen, in that approach the weights input to a neuron are initialized as Gaussian random variables with mean \(0\) and standard deviation \(1\) divided by the square root of the number of connections input to the neuron. Also in this method we'll initialize the biases, using Gaussian random variables with mean \(0\) and standard deviation \(1\). Here's the code:

defdefault_weight_initializer(self): self.biases = [np.random.randn(y, 1)foryinself.sizes[1:]] self.weights = [np.random.randn(y, x)/np.sqrt(x)forx, yinzip(self.sizes[:-1], self.sizes[1:])]

To understand the code, it may help to recall that `np` is the Numpy library for doing linear algebra. We'll `import` Numpy at the beginning of our program. Also, notice that we don't initialize any biases for the first layer of neurons. We avoid doing this because the first layer is an input layer, and so any biases would not be used. We did exactly the same thing in `network.py`.

Complementing the `default_weight_initializer` we'll also include a `large_weight_initializer` method. This method initializes the weights and biases using the old approach from Chapter 1, with both weights and biases initialized as Gaussian random variables with mean \(0\) and standard deviation \(1\). The code is, of course, only a tiny bit different from the `default_weight_initializer`:

deflarge_weight_initializer(self): self.biases = [np.random.randn(y, 1)foryinself.sizes[1:]] self.weights = [np.random.randn(y, x)forx, yinzip(self.sizes[:-1], self.sizes[1:])]

I've included the `large_weight_initializer` method mostly as a convenience to make it easier to compare the results in this chapter to those in Chapter 1. I can't think of many practical situations where I would recommend using it!

The second new thing in `Network`'s `__init__` method is that we now initialize a `cost` attribute. To understand how that works, let's look at the class we use to represent the cross-entropy cost*

*If you're not familiar with Python's static methods you can ignore the `@``staticmethod`decorators, and just treat `fn` and `delta` as ordinary methods. If you're curious about details, all `@staticmethod` does is tell the Python interpreter that the method which follows doesn't depend on the object in any way. That's why `self` isn't passed as a parameter to the `fn` and `delta` methods.:

classCrossEntropyCost(object): @staticmethoddeffn(a, y):returnnp.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a))) @staticmethoddefdelta(z, a, y):return(a-y)

Let's break this down. The first thing to observe is that even though the cross-entropy is, mathematically speaking, a function, we've implemented it as a Python class, not a Python function. Why have I made that choice? The reason is that the cost plays two different roles in our network. The obvious role is that it's a measure of how well an output activation, `a`, matches the desired output, `y`. This role is captured by the `CrossEntropyCost.fn` method. (Note, by the way, that the `np.nan_to_num` call inside `CrossEntropyCost.fn` ensures that Numpy deals correctly with the log of numbers very close to zero.) But there's also a second way the cost function enters our network. Recall from Chapter 2 that when running the backpropagation algorithm we need to compute the network's output error, \(δ^L\). The form of the output error depends on the choice of cost function: different cost function, different form for the output error. For the cross-entropy the output error is, as we saw in Equation (66),

\[ δ^L=a^L−y.\label{99}\tag{99} \]

For this reason we define a second method, `CrossEntropyCost.delta`, whose purpose is to tell our network how to compute the output error. And then we bundle these two methods up into a single class containing everything our networks need to know about the cost function.

In a similar way, `network2.py` also contains a class to represent the quadratic cost function. This is included for comparison with the results of Chapter 1, since going forward we'll mostly use the cross entropy. The code is just below. The `QuadraticCost.fn` method is a straightforward computation of the quadratic cost associated to the actual output, `a`, and the desired output, `y`. The value returned by`QuadraticCost.delta` is based on the expression (30) for the output error for the quadratic cost, which we derived back in Chapter 2.

classQuadraticCost(object): @staticmethoddeffn(a, y):return0.5*np.linalg.norm(a-y)**2 @staticmethoddefdelta(z, a, y):return(a-y) * sigmoid_prime(z)

We've now understood the main differences between `network2.py`and `network.py`. It's all pretty simple stuff. There are a number of smaller changes, which I'll discuss below, including the implementation of L2 regularization. Before getting to that, let's look at the complete code for `network2.py`. You don't need to read all the code in detail, but it is worth understanding the broad structure, and in particular reading the documentation strings, so you understand what each piece of the program is doing. Of course, you're also welcome to delve as deeply as you wish! If you get lost, you may wish to continue reading the prose below, and return to the code later. Anyway, here's the code:

"""network2.py~~~~~~~~~~~~~~An improved version of network.py, implementing the stochasticgradient descent learning algorithm for a feedforward neural network.Improvements include the addition of the cross-entropy cost function,regularization, and better initialization of network weights. Notethat I have focused on making the code simple, easily readable, andeasily modifiable. It is not optimized, and omits many desirablefeatures."""#### Libraries# Standard libraryimportjsonimportrandomimportsys# Third-party librariesimportnumpyasnp#### Define the quadratic and cross-entropy cost functionsclassQuadraticCost(object): @staticmethoddeffn(a, y):"""Return the cost associated with an output ``a`` and desired output``y``."""return0.5*np.linalg.norm(a-y)**2 @staticmethoddefdelta(z, a, y):"""Return the error delta from the output layer."""return(a-y) * sigmoid_prime(z)classCrossEntropyCost(object): @staticmethoddeffn(a, y):"""Return the cost associated with an output ``a`` and desired output``y``. Note that np.nan_to_num is used to ensure numericalstability. In particular, if both ``a`` and ``y`` have a 1.0in the same slot, then the expression (1-y)*np.log(1-a)returns nan. The np.nan_to_num ensures that that is convertedto the correct value (0.0)."""returnnp.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a))) @staticmethoddefdelta(z, a, y):"""Return the error delta from the output layer. Note that theparameter ``z`` is not used by the method. It is included inthe method's parameters in order to make the interfaceconsistent with the delta method for other cost classes."""return(a-y)#### Main Network classclassNetwork(object):def__init__(self, sizes, cost=CrossEntropyCost):"""The list ``sizes`` contains the number of neurons in the respectivelayers of the network. For example, if the list was [2, 3, 1]then it would be a three-layer network, with the first layercontaining 2 neurons, the second layer 3 neurons, and thethird layer 1 neuron. The biases and weights for the networkare initialized randomly, using``self.default_weight_initializer`` (see docstring for thatmethod)."""self.num_layers = len(sizes) self.sizes = sizes self.default_weight_initializer() self.cost=costdefdefault_weight_initializer(self):"""Initialize each weight using a Gaussian distribution with mean 0and standard deviation 1 over the square root of the number ofweights connecting to the same neuron. Initialize the biasesusing a Gaussian distribution with mean 0 and standarddeviation 1.Note that the first layer is assumed to be an input layer, andby convention we won't set any biases for those neurons, sincebiases are only ever used in computing the outputs from laterlayers."""self.biases = [np.random.randn(y, 1)foryinself.sizes[1:]] self.weights = [np.random.randn(y, x)/np.sqrt(x)forx, yinzip(self.sizes[:-1], self.sizes[1:])]deflarge_weight_initializer(self):"""Initialize the weights using a Gaussian distribution with mean 0and standard deviation 1. Initialize the biases using aGaussian distribution with mean 0 and standard deviation 1.Note that the first layer is assumed to be an input layer, andby convention we won't set any biases for those neurons, sincebiases are only ever used in computing the outputs from laterlayers.This weight and bias initializer uses the same approach as inChapter 1, and is included for purposes of comparison. Itwill usually be better to use the default weight initializerinstead."""self.biases = [np.random.randn(y, 1)foryinself.sizes[1:]] self.weights = [np.random.randn(y, x)forx, yinzip(self.sizes[:-1], self.sizes[1:])]deffeedforward(self, a):"""Return the output of the network if ``a`` is input."""forb, winzip(self.biases, self.weights): a = sigmoid(np.dot(w, a)+b)returnadefSGD(self, training_data, epochs, mini_batch_size, eta, lmbda = 0.0, evaluation_data=None, monitor_evaluation_cost=False, monitor_evaluation_accuracy=False, monitor_training_cost=False, monitor_training_accuracy=False):"""Train the neural network using mini-batch stochastic gradientdescent. The ``training_data`` is a list of tuples ``(x, y)``representing the training inputs and the desired outputs. Theother non-optional parameters are self-explanatory, as is theregularization parameter ``lmbda``. The method also accepts``evaluation_data``, usually either the validation or testdata. We can monitor the cost and accuracy on either theevaluation data or the training data, by setting theappropriate flags. The method returns a tuple containing fourlists: the (per-epoch) costs on the evaluation data, theaccuracies on the evaluation data, the costs on the trainingdata, and the accuracies on the training data. All values areevaluated at the end of each training epoch. So, for example,if we train for 30 epochs, then the first element of the tuplewill be a 30-element list containing the cost on theevaluation data at the end of each epoch. Note that the listsare empty if the corresponding flag is not set."""ifevaluation_data: n_data = len(evaluation_data) n = len(training_data) evaluation_cost, evaluation_accuracy = [], [] training_cost, training_accuracy = [], []forjinxrange(epochs): random.shuffle(training_data) mini_batches = [ training_data[k:k+mini_batch_size]forkinxrange(0, n, mini_batch_size)]formini_batchinmini_batches: self.update_mini_batch( mini_batch, eta, lmbda, len(training_data))%straining complete" % jifmonitor_training_cost: cost = self.total_cost(training_data, lmbda) training_cost.append(cost)ifmonitor_training_accuracy: accuracy = self.accuracy(training_data, convert=True) training_accuracy.append(accuracy)ifmonitor_evaluation_cost: cost = self.total_cost(evaluation_data, lmbda, convert=True) evaluation_cost.append(cost)ifmonitor_evaluation_accuracy: accuracy = self.accuracy(evaluation_data) evaluation_accuracy.append(accuracy)returnevaluation_cost, evaluation_accuracy, \ training_cost, training_accuracydefupdate_mini_batch(self, mini_batch, eta, lmbda, n):"""Update the network's weights and biases by applying gradientdescent using backpropagation to a single mini batch. The``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is thelearning rate, ``lmbda`` is the regularization parameter, and``n`` is the total size of the training data set."""nabla_b = [np.zeros(b.shape)forbinself.biases] nabla_w = [np.zeros(w.shape)forwinself.weights]forx, yinmini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnbfornb, dnbinzip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnwfornw, dnwinzip(nabla_w, delta_nabla_w)] self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nwforw, nwinzip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nbforb, nbinzip(self.biases, nabla_b)]defbackprop(self, x, y):"""Return a tuple ``(nabla_b, nabla_w)`` representing thegradient for the cost function C_x. ``nabla_b`` and``nabla_w`` are layer-by-layer lists of numpy arrays, similarto ``self.biases`` and ``self.weights``."""nabla_b = [np.zeros(b.shape)forbinself.biases] nabla_w = [np.zeros(w.shape)forwinself.weights]# feedforwardactivation = x activations = [x]# list to store all the activations, layer by layerzs = []# list to store all the z vectors, layer by layerforb, winzip(self.biases, self.weights): z = np.dot(w, activation)+b zs.append(z) activation = sigmoid(z) activations.append(activation)# backward passdelta = (self.cost).delta(zs[-1], activations[-1], y) nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, activations[-2].transpose())# Note that the variable l in the loop below is used a little# differently to the notation in Chapter 2 of the book. Here,# l = 1 means the last layer of neurons, l = 2 is the# second-last layer, and so on. It's a renumbering of the# scheme in the book, used here to take advantage of the fact# that Python can use negative indices in lists.forlinxrange(2, self.num_layers): z = zs[-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l+1].transpose(), delta) * sp nabla_b[-l] = delta nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())return(nabla_b, nabla_w)defaccuracy(self, data, convert=False):"""Return the number of inputs in ``data`` for which the neuralnetwork outputs the correct result. The neural network'soutput is assumed to be the index of whichever neuron in thefinal layer has the highest activation.The flag ``convert`` should be set to False if the data set isvalidation or test data (the usual case), and to True if thedata set is the training data. The need for this flag arisesdue to differences in the way the results ``y`` arerepresented in the different data sets. In particular, itflags whether we need to convert between the differentrepresentations. It may seem strange to use differentrepresentations for the different data sets. Why not use thesame representation for all three data sets? It's done forefficiency reasons -- the program usually evaluates the coston the training data and the accuracy on other data sets.These are different types of computations, and using differentrepresentations speeds things up. More details on therepresentations can be found inmnist_loader.load_data_wrapper."""ifconvert: results = [(np.argmax(self.feedforward(x)), np.argmax(y))for(x, y)indata]else: results = [(np.argmax(self.feedforward(x)), y)for(x, y)indata]returnsum(int(x == y)for(x, y)inresults)deftotal_cost(self, data, lmbda, convert=False):"""Return the total cost for the data set ``data``. The flag``convert`` should be set to False if the data set is thetraining data (the usual case), and to True if the data set isthe validation or test data. See comments on the similar (butreversed) convention for the ``accuracy`` method, above."""cost = 0.0forx, yindata: a = self.feedforward(x)ifconvert: y = vectorized_result(y) cost += self.cost.fn(a, y)/len(data) cost += 0.5*(lmbda/len(data))*sum( np.linalg.norm(w)**2forwinself.weights)returncostdefsave(self, filename):"""Save the neural network to the file ``filename``."""data = {"sizes": self.sizes, "weights": [w.tolist()forwinself.weights], "biases": [b.tolist()forbinself.biases], "cost": str(self.cost.__name__)} f = open(filename, "w") json.dump(data, f) f.close()#### Loading a Networkdefload(filename):"""Load a neural network from the file ``filename``. Returns aninstance of Network."""f = open(filename, "r") data = json.load(f) f.close() cost = getattr(sys.modules[__name__], data["cost"]) net = Network(data["sizes"], cost=cost) net.weights = [np.array(w)forwindata["weights"]] net.biases = [np.array(b)forbindata["biases"]]returnnet#### Miscellaneous functionsdefvectorized_result(j):"""Return a 10-dimensional unit vector with a 1.0 in the j'th positionand zeroes elsewhere. This is used to convert a digit (0...9)into a corresponding desired output from the neural network."""e = np.zeros((10, 1)) e[j] = 1.0returnedefsigmoid(z):"""The sigmoid function."""return1.0/(1.0+np.exp(-z))defsigmoid_prime(z):"""Derivative of the sigmoid function."""returnsigmoid(z)*(1-sigmoid(z))

One of the more interesting changes in the code is to include L2 regularization. Although this is a major conceptual change, it's so trivial to implement that it's easy to miss in the code. For the most part it just involves passing the parameter `lmbda` to various methods, notably the `Network.SGD` method. The real work is done in a single line of the program, the fourth-last line of the`Network.update_mini_batch` method. That's where we modify the gradient descent update rule to include weight decay. But although the modification is tiny, it has a big impact on results!

This is, by the way, common when implementing new techniques in neural networks. We've spent thousands of words discussing regularization. It's conceptually quite subtle and difficult to understand. And yet it was trivial to add to our program! It occurs surprisingly often that sophisticated techniques can be implemented with small changes to code.

Another small but important change to our code is the addition of several optional flags to the stochastic gradient descent method,`Network.SGD`. These flags make it possible to monitor the cost and accuracy either on the `training_data` or on a set of `evaluation_``data`which can be passed to `Network.SGD`. We've used these flags often earlier in the chapter, but let me give an example of how it works, just to remind you:

>>>importmnist_loader>>> training_data, validation_data, test_data = \ ... mnist_loader.load_data_wrapper() >>>importnetwork2>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost) >>> net.SGD(training_data, 30, 10, 0.5, ... lmbda = 5.0, ... evaluation_data=validation_data, ... monitor_evaluation_accuracy=True, ... monitor_evaluation_cost=True, ... monitor_training_accuracy=True, ... monitor_training_cost=True)

Here, we're setting the `evaluation_data` to be the `validation_data`. But we could also have monitored performance on the `test_data` or any other data set. We also have four flags telling us to monitor the cost and accuracy on both the `evaluation_data` and the `training_data`. Those flags are `False` by default, but they've been turned on here in order to monitor our `Network`'s performance. Furthermore,`network2.py`'s `Network.SGD` method returns a four-element tuple representing the results of the monitoring. We can use this as follows:

>>> evaluation_cost, evaluation_accuracy, ... training_cost, training_accuracy = net.SGD(training_data, 30, 10, 0.5, ... lmbda = 5.0, ... evaluation_data=validation_data, ... monitor_evaluation_accuracy=True, ... monitor_evaluation_cost=True, ... monitor_training_accuracy=True, ... monitor_training_cost=True)

So, for example, `evaluation_cost` will be a 30-element list containing the cost on the evaluation data at the end of each epoch. This sort of information is extremely useful in understanding a network's behaviour. It can, for example, be used to draw graphs showing how the network learns over time. Indeed, that's exactly how I constructed all the graphs earlier in the chapter. Note, however, that if any of the monitoring flags are not set, then the corresponding element in the tuple will be the empty list.

Other additions to the code include a `Network.save` method, to save `Network` objects to disk, and a function to `load` them back in again later. Note that the saving and loading is done using JSON, not Python's `pickle` or `cPickle` modules, which are the usual way we save and load objects to and from disk in Python. Using JSON requires more code than `pickle` or `cPickle` would. To understand why I've used JSON, imagine that at some time in the future we decided to change our `Network` class to allow neurons other than sigmoid neurons. To implement that change we'd most likely change the attributes defined in the `Network.__init__` method. If we've simply pickled the objects that would cause our `load` function to fail. Using JSON to do the serialization explicitly makes it easy to ensure that old `Network`s will still `load`.

There are many other minor changes in the code for `network2.py`, but they're all simple variations on `network.py`. The net result is to expand our 74-line program to a far more capable 152 lines.

## Problems

- Modify the code above to implement L1 regularization, and use L1 regularization to classify MNIST digits using a \(30\) hidden neuron network. Can you find a regularization parameter that enables you to do better than running unregularized?
- Take a look at the
`Network.cost_derivative`method in`network.py`. That method was written for the quadratic cost. How would you rewrite the method for the cross-entropy cost? Can you think of a problem that might arise in the cross-entropy version? In`network2.py`we've eliminated the`Network.cost_derivative`method entirely, instead incorporating its functionality into the`CrossEntropyCost.delta`method. How does this solve the problem you've just identified?