# 6.3: The code for our convolutional networks

- Page ID
- 3775

Alright, let's take a look at the code for our program, `network3.py`. Structurally, it's similar to `network2.py`, the program we developed in Chapter 3, although the details differ, due to the use of Theano. We'll start by looking at the `FullyConnectedLayer` class, which is similar to the layers studied earlier in the book. Here's the code (discussion below)**Note added November 2016: several readers have noted that in the line initializing `self.w`, I set `scale=np.sqrt(1.0/n_out)`, when the arguments of Chapter 3 suggest a better initialization may be `scale``=np.sqrt(1.0/n_in)`. This was simply a mistake on my part. In an ideal world I'd rerun all the examples in this chapter with the correct code. Still, I've moved on to other projects, so am going to let the error go.:

classFullyConnectedLayer(object):def__init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0): self.n_in = n_in self.n_out = n_out self.activation_fn = activation_fn self.p_dropout = p_dropout# Initialize weights and biasesself.w = theano.shared( np.asarray( np.random.normal( loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)), dtype=theano.config.floatX), name='w', borrow=True) self.b = theano.shared( np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)), dtype=theano.config.floatX), name='b', borrow=True) self.params = [self.w, self.b]defset_inpt(self, inpt, inpt_dropout, mini_batch_size): self.inpt = inpt.reshape((mini_batch_size, self.n_in)) self.output = self.activation_fn( (1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b) self.y_out = T.argmax(self.output, axis=1) self.inpt_dropout = dropout_layer( inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout) self.output_dropout = self.activation_fn( T.dot(self.inpt_dropout, self.w) + self.b)defaccuracy(self, y): "Return the accuracy for the mini-batch."returnT.mean(T.eq(y, self.y_out))

Much of the `__init__` method is self-explanatory, but a few remarks may help clarify the code. As per usual, we randomly initialize the weights and biases as normal random variables with suitable standard deviations. The lines doing this look a little forbidding. However, most of the complication is just loading the weights and biases into what Theano calls shared variables. This ensures that these variables can be processed on the GPU, if one is available. We won't get too much into the details of this. If you're interested, you can dig into the Theano documentation. Note also that this weight and bias initialization is designed for the sigmoid activation function (as discussed earlier). Ideally, we'd initialize the weights and biases somewhat differently for activation functions such as the tanh and rectified linear function. This is discussed further in problems below. The `__init__` method finishes with `self.params = [self.w, self.b]`. This is a handy way to bundle up all the learnable parameters associated to the layer. Later on, the `Network.SGD` method will use `params` attributes to figure out what variables in a `Network `instance can learn.

The `set_inpt` method is used to set the input to the layer, and to compute the corresponding output. I use the name `inpt` rather than `input` because `input` is a built-in function in Python, and messing with built-ins tends to cause unpredictable behavior and difficult-to-diagnose bugs. Note that we actually set the input in two separate ways: as `self.inpt` and `self.inpt_dropout`. This is done because during training we may want to use dropout. If that's the case then we want to remove a fraction `self.p_dropout` of the neurons. That's what the function `dropout_layer` in the second-last line of the `set_inpt` method is doing. So `self.inpt_dropout` and`self.output_dropout` are used during training, while `self.inpt` and `self.output` are used for all other purposes, e.g., evaluating accuracy on the validation and test data.

The `ConvPoolLayer` and `SoftmaxLayer` class definitions are similar to `FullyConnectedLayer`. Indeed, they're so close that I won't excerpt the code here. If you're interested you can look at the full listing for `network3.py`, later in this section.

However, a couple of minor differences of detail are worth mentioning. Most obviously, in both `ConvPoolLayer` and `SoftmaxLayer`we compute the output activations in the way appropriate to that layer type. Fortunately, Theano makes that easy, providing built-in operations to compute convolutions, max-pooling, and the softmax function.

Less obviously, when we introduced the softmax layer, we never discussed how to initialize the weights and biases. Elsewhere we've argued that for sigmoid layers we should initialize the weights using suitably parameterized normal random variables. But that heuristic argument was specific to sigmoid neurons (and, with some amendment, to tanh neurons). However, there's no particular reason the argument should apply to softmax layers. So there's no *a priori* reason to apply that initialization again. Rather than do that, I shall initialize all the weights and biases to be 00. This is a rather *ad hoc* procedure, but works well enough in practice.

Okay, we've looked at all the layer classes. What about the `Network`class? Let's start by looking at the `__init__` method:

classNetwork(object):def__init__(self, layers, mini_batch_size):"""Takes a list of `layers`, describing the network architecture, anda value for the `mini_batch_size` to be used during trainingby stochastic gradient descent."""self.layers = layers self.mini_batch_size = mini_batch_size self.params = [paramforlayerinself.layersforparaminlayer.params] self.x = T.matrix("x") self.y = T.ivector("y") init_layer = self.layers[0] init_layer.set_inpt(self.x, self.x, self.mini_batch_size)forjinxrange(1, len(self.layers)): prev_layer, layer = self.layers[j-1], self.layers[j] layer.set_inpt( prev_layer.output, prev_layer.output_dropout, self.mini_batch_size) self.output = self.layers[-1].output self.output_dropout = self.layers[-1].output_dropout

Most of this is self-explanatory, or nearly so. The line `self.params = [param for layer in ...]` bundles up the parameters for each layer into a single list. As anticipated above, the `Network.SGD` method will use `self.params` to figure out what variables in the `Network` can learn. The lines `self.x = T.matrix("x")` and `self.y = T.ivector("y")` define Theano symbolic variables named `x` and `y`. These will be used to represent the input and desired output from the network.

Now, this isn't a Theano tutorial, and so we won't get too deeply into what it means that these are symbolic variables*

*The Theano documentation provides a good introduction to Theano. And if you get stuck, you may find it helpful to look at one of the other tutorials available online. For instance, this tutorial covers many basics.. But the rough idea is that these represent mathematical variables, *not* explicit values. We can do all the usual things one would do with such variables: add, subtract, and multiply them, apply functions, and so on. Indeed, Theano provides many ways of manipulating such symbolic variables, doing things like convolutions, max-pooling, and so on. But the big win is the ability to do fast symbolic differentiation, using a very general form of the backpropagation algorithm. This is extremely useful for applying stochastic gradient descent to a wide variety of network architectures. In particular, the next few lines of code define symbolic outputs from the network. We start by setting the input to the initial layer, with the line

init_layer.set_inpt(self.x, self.x, self.mini_batch_size)

Note that the inputs are set one mini-batch at a time, which is why the mini-batch size is there. Note also that we pass the input `self.x`in twice: this is because we may use the network in two different ways (with or without dropout). The `for` loop then propagates the symbolic variable `self.x` forward through the layers of the `Network`. This allows us to define the final `output` and `output_``dropout `attributes, which symbolically represent the output from the `Network`.

Now that we've understood how a `Network` is initialized, let's look at how it is trained, using the `SGD` method. The code looks lengthy, but its structure is actually rather simple. Explanatory comments after the code.

defSGD(self, training_data, epochs, mini_batch_size, eta, validation_data, test_data, lmbda=0.0):"""Train the network using mini-batch stochastic gradient descent."""training_x, training_y = training_data validation_x, validation_y = validation_data test_x, test_y = test_data# compute number of minibatches for training, validation and testingnum_training_batches = size(training_data)/mini_batch_size num_validation_batches = size(validation_data)/mini_batch_size num_test_batches = size(test_data)/mini_batch_size# define the (regularized) cost function, symbolic gradients, and updatesl2_norm_squared = sum([(layer.w**2).sum()forlayerinself.layers]) cost = self.layers[-1].cost(self)+\ 0.5*lmbda*l2_norm_squared/num_training_batches grads = T.grad(cost, self.params) updates = [(param, param-eta*grad)forparam, gradinzip(self.params, grads)]# define functions to train a mini-batch, and to compute the# accuracy in validation and test mini-batches.i = T.lscalar()# mini-batch indextrain_mb = theano.function( [i], cost, updates=updates, givens={ self.x: training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size], self.y: training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size] }) validate_mb_accuracy = theano.function( [i], self.layers[-1].accuracy(self.y), givens={ self.x: validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size], self.y: validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size] }) test_mb_accuracy = theano.function( [i], self.layers[-1].accuracy(self.y), givens={ self.x: test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size], self.y: test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size] }) self.test_mb_predictions = theano.function( [i], self.layers[-1].y_out, givens={ self.x: test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size] })# Do the actual trainingbest_validation_accuracy = 0.0forepochinxrange(epochs):forminibatch_indexinxrange(num_training_batches): iteration = num_training_batches*epoch+minibatch_indexifiterationif(iteration+1) validation_accuracy = np.mean( [validate_mb_accuracy(j)forjinxrange(num_validation_batches)])ifvalidation_accuracy >= best_validation_accuracy:iftest_data: test_accuracy = np.mean( [test_mb_accuracy(j)forjinxrange(num_test_batches)])

The first few lines are straightforward, separating the datasets into x and y components, and computing the number of mini-batches used in each dataset. The next few lines are more interesting, and show some of what makes Theano fun to work with. Let's explicitly excerpt the lines here:

# define the (regularized) cost function, symbolic gradients, and updatesl2_norm_squared = sum([(layer.w**2).sum()forlayerinself.layers]) cost = self.layers[-1].cost(self)+\ 0.5*lmbda*l2_norm_squared/num_training_batches grads = T.grad(cost, self.params) updates = [(param, param-eta*grad)forparam, gradinzip(self.params, grads)]

In these lines we symbolically set up the regularized log-likelihood cost function, compute the corresponding derivatives in the gradient function, as well as the corresponding parameter updates. Theano lets us achieve all of this in just these few lines. The only thing hidden is that computing the `cost` involves a call to the `cost `method for the output layer; that code is elsewhere in `network3.py`. But that code is short and simple, anyway. With all these things defined, the stage is set to define the `train_mb` function, a Theano symbolic function which uses the `updates` to update the `Network `parameters, given a mini-batch index. Similarly, `validate_mb_accuracy` and `test_mb_accuracy` compute the accuracy of the `Network` on any given mini-batch of validation or test data. By averaging over these functions, we will be able to compute accuracies on the entire validation and test data sets.

The remainder of the `SGD` method is self-explanatory - we simply iterate over the epochs, repeatedly training the network on mini-batches of training data, and computing the validation and test accuracies.

Okay, we've now understood the most important pieces of code in`network3.py`. Let's take a brief look at the entire program. You don't need to read through this in detail, but you may enjoy glancing over it, and perhaps diving down into any pieces that strike your fancy. The best way to really understand it is, of course, by modifying it, adding extra features, or refactoring anything you think could be done more elegantly. After the code, there are some problems which contain a few starter suggestions for things to do. Here's the code*

*Using Theano on a GPU can be a little tricky. In particular, it's easy to make the mistake of pulling data off the GPU, which can slow things down a lot. I've tried to avoid this. With that said, this code can certainly be sped up quite a bit further with careful optimization of Theano's configuration. See the Theano documentation for more details.:

"""network3.py~~~~~~~~~~~~~~A Theano-based program for training and running simple neuralnetworks.Supports several layer types (fully connected, convolutional, maxpooling, softmax), and activation functions (sigmoid, tanh, andrectified linear units, with more easily added).When run on a CPU, this program is much faster than network.py andnetwork2.py. However, unlike network.py and network2.py it can alsobe run on a GPU, which makes it faster still.Because the code is based on Theano, the code is different in manyways from network.py and network2.py. However, where possible I havetried to maintain consistency with the earlier programs. Inparticular, the API is similar to network2.py. Note that I havefocused on making the code simple, easily readable, and easilymodifiable. It is not optimized, and omits many desirable features.This program incorporates ideas from the Theano documentation onconvolutional neural nets (notably,http://deeplearning.net/tutorial/lenet.html ), from Misha Denil'simplementation of dropout (https://github.com/mdenil/dropout ), andfrom Chris Olah (http://colah.github.io ).Written for Theano 0.6 and 0.7, needs some changes for more recentversions of Theano."""#### Libraries# Standard libraryimportcPickleimportgzip# Third-party librariesimportnumpyasnpimporttheanoimporttheano.tensorasTfromtheano.tensor.nnetimportconvfromtheano.tensor.nnetimportsoftmaxfromtheano.tensorimportshared_randomstreamsfromtheano.tensor.signalimportdownsample# Activation functions for neuronsdeflinear(z):returnzdefReLU(z):returnT.maximum(0.0, z)fromtheano.tensor.nnetimportsigmoidfromtheano.tensorimporttanh#### ConstantsGPU = TrueifGPU:\nto set the GPU flag to False."try: theano.config.device = 'gpu'except:pass# it's already settheano.config.floatX = 'float32'else:\nthe GPU flag to True."#### Load the MNIST datadefload_data_shared(filename="../data/mnist.pkl.gz"): f = gzip.open(filename, 'rb') training_data, validation_data, test_data = cPickle.load(f) f.close()defshared(data):"""Place the data into shared variables. This allows Theano to copythe data to the GPU, if one is available."""shared_x = theano.shared( np.asarray(data[0], dtype=theano.config.floatX), borrow=True) shared_y = theano.shared( np.asarray(data[1], dtype=theano.config.floatX), borrow=True)returnshared_x, T.cast(shared_y, "int32")return[shared(training_data), shared(validation_data), shared(test_data)]#### Main class used to construct and train networksclassNetwork(object):def__init__(self, layers, mini_batch_size):"""Takes a list of `layers`, describing the network architecture, anda value for the `mini_batch_size` to be used during trainingby stochastic gradient descent."""self.layers = layers self.mini_batch_size = mini_batch_size self.params = [paramforlayerinself.layersforparaminlayer.params] self.x = T.matrix("x") self.y = T.ivector("y") init_layer = self.layers[0] init_layer.set_inpt(self.x, self.x, self.mini_batch_size)forjinxrange(1, len(self.layers)): prev_layer, layer = self.layers[j-1], self.layers[j] layer.set_inpt( prev_layer.output, prev_layer.output_dropout, self.mini_batch_size) self.output = self.layers[-1].output self.output_dropout = self.layers[-1].output_dropoutdefSGD(self, training_data, epochs, mini_batch_size, eta, validation_data, test_data, lmbda=0.0):"""Train the network using mini-batch stochastic gradient descent."""training_x, training_y = training_data validation_x, validation_y = validation_data test_x, test_y = test_data# compute number of minibatches for training, validation and testingnum_training_batches = size(training_data)/mini_batch_size num_validation_batches = size(validation_data)/mini_batch_size num_test_batches = size(test_data)/mini_batch_size# define the (regularized) cost function, symbolic gradients, and updatesl2_norm_squared = sum([(layer.w**2).sum()forlayerinself.layers]) cost = self.layers[-1].cost(self)+\ 0.5*lmbda*l2_norm_squared/num_training_batches grads = T.grad(cost, self.params) updates = [(param, param-eta*grad)forparam, gradinzip(self.params, grads)]# define functions to train a mini-batch, and to compute the# accuracy in validation and test mini-batches.i = T.lscalar()# mini-batch indextrain_mb = theano.function( [i], cost, updates=updates, givens={ self.x: training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size], self.y: training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size] }) validate_mb_accuracy = theano.function( [i], self.layers[-1].accuracy(self.y), givens={ self.x: validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size], self.y: validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size] }) test_mb_accuracy = theano.function( [i], self.layers[-1].accuracy(self.y), givens={ self.x: test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size], self.y: test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size] }) self.test_mb_predictions = theano.function( [i], self.layers[-1].y_out, givens={ self.x: test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size] })# Do the actual trainingbest_validation_accuracy = 0.0forepochinxrange(epochs):forminibatch_indexinxrange(num_training_batches): iteration = num_training_batches*epoch+minibatch_indexifiteration % 1000 == 0:if(iteration+1) % num_training_batches == 0: validation_accuracy = np.mean( [validate_mb_accuracy(j)forjinxrange(num_validation_batches)])ifvalidation_accuracy >= best_validation_accuracy:iftest_data: test_accuracy = np.mean( [test_mb_accuracy(j)forjinxrange(num_test_batches)])#### Define layer typesclassConvPoolLayer(object):"""Used to create a combination of a convolutional and a max-poolinglayer. A more sophisticated implementation would separate thetwo, but for our purposes we'll always use them together, and itsimplifies the code, so it makes sense to combine them."""def__init__(self, filter_shape, image_shape, poolsize=(2, 2), activation_fn=sigmoid):"""`filter_shape` is a tuple of length 4, whose entries are the numberof filters, the number of input feature maps, the filter height, and thefilter width.`image_shape` is a tuple of length 4, whose entries are themini-batch size, the number of input feature maps, the imageheight, and the image width.`poolsize` is a tuple of length 2, whose entries are the y andx pooling sizes."""self.filter_shape = filter_shape self.image_shape = image_shape self.poolsize = poolsize self.activation_fn=activation_fn# initialize weights and biasesn_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize)) self.w = theano.shared( np.asarray( np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape), dtype=theano.config.floatX), borrow=True) self.b = theano.shared( np.asarray( np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)), dtype=theano.config.floatX), borrow=True) self.params = [self.w, self.b]defset_inpt(self, inpt, inpt_dropout, mini_batch_size): self.inpt = inpt.reshape(self.image_shape) conv_out = conv.conv2d( input=self.inpt, filters=self.w, filter_shape=self.filter_shape, image_shape=self.image_shape) pooled_out = downsample.max_pool_2d( input=conv_out, ds=self.poolsize, ignore_border=True) self.output = self.activation_fn( pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')) self.output_dropout = self.output# no dropout in the convolutional layersclassFullyConnectedLayer(object):def__init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0): self.n_in = n_in self.n_out = n_out self.activation_fn = activation_fn self.p_dropout = p_dropout# Initialize weights and biasesself.w = theano.shared( np.asarray( np.random.normal( loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)), dtype=theano.config.floatX), name='w', borrow=True) self.b = theano.shared( np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)), dtype=theano.config.floatX), name='b', borrow=True) self.params = [self.w, self.b]defset_inpt(self, inpt, inpt_dropout, mini_batch_size): self.inpt = inpt.reshape((mini_batch_size, self.n_in)) self.output = self.activation_fn( (1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b) self.y_out = T.argmax(self.output, axis=1) self.inpt_dropout = dropout_layer( inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout) self.output_dropout = self.activation_fn( T.dot(self.inpt_dropout, self.w) + self.b)defaccuracy(self, y): "Return the accuracy for the mini-batch."returnT.mean(T.eq(y, self.y_out))classSoftmaxLayer(object):def__init__(self, n_in, n_out, p_dropout=0.0): self.n_in = n_in self.n_out = n_out self.p_dropout = p_dropout# Initialize weights and biasesself.w = theano.shared( np.zeros((n_in, n_out), dtype=theano.config.floatX), name='w', borrow=True) self.b = theano.shared( np.zeros((n_out,), dtype=theano.config.floatX), name='b', borrow=True) self.params = [self.w, self.b]defset_inpt(self, inpt, inpt_dropout, mini_batch_size): self.inpt = inpt.reshape((mini_batch_size, self.n_in)) self.output = softmax((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b) self.y_out = T.argmax(self.output, axis=1) self.inpt_dropout = dropout_layer( inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout) self.output_dropout = softmax(T.dot(self.inpt_dropout, self.w) + self.b)defcost(self, net): "Return the log-likelihood cost."return-T.mean(T.log(self.output_dropout)[T.arange(net.y.shape[0]), net.y])defaccuracy(self, y): "Return the accuracy for the mini-batch."returnT.mean(T.eq(y, self.y_out))#### Miscellaneadefsize(data): "Return the size of the dataset `data`."returndata[0].get_value(borrow=True).shape[0]defdropout_layer(layer, p_dropout): srng = shared_randomstreams.RandomStreams( np.random.RandomState(0).randint(999999)) mask = srng.binomial(n=1, p=1-p_dropout, size=layer.shape)returnlayer*T.cast(mask, theano.config.floatX)

## Problems

- At present, the
`SGD`method requires the user to manually choose the number of epochs to train for. Earlier in the book we discussed an automated way of selecting the number of epochs to train for, known as early stopping. Modify`network3.py`to implement early stopping. - Add a
`Network`method to return the accuracy on an arbitrary data set. - Modify the
`SGD`method to allow the learning rate ηη to be a function of the epoch number.*Hint: After working on this problem for a while, you may find it useful to see the discussion at this link.* - Earlier in the chapter I described a technique for expanding the training data by applying (small) rotations, skewing, and translation. Modify
`network3.py`to incorporate all these techniques.*Note: Unless you have a tremendous amount of memory, it is not practical to explicitly generate the entire expanded data set. So you should consider alternate approaches.* - Add the ability to load and save networks to
`network3.py`. - A shortcoming of the current code is that it provides few diagnostic tools. Can you think of any diagnostics to add that would make it easier to understand to what extent a network is overfitting? Add them.
- We've used the same initialization procedure for rectified linear units as for sigmoid (and tanh) neurons. Our argument for that initialization was specific to the sigmoid function. Consider a network made entirely of rectified linear units (including outputs). Show that rescaling all the weights in the network by a constant factor \(c>0\) simply rescales the outputs by a factor \(c^{L−1}\), where \(L\) is the number of layers. How does this change if the final layer is a softmax? What do you think of using the sigmoid initialization procedure for the rectified linear units? Can you think of a better initialization procedure?
*Note: This is a very open-ended problem, not something with a simple self-contained answer. Still, considering the problem will help you better understand networks containing rectified linear units.* - Our analysis of the unstable gradient problem was for sigmoid neurons. How does the analysis change for networks made up of rectified linear units? Can you think of a good way of modifying such a network so it doesn't suffer from the unstable gradient problem?
*Note: The word good in the second part of this makes the problem a research problem. It's actually easy to think of ways of making such modifications. But I haven't investigated in enough depth to know of a really good technique.*