The recent history of deep neural networks features an impressive number of new methods and technological improvements to allow the training of deeper and more powerful networks.
The model themselves had a reputation for being black boxes, and they still have that reputation. Neural networks are criticized for their lack of interpretability, which is a tradeoff that we accept because of their amazing performance on many tasks. Efforts have been made to identify the role played by each layer, but it can be hard to find a meaning to individual layers.
There are good arguments to support the claim that the first layers of a convolution network for image recognition contain filters that are relatively “general”, in the sense that they would work great even if we switched to an entirely different dataset of images. The last layers are specific to the dataset being used, and has to be retrained when using a different dataset. In Yosinski et al. (2014) the authors try to pinpoint the at which this transition occurs, but they show that the exact transition is spread across multiple layers.
In this paper, we introduce the concept of the linear classifier probe, referred to as a “probe” for short when the context is clear. We start from the concept of Shanon entropy
, which is the classic way to describe the information contents of a random variable. We then seek to apply that concept to understand the roles of the intermediate layers of a neural network, to measure how much information is gained at every layer (answer : technically, none). We show that it fails to apply, and so we propose an alternative framework to ask the same question again. This time around, we ask what would be the performance of an optimal linear classifier if it was trained on the inputs of a given layer from our model. We demonstrate how this powerful concept can be very useful to understand the dynamics involved in a deep neural network during training and after.
2 Information theory
In section 2.1 we do a quick review of the basic concept of Shanon entropy. In section 2.2 we then give Proposition 1 that demonstrates how entropy fails to capture the essence of a lot of interesting questions regarding neural networks.
What happens when we add more layers ?
Where does information flow in a neural network with multiple branches ?
Does having multiple auxiliary losses help ? (e.g. Inception model)
This paves the way for our concept of the linear classifier probe that we introduce in section 3.1.
2.1 Basics of Shannon entropy
It was a great discovery when Claude Shannon repurposed the notion of entropy to represent information contents in a formal way. It laid the foundations for the discipline of information theory.
We would refer the reader to first chapters of MacKay (2003) for a good exposition on the matter.
One of the particularly interesting aspects of information theory is the ability to quantify in a useful manner the amount of uncertainty left in a discrete random variabletaking values in . The entropy is defined as
and it is always non-negative. It is equal to zero if and only if
has all its probability mass at a single value. That is, the entropy is zero only whenis a deterministic value, when there is no uncertainty left in .
Part of the genius of the notion of entropy is that is distills the essence of information to a quantity that does not depend on the particular representation.
One can also define the condition entropy of given as
Now, those formulas can be translated directly to the case of a continuous random variable, using integrals instead of summations, However, some of the nice properties no longer hold. In the continuous setting, the entropy is no longer scale-invariant, nor is it even guaranteed to be non-negative (we can find a good criticism of this in Marsh (2013)).
Despite this, it still makes sense to compare the conditional entropy to the conditional entropy , for other random variables and . Moreover, the relative entropy, in the form of the KL-divergence, is also very useful when dealing with continuous variables.
2.2 Entropy with intermediate layers of neural network
|Question 1||Given an MLP with data , do any of the intermediate layers|
|contain more information than about ?|
To rephrase this question in terms of entropy, we are asking if the conditional entropy is ever smaller than , where refers to any intermediate layer of the MLP.
The answer to that is No.
We present the following proposition which can be applied directly to an MLP to settle this case. It applies also to an MLP where certain layers have a component of randomness (e.g. Dropout during training).
Let be a set of continuous (or discrete) random variables such that their
joint distribution factors according to the following graphical model.
be a set of continuous (or discrete) random variables such that their joint distribution factors according to the following graphical model.
Then we have that the conditional entropy of given each is ordered as follow :
Going deeper into the multi-layer neural network, we are pretty much only destroying information. We start with the raw data , and then it’s just downhill. If contains the image of the savannah, and refers to whether it contains a lion or not, then none of the subsequent layers are truly more informative than itself. Refer to figure 1.
3 Linear classifier probes
In section 3.1 we present the main concept from this paper. We illustrate the concept in section 3.2. We then present a basic experiment in section 3.3. In section 3.4 we modify a very deep network in two different ways and we show how probes allow us to visualize the (sometimes disastrous) consequences of our design choices.
Despite what we highlighted in the previous section 2, there is indeed a good reason to use many deterministic layers, and it is because they perform useful transformations to the data with the goal of ultimately fitting a linear classifier at the very end. That is the purpose of the many layers. They are a tool to massage data into a form to be fed to a boring linear classifier.
With this in mind, it is natural to ask if that transformation is sudden or progressive, and whether the intermediate layers already have a representation that is immediately useful to a linear classifier. We refer the reader to figure 2 for a little diagram of probes being inserted in the usual deep neural network.
The conceptual framework that we propose is one where the intuitive notion of information is equivalent with immediate suitability for a linear classifier (instead of being related to entropy). It is with that notion that we study multiple scenarios in sections 3.2, 3.3, 3.4.
Just to be absolutely clear about what we call a linear classifier, we mean a function
where are the features of some hidden layer,
is the space of one-hot encodings of the target classes, andare the probe weights and biases to be learned so as to minimize the usual cross-entropy loss.
Over the course of training a model, the parameters of the model change. However, probes only make sense when we refer to a given training step. We can talk about the probes at iteration of training, when the model parameters are . These parameters are not affected by the probes. We prevent backpropagation through the model either by stopping the gradient flow (done with tf.stop_gradient
in tensorflow), or simply by specifying that the only variables to be updated are the probe parameters, while we keepfrozen.
We train the probe parameters up to convergence (or with early stopping). The concept of the probe makes sense only when we assume that it is the optimal possible linear classifier (or very close to that in practice).
Note that training those probes represents a convex optimization problem. In practice, this does mean that it is an easy thing to accomplish, but it is reassuring because it means that probes taken at time could be used as initialization for probes at time if we were not concerned about overfitting.
3.2 Probes on untrained model
We start with a toy example to illustrate what kind of plots we expect from probes. We use a 32-layer MLP with 128 hidden units. All the layers are fully-connected and we use LeakyReLU(
) as activation function.
We will run the same experiment times, with a different toy dataset each time. The goal is to use a data distribution where is drawn and where in linearly separable (i.e. super easy to classify with a one-layer neural network). To do this, we just pick a for each experiment, and let the label be the sign of .
We initialize this 32-layer MLP using glorot_normal
initialization, we do not perform any training on the model, and we add one probe at every layer. We optimize the probes with RMSProp and a sufficiently small learning rate.
In figure 3, we show the prediction error rate for every probe, averaged over the experiments. The graph includes a probe directly on the inputs , where we naturally have an error rate that is essentially zero (to be expected by the way we constructed our data), and which serves as a kind of sanity check. Given that we have only two possible labels, we also show a dotted horizontal line at , which is essentially the prediction error that we would get by flipping a coin. We can see that the prediction error rate climbs up towards as we go deeper in the MLP (with untrained parameters).
This illustrates the idea that the input signal is getting mangled by the successive layers, so much so that it becomes rather useless by the time we reach the final layer. We checked the mean activation norm of the hidden units at layer 32 to be sure that numerical underflow was not the cause for the degradation.
One of the popular explanation for training difficulties in very deep models is that of the exploding/vanishing (Hochreiter, 1991; Bengio et al., 1993). Here we would like to offer another complementary explanation, based on the observations from figure 3. That is, at the beginning on training, the usefulness of layers decays as we go deeper, reaching the point where the deeper layers are utterly useless. The values contained in the last layer are then used in the final softmax classifier, and the loss backpropagates the values of the derivatives. Since that derivative is based on garbage activations, the backpropagated quantities are also garbage, which means that the weights are all going to be updated based on garbage. The weights stay bad, and we fail to train the model. The authors like to refer to that phenomenon as garbage forwardprop, garbage backprop, in reference to the popular concept of garbage in, garbage out in computer science.
The works of Pascanu et al. (2013) and Montufar et al. (2014) explore the possibility of counting the exponentially-many regions of space that are defined define using the composition of many sequential layers in a deep neural network. In a way, the ability of an untrained neural network to “fold” a nice manifold into something unusable should not come as a surprize. Just a few randomly-initialized layers are sufficient to ruin a linearly-separable dataset. We leave out the idea of adversarially-tuned layers for another day.
3.3 Probes on MNIST convnet
In this section we run the MNIST convolutional model provided by the tensorflow github repo (tensorflow/models/image/mnist/convolutional.py) We selected that model for reproducibility and to demonstrate how to easily peek into popular models by using probes.
We start by sketching the model in figure 4. We report the results at the beginning and the end of training on figure 5. One of the interesting dynamics to be observed there is how useful are the first layers, despite the fact that the model is completely untrained.
elements. The probes are prevented from overfitting the training data. We can see that, at the beginning of training (on the left), the randomly-initialized layers were still providing useful transformations. The test prediction error goes from 8% to 2% simply using those random features. The biggest impact comes from the first ReLU. At the end of training (on the right), the test prediction error is improving at every layer (with the exception of a minor kink onfc1_preact).
3.4 Auxiliary loss branches and skip connections
Here we investigate two ways to modify a deep model in order to facilitate training. Our goal is not to convince the reader that they should implement these suggestions in their own models. Rather, we want to demonstrate the usefulness of the linear classifier probes as a way to better understand what is happening in their deep networks.
In both cases we are going to use a toy model with 128 fully-connected layers with 128 hidden units in each layer. We train on MNIST, and we use Glorot initialization along with leaky ReLUs.
We choose this model because we wanted a pathologically deep model without getting bogged down in architecture details. The model is pathological in the sense that smaller models can easily be designed to achieve better performance, but also in the sense that the model is so deep that it is very hard to train it with gradient descent methods. From our experiments, the maximal depth where things start to break down was depth 64, hence the choice here of using depth 128.
In the first scenario, we add one linear classifier at every 16 layers. These classifiers contribute to the loss minimized. They are not probes. This is very similar to what happens in the famous Inception model where “auxiliary heads” are used (Szegedy et al., 2015). This is illustrated in figure 5(a), and it works nicely. The untrainable model is now made trainable through a judicious use of auxiliary classifier losses. The results are shown in figure 7.
In the second scenario, we look at adding a bridge (a skip connection) between layer 0 and layer 64. This means that the input features to layer 64 are obtained by concatenating the output of layer 63 with the features of layer 0. The idea here is that we might observe that the model would effectively train a submodel of depth 64, using the skip connection, and shift gears later to use the whole depth of 128 layers. This is illustrated in figure 5(b), and the results are shown in figure 8. It does not work as expected, but the failure of this approach is visualized very nicely with probes and serves as a great example of their usefulness in diagnosing problems with models.
Firstly, at the beginning of training, we can see how the raw data is directly useful to perform linear classification, and how this degrades as more layers are added. In the case of the skip connection in figure 8, this has the effect of creating two bumps. This is because the layer 64 also has the input data as direct parent, so it can fit a probe to that signal.
Secondly, the classification error goes down from the first probe to the last probe. This is even more apparent on the full video (instead of the 3 frames provided here). This is a ripple effect, where the prediction error in figure 5(b) is visually spreading from the left of the plot to the right. That is, during most of the training the prediction error goes down for a parent layer before it comes down for a layer itself.
4 Discussion and future work
We presented results here for some toy models. While we think that this is sufficient to demonstrate the concept of the linear classifier probe and its usefulness, we really wanted to show the same kind of plots (and videos) on the Inception model running on the huge ImageNet dataset (~150GB).
After training the model for 2-3 weeks, saving a checkpoint every hour, we realized that training ~20 probes for every checkpoint was absolutely impossible. Realistically, given the size of the training set, it is even challenging to train single probe on a single checkpoint.
Training the probes is equivalent to solving a convex optimization problem. This means that we can make use of many convex optimization algorithms that the deep learning community are not currently using to train the deep networks themselves. We have not yet tried those methods, but we are looking at trying the Stochastic Average Gradient, and Stochastic Variance-Reduced Gradient methods(Roux et al., 2012; Schmidt et al., 2013; Harikandeh et al., 2015).
In this paper we introduced the concept of the linear classifier probe as a conceptual tool to better understand the dynamics inside a neural network and the role played by the individual intermediate layers. We are now able to ask new questions and explore new areas. We have demonstrated how these probes can be used to identify certain problematic behaviors in models that might not be apparent when we traditionally have access to only the prediction loss and error.
We hope that the notions presented in this paper can contribute to the understanding of deep neural networks and guide the intuition of researchers that design them.
The authors would like to acknowledge the support of the following agencies for research funding and computing support: NSERC, FQRNT, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR.
- Bengio et al. (1993) Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages 1183–1188. IEEE.
- Harikandeh et al. (2015) Harikandeh, R., Ahmed, M. O., Virani, A., Schmidt, M., Konečnỳ, J., and Sallinen, S. (2015). Stopwasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems, pages 2251–2259.
- Hochreiter (1991) Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München, page 91.
- MacKay (2003) MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press.
- Marsh (2013) Marsh, C. (2013). Introduction to continuous entropy.
- Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2924–2932. Curran Associates, Inc.
- Pascanu et al. (2013) Pascanu, R., Montufar, G., and Bengio, Y. (2013). On the number of response regions of deep feed forward networks with piece-wise linear activations. In International Conference on Learning Representations 2014 (ICLR 2014), Banff, Alberta, Canada.
- Roux et al. (2012) Roux, N. L., Schmidt, M., and Bach, F. R. (2012). A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671.
- Schmidt et al. (2013) Schmidt, M., Roux, N. L., and Bach, F. (2013). Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388.
- Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In , pages 1–9.
- Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328.