Deep neural networks (NNs) are one of the most successful machine learning tools, especially in the domain of supervised learning. For a long time, however, the successes of deep NNs were not well understood theoretically and tools to understand the functionality or to interpret the results of a NN were lacking. Recent years have thus seen an increased effort in explaining NNs, leading to methods such as deconvolution, network dissection, sensitivity analysis , and layer-wise relevance propagation 
. Even more recently, information theory has been employed to open the black box of deep learning, mainly by investigating deep NNs from the perspective of the information bottleneck principle[21, 20, 1, 23].
Another – seemingly unrelated – recent trend is pruning, i.e., the removal of neurons from a large, trained NN. Many pruning methods have been presented for feed-forward and convolutional NNs, some of which we review in Section 2
. The main motivation for pruning is a reduction of computational complexity of inference, and is thus important in IoT applications where NNs must be implemented on devices with limited resources. Furthermore, pruning is applied in the following two scenarios: 1) In transfer learning, an existing NN trained on a large, general training set is fine-tuned to operate on more specific data. Pruning can thus reduce the computational complexity while maintaining state-of-the-art performance on the narrower classification task. 2) Training large NNs is simpler than training small NNs , but after training a large portion of the NN does not contribute to classification performance. Pruning admits removing this irrelevant portion of the NN.
A connection between the aim to understand NNs and pruning, also known as cumulative ablation, has been made recently in . There, the authors investigated the effect of individual neurons as well as groups of neurons on classification and generalization performance by ablation analysis, i.e., by removing the neurons from the network via setting their output to a constant. They evaluated the selectivity of each neuron, a quantity that measures how strongly the behavior of a neuron output to data samples from one class differs from the behavior for data samples from all other classes; neurons with high selectivity are sometimes called “cat neurons”. The authors observed that the selectivity of a neuron is not a good indicator of the effect this neuron has on classification performance and that, especially in shallow layers, selective neurons may even harm classification performance. They obtained similar results by replacing selectivity by mutual information.
Our work is in the spirit of , with a focus on feed-forward NNs. We propose information-theoretic quantities to measure variability, class selectivity, and class information of a neuron output (Section 4), and investigate how these quantities connect with classification performance when said neuron is ablated (Section 5). We show that neither class selectivity nor class information are good performance indicators when ablating neurons across layers, thus confirming the results in . However, by performing ablation analysis for each layer separately, we observe that 1) class information and class selectivity values differ greatly from layer to layer and 2) class information and class selectivity are good performance indicators in shallow and deep layers, respectively. This observation puts the results of  in a new light and resolves their counterintuitivity. We finally discuss implications of these results on pruning in Section 6. Specifically, we argue that pruning techniques based on importance measures for individual neurons should 1) be applied layer-wise rather than on the NN as a whole and 2) potentially benefit by using different importance measures in different layers. Moreover, we show that retraining after pruning (cf. [16, 13, 10]) can be replaced by a small surgery step without incurring severe performance degradation as long as the pruning is not too severe.
For our experiments, we trained a NN with two hidden layers on the MNIST dataset, which has evolved into a benchmark dataset for which the results are easy to understand intuitively. In our future investigations in this direction we will extend our results to include experiments with state-of-the-art convolutional neural networks and other regularization techniques.
Of course, quantities computed from individual neuron outputs are not capable of drawing a complete picture. In Section 7 we briefly discuss scenarios in which such a picture is greatly misleading and outline ideas how this shortcoming can be removed. Specifically, we believe that partial information decomposition [25, 18], a decomposition of mutual information into unique, redundant, and synergistic contributions, can be used to both extend this work and , as well as the works in the spirit of the information bottleneck principle [21, 20, 1].
2 Related Work
Using an information-theoretic perspective, the authors of  claimed that training a NN consists of two phases, characterized by layers learning about the class label and forgetting the input features, respectively. This work was critically reviewed in 
, where the authors discussed, among other things, the influence of the type of activation function on qualitative results. Using partial information decomposition, the authors of discover two distinct phases during training a NN with a single hidden layer, characterized by large amounts of redundant and unique information, respectively.
Recently significant effort has been made to reduce the computational complexity of large NNs by reducing the degrees of freedom in weight matrices (e.g., by weight pruning or low-rank approximations)[24, 11], using binary or ternary weights instead of floating point weights [6, 19], and pruning neurons or filters from feed-forward and convolutional NNs, respectively [10, 22, 15, 13, 16].
For example, the authors of  proposed pruning neurons based on their output entropy or on the magnitude of incoming and outgoing weights. They achieved satisfactory performance only after retraining the NN. Retraining is also necessary in [13, 16], which suggest pruning filters from convolutional NNs. Rather than pruning neurons, the authors of [22, 15] suggest merging neurons that behave similarly in a well-defined sense; in order to account for the merging step, a surgery step to update the weight matrices of the retained neurons can be used instead of retraining.
3 Setup and Preliminaries
We consider the problem of classification via feed-forward NNs, i.e., of assigning data sample to a class in , . We assume that the parameters of the NN had been learned from labeled data. We moreover assume that we have access to a labeled validation set that was left out during training. We denote this dataset by , in which is the -th data sample and the corresponding class label. We assume throughout that .
Let denote the output of the -th neuron in the -th layer of the NN if is the data sample at the input. With denoting the weight connecting the -th neuron in the -th layer to the -th neuron in the -th layer, denoting the bias term of the -th neuron in the -th layer, and denoting an activation function, we obtain by setting
and by setting to the -th coordinate of
. The output of the network is a softmax layer withneurons, each corresponding to one of the classes.
We assume that the readers are familiar with information-theoretic quantities such as entropy, mutual information and Kullback-Leibler (KL) divergence, cf. [7, Ch. 2]
. To be able to use such quantities to measure the importance of individual neurons in the NN, we treat class labels, data samples, and neuron outputs as random variables (RVs). To this end, letbe a quantizer that maps neuron outputs to a finite set . Now let be a RV over the set of classes and a RV over , corresponding to the quantized output of the -th neuron in the
-th layer. We define the joint distribution ofand via the joint frequencies of in the validation set, i.e.,
where is the indicator function. The assumptions that and that
is small obviate the need for more sophisticated estimators for the distribution, such as Laplacian smoothing.
Designing the quantizers for estimating information-theoretic quantities is challenging in general (cf. recent discussions in [20, 21]). Nevertheless, for the task at hand this appears to be unproblematic: We observed in our experiments that using more than two quantization bins did not yield significantly different results (see Appendix 0.C); we therefore apply one-bit quantization, i.e.,
, unless stated otherwise. Specifically, the quantizer threshold lies at 0.5 and 0 for sigmoid and ReLU activation functions, respectively.
4 Information-Theoretic Neuron Importance Functions
In this section we propose information-theoretic quantities as importance measures for neurons in a NN; as we show in Appendix 0.B.4, each of these measures can be computed from the validation set with a complexity of . A selection of additional information-theoretic importance functions is available in Appendix 0.B, together with a discussion of relations among them. It is worth mentioning that information-theoretic measures of neuron redundancy, i.e., of possibly nonlinear correlations in neuron outputs of a certain layer, may yield additional insight (see Section 7). Such an analysis, however, is outside the scope of this work and is thus deferred to future investigation. We consider a hypothetical classification task with classes in this section to explain the importance functions in the context of a NN.
Entropy quantifies the uncertainty of a RV. In the context of a NN, the entropy
has been proposed as an importance function for pruning in  (for one-bit quantization). Specifically, the entropy indicates if the neuron output varies enough to fall into different quantization bins for different data samples. In our hypothetical classification task, a neuron will be assigned maximum importance if data samples cause to fall into one quantization bin and the other data samples cause to fall into the other quantization bin. In contrast, a neuron for which the outputs for all data samples fall in the same quantization bin will have least importance corresponding to zero entropy . Assuming sigmoid activation functions and saturated neuron outputs, the former case corresponds to each saturation region being active for half of the data samples, while the latter case corresponds to only one saturation region being active. In the latter case, the neuron is uninformative about the data sample and the class.
4.2 Mutual Information
While small or zero entropy of a neuron output suggests that it has little influence on classification performance, the converse is not true, i.e., a large value of does not imply that the neuron is important for classification. Indeed, the neuron may capture a highly varying feature of the input that is irrelevant for classification. As a remedy, we consider the mutual information between the neuron output and the class variable, i.e.,
This quantity measures how the knowledge of helps us in predicting and appears in corresponding classification error bounds . In our hypothetical classification task with saturated sigmoid activation functions, a neuron will be assigned maximum importance if the neuron output is in each saturation region for half of the data samples (which maximizes the first term in (4)) such that the class label determines the saturation region (which minimizes the second term in (4)). In contrast, mutual information assigns the least importance to a neuron output that falls in different saturation regions independently of the class labels. In this case, knowing the value of does not help in predicting . It can be shown that neurons with small also have small , cf. [7, Th. 2.6.5].
4.3 Kullback-Leibler Selectivity
It has been observed that, especially at deeper layers, the activity of individual neurons admits distinguishing one class from all others. Mathematically, for such a neuron there exists a class such that the class-conditional distribution differs significantly from the marginal distribution , i.e., the specific information (cf. ) is large. Neurons with large specific information for at least one class may be useful for the classification task (see Section 5.3 below), but may nevertheless be characterized by low entropy and low mutual information , especially if the number of classes is large. We therefore propose the maximum specific information over all classes as a measure of neuron importance:
This quantity assigns high importance to cat neurons and can thus be seen as an information-theoretic counterpart of the selectivity measure used in . We thus call the quantity defined in (5) Kullback-Leibler selectivity. Specifically, KL selectivity is maximized if all data samples of a specific class label are mapped to one value of and all the other data samples (corresponding to other class labels) are mapped to other values of . In this case, can be used to distinguish this class label from the rest. In contrast, KL selectivity is zero if and only if the mutual information is zero; in general, KL selectivity is an upper bound on mutual information (see Appendix 0.A).
Of course, taking the maximum in (5) aggregates specific information values to a single number, thus losing information. The specific information spectrum, , may be a more relevant measure of neuron importance, especially when considering multiple neurons of the same layer simultaneously. While we do not follow this path in this work, we discuss this as a possible direction for future work in Section 7. Furthermore, note that the mutual information is conceptually similar to specific information and shares similar corner cases. We briefly discuss this quantity in Appendix 0.B.
5 Understanding Individual Neuron Importance via Cumulative Ablation
In this section, we connect the proposed information-theoretic measures of neuron importance to classification performance. To this end, suppose a trained NN and a validation set are given. We then compute the proposed importance functions (3), (4), and (5) for each neuron in the NN. This admits ranking the neurons of each layer or of the NN as a whole. Subsequently, we ablate the lowest- or the highest-ranking neurons and compute the classification error on the test dataset. We conclude that an information-theoretic importance measure is adequate if cumulatively ablating neurons with low (high) values leads to small (large) drops in classification performance. Adequate importance measures are not only relevant for understanding NNs, but may also be used to reduce the computational complexity required for inference in a NN by ablating neurons of low importance (cf. Section 6).
To keep the analysis simple and intuitive, we demonstrate our ideas using the MNIST digits dataset. This dataset is divided into training samples and test samples. We further split off of the training samples as a labeled validation set , from which we compute the importance functions from Section 4. With the remaining
training samples we train a fully connected feed-forward NN with two hidden layers, 100 neurons each, and sigmoid activation functions. The data samples are 784-dimensional vectors, each entry assuming a grayscale value of aimage. The network is trained in order to minimize cross-entropy with -regularization. Cross-entropy is defined as
where is the response of the NN to data sample in the -th output neuron and where is the set of parameters of the NN. We train the NN using the ADAM algorithm  with a learning rate of and a batch size of . In order to get consistent results, we train 20 NNs with different random initializations, perform cumulative ablation separately, and present the averaged classification accuracies. The models are implemented111The source code of our experiments can be downloaded from https://email@example.com/raa2463/info_neuron_importance.git.
in Python using the Keras library.
One may ask to which value a neuron shall be ablated in order to minimize the effect on neurons in subsequent layers. Note that ablating the neuron output to a constant value is equivalent to removing said neuron and adapting the bias terms for neurons in the -th layer. We consider two options for adapting the bias terms : Leaving the bias terms unchanged, or performing bias balancing by replacing by
In ablation analysis , the former option is equivalent to assuming that the neuron output is zero for all data samples, while bias balancing assumes that is equal to its average value over all data samples222Note that neither needs to be the case even if the ablated neuron has low information-theoretic importance. Indeed, even if , we can only conclude that is constant to within the resolution of the quantizer ..
We show the effect of bias balancing in Fig. 1 for NNs with sigmoid activation functions, where either a random set of neurons or a set of neurons with low mutual information is ablated. It can be seen that bias balancing significantly improves performance. Similar observations were made for weight adaptation after neuron merging in [22, 15]. We believe that bias balancing, if done correctly, can partially obviate the need for retraining the NN after pruning, cf [10, 13, 16]. In contrast, it was shown in [17, Appendix A] that bias balancing for ReLU activation functions leads to reduced performance (see Appendix 0.E). Since we focus on sigmoid activation functions, we assume bias balancing throughout the remainder of this work.
5.1 Dependence of Importance Measures on Layer Number
Fig. 2 shows the empirical distribution of neuron importance measures for different layers. It can be observed, for example, that both mutual information and KL selectivity are larger in the second layer than in the first. This is in agreement with observations in [17, Figs. A2 & A4.a], and with previous observations that features in shallow layers are general, i.e., not related to a specific class, whereas features in deeper layers are more and more class-specific.
The behavior of mutual information in Fig. 2(b) is in contrast with the behavior of the mutual information between the class and the complete layer, i.e., with . The data processing inequality (cf. [7, Th. 2.8.1]) dictates that this latter quantity should decrease towards deeper layers; proper training reduces this decrease, as empirically observed in [21, 20]. That the mutual information terms , corresponding to the individual quantized outputs, in contrast increase towards deeper layers reflects that individual neurons gain importance relative to the collective of neurons in a given layer. In terms of partial information decomposition [25, 18], we expect that the information provided by neurons in deeper layers is mainly unique or redundant, while information provided by neurons in the first few layers is mainly synergistic.
Interestingly, entropy also increases from the first to the second layer. Neurons in the first layer are saturated more strongly and their output distribution is more heavily skewed, as discussed in Appendix0.D, leading to degenerate distributions with small entropy values. In the second layer, the saturation effect is less pronounced, leading to larger entropy values. Even more interestingly, such an increase of entropy from the first to the second layer is not present for ReLU activation functions (see Appendix 0.D).
5.2 Whole-Network Cumulative Ablation Analysis
We next investigate the adequacy of the proposed importance measures. Specifically, we rank all neurons in the NN based on their information-theoretic importance measures and ablate those with lowest values (i.e., we perform cumulative ablation analysis across all layers simultaneously). The results are shown in Fig. 3. It can be seen that ablation according to small mutual information values, for example, performs worse than ablating nodes randomly. The same was observed in [17, Fig. A4.a], where the authors claimed that neurons with large mutual information have adverse effects on classification performance. A similar, although less pronounced situation appears for KL selectivity, again leading to the same conclusion as made in .
These observations can be explained as follows. With reference to Fig. 2(b), ablating neurons with lowest mutual information mostly ablates neurons in the first hidden layer. These neurons extract general features that are combined to class-specific features in the second hidden layer. By ablation, these generic features are removed, thus deeper layers are not able to extract class-specific features anymore and classification fails. The adequacy of the proposed importance measures thus cannot be evaluated by performing cumulative ablation analysis across all layers simultaneously.
5.3 Layer-Wise Cumulative Ablation Analysis
Since the conclusion that neurons with large mutual information adversely affects classification performance appears counterintuitive, we next perform ablation analysis in each layer separately. The results are shown in Fig. 4. First of all, it can be seen that ablating neurons in the first layer has stronger negative effects than ablating neurons in the second layer. Indeed, ablating up to 50 of the 100 neurons in the second hidden layer has negligible effect on classification performance. We believe that this is because the neurons in the second layer are highly redundant; the assumption that many of the neurons in the second layer are irrelevant for classification can be ruled out because of large mutual information and KL selectivity values (cf. Figs. 2(b) and 2(c)).
The entropy of neurons in the first layer seems to be uncorrelated with classification performance, while it seems to be mildly negatively correlated for the second layer. I.e., it is advisable to ablate those neurons in the second layer that have largest entropy, rather than those with smallest entropy. This can be seen by comparing the curves for entropy in Figs 4(b) and 4(d), respectively: Results for ablating high-entropy neurons were slightly better than those for low-entropy neurons. The effect is only small, however, if compared to ablating neurons randomly.
In contrast, it can be seen that ablating neurons with low (high) mutual information or KL selectivity has better (worse) performance than ablating neurons randomly. In the second layer, ablating neurons with the smallest KL selectivity values performs best, while ablating neurons with largest mutual information values performs worst. In other words, it appears as if highly selective neurons are sufficient for good classification performance, while neurons with high mutual information are necessary – at least in this specific example; see also Section 7. This complements the discussion in [4, Sec. 3.2], mentioning that interpretability (which is related to selectivity) is neither necessary for nor a consequence of good classification performance.On the other hand, in light of our results, low selectivity is however a good measure of measuring the impact of ablation on classification performance. In summary, the situation in the second layer, although slightly more nuanced, is similar to the situation in the first layer. Thus, we conclude that mutual information and KL selectivity are adequate importance measures for classification.
On the surface, this contradicts the claims in . Under closer scrutiny, however, our results are not surprising. The mutual information values of neurons in the second layer are large compared to those of the first layer; simultaneously, neurons in the second layer seem to be highly redundant, i.e., can be removed without affecting classification performance. Combining these facts, it follows that ablating neurons with large mutual information values affects classification performance less than ablating neurons with small mutual information values, leading to the negative correlation reported in . We believe, however, that the correlation between mutual information and impact on classification performance becomes positive if this correlation is evaluated layer-by-layer. We even claim that similar conclusions can be drawn from a closer inspection of [17, Figs. 7.a, A4.a, A4.c, A4.e]. Thus, the superficial counterintuitivity of the results in  can be resolved by recognizing that it is ill-advised to compare mutual information and (KL) selectivity values across different layers.
6 Implications for Complexity Reduction via Pruning
Pruning is often used to reduce the computational complexity of performing inference in a NN. Moreover, it can lead to better generalization results as it reduces the representational capacity of the DNN, hence enforcing simpler learned models . Most of the recent work on pruning retrains the NN after the pruning procedure (e.g., [13, 16, 10]). When dealing with large pre-trained networks and fine-tuning them for transfer learning, retraining itself may incur significant complexity. Our analysis in Fig. 1 and the analysis in [17, Appendix A.1] suggest that by performing bias balancing for sigmoid activation functions and doing nothing for ReLU activation functions, respectively, can lead to acceptable performance without requiring retraining. That bias balancing improves performance without retraining parallels the observation that a simple update of the weight matrices improves performance after merging neurons [15, Fig. 4].
Concluding from  and Sections 5.2 and 5.3, pruning based on the ranking of entropy, mutual information, or (KL) selectivity should not be performed for all layers jointly, since this would remove mostly neurons from shallow layers. In contrast, pruning layers separately shows promising performance. This opens the question how the number (or percentage) of pruned neurons should be distributed among the different layers. While we do not have an answer to this question, we wish to point out that it will not only be linked to importance measures for individual neurons, but also to measures of neuron redundancy. Our discussion in Sections 5.1 and 5.3 indicates that deeper layers have more redundancy and hence can be more severly pruned without impacting the performance significantly.
Finally, we observed that our information-theoretic importance measures not only differ greatly between layers (see Section 5.1 and [17, Fig. A2]) but also have different meanings (see Section 4 and Appendix 0.A). This suggests that it may be useful to employ different importance measures when pruning different layers. To demonstrate this on a simple case, we performed pruning on the NN from Section 6. We fixed the ratio of neurons pruned from the first layer and the second layer to 1:2, i.e., twice as many neurons are pruned from the second layer as are pruned from the first layer. Moreover, we pruned neurons from the first layer if their mutual information values were low; we pruned neurons from the second layer if their KL selectivity was low. Fig. 5 shows the results of this experiment with and without bias balancing and compares it to pruning all neurons based on mutual information. Furthermore, we compare the results with the data-free pruning method proposed in . It can be seen that bias balancing recovers a significant portion of the performance loss caused by pruning, which matches with our observations in Fig. 1. It can also be seen that pruning neurons with low mutual information from the first layer and neurons with low KL selectivity from the second layer performs slightly better than pruning all neurons based on mutual information only, although the difference is small. This warrants more detailed investigation for using different importance measures at different layers but, based on these experiments, one can make an educated guess that it leads to at least as good results (or even slightly better) as using same measure for all layers. Finally, this hybrid scheme has similar performance as the data-free pruning method from , thus presents a competitive alternative. Note, however, that this comparison is problematic, since the method in  prunes neurons that are redundant, while our scheme prunes neurons that are uninformative about the class. The two methods may be even used together since they approach different aspects of pruning. A more detailed comparison with  is presented in Appendix 0.F.
7 Discussion, Limitations, and Outlook
By considering training data and neuron outputs as realizations of RVs, one can define and calculate information-theoretic quantities to measure neuron importance. This, in turn, admits investigating the connection between these importance measures and the effect of cumulative ablation on classification performance. We summarize our main findings:
The distribution of importance measures changes from layer to layer, with deeper layers in general having larger importance measures (especially mutual information). It therefore seems ill-advised to compare the importance of neurons of different layers.
In deeper layers, ablation has smaller effects on classification performance. This may be explained by an increased redundancy in deeper layers.
In deeper layers, KL selectivity seems to be the most adequate importance measure. In other words, deeper layers profit from “cat neurons”.
Class-dependent importance measures, such as mutual information or KL selectivity, are connected more strongly to classification performance than class-independent ones, such as entropy.
The connection between importance measures and classification performance depends on the activation function, as does the benefit of bias balancing.
On the one hand, we must admit that we obtained these conclusions by doing experiments with a comparably small dataset. The question whether the qualitative claims hold more generally, e.g., for convolutional NNs and deep NNs with more than two hidden layers, shall be answered in future work. On the other hand, the MNIST dataset is a well-understood benchmark for NNs, and our results admit an interesting perspective on the contribution of individual neurons on classification performance. Specifically, our work resolves the counterintuitivity present in  without contradicting it and we believe similar results should hold for larger datasets and neural networks.
Yet, the picture is not complete: Information-theoretic importance functions depending on the distribution of an individual neuron output are not sufficient. Neither are, in our opinion, mutual information measures between the class variable and complete layers, as were recently used in [21, 20, 1]. Consider the following three examples:
Suppose that and that class is particularly easy to predict from the neuron outputs in the -th layer. Suppose further that we use KL selectivity to measure neuron importance. It may happen that is the maximizer of (5) for all neurons in the layer. Thus, neuron importance is evaluated only based on the ability to distinguish class
from the rest, which ignores separating the remaining classes. Pruning neurons based on KL selectivity may thus result in a NN unable to correctly classify classes other than.
Suppose that -th and -th neuron in the -th layer have the same output for every input , i.e., and that is (highest possible). In this case if we use mutual information for pruning, the two neurons will be given very high importance, although one of them can be pruned without having any effect on the performance of the neural network (if the outgoing weights are adjusted accordingly after pruning).
Suppose that and that we use mutual information to measure neuron importance. Suppose further that the -th and -th neuron in the -th layer are binary. It may happen that both and are independent of , but that equals the exclusive or of and . Thus, , although both neuron outputs jointly determine the class.
In the first example, the neurons are individually informative, but KL selectivity may declare a set of neurons as important that is redundant (in the sense of determining class ) but insufficient (for determining other classes). The second example presents a similar situation but with mutual information. In the third example, the neurons are individually uninformative, but jointly so. The first two examples can possibly be accounted for by introducing importance measures that take the redundancy of a layer into account, such as those proposed by [22, 2]. For first example, another option is to replace the KL selectivity of a neuron by a table of specific information values, evaluated for each class and each neuron in a given layer. The resulting table of values then admits selecting a subset of neurons such that each class in is represented.
More generally, all examples can be treated by investigating how the information of a complete layer on the class variable splits into redundant, unique, and synergistic parts. Neurons that contain only redundant information shall be assigned little importance; in deeper layers, unique information may be given higher value than synergistic information, whereas the contrary may be true for shallow layers. This line of thinking suggests that, in addition to the measures proposed so far, partial information decomposition [25, 18] may be used to shed more light on the behavior of neural networks.
-  Amjad, R.A., Geiger, B.C.: How (not) to train your neural network using the information bottleneck principle (2018), submitted to ICML; preprint available: arXiv:1802.09766 [cs.LG]
-  Babaeizadeh, M., Smaragdis, P., Campbell, R.H.: NoiseOut: A simple way to prune neural networks. arXiv:1611.06211 (2016)
-  Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE 10, 1–46 (07 2015). https://doi.org/10.1371/journal.pone.0130140
Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying interpretability of deep visual representations. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 3319–3327. Honolulu (Jul 2017)
-  Chollet, F., et al.: Keras. https://github.com/keras-team/keras (2015)
-  Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to or . arXiv:1602.02830 (2016)
-  Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons, Inc., New York, NY, 1 edn. (1991)
-  Frankle, J., Carbin, M.: The lottery ticket hypothesis: Training pruned neural networks. arXiv:1803.03635v1 [cs.LG] (Mar 2018)
-  Han, T.S., Verdú, S.: Generalizing the Fano inequality. IEEE Transactions on Information Theory 40(4), 1247–1251 (Jul 1994)
-  He, T., Fan, Y., Qian, Y., Tan, T., Yu, K.: Reshaping deep neural network for fast decoding by node-pruning. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). pp. 245–249 (2014)
-  Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of deep convolutional neural networks for fast and low power mobile applications. In: Proc. Int. Conf. on Learning Representations (ICLR). San Juan (May 2016)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
-  Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: Proc. Int. Conf. on Learning Representations (ICLR). Toulon (Apr 2017), arXiv:1608.08710 [cs.CV]
-  Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37(1), 145–151 (Jan 1991)
-  Mariet, Z., Sra, S.: Diversity networks. In: Proc. Int. Conf. on Learning Representations (ICLR). San Juan (May 2016), arXiv:1511.05077v6 [cs.LG]
-  Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks resource efficient inference. In: Proc. Int. Conf. on Learning Representations (ICLR). Toulon (Apr 2017), arXiv:1611.06440v2 [cs.LG]
-  Morcos, A.S., Barrett, D.G., Rabinowitz, N.C., Botvinick, M.: On the importance of single directions for generalization. arXiv:1803.06959v1 [stat.ML] (May 2018), accepted for publication
-  Rauh, J., Banerjee, P., Olbrich, E., Jost, J., Bertschinger, N.: On extractable shared information. Entropy 19(7), 328 (Jul 2017)
-  Roth, W., Pernkopf, F.: Discrete-valued neural networks using variational inference, openreview.net/forum?id=r1h2DllAW
-  Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., Cox, D.D.: On the information bottleneck theory of deep learning. In: Proc. Int. Conf. on Learning Representations (ICLR). Vancouver (May 2018), openreview.net/forum?id=ry_WPG-A-, accepted for publication
-  Shwartz-Ziv, R., Tishby, N.: Opening the black box of deep neural networks via information. arXiv:1703.00810 [cs.LG] (Mar 2017)
-  Srinivas, S., Babu, R.V.: Data-free parameter pruning for deep neural networks. arXiv:1507.06149 (2015)
-  Tax, T., Mediano, P., Shanahan, M.: The partial information decomposition of generative neural network models. Entropy 19(9), 474 (Sep 2017)
-  Tu, M., Berisha, V., Cao, Y., Seo, J.s.: Reducing the model order of deep neural networks using information theory. In: Proc. IEEE Computer Society Annual Sym. on VLSI (ISVLSI). pp. 93–98 (2016)
-  Williams, P.L., Beer, R.D.: Nonnegative decomposition of multivariate information. arXiv:1004.2515 [cs.IT] (Apr 2010)
-  Zurada, J.M., Malinowski, A., Cloete, I.: Sensitivity analysis for minimization of input data dimension for feedforward neural network. In: Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS ’94. vol. 6, pp. 447–450 vol.6 (May 1994)
Appendix 0.A Properties of KL Selectivity
KL selectivity is not only large if a neuron output helps distinguishing one class from the rest, but also if it helps distinguishing a set of classes from its complement. This is the main conclusion of the following lemma.
Note that the distribution is a convex combination of distributions , . Indeed,
Since KL divergence is convex [7, Th. 2.7.2], we thus have
with equality if consists of a single element. Thus, maximizing over all sets is equivalent to maximizing over all class labels. This completes the proof.
The second result relates mutual information with KL selectivity.
with equality if and only if .
Note that mutual information can be written as a KL divergence, i.e.,
Since the convex combination on the right-hand side is always bounded from above by its maximum, the inequality is proved. Finally, if , then so is the convex combination on the right-hand side. Therefore, all terms need to be identical to zero, i.e., for all . It thus follows that KL selectivity is zero if mutual information is zero.
Appendix 0.B Additional Information-Theoretic Neuron Importance Measures
0.b.1 Jensen-Shannon Subset Separation
A consequence of Lemma 2 is that a large mutual information implies that at least for some class label , the conditional distributions differs from the marginal distribution . More generally, there needs to be a set , such that differs from . We measure the difference between these two distributions using Jensen-Shannon (JS) divergence. Specifically, the JS divergence between two distributions on , on the same finite alphabet and a weight is defined as 
where . JS divergence is nonnegative, symmetric, bounded, and zero if . JS divergence can be used to bound the Bayesian binary classification error from above and below, where and are the class priors and where and
are the conditional probabilities of the observation given the respective classes (see Theorems 4 and 5 in).
Evaluating the JS divergence between and with weights and , respectively, is thus connected to the binary classification problem of deciding whether or not the neuron output is connected to a subset of all classes. If there is at least one nontrivial set such that the JS divergence is large, then the neuron output is useful in separating data samples from classes in from those from classes in . Hence, we consider the following importance measure:
The following proposition gives a clearer picture of this cost function by showing that the JS divergence between these distributions coincides with the mutual information the neuron output shares with indicator function on a subset of class labels. The connection between JS divergence and mutual information is known; we reproduce the proof for the convenience of the reader.
In essence, Lemma 3 shows that our importance function (14) can be interpreted as a divergence between two distributions, and as the amount of information the neuron output shares with an indicator variable on a subset of class labels. Hence, this importance function measures the ability of the neuron output to separate class subsets.
Note further that, by (4), we have
In case all class labels occur equally often (i.e.,
has a uniform distribution on), the right-hand side of above equation achieves its maximum for sets that contain half the class labels. Thus, Jensen-Shannon subset separation tends to give higher importance to neurons that separate into equally-sized sets rather than to neurons that separate it in an unbalanced manner.
0.b.2 Labeled Mutual Information
The maximization in (14) has a computational complexity of , which makes it impractical for datasets with many classes. Instead, one can perform a maximization over individual classes rather than subsets of classes and thus obtains the definition of labeled mutual information:
With reference to Lemma 2, one can show that
i.e., labeled mutual information contains the same specific information that we used in the definition of KL selectivity. Note, however, that (except in certain corner cases), the maximizum in (16) may be achieved for a different class than the maximum in (5). Nevertheless, labeled mutual information and KL selectivity tend to give identical results in special corner cases.
By similar arguments as in the discussion of JS subset separation, we have
Therefore, labeled mutual information in general decreases with the number of possible class labels, i.e., with the cardinality of . This is not the case for KL selectivity.
0.b.3 Ordering Between Importance Measures
where follows from (4) and the nonnegativity of entropy, from the data processing inequality, and from the fact that the maximization is performed over a smaller set in (16) than in (14). Second, Lemma 2 shows that KL selectivity is an upper bound on mutual information. Finally, there is no ordering between KL selectivity and entropy.
0.b.4 Complexity of Computing Importance Functions
We assume that the validation set is, in any case, run through the NN, i.e., we ignore the computational complexity of computing neuron outputs. Assuming that and , the most complex step is computing the distribution from the data set ; this can be done with a complexity of . Similarly, the distribution and the set of distributions can be computed with a complexity of .
Entropy can be computed from with a complexity of ; mutual information from with a complexity of ; KL selectivity and labeled mutual information from the set of distributions with a complexity of ; and JS subset separation with a complexity of . These computations have, with the exception of JS divergence, a complexity negligible compared to .
Appendix 0.C Effect of Quantizer Resolution
Fig. 6 shows the effect the quantizer resolution has on cumulative ablation. Specifically, we ablated neurons with low mutual information values from the first layer of the NN. As it can be seen, the performance of different quantizer resolutions is similar. Apparently, different quantizer resolutions lead to different, but still strongly correlated rankings. We therefore chose to use the smallest quantizer resolution, i.e., one-bit quantization. First, this minimizes the computational complexity of computing the proposed importance measures (Appendix 0.B.4). Second, it guarantees that , which justifies using (2) as an estimate of the joint distribution of and . And finally, such a coarse quantization ensures that the neuron output can be interpreted by a linear separation; i.e., the fact that mutual information is invariant under bijections is unproblematic in this case, cf. .
Appendix 0.D Distribution of Importance Measures for ReLU Activation and the Curious Case of Entropy
Fig. 7 shows the distributions of eight neurons each taken from the first and second hidden layer, respectively. One can see that the neurons in the first hidden layer are saturated more strongly than those in the second hidden layer. Furthermore, one can seen that the distributions of the neuron outputs in the first hidden layer are skewed more strongly, i.e., the entropy is lower for neurons in the first hidden layer, explaining the behavior shown in Fig. 2(a).
As it can be seen in Fig. 8, the difference between the distribution of information-theoretic importance functions in the first and second hidden layer is not as severe for ReLU activation functions as it is for sigmoid activation functions. The reason is that, apparently, ReLU activation functions distinguish between being active or inactive, where in any case the activation is small. The distributions in the second layer seem to be similarly skewed as in the first.
Appendix 0.E Bias Balancing for ReLU Activation Functions
Fig. 9 shows the performance of bias balancing for NNs with ReLU activation functions. For these, it was shown in [17, Fig. A1] that bias balancing (“ablation to the empirical mean”) performs worse than doing nothing (“ablation to zero”). And indeed, our results in Fig. 9 confirm this result. While the effect is weak in the first layer, especially in the second layer bias balancing causes reduced classification performance compared to doing nothing. A possible explanation is that, for ReLU activation functions, classification seems to be linked to whether the neuron is active or inactive, i.e., whether or not is zero. Doing nothing is equivalent to making zero for every data sample, which negatively affects classification performance for all data samples for which would be positive. In contrast, replacing by its empirical mean negatively affects all data samples.
Appendix 0.F Comparison of Neuron Pruning with Neuron Merging
We next compare the data-free pruning method from  with pruning neurons based on information-theoretic importance measures. Note that such a comparison is problematic for multiple reasons. 1) The method proposed in  is data-free, i.e., depends only on the network parameters , while pruning based on statistics of neuron outputs requires a validation set for estimation (estimation itself has complexity linear in the size of the validation set, cf. Appendix 0.B.4). 2) While pruning based on neuron importance removes neurons that are little informative about the class in some well-defined sense, merging neurons that behave similarly, such as proposed in , places focus on redundancy (cf. Section 7).
The results are shown in Fig. 10. First of all, one can see the data-free pruning with weight updates outperforms method pruning without weight updates for sigmoid activation functions, paralleling the results from . Surprisingly, the picture seems to be reversed for ReLU activation functions, which in some sense parallels our discussion in Appendix 0.E. Judging from Figs. 10(a) and 10(c), one can see that pruning neurons with low mutual information and low KL selectivity in the first and second hidden layer, respectively, is a competitive alternative to the pruning method from .
For ReLU activation functions, the situation seems to be more complicated. While all pruning techniques seem to perform similarly for the first hidden layer (cf. Fig. 10(b)), in the second hidden layer it seems as if KL selectivity is not an adequate measure of neuron importance. This suggests that even qualitative results depend a lot on the type of activation function used.