1 Introduction
Usually a classification model is trained with a softmax loss which is quite successful in many scenarios. This loss typical helps the model learn discriminative features for the target task and ignore the irrelevant features. However, if there are several discriminative features that are correlated within a category, the model may choose the most discriminative feature (color and texture) and ignore the rest (object structure). The reason is that the most discriminative features are those that make the steepest descend in loss function. As the training continues, those features will dominate the final feature representation and the rest discriminative features will be discarded as well as the irrelevant features. Similar evidence can also be found in recent study on ImageNettrained CNNs. It shows that those models are biased towards texture rather than object shape
[5]. Learning partial discriminative features does not make the most of the dataset and thus reduces the generalization capability of the model.Information theory has been widely used to improve the representation capability of deep neural networks [2, 4, 6, 11, 12, 14]. In this work, we focus on how to apply mutual information to find correlated features in image classification tasks. According to Data Processing Inequalities (DPI), the mutual information between the input data and the hidden layers are decreasing as the layer goes deeper [13]
. The main idea of information bottleneck (IB) trade off is that we can try to minimize the mutual information between the input data and the hidden representation and maximize the mutual information between the hidden representation and the label to find the optimal achievable representations of the input data
[13]. In contrast, we find that when discriminative features are correlated, the maximizationinstead of minimization of mutual information between hidden representations can provide extra benefits for representation learning. We call this strategy information flow maximization (IFM) which is achieved by estimating and maximizing the mutual information between convolutional layers simultaneously. IFM is implemented using a multilayer fully connected neural network and it serves as a plugin in the conventional CNNs in the training stage. In the test stage, the IFM block is removed and thus there is no extra computation cost.
2 Related work
There are many work concentrating on information maximization for deep networks. In [2], Chen introduce InfoGAN which is a generative adversarial network that maximizes the mutual information between a small subset of the latent variables and the observation. In [1]
, Belghazi present a Mutual Information Neural Estimator (MINE) that estimates mutual information between high dimensional continuous random variables by gradient descent over neural networks. In
[6], Hjelm introduce Deep InfoMax (DIM) to maximize mutual information between a representation and the output of a deep neural network encoder to improve the representation’s suitability for downstream tasks. In [9], Jacobsen propose an invertible network architecture and an alternative objective that extract overall discriminative knowledge in the prediction model.The difference between our work and [6] is that we are concentrating on maximizing the mutual information between adjacent layers so that the information loss can be reduced while [6] tries to maximize the mutual information between the final representation and the output convolutional feature maps. The work in [9] is closely related to our work, the main difference is that we apply IFM blocks instead of flowbased models to reduce the information loss.
3 Method
The pipeline of the proposed method is shown in Figure 1. The backbone network is a vanilla convolutional neural network. The IFM blocks are plugged in between adjacent convolution layers. Note that the IFM blocks are only used in the training stage. In the test stage, those IFM blocks is removed so that there are no extra computation cost.
3.1 Mutual information estimation
In oder to be selfcontained, in this section we will introduce how to estimate mutual information. Formally, the mutual information is calculated as
(1) 
where and are two random variables.
is the joint probability mass function of
and . and are the marginal probability mass functions of and respectively. From Equation 1, we can find that maximizing the mutual information between andis equivalent to maximizing the KullbackLeibler divergence between the joint distribution of
and the product of marginal distribution of and .Following [10],the general form of fdivergence can be approximated by
(2) 
where is the joint distribution and is the product of marginal distribution . is an arbitrary class of functions and is the convex conjugate function of the generator function . Since can be approximated by the supremum of the difference between two expectations, we can choose to maximize
(3) 
where is a neural network parametrized by . More specifically, can be represented in the form where is specific to the fdivergence used. Since KullbackLeibler divergence is not upper bounded, we use JensenShannon divergence as a surrogate to estimate the mutual information. Thus, we can replace with and choose . Then we obtain
(4) 
Let . We have
(5) 
is represented by network D in Figure 2. From Equation 5, we can find that the maximization of will result in the network D outputting one for samples from the joint distribution and zero for samples from the product of the marginal distributions.
3.2 Constructing sample pairs
In Equation 5, we still need to estimate two expectations. In the first term the samples are sampled from the joint distribution and in the second term the samples are sampled from the product of marginal distributions. Since we are estimating the mutual information between adjacent convolutional layers and , we firstly resize to the size of . Then sampling from the joint distribution
could be achieved by sampling feature vectors at the same spatial location on the convolutional feature maps. For sampling from the second distribution, we can firstly sample a random feature vector from
and then randomly sample another feature vector from . For each sample pair, the two feature vectors are concatenated as a single vector. The details are shown in figure 2. is represented by network D and the maximization of Equation 5 will optimize network D to distinguish the sample pairs from the two distributions.3.3 Information flow maximization
When stacking convolutional layers, we are potentially losing information. According to DPI, we have . Suppose we are given a training dataset for classification and the data representation can be decomposed into three disentangled features and where and can be used for classification and describes some random variations that are shared across categories. Ideally, we can used these three features to perfectly reconstruct the input data. When we are training a model for the target classification task, the information about will be gradually discarded from the information flow which is as expected. However, if the classification task is biased towards one of the id features, say , we may unexpectedly lose the information of as well in the information flow. This is because during model training will be much larger than . will get more and more strengthened than . Finally, our model will rely only on for classification. This behavior undermines the generalization capability of our model especially when the test task depends on for classification.
In order to reduce the loss of information in the information flow, we propose to maximize the mutual information between adjacent convolutional layers. The entire objective function is
(6) 
where is the classification loss (the softmax loss) and is the number of layers that used to calculate the information flow.
Although some taskirrelevant information may also be involved in the final representation, the training process will let the discriminative information dominate the representation. Thus the classifier can make predictions based on more informative features.
4 Experiments
4.1 Dataset
The dataset we used in evaluation is the shiftMNIST dataset introduced in [9]. It is a modified version of MNIST dataset. For the ten digits, ten texture images are randomly selected from a texture dataset [3] and applied on the digit as its background. We split onefifth of the original MNIST training set to construct the validation set. In the training set, each digit is associated with a fixed type of texture. For example, for digit 1, its background patch is sampled from texture 1, and for digit 2, its background patch is sampled from texture 2, etc. However, in the validation set and test set, the digit is associated with a random texture. In other words, the texture id and the digit id are the same for given a training image while they are not necessarily the same for a given validation or test image. Some examples from the shiftedMNIST dataset are shown in Figure 3.
4.2 Implementation details
Layer  Network details 

conv1  Conv(32,3,3)BNleakyReLU 
  Maxpool(2,2) 
conv2  Conv(64,3,3)BNleakyReLU 
  Maxpool(2,2) 
conv3  Conv(128,3,3)BNleakyReLU 
  Maxpool(2,2) 
conv4  Conv(128,3,3)BNleakyReLU 
  Maxpool(2,2) 
FC(12822, 10) 
. “Maxpool(2,2) ” means maxpooling with stride 2 and the pooling window size is
. “BN ” indicates batch normalization
[8]. All leakyReLUs share the same ratio of 0.2 in the negative region. “FC ” is the fully connected layer.Layer  Network details 

fc1  FC( , 256)BNleakyReLU 
fc2  FC(256, 128)BNleakyReLU 
fc3  FC(128, 64)BNleakyReLU 
fc4  FC(64, 1)sigmoid 
The details of the classification network and the network D are shown in Table 1 and Table 2
. The learning rate is 0.01. Mutual information is estimated for (conv1, conv2), (conv2, conv3) and (conv3, conv4). For each pair of the convolutional feature maps, the upsampling step uses nearest neighbor interpolation. The size of the input image is 32
32. Both the classification network and the network D are trained in an endtoend way simultaneously.4.3 Evaluation protocol
For the shiftedMNIST dataset, one may argue that only the digit feature should be considered as the correct feature for label prediction in the training set. However, as stated in [7], the digit features can be viewed as a kind of human prior. For our model, it does not have such a prior so that both the digit feature and the texture feature may be viewed as the discriminative features. It leaves to the optimization dynamics to choose which feature as the final predictor. Note that the digit label and texture label are identical for a given training image. In the training stage, we select models with best digit validation accuracy and best texture validation accuracy to observe how the optimization dynamics influences the knowledge learning. The optimal test classification accuracy should be around 50% since the classification model is not aware of whether the test task is a digit classification task or a texture classification. So it should learn both features equivalently.
4.4 Results
The classification results are shown in Table 3. In this section, we train a baseline model without IFM blocks. Model achieves best digit validation accuracy and the model achieves best texture validation accuracy. The test accuracies are shown in the first two rows in Table 3. For both models, the prediction accuracies on the digit are slightly above 10% which is quite similar to random guess. It means that both models ignore the digit structure as the discriminative feature. The prediction accuracies on texture are above 95% which means the final representations are dominated by the texture features. The results of the baseline models demonstrate that if the model is trained in the vanilla way it only learns partial discriminative features and ignore other correlated features. In this experiments, the baseline models are sensitive to texture features which is in accordance with the observations in [5].
The benefit of applying IFM is shown in the bottom two rows. It can be found that the classification accuracies of digit are much higher than that of the baseline model. It indicates that our models indeed learn the digit structure as the discriminative feature. The reason for the test digit accuracy of being lower than that of is that digit structure features are more difficult to learn than texture features. When the texture features are well learned (with high texture validation accuracy), the learning of digit features may still be halfway. also outperforms the model in [9] which is a flow based model with no information loss. It implies that our IFM can be potentially viewed as an alternative way to flowbased models to reduce information loss in deep networks.
Model  acc (digit)  acc (texture) 

12.44%  95.07%  
12.05%  96.44%  
iCE fiRevNet [9]  40.01%   
54.54 %  40.41%  
31.78 %  69.00% 
5 Conclusion
In this work, we propose to maximize the information flow in convolutional neural networks as a kind of regularization term. The benefit of this regularization is that we can find correlated features that are difficult to be disentangled. Thus, the learned representations are more informative and generalizable than representations learned in conventional training without this information regularization term. Our future work will focus on how to apply the proposed information flow maximization on natural image classification tasks.
References
 [1] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
 [2] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.

[3]
M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi.
Describing textures in the wild.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3606–3613, 2014.  [4] M. Gabrié, A. Manoel, C. Luneau, N. Macris, F. Krzakala, L. Zdeborová, et al. Entropy and mutual information in models of deep neural networks. In Advances in Neural Information Processing Systems, pages 1821–1831, 2018.
 [5] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenettrained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
 [6] R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
 [7] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175, 2019.

[8]
S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37
, ICML’15, pages 448–456, 2015.  [9] J.H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge. Excessive invariance causes adversarial vulnerability. arXiv preprint arXiv:1811.00401, 2018.
 [10] S. Nowozin, B. Cseke, and R. Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.
 [11] R. ShwartzZiv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
 [12] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
 [13] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
 [14] S. Zhao, J. Song, and S. Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
Comments
There are no comments yet.