1 Introduction
Hyperparameter selection in deep neural networks is mostly done by experimentation for different datasets and models. A key example of one such hyperparameter that is the subject of this paper is the learning rate. High learning rates, particularly in early training stages, can result in instabilities and fluctuations in the parameter search process. Established procedures to set learning rate to a low value at the beginning and then gradually “warmup” to the desired learning rate have been used effectively He et al. (2016); Goyal et al. (2017). These approaches require the apriori definition of a policy or schedule and the learning rate changes according to that fixed policy. The fixed policy may not be suited for different datasets or model architectures which may be very different in complexity. The same policy may also not suffice in case the compute resources available, or other factors, necessitate changes in other training hyperparameters (e.g. batch size) during training. Dynamically setting the learning rate through the training cycle is one way of handling such issues; this paper studies the feasibility of Mutual Information (MI) Cover and Thomas (July 2006) as a metric to realize this objective.
2 Related work
Adaptive learning rate schedules based on gradients have been proposed in various gradientdescent based optimization algorithms used for training deep neural networks Ruder (2016)
. These include the likes of AdaGrad, AdaDelta, RMSprop, Adam and some more recent algorithms. These set learning rates at the level of individual parameters by considering the frequency or magnitude of updates; slow or infrequent updates characterized by smaller past gradients get more importance than fast or frequent updates characterized by larger past gradients. Depending on the dataset and model complexity, careful initial selection of the learning rate may be required.
Recent works of Tishby et al Tishby and Zaslavsky (2015); ShwartzZiv and Tishby (2017)
have attempted to explain deep learning through the Information Bottleneck (IB)
Tishby et al. (1999) basis. In particular, the recent paper ShwartzZiv and Tishby (2017) made strong and wideranging claims on aspects relating to phases in deep learning, causal relationship between compression and generalization and the basis for compression in deep learning. Some of these claims were subsequently countered in Saxe et al. (2018), while acknowledging the potential of the more general MI and IB concepts. The current paper builds on this body of literature.An important and related problem is that of estimation of MI. The problem has been widely studied and several estimation methods exist
WaltersWilliams and Li (2009). One such method is that of nearest neighbor approaches; these have been shown Doquire and Verleysen (2012); Khan et al. (2007)to be effective with high dimensional data and at large sample sizes. A widely cited example of this class of algorithms is the KraskovStögbauerGrassberger (KSG) estimator
Kraskov et al. (2004). This approach is used for MI estimation in this paper; other algorithms can also be used. Recent developments in the area include Kolchinsky and Tracey (2017) that estimates MI using pairwise distances between Gaussian mixture components and Belghazi et al. (2018) that estimates MI through gradient descent over neural networks. Their properties and suitability to the context of this paper will be studied in future.This work seeks to understand the operational utility of MI as a metric for deep learning and specifically, for dynamic hyperparameter (in this paper, the learning rate) setting. For this paper, the essential core of a typical deep neural network training pipeline remains unchanged i.e. the use of minibatch Stochastic Gradient Descent (SGD) optimization to optimize (minimize) the training cost (e.g. misclassification error) is maintained but the learning rate is set adaptively considering the MI of hidden layer activations with the true output. The paper effectively demonstrates that an information driven “warmup” and subsequent “cooldown” of learning rate can produce competitive outcomes on standard datasets; it does not attempt to prescribe a specific learning rate policy.
3 Approach
In (deep) neural network models, MI lends a layer specific measure that may be utilized in multiple ways. It may potentially be used as a metric for parameter optimization through standard optimization methods used for the purpose Tishby and Zaslavsky (2015); works such as Shamir et al. (2010)
suggested MI as providing an upper bound for prediction error. It may serve as a basis for dynamic tuning of networklevel hyperparameters. Further, as a layerspecific measure, it may be utilized for layerwise dynamic tuning of hyperparameters. Based on the information metric per epoch, interventions in hyperparameters may be used to steer the learning process towards efficient and effective deep learning.
Computing MI is generally computationally expensive; performing MI computation after each epoch in deep neural net training can prove to be infeasible. This paper relies on two ideas to effectively use MI in training with large data sets  (1) use a randomly selected subset of data for MI computation  plotting the MI vs sample size curve for different data sets (see Fig 1) enables informed selection of an appropriate subset samplesize for perepoch MI computation and (2) this approximate MI value may suffice if the relative measures can be utilized for the problem being addressed.
MI (of input and output training data) vs sample size for MNIST (left) and CIFAR10 (right) as computed using the KSG estimator. The figures show estimated mean and standard deviation (error bar) for each sample size tested. Experiments in this paper use a sample size of 1000.
Given input , output and hidden layer activations , MI between input and output, , is denoted as . MI between hidden layer activations and output, , is denoted by as ; specifically, MI between last (output) layer activations and output is denoted as . Finally, MI between hidden layer activations and input, , is denoted by as .
In the context of neural networks, the Data Processing Inequality (DPI) Cover and Thomas (July 2006) effectively provides an upper bound to the information that each layer (including output) of a neural network can capture. It suggests that successive layers operating on the input data cannot increase its information content relative to the output data i.e. ; as a specific case, .
With computed from input data and output data
, this inequality may hold true for dense neural networks but will not hold true for convolution neural networks (CNNs). The use of multiple (say
) convolution filters in CNNs is akin to treating image inputs tiled together with the respective convolution filter weights being mapped to the weights of a much larger dense neural network. Thus, a reasonable estimate of the upper bound may be obtained by repeating or tiling and , times, and then computing the estimate .The work ShwartzZiv and Tishby (2017) suggests as a ratio of the amount of information captured by the model. The KSG estimator Kraskov et al. (2004) incorporates a small amount of noise (a jitter in the order of ) in MI computation to overcome degeneracies in data. Saxe et al Saxe et al. (2018) observe that DPI will not hold when noise is added for the purposes of measuring MI. As a consequence of the DPI not being valid, may not be an upper bound and is a possible outcome. This paper proposes that the upper bound may be used as a “soft” criterion for dynamic learning rate setting using MI. Specifically, this paper proposes to increase the learning rate towards achieving the soft upper bound of and using the DPI violation condition () as a signal to reduce learning rate.
While adaptive learning rates schedules have been developed using gradients Ruder (2016) and can, in principle, be developed using other measures such as validation accuracy, the use of MI as a criterion for dynamic tuning of hyperparameters is motivated by this measure being able to capture both linear and nonlinear dependence between the quantities of interest (in our case, hidden layer activations and the true output) and offer a layerwise measure of optimality () with respect to a reference measure (). This paper uses standard deep neural net training design choices (e.g. minibatch SGD to minimize misclassification error) while setting the learning rate to maximize information with respect to true outcomes i.e. maximize .
Given the soft upper bound of , training the neural network increases the information of the last layer with respect to the true outcomes until it finally saturates to a maximum. This trend is also observed for previous layers though the change over the training cycle may be less dramatic and the value is typically lower. This observation serves as the basis for dynamically setting the learning rate (LR) in this paper. Two approaches are explored 

Tracking the change in , denoted by , relative to its value

This basically uses behavior that when the information measure saturates, change in the measure diminishes. The relative change measure as saturates; is a small number.


Tracking and relative to

LR increases and decreases are set in terms of the relative measure so as to maximize relative to ; this measure coupled with the relative change between epochs are used to decide on increases or decreases in LR.

Both approaches require the specification of a minimum and maximum LR (upper and lower bounds) and begin from the minimum value; experiments in this paper set these bounds around the desired LR. The first approach increases LR while the relative change in is significant (to a threshold, ); thereafter LR decreases. The second approach tracks both the value and the change in between epochs, relative to . LR increases while significant changes in occur and and thereafter, it decreases. In both cases, LR changes incrementally to enable gradual changes. LR increases may be performed at the same rate as decreases or may be dampened to control the max LR reached. LR policies used in the experiments are provided in the appendix. Note that this paper does not propose a specific learning rate policy; it focuses on demonstrating that MI can be used to dynamically set the LR to achieve competitive outcomes.
4 Experiments
Two standard datasets were used to demonstrate dynamic LR setting using MI  MNIST LeCun et al. (November 1998) and CIFAR10 Krizhevsky (2009). The emphasis (model and design choices) of these experiments was not maximizing accuracy for the dataset but to understand relative performance of dynamic LR setting (using MI) in comparison to alternatives. The model used for the experiments with MNIST data is shown in Figure 2; these experiments used standard offtheshelf minibatch SGD to minimize misclassification error (categorical cross entropy). For CIFAR10 experiments, the model used was based on Springenberg et al. (2015)
. The specific model implementation used dropout (50%) only after maxpooling layers, no L2 regularization for weights, a fixed momentum value of 0.9 and Nesterov acceleration. With a continuously decaying LR beginning at 0.01, it resulted in sufficiently close maximum test accuracies of 88.21% and 91.23%, with and without data augmentation respectively, to the respective outcomes (90.92% and 92.75%) over 350 epochs in the referenced paper. Experiments of this paper used the implementation without any data augmentation to enable a fair comparison between methods.
Figures 3 and 4 show outcomes of various LR selection methods applied to the MNIST and CIFAR10 datasets respectively; this experiment compared the use of a fixed (desired) LR, a warmup beginning from a lower LR and increasing to the desired LR and the dynamic LR methods tracking either the change in relative to its value or tracking both the value and the change in relative to the reference measure of . The CIFAR10 experiment also adds a decaying LR policy into the comparison. The following trends were observed 

Training deep neural networks results in increasing . This is and intuitive and expected outcome of successful training.

There is a nonlinear and increasing relationship between test accuracy and . Increasing results in higher test accuracy.

Achieving maximum does not necessarily guarantee (from existing plots) maximum test accuracy but definitely gives a very competitive test accuracy. Per the Information Bottleneck approach, the same value of may be associated with different and it is likely that a more compressed (lower ) model generalizes better.

It appears that an effective warmup should ideally result in significant increase in towards . The warmup policy (linear increase to desired learning rate in 5 epochs) seems effective for MNIST and insufficient for CIFAR10.

Experiments reported in this paper and other attempts suggest that the MNIST was able to produce good outcomes with an aggressive (relative to CIFAR10) warmup and cooldown LR policy; CIFAR10 on the other hand required the use of a slow warmup and lower LR values overall.

It is clear that mutual information of hidden layer activations with respect to output may be useful for dynamically setting learning rate through the training process to obtain competitive to better test accuracies in competitive to better time. A policy involving both change in and its value relative to produces better outcomes than tracking the former alone. In both MNIST (Figure 3) and CIFAR10 (Figure 4), dynamic LR using both the change and the value of resulted in top accuracy levels being achieved in roughly half the number of epochs as compared to the corresponding fixed LR policy.

A dynamic LR policy based on MI makes training easier in the sense that it moves this hyperparameter selection problem one level up; the problems of specifying a single optimal LR for the entire training cycle or specifying an optimal warmup policy to a "good" LR are overcome by automatically adjusting the LR every epoch, between bounds, to effectively realize an information driven warmup and cooldown of the LR. Starting from a low value of learning rate results in a stable search process and outcomes. Competitive outcomes are achieved by exploring a larger space of learning rates than the fixed and warmup strategies which are both essentially fixed learning rate policies; it is also also possible to achieve competitive outcomes using a smaller learning rate for a longer length of time.
Figure 5 shows an application experiment where the dynamic LR concept may be useful. During a training run, if the situation (e.g. availability of compute resources or simply, a training design choice) requires an increase in one hyperparameter such as the batchsize (BS), the LR would have to be suitably increased or a drop in accuracy may occur. There are guidelines on managing the LR in such scenarios. This experiment however, demonstrates that dynamic LR based on MI can be effectively used to automatically adjust LR in such scenarios. The LR policy used here tracked only the value of relative to for a few epochs before resuming the tracking of both change and value; this was done to enable the growth of both LR and as a consequence of the increased batch size before resuming the tracking of both its change and value. Note the higher LR reached in this experiment as compared to the fixed batch size run in Figure 3. Competitive to better outcomes were achieved in competitive to better time. The availability of a reference measure (a soft upper bound) enables MI to be particularly suited to handle such scenarios, as compared to other more readily available measures.
The use of MI of the last layer alone provides a networklevel intervention i.e. dynamic LR setting for all layers. The key property that mutual information affords is a layerwise measure of optimality. The methods demonstrated thus far were extended to a layerwise intervention i.e. dynamic LR setting of individual layers; this is demonstrated in Figure 6 where each layer’s MI with the true outcomes relative to the reference measure enables the setting of a layerspecific LR. Competitive outcomes were obtained in competitive time, compared to the outcomes of Figure 3.
5 Conclusion
This paper demonstrated that using Mutual Information (MI) to dynamically set learning rate through the training cycle, in the context of deep neural networks, is both feasible and produces competitive to better outcomes in competitive to better time. The paper also demonstrated the application of this idea to automatically respond to changes in other hyperparameters such as the batchsize and the extension of the idea to a layerwise dynamic LR tuning through the training cycle. MI lends a layerwise measure of optimality with respect to a reference value that can be leveraged to effectively steer deep neural network training to competitive/better outcomes.
Appendix
The following policies were used in experiments of this paper. Note that the paper does not attempt to prescribe a specific learning rate policy.
Dynamic LR policy based on change in
and are selected by the user for each data set. The LR for the current epoch, , is decided based on that of the previous epoch, , and the relative change in with respect to its value. is a small number e.g. 0.01. The parameters allow dampening LR increases relative to decreases, if required; for e.g., = 0.1 and = 1 was used for MNIST and = 0.003 and = 0.003, for CIFAR10.
Dynamic LR policy based on change and value of relative to
The terms are defined as before. There are effectively two LR regimes governed by being or ; in the former the LR may increase or decrease depending on whether saturates () or not; the latter case involves LR reductions only. For both MNIST and CIFAR10, LR increases occurred at the same rate as decreases  for MNIST, and for CIFAR10, ; for both MNIST and CIFAR10, was set to 0.1.
References
 He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016. URL http://arxiv.org/abs/1603.05027.
 Goyal et al. [2017] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training Imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677.
 Cover and Thomas [July 2006] T.M. Cover and J.A. Thomas. Elements of Information Theory, 2nd edition. Wiley, July 2006.
 Ruder [2016] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016. URL http://arxiv.org/abs/1609.04747.
 Tishby and Zaslavsky [2015] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In in: IEEE Information Theory Workshop (ITW), 2015.
 ShwartzZiv and Tishby [2017] R. ShwartzZiv and N. Tishby. Opening the black box of deep neural networks via information. CoRR, abs/1703.00810, 2017. URL http://arxiv.org/abs/1703.00810.
 Tishby et al. [1999] N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. In in: The 37’th Allerton Conference on Communication, Control, and Computing, 1999.
 Saxe et al. [2018] A.M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B.D. Tracey, and D.D. Cox. On the information bottleneck theory of deep learning. In in: International Conference on Learning Representations (ICLR), 2018.
 WaltersWilliams and Li [2009] J. WaltersWilliams and Y. Li. Estimation of Mutual Information: A Survey. In Rough Sets and Knowledge Technology. Springer Berlin Heidelberg, 2009.

Doquire and Verleysen [2012]
G. Doquire and M. Verleysen.
A comparison of multivariate mutual information estimators for feature selection.
InProceedings of the 1st International Conference on Pattern Recognition Applications and Methods
, 2012.  Khan et al. [2007] S. Khan, S. Bandyopadhyay, A.R. Ganguly, S. Saigal, D.J. Erickson III, V. Protopopescu, and G. Ostrouchov. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Physical Review E, 76.026209, 2007.
 Kraskov et al. [2004] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. Physical Review E, 69:066138, 2004.
 Kolchinsky and Tracey [2017] A. Kolchinsky and B.D. Tracey. Estimating mixture entropy with pairwise distances. CoRR, abs/1706.02419, 2017. URL http://arxiv.org/abs/1706.02419.
 Belghazi et al. [2018] I. Belghazi, S. Rajeswar, A. Baratin, R.D. Hjelm, and A.C. Courville. MINE: mutual information neural estimation. CoRR, abs/1801.04062, 2018. URL http://arxiv.org/abs/1801.04062.
 Shamir et al. [2010] O. Shamir, S. Sabato, and N. Tishby. Learning and generalization with the information bottleneck. Theor. Comput. Sci., 411 (2930):2696–2711, 2010.
 LeCun et al. [November 1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, 86(11):22782324, November 1998.
 Krizhevsky [2009] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 Springenberg et al. [2015] J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The All Convolutional Net. CoRR, abs/1412.6806, 2015. URL http://arxiv.org/abs/1412.6806.
Comments
There are no comments yet.