Neural networks perform really well in numerous tasks even when initialized randomly and trained with Stochastic Gradient Descent (SGD) (seeKrizhevsky et al. (2012)). Deeper models, like Googlenet (Szegedy et al. (2015)) and Deep Residual Networks (Szegedy et al. (2015); He et al. (2015a)
) are released each year, providing impressive results and even surpassing human performances in well-known datasets such as the Imagenet (Russakovsky et al. (2015)). This would not have been possible without the help of regularization and initialization techniques which solve the overfitting and convergence problems that are usually caused by data scarcity and the growth of the architectures.
From the literature, two different regularization strategies can be defined. The first ones consist in reducing the complexity of the model by (i) reducing the effective number of parameters with weight decay (Nowlan & Hinton (1992)), and (ii) randomly dropping activations with Dropout (Srivastava et al. (2014)) or dropping weights with DropConnect (Wan et al. (2013)) so as to prevent feature co-adaptation. Due to their nature, although this set of strategies have proved to be very effective, they do not leverage all the capacity of the models they regularize.
The second group of regularizations is those which improve the effectiveness and generality of the trained model without reducing its capacity. In this second group, the most relevant approaches decorrelate the weights or feature maps, e.g. Bengio & Bergstra (2009) introduced a new criterion so as to learn slow decorrelated features while pre-training models. In the same line Bao et al. (2013) presented ”incoherent training”, a regularizer for reducing the decorrelation of the network activations or feature maps in the context of speech recognition. Although regularizations in the second group are promising and have already been used to reduce the overfitting in different tasks, even with the presence of Dropout (as shown by Cogswell et al. (2016)), they are seldom used in the large scale image recognition domain because of the small improvement margins they provide together with the computational overhead they introduce.
Although they are not directly presented as regularizers, there are other strategies to reduce the overfitting such as Batch Normalization (Ioffe & Szegedy (2015)), which decreases the overfitting by reducing the internal covariance shift. In the same line, initialization strategies such as ”Xavier” (Glorot & Bengio (2010)) or ”He” (He et al. (2015b)
), also keep the same variance at both input and output of the layers in order to preserve propagated signals in deep neural networks. Orthogonal initialization techniques are another family which set the weights in a decorrelated initial state so as to condition the network training to converge into better representations. For instance,Mishkin & Matas (2016) propose to initialize the network with decorrelated features using orthonormal initialization (Saxe et al. (2013)) while normalizing the variance of the outputs as well.
In this work we hypothesize that regularizing negatively correlated features is an obstacle for achieving better results and we introduce OrhoReg, a novel regularization technique that addresses the performance margin issue by only regularizing positively correlated feature weights. Moreover, OrthoReg is computationally efficient since it only regularizes the feature weights, which makes it very suitable for the latest CNN models. We verify our hypothesis through a series of experiments: first using MNIST as a proof of concept, secondly we regularize wide residual networks on CIFAR-10, CIFAR-100, and SVHN (Netzer et al. (2011)) achieving the lowest error rates in the dataset to the best of our knowledge.
2 Dealing with weight redundancies
Deep Neural Networks (DNN) are very expressive models which can usually have millions of parameters. However, with limited data, they tend to overfit. There is an abundant number of techniques in order to deal with this problem, from L1 and L2 regularizations (Nowlan & Hinton (1992)), early-stopping, Dropout or DropConnect. Models presenting high levels of overfitting usually have a lot of redundancy in their feature weights, capturing similar patterns with slight differences which usually correspond to noise in the training data. A particular case where this is evident is in AlexNet (Krizhevsky et al. (2012)), which presents very similar convolution filters and even ”dead” ones, as it was remarked by Zeiler & Fergus (2014).
In fact, given a set of parameters connecting a set of inputs
to a neuron, two neurons will be positively correlated, and thus fire always together if and negatively correlated if
. In other words, two neurons with the same or slightly different weights will produce very similar outputs. In order to reduce the redundancy present in the network parameters, one should maximize the amount of information encoded by each neuron. From an information theory point of view, this means one should not be able to predict the output of a neuron given the output of the rest of the neurons of the layer. However, this measure requires batch statistics, huge joint probability tables, and it would have a high computational cost.
In this paper, we will focus on the weights correlation rather than activation independence since it still is an open problem in many neural network models and it can be addressed without introducing too much overhead, see Table 1. Then, we show that models generalize better when different feature detectors are enforced to be dissimilar. Although it might seem contradictory, CNNs can benefit from having repeated filter weights with different biases, as shown by Li et al. (2016). However, those repeated filters must be shared copies and adding too many unshared filter weights to CNNs increases overfitting and the need for stronger regularization (Zagoruyko & Komodakis (May 2016)). Thus, our proposed method and multi-bias neural networks are complementary since they jointly increase the representation power of the network with fewer parameters.
In order to find a good target to optimize so as to reduce the correlation between weights, it is first required to find a metric to measure it. In this paper, we propose to use the cosine similarity between feature detectors to express how strong is their relationship. Note that the cosine similarity is equivalent to the Pearson correlation for mean-centered normalized vectors, but we will use the term correlation for the sake of clarity.
2.1 Orthogonal weight regularization
This section introduces the orthogonal weight regularization, a regularization technique that aims to reduce feature detector correlation enforcing local orthogonality between all pairs of weight vectors. In order to keep the magnitudes of the detectors unaffected, we have chosen the cosine similarity between the vector pairs in order to solely focus on the vectors angle . Then, given any pair of feature vectors of the same size the cosine of their relative angle is:
Where denotes the inner product between and . We then square the cosine similarity in order to define a regularization cost function for steepest descent that has its local minima when vectors are orthogonal:
Where are the weights connecting the output of the layer to the neuron of the layer , which has hidden units. Interestingly, minimizing this cost function relates to the minimization of the Frobenius norm of the cross-covariance matrix without the diagonal. This cost will be added to the global cost of the model , where are the inputs and are the labels or targets, obtaining . Note that
is an hyperparameter that weights the relative contribution of the regularization term. We can now define the gradient with respect to the parameters:
The second term is introduced by the magnitude normalization. As magnitudes are not relevant for the vector angle problem, this equation can be simplified just by assuming normalized feature detectors:
Where is the global learning rate coefficient,
any target loss function for the backpropagation algorithm.
Although this update can be done sequentially for each feature-detector pair, it can be vectorized to speedup computations. Let be a matrix where each row is a feature detector corresponding to the normalized weights connecting the whole input of the layer to the neuron . Then, contains the inner product of each pair of vectors and in each position . Subsequently, we subtract the diagonal so as to ignore the angle from each feature with respect to itself and multiply by to compute the final value corresponding to the sum in eq. 5:
Where the second term is . Algorithm 1 summarizes the steps in order to apply OrthoReg.
2.2 Negative Correlations
Note that the presented algorithm, based on the cosine similarity, penalizes any kind of correlation between all pairs of feature detectors, i.e. the positive and the negative correlations, see Figure 1. However, negative correlations are related to inhibitory connections, competitive learning, and self-organization. In fact, there is evidence that negative correlations can help a neural population to increase the signal-to-noise ratio (Chelaru & Dragoi (2016)) in the V1. In order to find out the advantages of keeping negative correlations, we propose to use an exponential to squash the gradients for angles greater than :
Where is a coefficient that controls the minimum angle-of-influence of the regularizer, i.e. the minimum angle between two feature weights so that there exists a gradient pushing them apart, see Figure 1. We empirically found that the regularizer worked well for , see Figure 2. Note that when the loss and the gradients approximate to zero when vectors are at more than (orthogonal). As a result of incorporating the squashing function on the cosine similarity, negatively correlated feature weights will not be regularized. This is different from all previous approaches and the loss presented in eq. 2, where all pairs of weight vectors influence each other. Thus, from now on, the loss in eq. 2 is named as global loss and the loss in eq. 7 is named as local loss.
The derivative of eq. 7 is:
Then, given the element-wise exponential operator , we define the following expression in order to simplify the formulas:
and thus, the in vectorial form can be formulated as:
In order to provide a visual example, we have created a toy dataset and used the previous equations for positive and negative values, see Figure 2. As expected, it can be seen that the angle between all pairs of adjacent feature weights becomes more uniform after regularization. Note that Figure 2 shows that regularization with the global loss (eq. 2) results in less uniform angles than using the local loss as shown in 2 (which corresponds to the local loss presented in eq. 7) because vectors in opposite quadrants still influence each other. This is why in Figure 2, it can be seen that the mean nearest neighbor angle using the global loss (b) is more unstable than the local loss (c). As a proof of concept, we also performed gradient ascent, which minimizes the angle between the vectors. Thus, in Figures 2 and 2, it can be seen that the locality introduced by the local loss reaches a stable configuration where feature weights with angle are too far to attract each other.
The effects of global and local regularizations on Alexnet, VGG-16 and a 50-layer ResNet are shown on Figure 3. As it can be seen, OrthoReg reaches higher decorrelation bounds. Lower decorrelation peaks are still observed when the input dimensionality of the layers is smaller than the output since all vectors cannot be orthogonal at the same time. In this case, local regularization largely outperforms global regularization since it removes interferences caused by negatively correlated feature weights. This suggests why increasing fully connected layers’ size has not improved networks performance.
In this section we provide a set of experiments that verify that (i) training with the proposed regularization increases the performance of naive unregularized models, (ii) negatively correlated feature weights are useful, and (iii) the proposed regularization improves the performance of state-of-the-art models.
3.1 Verification experiments
As a sanity check, we first train a three-hidden-layer Multi-Layer Perceptron (MLP) withReLU non-liniarities on the MNIST dataset (LeCun et al. (1998)). Our code is based in the
train-a-digit-classifierexample included in torch/demos111https://github.com/torch/demos, which uses an upsampled version of the dataset (). The only pre-processing applied to the data is a global standardization. The model is trained with SGD and a batch size of during epochs. No momentum neither weight decay was applied. By default, the magnitude of the weights of this experiments is recovered after each regularization step in order to prove the regularization only affects their angle.
Sensitivity to hyperparameters. We train a three-hidden-layer MLP with 1024 hidden units, and different and values so as to verify how they affect the performance of the model. Figure 4 shows that the model effectively achieves the best error rate for the highest gamma value (), thus proving the advantages of the regularization. On Figure 4, we verify that higher regularization rates produce more general models. Figure 5 depicts the sensitivity of the model to . As expected, the best value is found when lambda corresponds to Orthogonality ().
Negative Correlations. Figure 5 highlights the difference between regularizing with the global or the local regularizer. Although both regularizations reach better error rates than the unregularized counterpart, the local regularization is better than the global. This confirms the hypothesis that negative correlations are useful and thus, performance decreases when we reduce them.
Compatibility with initialization and dropout. To demonstrate the proposed regularization can help even when other regularizations are present, we trained a CNN with (i) dropout (c32-c64-l512-d0.5-l10)222 = convolution with filters. = fully-connected with units. = dropout with prob . or (ii) LSUV initialization (Mishkin & Matas (2016)). In Table 2, we show that best results are obtained when orthogonal regularization is present. The results are consistent with the hypothesis that OrthoReg, as well as Dropout and LSUV, focuses on reducing the model redundancy. Thus, when one of them is present, the margin of improvement for the others is reduced.
3.2 Regularization on CIFAR-10 and CIFAR-100
We show that the proposed OrthoReg can help to improve the performance of state-of-the-art models such as deep residual networks (He et al. (2015a)). In order to show the regularization is suitable for deep CNNs, we successfuly regularize a 110-layer ResNet333https://github.com/gcr/torch-residual-networks on CIFAR-10, decreasing its error from 6.55% to 6.29% without data augmentation.
In order to compare with the most recent state-of-the-art, we train a wide residual network (Zagoruyko & Komodakis (November 2016)) on CIFAR-10 and CIFAR-100. The experiment is based on a torch implementation of the 28-layer and 10th width factor wide deep residual model, for which the median error rate on CIFAR-10 is and on CIFAR-100444https://github.com/szagoruyko/wide-residual-networks. As it can be seen in Figure 6, regularizing with OrthoReg yields the best test error rates compared to the baselines.
The regularization coefficient was chosen using grid search although similar values were found for all the experiments, specially if regularization gradients are normalized before adding them to the weights. The regularization was equally applied to all the convolution layers of the (wide) ResNet. We found that, although the regularized models were already using weight decay, dropout, and batch normalization, best error rates were always achieved with OrthoReg.
|Maxout (Goodfellow et al. (2013))||9.38||38.57||YES|
|NiN (Lin et al. (2014))||8.81||35.68||YES|
|DSN (Lee et al. (2015))||7.97||34.57||YES|
|Highway Network (Srivastava et al. (2015))||7.60||32.24||YES|
|All-CNN (Springenberg et al. (2015))||7.25||33.71||NO|
|110-Layer ResNet (He et al. (2015a))||6.61||28.4||NO|
|ELU-Network (Clevert et al. (2016))||6.55||24.28||NO|
|OrthoReg on 110-Layer ResNet*||NO|
|LSUV (Mishkin & Matas (2016))||5.84||-||YES|
Fract. Max-Pooling (Graham (2014))
|Wide ResNet v1 (Zagoruyko & Komodakis (May 2016))*||YES|
|OrthoReg on Wide ResNet v1 (May 2016)*||YES|
|Wide ResNet v2 (Zagoruyko & Komodakis (November 2016))*||YES|
|OrthoReg on Wide ResNet v2 (November 2016)*||YES|
Table 3 compares the performance of the regularized models with other state-of-the-art results. As it can be seen the regularized model surpasses the state of the art, with a relative error improvement on CIFAR-10, and a relative error improvement on CIFAR-100.
3.3 Regularization on SVHN
|NiN (Lin et al. (2014))||2.35|
|DSN (Lee et al. (2015))||1.92|
|Stochastic Depth ResNet (Huang et al. (2016))||1.75|
|Wide Resnet (Zagoruyko & Komodakis (May 2016))||1.64|
|OrthoReg on Wide Resnet||1.54|
For SVHN we follow the procedure depicted in Zagoruyko & Komodakis (May 2016), training a wide residual network of depth=28, width=4, and dropout. Results are shown in Table 4. As it can be seen, we reduce the error rate from to , which is the lowest value reported on this dataset to the best of our knowledge.
Regularization by feature decorrelation can reduce Neural Networks overfitting even in the presence of other kinds of regularizations. However, especially when the number of feature detectors is higher than the input dimensionality, its decorrelation capacity is limited due to the effects of negatively correlated features. We showed that imposing locality constraints in feature decorrelation removes interferences between negatively correlated feature weights, allowing regularizers to reach higher decorrelation bounds, and reducing the overfitting more effectively.
In particular, we show that the models regularized with the constrained regularization present lower overfitting even when batch normalization and dropout are present. Moreover, since our regularization is directly performed on the weights, it is especially suitable for fully convolutional neural networks, where the weight space is constant compared to the feature map space. As a result, we are able to reduce the overfitting of 110-layer ResNets and wide ResNets on CIFAR-10, CIFAR-100, and SVHN improving their performance. Note that despite OrthoReg consistently improves state of the art ReLU networks, the choice of the activation function could affect regularizers like the one presented in this work. In this sense, the effect of asymmetrical activations on feature correlations and regularizers should be further investigated in the future.
Authors acknowledge the support of the Spanish project TIN2015-65464-R (MINECO/FEDER), the 2016FI_B 01163 grant of Generalitat de Catalunya, and the COST Action IC1307 iV&L Net (European Network on Integrating Vision and Language) supported by COST (European Cooperation in Science and Technology). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU and a GTX TITAN GPU, used for this research.
- Bao et al. (2013) Yebo Bao, Hui Jiang, Lirong Dai, and Cong Liu. Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition. In 2013 IEEE ICASSP, pp. 6980–6984. IEEE, 2013.
- Bengio & Bergstra (2009) Yoshua Bengio and James S Bergstra. Slow, decorrelated features for pretraining complex cell-like networks. In NIPS, pp. 99–107, 2009.
- Chelaru & Dragoi (2016) Mircea I Chelaru and Valentin Dragoi. Negative correlations in visual cortical networks. Cerebral Cortex, 26(1):246–256, 2016.
- Clevert et al. (2016) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). ICLR, 2016.
- Cogswell et al. (2016) Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra. Reducing overfitting in deep networks by decorrelating representations. ICLR, 2016.
- Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
- Goodfellow et al. (2013) Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In ICML, pp. 1319–1327, 2013.
- Graham (2014) Benjamin Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014.
- He et al. (2015a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015a.
- He et al. (2015b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pp. 1026–1034, 2015b.
- Huang et al. (2016) Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382, 2016.
- Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456, 2015.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, pp. 1106–1114, 2012.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lee et al. (2015) Chen-Yu Lee, Saining Xie, Patrick W. Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Guy Lebanon and S. V. N. Vishwanathan (eds.), AISTATS, volume 38 of JMLR Proceedings. JMLR.org, 2015.
- Li et al. (2016) Hongyang Li, Wanli Ouyang, and Xiaogang Wang. Multi-bias non-linear activation in deep neural networks. arXiv preprint arXiv:1604.00676, 2016.
- Lin et al. (2014) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, March 2014.
- Mishkin & Matas (2016) Dmytro Mishkin and Jiri Matas. All you need is a good init. ICLR, 2016.
- Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
- Nowlan & Hinton (1992) Steven J. Nowlan and Geoffrey E. Hinton. Simplifying neural networks by soft weight-sharing. Neural computation, 4(4):473–493, 1992.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
- Saxe et al. (2013) Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, December 2013.
- Springenberg et al. (2015) Jost T. Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In ICLR (workshop track), 2015.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
- Srivastava et al. (2015) Rupesh K. Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In NIPS, pp. 2368–2376, 2015.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pp. 1–9, 2015.
- Wan et al. (2013) Li Wan, Matthew D Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In ICML, pp. 1058–1066, 2013.
- Zagoruyko & Komodakis (May 2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, May 2016.
- Zagoruyko & Komodakis (November 2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, November 2016.
- Zeiler & Fergus (2014) Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, pp. 818–833. 2014.