As the number of layers of neural networks increase, effectively training its parameters becomes a fundamental problem (Larochelle et al. (2009)). Many obstacles challenge the training of neural networks, including vanishing/exploding gradients (Bengio et al. (1994)
), saturating activation functions (Xu et al. (2016)) and poor weight initialization (Glorot & Bengio (2010)). Techniques such as unsupervised pre-training (Bengio et al. (2007)), non-saturating activation functions (Nair & Hinton (2010)) and normalization (Ioffe & Szegedy (2015)) target these issues and enable the training of deeper networks. However, stacking more than a dozen layers still lead to a hard to train model.
Recently, models such as Residual Networks (He et al. (2015b)) and Highway Neural Networks (Srivastava et al. (2015)) permitted the design of networks with hundreds of layers. A key idea of these models is to allow for information to flow more freely through the layers, by using shortcut connections between the layer’s input and output. This layer design greatly facilitates training, due to shorter paths between the lower layers and the network’s error function. In particular, these models can more easily learn identity mappings in the layers, thus allowing the network to be deeper and learn more abstract representations (representations
). Such networks have been highly successful in many computer vision tasks.
On the theoretical side, it is suggested that depth contributes exponentially more to the representational capacity of networks than width (Eldan & Shamir (2015) Telgarsky (2016) Bianchini & Scarselli (2014) Montúfar et al. (2014)
). This agrees with the increasing depth of winning architectures on challenges such as ImageNet (He et al. (2015b) Szegedy et al. (2014)).
Increasing the depth of networks significantly increases its representational capacity and consequently its performance, an observation supported by theory (Eldan & Shamir (2015) Telgarsky (2016) Bianchini & Scarselli (2014) Montúfar et al. (2014)) and practice (He et al. (2015b) Szegedy et al. (2014)). Moreover, He et al. (2015b) showed that, by construction, one can increase a network’s depth while preserving its performance. These two observations suggest that it suffices to stack more layers to a network in order to increase its performance. However, this behavior is not observed in practice even with recently proposed models, in part due to the challenge of training ever deeper networks.
In this work we aim to improve the training of deep networks by proposing a layer design that builds on Residual Networks and Highway Neural Networks. The key idea is to facilitate the learning of identity mappings by introducing a gating mechanism to the shortcut connection, as illustrated in Figure 1. Note that the shortcut connection is controlled by a gate that is parameterized with a scalar,
. This is a key difference from Highway Networks, where a tensor is used to regulate the shortcut connection, along with the incoming data. The idea of using a scalar is simple: it is easier to learnthan to learn for a weight tensor controlling the gate. Indeed, this single scalar allows for stronger supervision on lower layers, by making gradients flow more smoothly in the optimization.
We apply our proposed network design to Residual Networks, as illustrated in Figure 2. Note that in this case the layer becomes simply , where denotes the layer’s residual function. Thus, the shortcut connection allows the input to flow freely without any interference of through the layer. We will call this model Gated Residual Network, or GResNets. Again, note that learning identity mappings is again much easier in comparison to the original ResNets.
Note that layers that degenerated into identity mappings have no impact in the signal propagating through the network, and thus can be removed without affecting performance. The removal of such layers can be seen as a transposed application of sparse encoding (Glorot et al. (2011)
): transposing the sparsity from neurons to layers provides a form to prune them entirely from the network. Indeed, we show that performance decays slowly in GResNets when layers are removed, when compared to ResNets.
We evaluate the performance of the proposed design in two experiments. First, we evaluate fully-connected GResNets on MNIST and compare it with fully-connected ResNets, showing superior performance and robustness to layer removal. Second, we apply our model to Wide ResNets (Zagoruyko & Komodakis (2016)) and test its performance on CIFAR, obtaining results that are superior to all previously published results (to the best of our knowledge). These findings indicate that learning identity mappings is a fundamental aspect of learning in deep networks, and designing models where this is easier seems highly effective.
2 Augmentation with Residual Gates
2.1 Theoretical Intuition
Recall that a network’s depth can always be increased without affecting its performance – it suffices to add layers that perform identity mappings. Consider a classic fully-connected ReLU network with layers defined as. When adding a new layer, if we initialize
to the identity matrix, we have:
The last step holds since is an output of a previous ReLU layer, and . Thus, adding more layers should only improve performance. However, how can a network with more layers learn to yield performance superior than a network with less layers? A key observation is that if learning identity mapping is easy, then the network with more layers is more likely to yield superior performance, as it can more easily recover the performance of a smaller network through identity mappings.
The layer design of Residual Networks allows for deeper models to be trained due to its shortcut connections. Note that in ResNets the identity mapping is learned when instead of . Considering a residual layer , we have:
Intuitively, residual layers can degenerate into identity mappings more effectively since learning an all-zero matrix is easier than learning the identity matrix. To support this argument, consider weight parameters randomly initialized with zero mean. Hence, the point
is located exactly in the center of the probability mass distribution used to initialize the weights.
However, assuming that residual layers can trivially learn the parameter set implies ignoring the randomness when initializing the weights. We demonstrate this by calculating the expected component-wise distance between and the origin. Here, denotes the weight tensor after initialization and prior to any optimization. Note that the distance between and the origin captures the effort for a network to learn identity mappings:
Note that the distance is given by the distribution’s variance, and there is no reason to assume it to be negligible. Additionally, the fact that Residual Networks still suffer from optimization issues caused by depth (Huang et al. (2016)) further supports this claim.
Some initialization schemes propose a variance in the order of (glorotinit, He et al. (2015a)), however this represents the distance for each individual parameter in . For tensors with parameters, the total distance – either absolute or Euclidean – between and the origin will be in the order of .
2.2 Residual Gates
As previously mentioned, the key contribution in this work is the proposal of a layer design where learning a single scalar parameter suffices in order for the layer to degenerate into an identity mapping. As in Highway Networks, we propose the addition of gated shortcut connections. Our gates, however, are parametrized by a single scalar value, being easier to analyze and learn. In our model, the effort required to learn identity mappings does not depend on any parameter, such as the layer width, in sharp contrast to prior models.
Our design is as follows: a layer becomes , where is a scalar parameter. This design is illustrated in Figure 1. Note that such layer can quickly degenerate by setting to . Using the ReLU activation function as , it suffices that for .
By adding an extra parameter, the dimensionality of the cost surface also grows by one. This new dimension, however, can be easily understood due to the specific nature of the layer reformulation. The original surface is maintained on the slice, since the gated model becomes equivalent to the original one. On the slice we have an identity mapping, and the associated cost for all points in such slice is the same cost associated with the point : this follows since both parameter configurations correspond to identity mappings, therefore being equivalent. Lastly, due to the linear nature of and consequently of the gates, all other slices will be a linear combination between the slices and .
We proceed to use residual layers as the basis for our design, for two reasons. First, they are the current standard for computer vision tasks. Second, ResNets lack means to regulate the residuals, therefore a linear gating mechanism might not only allow deeper models, but could also improve performance. Thus, the residual layer is given by:
where is the layer’s residual function – in our case, BN-ReLU-Conv-BN-ReLU-Conv. Our approach changes this layer by adding a linear gate, yielding:
Our approach applied to residual layers is shown in Figure 2. The resulting layer maintains the shortcut connection unaltered, which according to He et al. (2016) is a desired property when designing residual blocks. As vanishes from the formulation, stops acting as a dual gating mechanism and can be interpreted as a flow regulator. Note that this model introduces a single scalar parameter per layer block. This new dimension can be interpreted as discussed above, except that the slice is equivalent to , since an identity mapping is learned when in ResNets.
All models were implemented on Keras (Chollet (2015)
) or on Torch (t7), and were executed on a Geforce GTX 1070. Larger models or more complex datasets, such as the ImageNet (Russakovsky et al. (2015)), were not explored due to hardware limitations.
The MNIST dataset (Lecun et al. (1998)) is composed of greyscale images with pixels. Images represent handwritten digits, resulting in a total of 10 classes. We trained three types of fully-connected models: classical plain networks, ResNets and GResNets.
The networks consist of a linear layer with 50 neurons, followed by
layers with 50 neurons each, and lastly a softmax layer for classification. Only themiddle layers differ between the three architectures – the first linear layer and the softmax layer are the same in all experiments.
For plain networks, each layer performs dot product, followed by Batch Normalization and a ReLU activation function.
Initial tests with pre-activations (He et al. (2016)) resulted in poor performance on the validation set, therefore we opted for the traditional Dot-BN-ReLU layer when designing Residual Networks. Each residual block is consists of two layers, as conventional.
All networks were trained using Adam (Kingma & Ba (2014)
) with Nesterov momentum (Dozat
) for a total of 100 epochs using mini-batches of size 128. No learning rate decay was used: we kept the learning rate and momentum fixed toand during the whole training.
For preprocessing, we divided each pixel value by 255, normalizing their values to .
The training curves for classical plain networks, ResNets and GResNets with varying depth are shown in Figure 4. The distance between the curves increase with the depth, showing that the augmentation helps the training of deeper models.
Table 1 shows the test error for each depth and architecture. ResNets converge in experiments with and ( and layers, respectively), while classical models do not.
Gated Residual Networks perform better in all settings, and the performance boost is more noticeable with increased depths. The relative error decreased approximately for , for and for .
As observed in Table 2, the mean values of decrease as the model gets deeper, showing that shortcut connections have less impact on shallow networks. This agrees with empirical results that ResNets perform better than classical plain networks as the depth increases.
We also analyzed how layer removal affects ResNets and GResNets. We compared how the deepest networks () behave as residual blocks composed of 2 layers are completely removed from the models. The final values for each parameter, according to its corresponding residual block, is shown in Figure 5. We can observe that layers close to the middle of the network have a smaller than these in the beginning or the end. Therefore, the middle layers have less importance by due to being closer to identity mappings.
Results are shown in Figure 5. For Gated Residual Networks, we prune pairs of layers following two strategies. One consists of pruning layers in a greedy fashion, where blocks with the smallest are removed first. In the other we remove blocks randomly. We present results using both strategies for GResNets, and only random pruning for ResNets since they lack the parameter.
The greedy strategy is slightly better for Gated Residual Networks, showing that the parameter is indeed a good indicator of a layer’s importance for the model, but that layers tend to assume the same level of significance. In a fair comparison, where both models are pruned randomly, GResNets retain a satisfactory performance even after half of its layers have been removed, while ResNets suffer performance decrease after just a few layers.
Therefore augmented models are not only more robust to layer removal, but can have a fair share of their layers pruned and still perform well. Faster predictions can be generated by using a pruned version of an original model.
The CIFAR datasets (Krizhevsky (2009)) consists of color images with pixels each. CIFAR-10 has a total of 10 classes, including pictures of cats, birds and airplanes. The CIFAR-100 dataset is composed of the same number of images, however with a total of 100 classes.
Residual Networks have surpassed state-of-the-art results on CIFAR. We test GResNets, Wide GResNets (Zagoruyko & Komodakis (2016)) and compare them with their original, non-augmented models.
For pre-activation ResNets, as described in He et al. (2016), we follow the original implementation details. We set an initial learning rate of 0.1, and decrease it by a factor of 10 after 50% and 75% epochs. SGD with Nesterov momentum of 0.9 are used for optimization, and the only pre-processing consists of mean subtraction. Weight decay of 0.0001 is used for regularization, and Batch Normalization’s momentum is set to 0.9.
We follow the implementation from Zagoruyko & Komodakis (2016) for Wide ResNets. The learning rate is initialized as 0.1, and decreases by a factor of 5 after 30%, 60% and 80% epochs. Images are mean/std normalized, and a weight decay of 0.0005 is used for regularization. When dropout is specified, we apply 0.3 dropout (dropout) between convolutions. All other details are the same as for ResNets.
For both architectures we use moderate data augmentation: images are padded with 4 pixels, and we take random crops of sizeduring training. Additionally, each image is horizontally flipped with probability. We use batch size 128 for all experiments.
For all gated networks, we initialize with a constant value of . One crucial question is whether weight decay should be applied to the parameters. We call this ” decay”, and also compare GResNets and Wide GResNets when it is applied with the same magnitude of the weight decay: 0.0001 for GResNet and 0.0005 for Wide GResNet.
|Acc||Original||Gated||Gated ( decay)|
|Wide ResNet (4,10) + Dropout||3.89||3.65||3.74|
Table 3 shows the test error for two architectures: a ResNet with , and a Wide ResNet with , . Augmenting each model adds 15 and 12 parameters, respectively. We observe that decay hurts performance in both cases, indicating that they should either remain unregularized or suffer a more subtle regularization compared to the weight parameters. Due to its direct connection to layer degeneration, regularizing results in enforcing identity mappings, which might harm the model.
As in the previous experiment, in Figure 7 we present the final values for each block. We can observe that the values follow an intriguing pattern: the lowest values are for the blocks of index , and , which are exactly the ones that increase the feature map dimension. This indicates that, in such residual blocks, the convolution performed in the shortcut connection to increase dimension is more important than the residual block itself. Additionally, the peak value for the last residual block suggests that its shortcut connection is of little importance, and could as well be fully removed without greatly impacting the model.
|Network in Network (nin)||-||8.81||-|
|Highway Neural Network (Srivastava et al. (2015))||2.3M||7.76||32.39|
|ResNet-110 (He et al. (2015b))||1.7M||6.61||-|
|ResNet in ResNet (rir)||1.7M||5.01||22.90|
|Stochastic Depth (Huang et al. (2016))||10.2M||4.91||-|
|ResNet-1001 (He et al. (2016))||10.2M||4.62||22.71|
|FractalNet (Larsson et al. (2016))||38.6M||4.60||23.73|
|Wide ResNet (4,10) (Zagoruyko & Komodakis (2016))||36.5M||3.89||18.85|
|Wide GatedResNet (4,10) + Dropout||36.5M||3.65||18.27|
We have proposed a novel layer design based on Highway Neural Networks, which can be applied to provide general layers a quick way to learn identity mappings. Unlike Highway or Residual Networks, layers generated by our technique require optimizing only one parameter to degenerate into identity. By designing our method such that randomly initialized parameter sets are always close to identity mappings, our design offers less issues with optimization issues caused by depth.
We have shown that applying our technique to ResNets yield a model that can regulate the residuals, named Gated Residual Networks. This model performed better in all our experiments with negligible extra training time and parameters. Lastly, we have shown how it can be used for layer pruning, effectively removing large numbers of parameters from a network without necessarily harming its performance.
- Bengio et al. (1994) Y. Bengio, P. Simard, and P Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994.
- Bengio et al. (2007) Y. Bengio, P. Lamblin, D Popovici, and H Larochelle. Greedy layer-wise training of deep networks. NIPS, 2007.
Bianchini & Scarselli (2014)
Monica Bianchini and Franco Scarselli.
On the complexity of neural network classifiers: A comparison between shallow and deep architectures.IEEE Transactions on Neural Networks and Learning Systems, 25(8):1553 – 1565, 2014. doi: 10.1109/TNNLS.2013.2293637.
- Chollet (2015) François Chollet. keras. https://github.com/fchollet/keras, 2015.
- (5) Timothy Dozat. Incorporating nesterov momentum into adam.
- Eldan & Shamir (2015) R. Eldan and O. Shamir. The Power of Depth for Feedforward Neural Networks. ArXiv e-prints, December 2015.
- Glorot & Bengio (2010) X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS,, 2010.
Glorot et al. (2011)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Deep sparse rectifier neural networks.
In Geoffrey J. Gordon and David B. Dunson (eds.),
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS-11)
, volume 15, pp. 315–323. Journal of Machine Learning Research - Workshop and Conference Proceedings, 2011.URL http://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf.
- He et al. (2015a) K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015a.
- He et al. (2015b) K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. ArXiv e-prints, December 2015b.
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ArXiv e-prints, March 2016.
- Huang et al. (2016) G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep Networks with Stochastic Depth. ArXiv e-prints, March 2016.
- Ioffe & Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
- Kingma & Ba (2014) D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, December 2014.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- Larochelle et al. (2009) Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks. J. Mach. Learn. Res., 10:1–40, June 2009. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1577069.1577070.
- Larsson et al. (2016) G. Larsson, M. Maire, and G. Shakhnarovich. FractalNet: Ultra-Deep Neural Networks without Residuals. ArXiv e-prints, May 2016.
- Lecun et al. (1998) Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
- Montúfar et al. (2014) G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the Number of Linear Regions of Deep Neural Networks. ArXiv e-prints, February 2014.
- Nair & Hinton (2010) Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Johannes Fürnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814. Omnipress, 2010. URL http://www.icml2010.org/papers/432.pdf.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. CoRR, abs/1507.06228, 2015. URL http://arxiv.org/abs/1507.06228.
- Szegedy et al. (2014) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. URL http://arxiv.org/abs/1409.4842.
- Telgarsky (2016) M. Telgarsky. Benefits of depth in neural networks. ArXiv e-prints, February 2016.
- Xu et al. (2016) B. Xu, R. Huang, and M. Li. Revise Saturated Activation Functions. ArXiv e-prints, February 2016.
- Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016. URL http://arxiv.org/abs/1605.07146.