Training Very Deep Neural Networks Without Skip-Connections
Deep neural networks with skip-connections, such as ResNet, show excellent performance in various image classification benchmarks. It is though observed that the initial motivation behind them - training deeper networks - does not actually hold true, and the benefits come from increased capacity, rather than from depth. Motivated by this, and inspired from ResNet, we propose a simple Dirac weight parameterization, which allows us to train very deep plain networks without skip-connections, and achieve nearly the same performance. This parameterization has a minor computational cost at training time and no cost at all at inference. We're able to achieve 95.5 34-layer deep plain network, surpassing 1001-layer deep ResNet, and approaching Wide ResNet. Our parameterization also mostly eliminates the need of careful initialization in residual and non-residual networks. The code and models for our experiments are available at https://github.com/szagoruyko/diracnetsREAD FULL TEXT VIEW PDF
Initialization, normalization, and skip connections are believed to be t...
Residual networks (ResNets) employ skip connections in their networks –
Deep residual networks (ResNets) made a recent breakthrough in deep lear...
Residual units are wildly used for alleviating optimization difficulties...
In this paper, we introduce a new perspective on training deep neural
Convolutional Neural Networks (CNNs) such as ResNet-50, DenseNet-40 and
Artificial Neural Networks (ANNs) are known as state-of-the-art techniqu...
Training Very Deep Neural Networks Without Skip-Connections
CuPy fused PyTorch neural networks ops
Training Very Deep Neural Networks Without Skip-Connections in Keras https://arxiv.org/abs/1706.00388
There were many attempts of training very deep networks. In image classification, after the success of Krizhevsky et al. (2012) with AlexNet (8 layers), the major improvement was brought by Simonyan & Zisserman (2015) with VGG (16-19 layers) and later by Szegedy et al. (2015)
with Inception (22 layers). In recurrent neural networks, LSTM byHochreiter & Schmidhuber (1997) allowed training deeper networks by introducing gated memory cells, leading to major increase of parameter capacity and network performance. Recently, similar idea was applied to image classification by Srivastava et al. (2015) who proposed Highway Networks, later improved by He et al. (2015a) with Residual Networks, resulting in simple architecture with skip-connections, which was shown to be generalizable to many other tasks. There were also proposed several other ways of adding skip-connections, such as DenseNet by Huang et al. (2016), which passed all previous activations to each new layer.
Despite the success of ResNet, a number of recent works showed that the original motivation of training deeper networks does not actually hold true, e.g. it might just be an ensemble of shallower networks Veit et al. (2016), and ResNet widening is more effective that deepening Zagoruyko & Komodakis (2016), meaning that there is no benefit from increasing depth to more than 50 layers. It is also known that deeper networks can be more efficient than shallower and wider, so various methods were proposed to train deeper networks, such as well-designed initialization strategies and special nonlinearities Bengio & Glorot (2010); He et al. (2015b); Clevert et al. (2015), additional mid-network losses Lee et al. (2014), better optimizers Sutskever et al. (2013), knowledge transfer Romero et al. (2014); Chen et al. (2016) and layer-wise training Schmidhuber (1992).
To summarize, deep networks with skip-connections have the following problems:
Feature reuse problem: upper layers might not learn useful representations given previous activations;
Widening is more effective than deepening: there is no benefit from increasing depth;
Actual depth is not clear: it might be determined by the shortest path.
However, the features learned by such networks are generic, and they are able to train with massive number of parameters without negative effects of overfitting. We are thus interested in better understanding of networks with skip-connections, which would allow us to train very deep plain (without skip-connections) networks and benefits they could bring, such as higher parameter efficiency, better generalization, and improved computational efficiency.
Motivated by this, we propose a novel weight parameterization for neural networks, which we call Dirac parameterization, applicable to a wide range of network architectures. Furthermore, by use of the above parameterization, we propose novel plain VGG and ResNet-like architecture without explicit skip-connections, which we call DiracNet. These networks are able to train with hundreds of layers, surpass 1001-layer ResNet while having only 28-layers, and approach Wide ResNet (WRN) accuracy. We should note that we train DiracNets end-to-end, without any need of layer-wise pretraining. We believe that our work is an important step towards simpler and more efficient deep neural networks.
Overall, our contributions are the following:
We propose generic Dirac weight parameterization, applicable to a wide range of neural network architectures;
Our plain Dirac parameterized networks are able to train end-to-end with hundreds of layers. Furthermore, they are able to train with massive number of parameters and still generalize well without negative effects of overfitting;
Dirac parameterization can be used in combination with explicit skip-connections like ResNet, in which case it eliminates the need of careful initialization.
In a trained network Dirac-parameterized filters can be folded into a single vector, resulting in a simple and easily interpretable VGG-like network, a chain of convolution-ReLU pairs.
Inspired from ResNet, we parameterize weights as a residual of Dirac function, instead of adding explicit skip connection. Because convolving any input with Dirac results in the same input, this helps propagate information deeper in the network. Similarly, on backpropagation it helps alleviate vanishing gradients problem.
Let be the identity in algebra of discrete convolutional operators, i.e. convolving it with input results in the same output ( denotes convolution):
In two-dimensional case convolution might be expressed as matrix multiplication, so
is simply an identity matrix, or a Kronecker delta. We generalize this operator to the case of a convolutional layer, where input (that consists of channels of spatial dimensions (, , …, )) is convolved with weight (combining M filters111outputs are over the first dimension of , inputs are over the second dimension of ) to produce an output of channels, i.e. . In this case we define Dirac delta , preserving eq. (1), as the following:
Given the above definition, for a convolutional layer we propose the following parameterization for the weight (hereafter we omit bias for simplicity):
where is scaling vector learned during training, and is a weight vector. Each -th element of corresponds to scaling of -th filter of . When all elements of are close to zero, it reduces to a simple linear layer . When they are higher than 1 and is small, Dirac dominates, and the output is close to be the same as input.
We also use weight normalization Salimans & Kingma (2016) for , which we find useful for stabilizing training of very deep networks with more than 30 layers:
where is another scaling vector (to be learned during training), and is a normalized weight vector where each filter is normalized by it’s Euclidean norm. We initialize to 1.0 and to 0.1, and do not -regularize them during training, as it would lead to degenerate solutions when their values are close to zero. We initialize
from normal distribution. Gradients of (5
Overall, this adds a negligible number of parameters to the network (just two scaling multipliers per channel) during training, which can be folded into filters at test time.
Let us discuss the connection of Dirac parameterization to ResNet. Due to distributivity of convolution, eq. (3) can be rewritten to show that the skip-connection in Dirac parameterization is implicit (we omit for simplicity):
where is a function combining nonlinearity and batch normalization. The skip connection in ResNet is explicit:
This means that Dirac parameterization and ResNet differ only by the order of nonlinearities. Each delta parameterized layer adds complexity by having unavoidable nonlinearity, which is not the case for ResNet. Additionally, Dirac parameterization can be folded into a single weight vector on inference.
We adopt architecture similar to ResNet and VGG, and instead of explicit skip-connections use Dirac parameterization (see table 1). The architecture consists of three groups, where each group has convolutional layers ( is used for easier comparison with basic-block ResNet and WRN, which have
blocks of pairs of convolutional layers per group). For simplicity we use max-pooling between groups to reduce spatial resolution. We also define widthas in WRN to control number of parameters.
We chose CIFAR and ImageNet for our experiments. As for baselines, we chose Wide ResNet with identity mapping in residual block He et al. (2016) and basic block (two
convolutions per block). We used the same training hyperparameters as WRN for both CIFAR and ImageNet.
The experimental section is composed as follows. First, we provide a detailed experimental comparison between plain and plain-Dirac networks, and compare them with ResNet and WRN on CIFAR. Also, we analyze evolution of scaling coefficients during training and their final values. Then, we present ImageNet results. Lastly, we apply Dirac parameterization to ResNet and show that it eliminates the need of careful initialization.
|name||output size||layer type|
In this section we compare plain networks with plain DiracNets. To do that, we trained both with 10-52 layers and the same number of parameters at the same depth (fig. 1). As expected, at 10 and 16 layers there is no difficulty in training plain networks, and both plain and plain-Dirac networks achieve the same accuracy. After that, accuracy of plain networks quickly drops, and with 52 layers only achieves 88%, whereas for Dirac parameterized networks it keeps growing. DiracNet with 34 layers achieves 92.8% validation accuracy, whereas simple plain only 91.2%. Plain 100-layer network does not converge and only achieves 40% train/validation accuracy, whereas DiracNet achieves 92.4% validation accuracy.
To compare plain Dirac parameterized networks with WRN we trained them with different width from 1 to 4 and depth from 10 to 100 (fig. 1). As observed by WRN authors, accuracy of ResNet is mainly determined by the number of parameters, and we even notice that wider networks achieve better performance than deeper. DiracNets, however, benefit from depth, and deeper networks with the same accuracy as wider have less parameters. In general, DiracNets need more parameters than ResNet to achieve top accuracy, and we were able to achieve 95.25% accuracy with DiracNet-28-10 with 36.5M parameters, which is close to WRN-28-10 with 96.0% and 36.5M parameters as well. We do not observe validation accuracy degradation when increasing width, the networks still perform well despite the massive number of parameters, just like WRN. Interestingly, plain DiracNet with only 28 layers is able to closely match ResNet with 1001 layers (table 2)
|NIN, Lin et al. (2013)||8.81||35.67|
|ELU, Clevert et al. (2015)||6.55||24.28|
As we leave and free of -regularization, we can visualize significance of various layers and how it changes during training by plotting their averages and , which we did for DiracNet-34 trained on CIFAR-10 on fig. 2. Interestingly, the behaviour changes from lower to higher groups of the network with increasing dimensionality. We also note that no layers exhibit degraded to ratio, meaning that all layers are involved in training. We also investigate these ratios in individual feature planes, and find that the number of degraded planes is low too.
As expected, Dirac parameterization does not bring accuracy improvements to ResNet on CIFAR, but eliminates the need of careful initialization. To test that, instead of usually used MSRA init He et al. (2015b), we parameterize weights as:
omitting other terms of eq. (5) for simplicity, and initialize all weights from a normal distribution , ignoring filter shapes. Then, we vary and observe that ResNet-28 converges to the same validation accuracy with statistically insignificant deviations, even for very small values of such as , and only gives slightly worse results when is around 1. It does not converge when all weights are zeros, as expected. Additionally, we tried to use the same orthogonal initialization as for DiracNet and vary it’s scaling, in which case the range of the scaling gain is even wider.
We trained DiracNets with 18 and 34 layers and their ResNet equivalents on ILSVRC2012 image classification dataset. We used the same setup as for ResNet training, and kept the same number of blocks per groups. Unlike on CIFAR, DiracNet almost matches ResNet in accuracy (table 3), with very similar convergence curves (fig. 3) and the same number of parameters. As for simple plain VGG networks, DiracNets achieve same accuracy with 10 times less parameters, similar to ResNet.
Our ImageNet pretrained models and their simpler folded convolution-ReLU chain variants are available at https://github.com/szagoruyko/diracnets.
|Network||# parameters||top-1 error||top-5 error|
|plain||VGG-CNN-S Chatfield et al. (2014)||102.9M||36.94||15.40|
|VGG-16 Simonyan & Zisserman (2015)||138.4M||29.38||-|
|residual||ResNet-18 [our baseline]||11.7M||29.62||10.62|
|ResNet-34 [our baseline]||21.8M||27.17||8.91|
We presented Dirac-parameterized networks, a simple and efficient way to train very deep networks with nearly state-of-the-art accuracy. Even though they are able to successfully train with hundreds of layers, after a certain number of layers there seems to be very small or no benefit in terms of accuracy for both ResNets and DiracNets. This is likely caused by underuse of parameters in deeper layers, and both architectures are prone to this issue to a different extent.
Even though on large ImageNet dataset DiracNets are able to closely match ResNet in accuracy with the same number of parameters and a simpler architecture, they are significantly behind on smaller CIFAR datasets, which we think is due to lack of regularization, more important on small amounts of data. Due to use of weight normalization and free scaling parameters DiracNet is less regularized than ResNet, which we plan to investigate in future.
We also observe that DiracNets share the same property as WRN to train with massive number of parameters and still generalize well without negative effects of overfitting, which was initially thought was due to residual connections. We now hypothesize that it is due to a combination of SGD with momentum at high learning rate, which has a lot of noise, and stabilizing factors, such as residual or Dirac parameterization, batch normalization, etc.
Imagenet classification with deep convolutional neural networks.In NIPS, 2012.
Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pp. 1139–1147. JMLR Workshop and Conference Proceedings, May 2013.