DiracNets: Training Very Deep Neural Networks Without Skip-Connections

06/01/2017 ∙ by Sergey Zagoruyko, et al. ∙ Ecole nationale des Ponts et Chausses 0

Deep neural networks with skip-connections, such as ResNet, show excellent performance in various image classification benchmarks. It is though observed that the initial motivation behind them - training deeper networks - does not actually hold true, and the benefits come from increased capacity, rather than from depth. Motivated by this, and inspired from ResNet, we propose a simple Dirac weight parameterization, which allows us to train very deep plain networks without skip-connections, and achieve nearly the same performance. This parameterization has a minor computational cost at training time and no cost at all at inference. We're able to achieve 95.5 34-layer deep plain network, surpassing 1001-layer deep ResNet, and approaching Wide ResNet. Our parameterization also mostly eliminates the need of careful initialization in residual and non-residual networks. The code and models for our experiments are available at https://github.com/szagoruyko/diracnets



There are no comments yet.


page 6

Code Repositories


Training Very Deep Neural Networks Without Skip-Connections

view repo


CuPy fused PyTorch neural networks ops

view repo


Training Very Deep Neural Networks Without Skip-Connections in Keras https://arxiv.org/abs/1706.00388

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There were many attempts of training very deep networks. In image classification, after the success of Krizhevsky et al. (2012) with AlexNet (8 layers), the major improvement was brought by Simonyan & Zisserman (2015) with VGG (16-19 layers) and later by Szegedy et al. (2015)

with Inception (22 layers). In recurrent neural networks, LSTM by 

Hochreiter & Schmidhuber (1997) allowed training deeper networks by introducing gated memory cells, leading to major increase of parameter capacity and network performance. Recently, similar idea was applied to image classification by Srivastava et al. (2015) who proposed Highway Networks, later improved by He et al. (2015a) with Residual Networks, resulting in simple architecture with skip-connections, which was shown to be generalizable to many other tasks. There were also proposed several other ways of adding skip-connections, such as DenseNet by Huang et al. (2016), which passed all previous activations to each new layer.

Despite the success of ResNet, a number of recent works showed that the original motivation of training deeper networks does not actually hold true, e.g. it might just be an ensemble of shallower networks Veit et al. (2016), and ResNet widening is more effective that deepening Zagoruyko & Komodakis (2016), meaning that there is no benefit from increasing depth to more than 50 layers. It is also known that deeper networks can be more efficient than shallower and wider, so various methods were proposed to train deeper networks, such as well-designed initialization strategies and special nonlinearities Bengio & Glorot (2010); He et al. (2015b); Clevert et al. (2015), additional mid-network losses Lee et al. (2014), better optimizers Sutskever et al. (2013), knowledge transfer Romero et al. (2014); Chen et al. (2016) and layer-wise training Schmidhuber (1992).

To summarize, deep networks with skip-connections have the following problems:

  • Feature reuse problem: upper layers might not learn useful representations given previous activations;

  • Widening is more effective than deepening: there is no benefit from increasing depth;

  • Actual depth is not clear: it might be determined by the shortest path.

However, the features learned by such networks are generic, and they are able to train with massive number of parameters without negative effects of overfitting. We are thus interested in better understanding of networks with skip-connections, which would allow us to train very deep plain (without skip-connections) networks and benefits they could bring, such as higher parameter efficiency, better generalization, and improved computational efficiency.

Motivated by this, we propose a novel weight parameterization for neural networks, which we call Dirac parameterization, applicable to a wide range of network architectures. Furthermore, by use of the above parameterization, we propose novel plain VGG and ResNet-like architecture without explicit skip-connections, which we call DiracNet. These networks are able to train with hundreds of layers, surpass 1001-layer ResNet while having only 28-layers, and approach Wide ResNet (WRN) accuracy. We should note that we train DiracNets end-to-end, without any need of layer-wise pretraining. We believe that our work is an important step towards simpler and more efficient deep neural networks.

Overall, our contributions are the following:

  • We propose generic Dirac weight parameterization, applicable to a wide range of neural network architectures;

  • Our plain Dirac parameterized networks are able to train end-to-end with hundreds of layers. Furthermore, they are able to train with massive number of parameters and still generalize well without negative effects of overfitting;

  • Dirac parameterization can be used in combination with explicit skip-connections like ResNet, in which case it eliminates the need of careful initialization.

  • In a trained network Dirac-parameterized filters can be folded into a single vector, resulting in a simple and easily interpretable VGG-like network, a chain of convolution-ReLU pairs.

2 Dirac parameterization

Inspired from ResNet, we parameterize weights as a residual of Dirac function, instead of adding explicit skip connection. Because convolving any input with Dirac results in the same input, this helps propagate information deeper in the network. Similarly, on backpropagation it helps alleviate vanishing gradients problem.

Let be the identity in algebra of discrete convolutional operators, i.e. convolving it with input results in the same output ( denotes convolution):


In two-dimensional case convolution might be expressed as matrix multiplication, so

is simply an identity matrix, or a Kronecker delta

. We generalize this operator to the case of a convolutional layer, where input (that consists of channels of spatial dimensions (, , …, )) is convolved with weight (combining M filters111outputs are over the first dimension of , inputs are over the second dimension of ) to produce an output of channels, i.e. . In this case we define Dirac delta , preserving eq. (1), as the following:


Given the above definition, for a convolutional layer we propose the following parameterization for the weight (hereafter we omit bias for simplicity):


where is scaling vector learned during training, and is a weight vector. Each -th element of corresponds to scaling of -th filter of . When all elements of are close to zero, it reduces to a simple linear layer . When they are higher than 1 and is small, Dirac dominates, and the output is close to be the same as input.

We also use weight normalization Salimans & Kingma (2016) for , which we find useful for stabilizing training of very deep networks with more than 30 layers:


where is another scaling vector (to be learned during training), and is a normalized weight vector where each filter is normalized by it’s Euclidean norm. We initialize to 1.0 and to 0.1, and do not -regularize them during training, as it would lead to degenerate solutions when their values are close to zero. We initialize

from normal distribution

. Gradients of (5

) can be easily calculated via chain-rule. We rely on automatic differentiation, available in all major modern deep learning frameworks (PyTorch, Tensorflow, Theano), to implement it.

Overall, this adds a negligible number of parameters to the network (just two scaling multipliers per channel) during training, which can be folded into filters at test time.

2.1 Connection to ResNet

Let us discuss the connection of Dirac parameterization to ResNet. Due to distributivity of convolution, eq. (3) can be rewritten to show that the skip-connection in Dirac parameterization is implicit (we omit for simplicity):


where is a function combining nonlinearity and batch normalization. The skip connection in ResNet is explicit:


This means that Dirac parameterization and ResNet differ only by the order of nonlinearities. Each delta parameterized layer adds complexity by having unavoidable nonlinearity, which is not the case for ResNet. Additionally, Dirac parameterization can be folded into a single weight vector on inference.

3 Experimental results

We adopt architecture similar to ResNet and VGG, and instead of explicit skip-connections use Dirac parameterization (see table 1). The architecture consists of three groups, where each group has convolutional layers ( is used for easier comparison with basic-block ResNet and WRN, which have

blocks of pairs of convolutional layers per group). For simplicity we use max-pooling between groups to reduce spatial resolution. We also define width

as in WRN to control number of parameters.

We chose CIFAR and ImageNet for our experiments. As for baselines, we chose Wide ResNet with identity mapping in residual block He et al. (2016) and basic block (two

convolutions per block). We used the same training hyperparameters as WRN for both CIFAR and ImageNet.

The experimental section is composed as follows. First, we provide a detailed experimental comparison between plain and plain-Dirac networks, and compare them with ResNet and WRN on CIFAR. Also, we analyze evolution of scaling coefficients during training and their final values. Then, we present ImageNet results. Lastly, we apply Dirac parameterization to ResNet and show that it eliminates the need of careful initialization.

name output size layer type
conv1 [33, 16]
group1 3232 2
max-pool 1616
group2 1616 2
max-pool 88
group3 88 2
avg-pool []
Table 1: Structure of DiracNets. Network width is determined by factor . Groups of convolutions are shown in brackets as [kernel shape, number of input channels, number of output channels] where is a number of layers in a group. Final classification layer and dimensionality changing layers are omitted for clearance.

3.1 Plain networks with Dirac parameterization

In this section we compare plain networks with plain DiracNets. To do that, we trained both with 10-52 layers and the same number of parameters at the same depth (fig. 1). As expected, at 10 and 16 layers there is no difficulty in training plain networks, and both plain and plain-Dirac networks achieve the same accuracy. After that, accuracy of plain networks quickly drops, and with 52 layers only achieves 88%, whereas for Dirac parameterized networks it keeps growing. DiracNet with 34 layers achieves 92.8% validation accuracy, whereas simple plain only 91.2%. Plain 100-layer network does not converge and only achieves 40% train/validation accuracy, whereas DiracNet achieves 92.4% validation accuracy.

Figure 1: DiracNet and ResNet with different depth/width, each circle area is proportional to number of parameters. DiracNet needs more width (i.e. parameters) to match ResNet accuracy. Accuracy is calculated as median of 5 runs.

3.2 Plain Dirac networks and residual networks

To compare plain Dirac parameterized networks with WRN we trained them with different width from 1 to 4 and depth from 10 to 100 (fig. 1). As observed by WRN authors, accuracy of ResNet is mainly determined by the number of parameters, and we even notice that wider networks achieve better performance than deeper. DiracNets, however, benefit from depth, and deeper networks with the same accuracy as wider have less parameters. In general, DiracNets need more parameters than ResNet to achieve top accuracy, and we were able to achieve 95.25% accuracy with DiracNet-28-10 with 36.5M parameters, which is close to WRN-28-10 with 96.0% and 36.5M parameters as well. We do not observe validation accuracy degradation when increasing width, the networks still perform well despite the massive number of parameters, just like WRN. Interestingly, plain DiracNet with only 28 layers is able to closely match ResNet with 1001 layers (table 2)

depth-width # params CIFAR-10 CIFAR-100
NIN, Lin et al. (2013) 8.81 35.67
ELU, Clevert et al. (2015) 6.55 24.28
VGG 16 20M 6.090.11 25.920.09
DiracNet (ours) 28-5 9.1M 5.160.14 23.440.14
28-10 36.5M 4.750.16 21.540.18
ResNet 1001-1 10.2M 4.92 22.71
Wide ResNet 28-10 36.5M 4.00 19.25
Table 2: CIFAR performance of plain (top part) and residual (bottom part) networks on with horizontal flips and crops data augmentation. DiracNets outperform all other plain networks by a large margin, and approach residual architectures. No dropout it used. For VGG and DiracNets we report meanstd of 5 runs.

3.3 Analysis of scaling coefficients

As we leave and free of -regularization, we can visualize significance of various layers and how it changes during training by plotting their averages and , which we did for DiracNet-34 trained on CIFAR-10 on fig. 2. Interestingly, the behaviour changes from lower to higher groups of the network with increasing dimensionality. We also note that no layers exhibit degraded to ratio, meaning that all layers are involved in training. We also investigate these ratios in individual feature planes, and find that the number of degraded planes is low too.

Figure 2: Average values of and during training for different layers of DiracNet-34. Deeper color means deeper layer in a group of blocks.

3.4 Dirac parameterization for ResNet weight initialization

As expected, Dirac parameterization does not bring accuracy improvements to ResNet on CIFAR, but eliminates the need of careful initialization. To test that, instead of usually used MSRA init He et al. (2015b), we parameterize weights as:


omitting other terms of eq. (5) for simplicity, and initialize all weights from a normal distribution , ignoring filter shapes. Then, we vary and observe that ResNet-28 converges to the same validation accuracy with statistically insignificant deviations, even for very small values of such as , and only gives slightly worse results when is around 1. It does not converge when all weights are zeros, as expected. Additionally, we tried to use the same orthogonal initialization as for DiracNet and vary it’s scaling, in which case the range of the scaling gain is even wider.

3.5 ImageNet results

We trained DiracNets with 18 and 34 layers and their ResNet equivalents on ILSVRC2012 image classification dataset. We used the same setup as for ResNet training, and kept the same number of blocks per groups. Unlike on CIFAR, DiracNet almost matches ResNet in accuracy (table 3), with very similar convergence curves (fig. 3) and the same number of parameters. As for simple plain VGG networks, DiracNets achieve same accuracy with 10 times less parameters, similar to ResNet.

Our ImageNet pretrained models and their simpler folded convolution-ReLU chain variants are available at https://github.com/szagoruyko/diracnets.

Network # parameters top-1 error top-5 error
plain VGG-CNN-S Chatfield et al. (2014) 102.9M 36.94 15.40
VGG-16 Simonyan & Zisserman (2015) 138.4M 29.38 -
DiracNet-18 11.7M 30.37 10.88
DiracNet-34 21.8M 27.79 9.34
residual ResNet-18 [our baseline] 11.7M 29.62 10.62
ResNet-34 [our baseline] 21.8M 27.17 8.91
Table 3: Single crop top-1 and top-5 error on ILSVRC2012 validation set for plain (top) and residual (bottom) networks.
Figure 3: Convergence of DiracNet and ResNet on ImageNet. Training top-5 error is shown with dashed lines, validation - with solid. All networks are trained using the same optimization hyperparameters. DiracNet closely matches ResNet accuracy with the same number of parameters.

4 Discussion

We presented Dirac-parameterized networks, a simple and efficient way to train very deep networks with nearly state-of-the-art accuracy. Even though they are able to successfully train with hundreds of layers, after a certain number of layers there seems to be very small or no benefit in terms of accuracy for both ResNets and DiracNets. This is likely caused by underuse of parameters in deeper layers, and both architectures are prone to this issue to a different extent.

Even though on large ImageNet dataset DiracNets are able to closely match ResNet in accuracy with the same number of parameters and a simpler architecture, they are significantly behind on smaller CIFAR datasets, which we think is due to lack of regularization, more important on small amounts of data. Due to use of weight normalization and free scaling parameters DiracNet is less regularized than ResNet, which we plan to investigate in future.

We also observe that DiracNets share the same property as WRN to train with massive number of parameters and still generalize well without negative effects of overfitting, which was initially thought was due to residual connections. We now hypothesize that it is due to a combination of SGD with momentum at high learning rate, which has a lot of noise, and stabilizing factors, such as residual or Dirac parameterization, batch normalization, etc.