Gradual DropIn of Layers to Train Very Deep Neural Networks

11/22/2015 ∙ by Leslie N. Smith, et al. ∙ University of Maryland U.S. Navy 0

We introduce the concept of dynamically growing a neural network during training. In particular, an untrainable deep network starts as a trainable shallow network and newly added layers are slowly, organically added during training, thereby increasing the network's depth. This is accomplished by a new layer, which we call DropIn. The DropIn layer starts by passing the output from a previous layer (effectively skipping over the newly added layers), then increasingly including units from the new layers for both feedforward and backpropagation. We show that deep networks, which are untrainable with conventional methods, will converge with DropIn layers interspersed in the architecture. In addition, we demonstrate that DropIn provides regularization during training in an analogous way as dropout. Experiments are described with the MNIST dataset and various expanded LeNet architectures, CIFAR-10 dataset with its architecture expanded from 3 to 11 layers, and on the ImageNet dataset with the AlexNet architecture expanded to 13 layers and the VGG 16-layer architecture.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few years, state-of-the-art results for image recognition [13, 19, 26], object detection [5]

, face recognition

[27], speech recognition [7], machine translation [25], image caption generation [28], driverless car technology [11], and other applications [14] have required increasingly deeper neural networks.

Network depth refers to the number of layers in the architecture. It is well known that adding layers to neural networks makes them more expressive [15]. Each year, the Imagenet Challenge [18]

is held in which teams are expected, given an image, to detect, localize, or recognize an object in the image. Deep convolutional neural networks (CNNs) have dominated the competition since Krizhevsky

et al. won in 2012 [13], and each year since, the winner of the competition used a deeper network than the previous year’s winner [18, 19, 26].

However, training a very deep network is a difficult and open research problem [4, 6, 22]. It is difficult to train very deep networks because the error norm during backpropagation can grow or vanish exponentially. In addition, very large training datasets are necessary when the network has millions of weights.

Here we suggest a dynamic architecture that grows during the training process and allows for the training of very deep networks. We illustrate this with our DropIn layer, where new layers are skipped at the start of the training, as though they were not present. This allows the weights of the included layers to start converging. Over a number of iterations the DropIn layer increasingly includes activations from the inserted layers, which gradually trains the weights in theses added layers.

DropIn follows the philosophy embedded within curriculum learning [2]. With curriculum learning one starts with an easier problem and incrementally increases the difficulty. Here too, one starts training a shallow architecture and after convergence begins, DropIn incrementally modifies the architecture to slowly include units from the new layers.

In addition, DropIn can be used in a mode analogous to dropout [20] for the regularization of a deep neural network during training. Instead of setting random activations to zero, as is done in dropout, DropIn sets these activations to the activations from a previous layer. We demonstrate that the “noise” from mixing the activations from previous layers provides regularization during training. In addition, both DropIn and dropout can be viewed as training a large collection of networks with varied architectures and extensive weight sharing.

The contributions of this paper are:

A dynamic architecture that can grow during training.

The details of a DropIn layer for enabling the training of very deep networks and for regularization during training.

Examples of successfully training deep architectures that cannot be trained with conventional methods on MNIST, CIFAR-10, and ImageNet.

2 Related work

Methods for training very deep networks have centered on initialization of the network weights or developing new architectures and DropIn is in the latter category.

2.1 Initialization of network weights

Sutskever et al. [24] investigate the difficulty in training deep networks and conclude that both proper initialization and momentum are necessary. Glorot and Bengio [6] recommend an initialization method called normalized initialization to allow the training of deep networks. He et al. [8]

recently improved upon the “normalized initialization” method by changing the distribution to take into account ReLU layers.

Hinton et al. [9] proposed first training layer by layer in an unsupervised fashion so that a transformed version of the input could be realized. Erhan [4] later characterized the mathematics of the unsupervised pre-training and offered an explanation for its success.

Sussillo and Abbott [23] suggest an initialization scheme called Random Walk Initialization

based on scaling the initial random matrices correctly. By multiplying the error gradient by a correctly scaled random matrix at each layer, an unbiased random walk is formed. This is one of only a few papers that show the results of experiments with networks consisting of hundreds of layers.

2.2 Developing new architecture

Raiko, et al. [16]

introduce the concept of skip connections by adding a linear transformation to the usual non-linear transformation of the input to a unit. Skip connections separate the linear and non-linear portions of the activations and allow the linear part to “skip” to higher layers. This is similar to DropIn in some ways, but the purpose of DropIn differs from that of skip connections, and DropIn does not need to learn any parameters.

Romero et al. [17] suggest training a thin, deep student network (called a fitnet) from a larger but shallower teacher network. The authors accomplish this by utilizing the output of the teacher’s hidden layers as a hint for the student’s hidden layers.

Srivastava et al. [21, 22] propose a new architecture, which they named Highway Networks

, where the output of a layer’s neuron contains a combination of the input and the output. Highway networks use carry gates inspired by long short-term memory (LSTM) recurrent neural networks (RNNs) to regulate how much of the input is carried to the next layer. The authors demonstrate that their structure permits training networks of hundreds of layers (up to 900 layers)

[21, 22]. These new parameters are learned along with the other parameters of the network. Zhang et al. [30] applied highway networks to LSTM recurrent neural networks. DropIn is a simpler approach than highway networks as it does not contain gate parameters that need to be learned.

Breuel [3] discusses a dynamic network that he describes as a biologically plausible “reconfigurable” network. In this network different units are weighted dynamically to produce different configurations. This allows a single network to perform multiple tasks. DropIn represents a different type of dynamic network that grows during training rather than reconfigures for each task.

2.3 Regularization during training

The well-known dropout [10, 20]

method is an effective means to improve the training of deep neural networks. During training dropout randomly zeros a neuron’s output activation with a probability

, called the dropout ratio, so that the network cannot rely on a particular configuration. This reduces overfitting to the training data and the resulting network is more robust and better generalizes to unseen data. While dropout “samples from an exponential number of different ‘thinned’ networks” [20], DropIn samples from an exponential number of different thinner and shallower sub-networks. Like dropout, DropIn randomly changes the configuration so that the network cannot rely on a particular configuration.

Baldi and Sadowski [1] provide a theoretical basis for understanding dropout, demonstrating that dropout regulates the training and prevents overfitting by approximating an average of a large ensemble of networks. A similar theoretical understanding (and benefits) can also apply to DropIn.

Figure 1: Diagram of traditional vs DropIn training method. The DropIn method sends activations from Layer to Layer with a ratio and from Layer to Layer (thus skipping Layer ) with a ratio .

3 DropIn method

In this section we provide a mathematical basis for DropIn as well as some implementation details.

3.1 Model description

There are two modes of running DropIn: first to gradually include skipped layers, which we refer to as gradual DropIn, and second as a regularizer, which we named regularizing DropIn. Figure 1 provides a visual reference as to how the DropIn unit works.

Gradual DropIn initially passes on only the activations from the previous layer, effectively skipping the new layers. For each iteration number, , the ratio is computed as for DropIn length , which is the number of iterations over which reduces from 1 to 0. Then the number of activations copied from layer drops as , where is the total number of activations in the layer . The remaining activations are accepted from the new layer and backpropagation trains the weights of these newly added units.

For regularizing DropIn, the DropIn probability ratio is set to a static value in . In this case, DropIn works analogously with dropout but instead of setting values to zero, they are set to the activations of a previous layer (e.g., ). The choice of which activations come from which layer is done in an evolving random fashion each iteration.

We follow the notation in the dropout paper [20] to show this more formally. Namely, we start with a neural network composed of some number of layers, , where is the layer index. Also,

represents the vector of outputs from layer

and is the input to the next layer . Let be the data input to the first layer. In addition, and are the weights and biases at layer . To allow us to track the evolving nature of the network, we include the training iteration number, , and the layer’s unit index number, .

The first equation for gradual DropIn is a vector of zeros then ones, which is designated as:


For regularizing DropIn, the equation for with a probability ratio is:


i.e., a 0-1 vector where each value is distributed as a Bernoulli random variable with probability


Once is set, the remaining equations (dropping and for simplicity) are the same for both modes – namely for layer :


where is any layer less than layer . These equations are similar to those for dropout, except instead of some of the outputs being zero, they are set to the values from a previous layer, .

3.2 Implementation

We implemented our method in Caffe

[12] by creating a new layer called DropIn. The parameters for the DropIn layer include a , (Figure 1), and a , , as described in Section 3.1.

DropIn requires that the size of both the new layer and the previous layer be the same. Hence, we also implemented a Resize layer to allow reshaping a layer’s output to a user-specified size. The Resize layer modifies its input, which is into a user-specified height, width, and number of channels/filters. The Resize layer allows DropIn to work with any two layers, even when the sizes of and are different.

4 Experiments

The purpose of this section is to demonstrate the effectiveness of DropIn on several standard datasets but with architectures that are not trainable with standard methods. No attempt was made to optimize the architecture or hyper-parameters for higher accuracy because our main objective was to show that a deep architecture that will not converge without DropIn, will converge with it. However, the results in Sections 4.3 and 4.4 also demonstrate an increase in accuracy by using a deeper network for Imagenet.

All of the following experiments were run with Caffe (downloaded August 31, 2015) using CUDA 7.0 and Nvidia’s CuDNN. These experiments were run on a 64 node cluster with 8 Nvidia Titan Black GPUs, 128 GB memory, and dual Intel Xenon E5-2620 v2 CPUs per node.

The following subsections depict, in table form, the structure of several networks. We use the naming convention {layer type}{layer number}-{number of outputs}(filter size). For example, conv1_2-32 represents a convolutional layer numbered 1_2 with 32 outputs and filters sized . DropIn layers are denoted as dropin , as depicted in Figure 1.

In Section 4.1 we show that very deep networks are trainable when using gradual DropIn on expanded LeNet with MNIST data. In Section 4.2 we show the effect of DropIn length on training accuracy for expanded CIFAR-10 network and that a small performance gain is possible with added layers. In Section 4.3 we show that an expanded AlexNet architecture increases accuracy and is trainable only with gradual DropIn . In Section 4.4 the VGG16 network is trained using gradual DropIn without the need to transfer weights from a shallower network.

LeNet LeNet(2N) + DropIn

conv1_1-20 conv1_1-20
dropin (1_1 + 1_2)
dropin (1_2 + 1_3)
dropin (1_(N-1) + 1_N)

dropin (2_1 + 2_2)
dropin (2_2 + 2_3)
dropin (2_(N-1) + 2_N)
Table 1: Network architecture for LeNet and LeNet(2N)+ DropIn.
Figure 2: Classification accuracy while training LeNet(10) + DropIn architecture with MNIST data. Curves represent different DropIn lengths, . (Best viewed in color)
Figure 3: Classification accuracy while training LeNet(2N) + DropIn, for with MNIST data. Curves represent different network depths. (Best viewed in color)

4.1 Mnist

This dataset consists of 70,000 grey-scale images with a resolution of 28x28111 Of these, 60,000 are for training and 10,000 are for testing. There are ten classes, each a different handwritten digit from zero to nine, with 7,000 images per class.

The standard network architecture for the classification of MNIST, provided in the Caffe package, is the 4-layer LeNet consisting of 2 convolutional/max-pooling layers followed by 2 fully-connected layers (see the first column of Table

1 for details). Inspired by the work in [22], we increased the number of convolutional layers from two to 2N, which we denote as LeNet(2N). These added layers (as seen in the second column of Table 1, minus the DropIn layers shown in red) learned a convolution filter but did not change the size of the outputs. We then added DropIn layers between each of the convolutional layers (as seen in the second column of Table 1) and called this network LeNet(2N) + DropIn.

We first looked at and created LeNet(10) and LeNet(10) + DropIn architectures. LeNet(10) did not converge in the standard training time of 10,000 iterations given multiple realizations of the training process. However, utilizing DropIn units we were able to have LeNet(2N) + DropIn converge 10,000 iterations with the same hyper-parameters. In Figure 2 we show results for several different DropIn lengths for this network. These different lengths indicate the robustness of the DropIn length for simpler networks and that, in general, shorter DropIn lengths provide marginally better results. We note for this case that the added layers do not increase the overall accuracy of the network, as the MNIST data is quite simple compared with other classification tasks; the added layers do not provide any extra differentiation power.

We now look at how the number of layers affects the training with DropIn. In Figure 3 there are two different plots, one with DropIn length of 2,500 iterations and the other with DropIn length of 7,500 iterations. For each plot we present three different networks with 10, 30, and 50, convolutional layers (equating to N=5, 15, 25). For both DropIn lengths and all three network depths, the gradual DropIn method allowed the networks to converge. The deeper networks require a greater number of iterations to reach the same level of accuracy as the shallower networks, which is to be expected as they have a greater number of weights to train. We also see that networks converge more quickly with the shorter DropIn length, indicating that shorter DropIn lengths are desirable.

CIFAR-10 CIFAR-10(11 layers) + DropIn

conv1-32 conv1_1-32 + LRN
maxpool conv1_2-32 + LRN
LRN dropin (1_1 + 1_2)
conv2-32 conv2_1-32 + LRN
maxpool conv2_2-32 + LRN
LRN dropin (2_1 + 2_2)
conv3_1-32 + LRN
conv3_2-32 + LRN
dropin (3_1 + 3_2)
conv4_1-32 + LRN
conv4_2-32 + LRN
dropin (4_1 + 4_2)
conv5_1-32 + LRN
conv5_2-32 + LRN
dropin (5_1 + 5_2)
conv3-64 conv6_1-64
Table 2: CIFAR-10 11-layer architecture, including DropIn units.
Figure 4: Test data classification accuracy while training the 11-layer CIFAR-10 architecture with DropIn. The curves show classification accuracies for different dropin_lengths, . (Best viewed in color)
Architecture dropin_length Accuracy (%)
3-layer net 81.4
11-layer net 8,000 81.7
11-layer net 16,000 82.3
11-layer net 24,000 82.3
Table 3: Final accuracy (average of last three values) results for the CIFAR-10 dataset on test data at the end of the training. Comparison of DropIn and dropin_lengths.

4.2 Cifar-10

This dataset consists of 60,000 color images with a resolution of 32x32. Of these, 50,000 are for training and 10,000 are for testing. There are ten classes with 6,000 images per class.

The Caffe [12] website provides the architecture and hyper-parameter settings as part of the CIFAR-10 tutorial222 The three convolutional layer architecture trains quickly and attains good accuracies. The convolutional layers were replicated to obtain an 11-layer model, which corresponds to the depth of one of the CIFAR-10 models in the experiments for highway networks [22]. The detailed architectures are compared in Table 2

. As shown in the table, the sizes of each of the layers entering the DropIn layer were kept the same for simplicity. For every convolutional layer, the weight initialization was Gaussian with standard deviation of 0.01 and the bias initialization was constant, set to 0. Each convolutional layer was followed by a rectified linear unit and local normalization. The length of the training, the learning rates, and schedule were modified to run over 32,000 iterations. This modification trained satisfactorily and provided a reasonable comparison.

Numerous attempts at training this 11-layer network without the DropIn layers failed to converge. Similar attempts to train this network with the DropIn layers did successfully converge, which is a primary result of this study.

Experiments were performed varying the DropIn length. Figure 4 shows the accuracy curves for , and Table 3 compares the final accuracies. The final accuracies show a marginal improvement for longer versus shorter lengths but for CIFAR-10 the results are relatively independent of the length value. Furthermore, the final accuracies from the 11-layer architecture are less than 1% better than the original 3-layer architecture, which implies that for the CIFAR-10 dataset, the deeper networker provides only marginal improvement.

AlexNet AlexNet (13 layers) + DropIn

conv1_1-96 conv1_1-96
dropin (1_1 + 1_2)
maxpool + LocalNorm
conv2_1-256 conv2_1-256
dropin (2_1 + 2_2)
maxpool + LocalNorm
conv3_1-384 conv3_1-384
dropin (3_1 + 3_2)
conv4_1-384 conv4_1-384
dropin (4_1 + 4_2)
conv5_1-256 conv5_1-256
dropin (5_1 + 5_2)
Table 4: Network architecture for AlexNet and modified version of AlexNet, AlexNet (13 layers) + DropIn .
Figure 5: Comparison of various DropIn lengths, . Validation data classification accuracy while training the AlexNet (13 layers) + DropIn architecture with ImageNet data. (Best viewed in color)
Architecture dropin_length Accuracy (%)
AlexNet 58.0
13 layers + DropIn 25,000 62.2
13 layers + DropIn 75,000 62.1
13 layers + DropIn 150,000 60.8
13 layers + DropIn 300,000 59.3
Table 5: Comparison of DropIn and dropin_lengths, . The table shows final accuracy (average of last three values) results for the ImageNet dataset on validation data at the end of the training.

4.3 ImageNet / AlexNet [18]

is a large image database based on the nouns in WordNet. This image database, used for the ImageNet Large Scale Visual Recognition Challenge, is commonly used as a basis of comparison in the deep learning literature. The database contains 1.2 million training and 50,000 testing images covering 1,000 categories.

The Caffe website provides the architecture and hyper-parameter files for a slightly modified AlexNet. We downloaded the architecture and hyper-parameter files from the website and we expanded the architecture from 8 layers to 13 layers by duplicating each of the convolutional layers, which is shown (minus the DropIn layers shown in red) in columns 1 and 2, respectively, of Table 4. The AlexNet (13 layers) + DropIn includes a DropIn layer between every duplicated layer used to create AlexNet (13 layers). Multiple attempts at training the AlexNet (13 layers) architecture in the conventional manner did not converge. In the tests with the expanded architecture, the hyper-parameters were kept the same as provided by the Caffe website (even though our experiments with DropIn indicate that tuning them could improve the results).

Experiments were run varying the DropIn hyper-parameter dropin_length. Table 5 shows final accuracy results after training for 450,000 iterations with a range of lengths. Figure 5 compares the accuracy during training of these experiments. In contrast to the results with CIFAR-10, the DropIn length makes a difference with ImageNet. We believe that this is because the deeper architecture increases the classification accuracy for larger datasets, hence the improvement with smaller DropIn lengths is more prominent.

From Figure 5 and Table 5, we can conclude that shorter lengths are better than the longer ones. If the length is less than the first scheduled drop in the learning rate at iteration 100,000, then the network is better trained. However, the difference between and 25,000 is negligible implying that lengths less than the first scheduled learning rate drop are equivalent.

VGG8 VGG16 + DropIn

conv1_1-64 conv1_1-64
dropin (1_1 + 1_2)
conv2_1-128 conv2_1-128
dropin (2_1 + 2_2)
conv3_1-256 conv3_1-256
dropin (3_1 + 3_2)
dropin (3_2 + 3_3)
conv4_1-512 conv4_1-512
dropin (4_1 + 4_2)
dropin (4_2 + 4_3)
conv5_1-512 conv5_1-512
dropin (5_1 + 5_2)
dropin (5_2 + 5_3)

Table 6: Network architectures for VGG8 and VGG16 + DropIn. See the text for additional settings.
Figure 6: Validation data classification accuracy while training the VGG16 + DropIn architecture with ImageNet data. (Best viewed in color)

4.4 ImageNet / VGG

VGG, a set of networks created by the Visual Geometry Group [19], won second place in the image classification category of the 2014 ImageNet contest. These networks, trained on the same database as the Alexnet architecture discussed in Section 4.3, contained layers. In Table 6

we see the VGG16 (minus the DropIn layers shown in red) architecture alongside what we will refer to as VGG8 (not contained in the original paper). All convolutional layers have a stride and padding of 1 and maxpooling layers have a stride of 2. In their paper, the authors describe the difficulty of training these deep networks and utilized a weight transfer method to enable the network to converge during training


While it is possible to train a deep neural network by first training a shallow network and using those weights to initialize the deeper network, we believe that in addition to being easier, training the full network with all the layers in place leads to a better trained network. This is supported by research on feature visualization, such as in Zeiler and Fergus [29], where they demonstrate that higher layers have more abstract representations. Training in place means that the learned representations will conform well to the representation at a given layer, while training a shallow network and initializing the weights of a deeper network might not.

Instead of training smaller networks, we propose to use our gradual DropIn method. For our studies, we utilized the VGG16 prototxt file referenced on the Caffe website444 and set up the solver file with the appropriate parameters from the authors’ paper. Using traditional training methods, we were only able to train the VGG8 architecture; the VGG16 failed to begin converging for multiple realizations. Using VGG8 as a template, we augment VGG16 with DropIn layers to create VGG16 + DropIn (see Table 6).

Based on the evidence presented in Section 4.3, we choose to test VGG16 with a DropIn length of 60,000. We found that other lengths (100,000, 150,000, and 200,000) began to converge as well but with limited time and resources, we chose to report only this length for this paper. The results of training VGG16 + DropIn are shown in Figure 6 alongside VGG8. We see that with gradual DropIn the difficult to train VGG16 network does converge. Here we see the real power of the gradual DropIn method; without training an additional shallower network we are able to directly train VGG16, thus saving effort for the practitioner.

Case fc6 fc7
1 dropout dropout
2 dropout
3 dropout DropIn
Table 7: The three regularization experiments shows layers with dropout or DropIn . The fully connected layers 6 and 7, are called fc6 and fc7, respectively.
Figure 7: Test of DropIn regularization with AlexNet. Validation data classification accuracy while training AlexNet with ImageNet data. (Best viewed in color)

4.5 Using DropIn for regularization

The original AlexNet architecture uses dropout for regularization during training in both fully connected layers and it provides a substantial increase in the network’s accuracy. AlexNet (with 8 layers) provides a means to test DropIn regularization. For this experiment, three cases were run as shown in Table 7. Case 1 is the original AlexNet.

The results from this experiment are shown in Figure 7, where both DropIn and dropout probability ratios were for all of these tests and all the other hyper-parameters were the same. This figure shows that removing dropout from fc7 causes visible degrading of the accuracy between iterations 150,000 and 200,000 (green curve). This kind of degradation does not happen with DropIn. Instead, the accuracy curve is similar to the curve with dropout (red versus blue curve) but with a small degradation in overall performance. We believe this degradation is because a DropIn network is more difficult to train than a dropout network. However, the final accuracy for the network with DropIn in fc7 is higher than from an architecture without dropout (red versus green curve). This experiment demonstrates that DropIn provides some regularization since the degradation found in the case without dropout is absent.

5 How to determine a good architecture

One of the challenges for deep learning practitioners is to determine good choices for the hyper-parameter values and the architecture for a given application and dataset. DropIn and dropout provide an easier way to test choices for the architecture than running a set of experiments with many different architectures.

DropIn and dropout can allow one to test a range of architecture depths and widths, respectively. Since adding layers does not necessarily increase accuracy, one can run with the gradual DropIn mode to see if there is little effect, such as in Figures 2 and 4, or visible effect, such as in Figure 5. Substantial improvement implies that there will be benefit from the additional depth.

Similarly, making a run where the dropout ratio varies from perhaps 0.9 to 0.1 (using a slightly modified dropout) provides guidance on the minimum number of neurons per layer. When decreasing the probability that neurons are retained (as shown in Figure 9 of Srivastava et al. [20]), the error typically has a range of the probability ratios where the error plateaus but at some threshold probability the error increases. By multiplying the number of neurons in a layer by this threshold probability, one can approximately determine the minimum number of neurons one must retain where there is marginal harm to the accuracy.

6 Conclusion

The major result of this paper is that deeper architectures that cannot converge using standard training methods, become trainable by slowly adding in the new layers during the training. In addition, there are indications that DropIn layers help regularize the training of a network. We found in general that if the shallow network is trainable, then the deeper network, where additional layers are added by a DropIn layer, is also trainable. With a large dataset like ImageNet, adding additional layers increases accuracy.

We have not yet explored training with tailored DropIn lengths for different DropIn layers in a network. In addition, comparing DropIn to initializing the weights from training a separate shallow network has not yet been tested; these are planned for future work and will be reported elsewhere. Also we plan to test DropIn within other architectures such as recurrent neural networks. Future work also includes training networks with hundreds of layers using asynchronous DropIn, where layers are added starting at different iterations. In addition, we wish to test training where the entire very deep network is initially very thin (few parameters to train) and units are added to all the layers during the training. Furthermore, we plan to study if a methodology can be developed to learn from the data how to automatically optimize the architecture during training and thus learn to adapt to an application based on its data.