In the past few years, deep Convolutional Neural Networks (CNN) (LeCun et al., 1998)
have become the ubiquitous tool for solving computer vision problems such as object detection, image classification, and image segmentation(Ren et al., 2015). Such success can be traced back to the work of Krizhevsky et al. (2012) whose 8-layer CNN won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), showing that multi-layer architectures are able to capture the large variability present in real world data.
Works by Simonyan & Zisserman (2014) have shown that increasing the number of layers will consistently improve classification accuracy in this same task. This has led to the proposal of new architectural improvements (Szegedy et al., 2015; He et al., 2016) that allowed network depth to increase from a few (Krizhevsky et al., 2012) to hundreds of layers(He et al., 2016). However, increase in a network’s depth comes at the price of longer training times (Glorot & Bengio, 2010) mainly caused by the computationally intensive convolution operations.
In this paper, we show that the overall training time of a target CNN architecture can be reduced by exploiting the spatial scaling property of convolutions during early stages of learning. This is done by first training a pre-train
CNN of smaller kernel resolutions for a few epochs, followed by properly rescaling its kernels to thetarget’s original dimensions and continuing training at full resolution.
Moreover, by rescaling the kernels at different epochs, we identify a trade-off between total training time and maximum obtainable accuracy. Finally, we propose a method for choosing when to rescale kernels and evaluate our approach on recent architectures showing savings in training times of nearly while test set accuracy is preserved.
2 Related Work
Different attempts to reduce CNN training time using the standard back-propagation technique (Rumelhart et al., 1986) have been proposed in the literature.
have benefited from the Fast Fourier Transform (FFT) algorithm and reduced the number of multiplications in each 2D convolution(Mathieu et al., 2014); while the current preference for smaller kernels has revived the interest for minimal filtering algorithms (Winograd, 1980) as seen in recent, unpublished works by Levin et al.
Very recently, Ioffe & Szegedy (2015)
were able to reduce the total number of iterations (epochs) required to fully train a network using technique called Batch Normalization. Authors were able to reduce the internal covariance shift, inherently present in the back-propagation technique, by applying mean ad variance normalization at each layer’s input.
Most relevant to our approach is the work of Chen et al. (2016) who suggested the use of function-preserving transformations to train deeper (more layers) and wider (more channels) networks starting from shallower and narrower ones. Our approach, on the other hand, relies on scaling the spatial dimensions of convolution kernels and input images. This allows us to keep the levels of representation that are usually associated with the number layers and kernels per layer.
Finally, it is worth mentioning that much of the effort towards speeding up CNNs has been focused on inference only. Works by Lebedev et al. (2015); Jaderberg et al. (2014); Denton et al. (2014) have used low-rank approximation to greatly reduce computational complexity of CNNs. These approaches, however, still require a network to be fully trained and they can all profit from our approach.
3 Proposed Method
This section describes the rationale behind spatially scaling kernels to speed up the overall CNN training procedure.
3.1 Spatially Scaling Convolutions
The time scaling property of convolutions states that convolution between two time-scaled signals and can be obtained by time-scaling the result of convolving the original inputs and , followed by an amplitude-scaling of .
If applied to the context of CNNs, this property would suggest that the output of a convolutional layer could also be obtained from the spatially downsized versions of both the layer’s input and its convolution kernels.
Benefits of this possibility can be seen in Equation 3, which represents the number of multiplications performed by a convolutional layer , having and input and output channels, when both input and convolution kernels are spatially scaled by a factor .
When compared to its unscaled version, i.e. , one can establish the bounds in Equation 4 by considering both extremes and , which in turn guarantees a minimum reduction in the number of multiplications proportional to .
The spatial scaling property is, of course, valid only in the continuous domain. Working with downsized versions of inputs will usually result in irreversible loss of spatial resolution and accuracy. However, as shown in the following sections, for moderate values of this property can still be exploited during early stages of training, where the network is still learning the basic structures for its kernels. This can be done by first training an otherwise identical network of smaller kernel resolutions, followed by an upscaling to the target kernel resolution and continuing training.
3.2 Pre-training Setup
During the pre-training phase, a pre-train network of architecture similar to the target one having downscaled kernel resolutions shall be trained. Equation 3 guarantees that during this phase the training process will run faster.
Generating this pre-train network from a target network architecture requires choosing new spatial resolutions for each convolutional layer as well as making the necessary adjustments so that fully-connected layers will have compatible input-output dimensions.
3.2.1 Convolution Kernels
When deciding on the new kernel resolutions, a trade-off between speed and accuracy must be considered.
Selecting kernels much smaller than the originals will cripple the layer’s ability to extract and forward high-frequency information, while too conservative downscaling will lead to insignificant improvements in training speeds. Therefore, we suggest the use of Table 1 for choosing the pre-train kernel resolution given a target kernel resolution. We have found those values to provide a good compromise between these factors after the complete training process.
Once the new kernel sizes have been chosen, it is necessary to adjust the network’s internal parameters so that each convolutional layer closely satisfies Equation 2. In order to do so, it is important to observe that a CNN architecture for image classification usually reflects two distinct stages of processing. The first stage contains various layers of convolution and pooling that act as feature extractors. They output feature-maps
whose spatial resolution depends on that layer’s input resolution. The second stage, on the other hand, acts as a classifier and can be identified by the presence offully-connected layers of fixed input and output dimensions.
Solving the input-output dependencies present in the feature extraction phase should be done starting from the input image itself since it has no constraints with any previous layer. Input images also have large spatial resolution so that it should be straightforward to choose a smaller integer length whose ratio with respect to the original image will closely approximate the first chosen scaling factor.
Upper layers will generally not have such flexibility since the following feature-maps will usually have lower spatial resolution. For these layers, the input-output scaling requirement are met by spatially padding or cropping the incoming feature-map.
3.2.2 Fully-connected Interface
Special attention must be paid to the interface between convolutional layers and fully-connected ones since the latter require fixed input sizes.
Activations in a fully-connected layer can be represented as a matrix-vector multiplication where each column in the weight matrix is associated to a particular neuron and the number of rows defines the layer’s input size. In this interpretation, before entering a fully-connected layer, feature-maps havingchannels and spatial resolution of must be reshaped into a vector representation .
Since the new spatial dimensions in the pre-train architecture will produce feature-maps of smaller resolutions , the weight matrix in the fully-connected layer must be adapted accordingly. In our representation, this means that the number of inputs (rows) in the weight matrix shall be reduced from to , while the number of output neurons (columns) is kept invariant. A visual representation of this procedure is seen in Figure 2.
Subsequent layers should need no further modification and the pre-train network can be trained until a given stopping criterion is met, e.g. classification accuracy on a validation set starts to plateau.
3.3 Resizing and Continuing Training
Once the stopping criterion has been reached by the pre-train network, its structure must be modified back to the original target network.
3.3.1 Convolution Kernels
As seen in Equation 2, just spatially resizing both input and kernels would result in an amplitude scaled version of the expected convolution, meaning that the scaling factor would propagate to all subsequent layers. This problem can be avoided simply by scaling the amplitude of the resized kernels by . That is, given a pre-trained kernel matrix that represents the weights of a convolution kernel from layer , the corresponding weights to be used in the target network are so that the amplitude gain caused by the convolution operation cancels out.
Moreover, associated to each convolution operation is a bias component that need not be scaled since it has a constant value. This resizing procedure should be carried out for every channel in every convolutional layer as described in Equations 5 and 6.
Kernels in our experiments were spatially upscaled using bilinear interpolation. Although other interpolation methods were tested,pre-train kernel resolutions were too small to benefit from higher order interpolation such as bicubic.
3.3.2 Fully-connected Interface
Again, special attention should be paid to the interface between convolutional and fully-connected layers.
According to the usual interpretation described in 3.2.2, the output of a convolutional layer must be vectorized before serving as input to a fully-connected layer, implying a loss of its explicit spatial representation. However, since the incoming feature-maps do contain intra-channel correlation, such information is still present and it is captured by the weights of the fully-connected layer.
In order to exploit this correlation and be able to apply our method, the fully connected layer shall be reinterpreted as a convolutional layer. In other words, each column in the weight matrix must first be reshaped into a third-order tensorof the same dimensions as the original incoming feature-maps so that we can apply the same resizing rule defined in Equations 5 and 6. This will produce a new tensor that must be then vectorized into the new weight matrix whose dimensions are consistent with the target network. Figure 1 illustrates the overall training process.
Finally, weights from successive fully-connected layers must simply be copied to the target model.
4 Preliminary Experiments
In order to assess the proposed approach, we must first estimate upper and lower bounds in terms of accuracy and training times set by thetarget network and its pre-train counterpart. To do this we use as baseline to our investigation the fast variant of 2013 ImageNet localization winner OverFeat (Sermanet et al., 2014).
The original OverFeat-fast contains five convolutional layers followed by three fully-connected ones that classify RGB images among the 1000 classes defined by the ImageNet dataset. By following the steps set in subsection 3.2 we generate a pre-train model of input resolution and do not apply padding at the last convolutional layer.
During our experiments we use an Nvidia Tesla K80 GPU to train and test both networks with ImageNet 2012 CLS-LOC training and validation datasets. We use mini-batches of 128 images and 10k mini-batches per epoch. We use an initial learning rate of and lower it to , , , and at the end of epochs 18, 29, 43, and 52 respectively, until epoch 65 when training is halted. A weight decay of is also applied until the end of epoch 29 and a momentum of is used during the entire training. For both networks, weights in each layer are initialized uniformly at random in the interval , where is the layer’s number of weights.
For the two networks we obtain the train and test accuracies with respect to the number of training epochs and training hours, seen in Figures 4 and 5. Representing accuracy in terms of epochs allows one to measure how fast the network is learning as data is presented to it, while representation in terms of training time reflects the variable to be optimized.
We notice in Figure 4 that test accuracy stops increasing a few epochs after the last change in learning rate. For this reason, we consider both networks to have been fully trained at the end of 55 epochs resulting in best test accuracies of (epoch 55) for the target network and (epoch 53) for the pre-train network. We also observe from Figure 4 that, during the first epochs, both test and training accuracies follow the same pattern for the two networks, which suggests that information being learnt by the models is both generalizable and adequate to be represented by the smaller, faster network.
On the other hand, Figure 5 highlights the effect of using spatially smaller kernels on training time. OverFeat-fast took hours to perform 55 epochs of training while the pre-train network only took hours to train on the same amount of data. This reflects a reduction in training time by a factor of , which largely agrees with the upper-bounds set by Equation 3 and the values of in Table 1 for the , and kernel resolutions found in the original architecture.
5 Experiments on Pre-training
In this Section we evaluate the effects of rescaling the pre-train network at different points in time. Our goal is to maximize the number of epochs trained using the smaller network in order to reduce the overall time necessary for training the network.
5.1 Resize-and-Continue Scheduled Training
Ideally, one would like to be able to fully train a smaller network, upsize its kernels and immediately obtain the test accuracy of the target network. However, as seen in Figure 4, decrease in learning rate and removal of weight decay lead to increase in overfitting, which in turn imposes some constraints to this straightforward approach.
In this experiment we evaluate the effect of upscaling kernels at different epochs and continuing the scheduled training rule. Since changes in the learning rules had a clear effect on accuracy, we focus on resizing the network before and after these changes. Accuracy curves for each starting epoch are reported in Figure 6 including threshold lines for the accuracies obtained by the two baseline networks described in Section 4. Although we still consider a 55 epoch training schedule, the process is carried out until epoch 65 in order to verify possible gains due to continuing training.
Curves in Figure 6 reveal some interesting behaviour. Each resized model shows a lower starting accuracy when compared to the input network test curve. This pattern is expected since interpolation will give an imperfect estimate of the desired kernels. On the other hand, the fact that accuracy does not drop too much indicates that knowledge can, at least partially, be transferred using this method.
The same figure also shows a saturation effect. Networks resized at early stages (epochs 17-20) are able to achieve levels of accuracy similar to the input network, while networks resized at late stages (epochs 51-54) can only achieve accuracies below the input network threshold. Intermediate values of accuracies were obtained when upsizing the pre-train network at epochs from 28 to 31, while accuracies close to the one obtained with the input network baseline were obtained by upsizing the pre-train networks at epochs 42-45. Restart training in the vicinity of the last change in learning rate resulted in test accuracies below the pre-train network baseline threshold.
Table 2 summarizes training times and accuracies for those networks that were able to closely approximate the final accuracy of the target network. From this experiment we notice that resizing the pre-train network at Epoch 17 produced the same accuracy as the target network even though it takes less hours to finish training, a relative gain of in training time. These results show the necessity of upscaling early during training in order to achieve the maximum target’s accuracy.
|Network||Best Accuracy (Epoch)||Total Training Time|
|Pre-train (Input 147147)||55.55%(53)|
|Target (Input 231231)||59.25%(55)|
|Resized at Epoch 17||59.25%(54)||220.0 h|
|Resized at Epoch 18||59.01%(55)|
|Resized at Epoch 19||58.84%(55)|
|Resized at Epoch 20||58.91%(54)|
5.2 Resize-and-Continue with Extra Training
It can be observed in Figure 6 that when resizing from epochs 17, 28, and 42, the subsequent epoch still shows relevant increase in accuracy, which does not happen at the same epochs for the baseline networks since, at those points, test accuracy plateaus, raising the need to change learning rate. From this observation we consider maintaining the same learning rule after resizing the networks until there is a drop in test accuracy, from which point on we continue with the predefined learning schedule.
Again we try to maximize the number of epochs run using the pre-train network, so we resize and continue the new training procedure at the end of epochs 18, 29 and 43 since these starting points achieved accuracies above the input network threshold during the previous experiment. Effects of continuing training using current learning rules for an extra number of epochs are reported in Figures 7 and 8 along with the curves produced in the previous experiment for the same restarting points.
Accuracies and times for both pre-training approaches are reported in Table 3. For each starting point, it can be seen that training for a number of extra epochs does increase the final accuracy. However, this extra training comes at the cost of slowing the overall training procedure.
Moreover, we observe that continuing training from Epoch 19 for 5 extra epochs resulted in a test accuracy slightly above the upper-bound defined by the target network in Section 4. Although the difference is too small to be considered as an actual improvement () it does prove that the upper-bound is achievable using the proposed method while avoiding hours of training ( with respect to the original time).
|Network||Extra Epochs||Accuracy (Epoch)||Training Time|
|Resized at Epoch 19 (Continued)||5||59.36% (59)|
|Resized at Epoch 30 (Continued)||9||58.52% (64)|
|Resized at Epoch 43 (Continued)||10||55.80% (64)|
5.3 Residual Networks
To prove that our approach can be used on different architectures along with other optimization techniques, we apply our method to the more recent Residual Network (He et al., 2016) architecture having 34 layers. As suggested by previous results, we resize the pre-train network one and two epochs before changing learning rate and verify possible gains in training times.
For this experiment, training was performed for 90 epochs using mini-batches of 128, weight decay of , and momentum equal to . Learning rate is initially set to and it is reduced to and before epochs 31 and 61. All experiments were run on a single NVIDIA Titan-X using the CuDNN library for FFT based convolutions. Original images crop resolutions were for the target network and for pre-train.
|Network||Accuracy (Epoch)||Training Time|
|Pre-train ()||69.05% (85)|
|Target ()||72.61% (89)|
|Resized at Epoch 29||72.91% (86)||145.60 h|
|Resized at Epoch 30||72.79% (86)|
|Resized at Epoch 59||71.36% (90)|
|Resized at Epoch 60||71.01% (85)|
As seen in Figures 9 and 10, resizing at early epochs (29 and 30) allowed the networks to achieve the expected maximum accuracy, while resizing at late epochs (59 and 60) prevented them from doing so. Moreover, when compared to the original architecture, resizing the pre-train ResNet at epoch 29 allowed it avoid hours () of training and gave in slightly better accuracy. A summary of these results is reported in Table 4.
In this work, we have presented a fast way of training CNN that exploits the spatial scaling property of convolutions. Ideally the scaling property would allow a target model to be trained from a fully trained pre-train network. In practice, however, we have observed that there is an intrinsic saturation process that prevents such näive implementation from succeeding. The longer the pre-train network is trained the less likely it is to achieve the performance of the target network. Although further investigation is required, to the best of our knowledge this happens because, as the pre-train network is trained, the learnt set of weights moves towards a deep local minimum making it difficult to locally find better weights with lower learning-rates.
However, we observe that this effect is mitigated at early stages of learning where testing and training accuracies are similar for both networks. This leads to the conclusion that both networks are learning information that can be generalized, and that can be effectively exploited at both kernel resolutions. This allowed us to use the proposed approach as a pre-training technique where, by resizing the network a couple of epochs before the first scheduled change in learning rate, we were able to obtain the expected target accuracy for both OverFeat and ResNet architectures while avoiding hours () and hours () of training, respectively.
- Chen et al. (2016) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In International Conference on Learning Representations, 2016.
- Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pp. 1269–1277, 2014.
Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
International Conference on Artificial Intelligence and Statistics, pp. 249–256, 2010.
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
The IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
Ioffe & Szegedy (2015)
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456, 2015.
- Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
- Lebedev et al. (2015) Vadim Lebedev, Yaroslav Ganin, Victor Lempitsky, Maksim Rakhuba, and Ivan Oseledets. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In International Conference on Learning Representations, 2015.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Mathieu et al. (2014) Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. In International Conference on Learning Representations. CBLS, April 2014.
- Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
- Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
- Sermanet et al. (2014) Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In International Conference on Learning Representations. CBLS, apr 2014.
- Simonyan & Zisserman (2014) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
- Winograd (1980) Shmuel Winograd. Arithmetic complexity of computations, volume 33. Siam, 1980.