Fast Training of Convolutional Neural Networks via Kernel Rescaling

Training deep Convolutional Neural Networks (CNN) is a time consuming task that may take weeks to complete. In this article we propose a novel, theoretically founded method for reducing CNN training time without incurring any loss in accuracy. The basic idea is to begin training with a pre-train network using lower-resolution kernels and input images, and then refine the results at the full resolution by exploiting the spatial scaling property of convolutions. We apply our method to the ImageNet winner OverFeat and to the more recent ResNet architecture and show a reduction in training time of nearly 20

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/24/2017

Irregular Convolutional Neural Networks

Convolutional kernels are basic and vital components of deep Convolution...
02/10/2021

Two Novel Performance Improvements for Evolving CNN Topologies

Convolutional Neural Networks (CNNs) are the state-of-the-art algorithms...
08/28/2019

Distributed Deep Learning for Precipitation Nowcasting

Effective training of Deep Neural Networks requires massive amounts of d...
12/07/2019

Comparison of Neuronal Attention Models

Recent models for image processing are using the Convolutional neural ne...
03/27/2018

Incremental Training of Deep Convolutional Neural Networks

We propose an incremental training method that partitions the original n...
10/28/2021

SIMCNN – Exploiting Computational Similarity to Accelerate CNN Training in Hardware

Convolution neural networks (CNN) are computation intensive to train. It...
06/03/2015

Implementation of Training Convolutional Neural Networks

Deep learning refers to the shining branch of machine learning that is b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few years, deep Convolutional Neural Networks (CNN) (LeCun et al., 1998)

have become the ubiquitous tool for solving computer vision problems such as object detection, image classification, and image segmentation

(Ren et al., 2015). Such success can be traced back to the work of Krizhevsky et al. (2012) whose 8-layer CNN won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), showing that multi-layer architectures are able to capture the large variability present in real world data.

Works by Simonyan & Zisserman (2014) have shown that increasing the number of layers will consistently improve classification accuracy in this same task. This has led to the proposal of new architectural improvements (Szegedy et al., 2015; He et al., 2016) that allowed network depth to increase from a few (Krizhevsky et al., 2012) to hundreds of layers(He et al., 2016). However, increase in a network’s depth comes at the price of longer training times (Glorot & Bengio, 2010) mainly caused by the computationally intensive convolution operations.

In this paper, we show that the overall training time of a target CNN architecture can be reduced by exploiting the spatial scaling property of convolutions during early stages of learning. This is done by first training a pre-train

CNN of smaller kernel resolutions for a few epochs, followed by properly rescaling its kernels to the

target’s original dimensions and continuing training at full resolution.

Moreover, by rescaling the kernels at different epochs, we identify a trade-off between total training time and maximum obtainable accuracy. Finally, we propose a method for choosing when to rescale kernels and evaluate our approach on recent architectures showing savings in training times of nearly while test set accuracy is preserved.

2 Related Work

Different attempts to reduce CNN training time using the standard back-propagation technique (Rumelhart et al., 1986) have been proposed in the literature.

Regarding how convolutions are implemented, architectures with large convolution kernels (Krizhevsky et al., 2012; Sermanet et al., 2014)

have benefited from the Fast Fourier Transform (FFT) algorithm and reduced the number of multiplications in each 2D convolution

(Mathieu et al., 2014); while the current preference for smaller kernels has revived the interest for minimal filtering algorithms (Winograd, 1980) as seen in recent, unpublished works by Levin et al.

Very recently, Ioffe & Szegedy (2015)

were able to reduce the total number of iterations (epochs) required to fully train a network using technique called Batch Normalization. Authors were able to reduce the internal covariance shift, inherently present in the back-propagation technique, by applying mean ad variance normalization at each layer’s input.

Most relevant to our approach is the work of Chen et al. (2016) who suggested the use of function-preserving transformations to train deeper (more layers) and wider (more channels) networks starting from shallower and narrower ones. Our approach, on the other hand, relies on scaling the spatial dimensions of convolution kernels and input images. This allows us to keep the levels of representation that are usually associated with the number layers and kernels per layer.

Finally, it is worth mentioning that much of the effort towards speeding up CNNs has been focused on inference only. Works by Lebedev et al. (2015); Jaderberg et al. (2014); Denton et al. (2014) have used low-rank approximation to greatly reduce computational complexity of CNNs. These approaches, however, still require a network to be fully trained and they can all profit from our approach.

3 Proposed Method

This section describes the rationale behind spatially scaling kernels to speed up the overall CNN training procedure.

Input

Training

Pre-Train

Continue training

Start training

Input

Training

Target

Figure 1: Training starts with a pre-train network of smaller convolution kernels and input images. After a number of epochs, kernels are resized to the target’s resolution and training continues as scheduled.

3.1 Spatially Scaling Convolutions

The time scaling property of convolutions states that convolution between two time-scaled signals and can be obtained by time-scaling the result of convolving the original inputs and , followed by an amplitude-scaling of .

This property can be extended to continuous 2D signals (Equations 1 and 2) where in this case it is better denoted as the spatial scaling property.

(1)
(2)

If applied to the context of CNNs, this property would suggest that the output of a convolutional layer could also be obtained from the spatially downsized versions of both the layer’s input and its convolution kernels.

Benefits of this possibility can be seen in Equation 3, which represents the number of multiplications performed by a convolutional layer , having and input and output channels, when both input and convolution kernels are spatially scaled by a factor .

(3)

When compared to its unscaled version, i.e. , one can establish the bounds in Equation 4 by considering both extremes and , which in turn guarantees a minimum reduction in the number of multiplications proportional to .

(4)

The spatial scaling property is, of course, valid only in the continuous domain. Working with downsized versions of inputs will usually result in irreversible loss of spatial resolution and accuracy. However, as shown in the following sections, for moderate values of this property can still be exploited during early stages of training, where the network is still learning the basic structures for its kernels. This can be done by first training an otherwise identical network of smaller kernel resolutions, followed by an upscaling to the target kernel resolution and continuing training.

3.2 Pre-training Setup

During the pre-training phase, a pre-train network of architecture similar to the target one having downscaled kernel resolutions shall be trained. Equation 3 guarantees that during this phase the training process will run faster.

Generating this pre-train network from a target network architecture requires choosing new spatial resolutions for each convolutional layer as well as making the necessary adjustments so that fully-connected layers will have compatible input-output dimensions.

3.2.1 Convolution Kernels

When deciding on the new kernel resolutions, a trade-off between speed and accuracy must be considered.

Selecting kernels much smaller than the originals will cripple the layer’s ability to extract and forward high-frequency information, while too conservative downscaling will lead to insignificant improvements in training speeds. Therefore, we suggest the use of Table 1 for choosing the pre-train kernel resolution given a target kernel resolution. We have found those values to provide a good compromise between these factors after the complete training process.

Target Pre-train
Table 1: Suggested kernel resolution conversions with relative resize factors and bounds.

Once the new kernel sizes have been chosen, it is necessary to adjust the network’s internal parameters so that each convolutional layer closely satisfies Equation 2. In order to do so, it is important to observe that a CNN architecture for image classification usually reflects two distinct stages of processing. The first stage contains various layers of convolution and pooling that act as feature extractors. They output feature-maps

whose spatial resolution depends on that layer’s input resolution. The second stage, on the other hand, acts as a classifier and can be identified by the presence of

fully-connected layers of fixed input and output dimensions.

Solving the input-output dependencies present in the feature extraction phase should be done starting from the input image itself since it has no constraints with any previous layer. Input images also have large spatial resolution so that it should be straightforward to choose a smaller integer length whose ratio with respect to the original image will closely approximate the first chosen scaling factor

.

Upper layers will generally not have such flexibility since the following feature-maps will usually have lower spatial resolution. For these layers, the input-output scaling requirement are met by spatially padding or cropping the incoming feature-map.

3.2.2 Fully-connected Interface

Special attention must be paid to the interface between convolutional layers and fully-connected ones since the latter require fixed input sizes.

Activations in a fully-connected layer can be represented as a matrix-vector multiplication where each column in the weight matrix is associated to a particular neuron and the number of rows defines the layer’s input size. In this interpretation, before entering a fully-connected layer, feature-maps having

channels and spatial resolution of must be reshaped into a vector representation .

Since the new spatial dimensions in the pre-train architecture will produce feature-maps of smaller resolutions , the weight matrix in the fully-connected layer must be adapted accordingly. In our representation, this means that the number of inputs (rows) in the weight matrix shall be reduced from to , while the number of output neurons (columns) is kept invariant. A visual representation of this procedure is seen in Figure 2.

target

pre-train

Figure 2: Visual representation of the interface between convolutional and fully-connected layers. Feature-maps from a convolutional layer are first vectorized before entering a fully-connected layer, whose weights are usually represented in matrix form. The number of input must be selected according to the new feature-map spatial resolution () and the number of output neurons is kept invariant.

Subsequent layers should need no further modification and the pre-train network can be trained until a given stopping criterion is met, e.g. classification accuracy on a validation set starts to plateau.

3.3 Resizing and Continuing Training

Once the stopping criterion has been reached by the pre-train network, its structure must be modified back to the original target network.

3.3.1 Convolution Kernels

As seen in Equation 2, just spatially resizing both input and kernels would result in an amplitude scaled version of the expected convolution, meaning that the scaling factor would propagate to all subsequent layers. This problem can be avoided simply by scaling the amplitude of the resized kernels by . That is, given a pre-trained kernel matrix that represents the weights of a convolution kernel from layer , the corresponding weights to be used in the target network are so that the amplitude gain caused by the convolution operation cancels out.

Moreover, associated to each convolution operation is a bias component that need not be scaled since it has a constant value. This resizing procedure should be carried out for every channel in every convolutional layer as described in Equations 5 and 6.

(5)
(6)

Kernels in our experiments were spatially upscaled using bilinear interpolation. Although other interpolation methods were tested,

pre-train kernel resolutions were too small to benefit from higher order interpolation such as bicubic.

3.3.2 Fully-connected Interface

Again, special attention should be paid to the interface between convolutional and fully-connected layers.

According to the usual interpretation described in 3.2.2, the output of a convolutional layer must be vectorized before serving as input to a fully-connected layer, implying a loss of its explicit spatial representation. However, since the incoming feature-maps do contain intra-channel correlation, such information is still present and it is captured by the weights of the fully-connected layer.

pre-train

target

Figure 3: Rescaling weights in fully-connected layer back to target’s dimensions. Each column in the fully-connected weight matrix is reshaped to match the pre-train feature-map dimensions. Rescaling is applied in the same fashion as regular convolutional kernels and weights are then vectorized to the target’s new weight matrix.

In order to exploit this correlation and be able to apply our method, the fully connected layer shall be reinterpreted as a convolutional layer. In other words, each column in the weight matrix must first be reshaped into a third-order tensor

of the same dimensions as the original incoming feature-maps so that we can apply the same resizing rule defined in Equations 5 and 6. This will produce a new tensor that must be then vectorized into the new weight matrix whose dimensions are consistent with the target network. Figure 1 illustrates the overall training process.

Finally, weights from successive fully-connected layers must simply be copied to the target model.

4 Preliminary Experiments

In order to assess the proposed approach, we must first estimate upper and lower bounds in terms of accuracy and training times set by the

target network and its pre-train counterpart. To do this we use as baseline to our investigation the fast variant of 2013 ImageNet localization winner OverFeat (Sermanet et al., 2014).

The original OverFeat-fast contains five convolutional layers followed by three fully-connected ones that classify RGB images among the 1000 classes defined by the ImageNet dataset. By following the steps set in subsection 3.2 we generate a pre-train model of input resolution and do not apply padding at the last convolutional layer.

During our experiments we use an Nvidia Tesla K80 GPU to train and test both networks with ImageNet 2012 CLS-LOC training and validation datasets. We use mini-batches of 128 images and 10k mini-batches per epoch. We use an initial learning rate of and lower it to , , , and at the end of epochs 18, 29, 43, and 52 respectively, until epoch 65 when training is halted. A weight decay of is also applied until the end of epoch 29 and a momentum of is used during the entire training. For both networks, weights in each layer are initialized uniformly at random in the interval , where is the layer’s number of weights.

For the two networks we obtain the train and test accuracies with respect to the number of training epochs and training hours, seen in Figures 4 and 5. Representing accuracy in terms of epochs allows one to measure how fast the network is learning as data is presented to it, while representation in terms of training time reflects the variable to be optimized.

We notice in Figure 4 that test accuracy stops increasing a few epochs after the last change in learning rate. For this reason, we consider both networks to have been fully trained at the end of 55 epochs resulting in best test accuracies of (epoch 55) for the target network and (epoch 53) for the pre-train network. We also observe from Figure 4 that, during the first epochs, both test and training accuracies follow the same pattern for the two networks, which suggests that information being learnt by the models is both generalizable and adequate to be represented by the smaller, faster network.

On the other hand, Figure 5 highlights the effect of using spatially smaller kernels on training time. OverFeat-fast took hours to perform 55 epochs of training while the pre-train network only took hours to train on the same amount of data. This reflects a reduction in training time by a factor of , which largely agrees with the upper-bounds set by Equation 3 and the values of in Table 1 for the , and kernel resolutions found in the original architecture.

Figure 4: Accuracy as function of epochs obtained using both original OverFeat-fast of input resolution and its pre-train counterpart having input resolution.
Figure 5: Accuracy as function time obtained using both original OverFeat-fast of input resolution and its pre-train counterpart having input resolution.

5 Experiments on Pre-training

In this Section we evaluate the effects of rescaling the pre-train network at different points in time. Our goal is to maximize the number of epochs trained using the smaller network in order to reduce the overall time necessary for training the network.

5.1 Resize-and-Continue Scheduled Training

Ideally, one would like to be able to fully train a smaller network, upsize its kernels and immediately obtain the test accuracy of the target network. However, as seen in Figure 4, decrease in learning rate and removal of weight decay lead to increase in overfitting, which in turn imposes some constraints to this straightforward approach.

In this experiment we evaluate the effect of upscaling kernels at different epochs and continuing the scheduled training rule. Since changes in the learning rules had a clear effect on accuracy, we focus on resizing the network before and after these changes. Accuracy curves for each starting epoch are reported in Figure 6 including threshold lines for the accuracies obtained by the two baseline networks described in Section 4. Although we still consider a 55 epoch training schedule, the process is carried out until epoch 65 in order to verify possible gains due to continuing training.

Figure 6: Effects of rescaling kernels at different epochs. Lower and upper horizontal lines define the maximum accuracies obtained with pre-train and target networks, respectively.

Curves in Figure 6 reveal some interesting behaviour. Each resized model shows a lower starting accuracy when compared to the input network test curve. This pattern is expected since interpolation will give an imperfect estimate of the desired kernels. On the other hand, the fact that accuracy does not drop too much indicates that knowledge can, at least partially, be transferred using this method.

The same figure also shows a saturation effect. Networks resized at early stages (epochs 17-20) are able to achieve levels of accuracy similar to the input network, while networks resized at late stages (epochs 51-54) can only achieve accuracies below the input network threshold. Intermediate values of accuracies were obtained when upsizing the pre-train network at epochs from 28 to 31, while accuracies close to the one obtained with the input network baseline were obtained by upsizing the pre-train networks at epochs 42-45. Restart training in the vicinity of the last change in learning rate resulted in test accuracies below the pre-train network baseline threshold.

Table 2 summarizes training times and accuracies for those networks that were able to closely approximate the final accuracy of the target network. From this experiment we notice that resizing the pre-train network at Epoch 17 produced the same accuracy as the target network even though it takes less hours to finish training, a relative gain of in training time. These results show the necessity of upscaling early during training in order to achieve the maximum target’s accuracy.

Network Best Accuracy (Epoch) Total Training Time
Pre-train (Input 147147) 55.55%(53)
Target (Input 231231) 59.25%(55)
Resized at Epoch 17 59.25%(54) 220.0 h
Resized at Epoch 18 59.01%(55)
Resized at Epoch 19 58.84%(55)
Resized at Epoch 20 58.91%(54)
Table 2: Final accuracy and training times for resized networks after a total of 55 epochs. Lower and upper bound accuracies are set by pre-train and target networks, respectively.

5.2 Resize-and-Continue with Extra Training

It can be observed in Figure 6 that when resizing from epochs 17, 28, and 42, the subsequent epoch still shows relevant increase in accuracy, which does not happen at the same epochs for the baseline networks since, at those points, test accuracy plateaus, raising the need to change learning rate. From this observation we consider maintaining the same learning rule after resizing the networks until there is a drop in test accuracy, from which point on we continue with the predefined learning schedule.

Again we try to maximize the number of epochs run using the pre-train network, so we resize and continue the new training procedure at the end of epochs 18, 29 and 43 since these starting points achieved accuracies above the input network threshold during the previous experiment. Effects of continuing training using current learning rules for an extra number of epochs are reported in Figures 7 and 8 along with the curves produced in the previous experiment for the same restarting points.

Figure 7: Accuracy as a function of epochs when training is allowed to continue using current learning rules for a few extra epochs. Learning rule is updated as soon as there is a drop in test accuracy.
Figure 8: Accuracy as a function of time when training is allowed to continue using current learning rules for a few extra epochs. Learning rule is updated as soon as there is a drop in test accuracy.

Accuracies and times for both pre-training approaches are reported in Table 3. For each starting point, it can be seen that training for a number of extra epochs does increase the final accuracy. However, this extra training comes at the cost of slowing the overall training procedure.

Moreover, we observe that continuing training from Epoch 19 for 5 extra epochs resulted in a test accuracy slightly above the upper-bound defined by the target network in Section 4. Although the difference is too small to be considered as an actual improvement () it does prove that the upper-bound is achievable using the proposed method while avoiding hours of training ( with respect to the original time).

Network Extra Epochs Accuracy (Epoch) Training Time
Resized at Epoch 19 (Continued) 5 59.36% (59)
Resized at Epoch 30 (Continued) 9 58.52% (64)
Resized at Epoch 43 (Continued) 10 55.80% (64)
Table 3: Best accuracy and total training times for resized networks with extra training.

5.3 Residual Networks

To prove that our approach can be used on different architectures along with other optimization techniques, we apply our method to the more recent Residual Network (He et al., 2016) architecture having 34 layers. As suggested by previous results, we resize the pre-train network one and two epochs before changing learning rate and verify possible gains in training times.

For this experiment, training was performed for 90 epochs using mini-batches of 128, weight decay of , and momentum equal to . Learning rate is initially set to and it is reduced to and before epochs 31 and 61. All experiments were run on a single NVIDIA Titan-X using the CuDNN library for FFT based convolutions. Original images crop resolutions were for the target network and for pre-train.

Figure 9: Accuracy curves obtained using ResNet-34 as a function of epochs. Lower and upper horizontal lines define the best accuracies obtained for the new baseline networks.
Figure 10: Accuracy curves obtained using ResNet-34 as a function of time. Lower and upper horizontal lines define the best accuracies obtained for the new baseline networks.
Network Accuracy (Epoch) Training Time
Pre-train () 69.05% (85)
Target () 72.61% (89)
Resized at Epoch 29 72.91% (86) 145.60 h
Resized at Epoch 30 72.79% (86)
Resized at Epoch 59 71.36% (90)
Resized at Epoch 60 71.01% (85)
Table 4: Best accuracy and training times for ResNet-34. Training is reduced by hours when upscaling two epochs before changing learning rate.

As seen in Figures 9 and 10, resizing at early epochs (29 and 30) allowed the networks to achieve the expected maximum accuracy, while resizing at late epochs (59 and 60) prevented them from doing so. Moreover, when compared to the original architecture, resizing the pre-train ResNet at epoch 29 allowed it avoid hours () of training and gave in slightly better accuracy. A summary of these results is reported in Table 4.

6 Conclusion

In this work, we have presented a fast way of training CNN that exploits the spatial scaling property of convolutions. Ideally the scaling property would allow a target model to be trained from a fully trained pre-train network. In practice, however, we have observed that there is an intrinsic saturation process that prevents such näive implementation from succeeding. The longer the pre-train network is trained the less likely it is to achieve the performance of the target network. Although further investigation is required, to the best of our knowledge this happens because, as the pre-train network is trained, the learnt set of weights moves towards a deep local minimum making it difficult to locally find better weights with lower learning-rates.

However, we observe that this effect is mitigated at early stages of learning where testing and training accuracies are similar for both networks. This leads to the conclusion that both networks are learning information that can be generalized, and that can be effectively exploited at both kernel resolutions. This allowed us to use the proposed approach as a pre-training technique where, by resizing the network a couple of epochs before the first scheduled change in learning rate, we were able to obtain the expected target accuracy for both OverFeat and ResNet architectures while avoiding hours () and hours () of training, respectively.

References

  • Chen et al. (2016) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In International Conference on Learning Representations, 2016.
  • Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pp. 1269–1277, 2014.
  • Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In

    International Conference on Artificial Intelligence and Statistics

    , pp. 249–256, 2010.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition

    , June 2016.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32nd International Conference on Machine Learning

    , pp. 448–456, 2015.
  • Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
  • Lebedev et al. (2015) Vadim Lebedev, Yaroslav Ganin, Victor Lempitsky, Maksim Rakhuba, and Ivan Oseledets. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In International Conference on Learning Representations, 2015.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Mathieu et al. (2014) Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. In International Conference on Learning Representations. CBLS, April 2014.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
  • Sermanet et al. (2014) Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In International Conference on Learning Representations. CBLS, apr 2014.
  • Simonyan & Zisserman (2014) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
  • Winograd (1980) Shmuel Winograd. Arithmetic complexity of computations, volume 33. Siam, 1980.