1 Introduction
In the past few years, deep Convolutional Neural Networks (CNN) (LeCun et al., 1998)
have become the ubiquitous tool for solving computer vision problems such as object detection, image classification, and image segmentation
(Ren et al., 2015). Such success can be traced back to the work of Krizhevsky et al. (2012) whose 8layer CNN won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), showing that multilayer architectures are able to capture the large variability present in real world data.Works by Simonyan & Zisserman (2014) have shown that increasing the number of layers will consistently improve classification accuracy in this same task. This has led to the proposal of new architectural improvements (Szegedy et al., 2015; He et al., 2016) that allowed network depth to increase from a few (Krizhevsky et al., 2012) to hundreds of layers(He et al., 2016). However, increase in a network’s depth comes at the price of longer training times (Glorot & Bengio, 2010) mainly caused by the computationally intensive convolution operations.
In this paper, we show that the overall training time of a target CNN architecture can be reduced by exploiting the spatial scaling property of convolutions during early stages of learning. This is done by first training a pretrain
CNN of smaller kernel resolutions for a few epochs, followed by properly rescaling its kernels to the
target’s original dimensions and continuing training at full resolution.Moreover, by rescaling the kernels at different epochs, we identify a tradeoff between total training time and maximum obtainable accuracy. Finally, we propose a method for choosing when to rescale kernels and evaluate our approach on recent architectures showing savings in training times of nearly while test set accuracy is preserved.
2 Related Work
Different attempts to reduce CNN training time using the standard backpropagation technique (Rumelhart et al., 1986) have been proposed in the literature.
Regarding how convolutions are implemented, architectures with large convolution kernels (Krizhevsky et al., 2012; Sermanet et al., 2014)
have benefited from the Fast Fourier Transform (FFT) algorithm and reduced the number of multiplications in each 2D convolution
(Mathieu et al., 2014); while the current preference for smaller kernels has revived the interest for minimal filtering algorithms (Winograd, 1980) as seen in recent, unpublished works by Levin et al.Very recently, Ioffe & Szegedy (2015)
were able to reduce the total number of iterations (epochs) required to fully train a network using technique called Batch Normalization. Authors were able to reduce the internal covariance shift, inherently present in the backpropagation technique, by applying mean ad variance normalization at each layer’s input.
Most relevant to our approach is the work of Chen et al. (2016) who suggested the use of functionpreserving transformations to train deeper (more layers) and wider (more channels) networks starting from shallower and narrower ones. Our approach, on the other hand, relies on scaling the spatial dimensions of convolution kernels and input images. This allows us to keep the levels of representation that are usually associated with the number layers and kernels per layer.
Finally, it is worth mentioning that much of the effort towards speeding up CNNs has been focused on inference only. Works by Lebedev et al. (2015); Jaderberg et al. (2014); Denton et al. (2014) have used lowrank approximation to greatly reduce computational complexity of CNNs. These approaches, however, still require a network to be fully trained and they can all profit from our approach.
3 Proposed Method
This section describes the rationale behind spatially scaling kernels to speed up the overall CNN training procedure.
3.1 Spatially Scaling Convolutions
The time scaling property of convolutions states that convolution between two timescaled signals and can be obtained by timescaling the result of convolving the original inputs and , followed by an amplitudescaling of .
This property can be extended to continuous 2D signals (Equations 1 and 2) where in this case it is better denoted as the spatial scaling property.
(1)  
(2) 
If applied to the context of CNNs, this property would suggest that the output of a convolutional layer could also be obtained from the spatially downsized versions of both the layer’s input and its convolution kernels.
Benefits of this possibility can be seen in Equation 3, which represents the number of multiplications performed by a convolutional layer , having and input and output channels, when both input and convolution kernels are spatially scaled by a factor .
(3) 
When compared to its unscaled version, i.e. , one can establish the bounds in Equation 4 by considering both extremes and , which in turn guarantees a minimum reduction in the number of multiplications proportional to .
(4) 
The spatial scaling property is, of course, valid only in the continuous domain. Working with downsized versions of inputs will usually result in irreversible loss of spatial resolution and accuracy. However, as shown in the following sections, for moderate values of this property can still be exploited during early stages of training, where the network is still learning the basic structures for its kernels. This can be done by first training an otherwise identical network of smaller kernel resolutions, followed by an upscaling to the target kernel resolution and continuing training.
3.2 Pretraining Setup
During the pretraining phase, a pretrain network of architecture similar to the target one having downscaled kernel resolutions shall be trained. Equation 3 guarantees that during this phase the training process will run faster.
Generating this pretrain network from a target network architecture requires choosing new spatial resolutions for each convolutional layer as well as making the necessary adjustments so that fullyconnected layers will have compatible inputoutput dimensions.
3.2.1 Convolution Kernels
When deciding on the new kernel resolutions, a tradeoff between speed and accuracy must be considered.
Selecting kernels much smaller than the originals will cripple the layer’s ability to extract and forward highfrequency information, while too conservative downscaling will lead to insignificant improvements in training speeds. Therefore, we suggest the use of Table 1 for choosing the pretrain kernel resolution given a target kernel resolution. We have found those values to provide a good compromise between these factors after the complete training process.
Target  Pretrain  

Once the new kernel sizes have been chosen, it is necessary to adjust the network’s internal parameters so that each convolutional layer closely satisfies Equation 2. In order to do so, it is important to observe that a CNN architecture for image classification usually reflects two distinct stages of processing. The first stage contains various layers of convolution and pooling that act as feature extractors. They output featuremaps
whose spatial resolution depends on that layer’s input resolution. The second stage, on the other hand, acts as a classifier and can be identified by the presence of
fullyconnected layers of fixed input and output dimensions.Solving the inputoutput dependencies present in the feature extraction phase should be done starting from the input image itself since it has no constraints with any previous layer. Input images also have large spatial resolution so that it should be straightforward to choose a smaller integer length whose ratio with respect to the original image will closely approximate the first chosen scaling factor
.Upper layers will generally not have such flexibility since the following featuremaps will usually have lower spatial resolution. For these layers, the inputoutput scaling requirement are met by spatially padding or cropping the incoming featuremap.
3.2.2 Fullyconnected Interface
Special attention must be paid to the interface between convolutional layers and fullyconnected ones since the latter require fixed input sizes.
Activations in a fullyconnected layer can be represented as a matrixvector multiplication where each column in the weight matrix is associated to a particular neuron and the number of rows defines the layer’s input size. In this interpretation, before entering a fullyconnected layer, featuremaps having
channels and spatial resolution of must be reshaped into a vector representation .Since the new spatial dimensions in the pretrain architecture will produce featuremaps of smaller resolutions , the weight matrix in the fullyconnected layer must be adapted accordingly. In our representation, this means that the number of inputs (rows) in the weight matrix shall be reduced from to , while the number of output neurons (columns) is kept invariant. A visual representation of this procedure is seen in Figure 2.
Subsequent layers should need no further modification and the pretrain network can be trained until a given stopping criterion is met, e.g. classification accuracy on a validation set starts to plateau.
3.3 Resizing and Continuing Training
Once the stopping criterion has been reached by the pretrain network, its structure must be modified back to the original target network.
3.3.1 Convolution Kernels
As seen in Equation 2, just spatially resizing both input and kernels would result in an amplitude scaled version of the expected convolution, meaning that the scaling factor would propagate to all subsequent layers. This problem can be avoided simply by scaling the amplitude of the resized kernels by . That is, given a pretrained kernel matrix that represents the weights of a convolution kernel from layer , the corresponding weights to be used in the target network are so that the amplitude gain caused by the convolution operation cancels out.
Moreover, associated to each convolution operation is a bias component that need not be scaled since it has a constant value. This resizing procedure should be carried out for every channel in every convolutional layer as described in Equations 5 and 6.
(5)  
(6) 
Kernels in our experiments were spatially upscaled using bilinear interpolation. Although other interpolation methods were tested,
pretrain kernel resolutions were too small to benefit from higher order interpolation such as bicubic.3.3.2 Fullyconnected Interface
Again, special attention should be paid to the interface between convolutional and fullyconnected layers.
According to the usual interpretation described in 3.2.2, the output of a convolutional layer must be vectorized before serving as input to a fullyconnected layer, implying a loss of its explicit spatial representation. However, since the incoming featuremaps do contain intrachannel correlation, such information is still present and it is captured by the weights of the fullyconnected layer.
In order to exploit this correlation and be able to apply our method, the fully connected layer shall be reinterpreted as a convolutional layer. In other words, each column in the weight matrix must first be reshaped into a thirdorder tensor
of the same dimensions as the original incoming featuremaps so that we can apply the same resizing rule defined in Equations 5 and 6. This will produce a new tensor that must be then vectorized into the new weight matrix whose dimensions are consistent with the target network. Figure 1 illustrates the overall training process.Finally, weights from successive fullyconnected layers must simply be copied to the target model.
4 Preliminary Experiments
In order to assess the proposed approach, we must first estimate upper and lower bounds in terms of accuracy and training times set by the
target network and its pretrain counterpart. To do this we use as baseline to our investigation the fast variant of 2013 ImageNet localization winner OverFeat (Sermanet et al., 2014).The original OverFeatfast contains five convolutional layers followed by three fullyconnected ones that classify RGB images among the 1000 classes defined by the ImageNet dataset. By following the steps set in subsection 3.2 we generate a pretrain model of input resolution and do not apply padding at the last convolutional layer.
During our experiments we use an Nvidia Tesla K80 GPU to train and test both networks with ImageNet 2012 CLSLOC training and validation datasets. We use minibatches of 128 images and 10k minibatches per epoch. We use an initial learning rate of and lower it to , , , and at the end of epochs 18, 29, 43, and 52 respectively, until epoch 65 when training is halted. A weight decay of is also applied until the end of epoch 29 and a momentum of is used during the entire training. For both networks, weights in each layer are initialized uniformly at random in the interval , where is the layer’s number of weights.
For the two networks we obtain the train and test accuracies with respect to the number of training epochs and training hours, seen in Figures 4 and 5. Representing accuracy in terms of epochs allows one to measure how fast the network is learning as data is presented to it, while representation in terms of training time reflects the variable to be optimized.
We notice in Figure 4 that test accuracy stops increasing a few epochs after the last change in learning rate. For this reason, we consider both networks to have been fully trained at the end of 55 epochs resulting in best test accuracies of (epoch 55) for the target network and (epoch 53) for the pretrain network. We also observe from Figure 4 that, during the first epochs, both test and training accuracies follow the same pattern for the two networks, which suggests that information being learnt by the models is both generalizable and adequate to be represented by the smaller, faster network.
On the other hand, Figure 5 highlights the effect of using spatially smaller kernels on training time. OverFeatfast took hours to perform 55 epochs of training while the pretrain network only took hours to train on the same amount of data. This reflects a reduction in training time by a factor of , which largely agrees with the upperbounds set by Equation 3 and the values of in Table 1 for the , and kernel resolutions found in the original architecture.
5 Experiments on Pretraining
In this Section we evaluate the effects of rescaling the pretrain network at different points in time. Our goal is to maximize the number of epochs trained using the smaller network in order to reduce the overall time necessary for training the network.
5.1 ResizeandContinue Scheduled Training
Ideally, one would like to be able to fully train a smaller network, upsize its kernels and immediately obtain the test accuracy of the target network. However, as seen in Figure 4, decrease in learning rate and removal of weight decay lead to increase in overfitting, which in turn imposes some constraints to this straightforward approach.
In this experiment we evaluate the effect of upscaling kernels at different epochs and continuing the scheduled training rule. Since changes in the learning rules had a clear effect on accuracy, we focus on resizing the network before and after these changes. Accuracy curves for each starting epoch are reported in Figure 6 including threshold lines for the accuracies obtained by the two baseline networks described in Section 4. Although we still consider a 55 epoch training schedule, the process is carried out until epoch 65 in order to verify possible gains due to continuing training.
Curves in Figure 6 reveal some interesting behaviour. Each resized model shows a lower starting accuracy when compared to the input network test curve. This pattern is expected since interpolation will give an imperfect estimate of the desired kernels. On the other hand, the fact that accuracy does not drop too much indicates that knowledge can, at least partially, be transferred using this method.
The same figure also shows a saturation effect. Networks resized at early stages (epochs 1720) are able to achieve levels of accuracy similar to the input network, while networks resized at late stages (epochs 5154) can only achieve accuracies below the input network threshold. Intermediate values of accuracies were obtained when upsizing the pretrain network at epochs from 28 to 31, while accuracies close to the one obtained with the input network baseline were obtained by upsizing the pretrain networks at epochs 4245. Restart training in the vicinity of the last change in learning rate resulted in test accuracies below the pretrain network baseline threshold.
Table 2 summarizes training times and accuracies for those networks that were able to closely approximate the final accuracy of the target network. From this experiment we notice that resizing the pretrain network at Epoch 17 produced the same accuracy as the target network even though it takes less hours to finish training, a relative gain of in training time. These results show the necessity of upscaling early during training in order to achieve the maximum target’s accuracy.
Network  Best Accuracy (Epoch)  Total Training Time 
Pretrain (Input 147147)  55.55%(53)  
Target (Input 231231)  59.25%(55)  
Resized at Epoch 17  59.25%(54)  220.0 h 
Resized at Epoch 18  59.01%(55)  
Resized at Epoch 19  58.84%(55)  
Resized at Epoch 20  58.91%(54) 
5.2 ResizeandContinue with Extra Training
It can be observed in Figure 6 that when resizing from epochs 17, 28, and 42, the subsequent epoch still shows relevant increase in accuracy, which does not happen at the same epochs for the baseline networks since, at those points, test accuracy plateaus, raising the need to change learning rate. From this observation we consider maintaining the same learning rule after resizing the networks until there is a drop in test accuracy, from which point on we continue with the predefined learning schedule.
Again we try to maximize the number of epochs run using the pretrain network, so we resize and continue the new training procedure at the end of epochs 18, 29 and 43 since these starting points achieved accuracies above the input network threshold during the previous experiment. Effects of continuing training using current learning rules for an extra number of epochs are reported in Figures 7 and 8 along with the curves produced in the previous experiment for the same restarting points.
Accuracies and times for both pretraining approaches are reported in Table 3. For each starting point, it can be seen that training for a number of extra epochs does increase the final accuracy. However, this extra training comes at the cost of slowing the overall training procedure.
Moreover, we observe that continuing training from Epoch 19 for 5 extra epochs resulted in a test accuracy slightly above the upperbound defined by the target network in Section 4. Although the difference is too small to be considered as an actual improvement () it does prove that the upperbound is achievable using the proposed method while avoiding hours of training ( with respect to the original time).
Network  Extra Epochs  Accuracy (Epoch)  Training Time 

Resized at Epoch 19 (Continued)  5  59.36% (59)  
Resized at Epoch 30 (Continued)  9  58.52% (64)  
Resized at Epoch 43 (Continued)  10  55.80% (64) 
5.3 Residual Networks
To prove that our approach can be used on different architectures along with other optimization techniques, we apply our method to the more recent Residual Network (He et al., 2016) architecture having 34 layers. As suggested by previous results, we resize the pretrain network one and two epochs before changing learning rate and verify possible gains in training times.
For this experiment, training was performed for 90 epochs using minibatches of 128, weight decay of , and momentum equal to . Learning rate is initially set to and it is reduced to and before epochs 31 and 61. All experiments were run on a single NVIDIA TitanX using the CuDNN library for FFT based convolutions. Original images crop resolutions were for the target network and for pretrain.
Network  Accuracy (Epoch)  Training Time 
Pretrain ()  69.05% (85)  
Target ()  72.61% (89)  
Resized at Epoch 29  72.91% (86)  145.60 h 
Resized at Epoch 30  72.79% (86)  
Resized at Epoch 59  71.36% (90)  
Resized at Epoch 60  71.01% (85) 
As seen in Figures 9 and 10, resizing at early epochs (29 and 30) allowed the networks to achieve the expected maximum accuracy, while resizing at late epochs (59 and 60) prevented them from doing so. Moreover, when compared to the original architecture, resizing the pretrain ResNet at epoch 29 allowed it avoid hours () of training and gave in slightly better accuracy. A summary of these results is reported in Table 4.
6 Conclusion
In this work, we have presented a fast way of training CNN that exploits the spatial scaling property of convolutions. Ideally the scaling property would allow a target model to be trained from a fully trained pretrain network. In practice, however, we have observed that there is an intrinsic saturation process that prevents such näive implementation from succeeding. The longer the pretrain network is trained the less likely it is to achieve the performance of the target network. Although further investigation is required, to the best of our knowledge this happens because, as the pretrain network is trained, the learnt set of weights moves towards a deep local minimum making it difficult to locally find better weights with lower learningrates.
However, we observe that this effect is mitigated at early stages of learning where testing and training accuracies are similar for both networks. This leads to the conclusion that both networks are learning information that can be generalized, and that can be effectively exploited at both kernel resolutions. This allowed us to use the proposed approach as a pretraining technique where, by resizing the network a couple of epochs before the first scheduled change in learning rate, we were able to obtain the expected target accuracy for both OverFeat and ResNet architectures while avoiding hours () and hours () of training, respectively.
References
 Chen et al. (2016) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In International Conference on Learning Representations, 2016.
 Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pp. 1269–1277, 2014.

Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
International Conference on Artificial Intelligence and Statistics
, pp. 249–256, 2010. 
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition
, June 2016. 
Ioffe & Szegedy (2015)
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
Proceedings of the 32nd International Conference on Machine Learning
, pp. 448–456, 2015.  Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 Lebedev et al. (2015) Vadim Lebedev, Yaroslav Ganin, Victor Lempitsky, Maksim Rakhuba, and Ivan Oseledets. Speedingup convolutional neural networks using finetuned cpdecomposition. In International Conference on Learning Representations, 2015.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Mathieu et al. (2014) Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. In International Conference on Learning Representations. CBLS, April 2014.
 Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
 Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533–536, 1986.
 Sermanet et al. (2014) Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In International Conference on Learning Representations. CBLS, apr 2014.
 Simonyan & Zisserman (2014) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
 Winograd (1980) Shmuel Winograd. Arithmetic complexity of computations, volume 33. Siam, 1980.
Comments
There are no comments yet.