1 Introduction
The depth of a DCNN plays a vital role in discovering intricate structures both in theory [haastad1987computational, haastad1991power, lecun2015deep, szegedy2015going] and real practice [sun2014deep, guo2014deep, sindhwani2015structured]. The original DCNN LeNet5 contains 4 convolutional layers; however, the development of computer hardware and the improvement of network topology have enabled the DCNNs to go deeper and deeper. Recent deep models, including ResNet and DenseNet, have already surpassed the 100layer barrier at 152 and 264 layers, respectively. Although the deeper networks have strong power in discovering intricate structures in image tasks, these approaches have a key characteristic in common: All of the uptodate DCNNs adopt SGD [bengio2007greedy, bottou2010large] and its variants[kingma2014adam, duchi2011adaptive] as a foundation to train the networks. Such a training strategy makes the network suffer from being trapped in a local minimum and requiring a large amount of training time. Thus, a novel learning pipeline should be found in order to boost the generalization performance and speed up the learning.
Recent years the MP inverse technique has been utilized to train a DCNN to achieve a better generalization performance [yang2019recomputation]
. Essentially, the dense layers of a DCNN (with linear activation function) can be reduced back to a linear system, whereas the approximate optimal parameters are the least squares (LS) solutions when the minimum error has been achieved. It is well established that the MP inverse, among other techniques, is the most widely known generalization of the inverse matrix to find the unique solution of an LS problem. More importantly, the work in
[schmidt1992feedforward]has already proved that the output layer vector can be called the fishier vector if the weights are solved by the standard MP inverse. Following this, an increased focus has been placed on hierarchical networks with MP inverse
[yang2019features, zhang2020width]. Compared to the parameters calculated with SGD, the unique solution obtained by MP inverse corresponds to the maximum likelihood estimation. To the best of our knowledge, the study in
[yang2019recomputation] is the stateoftheart work that utilizes the MP inverse in DCNN training. In each training epoch, the DCNN is first trained with the SGD optimizer; then, the parameters in dense layers are refined through the MP inversebased approach.Despite its advantages, the use of the training procedure for DCNN provided in [yang2019recomputation]
is not as widespread as it could be. The crucial reasons for this may be as follows. On the one hand, the retraining process adds to the computational workload of each epoch. In particular, for a large dataset such as ImageNet
[deng2009imagenet], the researchers can only refine the parameters of dense layer with CPU instead of GPU acceleration because the calculation of MP inverse can occupy huge computational resources. As the training cost in the CPU environment increases dramatically, the DCNN trained with MP inverse [yang2019recomputation] is uneconomical when handling largescale samples without access to industrialscale computational resources.On the other hand, before retraining the dense layer weights, the process in the study [yang2019recomputation] still requires the use of the SGD technique to optimize the parameters of all layers. According to some successful techniques [huang2016deep, gastaldi2017shake], we hypothesize in this paper that it is not necessary for the work [yang2019recomputation] to involve the entire convolutional layer in the training process at each epoch because the refinement of parameters in dense layers can provide more clues. The existing methods [huang2016deep, gastaldi2017shake]
achieved accelerating training by adjusting a set of hyperparameters on the basic SGD training pipeline. However, they do have drawbacks, such as slight degradation in performance and a lack of robustness against various environmental conditions, resulting in an unstable testing accuracy.
In this paper, we focus on providing a unified training pipeline of DCNN with better generalization performance but without incurring much additional training burden. We achieve this goal by training the DCNN with several general epochs. In each general epoch, two simple but straightforward steps that can be implemented in the pure GPU environment are employed: The first one is SGD with random learning, and the second is an MP inversebased batchbybatch strategy. For the first step, a freeze learning algorithm that reduces the workload and speeds up the DCNN algorithm is provided. We shorten the network by randomly activating a portion of convolutional layers in each general epoch. The rate is preset to a value and progressively decreased. In the first several general epochs, is set to 1 to start the network, and all of the parameters are updated. Then, it is gradually decreased and finally comes to 0, which means that all of the convolutional layers are “frozen” without updating. Hence, the only parameter that users need to adjust is the activation rate . As for the second one, a batchbybatch MP inverse retraining strategy that can be processed by GPU is proposed. Instead of training all of the loaded data at once, the data is processed sequentially. By doing so, the data volume of each batch is dramatically reduced and will not consume many computational resources. Thus, the proposed training pipeline can be implemented with GPU acceleration.
In extensive experiments, several stateoftheart deep learning architectures for pattern recognition, such as AlexNet
[krizhevsky2012imagenet], VGG16 [simonyan2014very], Inceptionv3 [szegedy2016rethinking], ResNet [he2016deep] and DenseNet [huang2017densely], are utilized to verify the effectiveness of this method. We show across 8 datasets, including 2 large datasets (ImageNet and Place365), that the proposed method almost always improves the generalization performance without increasing the training burden. For instance, on the Food251 and Place365 datasets, fast retraining with ResNet provides 24.8% and 21.0% greater speeds, respectively, than the methods in
[yang2019recomputation].2 Related Works
The training procedures of DCNN have been widely studied. However, most of these prior studies focus on improving one part of performance, either the training efficiency [hinton2012improving, huang2016deep, brock2017freezeout] or the generalization performance [yang2019recomputation, huang2017snapshot]. Few of them address both concerns.
Many successful learning schedules, such as Dropout [hinton2012improving], Stochastic Depth [huang2016deep], and FreezeOut [brock2017freezeout], have already achieved a computational speedup by excluding some convolutional layers from backward pass, as the early layers of a DCNN only detect simple edge details while taking up most of the time budget. Stochastic Depth [huang2016deep] reduces the training time by removing a set of convolutional layers for each minibatch, while FreezeOut [brock2017freezeout] reduces computational costs with cosine annealing [loshchilov2016sgdr] by freezing convolutional layers.
In study [yang2019recomputation], a standard SGD with an MP inverse pipeline that can boost the testing performance of a DCNN was provided. It is motivated by the fact that the performance boost through network topology optimization is almost approaching its limitation, as shown by the minimal improvement in the results of the ILSVRC competition in recent years [he2016deep, huang2017densely]. In other words, the network depth has increased dramatically recently, while the testing performance has only had a limited improvement. After each SGD training, the authors adopt the MP inverse to pull back the residual error
from the output layer to each fullyconnected (FC) layer in order to update the parameters. Thus, the approximate optimal parameters of FC layers can be generated. Formally, if the DCNN contains the ReLU layer and DropOut operation, the updated weight
could be obtained via the KKT theorem [Kuhn1951Nonlinear] and the optimization solution [schmidt1992feedforward]. The parameters of the last (th) FC layer are updated through [yang2019recomputation]:(1) 
where is the MP inverse of , is the retraining rate, is the regularization term, is the parameters of nth FC layer, is the updated weights. is the input feature of nth FC layer, and is the output layer residual error.
The earlier th FC layer can be updated by:
(2) 
where and are the MP inverse of and , respectively, is the dropout operation, and is the ReLU operation. After each SGD training epoch, the weights of each FC layer are updated.
While the method provides a strong recognition accuracy within image classification datasets, it can only be employed in a CPU environment instead of with highspeed GPU acceleration, as the parameters calculated via Eq. 1 and Eq. 2 occupy a large amount of computational resources. This work is motivated by [huang2016deep, brock2017freezeout, yang2019recomputation], aiming at crafting a fast retraining scheme that can lead to improvements in both training speed and testing performance with all of the existing DCNN models.
3 DCNN with Fast Retraining Strategy
Fast retraining tunes the DCNN parameters with general epochs to achieve better generalization performance and boost the training efficiency. Each general epoch contains two steps: Step 1, the convolutional layer random learning with SGD, and Step 2, the dense layer retraining with an MP inversesbased batchbybatch strategy.
3.1 Step 1  Convolutional Layer Random Learning with Stochastic Gradient Descend
In this paper, we provide a simple accelerated training algorithm as depicted in Fig. 0(a) to speed up the training of a DCNN by randomly dropping the hidden layers in each epoch with a preset activation rate . As this method only contains one hyperparameter, , it is relatively easy for users to tune for practical employment. Note that keeps updating as the training epoch changes. Initially, the activation rate is set to 1 in the first several training epochs in order to “warm up” the DCNN network. All of the parameters in the network are tuned and updated in backward pass. After the “warm up” stage, the earlier layers are able to extract lowlevel features that can be used by later layers to build highlevel features, and they are reliable enough to represent the raw images. Therefore, is increased to both accelerate network training and avoid overfitting. In this sense, the inactivated layers are excluded from backward pass. Suppose that a designed DCNN contains convolutional layers; in a certain training epoch, the total number of activated () and inactivated () layers are:
(3) 
3.2 Step 2  Dense Layer Retraining with MP inversebased Batchbybatch Strategy
In order to implement the retraining schedule with efficient GPUs, the feature H and error e in Eq. 1 is processed chunkbychunk with pieces, i.e., , and . First, the initial data and is given, and the weights is calculated via onebatch learning strategy (1). Then, the weights is updated via , , and in an iterative way.
Suppose we have , which are defined as equation (4).
(4) 
From (1), with batches of feature, the updated weights of one dense layer are considered as:
(5) 
where . According to (4), the following equation can be drawn:
(6) 
where . Based on the ShermanMorrisonWoodbury (SMW) formula [golub2012matrix], the inverse of can be attained:
(7) 
The equation (5) can be rewritten as
(8) 
Furthermore, for simplicity, we denote as:
(9) 
Substitute into equation (8), the weight can be simplified to the following equation:
(10) 
In the case of new training data being available, the updated weight can be written as:
(11) 
Above all, the parameters in the last FC layer can be updated with batchbybatch strategy as (12):
(12) 
The parameters in the earlier th FC layer with batches data can be updated via (13):
(13) 
where is the dropout operation.
The proposed training procedure for DCNN is presented as Algorithm LABEL:A0001. In each general epoch, the training process can be divided into two continuous steps: Step 1, convolutional layer random learning with SGD (Line 25), and Step 2, dense layer retraining with an MP inversebased batchbybatch strategy (Line 616).
4 Experiments
In this paper, we apply 8 datasets and 4 stateoftheart DCNNs to demonstrate the efficiency of the fast retraining strategy. The experiments performed in this section were conducted on a workstation with a 256GB memory and an E52650 processor, and all of the DCNNs were trained on NVIDIA 1080Ti GPU.
4.1 Dataset and Experimental Settings
I. Datasets. The details of the datasets are shown as Table 1. For Caltech101 [fei2006one] and Caltech256 [griffin2007caltech], we randomly selected 30 images per category to get the training set, using the rest for testing. As for CIFAR100 [krizhevsky2009learning], 50,000 images were used for training and 10,000 for testing. The Food251 dataset [kaur2019foodx], which was a food classification dataset, created in 2019, the training set (118,475 images) and validation set (11,994 images) were used for training and testing. Besides these datasets, the commonly used largescale datasets Place365 [zhou2017places] and ImageNet [deng2009imagenet] were used to evaluate the proposed work. For a comprehensive comparison, we randomly selected 200 and 500 images per category to create ImageNet1 and Place3651, respectively, while the validation set was used for testing.
II. Architectures. We evaluated fast retraining with several stateoftheart DCNNs, such as AlexNet [krizhevsky2012imagenet], VGG [simonyan2014very], InceptionV3 [szegedy2016rethinking], ResNet [he2016deep], and DenseNet [huang2017densely]. For the first three frameworks, we utilized the 16layer VGG, the 48layer InceptionV3, and the 50layer ResNet, respectively. For the DenseNet, we evaluated the fast retraining scheme on two structures: the 121layer DenseNet and the 201layer DenseNet.
III. Setting.
In this paper, we tested the proposed method with an original strategy under two different conditions, i.e., transfer learning and training from scratch. The experimental settings were as follows. For transfer learning, the initial learning rate was set to
, and was divided by 10 in every 3 training epochs. The initial activation rate was 1, and it was set to 0.8, 0.6, and 0.4 at 25%, 50%, and 75% of the total number of training epochs. Other settings include the total number of training epochs, the regularization term in retraining, and the mini batch size; these are described in Table 1. As for DCNN training from scratch, we trained the model for 90 epochs. The learning rate was set to 0.1 and was lowered by 10 times at epochs 30 and 60. The was first set to 1 and was decreased to 0.9 and 0.6 at 50% and 75% of the total number of training epochs.Dataset  # classes  # training  # testing  batch size  batch size  # max  

samples  samples  (SGD)  (MP inverse)  epoch  
Caltech101  102  3,060  6,084  32  10  8/90^{a}  6/4/4/2^{b} 
Caltech256  257  7,710  22,898  32  10  8/90^{a}  4/4/4/2^{b} 
CIFAR100  100  50,000  10,000  32  10  8/90^{a}  4/2/2/2^{b} 
Food251  251  118,475  11,994  32  10  8/90^{a}  4/2/2/2^{b} 
Place3651  365  182,500  36,500  32  10  12/90^{a}  2/2/2/1^{b} 
Place365  365  1,803,460  36,500  32  10  12/90^{a}  2/2/2/1^{b} 
ImageNet1  1,000  200,000  50,000  32  10  NA/90^{a}  2/2/2/1^{b} 
ImageNet  1,000  14,197,122  50,000  32  10  NA/90^{a}  2/2/2/1^{b} 

Configurations for transfer learning / Training from scratch Configurations for VGG / InceptionV3 / ResNet / DenseNet
4.2 Stepbystep Quantitative Analysis
Dataset  VGG16  ResNet50  Inceptionv3  DenseNet201  

SGD [bottou2010large]  R. [yang2019recomputation]  FR.  SGD [bottou2010large]  R. [yang2019recomputation]  FR.  SGD [bottou2010large]  R. [yang2019recomputation]  FR.  SGD [bottou2010large]  R. [yang2019recomputation]  FR.  
Caltech101  90.4  90.1  90.1  90.8  91.3  91.5  89.7  90.6  90.5  92.5  91.7  91.4 
Caltech256  69.3  73.0  73.7  77.1  78.2  78.3  77.9  78.4  78.6  79.7  81.0  81.2 
CIFAR100  77.4  79.5  79.2  84.3  83.5  83.7  83.8  84.2  84.3  86.1  84.4  85.6 
Food251  52.7  56.9  57.4  59.4  61.8  61.8  59.3  60.8  61.8  61.6  62.4  62.8 
Place3651  42.9  44.3  44.5  47.4  48.4  48.6  47.5  48.1  48.1  47.7  47.7  47.9 
Place365  50.9  51.6  51.9  52.7  53.3  53.6  53.8  53.9  54.1  54.7  55.4  55.4 
Average  63.9  65.9  66.2  68.6  69.4  69.6  68.7  69.4  69.6  70.4  70.4  70.7 
In this paper, all of the experiments are evaluated with Top1 testing accuracy, and the results recorded are the mean average of a minimum of three experiments. The best results are in boldface format.
I. The Effectiveness Analysis of batchbybatch strategy. To verify the effectiveness of applying an MP inversebased batchbybatch scheme in dense layer retraining, we conducted a sanity check with this strategy and the onebatch schedule [yang2019recomputation], as described in Fig. 2a. In particular, the peak memory usage (PMU) in training was in empirically evaluating the performance of different training modes. The investigation reveals that the batchbybatch strategy significantly reduces the memory use of retraining DCNNs. Hence, we summarize the first conclusion as such: The provided batchbybatch method reduces the computational burden and can be accelerated in a GPU environment, which overcomes the main drawback of work [yang2019recomputation].
II. The Effectiveness Analysis of random learning. To validate the effectiveness of random layer learning, experiments were conducted on the Place3651 dataset. Note that, in this part, the retraining strategy was excluded from the training epoch (so that only random learning remains), and the DCNNs were trained on ImageNet pretrained networks with 8 epochs. The experiments were conducted and trained under three different configurations: DCNN trained with i) the original SGD baseline, ii) the FreezeOut [brock2017freezeout] learning scheme, and iii) the proposed random learning strategy. Fig. 2b and 2c compare the results. Through analysis, we reach the second conclusion: The training time of the DCNN with proposed random learning is faster than that of the DCNN with traditional SGD, and it has a positive impact on generalization performance.
III. Comparison of transfer learning. Taking the above outcomes of sections I and II as the foundation, more experiments were carried out to compare the proposed learning procedure with the retraining algorithm [yang2019recomputation]. All of the results are tabulated in Table 2. Unlike most recent works that prefer to boost the testing performance with novel network topology, in this paper, the proposed method does not contain any network modification, but it does have a slight improvement (0.1% to 1.0% ) on testing accuracy over the stateoftheart MP inversebased learning scheme [yang2019recomputation]. While the 0.1% to 1% Top1 accuracy boost seems to lead to marginal improvement, it is not easy to obtain these improvements at the current stage as the DCNN optimization is almost achieving its limitation. For example, VGG16 and ResNet are the ILSVRC winners in the year of 2014 and 2016, respectively. ResNet only provides 1.2% boost than VGG16 on CIFAR100 set, but 1% lower on SUN397 set.
Furthermore, the total training time of the fast retraining method, the recomputation method [yang2019recomputation], and the original SGD method are tabulated in Table 3. Note that all of the experiments were conducted with 8 training epochs. Figure 3 shows the generalization performance on these datasets by a plot as the number of training epochs increases. We can easily find that: The proposed strategy presents a speedup of up to 25% compared to the existing retraining strategy [yang2019recomputation]; It only needs 3 to 4 epochs to get the optimal results, whereas the original DCNN needs at least 6 epochs. Through Table 2, Table 3, and Fig. 3, the last conclusion that can be drawn is: The fast retraining scheme improves the generalization performance of a DCNN, reducing the learning time by 15% to 25% over the existing MP inversebased learning paradigm [yang2019recomputation].
IV. Comparison results of training from scratch. In order to extensively test the fast retraining method, we employed another set of experiments under the condition of training from scratch. Table 4 shows the comparison results with InceptionNet and DenseNet on the ImageNet1 and ImageNet datasets. Through Table 4, we find that the Inceptionv3 and DenseNet121 with fast retraining could have 1.9% and 0.9% improvement over those with the traditional SGD scheme, and there is a 0.6% and 0.3% boost over the training pipeline in [yang2019recomputation]. Thus, the effectiveness of the proposed fast retraining is verified.
Dataset  DCNN  SGD [bottou2010large]  Retraining [yang2019recomputation]  Fast Retraining  Imp.  SGD (%)  Imp.  R. (%) 

CIFAR100  Inception  282  308  262  8.7  15.0 
ResNet  161  176  139  16.3  21.0  
Place3651  Inception  860  968  772  12.1  20.2 
ResNet  481  589  445  10.8  24.8 
Method  Dataset  Accuracy (%) 

Inceptionv3 with SGD [bottou2010large].  ImageNet1  42.2 
DenseNet121 with SGD [bottou2010large].  ImageNet  69.1 
Inceptionv3 with retraining scheme [yang2019recomputation]  ImageNet1  43.5 
DenseNet121 with retraining scheme [yang2019recomputation]  ImageNet  69.9 
Inceptionv3 with fast retraining scheme  ImageNet1  44.1 
DenseNet121 with fast retraining scheme  ImageNet  70.2 
5 Conclusion
In this paper a unified fast retraining procedure for DCNN is proposed. Compared to the stateoftheart DCNN training strategy [yang2019recomputation], this method achieves better testing performance but without occupying much computation resources. In particular, it provides a random learning schedule to speed up the convolutional layer learning and a batchbybatch MoorePenrose inversebased retraining strategy to optimize the parameters of dense layer. This scheme can be applied to all DCNNs, and the batchbybatch solution of MoorePenrose inverse allows the proposed training pipeline to be accelerated in a pure GPU environment. The experimental results on benchmark datasets prove the effectiveness and efficiency of the proposed fast retraining algorithm.
Comments
There are no comments yet.