The depth of a DCNN plays a vital role in discovering intricate structures both in theory [haastad1987computational, haastad1991power, lecun2015deep, szegedy2015going] and real practice [sun2014deep, guo2014deep, sindhwani2015structured]. The original DCNN LeNet5 contains 4 convolutional layers; however, the development of computer hardware and the improvement of network topology have enabled the DCNNs to go deeper and deeper. Recent deep models, including ResNet and DenseNet, have already surpassed the 100-layer barrier at 152 and 264 layers, respectively. Although the deeper networks have strong power in discovering intricate structures in image tasks, these approaches have a key characteristic in common: All of the up-to-date DCNNs adopt SGD [bengio2007greedy, bottou2010large] and its variants[kingma2014adam, duchi2011adaptive] as a foundation to train the networks. Such a training strategy makes the network suffer from being trapped in a local minimum and requiring a large amount of training time. Thus, a novel learning pipeline should be found in order to boost the generalization performance and speed up the learning.
Recent years the MP inverse technique has been utilized to train a DCNN to achieve a better generalization performance [yang2019recomputation]
. Essentially, the dense layers of a DCNN (with linear activation function) can be reduced back to a linear system, whereas the approximate optimal parameters are the least squares (LS) solutions when the minimum error has been achieved. It is well established that the MP inverse, among other techniques, is the most widely known generalization of the inverse matrix to find the unique solution of an LS problem. More importantly, the work in[schmidt1992feedforward]
has already proved that the output layer vector can be called the fishier vector if the weights are solved by the standard MP inverse. Following this, an increased focus has been placed on hierarchical networks with MP inverse[yang2019features, zhang2020width]
. Compared to the parameters calculated with SGD, the unique solution obtained by MP inverse corresponds to the maximum likelihood estimation. To the best of our knowledge, the study in[yang2019recomputation] is the state-of-the-art work that utilizes the MP inverse in DCNN training. In each training epoch, the DCNN is first trained with the SGD optimizer; then, the parameters in dense layers are refined through the MP inverse-based approach.
Despite its advantages, the use of the training procedure for DCNN provided in [yang2019recomputation]
is not as widespread as it could be. The crucial reasons for this may be as follows. On the one hand, the retraining process adds to the computational workload of each epoch. In particular, for a large dataset such as ImageNet[deng2009imagenet], the researchers can only refine the parameters of dense layer with CPU instead of GPU acceleration because the calculation of MP inverse can occupy huge computational resources. As the training cost in the CPU environment increases dramatically, the DCNN trained with MP inverse [yang2019recomputation] is uneconomical when handling large-scale samples without access to industrial-scale computational resources.
On the other hand, before retraining the dense layer weights, the process in the study [yang2019recomputation] still requires the use of the SGD technique to optimize the parameters of all layers. According to some successful techniques [huang2016deep, gastaldi2017shake], we hypothesize in this paper that it is not necessary for the work [yang2019recomputation] to involve the entire convolutional layer in the training process at each epoch because the refinement of parameters in dense layers can provide more clues. The existing methods [huang2016deep, gastaldi2017shake]
achieved accelerating training by adjusting a set of hyperparameters on the basic SGD training pipeline. However, they do have drawbacks, such as slight degradation in performance and a lack of robustness against various environmental conditions, resulting in an unstable testing accuracy.
In this paper, we focus on providing a unified training pipeline of DCNN with better generalization performance but without incurring much additional training burden. We achieve this goal by training the DCNN with several general epochs. In each general epoch, two simple but straightforward steps that can be implemented in the pure GPU environment are employed: The first one is SGD with random learning, and the second is an MP inverse-based batch-by-batch strategy. For the first step, a freeze learning algorithm that reduces the workload and speeds up the DCNN algorithm is provided. We shorten the network by randomly activating a portion of convolutional layers in each general epoch. The rate is preset to a value and progressively decreased. In the first several general epochs, is set to 1 to start the network, and all of the parameters are updated. Then, it is gradually decreased and finally comes to 0, which means that all of the convolutional layers are “frozen” without updating. Hence, the only parameter that users need to adjust is the activation rate . As for the second one, a batch-by-batch MP inverse retraining strategy that can be processed by GPU is proposed. Instead of training all of the loaded data at once, the data is processed sequentially. By doing so, the data volume of each batch is dramatically reduced and will not consume many computational resources. Thus, the proposed training pipeline can be implemented with GPU acceleration.
, are utilized to verify the effectiveness of this method. We show across 8 datasets, including 2 large datasets (ImageNet and Place365), that the proposed method almost always improves the generalization performance without increasing the training burden. For instance, on the Food251 and Place365 datasets, fast retraining with ResNet provides 24.8% and 21.0% greater speeds, respectively, than the methods in[yang2019recomputation].
2 Related Works
The training procedures of DCNN have been widely studied. However, most of these prior studies focus on improving one part of performance, either the training efficiency [hinton2012improving, huang2016deep, brock2017freezeout] or the generalization performance [yang2019recomputation, huang2017snapshot]. Few of them address both concerns.
Many successful learning schedules, such as Dropout [hinton2012improving], Stochastic Depth [huang2016deep], and FreezeOut [brock2017freezeout], have already achieved a computational speedup by excluding some convolutional layers from backward pass, as the early layers of a DCNN only detect simple edge details while taking up most of the time budget. Stochastic Depth [huang2016deep] reduces the training time by removing a set of convolutional layers for each mini-batch, while FreezeOut [brock2017freezeout] reduces computational costs with cosine annealing [loshchilov2016sgdr] by freezing convolutional layers.
In study [yang2019recomputation], a standard SGD with an MP inverse pipeline that can boost the testing performance of a DCNN was provided. It is motivated by the fact that the performance boost through network topology optimization is almost approaching its limitation, as shown by the minimal improvement in the results of the ILSVRC competition in recent years [he2016deep, huang2017densely]. In other words, the network depth has increased dramatically recently, while the testing performance has only had a limited improvement. After each SGD training, the authors adopt the MP inverse to pull back the residual error
from the output layer to each fully-connected (FC) layer in order to update the parameters. Thus, the approximate optimal parameters of FC layers can be generated. Formally, if the DCNN contains the ReLU layer and DropOut operation, the updated weightcould be obtained via the KKT theorem [Kuhn1951Nonlinear] and the optimization solution [schmidt1992feedforward]. The parameters of the last (-th) FC layer are updated through [yang2019recomputation]:
where is the MP inverse of , is the retraining rate, is the regularization term, is the parameters of nth FC layer, is the updated weights. is the input feature of nth FC layer, and is the output layer residual error.
The earlier -th FC layer can be updated by:
where and are the MP inverse of and , respectively, is the dropout operation, and is the ReLU operation. After each SGD training epoch, the weights of each FC layer are updated.
While the method provides a strong recognition accuracy within image classification datasets, it can only be employed in a CPU environment instead of with high-speed GPU acceleration, as the parameters calculated via Eq. 1 and Eq. 2 occupy a large amount of computational resources. This work is motivated by [huang2016deep, brock2017freezeout, yang2019recomputation], aiming at crafting a fast retraining scheme that can lead to improvements in both training speed and testing performance with all of the existing DCNN models.
3 DCNN with Fast Retraining Strategy
Fast retraining tunes the DCNN parameters with general epochs to achieve better generalization performance and boost the training efficiency. Each general epoch contains two steps: Step 1, the convolutional layer random learning with SGD, and Step 2, the dense layer retraining with an MP inverses-based batch-by-batch strategy.
3.1 Step 1 - Convolutional Layer Random Learning with Stochastic Gradient Descend
In this paper, we provide a simple accelerated training algorithm as depicted in Fig. 0(a) to speed up the training of a DCNN by randomly dropping the hidden layers in each epoch with a preset activation rate . As this method only contains one hyperparameter, , it is relatively easy for users to tune for practical employment. Note that keeps updating as the training epoch changes. Initially, the activation rate is set to 1 in the first several training epochs in order to “warm up” the DCNN network. All of the parameters in the network are tuned and updated in backward pass. After the “warm up” stage, the earlier layers are able to extract low-level features that can be used by later layers to build high-level features, and they are reliable enough to represent the raw images. Therefore, is increased to both accelerate network training and avoid over-fitting. In this sense, the inactivated layers are excluded from backward pass. Suppose that a designed DCNN contains convolutional layers; in a certain training epoch, the total number of activated () and inactivated () layers are:
3.2 Step 2 - Dense Layer Retraining with MP inverse-based Batch-by-batch Strategy
In order to implement the retraining schedule with efficient GPUs, the feature H and error e in Eq. 1 is processed chunk-by-chunk with pieces, i.e., , and . First, the initial data and is given, and the weights is calculated via one-batch learning strategy (1). Then, the weights is updated via , , and in an iterative way.
Suppose we have , which are defined as equation (4).
From (1), with batches of feature, the updated weights of one dense layer are considered as:
where . According to (4), the following equation can be drawn:
where . Based on the Sherman-Morrison-Woodbury (SMW) formula [golub2012matrix], the inverse of can be attained:
The equation (5) can be rewritten as
Furthermore, for simplicity, we denote as:
Substitute into equation (8), the weight can be simplified to the following equation:
In the case of new training data being available, the updated weight can be written as:
Above all, the parameters in the last FC layer can be updated with batch-by-batch strategy as (12):
The parameters in the earlier -th FC layer with batches data can be updated via (13):
where is the dropout operation.
The proposed training procedure for DCNN is presented as Algorithm LABEL:A0001. In each general epoch, the training process can be divided into two continuous steps: Step 1, convolutional layer random learning with SGD (Line 2-5), and Step 2, dense layer retraining with an MP inverse-based batch-by-batch strategy (Line 6-16).
In this paper, we apply 8 datasets and 4 state-of-the-art DCNNs to demonstrate the efficiency of the fast retraining strategy. The experiments performed in this section were conducted on a workstation with a 256GB memory and an E5-2650 processor, and all of the DCNNs were trained on NVIDIA 1080Ti GPU.
4.1 Dataset and Experimental Settings
I. Datasets. The details of the datasets are shown as Table 1. For Caltech101 [fei2006one] and Caltech256 [griffin2007caltech], we randomly selected 30 images per category to get the training set, using the rest for testing. As for CIFAR100 [krizhevsky2009learning], 50,000 images were used for training and 10,000 for testing. The Food251 dataset [kaur2019foodx], which was a food classification dataset, created in 2019, the training set (118,475 images) and validation set (11,994 images) were used for training and testing. Besides these datasets, the commonly used large-scale datasets Place365 [zhou2017places] and ImageNet [deng2009imagenet] were used to evaluate the proposed work. For a comprehensive comparison, we randomly selected 200 and 500 images per category to create ImageNet-1 and Place365-1, respectively, while the validation set was used for testing.
II. Architectures. We evaluated fast retraining with several state-of-the-art DCNNs, such as AlexNet [krizhevsky2012imagenet], VGG [simonyan2014very], Inception-V3 [szegedy2016rethinking], ResNet [he2016deep], and DenseNet [huang2017densely]. For the first three frameworks, we utilized the 16-layer VGG, the 48-layer Inception-V3, and the 50-layer ResNet, respectively. For the DenseNet, we evaluated the fast retraining scheme on two structures: the 121-layer DenseNet and the 201-layer DenseNet.
In this paper, we tested the proposed method with an original strategy under two different conditions, i.e., transfer learning and training from scratch. The experimental settings were as follows. For transfer learning, the initial learning rate was set to, and was divided by 10 in every 3 training epochs. The initial activation rate was 1, and it was set to 0.8, 0.6, and 0.4 at 25%, 50%, and 75% of the total number of training epochs. Other settings include the total number of training epochs, the regularization term in retraining, and the mini batch size; these are described in Table 1. As for DCNN training from scratch, we trained the model for 90 epochs. The learning rate was set to 0.1 and was lowered by 10 times at epochs 30 and 60. The was first set to 1 and was decreased to 0.9 and 0.6 at 50% and 75% of the total number of training epochs.
|Dataset||# classes||# training||# testing||batch size||batch size||# max|
Configurations for transfer learning / Training from scratch Configurations for VGG / Inception-V3 / ResNet / DenseNet
4.2 Step-by-step Quantitative Analysis
|SGD [bottou2010large]||R. [yang2019recomputation]||FR.||SGD [bottou2010large]||R. [yang2019recomputation]||FR.||SGD [bottou2010large]||R. [yang2019recomputation]||FR.||SGD [bottou2010large]||R. [yang2019recomputation]||FR.|
In this paper, all of the experiments are evaluated with Top-1 testing accuracy, and the results recorded are the mean average of a minimum of three experiments. The best results are in boldface format.
I. The Effectiveness Analysis of batch-by-batch strategy. To verify the effectiveness of applying an MP inverse-based batch-by-batch scheme in dense layer retraining, we conducted a sanity check with this strategy and the one-batch schedule [yang2019recomputation], as described in Fig. 2a. In particular, the peak memory usage (PMU) in training was in empirically evaluating the performance of different training modes. The investigation reveals that the batch-by-batch strategy significantly reduces the memory use of retraining DCNNs. Hence, we summarize the first conclusion as such: The provided batch-by-batch method reduces the computational burden and can be accelerated in a GPU environment, which overcomes the main drawback of work [yang2019recomputation].
II. The Effectiveness Analysis of random learning. To validate the effectiveness of random layer learning, experiments were conducted on the Place365-1 dataset. Note that, in this part, the retraining strategy was excluded from the training epoch (so that only random learning remains), and the DCNNs were trained on ImageNet pre-trained networks with 8 epochs. The experiments were conducted and trained under three different configurations: DCNN trained with i) the original SGD baseline, ii) the FreezeOut [brock2017freezeout] learning scheme, and iii) the proposed random learning strategy. Fig. 2b and 2c compare the results. Through analysis, we reach the second conclusion: The training time of the DCNN with proposed random learning is faster than that of the DCNN with traditional SGD, and it has a positive impact on generalization performance.
III. Comparison of transfer learning. Taking the above outcomes of sections I and II as the foundation, more experiments were carried out to compare the proposed learning procedure with the retraining algorithm [yang2019recomputation]. All of the results are tabulated in Table 2. Unlike most recent works that prefer to boost the testing performance with novel network topology, in this paper, the proposed method does not contain any network modification, but it does have a slight improvement (0.1% to 1.0% ) on testing accuracy over the state-of-the-art MP inverse-based learning scheme [yang2019recomputation]. While the 0.1% to 1% Top-1 accuracy boost seems to lead to marginal improvement, it is not easy to obtain these improvements at the current stage as the DCNN optimization is almost achieving its limitation. For example, VGG-16 and ResNet are the ILSVRC winners in the year of 2014 and 2016, respectively. ResNet only provides 1.2% boost than VGG-16 on CIFAR100 set, but 1% lower on SUN397 set.
Furthermore, the total training time of the fast retraining method, the recomputation method [yang2019recomputation], and the original SGD method are tabulated in Table 3. Note that all of the experiments were conducted with 8 training epochs. Figure 3 shows the generalization performance on these datasets by a plot as the number of training epochs increases. We can easily find that: The proposed strategy presents a speedup of up to 25% compared to the existing retraining strategy [yang2019recomputation]; It only needs 3 to 4 epochs to get the optimal results, whereas the original DCNN needs at least 6 epochs. Through Table 2, Table 3, and Fig. 3, the last conclusion that can be drawn is: The fast retraining scheme improves the generalization performance of a DCNN, reducing the learning time by 15% to 25% over the existing MP inverse-based learning paradigm [yang2019recomputation].
IV. Comparison results of training from scratch. In order to extensively test the fast retraining method, we employed another set of experiments under the condition of training from scratch. Table 4 shows the comparison results with InceptionNet and DenseNet on the ImageNet-1 and ImageNet datasets. Through Table 4, we find that the Inception-v3 and DenseNet-121 with fast retraining could have 1.9% and 0.9% improvement over those with the traditional SGD scheme, and there is a 0.6% and 0.3% boost over the training pipeline in [yang2019recomputation]. Thus, the effectiveness of the proposed fast retraining is verified.
|Dataset||DCNN||SGD [bottou2010large]||Retraining [yang2019recomputation]||Fast Retraining||Imp. - SGD (%)||Imp. - R. (%)|
|Inception-v3 with SGD [bottou2010large].||ImageNet-1||42.2|
|DenseNet-121 with SGD [bottou2010large].||ImageNet||69.1|
|Inception-v3 with retraining scheme [yang2019recomputation]||ImageNet-1||43.5|
|DenseNet-121 with retraining scheme [yang2019recomputation]||ImageNet||69.9|
|Inception-v3 with fast retraining scheme||ImageNet-1||44.1|
|DenseNet-121 with fast retraining scheme||ImageNet||70.2|
In this paper a unified fast retraining procedure for DCNN is proposed. Compared to the state-of-the-art DCNN training strategy [yang2019recomputation], this method achieves better testing performance but without occupying much computation resources. In particular, it provides a random learning schedule to speed up the convolutional layer learning and a batch-by-batch Moore-Penrose inverse-based retraining strategy to optimize the parameters of dense layer. This scheme can be applied to all DCNNs, and the batch-by-batch solution of Moore-Penrose inverse allows the proposed training pipeline to be accelerated in a pure GPU environment. The experimental results on benchmark datasets prove the effectiveness and efficiency of the proposed fast retraining algorithm.