Deep Networks with Fast Retraining

08/13/2020 ∙ by Wandong Zhang, et al. ∙ University of Windsor Lakehead University 0

Recent wor [1] has utilized Moore-Penrose (MP) inverse in deep convolutional neural network (DCNN) training, which achieves better generalization performance over the DCNN with a stochastic gradient descent (SGD) pipeline. However, the MP technique cannot be processed in the GPU environment due to its high demands of computational resources. This paper proposes a fast DCNN learning strategy with MP inverse to achieve better testing performance without introducing a large calculation burden. We achieve this goal through an SGD and MP inverse-based two-stage training procedure. In each training epoch, a random learning strategy that controls the number of convolutional layers trained in backward pass is utilized, and an MP inverse-based batch-by-batch learning strategy is developed that enables the network to be implemented with GPU acceleration and to refine the parameters in dense layer. Through experiments on image classification datasets with various training images ranging in amount from 3,060 (Caltech101) to 1,803,460 (Place365), we empirically demonstrate that the fast retraining is a unified strategy that can be utilized in all DCNNs. Our method obtains up to 1 state-of-the-art DCNN learning pipeline, yielding a savings in training time of 15 Thangarajah, "Recomputation of dense layers for the perfor-238mance improvement of dcnn," IEEE Trans. Pattern Anal. Mach. Intell., 2019.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The depth of a DCNN plays a vital role in discovering intricate structures both in theory [haastad1987computational, haastad1991power, lecun2015deep, szegedy2015going] and real practice [sun2014deep, guo2014deep, sindhwani2015structured]. The original DCNN LeNet5 contains 4 convolutional layers; however, the development of computer hardware and the improvement of network topology have enabled the DCNNs to go deeper and deeper. Recent deep models, including ResNet and DenseNet, have already surpassed the 100-layer barrier at 152 and 264 layers, respectively. Although the deeper networks have strong power in discovering intricate structures in image tasks, these approaches have a key characteristic in common: All of the up-to-date DCNNs adopt SGD [bengio2007greedy, bottou2010large] and its variants[kingma2014adam, duchi2011adaptive] as a foundation to train the networks. Such a training strategy makes the network suffer from being trapped in a local minimum and requiring a large amount of training time. Thus, a novel learning pipeline should be found in order to boost the generalization performance and speed up the learning.

Recent years the MP inverse technique has been utilized to train a DCNN to achieve a better generalization performance [yang2019recomputation]

. Essentially, the dense layers of a DCNN (with linear activation function) can be reduced back to a linear system, whereas the approximate optimal parameters are the least squares (LS) solutions when the minimum error has been achieved. It is well established that the MP inverse, among other techniques, is the most widely known generalization of the inverse matrix to find the unique solution of an LS problem. More importantly, the work in 


has already proved that the output layer vector can be called the fishier vector if the weights are solved by the standard MP inverse. Following this, an increased focus has been placed on hierarchical networks with MP inverse 

[yang2019features, zhang2020width]

. Compared to the parameters calculated with SGD, the unique solution obtained by MP inverse corresponds to the maximum likelihood estimation. To the best of our knowledge, the study in 

[yang2019recomputation] is the state-of-the-art work that utilizes the MP inverse in DCNN training. In each training epoch, the DCNN is first trained with the SGD optimizer; then, the parameters in dense layers are refined through the MP inverse-based approach.

Despite its advantages, the use of the training procedure for DCNN provided in [yang2019recomputation]

is not as widespread as it could be. The crucial reasons for this may be as follows. On the one hand, the retraining process adds to the computational workload of each epoch. In particular, for a large dataset such as ImageNet 

[deng2009imagenet], the researchers can only refine the parameters of dense layer with CPU instead of GPU acceleration because the calculation of MP inverse can occupy huge computational resources. As the training cost in the CPU environment increases dramatically, the DCNN trained with MP inverse [yang2019recomputation] is uneconomical when handling large-scale samples without access to industrial-scale computational resources.

On the other hand, before retraining the dense layer weights, the process in the study [yang2019recomputation] still requires the use of the SGD technique to optimize the parameters of all layers. According to some successful techniques [huang2016deep, gastaldi2017shake], we hypothesize in this paper that it is not necessary for the work [yang2019recomputation] to involve the entire convolutional layer in the training process at each epoch because the refinement of parameters in dense layers can provide more clues. The existing methods [huang2016deep, gastaldi2017shake]

achieved accelerating training by adjusting a set of hyperparameters on the basic SGD training pipeline. However, they do have drawbacks, such as slight degradation in performance and a lack of robustness against various environmental conditions, resulting in an unstable testing accuracy.

In this paper, we focus on providing a unified training pipeline of DCNN with better generalization performance but without incurring much additional training burden. We achieve this goal by training the DCNN with several general epochs. In each general epoch, two simple but straightforward steps that can be implemented in the pure GPU environment are employed: The first one is SGD with random learning, and the second is an MP inverse-based batch-by-batch strategy. For the first step, a freeze learning algorithm that reduces the workload and speeds up the DCNN algorithm is provided. We shorten the network by randomly activating a portion of convolutional layers in each general epoch. The rate is preset to a value and progressively decreased. In the first several general epochs, is set to 1 to start the network, and all of the parameters are updated. Then, it is gradually decreased and finally comes to 0, which means that all of the convolutional layers are “frozen” without updating. Hence, the only parameter that users need to adjust is the activation rate . As for the second one, a batch-by-batch MP inverse retraining strategy that can be processed by GPU is proposed. Instead of training all of the loaded data at once, the data is processed sequentially. By doing so, the data volume of each batch is dramatically reduced and will not consume many computational resources. Thus, the proposed training pipeline can be implemented with GPU acceleration.

In extensive experiments, several state-of-the-art deep learning architectures for pattern recognition, such as AlexNet 

[krizhevsky2012imagenet], VGG-16 [simonyan2014very], Inception-v3 [szegedy2016rethinking], ResNet [he2016deep] and DenseNet [huang2017densely]

, are utilized to verify the effectiveness of this method. We show across 8 datasets, including 2 large datasets (ImageNet and Place365), that the proposed method almost always improves the generalization performance without increasing the training burden. For instance, on the Food251 and Place365 datasets, fast retraining with ResNet provides 24.8% and 21.0% greater speeds, respectively, than the methods in 


2 Related Works

The training procedures of DCNN have been widely studied. However, most of these prior studies focus on improving one part of performance, either the training efficiency [hinton2012improving, huang2016deep, brock2017freezeout] or the generalization performance [yang2019recomputation, huang2017snapshot]. Few of them address both concerns.

Many successful learning schedules, such as Dropout [hinton2012improving], Stochastic Depth [huang2016deep], and FreezeOut [brock2017freezeout], have already achieved a computational speedup by excluding some convolutional layers from backward pass, as the early layers of a DCNN only detect simple edge details while taking up most of the time budget. Stochastic Depth [huang2016deep] reduces the training time by removing a set of convolutional layers for each mini-batch, while FreezeOut [brock2017freezeout] reduces computational costs with cosine annealing [loshchilov2016sgdr] by freezing convolutional layers.

In study [yang2019recomputation], a standard SGD with an MP inverse pipeline that can boost the testing performance of a DCNN was provided. It is motivated by the fact that the performance boost through network topology optimization is almost approaching its limitation, as shown by the minimal improvement in the results of the ILSVRC competition in recent years [he2016deep, huang2017densely]. In other words, the network depth has increased dramatically recently, while the testing performance has only had a limited improvement. After each SGD training, the authors adopt the MP inverse to pull back the residual error

from the output layer to each fully-connected (FC) layer in order to update the parameters. Thus, the approximate optimal parameters of FC layers can be generated. Formally, if the DCNN contains the ReLU layer and DropOut operation, the updated weight

could be obtained via the KKT theorem [Kuhn1951Nonlinear] and the optimization solution [schmidt1992feedforward]. The parameters of the last (-th) FC layer are updated through [yang2019recomputation]:


where is the MP inverse of , is the retraining rate, is the regularization term, is the parameters of nth FC layer, is the updated weights. is the input feature of nth FC layer, and is the output layer residual error.

The earlier -th FC layer can be updated by:


where and are the MP inverse of and , respectively, is the dropout operation, and is the ReLU operation. After each SGD training epoch, the weights of each FC layer are updated.

While the method provides a strong recognition accuracy within image classification datasets, it can only be employed in a CPU environment instead of with high-speed GPU acceleration, as the parameters calculated via Eq. 1 and Eq. 2 occupy a large amount of computational resources. This work is motivated by [huang2016deep, brock2017freezeout, yang2019recomputation], aiming at crafting a fast retraining scheme that can lead to improvements in both training speed and testing performance with all of the existing DCNN models.

3 DCNN with Fast Retraining Strategy

Fast retraining tunes the DCNN parameters with general epochs to achieve better generalization performance and boost the training efficiency. Each general epoch contains two steps: Step 1, the convolutional layer random learning with SGD, and Step 2, the dense layer retraining with an MP inverses-based batch-by-batch strategy.

3.1 Step 1 - Convolutional Layer Random Learning with Stochastic Gradient Descend

(a) Step 1 - Random learning with SGD. In each epoch, users randomly activate number of convolutional layers, while excluding the rest number of convolutional layers from backward pass. and are determined by a predefined hyperparameter .
(b) Step 2 - Retraining with MP inverse-based batch-by-batch strategy. and are obtained by Procedure I, while and are received via Procedure II. The details for Procedure I and II can be found from Algorithm 1.
Figure 1: The proposed training procedure. The DCNN is trained with general epochs containing two successive steps: Step 1 for random convolutional layer learning. Step 2 for dense layer retraining.

In this paper, we provide a simple accelerated training algorithm as depicted in Fig. 0(a) to speed up the training of a DCNN by randomly dropping the hidden layers in each epoch with a preset activation rate . As this method only contains one hyperparameter, , it is relatively easy for users to tune for practical employment. Note that keeps updating as the training epoch changes. Initially, the activation rate is set to 1 in the first several training epochs in order to “warm up” the DCNN network. All of the parameters in the network are tuned and updated in backward pass. After the “warm up” stage, the earlier layers are able to extract low-level features that can be used by later layers to build high-level features, and they are reliable enough to represent the raw images. Therefore, is increased to both accelerate network training and avoid over-fitting. In this sense, the inactivated layers are excluded from backward pass. Suppose that a designed DCNN contains convolutional layers; in a certain training epoch, the total number of activated () and inactivated () layers are:


3.2 Step 2 - Dense Layer Retraining with MP inverse-based Batch-by-batch Strategy

In order to implement the retraining schedule with efficient GPUs, the feature H and error e in Eq. 1 is processed chunk-by-chunk with pieces, i.e., , and . First, the initial data and is given, and the weights is calculated via one-batch learning strategy (1). Then, the weights is updated via , , and in an iterative way.

Suppose we have , which are defined as equation (4).


From (1), with batches of feature, the updated weights of one dense layer are considered as:


where . According to (4), the following equation can be drawn:


where . Based on the Sherman-Morrison-Woodbury (SMW) formula [golub2012matrix], the inverse of can be attained:


The equation (5) can be rewritten as


Furthermore, for simplicity, we denote as:


Substitute into equation (8), the weight can be simplified to the following equation:


In the case of new training data being available, the updated weight can be written as:


Above all, the parameters in the last FC layer can be updated with batch-by-batch strategy as (12):


The parameters in the earlier -th FC layer with batches data can be updated via (13):


where is the dropout operation.

The proposed training procedure for DCNN is presented as Algorithm LABEL:A0001. In each general epoch, the training process can be divided into two continuous steps: Step 1, convolutional layer random learning with SGD (Line 2-5), and Step 2, dense layer retraining with an MP inverse-based batch-by-batch strategy (Line 6-16).

4 Experiments

In this paper, we apply 8 datasets and 4 state-of-the-art DCNNs to demonstrate the efficiency of the fast retraining strategy. The experiments performed in this section were conducted on a workstation with a 256GB memory and an E5-2650 processor, and all of the DCNNs were trained on NVIDIA 1080Ti GPU.

4.1 Dataset and Experimental Settings

I. Datasets. The details of the datasets are shown as Table 1. For Caltech101 [fei2006one] and Caltech256 [griffin2007caltech], we randomly selected 30 images per category to get the training set, using the rest for testing. As for CIFAR100 [krizhevsky2009learning], 50,000 images were used for training and 10,000 for testing. The Food251 dataset [kaur2019foodx], which was a food classification dataset, created in 2019, the training set (118,475 images) and validation set (11,994 images) were used for training and testing. Besides these datasets, the commonly used large-scale datasets Place365 [zhou2017places] and ImageNet [deng2009imagenet] were used to evaluate the proposed work. For a comprehensive comparison, we randomly selected 200 and 500 images per category to create ImageNet-1 and Place365-1, respectively, while the validation set was used for testing.

II. Architectures. We evaluated fast retraining with several state-of-the-art DCNNs, such as AlexNet [krizhevsky2012imagenet], VGG [simonyan2014very], Inception-V3 [szegedy2016rethinking], ResNet [he2016deep], and DenseNet [huang2017densely]. For the first three frameworks, we utilized the 16-layer VGG, the 48-layer Inception-V3, and the 50-layer ResNet, respectively. For the DenseNet, we evaluated the fast retraining scheme on two structures: the 121-layer DenseNet and the 201-layer DenseNet.

III. Setting.

In this paper, we tested the proposed method with an original strategy under two different conditions, i.e., transfer learning and training from scratch. The experimental settings were as follows. For transfer learning, the initial learning rate was set to

, and was divided by 10 in every 3 training epochs. The initial activation rate was 1, and it was set to 0.8, 0.6, and 0.4 at 25%, 50%, and 75% of the total number of training epochs. Other settings include the total number of training epochs, the regularization term in retraining, and the mini batch size; these are described in Table 1. As for DCNN training from scratch, we trained the model for 90 epochs. The learning rate was set to 0.1 and was lowered by 10 times at epochs 30 and 60. The was first set to 1 and was decreased to 0.9 and 0.6 at 50% and 75% of the total number of training epochs.

Dataset # classes # training # testing batch size batch size # max
samples samples (SGD) (MP inverse) epoch
Caltech101 102 3,060 6,084 32 10 8/90a 6/4/4/2b
Caltech256 257 7,710 22,898 32 10 8/90a 4/4/4/2b
CIFAR100 100 50,000 10,000 32 10 8/90a 4/2/2/2b
Food251 251 118,475 11,994 32 10 8/90a 4/2/2/2b
Place365-1 365 182,500 36,500 32 10 12/90a 2/2/2/1b
Place365 365 1,803,460 36,500 32 10 12/90a 2/2/2/1b
ImageNet-1 1,000 200,000 50,000 32 10 NA/90a 2/2/2/1b
ImageNet 1,000 14,197,122 50,000 32 10 NA/90a 2/2/2/1b
  • Configurations for transfer learning / Training from scratch Configurations for VGG / Inception-V3 / ResNet / DenseNet

Table 1: Summary of the Datasets

4.2 Step-by-step Quantitative Analysis

Figure 2: The effective analysis results of MP inverse-based batch-by-batch strategy and convolutional layer random learning schedule. (a) The peak memory usage of one-batch and batch-by-batch on Place365 dataset. (b) Training time comparison of SGD [bottou2010large], FreezeOut [brock2017freezeout], and random learning on Place365-1. (c) The Rop-1 testing accuracy comparison on Place365-1 dataset.
Dataset VGG-16 ResNet-50 Inception-v3 DenseNet-201
SGD [bottou2010large] R. [yang2019recomputation] FR. SGD [bottou2010large] R. [yang2019recomputation] FR. SGD [bottou2010large] R. [yang2019recomputation] FR. SGD [bottou2010large] R. [yang2019recomputation] FR.
Caltech101 90.4 90.1 90.1 90.8 91.3 91.5 89.7 90.6 90.5 92.5 91.7 91.4
Caltech256 69.3 73.0 73.7 77.1 78.2 78.3 77.9 78.4 78.6 79.7 81.0 81.2
CIFAR100 77.4 79.5 79.2 84.3 83.5 83.7 83.8 84.2 84.3 86.1 84.4 85.6
Food251 52.7 56.9 57.4 59.4 61.8 61.8 59.3 60.8 61.8 61.6 62.4 62.8
Place365-1 42.9 44.3 44.5 47.4 48.4 48.6 47.5 48.1 48.1 47.7 47.7 47.9
Place365 50.9 51.6 51.9 52.7 53.3 53.6 53.8 53.9 54.1 54.7 55.4 55.4
Average 63.9 65.9 66.2 68.6 69.4 69.6 68.7 69.4 69.6 70.4 70.4 70.7
Table 2: Top-1 testing accuracy comparison (R. [yang2019recomputation] - DCNN with learning strategy in [yang2019recomputation], FR. - DCNN with the proposed fast retraining strategy).

In this paper, all of the experiments are evaluated with Top-1 testing accuracy, and the results recorded are the mean average of a minimum of three experiments. The best results are in boldface format.

I. The Effectiveness Analysis of batch-by-batch strategy. To verify the effectiveness of applying an MP inverse-based batch-by-batch scheme in dense layer retraining, we conducted a sanity check with this strategy and the one-batch schedule [yang2019recomputation], as described in Fig. 2a. In particular, the peak memory usage (PMU) in training was in empirically evaluating the performance of different training modes. The investigation reveals that the batch-by-batch strategy significantly reduces the memory use of retraining DCNNs. Hence, we summarize the first conclusion as such: The provided batch-by-batch method reduces the computational burden and can be accelerated in a GPU environment, which overcomes the main drawback of work [yang2019recomputation].

II. The Effectiveness Analysis of random learning. To validate the effectiveness of random layer learning, experiments were conducted on the Place365-1 dataset. Note that, in this part, the retraining strategy was excluded from the training epoch (so that only random learning remains), and the DCNNs were trained on ImageNet pre-trained networks with 8 epochs. The experiments were conducted and trained under three different configurations: DCNN trained with i) the original SGD baseline, ii) the FreezeOut [brock2017freezeout] learning scheme, and iii) the proposed random learning strategy. Fig. 2b and 2c compare the results. Through analysis, we reach the second conclusion: The training time of the DCNN with proposed random learning is faster than that of the DCNN with traditional SGD, and it has a positive impact on generalization performance.

III. Comparison of transfer learning. Taking the above outcomes of sections I and II as the foundation, more experiments were carried out to compare the proposed learning procedure with the retraining algorithm [yang2019recomputation]. All of the results are tabulated in Table 2. Unlike most recent works that prefer to boost the testing performance with novel network topology, in this paper, the proposed method does not contain any network modification, but it does have a slight improvement (0.1% to 1.0% ) on testing accuracy over the state-of-the-art MP inverse-based learning scheme [yang2019recomputation]. While the 0.1% to 1% Top-1 accuracy boost seems to lead to marginal improvement, it is not easy to obtain these improvements at the current stage as the DCNN optimization is almost achieving its limitation. For example, VGG-16 and ResNet are the ILSVRC winners in the year of 2014 and 2016, respectively. ResNet only provides 1.2% boost than VGG-16 on CIFAR100 set, but 1% lower on SUN397 set.

Furthermore, the total training time of the fast retraining method, the recomputation method [yang2019recomputation], and the original SGD method are tabulated in Table 3. Note that all of the experiments were conducted with 8 training epochs. Figure 3 shows the generalization performance on these datasets by a plot as the number of training epochs increases. We can easily find that: The proposed strategy presents a speedup of up to 25% compared to the existing retraining strategy [yang2019recomputation]; It only needs 3 to 4 epochs to get the optimal results, whereas the original DCNN needs at least 6 epochs. Through Table 2, Table 3, and Fig. 3, the last conclusion that can be drawn is: The fast retraining scheme improves the generalization performance of a DCNN, reducing the learning time by 15% to 25% over the existing MP inverse-based learning paradigm [yang2019recomputation].

Figure 3: Top-1 testing accuracy of InceptionNet on CIFAR100 and Place365-1 datasets.

IV. Comparison results of training from scratch. In order to extensively test the fast retraining method, we employed another set of experiments under the condition of training from scratch. Table 4 shows the comparison results with InceptionNet and DenseNet on the ImageNet-1 and ImageNet datasets. Through Table 4, we find that the Inception-v3 and DenseNet-121 with fast retraining could have 1.9% and 0.9% improvement over those with the traditional SGD scheme, and there is a 0.6% and 0.3% boost over the training pipeline in [yang2019recomputation]. Thus, the effectiveness of the proposed fast retraining is verified.

Dataset DCNN SGD [bottou2010large] Retraining [yang2019recomputation] Fast Retraining Imp. - SGD (%) Imp. - R. (%)
CIFAR100 Inception 282 308 262 8.7 15.0
ResNet 161 176 139 16.3 21.0
Place365-1 Inception 860 968 772 12.1 20.2
ResNet 481 589 445 10.8 24.8
Table 3: Comparison of total training time with SGD, retraining strategy and the proposed fast retraining on CIFAR100 and Place365-1 datasets in minute (Imp. - SGD (%): the improvement over SGD optimizer, Imp. - R. (%): the improvement over retraining schedule).
Method Dataset Accuracy (%)
Inception-v3 with SGD [bottou2010large]. ImageNet-1 42.2
DenseNet-121 with SGD [bottou2010large]. ImageNet 69.1
Inception-v3 with retraining scheme [yang2019recomputation] ImageNet-1 43.5
DenseNet-121 with retraining scheme [yang2019recomputation] ImageNet 69.9
Inception-v3 with fast retraining scheme ImageNet-1 44.1
DenseNet-121 with fast retraining scheme ImageNet 70.2
Table 4: Top-1 testing accuracy comparison under the condition of training from scratch.

5 Conclusion

In this paper a unified fast retraining procedure for DCNN is proposed. Compared to the state-of-the-art DCNN training strategy [yang2019recomputation], this method achieves better testing performance but without occupying much computation resources. In particular, it provides a random learning schedule to speed up the convolutional layer learning and a batch-by-batch Moore-Penrose inverse-based retraining strategy to optimize the parameters of dense layer. This scheme can be applied to all DCNNs, and the batch-by-batch solution of Moore-Penrose inverse allows the proposed training pipeline to be accelerated in a pure GPU environment. The experimental results on benchmark datasets prove the effectiveness and efficiency of the proposed fast retraining algorithm.