The recent record-breaking predictive performance achieved by deep neural networks (DNNs) motivates a tremendously growing demand to bring DNN-powered intelligence into numerous applications. However, the excellent performance of modern DNNs comes at an often prohibitive training cost due to the required vast volume of training data and model parameters. As an illustrative example of the computational complexity of DNN training, one forward pass of the ResNet50 (he2016deep) model requires 4 GFLOPs (FLOPs: floating point operations per second) of computations and training requires FLOPs, which takes 14 days on one state-of-the-art NVIDIA M40 GPU (you2018imagenet). As a result, training a state-of-the-art DNN model often demands considerable energy, along with the associated financial and environmental costs. For example, a recent report shows that training a single DNN can cost over $10K US dollars and emit as much carbon as five cars in their lifetimes (Strubell2019Energy), limiting the rapid development of DNN innovations and raising various environmental concerns.
The recent trends of improving DNN efficiency are mostly focused on compressing models and accelerating inference. An empirically adopted practice is the so-called progressive pruning and training routine, i.e., training a large model fully, pruning it and then retraining the pruned model to restore the performance (the process can be iterated several rounds). While this has been a standard practice for model compression (han2015deep), some recent efforts start empirically linking it to the potential of more efficient training. Notably, a latest series of works (frankle2019lottery; liu2018rethinking) reveals that dense, randomly-initialized networks contain small subnetworks which can match the test accuracy of original networks when trained alone themselves. These subnetworks are called winning tickets. Despite their insightful findings, there remains to be a major gap between the winning ticket observation and the goal of more efficient training, since winning tickets were only identified by pruning unimportant connections in a fully trained dense network.
This paper closes this gap by demonstrating the Early-Bird (EB) tickets phenomenon: the winning tickets can be drawn very early in training, and with aggressively low-cost training algorithms. Through a range of experiments on different DNNs and datasets, we observe the consistent existence of EB tickets, the cheap costs needed to reliably draw them, and develop a novel mask distance metric to detect their emergence. After bring identified, re-training those EB-tickets (using standard training) leads to comparable or even superior final accuracies, compared to either standard training, or re-training the “ground-truth” winning tickets drawn after full training as in (frankle2019lottery). Our observations seem to coincide with the recent findings by (achille2018critical; li2019towards) about the two-stage optimization trajectory in training. Taking advantage of EB tickets, we propose an efficient DNN training scheme termed EB training. To our best knowledge, this is the first step taken towards exploiting winning tickets for a realistic efficient training goal.
Our contribution can be summarized as follow:
We discover the Early-Bird (EB) tickets, and show they 1) consistently exist across models and datasets; 2) can emerge very early in training; and 3) stay robust under (and sometimes even favor) various aggressive and low-cost training schemes (in addition to early stopping), including large learning rates and low-precision training.
We propose a practical, easy-to-compute mask distance, as an indicator informing to draw EB tickets without accessing the “ground-truth” winning tickets (drawn after full training), fixing a major paradox for connecting winning tickets with the efficient training goal.
We design a novel efficient training framework based on EB tickets (EB training). Experiments in state-of-the-art benchmarks and models show that EB training can achieve up to 4.7 energy savings, while maintaining the same or even better accuracy, compared to training with the original winning tickets.
2 Related work
Winning Ticket Hypothesis.
The lottery ticket hypothesis (frankle2019lottery) first points out that a small subnetwork, called the winning ticket, can be identified by pruning a full trained dense network; when training it isolation with the same weight initialization once assigned to the corresponding weights in the dense network, one can restore the comparable test accuracy to the dense network. However, finding winning tickets hinged on costly (iterative) pruning and retraining (morcos2019ticket) studies the reuse of winning tickets, transferable across different datasets. (dtl) discovers the existence of supermasks that can be applied to an untrained, randomly-initialized network. (liu2018rethinking) argues that the weight initialization might make less difference when trained with a large learning rate, while the searched connectivity is more of the winning ticket’s core value. It also explores the usage of both unstructured and (more hardware-friendly) structured pruning and shows that both lead to the emergence of winning tickets.
Another related work (lee2018snip) prunes a network at single-shot with one mini-batch, in which the irrelevant connections are identified by a connection sensitivity criterion. Comparing to (frankle2019lottery), the authors showed their method to be more efficient in finding the good subnetwork (not the winning ticket), although its re-training accuracy/efficiency was found to be inferior, compared to training the “ground truth” winning ticket.
Other Relevant Observations in Training.
(pmlr-v97-rahaman19a; xu2019frequency) argue that deep networks will first learn low-complexity (lower-frequency) functional components, before absorbing high-frequency features: the former being more robust to perturbations. An important hint can be found in (achille2018critical): the early stage of training seems to first discover the important connections and the connectivity patterns between layers, which becomes relatively fixed in the latter training stage. That seems to imply that the critical sub-network (connectivity) can be identified independent of, and seemingly also ahead of, the (final best) weights. Finally, li2019towards demonstrates that training a deep network with a large initial learning rate helps the model focus on memorizing easier-to-fit, more generalizable pattern faster and better – a direct inspiration for us to try drawing EB tickets using large learning rates.
Efficient Inference and Training.
Model compression has been extensively studied for lighter-weight inference. Popular means include pruning (li2017pruning; liu2017learning; he2018soft; wen2016learning; luo2017thinet), weight factorization (NIPS2014_5544), weight sharing (wu2018deep), quantization (hubara2017quantized), and lately network architecture search (nas). On the other hand, the literature on efficient training appears to be much sparser. A handful of works (goyal; Cho2017PowerAID; you2018imagenet; Akiba2017ExtremelyLM; jia2018highly; gupta2015deep) focus on reducing the total training time in paralleled, communication-efficient distributed settings. In contrast, our goal is to shrink the total resource cost for in-situ, resource-constrained training. (banner2018scalable; wang2018training) presented low-precision training, which is aligned with our goal and can be incorporated into EB training (see later).
3 Drawing Early-bird Tickets: Hypothesis and Experiments
We hypothesize that the winning tickets can emerge at a very early training stage, which we term as an early-bird (EB) ticket. Consider a dense, randomly-initialized network , reaches a minimum validation loss at the -th iteration with a test accuracy
, when optimized with stochastic gradient descent (SGD) on a training set. In addition, consider subnetworkswith a mask indicates the pruned and unpruned connections in . When being optimized with SGD on the same training set, reach a minimum validation loss at the -th iteration with a test accuracy . The EB tickets hypothesis articulates that there exists such that (even ), i.e., same or better generalization, with (e.g., early stopping) and sparse (i.e., much reduced parameters).
Section 3 addresses three key questions pertaining to the EB ticket hypothesis. We first show via an extensive set of experiments, that EB tickets can be observed across popular models and datasets (Section 3.1). We then try to be more aggressive to see if high-quality EB tickets still emerge under “cheaper” training (Section 3.2). We finally reveal that EB tickets can be identified with comparing a novel mask distance
between consecutive epochs, thus no full training needed (Section 3.3).
3.1 Do Early-bird Tickets Always Exist?
We perform ablation simulations using two representative deep models: VGG16 (simonyan2014very) and pre-activation residual networks-101 (PreResNet101) (he2016identity), on two popular datasets: CIFAR-10 and CIFAR-100. For drawing the tickets (submodels), we adopt a standard training protocol for both CIFAR-10 and CIFAR-100: the training takes 160 epochs in total and the batch size to 256; the initial learning rate is set to 0.1, and is divided by 10 at the 80th and 120th epochs respectively; the SGD solver is adopted with a momentum of 0.9 and a weight decay of . For retraining the tickets, we keep the same setting by default.
We follow the main idea of (frankle2019lottery), but instead prune networks trained at earlier points (before the accuracies reach their final top values), to see if reliable tickets can still be drawn. We adopt the same channel pruning in (liu2017learning) for all experiments since it is hardware friendly and aligns with our end goal of efficient training (wen2016learning). Fig. 1 reports the accuracies achieved by re-training the tickets drawn from different early epochs. All results consistently endorse that there exist high-quality tickets, at as early as epoch 20 (w.r.t. totally 160 epochs), that can achieve strong re-training accuracies. Comparing among different pruning ratios , it is not too surprising to see over-pruning (e.g., p = 70%) makes drawing good tickets harder, indicating a balance that we need to calibrate between accuracy and training efficiency.
Two more striking observations from Fig. 1: 1) there consistently exist EB tickets drawn at certain early epoch ranges, that outperform those drawn in later stages, including the “ground-truth” winning tickets drawn at epoch 160. That intriguing phenomenon implies the possible “over-cooking” when networks try to identify connectivity patterns first (achille2018critical), that might hamper generation; 2) some EB tickets are able to outperform even their unpruned, fully-trained models (e.g., the dashlines), potentially thanks to the sparse regularization learned by EB tickets.
|LR Schedule||Retrain acc.() (CIFAR-100)||Retrain acc.() (CIFAR-10)|
3.2 Do Early-bird Tickets Still Emerge under Lower-Cost Training?
For EB tickets, only the important connections found by them (connectivity) matters, while the weights are to be re-trained anyway. One might hence come up with a bold guess, that more aggressive and “cheaper” training methods might be applicable to further shrink the cost of finding EB tickets (which already implies early stopping of training), as long as the same significant connections still emerge. We experimentally investigate the impacts of two schemes: using large learning rates, and training in lower precision: EB tickets are observed to survive well under them both.
Large learning rates favor the emergence of EB Tickets. We first vary the learning rate schedule in the above experiments. The original schedule is denoted as , i.e., starting at 0.1, decayed to 0.01 at epoch 80, and further decayed to 0.001 at epoch 120. In comparison, we test a new learning rate schedule : starting at 0.01 (epoch 0), and decayed to 0.001 at epoch 100. After drawing tickets, we re-train them all using the same learning rate schedule for sufficient training and fair comparison. We see from Table 1 that high-quality EB tickets always emerge earlier when searching with the larger schedule, whose final accuracies are also better. Note that the earlier emergence of good EB tickets contributes to lowering the training costs. It has already been observed that larger learning rates are beneficial for training the dense model fully to draw the winning tickets (frankle2019lottery; liu2018rethinking): we show this benefit extends to EB tickets too.
Low-precision training does not destroy EB Tickets. We next examine the emergence of EB tickets within a state-of-the-art low-precision training algorithm (wu2018training) , where both model weights, activations, gradients and errors are quantized to 8 bits throughout training. Note that we only apply low-precision training to the stage of identifying EB tickets, and afterwards the tickets are trained in the same full-precision as in Section 3.1. Fig. 2 shows the retraining accuracy and total number of FLOPs ( EB ticket search (8 bits) + retraining (32 bits floating points)) for VGG16 and CIFAR-10/100. We can see that EB tickets still emerge when using aggressively low-precision for identifying EB tickets: the general trends resemble the full-precision training cases in Fig. 1, except the emergence of good EB tickets seem to become even earlier up to initial epochs. In this way, it will lead to cost savings in finding EB tickets, since low-precision updates can aggressively save energy compared to full precision per epoch, as in the Fig. 5.
3.3 How to Identify Early-Bird Tickets Practically?
Distance between Ticket Masks. For each time of pruning, we define a binary mask of the drawn ticket (pruned subnetwork) w.r.t. the full dense network. We follow (liu2017learning) to consider the scaling factor
in batch normalization (BN) layers as indicators of the corresponding channels’ significance. Given a target pruning ratio, we then prune the channels with top -percentage smallest values. Denote the pruned channels as 0 while the kept ones as 1, the original network can be mapped into a binary “ticket mask”. For any two sub-networks pruned from the same dense model, we calculate their mask distance via the Hamming distance between their two ticket masks.
Detecting EB Tickets via Mask Distances. We first visualize and observe the global behaviors of mask distances between consecutive epochs. Fig. 3 plots the pairwise mask distance matrices () of the VGG16 and PreResNet101 experiments on CIFAR100 (from Section 3.1), at different pruning ratios , where -th element in a matrix denotes the mask distance between epochs and in that corresponding experiment (160 epochs in total for all). For the ease of visualization, all elements in a matrix are first linearly normalized between 0 and 1; we then use an all-one matrix to minus the normalized mask distance matrix. Therefore in the resulting matrix, a higher value (closer to 1) indicates a smaller mask distance and is colored warmer (same hereinafter).
Fig. 3 demonstrates fairly consistent behaviors. Taking VGG16 on CIFAR100 ( = 0.2) for example: 1) in the beginning, the mask distances change rapidly between epochs, manifested by the quickly “cooling” colors (yellow green), from the diagonal line (highest since that is comparing an epoch’s mask with itself), to off-diagonal (comparing different epochs; the more an element deviates from the diagonal, the more distant the two epochs are away from each other); 2) after 10 epochs, the off-diagonal elements become “yellow” too, and the color transition becomes much smoother from diagonal to off-diagonal, indicating the masks change only very mildly after passing this point; 3) after 80 epochs, the mask distances almost see no change across epochs. Similar trends are observed in other plots too. It seems to concur the hypothesis in (achille2018critical) that a network first learns important connectivity patterns and then fixs them to tune weights more.
Our observation that the ticket masks quickly become stable and hardly changed in early training stages supports drawing EB tickets. We therefore measure the mask distance between two consecutive epochs, and draw EB tickets when such distance is smaller than a threshold . Practically, to improve the reliability of EB tickets, we will stop to draw EB tickets when the last five recorded mask distances are all smaller than , to avoid some irregular fluctuation in early training stages. In Fig. 3, the red lines indicate the identification of EB tickets when is set to .
4 Efficient Training via Early Bird Tickets
4.1 Why is EB Training More Efficient?
EB Training vs. Progressive Pruning and Training. Fig. 4 illustrates an overview of our proposed EB training and the progressive pruning and training scheme, e.g., as in (frankle2019lottery). In particular, the progressive pruning and training scheme adopts a three-step routine of 1) training a large and dense model, 2) pruning it and 3) then retraining the pruned model to restore performance, and these three steps can be iterated (han2015deep). The first step often dominates (e.g., 75% when using the PreResNet101 model and CIFAR-10 dataset) in terms of training energy and time costs as it involves training a large and dense model. The key feature of our EB training scheme is that it replaces the aforementioned steps 1 and 2 with a lower-cost step of detecting the EB tickets, i.e., enabling early stopping during the time- and energy-dominant step of training the large and dense model, thus promising large savings in both training time and energy. For example, assuming the first step of the progressive pruning and training scheme requires epochs to sufficiently train the model for achieving the target accuracy, the proposed EB training needs only epochs to identify the EB tickets making use of the mask distance that can detect the emergence3 of EB tickets, where (e.g., in the experiments summarized in Fig.1 and in the experiment summarized in Table 1).
4.2 How to Implement EB Training?
From Fig. 1, we can see that the EB training scheme consists of a training (searching) step to identify EB tickets and a fine-tuning step to retrain the EB tickets for achieving the target accuracy. As the fine-tuning step is the same as that of the progressive pruning and training baseline, we here only elaborate the first step of searching EB tickets as summarized in Algorithm 1. Specifically, the EB searching process 1) first initializes the weights and scaling factors and ; 2) iterates the structured pruning process as in (liu2017learning) to calculate the mask distances between the consecutive resulting subnetworks, storing them into a first-in-first-out (FIFO) queue of maximum length = 5; and 3) exits when the maximum mask distance in the FIFO is smaller than a specified threshold (default 0.1 with the [0,1] normalized distances). The output is the resulting EB ticket denoted as , which will be retrained further to reach the target accuracy.
While low-precision ticket search has already been proven to be completely feasible and favorable (see Fig. 2), we do NOT include it into current efficient training experiments. That is to purposely avoid the over-complicacy that other training algorithms can also be implemented in the same low precision regimes. We do, however, note that EB training can further aggressively shrink its computation/energy costs, without sacrificing accuracy, by adopting the low-precision ticket search described in Section 3.2. We leave this for the promising next-step work. In the following experiments, both EB training and its competitors use 32-bit floating point implementations to ensure fairness.
4.3 How does EB Training Compare to State-of-the-art Techniques?
Experiment Setting. We consider training VGG16 and PreResNet101 models on both CIFAR-10 and CIFAR-100. We measure the training energy costs from real-device operations: as the energy cost consists of both computational and data movement costs, the latter of which is often dominant but can not captured by the commonly used metrics, such as the number of FLOPs (Eyeriss_JSSC), we evaluate the proposed techniques against the baselines in terms of accuracy and real measured energy consumption. Specifically, all the energy consumption are obtained by training the corresponding models in an embedded GPU (NVIDIA JETSON TX2). The GPU measurement setup can be found in the Appendix. Note that the energy measurement results include the overhead of using the mask distance to detect the emergence of EB tickets, which is found negligible as compared to the total training cost. For example, for the VGG16 model, the overhead caused by computing mask distances is in memory storage size, in FLOPs, and in real-measured energy costs. This is because 1) the mask distance evaluation only involves simple Hamming distance calculations and 2) each epoch only calculates the distance once.
Results and Analysis. We first compare the computational savings (total number of FLOPs) with the baseline using a progressive pruning and training scheme (liu2017learning), when drawing subnetworks at different epochs. Fig. 5 summarizes the results of the PreResNet/VGG16 models and CIFAR-10/100 datasets, corresponding to the same set of experiments as in Fig. 1. We see that EB training can achieve FLOPs reduction compared to the common baseline, while leading to comparable or even better accuracy (- 0.02% + 0.99% over the baseline).
Table 2 compares the retraining accuracy and consumed FLOPs/energy of EB training with four state-of-the-art progressive pruning and training techniques: the original lottery ticket (LT) training (frankle2019lottery), network slimming (NS) (liu2017learning), ThiNet (luo2017thinet) and SNIP (lee2018snip). While all of them involve the process of training a dense network, pruning it, and retraining, they apply different pruning criteria, e.g., NS imposes -sparsity on channel-wise scaling factors from BN layers, and ThiNet greedily prunes the channel that has the smallest effect on the next layer’s activation values. For EB training, we by default follow (frankle2019lottery) to inherit the same original weight initialization when re-training the searched ticket. We also notice existing debates (liu2018rethinking) on the initialization re-use, and thus also compare with a variant by re-training the ticket from a new random initialization, termed as EB training (re-init).
|Setting||Methods||Retrain acc.||Energy cost (KJ)/FLOPs (G)|
|PreResNet -101 CIFAR-10||LT (one-shot)||93.70||93.21||92.78||6322/298||6322/298||6322/298|
|VGG16 CIFAR-10||LT (one-shot)||93.18||93.25||93.28||746.2/605||746.2/605||746.2/605|
|EB-Train Improv.||0.19||0.01||- 0.57||1.2/1.2||1.7/1.4||2.0/1.5|
|PreResNet -101 CIFAR-100||LT (one-shot)||71.90||71.60||69.95||6095/298||6095/298||6095/298|
|VGG16 CIFAR-100||LT (one-shot)||72.62||71.31||70.96||741.2/605||741.2/605||741.2/605|
|EB-Train Improv.||- 0.81||0.86||0.32||1.3/1.1||1.4/1.2||1.7/1.5|
Table 2 demonstrates that EB training consistently outperforms all competitors in terms of saving training energy and computation costs, meanwhile improving the final accuracy in most cases. We use EB-Train Improv. to record the performance margin (either accuracy, or energy/computation) between EB training, and the strongest competitor among the four state-of-the-art baselines. Specifically, EB training outperforms those methods by up to and in terms of the energy consumption and computational FLOPs, while always achieving comparable or even better accuracies, across three pruning ratios, two DNN models and two datasets. Moreover, comparing with the re-init variant endorses the effectiveness of initialization inheritance in EB training. As an additional highlight, EB training naturally leads to pruned DNN models whose inferences are efficient too, unifying the boost of efficiency throughout the full learning lifecycle.
5 Discussion and conclusion
We have demonstrated that winning tickets can be drawn at the very early training stage, i.e., EB tickets exist, in both the standard training and several lower-cost training schemes. That motivates a practical success of applying EB tickets to efficient training, whose results compare favorably against state-of-the-arts. We believe the current work is “not the end, not even the beginning of the end, but perhaps the end of the beginning” (Quote Sir Winston Churchill, 1942): many promising problems remain open to be addressed. An immediate future work is to implement low-precision EB training algorithms, and test on larger models/datasets. We are also curious whether more lower-cost training techniques could be associated with EB tickets. Finally, sometimes high pruning ratios (e.g, 0.7) can hurt the quality of EB tickets and the retrained networks. We look forward to automating the choice of for different models/datasets, unleashing higher efficiency.
Appendix A Appendix
Fig. 6 shows our GPU measurement setup, in which the GPU board is connected to a laptop and a power meter. In particular, the training settings are downloaded from the laptop to the GPU board, and the real-measured energy consumption is obtained via the power meter and runtime measurement for the whole training process.