1 Introduction
The increasing penetration of intelligent sensors has revolutionized how Internet of Things (IoT) works. For visual data analytics, we have witnessed the recordbreaking predictive performance achieved by convolutional neural networks (CNNs) [AlexNet, Object, DeepFace]. Although such high performance CNN models are initially learned in data centers and then deployed to IoT devices, we have witnessed increasing necessity for the model to continue learning and updating itself in situ, such as for personalization for different users, or incremental/lifelong learning. Ideally, this learning/retraining process should take place on device. Comparing to cloudbased retraining, training locally helps avoid transferring data back and forth between data centers and IoT devices, reduce communication cost/latency, and enhance privacy.
However, training on IoT devices is nontrivial, more consuming yet much less explored than inference. IoT devices, such as smart phones and wearables, have limited computation and energy resources, that are even stringent for inference. Training CNNs consumes magnitudes higher computations than one inference. For example, training ResNet50 for only one 224 224 image can take up to 12 GFLOPs (vs. 4GFLOPS for inference), which can easily drain a mobile phone battery when training with batch images [Yang_2017]. This mismatch between the limited resources of IoT devices and the high complexity of CNNs is only getting worse because the network structures are getting more complex as they are designed to solve harder and largerscale tasks [Google_multimodel].
This paper considers the most standard CNN training setting, assuming both the model structure and the dataset to be given. This “basic” training setting is not usually the realistic IoT case, but we address it as a starting point (with familiar benchmarks), and an opening door towards obtaining a toolbox that may be later extended to online/transfer learning too (see Section 5). Our goal is to
reduce the total energy cost in training, which is complicated by a myriad of factors: from persample (minibatch) complexity (both feedforward and backward computations), to the empirical convergence rate (how many epochs it takes to converge), and more broadly, hardware/architecture factors such as data access and movements
[Eyeriss_JSSC, Boris]. Despite a handful of works on efficient, accelerated CNN training [goyal, Cho2017PowerAID, You_2018, Akiba2017ExtremelyLM, jia2018highly], they mostly focus on reducing the total training time in resourcerich settings, such as by distributed training in largescale GPU clusters. In contrast, our focus is to trim down the total energy cost for insitu, resourceconstrained training. It represents an orthogonal (and less studied) direction to [goyal, Cho2017PowerAID, You_2018, Akiba2017ExtremelyLM, jia2018highly, gupta2015deep, wang2018training], although the two can certainly be combined.To unleash the potential of more energyefficient insitu training, we look at the full CNN training lifecycle closely. With the goal to “squeeze out” unnecessary costs, we raise three curious questions:

[leftmargin=*]

Q1: Are all samples always required throughout training: is it necessary to use all training samples in all epochs?

Q2: Are all parts of the entire model equally important during training: does every layer or filter have to be updated every time?

Q3: Are precise gradients indispensable for training: can we efficiently compute and update the model with approximate gradients?
The above three questions only represent our “first stab” ideas to explore energyefficient training, whose full scope is much more profound. By no means do our above questions represent all possible directions. We envision that many other recipes can be blended too, such as training on lower bit precision or input resolution [banner2018scalable, chin2019adascale]. We also recognize that energyefficient CNN training should be jointly considered with hardware/architecture codesign [wu2018l1, hoffer2018norm], which is beyond the current work.
Motivated by the above questions, this paper proposes a novel energy efficient CNN training framework dubbed ETrain. It consists of three complementary aspects of efforts to trim down unnecessary training computations and data movements, each addressing one of the above questions:

[leftmargin=*]

DataLevel: Stochastic minibatch dropping (SMD).
We show that CNN training could be accelerated by a “frustratingly easy” strategy: randomly skipping minibatches with 0.5 probability throughout training. This could be interpreted as data sampling with (limited) replacements, and is found to incur minimal accuracy loss (sometimes even increase).

ModelLevel: Inputdependent selective layer update (SLU). For each minibatch, we select a different subset of CNN layers to be updated. The inputadaptive selection is based on a lowcost gating function jointly learned during training. While similar ideas were explored in efficient inference [wang2018skipnet], for the first time it is applied and evaluated for training.

AlgorithmLevel: Predictive sign gradient descent (PSG). We explore the usage of an extremely lowprecision gradient descent algorithm called SignSGD, which has recently found both theoretical and experimental grounds [signSGD]. The original algorithm still requires the full gradient computation and therefore does not save energy. We create a novel “predictive” variant, that could obtain the sign without computing the full gradient, via lowcost, bitlevel prediction. Combined with mixedprecision design, it decreases computation and datamovement costs.
Besides mainly experimental explorations, we find ETrain has many interesting links to recent CNN training theories, e.g., [chaudhari2018stochastic, samples_equal, zhang2019all, lottery]. We evaluate ETrain in comparison with its closest stateoftheart competitors. To measure its actual performance, ETrain is also implemented and evaluated on an FPGA board. The results show that the CNN model applied with ETrain consistently achieves higher training energy efficiency with marginal accuracy drops.
2 Related Work
Accelerated CNN training. A number of works have been devoted to accelerating training, in a resourcerich setting, by utilizing communicationefficient distributed optimization and larger minibatch sizes [goyal, Cho2017PowerAID, You_2018, Akiba2017ExtremelyLM]. The latest work [jia2018highly] combined distributed training with a mixed precision framework, leading to training AlexNet within 4 minutes. However, their goals and settings are distinct from ours  while the distributed training strategy can reduce time, it will actually incur more total energy overhead, and is clearly not applicable to ondevice resourceconstrained training.
Lowprecision training. It is well known that CNN training can be performed under substantial lower precision [banner2018scalable, wang2018training, gupta2015deep], rather than using fullprecision floats. Specifically, training with quantized gradients has been well studied in the distributed learning, whose main motivation is to reduce the communication cost during gradient aggregations between workers [seide20141, alistarh2016qsgd, zhang2017zipml, de2017understanding, terngrad, signSGD]. A few works considered to only transmit the coordinates of large magnitudes [aji2017sparse, lin2017deep, wangni2018gradient]. Recently, the SignSGD algorithm [seide20141, signSGD] even showed the feasibility of using onebit gradients (signs) during training, without notably hampering the convergence rate or final result. However, most algorithms are optimized for distributed communication efficiency, rather than for reducing training energy costs. Many of them, including [signSGD], need first compute fullprecision gradients and then quantize them.
Efficient CNN inference: Static and Dynamic. Compressing CNNs and speeding up their inference have attracted major research interests in recent years. Representative methods include weight pruning, weight sharing, layer factorization, bit quantization, to just name a few [han2015deep, he2017channel, yu2017scalpel, DeepkMeans, chen2019collaborative].
While model compression presents “static” solutions for improving inference efficiency, a more interesting recent trend looks at dynamic inference [wang2018skipnet, blockdrop, convnetaig, rnp, gaternet] to reduce the latency, i,e, selectively executing subsets of layers in the network conditioned on each input. That sequential decision making process is usually controlled by lowcost gating or policy networks. This mechanism was also applied to improve inference energy efficiency [energynet, wang2019dual].
In [PredictiveNet], a unique bitlevel prediction framework called PredictiveNet was presented to accelerate CNN inference at a lower level. Since CNN layerwise activations are usually highly sparse, the authors proposed to predict those zero locations using lowcost bit predictors, thereby bypassing a large fraction of energydominant convolutions without modifying the CNN structure.
Energyefficient training is different from and more complicated than its inference counterpart. However, many insights gained from the latter can be lent to the former. For example, the recent work [prunetrain] showed that performing active channel pruning during training can accelerate the empirical convergence. Our proposed modellevel SLU is inspired by [wang2018skipnet]. The algorithmlevel PSG also inherits the idea of bitlevel lowcost prediction from [PredictiveNet].
3 The Proposed Framework
3.1 DataLevel: Stochastic minibatch dropping (SMD)
We first adopt a straightforward, seemingly naive, yet surprisingly effective stochastic minibatch dropping (SMD) strategy (see Fig. 1), to aggressively reduce the training cost by letting it see less minibatches. At each epoch, SMD simply skips every minibatch with a default probability of . All other training protocols, such as learning rate schedule, remain unchanged. Compared to the normal training, SMD can directly half the training cost, if both were trained with the same number of epochs. Yet amazingly, we observe in our experiments that SMD usually leads to negligible accuracy decrease, sometimes even increase (see Sec. 4). Why? We discuss possible explanations below.
SMD can be interpreted as sampling with limited replacement
. To understand this, think of combing two consecutive SMDenforced epochs into one, then it has the same number of minibatches as one full epoch; but within it each training sample now has 0.25, 0.5, and 0.25 probability, to be sampled 2, 1, and 0 times, respectively. The conventional wisdom is that for stochastic gradient descent (SGD), in each epoch, the minibatches are sampled i.i.d. from data
without replacement (i.e., each sample occurs exactly once per epoch) [bertsekas2011incremental, shamir2016without, bengio2012practical, recht2012beneath, gurbuzbalaban2015random]. However, [chaudhari2018stochastic] proved that sampling minibatches with replacementhas a large variance than sampling without replacement, and consequently SGD may have better regularization properties.
Alternatively, SMD could also be viewed as a special data augmentation way that injects more sampling noise to perturb training distribution every epoch. Past works [ge2015escaping, keskar2016large, daneshmand2018escaping] have shown that specific kinds of random noise aid convergence through escaping from saddle points or less generalizable minima. The structured sampling noise caused by SMD might aid the exploration.
Besides, [samples_equal, johnson2018training, coleman2019select] also showed that an importance sampling scheme that focuses on training more with “informative” examples leads to faster convergence under resource budgets. They implied that the minibatch dropping can be selective based on certain information criterion instead of stochastic. We use SMD because it has zero overhead, but more effective dropping options might be available if lowcost indicators of minibatch importance can be identified: we leave this as future work.
3.2 ModelLevel: Inputdependent selective layer update (SLU)
[wang2018skipnet] proposed to dynamically skip a subset of layers for different inputs, in order to adaptively accelerate the feedforward inference. However, [wang2018skipnet]
called for a post process after supervised training, i.e., to refine the dynamic skipping policy via reinforcement learning, thus causing undesired extra training overhead. We propose to extend the idea of dynamic inference to the training stage, i.e., dynamically skipping a subset of layers during
both feedforward and backpropagation. Crucially, we show that by adding an auxiliary regularization, such dynamic skipping can be learned from scratch and obtain satisfactory performance: no post refinement nor extra training iterations is required. That is critical for dynamic layer skipping to be useful for energyefficient training: we term this extended scheme as inputdependent selective layer update (SLU).As depicted in Fig. 1, given a base CNN to be trained, we follow [wang2018skipnet] to add a lightweight RNN gating network per layer block. Each gate takes the same input as its corresponding layer, and outputs softgating outputs between [0,1] for the layer, which are then used as the skipping probability, in which the higher the value is, more probably that layer will be selected. Therefore, each layer will be adaptively selected or skipped, depending on the inputs. We will only select the layers activated by gates. Those RNN gates cost less than 0.04% feedforward FLOPs than the base models; hence their energy overheads are negligible. More details can be found in the supplementary.
[wang2018skipnet] first trained the gates in a supervised way together with the base model. Observing that such learned routing policies were often not sufficiently efficient, they used reinforcement learning postprocessing to learn more aggressive skipping afterwards. While this is fine for the end goal of dynamic inference, we hope to get rid of the postprocessing overhead. We incorporate the computational complexity regularization into the objective function to overcome this hurdle, defined as
(1) 
Here, is a weighting coefficient of the computational complexity regularization. and denote the parameters of the base model and the gating network, respectively. Also, denotes the prediction loss, and is calculated by accumulating the computational cost (FLOPs) of the layers that are selected. The regularization explicitly encourages to learn more “parismous” selections throughout the training. We find that such SLUregularized training leads to almost the same number of epochs to converge compared to standard training, i.e., SLU does not sacrifice empirical convergence speed. As a side effect, SLU will naturally yield CNNs with dynamic inference capability. Though not the focus of this paper, we find the CNN trained with SLU reaches comparable accuracyefficiency tradeoff over one trained with the approach in [wang2018skipnet].
The practice of SLU seems to align with several recent theories on CNN training. In [layers_equal], the authors suggested that “not all layers are created equal” for training. Specifically, some layers are critical to be intensively updated for improving final predictions, while others are insensitive along training. There exist “noncritical” layers that barely change their weights throughout training: even resetting those layers in a trained model to their initial value has few negative consequences. The more recent work [lottery] further confirmed the phenomenon, though how to identify those noncritical model parts at the early training stage remains unclear. [veit2016residual, greff2016highway] also observed different samples might activate different submodels. Those inspiring theories, combined with the dynamic inference practice, motivate us to propose SLU for more efficient training.
3.3 AlgorithmLevel: Predictive sign gradient descent (PSG)
It is well recognized that lowprecision fixedpoint implementation is a very effective knob for achieving energy efficient CNNs, because both CNNs’ computational and data movement costs are approximately a quadratic function of their employed precision. For example, a stateoftheart design [horowitz20141] shows that adopting 8bit precision for a multiplication, adder, and data movement can reduce the energy cost by 95%, 97%, and 75%, respectively, as compared to that of a 32bit floating point design when evaluated in a commercial 45nm CMOS technology.
The successful adoption of extremely lowprecision (binary) gradients in SignSGD [signSGD] is appealing, as it might lead reducing both weight update computation and data movements. However, directly applying the original SignSGD algorithm for training will not save energy, because it actually computes the fullprecision gradient first before taking the signs. We propose a novel predictive sign gradient descent (PSG) algorithm, which predicts the sign of gradients using lowcost bitlevel predictors, therefore completely bypassing the costly fullgradient computation.
We next introduce how the gradients of weights are updated in PSG. Assume the following notations: the full precision and most significant bits (the latter, MSB part, is adopted by PSG’s lowcost predictors) of the input and the gradient of the output are denoted as () and (), respectively, where the corresponding input and the gradient of the output for PSG’s predictors are denoted as and , respectively. As such, the quantization noise for the input and the gradient of the output are and , respectively. Similarly, after backpropagation, we denote the fullprecision and lowprecision (i.e., taking the most significant bits (MSBs)) gradient of the weight as and , respectively, the latter of which is computed using and . Then, with an empirically preselected threshold , PSG updates the th weight gradient as follows:
(2) 
Note that in hardware implementation, the computation to obtain is embedded within that of . Therefore, the PSG’s predictors do not incur energy overhead.
PSG for energyefficient training. Recent work [banner2018scalable] has shown that most of the training process is robust to reduced precision (e.g., 8 bits instead of 32 bits), except for the weight gradient calculations and updates. Taking their learning, we similarly adopt a higher precision for the gradients than the inputs and weights, i.e., . Specifically, when training with PSG, we first compute the predictors using (e.g., ) and (e.g., ), and then update the weights’ gradients following Eq. (2). The further energy savings of training with PSG over the fixedpoint training [banner2018scalable] are resulted from the fact that the predictors computed using and require exponentially less computational and data movement energy.
Prediction guarantee of PSG. We analyze the probability of PSG’s prediction failure to discuss its performance guarantee. Specifically, if denoting the sign prediction failure produced by Eq. (2) as , it can be proved that this probability is upbounded as follows,
(3) 
where and are the quantization noise step sizes of and , respectively. and are given in the Appendix along with the proof of Eq. (3). Eq. (3) shows that the prediction failure probability of PSG is upbounded by a term that degrades exponentially with the precision assigned to the predictors, indicating that this failure probability can be very small if the predictors are designed properly.
Adaptive threshold. Training with PSG might lead to sign flips in the weight gradients as compared to that of the floating point one, which only occurs when the latter has a small magnitude and thus the quantization noise of the predictors causes the sign flips. Therefore, it is important to properly select a threshold (e.g., in Eq.(2)) that can optimally balance this sign flip probability and the achieved energy savings. We adopt an adaptive threshold selection strategy because the dynamic range of gradients differ significantly from layers to layers: instead of using a fixed number, we will tune a ratio which yields the adaptive threshold as .
4 Experiments
4.1 Experiment setup
Datasets: We evaluate our proposed techniques on two datasets: CIFAR10 and CIFAR100. Common data augmentation methods (e.g., mirroring/shifting) are adopted, and data are normalized as in [krizhevsky2009learning]. Models: Three popular backbones, ResNet74, ResNet110 [he2016deep], and MobileNetV2 [s2018mobilenetv2], are considered. For evaluating each of the three proposed techniques (i.e., SMD, SLU, and PSG), we consider various experimental settings using ResNet74 and CIFAR10 dataset for ablation study, as described in Sections 4.24.5. ResNet110 and MobileNetV2 results are reported in Section 4.6. Top1 accuracies are measured for CIFAR10, and both top1 and top5 accuracies for CIFAR100. Training settings: We adopt the training settings in [he2016deep] for the baseline default configurations. Specifically, we use an SGD with a momentum of 0.9 and a weight decaying factor of 0.0001, and the initialization introduced in [he2015delving]. Models are trained for 64k iterations. For experiments where PSG is used, the initial learning rate is adjusted to as SignSGD[signSGD] suggested small learning rates to benefit convergence. For others, the learning rate is initially set to be 0.1 and then decayed by 10 at the 32k and 48k iterations, respectively. We also employed the stochastic weight averaging (SWA) technique [yang2019swalp] when PSG is adopted, that was found to notably stabilize training.
Real energy measurements using FPGA: As the energy cost of CNN inference/training consists of both computational and data movement costs, the latter of which is often dominant but can not captured by the commonly used metrics, such as the number of FLOPs [Eyeriss_JSSC], we evaluate the proposed techniques against the baselines in terms of accuracy and real measured energy consumption. Specifically, unless otherwise specified, all the energy or energy savings are obtained through real measurements by training the corresponding models and datasets in a stateoftheart FPGA [zed], which is a digilent ZedBoard Zynq7000 ARM/FPGA SoC Development Board. Fig. 2 shows our FPGA measurement setup, in which the FPGA board is connected to a laptop through a serial port and a power meter. In particular, the training settings are downloaded from the laptop to the FPGA board, and the realmeasured energy consumption is obtained via the power meter for the whole training process and then sent back to the laptop. All energy results are measured from FPGA.
4.2 Evaluating stochastic minibatch dropping
We first validate the energy saving achieved by SMD against a few “offtheshelf” options: (1) can we train with the standard algorithm, using less iterations and otherwise the same training protocol? (2) can we train with the standard algorithm, using less iterations but properly increased learning rates? Two set of carefullydesigned experiments are presented below for addressing them.
Training with SMD vs. standard minibatch (SMB): We first evaluate SMD over the standard minibatch (SMB) training, which uses all (vs. 50% in SMD) minibatch samples. As shown in Fig. LABEL:fig:SMDablation_a when the energy ratio is 1 (i.e., training with SMB + 64k iterations vs. SMD + 128k iterations), the proposed SMD technique is able to boost the inference accuracy by 0.39% over the standard way.
We next “naively” suppress the energy cost of SMB, by reducing training iterations. Specifically, we reduce the SMB training iterations to be of the original one. Note the learning rate schedule (e.g., when to reduce learning rates) will be scaled proportionally with the total iteration number too. For comparison, we conduct experiments of training with SMD when the number of equivalent training iterations are the same as those of the SMB cases. Fig. LABEL:fig:SMDablation_a shows that training with SMD consistently achieves a higher inference accuracy than SMB with the margin ranging from 0.39% to 0.86%. Furthermore, we can see that training with SMD reduces the training energy cost by while boosting the inference accuracy by (see the cases of SMD under the energy ratio of 0.67 vs. SMB under the energy ratio of 1, respectively, in Fig. LABEL:fig:SMDablation_a), as compared to SMB. We adopt SMD under this energy ratio of in all the remaining experiments.
max width= Dataset Backbone Accuracy SMB SMD CIFAR10 ResNet110 92.75% 93.05% CIFAR100 ResNet74 71.11% 71.37%
We repeated training ResNet74 on CIFAR10 using SMD for 10 times with different random initializations. The accuracy standard deviation is only 0.132%, showing high stability. We also conducted more experiments with different backbones and datasets . As shown in Tab.
4.2, SMD is consistently better than SMB.Training with SMD vs. SMB + increased learning rates: We further compare with SMB with tuned/larger learning rates, conjecturing that it would accelerate convergence by possibly reducing the needed training epochs. Results are summarized in Fig. LABEL:fig:SMDablation_b. Specifically, when the number of iterations are reduced by , we do a grid search of learning rates, with a step size from between [,]. All compared methods are set with the same training energy budget. Fig. LABEL:fig:SMDablation_b demonstrates that while increasing learning rates seem to improve SMB’s energy efficiency over sticking to the original protocol, our proposed SMD still maintains a clear advantage of at least .
4.3 Evaluating selective layer update
Our current SLU experiments are based on CNNs with residual connections, partially because they dominate in SOTA CNNs. We will extend SLU to other model structures in future work. We evaluate the proposed SLU by comparing it with stochastic depth (SD)
[huang2016deep], a technique originally developed for training very deep networks effectively, by updating only a random subset of layers at each minibatch. It could be viewed as a “random” version of SLU (which uses learned layer selection). We follow all suggested settings in [huang2016deep]. For a fair comparison, we adjust the hyperparameter [huang2016deep], so that SD dropping ratio is always the same as SLU.From Fig. 6, training with SLU consistently achieves higher inference accuracies than SD when their training energy costs are the same. It is further encouraging to observe that training with SLU could even achieve higher accuracy sometimes in addition to saving energy. For example, comparing the cases when training with SLU + an energy ratio of 0.3 (i.e., energy saving) and that of SD + an energy ratio of 0.5, the proposed SLU technique is able to reduce the training energy cost by while boosting the inference accuracy by
. These results endorses the usage of datadriven gates instead of random dropping, in the context of energyefficient training. Training with SLU + SMD combined further boosts the accuracy while reducing the energy cost. Furthermore, 20 trials of SLU experiments to ResNet38 on CIFAR10 conclude that, with 95% confidence level, the confidence interval for the mean of the top1 accuracy and the energy saving are [92.47%, 92.58%] (baseline:92.50%) and [39.55%, 40.52%], respectively, verifying SLU’s trustworthy effectiveness.
4.4 Evaluating predictive sign gradient descent
We evaluate PSG against two alternatives: (1) 8bit fixed point training proposed in [banner2018scalable]; and (2) the original SignSGD [signSGD]. For all experiments in Sections 4.4 and 4.5, we adopt 8bit precision for the activations/weights and 16bit for the gradients. The corresponding precision of the predictors are 4bit and 10bit, respectively. We use an adaptive threshold (see Section 3.3) of . More experiment details are in Appendix.
As shown in Table 3, while the 8bit fixed point training in [banner2018scalable] saves about training energy (going from 32bit to 8bit in general leads to about energy saving, which is compromised by their employed 32bit gradients in this case) with a marginal accuracy loss of as compared to the 32bit SGD, the proposed PSG almost doubles the training energy savings ( vs. for [banner2018scalable]) with a negligible accuracy loss of ( vs. for [banner2018scalable]). Interestingly, PSG slightly boosts the inference accuracy by while saving energy, i.e., better training energy efficiency with a slightly better inference accuracy, compared to SignSGD [signSGD]. Besides, as we observed, the ratio of using for sign prediction typically remains at least 60% throughout the training process, given adaptive threshold .
4.5 Evaluating ETrain: SMD + SLU + PSG
We now evaluate the proposed ETrain framework, which combines the SMD, SLU, and PSG techniques. As shown in Table 3, we can see that ETrain: (1) indeed can further boost the performance as compared to training with SMD+SLU (e.g., ETrain achieves a higher accuracy of ( vs. (see Fig.6 at the energy ratio of 0.2) of training with SMD+SLU, when achieving energy savings); and (2) can achieve an extremely aggressive energy savings of and , while incurring an accuracy loss of only about and , respectively, as compared to that of the 32bit floating point SGD (see Table 3), i.e., up to better training energy efficiency with small accuracy loss.
Impact on empirical convergence speed. We plot the training convergence curves of different methods in Fig. 7, with the xaxis represented in the alternative form of training energy costs (up to current iteration). We observe that ETrain does not slow down the empirical convergence. In fact, it even makes the training loss decrease faster in the early stage.
Experiments on adapting a pretrained model. We perform a proofofconcept experiment for CNN finetuning by splitting CIFAR10 training set into half, where each class was i.i.d. split evenly. We first pretrain ResNet74 on the first half, then finetune it on the second half. During finetuning, we compare two energyefficient options: (1) finetuning only the last FC layer using standard training; (2) finetuning all layers using E
Train. With all hyperparameters being tuned to best efforts, the two finetuning methods improve over the pretrained model top1 accuracy by [0.30%, 1.37%] respectively, while (2) saves 61.58% more energy (FPGAmeasured) than (1). That shows that E
Train is the preferred option: higher accuracy and more energy savingsTable 4 evaluates ETrain and its ablation baselines on various models and more datasets. The conclusions are aligned with the ResNet74 cases. Remarkably, on CIFAR10 with ResNet110, ETrain saves over 83% energy with only 0.56% accuracy loss. When saving over 91% (i.e., more than 10), the accuracy drop is still less than 2%. On CIFAR100 with ResNet110, ETrain can even surpass baseline on both top1 and top5 accuracy while saving over 84% energy. More notably, ETrain is also effective for even compact networks: it saves about 90% energy cost while achieving a comparable accuracy, when adopted for training MobileNetV2.
5 Discussion of Limitations and Future Work
We propose the ETrain framework to achieve energyefficient CNN training in resourceconstrained settings. Three complementary aspects of efforts to trim down training costs  from data, model and algorithm levels, respectively, are carefully designed, justified, and integrated. Experiments on both simulation and real FPGA demonstrate the promise of ETrain. Despite the preliminary success, we are aware of several limitations of ETrain, which also points us to the future road map. For example, ETrain is currently designed and evaluated for standard offline CNN training, with all training data presented in batch, for simplicity. This is not scalable for many realworld IoT scenarios, where new training data arrives sequentially in a stream form, with limited or no data buffer/storage leading to the open challenge of “onthefly” CNN training [sahoo2017online]. In this case, while both SLU and PSG are still applicable, SMD needs to be modified, e.g., by onepass active selection of streamin data samples. Besides, SLU is not yet straightforward to be extended to plain CNNs without residual connections. We expect finergrained selective model updates, such as online channel pruning [prunetrain], to be useful alternatives here. We also plan to optimize ETrain for continuous adaptation or lifelong learning.
Acknowledgments
The work is in part supported by the NSF RTML grant (1937592, 1937588). The authors would like to thank all anonymous reviewers for their tremendously useful comments to help improve our work.
References
Appendix A PSG Prediction Error Rate Bound Analysis
Weight gradient calculation during back propagation. Consider we have a convolutional layer with weight and no bias (as is the usual case for modern deep CNNs), its input is and the output is . During one pass of back propagation, the gradient propagated by its succeeding layer is . We compute the gradient of the weight as . Considering only one entry in , it can be represented by the sum of a series inner product of the corresponding locations in and . For simplicity and with a little abuse of notations, the one entry the gradient can be represented as:
(4) 
where iterate over the minibatch and is the minibatch size. The MSB parts used to predict the gradient signs are denoted as and , with precision and . The corresponding quantization noise terms are and . The gradient calculated using (4) with and is denoted as . Then the gradient error, denoted as , can be approximated with
(5) 
Here the second order noise term is neglected because it is small.
Sign predicition error probability bound. Denote the the sign prediction failure, given a t event as , which has three subcases as shown in Table 5:
Event  Condition 
Consider Case :
(6) 
where is the conditional distribution of given , and is the variance of . The inequality comes from Chebychev’s inequality and the fact that is symmetrically distributed. Plug into (A), we have:
(7) 
Consider Case and : following similar derivations, we can have:
(8)  
(9) 
where and are defined as:
Discussion of the data range. In (3) the data range is assumed to be . When the data range changes, however, the bound will not change because it is equivalent with scaling the numerators and denominators in the derivations above, which corresponds to the adaptive threshold scheme we introduce in Section 3.3.
Appendix B Experiment Settings for PSG in Section 4.4
Instead of using the default training settings described in Section 4.1, we use a learning rate of 0.03 and a weight decay of 0.0005 for SignSGD [signSGD] and PSG in Section 4.4, which we found optimal for most cases when SignSGD was involved (PSG also uses SignSGD because it predicts the sign to replace weight gradients). During the experiments, we found it a little bit tricky to find a suitable learning rate. Because both of SignSGD and PSG use the sign of the gradients to update weights, they demand smaller learning rate especially when the performance improves and gradients approach to near zero. The above setting is consistent to the observations in [signSGD] that the learning rate for SignSGD should be appropriately smaller than that for the baseline algorithm.
Influence of Adaptive Threshold . We use for experiments where PSG is applied, i.e., the real threshold used for one layer to screen out small and therefore unreliable gradients predictions is chosen as 0.05 times of the maximum among all gradient predictions in that layer. This is an empirical choice. We find that the effectiveness of PSG is insensitive to the choice of , when constrained in a proper range, e.g. . If is too small, however, it will result in too frequent coarse gradients and might hurt convergence.
Appendix C SLU Implementation Details
In our implementation of SLU(Figure 8), we adopt the recurrent gates (RNNGates) as in [wang2018skipnet]
. It is composed of a global average pooling followed by a linear projection that reduces the features to a 10dimensional vector as depicted in
9. A Long Short Term Memory (LSTM)
[gers1999learning] network that contains a single layer of dimension 10 is applied to generate a binary scalar. As mentioned in [wang2018skipnet], this RNN gating networks design incurs a negligible overhead compared to its feedforward counterpart (0.04% vs. 12.5% of the computation of the residual blocks when the baseline architecture is a ResNet). To further reduce the energy cost due to loading parameters into the memory, all RNN Gates in the SLU share the same weights.Training with SLU + SMD: We further evaluate the performance of combing the SLU and SMD techniques. As shown in Fig. 7, training with SLU + SMD consistently boost the inference accuracy further while reducing the training energy cost. For example, compared to the SD baseline, SLU + SMD can improve the inference accuracy by , while costing lower energy.
Comments
There are no comments yet.