The increasing penetration of intelligent sensors has revolutionized how Internet of Things (IoT) works. For visual data analytics, we have witnessed the record-breaking predictive performance achieved by convolutional neural networks (CNNs) [AlexNet, Object, DeepFace]. Although such high performance CNN models are initially learned in data centers and then deployed to IoT devices, we have witnessed increasing necessity for the model to continue learning and updating itself in situ, such as for personalization for different users, or incremental/lifelong learning. Ideally, this learning/retraining process should take place on device. Comparing to cloud-based retraining, training locally helps avoid transferring data back and forth between data centers and IoT devices, reduce communication cost/latency, and enhance privacy.
However, training on IoT devices is non-trivial, more consuming yet much less explored than inference. IoT devices, such as smart phones and wearables, have limited computation and energy resources, that are even stringent for inference. Training CNNs consumes magnitudes higher computations than one inference. For example, training ResNet-50 for only one 224 224 image can take up to 12 GFLOPs (vs. 4GFLOPS for inference), which can easily drain a mobile phone battery when training with batch images [Yang_2017]. This mismatch between the limited resources of IoT devices and the high complexity of CNNs is only getting worse because the network structures are getting more complex as they are designed to solve harder and larger-scale tasks [Google_multimodel].
This paper considers the most standard CNN training setting, assuming both the model structure and the dataset to be given. This “basic” training setting is not usually the realistic IoT case, but we address it as a starting point (with familiar benchmarks), and an opening door towards obtaining a toolbox that may be later extended to online/transfer learning too (see Section 5). Our goal is to , which is complicated by a myriad of factors: from per-sample (mini-batch) complexity (both feed-forward and backward computations), to the empirical convergence rate (how many epochs it takes to converge), and more broadly, hardware/architecture factors such as data access and movements
This paper considers the most standard CNN training setting, assuming both the model structure and the dataset to be given. This “basic” training setting is not usually the realistic IoT case, but we address it as a starting point (with familiar benchmarks), and an opening door towards obtaining a toolbox that may be later extended to online/transfer learning too (see Section 5). Our goal is toreduce the total energy cost in training
, which is complicated by a myriad of factors: from per-sample (mini-batch) complexity (both feed-forward and backward computations), to the empirical convergence rate (how many epochs it takes to converge), and more broadly, hardware/architecture factors such as data access and movements[Eyeriss_JSSC, Boris]. Despite a handful of works on efficient, accelerated CNN training [goyal, Cho2017PowerAID, You_2018, Akiba2017ExtremelyLM, jia2018highly], they mostly focus on reducing the total training time in resource-rich settings, such as by distributed training in large-scale GPU clusters. In contrast, our focus is to trim down the total energy cost for in-situ, resource-constrained training. It represents an orthogonal (and less studied) direction to [goyal, Cho2017PowerAID, You_2018, Akiba2017ExtremelyLM, jia2018highly, gupta2015deep, wang2018training], although the two can certainly be combined.
To unleash the potential of more energy-efficient in-situ training, we look at the full CNN training lifecycle closely. With the goal to “squeeze out” unnecessary costs, we raise three curious questions:
Q1: Are all samples always required throughout training: is it necessary to use all training samples in all epochs?
Q2: Are all parts of the entire model equally important during training: does every layer or filter have to be updated every time?
Q3: Are precise gradients indispensable for training: can we efficiently compute and update the model with approximate gradients?
The above three questions only represent our “first stab” ideas to explore energy-efficient training, whose full scope is much more profound. By no means do our above questions represent all possible directions. We envision that many other recipes can be blended too, such as training on lower bit precision or input resolution [banner2018scalable, chin2019adascale]. We also recognize that energy-efficient CNN training should be jointly considered with hardware/architecture co-design [wu2018l1, hoffer2018norm], which is beyond the current work.
Motivated by the above questions, this paper proposes a novel energy efficient CNN training framework dubbed E-Train. It consists of three complementary aspects of efforts to trim down unnecessary training computations and data movements, each addressing one of the above questions:
Data-Level: Stochastic mini-batch dropping (SMD).
We show that CNN training could be accelerated by a “frustratingly easy” strategy: randomly skipping mini-batches with 0.5 probability throughout training. This could be interpreted as data sampling with (limited) replacements, and is found to incur minimal accuracy loss (sometimes even increase).
Model-Level: Input-dependent selective layer update (SLU). For each minibatch, we select a different subset of CNN layers to be updated. The input-adaptive selection is based on a low-cost gating function jointly learned during training. While similar ideas were explored in efficient inference [wang2018skipnet], for the first time it is applied and evaluated for training.
Algorithm-Level: Predictive sign gradient descent (PSG). We explore the usage of an extremely low-precision gradient descent algorithm called SignSGD, which has recently found both theoretical and experimental grounds [signSGD]. The original algorithm still requires the full gradient computation and therefore does not save energy. We create a novel “predictive” variant, that could obtain the sign without computing the full gradient, via low-cost, bit-level prediction. Combined with mixed-precision design, it decreases computation and data-movement costs.
Besides mainly experimental explorations, we find E-Train has many interesting links to recent CNN training theories, e.g., [chaudhari2018stochastic, samples_equal, zhang2019all, lottery]. We evaluate E-Train in comparison with its closest state-of-the-art competitors. To measure its actual performance, E-Train is also implemented and evaluated on an FPGA board. The results show that the CNN model applied with E-Train consistently achieves higher training energy efficiency with marginal accuracy drops.
2 Related Work
Accelerated CNN training. A number of works have been devoted to accelerating training, in a resource-rich setting, by utilizing communication-efficient distributed optimization and larger mini-batch sizes [goyal, Cho2017PowerAID, You_2018, Akiba2017ExtremelyLM]. The latest work [jia2018highly] combined distributed training with a mixed precision framework, leading to training AlexNet within 4 minutes. However, their goals and settings are distinct from ours - while the distributed training strategy can reduce time, it will actually incur more total energy overhead, and is clearly not applicable to on-device resource-constrained training.
Low-precision training. It is well known that CNN training can be performed under substantial lower precision [banner2018scalable, wang2018training, gupta2015deep], rather than using full-precision floats. Specifically, training with quantized gradients has been well studied in the distributed learning, whose main motivation is to reduce the communication cost during gradient aggregations between workers [seide20141, alistarh2016qsgd, zhang2017zipml, de2017understanding, terngrad, signSGD]. A few works considered to only transmit the coordinates of large magnitudes [aji2017sparse, lin2017deep, wangni2018gradient]. Recently, the SignSGD algorithm [seide20141, signSGD] even showed the feasibility of using one-bit gradients (signs) during training, without notably hampering the convergence rate or final result. However, most algorithms are optimized for distributed communication efficiency, rather than for reducing training energy costs. Many of them, including [signSGD], need first compute full-precision gradients and then quantize them.
Efficient CNN inference: Static and Dynamic. Compressing CNNs and speeding up their inference have attracted major research interests in recent years. Representative methods include weight pruning, weight sharing, layer factorization, bit quantization, to just name a few [han2015deep, he2017channel, yu2017scalpel, DeepkMeans, chen2019collaborative].
While model compression presents “static” solutions for improving inference efficiency, a more interesting recent trend looks at dynamic inference [wang2018skipnet, blockdrop, convnet-aig, rnp, gaternet] to reduce the latency, i,e, selectively executing subsets of layers in the network conditioned on each input. That sequential decision making process is usually controlled by low-cost gating or policy networks. This mechanism was also applied to improve inference energy efficiency [energynet, wang2019dual].
In [PredictiveNet], a unique bit-level prediction framework called PredictiveNet was presented to accelerate CNN inference at a lower level. Since CNN layer-wise activations are usually highly sparse, the authors proposed to predict those zero locations using low-cost bit predictors, thereby bypassing a large fraction of energy-dominant convolutions without modifying the CNN structure.
Energy-efficient training is different from and more complicated than its inference counterpart. However, many insights gained from the latter can be lent to the former. For example, the recent work [prunetrain] showed that performing active channel pruning during training can accelerate the empirical convergence. Our proposed model-level SLU is inspired by [wang2018skipnet]. The algorithm-level PSG also inherits the idea of bit-level low-cost prediction from [PredictiveNet].
3 The Proposed Framework
3.1 Data-Level: Stochastic mini-batch dropping (SMD)
We first adopt a straightforward, seemingly naive, yet surprisingly effective stochastic mini-batch dropping (SMD) strategy (see Fig. 1), to aggressively reduce the training cost by letting it see less mini-batches. At each epoch, SMD simply skips every mini-batch with a default probability of . All other training protocols, such as learning rate schedule, remain unchanged. Compared to the normal training, SMD can directly half the training cost, if both were trained with the same number of epochs. Yet amazingly, we observe in our experiments that SMD usually leads to negligible accuracy decrease, sometimes even increase (see Sec. 4). Why? We discuss possible explanations below.
SMD can be interpreted as sampling with limited replacement . To understand this, think of combing two consecutive SMD-enforced epochs into one, then it has the same number of mini-batches as one full epoch; but within it each training sample now has 0.25, 0.5, and 0.25 probability, to be sampled 2, 1, and 0 times, respectively. The conventional wisdom is that for stochastic gradient descent (SGD), in each epoch, the mini-batches are sampled i.i.d. from data has a large variance than sampling without replacement, and consequently SGD may have better regularization properties.
. To understand this, think of combing two consecutive SMD-enforced epochs into one, then it has the same number of mini-batches as one full epoch; but within it each training sample now has 0.25, 0.5, and 0.25 probability, to be sampled 2, 1, and 0 times, respectively. The conventional wisdom is that for stochastic gradient descent (SGD), in each epoch, the mini-batches are sampled i.i.d. from datawithout replacement (i.e., each sample occurs exactly once per epoch) [bertsekas2011incremental, shamir2016without, bengio2012practical, recht2012beneath, gurbuzbalaban2015random]. However, [chaudhari2018stochastic] proved that sampling mini-batches with replacement
has a large variance than sampling without replacement, and consequently SGD may have better regularization properties.
Alternatively, SMD could also be viewed as a special data augmentation way that injects more sampling noise to perturb training distribution every epoch. Past works [ge2015escaping, keskar2016large, daneshmand2018escaping] have shown that specific kinds of random noise aid convergence through escaping from saddle points or less generalizable minima. The structured sampling noise caused by SMD might aid the exploration.
Besides, [samples_equal, johnson2018training, coleman2019select] also showed that an importance sampling scheme that focuses on training more with “informative” examples leads to faster convergence under resource budgets. They implied that the mini-batch dropping can be selective based on certain information criterion instead of stochastic. We use SMD because it has zero overhead, but more effective dropping options might be available if low-cost indicators of mini-batch importance can be identified: we leave this as future work.
3.2 Model-Level: Input-dependent selective layer update (SLU)
[wang2018skipnet] proposed to dynamically skip a subset of layers for different inputs, in order to adaptively accelerate the feed-forward inference. However, [wang2018skipnet] called for a post process after supervised training, i.e., to refine the dynamic skipping policy via reinforcement learning, thus causing undesired extra training overhead. We propose to extend the idea of dynamic inference to the training stage, i.e., dynamically skipping a subset of layers during
called for a post process after supervised training, i.e., to refine the dynamic skipping policy via reinforcement learning, thus causing undesired extra training overhead. We propose to extend the idea of dynamic inference to the training stage, i.e., dynamically skipping a subset of layers duringboth feed-forward and back-propagation. Crucially, we show that by adding an auxiliary regularization, such dynamic skipping can be learned from scratch and obtain satisfactory performance: no post refinement nor extra training iterations is required. That is critical for dynamic layer skipping to be useful for energy-efficient training: we term this extended scheme as input-dependent selective layer update (SLU).
As depicted in Fig. 1, given a base CNN to be trained, we follow [wang2018skipnet] to add a light-weight RNN gating network per layer block. Each gate takes the same input as its corresponding layer, and outputs soft-gating outputs between [0,1] for the layer, which are then used as the skipping probability, in which the higher the value is, more probably that layer will be selected. Therefore, each layer will be adaptively selected or skipped, depending on the inputs. We will only select the layers activated by gates. Those RNN gates cost less than 0.04% feed-forward FLOPs than the base models; hence their energy overheads are negligible. More details can be found in the supplementary.
[wang2018skipnet] first trained the gates in a supervised way together with the base model. Observing that such learned routing policies were often not sufficiently efficient, they used reinforcement learning post-processing to learn more aggressive skipping afterwards. While this is fine for the end goal of dynamic inference, we hope to get rid of the post-processing overhead. We incorporate the computational complexity regularization into the objective function to overcome this hurdle, defined as
Here, is a weighting coefficient of the computational complexity regularization. and denote the parameters of the base model and the gating network, respectively. Also, denotes the prediction loss, and is calculated by accumulating the computational cost (FLOPs) of the layers that are selected. The regularization explicitly encourages to learn more “parismous” selections throughout the training. We find that such SLU-regularized training leads to almost the same number of epochs to converge compared to standard training, i.e., SLU does not sacrifice empirical convergence speed. As a side effect, SLU will naturally yield CNNs with dynamic inference capability. Though not the focus of this paper, we find the CNN trained with SLU reaches comparable accuracy-efficiency trade-off over one trained with the approach in [wang2018skipnet].
The practice of SLU seems to align with several recent theories on CNN training. In [layers_equal], the authors suggested that “not all layers are created equal” for training. Specifically, some layers are critical to be intensively updated for improving final predictions, while others are insensitive along training. There exist “non-critical” layers that barely change their weights throughout training: even resetting those layers in a trained model to their initial value has few negative consequences. The more recent work [lottery] further confirmed the phenomenon, though how to identify those non-critical model parts at the early training stage remains unclear. [veit2016residual, greff2016highway] also observed different samples might activate different sub-models. Those inspiring theories, combined with the dynamic inference practice, motivate us to propose SLU for more efficient training.
3.3 Algorithm-Level: Predictive sign gradient descent (PSG)
It is well recognized that low-precision fixed-point implementation is a very effective knob for achieving energy efficient CNNs, because both CNNs’ computational and data movement costs are approximately a quadratic function of their employed precision. For example, a state-of-the-art design [horowitz20141] shows that adopting 8-bit precision for a multiplication, adder, and data movement can reduce the energy cost by 95%, 97%, and 75%, respectively, as compared to that of a 32-bit floating point design when evaluated in a commercial 45nm CMOS technology.
The successful adoption of extremely low-precision (binary) gradients in SignSGD [signSGD] is appealing, as it might lead reducing both weight update computation and data movements. However, directly applying the original SignSGD algorithm for training will not save energy, because it actually computes the full-precision gradient first before taking the signs. We propose a novel predictive sign gradient descent (PSG) algorithm, which predicts the sign of gradients using low-cost bit-level predictors, therefore completely bypassing the costly full-gradient computation.
We next introduce how the gradients of weights are updated in PSG. Assume the following notations: the full precision and most significant bits (the latter, MSB part, is adopted by PSG’s low-cost predictors) of the input and the gradient of the output are denoted as () and (), respectively, where the corresponding input and the gradient of the output for PSG’s predictors are denoted as and , respectively. As such, the quantization noise for the input and the gradient of the output are and , respectively. Similarly, after back-propagation, we denote the full-precision and low-precision (i.e., taking the most significant bits (MSBs)) gradient of the weight as and , respectively, the latter of which is computed using and . Then, with an empirically pre-selected threshold , PSG updates the -th weight gradient as follows:
Note that in hardware implementation, the computation to obtain is embedded within that of . Therefore, the PSG’s predictors do not incur energy overhead.
PSG for energy-efficient training. Recent work [banner2018scalable] has shown that most of the training process is robust to reduced precision (e.g., 8 bits instead of 32 bits), except for the weight gradient calculations and updates. Taking their learning, we similarly adopt a higher precision for the gradients than the inputs and weights, i.e., . Specifically, when training with PSG, we first compute the predictors using (e.g., ) and (e.g., ), and then update the weights’ gradients following Eq. (2). The further energy savings of training with PSG over the fixed-point training [banner2018scalable] are resulted from the fact that the predictors computed using and require exponentially less computational and data movement energy.
Prediction guarantee of PSG. We analyze the probability of PSG’s prediction failure to discuss its performance guarantee. Specifically, if denoting the sign prediction failure produced by Eq. (2) as , it can be proved that this probability is upbounded as follows,
where and are the quantization noise step sizes of and , respectively. and are given in the Appendix along with the proof of Eq. (3). Eq. (3) shows that the prediction failure probability of PSG is upbounded by a term that degrades exponentially with the precision assigned to the predictors, indicating that this failure probability can be very small if the predictors are designed properly.
Adaptive threshold. Training with PSG might lead to sign flips in the weight gradients as compared to that of the floating point one, which only occurs when the latter has a small magnitude and thus the quantization noise of the predictors causes the sign flips. Therefore, it is important to properly select a threshold (e.g., in Eq.(2)) that can optimally balance this sign flip probability and the achieved energy savings. We adopt an adaptive threshold selection strategy because the dynamic range of gradients differ significantly from layers to layers: instead of using a fixed number, we will tune a ratio which yields the adaptive threshold as .
4.1 Experiment setup
Datasets: We evaluate our proposed techniques on two datasets: CIFAR-10 and CIFAR-100. Common data augmentation methods (e.g., mirroring/shifting) are adopted, and data are normalized as in [krizhevsky2009learning]. Models: Three popular backbones, ResNet-74, ResNet-110 [he2016deep], and MobileNetV2 [s2018mobilenetv2], are considered. For evaluating each of the three proposed techniques (i.e., SMD, SLU, and PSG), we consider various experimental settings using ResNet-74 and CIFAR-10 dataset for ablation study, as described in Sections 4.2-4.5. ResNet-110 and MobileNetV2 results are reported in Section 4.6. Top-1 accuracies are measured for CIFAR-10, and both top-1 and top-5 accuracies for CIFAR-100. Training settings: We adopt the training settings in [he2016deep] for the baseline default configurations. Specifically, we use an SGD with a momentum of 0.9 and a weight decaying factor of 0.0001, and the initialization introduced in [he2015delving]. Models are trained for 64k iterations. For experiments where PSG is used, the initial learning rate is adjusted to as SignSGD[signSGD] suggested small learning rates to benefit convergence. For others, the learning rate is initially set to be 0.1 and then decayed by 10 at the 32k and 48k iterations, respectively. We also employed the stochastic weight averaging (SWA) technique [yang2019swalp] when PSG is adopted, that was found to notably stabilize training.
Real energy measurements using FPGA: As the energy cost of CNN inference/training consists of both computational and data movement costs, the latter of which is often dominant but can not captured by the commonly used metrics, such as the number of FLOPs [Eyeriss_JSSC], we evaluate the proposed techniques against the baselines in terms of accuracy and real measured energy consumption. Specifically, unless otherwise specified, all the energy or energy savings are obtained through real measurements by training the corresponding models and datasets in a state-of-the-art FPGA [zed], which is a digilent ZedBoard Zynq-7000 ARM/FPGA SoC Development Board. Fig. 2 shows our FPGA measurement setup, in which the FPGA board is connected to a laptop through a serial port and a power meter. In particular, the training settings are downloaded from the laptop to the FPGA board, and the real-measured energy consumption is obtained via the power meter for the whole training process and then sent back to the laptop. All energy results are measured from FPGA.
4.2 Evaluating stochastic mini-batch dropping
We first validate the energy saving achieved by SMD against a few “off-the-shelf” options: (1) can we train with the standard algorithm, using less iterations and otherwise the same training protocol? (2) can we train with the standard algorithm, using less iterations but properly increased learning rates? Two set of carefully-designed experiments are presented below for addressing them.
Training with SMD vs. standard mini-batch (SMB): We first evaluate SMD over the standard mini-batch (SMB) training, which uses all (vs. 50% in SMD) mini-batch samples. As shown in Fig. LABEL:fig:SMDablation_a when the energy ratio is 1 (i.e., training with SMB + 64k iterations vs. SMD + 128k iterations), the proposed SMD technique is able to boost the inference accuracy by 0.39% over the standard way.
We next “naively” suppress the energy cost of SMB, by reducing training iterations. Specifically, we reduce the SMB training iterations to be of the original one. Note the learning rate schedule (e.g., when to reduce learning rates) will be scaled proportionally with the total iteration number too. For comparison, we conduct experiments of training with SMD when the number of equivalent training iterations are the same as those of the SMB cases. Fig. LABEL:fig:SMDablation_a shows that training with SMD consistently achieves a higher inference accuracy than SMB with the margin ranging from 0.39% to 0.86%. Furthermore, we can see that training with SMD reduces the training energy cost by while boosting the inference accuracy by (see the cases of SMD under the energy ratio of 0.67 vs. SMB under the energy ratio of 1, respectively, in Fig. LABEL:fig:SMDablation_a), as compared to SMB. We adopt SMD under this energy ratio of in all the remaining experiments.
max width= Dataset Backbone Accuracy SMB SMD CIFAR-10 ResNet-110 92.75% 93.05% CIFAR-100 ResNet-74 71.11% 71.37%
We repeated training ResNet-74 on CIFAR-10 using SMD for 10 times with different random initializations. The accuracy standard deviation is only 0.132%, showing high stability. We also conducted more experiments with different backbones and datasets . As shown in Tab.
We repeated training ResNet-74 on CIFAR-10 using SMD for 10 times with different random initializations. The accuracy standard deviation is only 0.132%, showing high stability. We also conducted more experiments with different backbones and datasets . As shown in Tab.4.2, SMD is consistently better than SMB.
Training with SMD vs. SMB + increased learning rates: We further compare with SMB with tuned/larger learning rates, conjecturing that it would accelerate convergence by possibly reducing the needed training epochs. Results are summarized in Fig. LABEL:fig:SMDablation_b. Specifically, when the number of iterations are reduced by , we do a grid search of learning rates, with a step size from between [,]. All compared methods are set with the same training energy budget. Fig. LABEL:fig:SMDablation_b demonstrates that while increasing learning rates seem to improve SMB’s energy efficiency over sticking to the original protocol, our proposed SMD still maintains a clear advantage of at least .
4.3 Evaluating selective layer update
Our current SLU experiments are based on CNNs with residual connections, partially because they dominate in SOTA CNNs. We will extend SLU to other model structures in future work.
We evaluate the proposed SLU by comparing it with stochastic depth (SD)
Our current SLU experiments are based on CNNs with residual connections, partially because they dominate in SOTA CNNs. We will extend SLU to other model structures in future work. We evaluate the proposed SLU by comparing it with stochastic depth (SD)[huang2016deep], a technique originally developed for training very deep networks effectively, by updating only a random subset of layers at each mini-batch. It could be viewed as a “random” version of SLU (which uses learned layer selection). We follow all suggested settings in [huang2016deep]. For a fair comparison, we adjust the hyper-parameter [huang2016deep], so that SD dropping ratio is always the same as SLU.
From Fig. 6, training with SLU consistently achieves higher inference accuracies than SD when their training energy costs are the same. It is further encouraging to observe that training with SLU could even achieve higher accuracy sometimes in addition to saving energy. For example, comparing the cases when training with SLU + an energy ratio of 0.3 (i.e., energy saving) and that of SD + an energy ratio of 0.5, the proposed SLU technique is able to reduce the training energy cost by while boosting the inference accuracy by .
These results endorses the usage of data-driven gates instead of random dropping, in the context of energy-efficient training. Training with SLU + SMD combined further boosts the accuracy while reducing the energy cost. Furthermore, 20 trials of SLU experiments to ResNet38 on CIFAR-10 conclude that, with 95% confidence level, the confidence interval for the mean of the top-1 accuracy and the energy saving are [92.47%, 92.58%] (baseline:92.50%) and [39.55%, 40.52%], respectively, verifying SLU’s trustworthy effectiveness.
. These results endorses the usage of data-driven gates instead of random dropping, in the context of energy-efficient training. Training with SLU + SMD combined further boosts the accuracy while reducing the energy cost. Furthermore, 20 trials of SLU experiments to ResNet38 on CIFAR-10 conclude that, with 95% confidence level, the confidence interval for the mean of the top-1 accuracy and the energy saving are [92.47%, 92.58%] (baseline:92.50%) and [39.55%, 40.52%], respectively, verifying SLU’s trustworthy effectiveness.
4.4 Evaluating predictive sign gradient descent
We evaluate PSG against two alternatives: (1) 8-bit fixed point training proposed in [banner2018scalable]; and (2) the original SignSGD [signSGD]. For all experiments in Sections 4.4 and 4.5, we adopt 8-bit precision for the activations/weights and 16-bit for the gradients. The corresponding precision of the predictors are 4-bit and 10-bit, respectively. We use an adaptive threshold (see Section 3.3) of . More experiment details are in Appendix.
As shown in Table 3, while the 8-bit fixed point training in [banner2018scalable] saves about training energy (going from 32-bit to 8-bit in general leads to about energy saving, which is compromised by their employed 32-bit gradients in this case) with a marginal accuracy loss of as compared to the 32-bit SGD, the proposed PSG almost doubles the training energy savings ( vs. for [banner2018scalable]) with a negligible accuracy loss of ( vs. for [banner2018scalable]). Interestingly, PSG slightly boosts the inference accuracy by while saving energy, i.e., better training energy efficiency with a slightly better inference accuracy, compared to SignSGD [signSGD]. Besides, as we observed, the ratio of using for sign prediction typically remains at least 60% throughout the training process, given adaptive threshold .
4.5 Evaluating E-Train: SMD + SLU + PSG
We now evaluate the proposed E-Train framework, which combines the SMD, SLU, and PSG techniques. As shown in Table 3, we can see that E-Train: (1) indeed can further boost the performance as compared to training with SMD+SLU (e.g., E-Train achieves a higher accuracy of ( vs. (see Fig.6 at the energy ratio of 0.2) of training with SMD+SLU, when achieving energy savings); and (2) can achieve an extremely aggressive energy savings of and , while incurring an accuracy loss of only about and , respectively, as compared to that of the 32-bit floating point SGD (see Table 3), i.e., up to better training energy efficiency with small accuracy loss.
Impact on empirical convergence speed. We plot the training convergence curves of different methods in Fig. 7, with the x-axis represented in the alternative form of training energy costs (up to current iteration). We observe that E-Train does not slow down the empirical convergence. In fact, it even makes the training loss decrease faster in the early stage.
Experiments on adapting a pre-trained model. We perform a proof-of-concept experiment for CNN fine-tuning by splitting CIFAR-10 training set into half, where each class was i.i.d. split evenly. We first pre-train ResNet-74 on the first half, then fine-tune it on the second half. During fine-tuning, we compare two energy-efficient options: (1) fine-tuning only the last FC layer using standard training; (2) fine-tuning all layers using E -Train. With all hyperparameters being tuned to best efforts, the two fine-tuning methods improve over the pre-trained model top-1 accuracy by [0.30%, 1.37%] respectively, while (2) saves 61.58% more energy (FPGA-measured) than (1). That shows that E
-Train. With all hyperparameters being tuned to best efforts, the two fine-tuning methods improve over the pre-trained model top-1 accuracy by [0.30%, 1.37%] respectively, while (2) saves 61.58% more energy (FPGA-measured) than (1). That shows that E-Train is the preferred option: higher accuracy and more energy savings
Table 4 evaluates E-Train and its ablation baselines on various models and more datasets. The conclusions are aligned with the ResNet-74 cases. Remarkably, on CIFAR-10 with ResNet-110, E-Train saves over 83% energy with only 0.56% accuracy loss. When saving over 91% (i.e., more than 10), the accuracy drop is still less than 2%. On CIFAR-100 with ResNet-110, E-Train can even surpass baseline on both top-1 and top5 accuracy while saving over 84% energy. More notably, E-Train is also effective for even compact networks: it saves about 90% energy cost while achieving a comparable accuracy, when adopted for training MobileNetV2.
5 Discussion of Limitations and Future Work
We propose the E-Train framework to achieve energy-efficient CNN training in resource-constrained settings. Three complementary aspects of efforts to trim down training costs - from data, model and algorithm levels, respectively, are carefully designed, justified, and integrated. Experiments on both simulation and real FPGA demonstrate the promise of E-Train. Despite the preliminary success, we are aware of several limitations of E-Train, which also points us to the future road map. For example, E-Train is currently designed and evaluated for standard off-line CNN training, with all training data presented in batch, for simplicity. This is not scalable for many real-world IoT scenarios, where new training data arrives sequentially in a stream form, with limited or no data buffer/storage leading to the open challenge of “on-the-fly” CNN training [sahoo2017online]. In this case, while both SLU and PSG are still applicable, SMD needs to be modified, e.g., by one-pass active selection of stream-in data samples. Besides, SLU is not yet straightforward to be extended to plain CNNs without residual connections. We expect finer-grained selective model updates, such as online channel pruning [prunetrain], to be useful alternatives here. We also plan to optimize E-Train for continuous adaptation or lifelong learning.
The work is in part supported by the NSF RTML grant (1937592, 1937588). The authors would like to thank all anonymous reviewers for their tremendously useful comments to help improve our work.
Appendix A PSG Prediction Error Rate Bound Analysis
Weight gradient calculation during back propagation. Consider we have a convolutional layer with weight and no bias (as is the usual case for modern deep CNNs), its input is and the output is . During one pass of back propagation, the gradient propagated by its succeeding layer is . We compute the gradient of the weight as . Considering only one entry in , it can be represented by the sum of a series inner product of the corresponding locations in and . For simplicity and with a little abuse of notations, the one entry the gradient can be represented as:
where iterate over the mini-batch and is the mini-batch size. The MSB parts used to predict the gradient signs are denoted as and , with precision and . The corresponding quantization noise terms are and . The gradient calculated using (4) with and is denoted as . Then the gradient error, denoted as , can be approximated with
Here the second order noise term is neglected because it is small.
Sign predicition error probability bound. Denote the the sign prediction failure, given a t event as , which has three subcases as shown in Table 5:
Consider Case :
where is the conditional distribution of given , and is the variance of . The inequality comes from Chebychev’s inequality and the fact that is symmetrically distributed. Plug into (A), we have:
Consider Case and : following similar derivations, we can have:
where and are defined as:
Discussion of the data range. In (3) the data range is assumed to be . When the data range changes, however, the bound will not change because it is equivalent with scaling the numerators and denominators in the derivations above, which corresponds to the adaptive threshold scheme we introduce in Section 3.3.
Appendix B Experiment Settings for PSG in Section 4.4
Instead of using the default training settings described in Section 4.1, we use a learning rate of 0.03 and a weight decay of 0.0005 for SignSGD [signSGD] and PSG in Section 4.4, which we found optimal for most cases when SignSGD was involved (PSG also uses SignSGD because it predicts the sign to replace weight gradients). During the experiments, we found it a little bit tricky to find a suitable learning rate. Because both of SignSGD and PSG use the sign of the gradients to update weights, they demand smaller learning rate especially when the performance improves and gradients approach to near zero. The above setting is consistent to the observations in [signSGD] that the learning rate for SignSGD should be appropriately smaller than that for the baseline algorithm.
Influence of Adaptive Threshold . We use for experiments where PSG is applied, i.e., the real threshold used for one layer to screen out small and therefore unreliable gradients predictions is chosen as 0.05 times of the maximum among all gradient predictions in that layer. This is an empirical choice. We find that the effectiveness of PSG is insensitive to the choice of , when constrained in a proper range, e.g. . If is too small, however, it will result in too frequent coarse gradients and might hurt convergence.
Appendix C SLU Implementation Details
In our implementation of SLU(Figure 8), we adopt the recurrent gates (RNNGates) as in [wang2018skipnet] . It is composed of a global average pooling followed by a linear projection that reduces the features to a 10-dimensional vector as depicted in .
A Long Short Term Memory (LSTM)
. It is composed of a global average pooling followed by a linear projection that reduces the features to a 10-dimensional vector as depicted in9
. A Long Short Term Memory (LSTM)[gers1999learning] network that contains a single layer of dimension 10 is applied to generate a binary scalar. As mentioned in [wang2018skipnet], this RNN gating networks design incurs a negligible overhead compared to its feed-forward counterpart (0.04% vs. 12.5% of the computation of the residual blocks when the baseline architecture is a ResNet). To further reduce the energy cost due to loading parameters into the memory, all RNN Gates in the SLU share the same weights.
Training with SLU + SMD: We further evaluate the performance of combing the SLU and SMD techniques. As shown in Fig. 7, training with SLU + SMD consistently boost the inference accuracy further while reducing the training energy cost. For example, compared to the SD baseline, SLU + SMD can improve the inference accuracy by , while costing lower energy.