I Introduction
The maturation of deep learning has enabled ondevice intelligence for Internet of Things (IoT) devices. Convolutional neural network (CNN), as an effective deep learning model, has been intensively deployed on IoT devices to extract information from the sensed data, such as smart cities
[36], smart agriculture [43] and wearable devices[4]. The models are initially trained on highperformance computers (HPCs) and then deployed to IoT devices for inference. However, in the physical world, the statically trained model cannot adapt to the real world dynamically and may result in low accuracy for new input instances. Ondevice training has the potential to learn from the environment and update the model insitu. This enables incremental/lifelong learning [31] to train an existing model to update its knowledge, and device personalization [32] by learning features from the specific user and improving model accuracy. Federated learning [25] is another application scenario of ondevice training, where a large number of devices (typically mobile phones) collaboratively learn a shared model while keeping the training data on personal devices to protect privacy. Since each device still computes the full model update by expensive training process, the computation cost of training needs to be greatly reduced to make federated learning realistic.While the efficiency of training in HPCs can always be improved by allocating more computing resources, such as 1024 GPUs [1], training on resourceconstrained IoT devices remains prohibitive. The main problem is the large gap between the high computation and energy demand of training and the limited computing resource and battery on IoT devices. For example, training ResNet110 [10] on a 32x32 input image takes 780M FLOPs, which is prohibitive for IoT devices. Besides, since computation directly translates into energy consumption and IoT devices are usually batteryconstrained [8], the high computation demand of training will quickly drain the battery. While existing works [13, 29, 41]
effectively reduce the computation cost of inference by assigning input instances to different classifiers according to the difficulty, the computation cost of training is not reduced.
To address this challenge, this work aims to enable ondevice training by significantly reducing the computation cost of training while preserving the desired accuracy. Meanwhile, the proposed techniques can also be adopted to improve training efficiency on HPCs. To achieve this goal, we investigate the computation cost of the entire training cycle, aiming to eliminate unnecessary computations while keeping full accuracy. We made the following two observations: First, not all the input instances are important for improving the model accuracy. Some instances are similar to the ones that the model has already been trained with and can be completely dropped to save computation.
Therefore, developing an approach to filter out unimportant instances can greatly reduce the computation cost. Second, for the important instances, not all the computation in the training cycle is necessary. Eliminating insignificant computations will have a marginal influence on accuracy. In the backward pass of training, some channels in the error maps have small values. Pruning out these insignificant channels and corresponding computation will have a marginal influence on the final accuracy while saving a large portion of computation.
Based on the two observations, we propose a novel framework consisting of two complementary approaches to reduce the computation cost of training while preserving the full accuracy. The first approach is an early instance filter to select important instances from the input stream to train the network and drop trivial ones. The second approach is error map pruning to prune out insignificant computations in the backward pass when training with the selected instances.
In summary, the main contributions of this paper include:

A framework to enable ondevice training. We propose a framework consisting of two approaches to eliminate unnecessary computation in training CNNs while preserving full network accuracy. The first approach improves the training efficiency of both the forward and backward passes, and the second approach further reduces the computation cost in the backward pass.

Selfsupervised early instance filtering (EIF) on the data level. We propose an instance filter to predict the loss of each instance and develop a selfsupervised algorithm to train the filter. Instances with predicted low loss are dropped before starting the training cycle to save computation. To train the filter simultaneously with the main network, we propose a selfsupervised training algorithm including the adaptive threshold based labeling strategy, uncertainty sampling based instance selection algorithm, and weighted loss for biased highloss ratio.

Error map pruning (EMP) on the algorithm level. We propose an algorithm to prune insignificant channels in error maps to reduce the computation cost in the backward pass. The channel selection strategy considers the importance of each channel on both the error propagation and the computation of the weight gradients to minimize the influence of pruning on the final accuracy.
We evaluate the proposed approaches on networks of different scales. ResNet and VGG are for ondevice training of mobile devices, and LeNet is for tiny sensor nodelevel devices. Experimental results show the proposed approaches effectively reduce the computation cost of training without any or with marginal accuracy loss.
Ii Background and Related Work
Iia Background of CNN Training
The training of CNNs is most commonly conducted with the minibatch stochastic gradient descent (SGD) algorithm
[5]. It updates the model weights iterationbyiteration using a minibatch (e.g. 128) of input instances. For each instance in the minibatch, a forward pass and a backward pass are conducted. The forward pass attempts to predict the correct outputs using current model weights. Then the backward pass backpropagates the loss through layers, which generates the error maps for each layer. Using the error maps, the gradient of the loss w.r.t. the model weights are computed. Finally, the model weights are updated by using the weight gradients and an optimization algorithm such as SGD.To provide labeled data for ondevice training, labeling strategies from existing works can be used. For example, the labels can come from aggregating inference results from neighbor devices [20] (e.g. voting), employing spatial context information as the supervisory signals [27, 7], or naturally inferred from user interaction [26, 9] such as nextwordprediction in keyboard typing.
IiB Related Work
Accelerated Training. There are a number of works on accelerating network training. Stochastic depth[11]
accelerates the training by randomly bypassing layers with the residual connection. E2Train
[37] randomly drops minibatches and selectively skips layers by using residual connections to save the computation cost. Different from [37], which randomly drops minibatches, we investigate the importance of each instance before keeping or dropping it. The input data from the realword is not ideally shuffled and valuable instances for training can concentrate within one minibatch. Simply dropping minibatches can miss important instances for training the network. Besides, the layer skipping in these two works rely on the ResNet architecture [10], and cannot be naturally extended to general CNNs. In contrast, our approaches are applicable to general CNNs. OHEM [34] selects highloss instances and drops lowloss ones to improve training efficiency. It computes the loss values of all instances in the forward pass and only keeps highloss instances for the backward pass. The main drawback is that the computation in the forward pass of lowloss instances is wasted. Different from this, our approach predicts the loss of each instance and drops lowloss instances before starting the forward pass, which eliminates the computation cost of lowloss instances.Distributed Training. Another way to accelerate training is leveraging distributed training with abundant computing resources and large batchsizes. [1] employs an extremely large batch size of 32K with 1024 GPUs to train ResNet50 in 20 minutes. [14] integrates a mixedprecision method into distributed training and pushes the time to 6.6 minutes. However, these works target on leveraging highlyparallel computing resources to reduce the training time and actually increase the total computation cost, which is infeasible for training on resourceconstrained IoT devices.
Network Pruning during Training. Some works aim to train and prune the network architecture simultaneously. [38] proposes a pruning approach to sparsify weights during training. The goal is to generate a compact network for inference instead of improving training efficiency. In fact, it requires more time on training by first training the backbone then pruning it. Similarly, [42] prunes the sparse network during training to have a compact network for inference. However, these works only improve the inference efficiency and training computation cost is not reduced. [2, 24] aim to accelerate training by reconfiguring the network to a smaller one during training. The main drawback is that the network is pruned on the offline training dataset, and the ability of the pruned network for further ondevice learning is compromised. Instead, we focus on reducing the computation cost of online training and the entire network architecture is preserved to keep the full ability for learning in an uncertain future.
Network Compression and Neural Architecture Search. There are extensive explorations on network compression and neural architecture search (NAS). [21, 23] prune the network to generate a compact network for efficient inference. [16, 39, 40, 15] search neural architectures for hardwarefriendly inference. [22] further considers quantization during NAS for efficient inference. However, these works only aim to design network architectures for efficient inference. The computation cost of training is not considered.
Iii Framework Overview
The overview of the proposed framework is shown in Fig. 1. On top of the main neural network, a small instance filter network is proposed to select important instances from the input stream to train the network and drop trivial ones. When the input instances arrive, the early instance filter predicts the loss value for each instance as if the instance was fed into the main network and makes a binary decision to drop or preserve this instance. If the predicted loss is high and the instance is preserved, the main network will be invoked to start the forward and backward pass for training. Since the loss prediction is for the main network, once the main network is updated, the instance filter also needs to be trained for accurate loss prediction. The training of the instance filter is selfsupervised based on the labeling strategy by the adaptive loss threshold, instance selection by uncertainty sampling, and the weighted loss for biased highloss ratio, which will be introduced in Section IV. Once important instances are selected, the error map pruning further reduces the computation cost of the backward pass. It prunes out channels in the error maps that have small contributions to the error propagation and gradient computation, which will be introduced in Section V.
Iv SelfSupervised Early Instance Filtering
The early instance filter (EIF) is used to select important instances for training the main network and drop trivial instances to reduce the computation cost of training. Since the main network is constantly being updated during training, it is essential to tune the EIF every time the main network is updated. In this way, the EIF can accurately select important instances based on the latest state of the main network. In this section, we will first introduce the working flow of EIF to select instances for training the main model. Then we discuss the challenges for updating the EIF. After that, we present three approaches to address these challenges such that the EIF can be effectively updated.
To select important instances and drop trivial ones onthefly during training, the EIF predicts the loss value of each instance from the input stream without actually feeding the instance into the main network. Trivial instances with predicted low loss are dropped before the forward pass, which eliminates the computation on the forward pass and more computationally intensive backward pass of the main network. Important instances with predicted high loss are preserved to complete the forward pass, calculate the loss, and finish the backward pass to compute the weight gradients to update the main network. Kindly note that the instances are not preselected before the training starts. Instead, they are selected onthefly during training based on what the main network has and has not learned at the current state.
Fig. 2 shows the working flow of EIF. The user first needs to predefine a highloss ratio (e.g. 10%) such that these amounts of instances in the whole input stream will be predicted as highloss and the others will be predicted as lowloss. Only instances predicted as highloss will be used for training the main network. When instances arrive sequentially, the early instance filter predicts the loss value of each instance as binary high or low for the main network such that the predefined highloss ratio is satisfied. The filter also produces the confidence of each loss prediction, represented by the entropy of the loss prediction. Since the loss prediction by the EIF network is for the main network and the main network is constantly being updated, it is essential to retrain the EIF network every time the main network is updated to realize accurate loss prediction. However, there are several challenges in realizing automatic selfsupervised training for the EIF network. In this section, we will first present three major challenges. Then, we will present three techniques to address these challenges: adaptive loss threshold, uncertainty sampling, and weighted loss, as shown in Fig. 2.
Challenges: During ondevice training, instances with predicted lowloss are dropped before feeding to the main network to compute the actual loss, and their true loss values are unknown. Thus, we can only know the true loss values of instances with predicted highloss, which brings two challenges. The first challenge is how to label instances as highloss or lowloss for training the EIF according to the predefined highloss ratio. For example, if we could know the loss values of all instances, defining a loss threshold to separate 10% instances with the highest loss values is simply sorting all the loss values and finding the value for separation. Since the loss values of dropped instances are unknown, defining a loss threshold remains a challenge.
The second challenge is that the EIF network can choose what instances will be used to train itself, which is not possible for normal CNN training. As long as the EIF network is not 100% accurate, it will make wrong predictions. To avoid punishment, instead of adjusting its own weights to make accurate loss predictions, the filter will learn a shortcut by predicting all the new input instances as low loss and drop them. Since the dropped instances will not be fed to the main network, the EIF network will never know the ground truth of the losses and thus it will not be punished for doing so. In this way, EIF will think it makes perfect predictions. Dropping all the new instances prevents further training of the filter and main network.
The third
challenge is how to correctly train the filter when the number of highloss and lowloss instances is extremely unbalanced in the input stream. This is different from normal training datasets such as CIFAR10 and ImageNet, in which the number of instances in each class is balanced. The unbalanced number of highloss and lowloss instances makes the EIF network training ineffective. For example, when the predefined highloss ratio is relatively low (e.g. 10%), simply predicting all the instances as lowloss will produce high accuracy of 90% on the filter, for which it believes as a good result. However, this prediction is useless since it does not find any important instance to train the main network.
We will present three techniques to address these challenges.
Iva Adaptive Loss Threshold Based Labeling Strategy
The adaptive loss threshold is used to provide the ground truth (labels) for training the EIF. With the adaptive loss threshold, we can label the loss of instances as highloss or lowloss to train the EIF. During the training of EIF and the main network, percent of instances will be predicted as highloss by the EIF. The true loss values of instances predicted as highloss can be obtained on the main network. However, we do not know the true loss values of instances predicted as lowloss since they are dropped before feeding into the main network. With only partial loss values, defining an exact loss threshold (e.g. sorting all loss values and finding the threshold) is challenging. Therefore, we aim to approximate the threshold. To achieve this, we first define true high (TH) instances as the instances with predicted high loss by the filter and labeled as highloss by the loss threshold. We will monitor the number of TH instances in the last minibatches. Then we calculate the percentage as the number of TH instances in the preserved ones over all the instances in last minibatches. By comparing the percentage with the predefined percentage , the loss threshold is adjusted to draw to the predefined percentage .
Formally, with adaptive loss threshold , instances are labeled as highloss or lowloss as follows.
(1) 
where is the loss value of instance computed by the main network. is the adaptive loss threshold.
The true high (TH) loss instance ratio by the filter is defined as follows.
(2) 
is an indicator function which equals 1 if is true and 0 otherwise. is the binary prediction by the filter for instance , and is the loss label by Eq.(1). is the batch size, and is the number of minibatches to monitor for one update of the loss threshold.
Based on the computed and predefined , the loss threshold is adjusted to draw to . When is larger than , too many instances are labeled and predicted as high loss, which indicates is too small. Therefore, will be incremented by multiplying with a factor larger than 1. Similarly, when is smaller than , is too large and will be attenuated. The loss threshold is adjusted as:
(3) 
and are two hyperparameters where is larger than 1 and is smaller than 1 to define the step size.
The computed is essential to the selfsupervised training of the EIF. More specifically, controls the loss threshold by Eq.(3), which further controls the instance labels by Eq.(1) for training the instance filter. With the labels , the filter will be trained accordingly to predict highloss instances. The number of instances with predicted highloss by the filter and labeled as highloss will be used to compute the new by Eq.(2), which further adjusts . This process continues for each minibatch, which forms the selfsupervised training of the instance filter. Leveraging the selfsupervision, the loss threshold will be properly adjusted and the instance filter will be welltrained to track the latest state of the main network. In this way, the true highloss ratio affected by both the filter and the loss threshold will be kept at the set ratio . The filter will effectively select percent important instances for training the main network.
IvB Instance Selection by Uncertainty Sampling
The main reason for the second challenge is that if an instance is dropped, it will never be fed to the main network and the EIF network will never know the ground truth of the loss. In this way, the labels (i.e. highloss or lowloss) of the dropped instances for training the EIF will be unknown, and the EIF cannot be correctly trained. To address this problem, we keep some instances with predicted low loss, which would be dropped, to augment the preserved instances for training the filter. In this way, wrong loss predictions of the dropped instances will also punish the filter, which forces it to actually learn to find important instances. To decide which instances to keep and minimize the number of selected instances, we employ uncertainty sampling [17]. The dropped instances that the filter are least confident about will be fed into the main network to compute the loss value. To measure the confidence of loss prediction by the filter, we use the entropy defined as:
(4) 
is the computed probability by the filter of being highloss (
) or lowloss () for instance . The smaller the entropy, the more confident the filter is about the prediction.Based on the entropy, we select from the dropped instances where the entropy is above the entropy threshold to augment the preserved instances for training the filter. The set of selected instances is defined as:
(5) 
where is the entropy threshold.
IvC Weighed Loss for Biased HighLoss Ratio
To address the third challenge, we propose to use the weighted loss function to make the EIF training process fair in treating the highloss instances when their ratio is low. In this way, the EIF can be trained to make accurate loss predictions and select important instances for training the main network.
Traditionally, for datasets with balanced classes, we use the average loss of each instance as the loss function of a minibatch for training. In our case, based on the binary loss label in Eq.(1) and the binary loss prediction by the filter, the loss function for instance is defined by crossentropy as:
(6) 
where defined in Eq.(4) is the computed probability of being a high or low loss for instance by the filter. measures how well the loss prediction approximates the true loss label and will be minimized during training. The average loss will be the average loss value of each preserved instance in a minibatch. However, when the predefined highloss ratio is not 50% and makes the number of highloss and lowloss instances unbalanced, directly using the average loss will result in effective training of the EIF.
To understand the inefficiency of training with the average loss, we define the weighted loss for preserved instances in a minibatch to train the filter as:
(7) 
where , , and represent true high, false high, true low and false low loss instances, respectively. and are instances with predicted high loss and labeled as and by Eq.(1), respectively. and are instances with predicted low loss and selected by uncertainty sampling in Eq.(5), which have loss labels and , respectively. The weights and represent how important the true high loss (instances with loss label , including and ) and true low loss (instances with loss label , including and ) instances are, respectively. and are normalized such that the weights of all instances in Eq.(7) sum up to 1.
When the predefined highloss ratio is not 50%, the number of highloss and lowloss instances will be not equal in the input stream. This makes training the EIF with the average loss ineffective. For example, when is set to , only 10% of the instances streamed in will be labeled as highloss by the adaptive loss threshold. In this way, 90% elements in Eq.(7) will be lowloss instances and dominate the loss. If we were using average loss, all the weights will be the same. To minimize the loss when training the filter, simply predicting all instances as lowloss will produce small loss values on the dominating second and third elements in Eq.(7), and hence the total loss, which prevents effective training of the filter.
To address this problem, we make the weights biased by setting and . In this way, we have . The first and fourth sums in Eq.(7) correspond to the highloss () instances. The second and third sums in Eq.(7) correspond to the lowloss () instances. By setting the weights in this way, the highloss and lowloss instances will contribute equally to the total loss and will be treated fairly in training. In the above example, while the first and fourth sums only contribute to 10% of the number of elements, the higher weight makes them equally important as the second and third sums, which have lower weight . Therefore, the instance filter can be correctly trained with the unbalanced number of highloss and lowloss instances and accurately predict highloss ones.
With the predicted highloss instances by the filter, the selected instances by uncertainty sampling, and the weighted loss function for training, the filter is effectively trained to predict highloss instances for training the main network.
V Error Map Pruning in Backward Pass
When training with the selected instances, the computation in the backward pass can be further reduced by error map pruning (EMP). Since the backward pass takes about 2/3 computation cost of training, reducing its computation can effectively reduce the total cost. As shown in Fig. 3, in the backward pass of training, the backpropagation propagates the errors layerbylayer from the last layer to the first layer. We focus on pruning convolutional layers because they dominate the computation cost in the backward pass. Within one convolutional layer, the input error map is generated from the output error map of the same layer. The output error map consists of many channels. We aim to prune the insignificant channels to reduce the computation cost of training.
Given a pruning ratio, we need to keep the most representative channels in the error map to maintain as much information such that the training accuracy is retained. The main idea of the proposed channel selection strategy is to prune the channels that have the least influence on both error propagation and the computation of the weight gradients.
Va Channel Selection to Minimize Reconstruction Error on Error Propagation
The first criterion to select the channels to be pruned is to minimize reconstruction error on error propagation. The error propagation for one convolutional layer is shown on the top of Fig. 4. Within one layer, the error propagation starts from the output error map shown on the right, convolves with the rotated kernel weights and generates the input error map on the left. The error propagation with pruned is shown on the bottom of Fig. 4. The number of channels in is pruned from to . When computing , the computations corresponding to the pruned channels, which are convolutional operations between and the rotated weights, are removed. In order to maintain training accuracy, we want to keep the input error map before and after the pruning as same as possible. In another word, we want to minimize the reconstruction error on the input error map.
Formally, without channel pruning of , is computed as follows.
(8) 
where is the input error map consisting of channels, each with shape . is the rotated weights of th convolutional kernel with shape . is the th channel of the output error map with shape .
Given a pruning ratio and an output error map , we aim to reduce the number of channels in from to such that . To minimize the reconstruction error on , the channel selection problem is formulated as follows.
(9)  
(10) 
where
is the error map selection strategy, represented as a binary vector of length
. is the th entry of , and means the th channel of is pruned. The norm measures the reconstruction error on .However, directly solving the minimization problem is prohibitive. in the problem is computed by Eq.(8), which completes all the computation in error propagation and defeats the purpose of saving computation. To select channels to prune before starting the actual error propagation, we define the importance score as an indication of how much each channel will influence the value of and prune the least important channels to minimize the reconstruction error on .
Importance Score. In Eq.(9), when a channel is pruned, the computation error on is caused by the pruned
. As a fast and accurate estimation of the magnitude of
, we define the importance score of channel as follows.(11) 
where is norm of convolutional kernel , computed by . Here we remove the rotation on since it does not change the norm. is norm of the channel in the output error map, computed by the sum of its absolute values . and are two hyperparameters to adjust the weight of each norm.
The importance score gives an expectation of the magnitude that a channel in contributes to . Channels with small magnitudes in and corresponding kernel weights tend to produce trivial values in the input error map , which can be pruned while minimizing the influence on .
VB Channel Selection to Minimize Reconstruction Error on Gradient Computation
The second criterion to select the channels to be pruned is to minimize the reconstruction error on the weight gradients. The computation of the weight gradients without pruning is shown on the top of Fig. 5. The output feature map of the previous layer convolves with one channel of the output error map to produce the gradient of one kernel. When some channels in are pruned, the computation of the weight gradients corresponding to the pruned channels is removed. To retain training accuracy, we want to keep the weight gradients before and after pruning as same as possible. Without channel pruning of , the weight gradients of kernel are computed as follows.
(12) 
where is the weight gradients of kernel with shape . is the output feature map of the previous layer with shape . is the channel of the output error map in layer , which has shape .
To determine the channel selection strategy while minimizing the reconstruction error on the gradient computation, the channel selection problem is formulated as follows.
(13) 
Similar to Eq.(9), we use the norm to measure the reconstruction error on the computation of the weight gradients for all the kernels incurred by the pruning. Similarly, solving this problem needs to complete all the gradient computation in Eq.(12) to get , which contradicts the goal to save computation. Thus, we define the importance score of each channel in for and prune the least important ones to minimize the reconstruction error on .
Importance Score. In Eq.(13), when a channel is pruned, the computation error is caused by the pruned . Since is independent of and can be considered as a constant when measuring the importance of each channel , we ignore and only include in the importance score of channel , which is defined as follows.
(14) 
VC MiniBatch Pruning with Combined Importance Score
To make the pruned channels for error propagation and gradient computation consistent with each other, we combine the importance score for these two processes. Then we scale it from instancewise to batchwise for minibatch training.
The importance score for gradient computation in Eq.(14) is a reduced form of Eq.(11) by setting and . Therefore, we combine them into Eq.(11). Based on the perinstance importance score of each channel, we can prune channels for a minibatch of instances to reduce the computation while maintaining the accuracy. For a minibatch of instances, we prune the same channels for all the instances. The batchwise importance score of one channel is calculated as . is the batch size and is the importance score of channel for instance .
With the batchwise importance score, the error map pruning process for one convolutional layer is as follows. Given a pruning ratio , channels in the output error map need to be pruned. First, for each channel in , we calculate the batchwise importance score . Then the importance scores of all channels are sorted and channels with the smallest are marked as pruned. Then the error propagation and the computation of the weight gradients corresponding to the pruned channels are skipped to save computation.
Computation Reduction. With error map pruning, the computation cost of both the error propagation and the weight gradients is effectively reduced. With pruning ratio , computation in the error propagation and gradient computation is skipped, which saves about computation in the backward pass of training. More specifically, without pruning, for one instance the computation cost of error propagation for a convolutional layer in floatingpoint operations (FLOPs) is . When pruning the number of channels in from to , the computation cost is reduced to . For the computation of the weight gradients, before pruning the computation cost of is . With pruning ratio , the cost is reduced to . In this way, computation cost is reduced in the backward pass of convolutional layers.
Overhead Analysis. The computation overhead of error map pruning is negligible. It is caused by the channel selection and the skipping of pruned channels. When using the norm strategy in Eq.(11) for the channel selection, the overhead is negligible because the sum over each kernel weight and each channel are relatively cheap compared with the expensive convolutional operation in the backward pass. For example, the channel selection of ResNet110 consumes marginal 0.53% FLOPs of the backward pass. For the overhead of skipping, since we employ structured pruning, skipping the pruned channels is simply skipping the computation involving the pruned channels, which has negligible overhead.
Vi Experiments
We conduct extensive experiments to demonstrate the effectiveness of our approaches in terms of computation reduction, energy saving, accuracy, convergence speed and provide detailed analysis. The evaluation is on four network architectures and four datasets. We first evaluate EIF and then evaluate the combined EIF+EMP approach. After that, we evaluate the practical energy savings on two edge devices.
Via Experimental setup
Datasets and Networks. We evaluate the proposed approaches on four datasets: CIFAR10, CIFAR100 [18], MNIST [19], and ImageNet [6]. We use networks with different capacity to show the scalability of the proposed approaches. The networks include largescale networks for mobile devices and small networks for tiny sensor nodes. For largescale networks, we employ two kinds of CNNs, including residual networks ResNet [10] and plain networks VGG [35]. ResNet110, ResNet74, and VGG16 are evaluated on CIFAR10/100. ResNet18 and VGG11 are evaluated on ImageNet. For small networks, we use LeNet on MNIST.
Architectures of Instance Filter.
We use different networks as the instance filter for different datasets. For CIFAR10/100, we use ResNet8. It has 7 convolutional layers and 1 fullyconnected layer. The first layer is 3x3 convolutions with 16 filters. Then there is a stack of 3 residual blocks. Each block has 2 convolutional layers with kernel size 3x3. The numbers of filters in each block are {16,32,64}, respectively. The network ends with a 10/100way fullyconnected layer. For ImageNet, we use ResNet10. It has 9 convolutional layers and 1 fullyconnected layer. The first layer is 7x7 convolutions with 64 filters. Additional downsampling is conducted with a stride of 4 to reduce the computation cost. Then there is a stack of 4 residual blocks. Each block has 2 convolutional layers with kernel size 3x3. The numbers of filters in each block are {64,128,256,512}, respectively. The network ends with a 1000way fullyconnected layer. For MNIST, we use a slimmed LeNet with kernel size 3x3 and {6,16} filters for two convolutional layers, respectively.
The computation overhead of the EIF is negligible compared with the main networks. For CIFAR10/100, the computation required for the inference of the EIF is 5.0% of ResNet110 and 4.1% of VGG16, respectively. The computation required for training the EIF is 5.9% of ResNet110 and 4.8% of VGG16, respectively. For ImageNet, the computation required for the inference and training of the EIF network is 3.4% and 3.9% of ResNet18, and 0.81% and 1.05% of VGG11, respectively. For MNIST, the computation required for the inference and training of the EIF is 9.5% and 7.8% of LeNet, respectively.
Training Details. We train both the main network and the instance filter simultaneously from scratch. For ResNet110, ResNet74, and VGG16, we employ the training settings in [10]. We use SGD optimizer with momentum 0.9 and weight decay 0.0001 with batch size 128. The models are trained for 64k iterations. The initial learning rate is 0.1 and decayed by a factor of 10 at 32k and 48k iterations. For the instance filter, the learning rate is set to 0.1. For ResNet18 and VGG11, similar training settings are employed except that the batch size is 256, the models are trained for 450k iterations, and the learning rate is decayed by a factor of 10 at 150k and 300k iterations. For LeNet, the learning rate is 0.01 and the momentum is 0.5. The model is trained for 18.7k iterations with batch size 64. For the instance filter, the initial learning rate is 0.1 and decayed to 0.05 after 0.94k iterations.
Metrics. We evaluate the proposed approaches in two highly related but different metrics: the reduction of computation cost and practical energy saving. The computation cost is measured in FLOPs, which is independent of the specific devices and a commonly used metric of computation cost[33]
. The evaluation of the computation cost is conducted on NVIDIA P100 GPU with PyTorch 1.1 and measured by the THOP library
[30], which will be presented in Sections VIB  VIE. The second metric, practical energy saving, depends on the devices and is measured on two edge devices (NVIDIA Jetson TX2 mobile GPU and MSP432 MCU), which will be presented in Section VIF.ViB Evaluating Early Instance Filtering (EIF)
To show that the proposed early instance filtering (EIF) can effectively reduce the computation cost while maintaining or even boosting the accuracy, we compare with two startoftheart (SOTA) baselines and a standard training approach. Online hard example mining (OHEM) [34] selects hard examples for training by computing the loss values. Stochastic minibatch dropping (SMD) [37] is a SOTA ondevice training approach, which randomly skips every minibatch with a probability. SMB is the standard minibatch training method by stochastic gradient descent (SGD), and the computation cost is adjusted by reducing the number of training iterations.
Computation Reduction while Boosting Accuracy. The proposed EIF substantially outperforms the baselines in terms of both accuracy and computation reduction. As shown in Fig. 6, when training ResNet110 on CIFAR10, with different remaining computation ratio, EIF consistently outperforms the baselines by a large margin. Compared with the full accuracy by SGD (e.g. SMB with remaining computation ratio 1.0), when using only 36.50% remaining computation, EIF boosts the accuracy by 0.16% (93.73% vs. 93.57%). With only 55.45% computation, EIF boosts the accuracy by 0.52% (94.09% vs. 93.57%). Compared with SMB and SMD, under different computation ratio, EIF achieves consistently higher accuracy with range [0.84%, 2.32%] and [0.83%, 2.28%], respectively. The significant improvement is achieved because EIF selects instances by predicting the true loss value, instead of randomly dropping the instances. Compared with OHEM, EIF consistently achieves higher accuracy with range [0.31%, 0.98%] under different computation ratio. The improved accuracy and reduced computation cost show that the proposed instance filter effectively selects important instances for training to save computation cost.
To further evaluate EIF, we conduct experiments on training ResNet74, VGG16 and LeNet. Consistent accuracy improvement over the SOTA baselines is observed as shown in Fig. 7(a)(b)(c). For ResNet74, with 63.74% computation reduction, EIF achieves 0.02% accuracy loss, while OHEM has a larger accuracy loss of 5.56% with less computation reduction of 60.59%. SMD and SMB achieve an accuracy loss of 2.25% and 1.98% with less computation reduction of 60%. Similar results are observed on VGG16. For LeNet, EIF boosts the accuracy by 0.21% (99.45% over 99.24%) with computation reduction 77.19%.
ViC Evaluating EIF + EMP
We evaluate the proposed framework EIF+EMP consisting of EIF and EMP and compare it with the SOTA baselines. Our approach effectively reduces the computation cost and achieves significantly better accuracy that SOTA baselines. Fig. 8 shows the accuracy of ResNet110 on CIFAR10 when trained by EIF+EMP and the baselines under different remaining computation ratio. Compared with EIF or EMP only, EIF+EMP achieves more aggressive computation reduction while preserving and even boosting the accuracy. With EIF only, we achieve 63.50% computation reduction without accuracy loss. With EMP only, we achieve 35.56% computation reduction in the backward pass without accuracy loss and 62.22% computation reduction with a slight accuracy loss of 0.72%. By the combined EIF+EMP, with up to 67.84% computation reduction, we achieve no accuracy loss and boosts the accuracy by up to 0.84% (94.41% vs. 93.57%).
Network  Method  Comp. Reduce  Accuracy 
ResNet110  SGD(original)    93.57% 
EIF+EMP  67.84%  93.66%  
OHEM[34]  60.45%  85.11%  
SD[11]  60.00%  91.96%  
EIF+EMP+Q  95.71%  93.54%  
E2Train(+Q)[37]  90.13%  91.68%  
ResNet74  SGD(original)    93.46% 
EIF+EMP  63.91%  93.48%  
OHEM  60.59%  87.90%  
SD  60.00%  90.99%  
EIF+EMP+Q  95.41%  93.00%  
E2Train(+Q)  90.13%  91.36%  
VGG16  SGD(original)    93.25% 
EIF+EMP  67.33%  93.15%  
OHEM  60.41%  71.81%  
EIF+EMP+Q  95.54%  92.69%  
E2Train(+Q)      
LeNet  SGD(original)    99.23% 
EIF+EMP  78.60%  99.47%  
OHEM  65.24%  99.33% 
We further evaluate EIF+EMP with more network architectures and datasets. Our approach substantially outperforms the baselines in terms of computation reduction and accuracy. We evaluate our approach with ResNet110, ResNet74 and VGG16 on CIFAR10 and LeNet on MNIST. For a fair comparison with E2Train [37], which employs quantization[3], we use the same quantization scheme. When comparing with other baselines, we do not use quantization. The experimental results are shown in Table I. When training ResNet74, our approach achieves 63.91% computation saving without accuracy loss. With quantization, our approach achieves 95.41% computation saving with a marginal accuracy loss of 0.46%. E2Train achieves less computation saving of 90.13% and a much higher accuracy loss of 2.10%. Similar results are observed on ResNet110, VGG16 and LeNet. SD and E2Train rely on the residual connections in ResNet and cannot be applied to VGG16 and LeNet. This result shows the proposed framework EIF+EMP achieves superior computation saving and significantly higher accuracy than baselines on different networks.
Network  Method  Comp. Reduce  Accuracy 
ResNet110  SGD (Original)    71.60% 
EIF+EMP  50.02%  72.02%  
56.24%  71.63%  
OHEM  47.01%  69.98%  
SD  50.00%  70.44%  
SMB  50.00%  67.28%  
EIF+EMP+Q  92.92%  71.29%  
E2Train(+Q)  90.13%  67.94%  
VGG16  SGD (Original)    71.56% 
EIF+EMP  50.49%  71.59%  
53.86%  70.92%  
OHEM  46.99%  65.17%  
SMB  50.00%  68.76% 
Experiments on CIFAR100. We further evaluate the proposed approaches on CIFAR100 with ResNet110 and VGG16. EIF+EMP substantially outperforms the baselines in both computation reduction and accuracy. As shown in Table II, with ResNet110, EIF+EMP achieves 56.24% computation reduction while preserving the full network accuracy, and 50.02% computation reduction while boosting the accuracy by 0.42%. The baselines OHEM, SD and SMB achieve much lower accuracy even with less computation reduction. With quantization, EIF+EMP achieves 92.92% computation reduction with a marginal accuracy loss of 0.31%, while E2Train has a much larger accuracy loss of 3.66% with less computation reduction of 90.13%. For VGG16, EIF+EMP achieves 50.49% computation reduction without accuracy loss and 53.86% computation reduction with only 0.64% accuracy loss. The baselines OHEM and SMB achieve much larger accuracy loss of 6.39% and 2.80% with less computation reduction. SD and E2Train cannot be applied to VGG16, which does not have residual connections.
Experiments on ImageNet. We evaluate the proposed approaches on largescale dataset ImageNet [26]. ImageNet consists of 1.2M training images in 1000 classes. The main networks are ResNet18 and VGG11.
Network  Method  Comp. Reduce 




ResNet18  SGD (Original)    69.76%  89.08%  
EIF+EMP  58.91%  70.27%  89.63%  
64.71%  68.98%  89.35%  
OHEM  46.67%  62.09%  87.08%  
SD  50.00%  65.36%  86.41%  
SMB  50.00%  65.94%  87.50%  
VGG11  SGD (Original)    70.38%  89.81%  
EIF+EMP  51.63%  70.36%  89.98%  
60.59%  70.01%  89.83%  
OHEM  46.59%  56.39%  85.62%  
SMB  50.00%  63.76%  86.49% 
The proposed EIF+EMP effectively reduces the computation cost in training while preserving the accuracy on the largescale dataset and significantly outperforms the baselines. As shown in Table III, when training ResNet18, with 58.91% computation reduction in training, EIF+EMP boosts the top1 accuracy by 0.51% (70.27% vs. 69.76%) and boosts the top5 accuracy by 0.55%. With more aggressive computation reduction of 64.71%, EIF+EMP still boosts the top5 accuracy by 0.27% (89.35% vs. 89.08%). EIF+EMP consistently outperforms the SOTA baselines by a large margin. With larger computation reduction, EIF+EMP achieves higher top1 accuracy with range [4.33%, 8.18%] and higher top5 accuracy with range [2.13%, 3.22%], respectively. SD relies on the residual connections and cannot be applied to VGG11. Similar results are observed on VGG11 as shown in Table III.
ViD Convergence Speed
The proposed approaches improve the convergence speed in the training process. The test error (i.e. 100%  accuracy on test dataset) over the computation cost during training is shown in Fig. 9. The proposed EIF, EMP and combined EIF+EMP approaches converge faster than the baselines, represented as lower test error (higher accuracy) with the same computation cost. More specifically, EIF+EMP achieves 3.1x faster convergence and 0.09% accuracy improvement compared with the standard minibatch approach (SMB). The SOTA baselines OHEM and SD achieve lower convergence speed and larger accuracy loss of 8.46% and 1.61%, respectively.
ViE Quantitative and Qualitative Analysis
Effectiveness of Adaptive Loss Threshold. The proposed early instance filter effectively predicts a predefined percentage of input instances as highloss and the adaptive loss threshold effectively adjusts the loss threshold as the labeling strategy to train the filter. In Fig. 10(a), the predefined high loss ratio is 40% for training ResNet110 on CIFAR10. The number of predicted highloss instances, averaged every 390 iterations, is stabilized at about 51, which effectively selects 40% highloss instances on average from 128 instances in each minibatch. As the average loss decreases, the adaptive loss threshold also decreases following a similar pattern to closely track the latest state of the main network.
We further compare the proposed adaptive loss threshold with the static loss threshold. With a static loss threshold 1.0, the number of predicted highloss instances per minibatch and the average loss of the main model in the training process are shown in Fig. 10(b). The goal of training is to minimize the loss of the main model. However, the static loss threshold cannot effectively decrease the loss of the main model as shown in the blue line, and results in low accuracy. This is because the static loss threshold cannot track the latest state of the main model. Therefore, it cannot effectively stabilize the number of predicted highloss instances to train the main model. The static loss threshold only achieves 80.83% final accuracy of the main model. Different from this, the proposed adaptive loss threshold can effectively minimize the loss of the main model and achieves high accuracy of 94.24%.
Effectiveness of Weighted Loss for Training EIF. The weighted loss in Eq.(7) effectively trains the EIF network to make accurate loss prediction, which eventually results in higher accuracy of the main model. As shown in Fig. 11, when the weighted loss is employed, the wrong loss prediction ratio by the EIF is much lower than that without weighted loss. The predefined highloss ratio is 30%, and the corresponding lowloss ratio is 70%. This highloss ratio makes the number of highloss and lowloss instances unbalanced in the input stream.
When the weighted loss is used for training EIF, the average wrong loss prediction ratio by EIF is reduced from 20.31% to 8.59%. This accurate loss prediction effectively selects highloss instances to train the main model and results in significantly higher accuracy of the main model, which is 94.05% with weighted loss vs. 90.58% without weighted loss.
Overhead of EIF. The proposed early instance filter has marginal energy and computation overhead. The average energy and computation overhead of the EIF network per training iteration (e.g. one minibatch of 128 instances) when training ResNet110 on CIFAR10 dataset is shown in Fig. 12. As shown in the yellow bar in Fig. 12(a), the energy overhead of the EIF network (measured on NVIDIA Jetson TX2) is 0.43J per iteration, which is 10.22% of the total energy cost 4.18J when training with EIF. Without EIF, the energy cost is 12.90J per iteration. As shown in Fig. 12(b), the computation overhead of EIF is 3.88 GFLOPs, which is 11.65% of the total computation cost 33.21 GFLOPs when training with EIF. Without EIF, the computation cost is 99.91 GFLOPs per iteration. The detailed EIF computation overhead across all training iterations are shown in Fig. 12(c). While the overhead of EIF is not zero, the proposed approach achieves 67.60% energy saving and 66.76% computation saving while fully preserving the accuracy.
Preserved and Dropped Instances by EIF. To better understand the instances selected by the early instance filter, we cluster the instances that the filter preserves and drops when training ResNet110 on CIFAR10 and LeNet on MNIST, as shown in Fig. 13. We find that the dropped instances show the full objects with typical characteristics. The preserved instances either only show part of the object or show nontypical characteristics, even hard for humans to understand. This result shows the early instance filter can effectively find important instances to train the network.
Analysis of Error Map Pruning. To better understand the pruned and preserved channels in the backward pass by error map pruning, we visualize them to analyze the effectiveness of the proposed channel selection approach. The preserved and pruned channels in the error map and corresponding kernel weights in conv2 layer of VGG16 are shown in Fig. 14. The 16 channels with the highest/lowest proposed importance scores are shown on the top left and bottom left, respectively, and their corresponding convolutional kernel weights are shown on the right. The pruned channels are darker with smaller values than the preserved channels, which are brighter with larger values. Similarly, the kernel weights corresponding to the pruned channels have smaller values than the preserved ones. Therefore, the pruned channels will have the least influence on both the error propagation and computation and weight gradients. This result shows the proposed error map pruning approach effectively selects the channels to prune to minimize the influence on training.
ViF Practical Energy Saving on Hardware Platforms
The energy cost of training consists of both the computation cost and the memory access cost. While the former one dominates the energy cost and is represented by the commonly used metric FLOPs [33], the energy saving ratio can be slightly different from the computation reduction ratio. To evaluate the practical energy saving, we conduct extensive experiments on two edge platforms and evaluate the proposed approaches in terms of practical energy saving and accuracy.
Hardware Setup. We apply the proposed training approach on two edge platforms to evaluate realistic energy saving. For mobilelevel devices, we train ResNet110, ResNet74, and VGG16 on an NVIDIA Jetson TX2 mobile GPU [28] with CIFAR10 and CIFAR100 datasets by PyTorch 1.1. We use an energy meter to measure the energy cost as shown on the top of Fig. 15. For sensor nodelevel devices, we train LeNet on the MSP432 MCU [12]. We use C language to implement the training process on MCU. Since the MCU cannot store the entire dataset, we use a computer to feed the training data into the MCU via UART in the training process. We use the Keysight N6705C power analyzer to measure the energy cost on the MCU as shown on the bottom of Fig. 15.
Energy Saving of Training on Mobile GPU. We evaluate the energy saving by EIF+EMP on mobilelevel devices. We repeat all the experiments in Table I and II on the mobile GPU to measure the practical energy saving, except for the LeNet, which will be evaluated on MCU. Our approach effectively reduces the energy cost of ondevice training. Compared with the original SGD, the proposed EIF+EMP achieves energy saving of 67.60%, 63.57%, and 60.02% in the training of ResNet110, ResNet74, and VGG16 on CIFAR10, respectively, as shown in Fig. 16 (the result of ResNet74 is not shown for conciseness). The energy savings prolong battery life by 3.1x, 2.7x, and 2.5x while improving the accuracy or incurring a slight 0.1% accuracy loss. Compared with the SOTA baselines OHEM and SD, our approach achieves significantly higher accuracy when similar energy saving is achieved. SD relies on the residual connections and cannot be applied to VGG16. Besides, the practical energy saving ratios are very close to the computation reduction ratios represented by FLOPs, which shows the computation reduction in FLOPs can generalize well to energy saving on hardware platforms. Similar results are observed on the CIFAR100, on which we achieve 54.22% and 46.64% energy saving (2.2x and 1.9x battery life) for ResNet110 and VGG16 without any accuracy loss, respectively.
Energy Saving of Training on MCU.
We evaluate the energy saving by EIF+EMP on sensor nodelevel devices (i.e. MCUs). We train LeNet on MCU MSP432 for one epoch including 60000 instances and measure the energy cost and accuracy. Due to the limited runtime memory, we set the batch size to 1. Since the original SGD approach takes too long (i.e. about 50 days) to complete on MCU, we conduct 10% training iterations of one epoch on MCU and estimate the total energy cost by multiplying the measured energy cost by 10. The accuracy of the original SGD is measured on the P100 GPU after finishing one epoch. OHEM cannot be applied to MCUs because it needs batchwise loss values for instance selection. To compare with OHEM, we measure its energy cost on MCU by completing its computation and ignoring the accuracy. The accuracy of OHEM is evaluated on P100 GPU.
EIF+EMP significantly reduces the energy cost of training on MCUs and effectively prolongs battery life. As shown in Fig. 17, when training LeNet on MSP432 MCU, EIF+EMP effectively reduces the energy cost by 74.09% while improving the accuracy by 0.33%. This prolongs battery life by 3.9x. OHEM, while not fully feasible on MCU, achieves much lower energy saving of 59.78% with an accuracy loss of 0.32%. This result shows EIF+EMP greatly improves the battery life of tiny sensor nodes and outperforms the baselines.
Vii Conclusion
This work aims to enable ondevice training of convolutional neural networks by reducing the computation cost at training time. We propose two complementary approaches to reduce the computation cost: early instance filtering (EIF), which selects important instances for training the network and drops trivial ones, and error map pruning (EMP), which prunes insignificant channels in error map in backpropagation. Experimental results show superior computation reduction with higher accuracy compared with stateoftheart techniques.
References
 [1] (2017) Extremely large minibatch sgd: training resnet50 on imagenet in 15 minutes. Cited by: §I, §IIB.
 [2] (2017) Compressionaware training of deep networks. In Advances in Neural Information Processing Systems, Cited by: §IIB.
 [3] (2018) Scalable methods for 8bit training of neural networks. In Advances in neural information processing systems, pp. 5145–5153. Cited by: §VIC.
 [4] (2016) Sparsification and separation of deep learning layers for constrained resource inference on wearables. In Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CDROM, pp. 176–189. Cited by: §I.
 [5] (2012) Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421–436. Cited by: §IIA.
 [6] (2009) ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, Cited by: §VIA.

[7]
(2015)
Unsupervised visual representation learning by context prediction.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 1422–1430. Cited by: §IIA.  [8] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §I.
 [9] (2018) Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604. Cited by: §IIA.

[10]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §I, §IIB, §VIA, §VIA.  [11] (2016) Deep networks with stochastic depth. In European conference on computer vision, pp. 646–661. Cited by: §IIB, TABLE I.
 [12] (2020)(Website) External Links: Link Cited by: Fig. 17, §VIF.
 [13] (2018) Tradingoff accuracy and energy of deep inference on embedded systems: a codesign approach. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 37 (11). Cited by: §I.
 [14] (2018) Highly scalable deep learning training system with mixedprecision: training imagenet in four minutes. Cited by: §IIB.
 [15] (2020) Hardware/software coexploration of neural architectures. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems. Cited by: §IIB.
 [16] (2019) Accuracy vs. efficiency: achieving both through fpgaimplementation aware neural architecture search. In Proceedings of the 56th Annual Design Automation Conference 2019, pp. 1–6. Cited by: §IIB.

[17]
Learning active learning from data
. In Advances in Neural Information Processing Systems, Cited by: §IVB.  [18] () CIFAR10. . External Links: Link Cited by: §VIA.
 [19] (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §VIA.
 [20] (2019) Neuro. zero: a zeroenergy neural network accelerator for embedded sensing and inference systems. In Proceedings of the 17th Conference on Embedded Networked Sensor Systems, Cited by: §IIA.
 [21] (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §IIB.
 [22] (2019) On neural architecture search for resourceconstrained hardware platforms. arXiv preprint arXiv:1911.00105. Cited by: §IIB.
 [23] (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §IIB.
 [24] (2019) Prunetrain: gradual structured pruning from scratch for faster neural network training. arXiv preprint arXiv:1901.09290. Cited by: §IIB.

[25]
(2017)
Federated learning: collaborative machine learning without centralized training data
. External Links: Link Cited by: §I.  [26] (2016) Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §IIA, §VIC.
 [27] (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §IIA.
 [28] (2020)(Website) External Links: Link Cited by: Fig. 16, §VIF.
 [29] (2016) Conditional deep learning for energyefficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), Cited by: §I.
 [30] (2020)(Website) External Links: Link Cited by: §VIA.
 [31] (2008) Incremental learning for robust visual tracking. International journal of computer vision 77 (13), pp. 125–141. Cited by: §I.
 [32] (2018) Personalized machine learning for robot perception of affect and engagement in autism therapy. Science Robotics 3 (19), pp. eaao6760. Cited by: §I.
 [33] Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §VIA, §VIF.
 [34] (2016) Training regionbased object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 761–769. Cited by: §IIB, §VIB, TABLE I.
 [35] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §VIA.
 [36] (2018) Insitu ai: towards autonomous and incremental deep learning for iot systems. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 92–103. Cited by: §I.
 [37] (2019) E2train: training stateoftheart cnns with over 80% energy savings. In Advances in Neural Information Processing Systems, Cited by: §IIB, §VIB, §VIC, TABLE I.
 [38] (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §IIB.

[39]
(2020)
Coexploring neural architecture and networkonchip design for realtime artificial intelligence
. In 2020 25th Asia and South Pacific Design Automation Conference (ASPDAC), pp. 85–90. Cited by: §IIB.  [40] (2020) Coexploration of neural architectures and heterogeneous asic accelerator designs targeting multiple tasks. arXiv preprint arXiv:2002.04116. Cited by: §IIB.
 [41] (2020) Intermittent inference with nonuniformly compressed multiexit neural network for energy harvesting powered devices. arXiv preprint arXiv:2004.11293. Cited by: §I.
 [42] (2016) Less is more: towards compact cnns. In European Conference on Computer Vision, Cited by: §IIB.
 [43] (2018) Deep learning for smart agriculture: concepts, tools, applications, and opportunities. International Journal of Agricultural and Biological Engineering 11 (4), pp. 32–44. Cited by: §I.
Comments
There are no comments yet.