1 Introduction
Deep learning approaches, such as convolution neural networks (CNNs), have achieved tremendous success in versatile applications. However, deploying the deep learning models on resourceconstrained systems, such as drones and selfdriving cars, is challenging because of its huge computation and energy cost. Diving into the performance of each inference case, researchers
teerapittayanon2016branchynet ; figurnov2017spatially ; wang2018skipnetfound that the significant growth in model complexity is only helpful to classifying a handful of complicated inputs correctly, and they might become “wasteful” for simple inputs. Motivated by this observation, several works
panda2016conditional have tackled the problem of inputdependent dynamic inference. As one of the dynamic inference techniques, early exit includes additional side branch classifiers (exiting layers) to some of the network layers. As shown in Fig. 1, compared with the classic deep learning network in (a), the additional exiting layer in (b) allows inference results for a large portion of test samples to exit the network early when samples have already been inferred with high confidence.Despite being employed in burgeoning efforts to reduce computation cost and energy consumption in inference, early exits scardapane2020should are still not capable of addressing the challenges listed below. First, there is a dilemma between finegrained and coarsegrained placements of exiting points. Finegrained exiting points lead to significant performance and energy overheads due to frequent execution of the exiting layers enzo2020optimized . In contrast, coarsegrained exiting points may miss the opportunities to exit earlier. Second, the exiting layers have different topologies at different exiting points, which will add extra burdens to mapping the computation to computing hardware resources. Third, the energyefficient computing platforms using early exits can only reduce the computing voltage and frequency after exiting. During the inference process, runtime computing configuration adjustment, such as dynamic voltage and frequency scaling (DVFS), can not be applied ahead of time for energy saving.
To provide an efficient early exit for resourceconstrained computing platforms, we propose Predictive Exit: prediction of finegrain early exits in computation and energyefficient inference. To our best knowledge, Predictive Exit is the first work to forecast the exiting point and adjust the computing configuration (i.e. DVFS) along with the inference proceeds, which achieves significant computation and energy savings. The contributions of this work are threefold.

[wide, labelwidth=!, labelindent=0pt]

A novel Predictive Exit framework for deep learning networks is introduced, which is shown in Fig. 1 (c). Finegrained exiting layers sharing the same topology are potentially placed after each network layer. A selective execution of the exiting layers will capture the opportunities to exit on time and reduce computation and energy overheads.

A lowcost prediction engine is proposed to predict the point where the network will exit, execute the exiting layer selectively at runtime based on the prediction, and optionally adjust the computing configurations for energy saving during each inference based on DVFS.

Extensive experiments with floatingpoint (FP32) and fixedpoint quantization (INT8) operations are conducted to demonstrate that Predictive Exit achieves computation cost reduction and energy saving with tiny accuracy loss compared with the network without early exit and no accuracy loss compared with the early exit using stateoftheart placement strategies.
2 Background and Related Work
2.1 Latency, Power, and Energy
In many applications, such as robots and selfdriving cars, deep learning networks are released and executed periodically. The time interval between the release times of two consecutive inferences is called the inference period . The actual execution time of each inference is called latency , which rarely exceeds the period
. Given a computing platform and a neural network with fixed computation, the inference latency
is inversely proportional to the operating frequency of the digital logic inside the computing platform, which is called the computing frequency . Following the basic rules of semiconductors, a higher indicates faster computing but requires a linearly increased processor supply voltage .When a task is executed, the power consumption in computing, called active power , is the summation of dynamic, static, and constant power. When the computing platform is idle, the power consumption, also called idle power , mainly comprises static and constant power. The energy of a computing platform includes both active and idle power integrated with time tambe2021edgebert ,
(1)  
is the time spent on executing the task, and is the time duration when the computing platform is in idle, which is the time interval between finishing this inference and starting the next inference, denoted by . The dynamic power consumption originates from the activity of logic gates inside a processor, which can be derived by
(2) 
where is the capacitance of switched logic gates. The static power dissipation originates from transistor static current (including leakage and short circuit current) when the processor is powered on, which is described by
(3) 
is the number of logic gates, and is the normalized static current for each logic gate. The constant power is the power consumption by the auxiliary devices of a computing platform, like board fans, peripheral circuitry, etc.
2.2 Early Exit with Dynamic Voltage and Frequency Scaling
DVFS is a power management technique in computer architecture whereby and of a microprocessor can be automatically adjusted on the fly depending on the actual needs to conserve power and energy cost of the computing platform tang2019impact . Given the same inference period , combining early exit and DVFS could effectively reduce the cost of computation resources for executing a deep learning network.
Originally, the deep learning networks are executed by running all network layers at the default high frequency and voltage without early exit and DVFS. We assume the inference latency (execution time) is equal to inference period . In this case, the energy consumption within is . As the deployment of early exists, network layers located ahead of the early exit run at the default high frequency and voltage. After the network exits at , computing platform can reduce the voltage and frequency to the lowest level by DVFS tambe2021edgebert until time reaches . Therefore, the energy consumption can be described by .
Since the processor voltagefrequency pair setting is cubic to but linear to and latency according to Eqs. (2) and (3), it is highly demanded to further reduce the cost of . A direct intuition to achieve this is to adjust the voltage and frequency to proper middlelevel and run the network until it exits at time . However, how to predict where the network will exit and adjust the voltagefrequency pair during the inference is not trivial. In this work, we propose a solution to adjust the voltage and frequency to the proper “middlelevel” at runtime based on the prediction of early exits. The energy cost is reduced by utilizing the inference period better.
2.3 Related Work
Early exits have attracted tons of attention as an important branch for dynamic inference in the past few years. Since it was studied in panda2016conditional ; teerapittayanon2016branchynet , many approaches have been proposed to improve the accuracy or reduce the computation cost of exit decisions. Wang et al. wang2019dynexit offered a dynamic lossweight adjustment earlyexit strategy for ResNets together with the hardwarefriendly architecture of earlyexit branches. Laskaridis et al. laskaridis2020spinn presented SPINN, a distributed progressive inference mechanism that maintains reliable execution of CNN inference across deviceserver setups. To determine the placement of exiting layers, Kaya et al. kaya2019shallow explored the distribution of computation cost; Bonato et al. vanderlei2021class classified the exit point for each inference object; Panda et al. panda2017energy and Baccarelli et al.baccarelli2020optimized calculated the benefit and additional computation and energy cost from adding this exiting layer. To reduce average inference time and longest path inference time, Sabih et al. sabih2022dyfip introduced a novel dynamic pruning strategy that combines early exit with a dynamic pruning strategy. To further balance the accuracy, inference time, and energy tradeoffs, Wołczyk et al. maciej2021zero proposed a Zero Time Waste (ZTW) method approach by adding direct connections between exiting layers and combining previous outputs in an ensemblelike manner, and Ghodrati et al. ghodrati2021frame proposed an onthefly supervision training mechanism to reach a dynamic tradeoff between accuracy and power. Although the multiexit networks in their studies can lead to a reduction in inference time and computational cost, their efficiency lies in selecting a placement of exiting layers that keeps a good balance between costsaving and inference accuracy. However, studies focused on choosing the best placement usually result in fixed places and thus neglect the possibility of exiting between exiting layers. Our work provides a predictive design that can dynamically change the placement of the exiting layer, which can further reduce inference time and computational cost.
Together with the proceeding of theoretical models, early exits have started being deployed in hardware and IoT systems odema2021eexnas ; samikwa2022adaptive . Li et al. li2020edge and Zeng et al. zeng2019boom adopted deviceedge synergy optimized by DNN partitioning and DNN rightsizing through the early exit for ondemand DNN collaborative inference. The strategy is feasible and effective for lowlatency edge intelligence. Xu et al. xu2018efficient presented a compressed CeNN framework optimized by five different incremental quantization methods and early exit. FPGA implementation shows that two optimizations achieve 7.8x and 8.3x speedup, respectively, while almost no performance loss. In the hardware designs, a lowcost early exit decision hardware unit was designed by kim2020low to reduce the cost of early exiting. Tambe et al. tambe2021edgebert also exploited early exits in the Bert hardware accelerator designs to reduce the computation power and energy consumption. Meanwhile, Laskaridis et al. laskaridis2021adaptive and Scardapane et al. scardapane2020should further provided a thorough overview of the current architecture, stateofart methods, and future challenges of earlyexit networks. Their studies have made a great contribution to power savings of early exits on embedded CNN. While most of their strategies demand a redesign of either the hardware or the network, our work proposes a method based on DVFS that saves the trouble of reconstruction.
3 The Framework of Predictive Exit
Targeting computation and energyefficient inference with early exits, we propose a Predictive Exit framework for deep learning networks by taking the CNN as an example, which is the most popular deep learning algorithm alzubaidi2021review . The Predictive Exit combines the following schemes into one continuous spectrum, as shown in Fig. 2.
Unified exiting layer:
The neural networks run on the graphics processing units (GPU), tensor processing units (TPU), or applicationspecific integrated circuits (ASIC) on local or edge devices. To release the burden of executing versatile computation, we make the exiting layers used in the network share the same topology.
Finegrained exiting points: Exiting layers are potentially placed after each convolution layer to fully utilize the early exit opportunities. To avoid the computation overhead from running every exiting layer, only the layer at the predicted exiting point will be executed.
Lowcost prediction engine: The core of the framework is the lowcost prediction engine, which forecasts the possible exiting point and activates the early exiting layer. Based on the remaining computation workload before the expected exiting point, the proper computing frequency and voltage will be selected to run the inference and to further save energy.
3.1 Unified Exiting Layer
In this work, the design of exiting layer is based on the existing work passalis2020efficient . All the exiting layers share the same topology. It contains a BagofFeatures (BoF) pooling layer and a FullyConnected (FC) layer. Let be the intermediate results at the layer and be the number of object classes in the dataset. The BoF pooling layer functions as a feature aggregation to extract from
. In the BoF pooling, a set of feature vectors called codebook are used to describe
. The weight of each codebook is generated by measuring the similarity between the codebook and . Since the size of codebook weight is usually larger than , the FC layer further works as a classifier that adjusts the result of BoF pooling to, which estimates the final output of the exiting layer. Details can be found in
passalis2020efficient . The function of the exiting layer is denoted as(4) 
where stands for the intermediate result after the calculation of to layer.
After we’ve trained the early exiting layers, the average feature weight is first calculated according to Eq. (5), which are parameters for the exit decision.
(5) 
where is the size of the training dataset. During the inference, at each early exiting layer, the weight ratio is calculated as the maximum feature weight over the
multiplied by a hyperparameter
specified by the user. Once in Eq.(6) is larger than 1, the inference is terminated and the result at this early exiting layer is used as the final result.(6) 
The key idea behind such implementation is that the larger the maximum feature weight is, the more confident the classification is regarded. The parameter plays the role of striking a balance between the accuracy and acceleration extent. If the user expect to attain a more accurate result, should be higher and vice versa.
3.2 FineGrained Exiting Points
In the classic early exit design, a limited number of exiting layers are settled in a fixed location and distance, usually to , of the neural networks. For example, in the inference model proposed in classicexample , the two early exits are placed at roughly and of the network. If the first exiting layer fails, the inference must run through the following network layers until it reaches the next exiting layer to check whether it can exit. A more sensible way proposed by Kaya et al. kaya2019shallow is to settle the exiting layers based on the percentage of the total cost after a rough estimation of the computational costs of each layer. However, their “fixedplace” design ignores the possibility of exiting the layers between the two early exits. It is nontrivial to choose where and how to place early exits to satisfy both computation time and inference accuracy.
In this work, we proposed a finegrained exiting points design, that is, placing potential exiting layers on every network layer since the starting layer . is a hyperparameter specified by the user, working as an adjuster of computation time and inference accuracy. During the inference process, the trial of early exit begins on . If the trial succeeds, the early exit is triggered. If fails, instead of running through every exiting layer from , the position of the next trial is , where Predict is the function of lowcost prediction engine to forecast the next location to exit, which will be introduced in the following subsection. Under the circumstances that early exit does not succeed on , we will start another prediction based on the exiting layer result of . That is
(7) 
Therefore, the location expectation of exit can be expressed as
(8) 
where is the location of first trial, stands for the number of times of predictions, and denotes the possibility of prediction failure.
3.3 LowCost Prediction Engine
3.3.1 Prediction of exiting points
The Predict function starts to forecast the exiting point since the layer. To fully use the information from the intermediate result during the inference process, the key function of Predict is realized by one dimensional convolution on the intermediate result and the exiting layer result based on it at , which is summarized in Algorithm 1.
To perform the convolution, we first extend the intermediate result from to
by zero padding, which allow for more space for the filter to cover the intermediate result,
(9) 
Afterwards, a vector is generated (with all of 1 in our work) as the filter of the convolution. Based on and , a new set of feature weight is generated through one dimensional convolution, which estimates the results of exiting layer placed at ,
(10) 
By recursively replacing the intermediate result in Eq. (9) with and repeating the computation of Eq. (9) and Eq. (10), the predicted results of exiting layer placed at can be obtained and noted as . Following above steps, the predicted results of any exiting layer placed after can be calculated.
With the predicted results of exiting layer placed at any layer after , we will have Predict function described as follows:
(11) 
where represents the predicted exiting point after , which is the smallest number that satisfies
(12) 
where indicates the predicted exiting confidence at layer . Therefore, is the predicted exiting point from Predict. In case that cannot be found, a hyperparameter is further introduced, that is . should be no more than where stands for the number of layers the model, meaning that prediction result beyond the last layer is forbidden. If no integer in satisfies Eq. (12), we assume that . The prediction engine is simple enough to avoid adding notable computation cost to the network. The time complexity of Algorithm 1 is . Considering , the complexity can be estimated as . Meanwhile, hyperparameters are introduced in the prediction. , , and can be tuned by advanced users, which can balance the prediction accuracy and computation cost for different application scenarios.
Discussion of Weight Ratio and Cross EntropyBased Prediction We discuss the prediction for early exits using and cross entropy , which will be defined in Eq. (15), because these two parameters are the indicators of confidence in early exits. More precisely, a higher weight ratio or a lower cross entropy indicates more confidence in exiting. As shown in Fig. 3, we record the values of weight ratio and cross entropy ( axis) at the prediction starting point and the actual exiting layer (
axis) in the inference process. The test case results of the Resnet34 model and the SVHN dataset are presented in the figure. However, on the weight ratios ranging from [0.2 0.6), the exiting points are distributed between the 1st and 12th layers. These spreading trends are also observed when the weight ratio lies in [0.6 0.8) and [0.8 1.0). Similarly, the exiting points are also widely distributed on the same cross entropy. Therefore, it is hard to predict the exiting point purely based on weight ratio and crossentropy, even they are the indicators of confidence in early exits.


3.3.2 DVFS for Predictive Exit
Based on the predicted exiting point, the prediction engine will adjust the voltagefrequency pair to proper “middle levels” and run the network until it exits at time . Given a network with layers and a prediction engine which starts the prediction at and predicts the exiting point is , the prediction engine will reduce the computing frequency (and its corresponding voltage) to
(13) 
where is the default high frequency of the computing platform. Therefore, the energy consumption used in this inference will be
(14) 
where is the time of finishing the prediction starting layer and its previous network layers. As the selection of computing frequency is based on the predicted remaining workloads in the inference, the prediction engine can make the inference finish by time based on correct predictions. Since the processor voltage and frequency are linear scales to the computing performance (inversely proportional to inference latency) but the cubic scale and linear scale to dynamic power and static power, the predictive exit will effectively reduce energy consumption.
4 Training the Network with Predictive Exit
The learning objective of Predictive Exit is to keep the network inference accuracy given the predictive exit functions. Therefore, the objective function is the cross entropy loss for the entire network
(15) 
where and are the true class distribution and the predicted class distribution for each object.
Our model is trained through batch gradient descent. The data are fed into the model to optimize the parameter. Let be the size of the training set , and a target vector be the correct result. At first, the model is trained without any exiting layer by the following equation
(16) 
where is the learning rate. On the circumstance that the accuracy fails to meet the requirement, can be adjusted. Afterwards, with of the original model fixed and armed with exiting layers, the model is trained again to optimize the parameters in the BoF pooling and FC layer,
(17) 
In our design, the original model parameters are not fixed during the training of the exiting layer. Although it may sacrifice the accuracy of the original layer, it leads to a higher accuracy at each exiting layer. The output of the original model can be regarded as another exiting layer and trained together to compensate the accuracy lost. Although the training procedure requires two steps, the advantage brought by the inference process far outweighs this disadvantage, especially when training is done by the server and only the inference is deployed on the resourceconstrained platforms.
5 Evaluation
5.1 Experimental Setup
We evaluate the Predictive Exit using VGG19 and ResNets34
as the backbone models on the commonly used CIFAR10, CIFAR100
cifar10 , SVHN svhn , and STL10 datasets stl10 . For the consideration of low computation and energy applications, both floatingpoint (FP32) and fixedpoint quantization (Int8) operations are tested. The training follows the procedure in Section 4. For the classic network and exiting layer, we set the learning rates as 0.00025/0.001 and train them with 100 iterations. The selection of is based on the prediction accuracy detailed in Section 5.3. We compare the proposed Predictive Exit with
[wide, labelwidth=!, labelindent=0pt]

Classic CNN adopts VGG19 simonyan2014very or ResNets34 he2016deep ;

Hierarchical Early Exits passalis2020efficient , which is specifically designed for CNNs;

Placement Early Exit, proposed in panda2017energy and improved in baccarelli2020optimized with tunable hyperparameters, which identifies the best place to attach an early exit by considering the benefit and extra computation and energy cost.
5.2 Inference Accuracy and Computation Cost
The inference accuracy and computation cost (the number of floatingpoint or integer operations normalized by that of classic CNN) of different inference approaches and datasets are shown in Table 1. On the VGG19 model, Predictive Exit achieves the same inference accuracy as other early exit approaches (which is 1%  3% lower than Classic CNN). Because Predictive Exit will continue for the next prediction instead of forcing the inference to exit even if the prediction is wrong, no extra inference accuracy loss is brought. The early exit based approaches significantly reduce the computation cost of Classic CNN. Compared with Hierarchical and Placement, Predictive Exit further reduces up to 12.8% of computation cost by leveraging the opportunities to exit earlier and avoiding frequent execution of the exiting layers. On the ResNet model, compared with Hierarchical and Placement, Predictive Exit reduces up to 12.1% of computation cost without extra accuracy loss.
Model  VGG19  ResNet34  
Approach  Classic CNN  Hierarchical  Placement  Predictive Exit  Classic CNN  Hierarchical  Placement  Predictive Exit 
FP32 CIFAR10  89%  100%  88%  59.8%  88%  46.9%  88%  46.2%  90%  100%  89%  28.7%  89%  12.1%  89%  7.7% 
FP32 CIFAR100  64%  100%  62%  86.6%  62%  80.6%  62%  76.2%  66%  100%  63%  36.6%  63%  21.2%  63%  16.1% 
FP32 SVHN  89%  100%  87%  55.6%  87%  56.5%  87%  46.8%  91%  100%  89%  6.9%  89%  5.4%  89%  3.8% 
FP32 STL10  75%  100%  74%  67.6%  74%  64.2%  74%  51.5%  76%  100%  73%  61.8%  73%  17.8%  73%  5.7% 
Int8 CIFAR10  88%  100%  87%  62.3%  87%  46.4%  87%  46.1%  85%  100%  82%  29.6%  82%  14.9%  82%  8.4% 
Int8 CIFAR100  64%  100%  62%  95.0%  62%  79.5%  62%  76.4%  62%  100%  60%  44.4%  60%  22.3%  60%  18.5% 
Int8 SVHN  89%  100%  86%  57.2%  86%  47.4%  86%  46.7%  88%  100%  85%  7.1%  85%  5.6%  85%  4.0% 
Int8 STL10  74%  100%  74%  66.9%  74%  64.3%  74%  51.5%  71%  100%  70%  78.1%  70%  17.8%  70%  5.8% 
For the training process, compared with Classic CNN, which requires only the training of original network layers, Hierarchical structure demands additional training of the exiting layers placed in specified positions of the network. While both Placement and Predictive design need to train all exiting layers that settle after each layer of the network, Placement further needs one more step to determine placement indexes by an exhaustive search for the most profitable exiting layers.
5.3 Where to Start Prediction: Hyperparameter L0
The prediction engine will start the exiting prediction since the starting layer . For each model and dataset, choosing should be done before the deep learning model is deployed as it directly determines Predictive Exit’s performance. This section quantitatively compares the prediction accuracy of different during the training process. To let the prediction engine cover a wide range of network layers, should be the first few layers in the networks. Therefore, in the VGG19 network, we test placed at the 1st10th layers, and in the ResNet34 network, we test placed at the 6th20th layers. Fig. 4 and Fig. 5 present the prediction accuracy under FP32 and INT8 operations across different datasets. The prediction accuracy indicates the percentage of successful exiting at the first predicted exiting point. In the VGG19 network, if is the first network layer, the prediction can achieve over 81% accuracy across all tested datasets and operations. Most notably, if is the 6th network layer, the prediction can achieve over 99.8% accuracy. Therefore, for VGG19 network and test datasets will be the 6th network layer. Similarly, for the ResNet34 network, the desired
will be the 6th, 14th, 10th, and 6th layers for datasets CIFAR10, CIFAR100, SVHN, and STL 10, respectively. Comparing the VGG19 and ResNet, Predictive Exit achieves higher exiting prediction accuracy on VGG19, with a shallower network structure.
5.4 Potential Energy Benefit by DVFS
To illustrate the potential energy benefit of Predictive Exit, we calculate the energy consumption with offline measured active and idle power of NVIDIA Quadro GV100 GPU at different frequencyvoltage pairs kandiah2021accelwattch (shown in Table 2). In a network with layers, when the Predictive Exit predicts the network will exit at layers, the processor will select the lowest candidate frequency in Table 2 that is higher than or equal to to execute this network. We count the inference workload (including the exiting layer) under each frequencyvoltage pair. Based on the power consumption in each frequencyvoltage pair, we compare the energy consumption and normalize it to the energy consumption used by the Classic CNN model in Table 3. Unsurprisingly, early exit achieves tremendous energy savings compared with Classic CNN models. Predictive Exit further reduces the energy consumption (compared with the best cases of Hierarchical and Placement) by 19.9% to 37.6% on the VGG19 network and 19.7% to 37.3% on the ResNet34 network, by predicting the exiting and adjusting the computation configuration along with the inference process.
Frequency (GHz)  0.60  0.65  0.70  0.75  0.80  0.85  0.90  0.95  1.00  1.05  1.10  1.15  1.20  1.25  1.30  1.35  1.40  1.45 
Active Power (W)  59.2  67.4  73.5  81.1  85.3  90.4  97.5  104.9  112.8  119.5  130.1  139.5  148.9  161.1  170.2  180.6  199.1  218.5 
Idle Power (W)  35.1  35.9  37.0  37.8  38.9  39.8  41.1  41.3  43.8  44.2  45.0  45.8  46.5  47.8  49.3  50.5  52.3  55.1 
Model  VGG19  ResNet34  
Approach  Classic CNN  Hierarchical  Placement  Predictive Exit  Classic CNN  Hierarchical  Placement  Predictive Exit 
FP32 CIFAR10  100%  62.2%  47.9%  27.1%  100%  58.6%  59.3%  30.3% 
FP32 CIFAR100  100%  85.1%  72.8%  44.1%  100%  69.9%  67.3%  39.9% 
FP32 SVHN  100%  58.9%  54.0%  27.1%  100%  46.8%  49.1%  27.1% 
FP32 STL10  100%  65.7%  64.7%  27.1%  100%  73.6%  64.5%  27.2% 
Int8 CIFAR10  100%  76.8%  47.0%  27.1%  100%  58.8%  60.7%  30.6% 
Int8 CIFAR100  100%  85.2%  73.1%  45.1%  100%  75.0%  68.1%  41.9% 
Int8 SVHN  100%  54.0%  47.7%  27.1%  100%  46.9%  49.1%  27.1% 
Int8 STL10  100%  66.0%  64.7%  27.1%  100%  85.3%  64.5%  27.2% 
6 Conclusion
The proposed Predictive Exit can accurately predict where the network will exit as a computation and energyefficient inference technique. By activating the exiting layer at the expected exiting point, the Predictive Exit reduces the network computation costs by exiting on time without running every preplaced exiting layer. The Predictive Exit significantly reduces the energy consumption used in inference by selecting proper computing configurations in the inference process. Since our method is a plugin for existing deep learning networks without model modification, the Predictive Exit could be applied to general deep learning networks, which will be explored in our future work.
References

[1]
Surat Teerapittayanon, Bradley McDanel, and HsiangTsung Kung.
Branchynet: Fast inference via early exiting from deep neural
networks.
In
2016 23rd International Conference on Pattern Recognition (ICPR)
, pages 2464–2469. IEEE, 2016. 
[2]
Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang,
Dmitry Vetrov, and Ruslan Salakhutdinov.
Spatially adaptive computation time for residual networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1039–1048, 2017.  [3] Xin Wang, Fisher Yu, ZiYi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 409–424, 2018.
 [4] Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. Conditional deep learning for energyefficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 475–480. IEEE, 2016.
 [5] Simone Scardapane, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. Why should we add early exits to neural networks? Cognitive Computation, 12(5):954–966, 2020.
 [6] Enzo Baccarelli, Simone Scardapane, Michele Scarpiniti, Alireza Momenzadeh, and Aurelio Uncini. Optimized training and scalable implementation of conditional deep neural networks with early exits for fogsupported iot applications. Information Sciences, 521:107–143, 2020.
 [7] Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, EnYu Yang, Marco Donato, Victor Sanh, Paul Whatmough, Alexander M Rush, David Brooks, et al. Edgebert: Sentencelevel energy optimizations for latencyaware multitask nlp inference. In MICRO54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 830–844, 2021.
 [8] Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xiaowen Chu. The impact of gpu dvfs on the energy and performance of deep learning: An empirical study. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, pages 315–325, 2019.
 [9] Meiqi Wang, Jianqiao Mo, Jun Lin, Zhongfeng Wang, and Li Du. Dynexit: A dynamic earlyexit strategy for deep residual networks. In 2019 IEEE International Workshop on Signal Processing Systems (SiPS), pages 178–183, 2019.
 [10] Stefanos Laskaridis, Stylianos I. Venieris, Mario Almeida, Ilias Leontiadis, and Nicholas D. Lane. Spinn: Synergistic progressive inference of neural networks over device and cloud. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, MobiCom ’20, New York, NY, USA, 2020. Association for Computing Machinery.

[11]
Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras.
Shallowdeep networks: Understanding and mitigating network
overthinking.
In
International conference on machine learning
, pages 3301–3310. PMLR, 2019.  [12] Vanderlei Bonato and ChristosSavvas Bouganis. Classspecific early exit design methodology for convolutional neural networks. Applied Soft Computing, 107:107316, 2021.
 [13] Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. Energyefficient and improved image recognition with conditional deep learning. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):1–21, 2017.
 [14] Enzo Baccarelli, Simone Scardapane, Michele Scarpiniti, Alireza Momenzadeh, and Aurelio Uncini. Optimized training and scalable implementation of conditional deep neural networks with early exits for fogsupported iot applications. Information Sciences, 521:107–143, 2020.
 [15] Muhammad Sabih, Frank Hannig, and Jürgen Teich. Dyfip: Explainable aibased dynamic filter pruning of convolutional neural networks. In Proceedings of the 2nd European Workshop on Machine Learning and Systems, EuroMLSys ’22, page 109115, New York, NY, USA, 2022. Association for Computing Machinery.
 [16] Maciej Woł czyk, Bartosz Wójcik, Klaudia Bał azy, Igor T Podolak, Jacek Tabor, Marek Śmieja, and Tomasz Trzcinski. Zero time waste: Recycling predictions in early exit neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2516–2528. Curran Associates, Inc., 2021.
 [17] Amir Ghodrati, Babak Ehteshami Bejnordi, and Amirhossein Habibian. Frameexit: Conditional early exiting for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15608–15618, June 2021.
 [18] Mohanad Odema, Nafiul Rashid, and Mohammad Abdullah Al Faruque. Eexnas: Earlyexit neural architecture search solutions for lowpower wearable devices. In 2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pages 1–6, 2021.
 [19] Eric Samikwa, Antonio Di Maio, and Torsten Braun. Adaptive early exit of computation for energyefficient and lowlatency machine learning over iot networks. In 2022 IEEE 19th Annual Consumer Communications Networking Conference (CCNC), pages 200–206, 2022.
 [20] En Li, Liekang Zeng, Zhi Zhou, and Xu Chen. Edge ai: Ondemand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications, 19(1):447–457, 2020.
 [21] Liekang Zeng, En Li, Zhi Zhou, and Xu Chen. Boomerang: Ondemand cooperative deep neural network inference for edge intelligence on the industrial internet of things. IEEE Network, 33(5):96–103, 2019.
 [22] Xiaowei Xu, Qing Lu, Tianchen Wang, Yu Hu, Chen Zhuo, Jinglan Liu, and Yiyu Shi. Efficient hardware implementation of cellular neural networks with incremental quantization and early exit. J. Emerg. Technol. Comput. Syst., 14(4), dec 2018.
 [23] Geonho Kim and Jongsun Park. Low cost early exit decision unit design for cnn accelerator. In 2020 International SoC Design Conference (ISOCC), pages 127–128. IEEE, 2020.
 [24] Stefanos Laskaridis, Alexandros Kouris, and Nicholas D. Lane. Adaptive inference through earlyexit networks: Design, challenges and directions. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, EMDL’21, page 11, New York, NY, USA, 2021. Association for Computing Machinery.
 [25] Laith Alzubaidi, Jinglan Zhang, Amjad J Humaidi, Ayad AlDujaili, Ye Duan, Omran AlShamma, J Santamaría, Mohammed A Fadhel, Muthana AlAmidie, and Laith Farhan. Review of deep learning: Concepts, cnn architectures, challenges, applications, future directions. Journal of big Data, 8(1):1–74, 2021.
 [26] Nikolaos Passalis, Jenni Raitoharju, Anastasios Tefas, and Moncef Gabbouj. Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits. Pattern Recognition, 105:107346, 2020.
 [27] Surat Teerapittayanon, Bradley McDanel, and H.T. Kung. Distributed deep neural networks over the cloud, the edge and end devices. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 328–339, 2017.
 [28] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 [29] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.

[30]
Adam Coates, Andrew Ng, and Honglak Lee.
An analysis of singlelayer networks in unsupervised feature
learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.  [31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [33] Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G Rogers, Tor M Aamodt, and Nikos Hardavellas. Accelwattch: A power modeling framework for modern gpus. In MICRO54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 738–753, 2021.