Log In Sign Up

Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference

by   Xiangjie Li, et al.

By adding exiting layers to the deep learning networks, early exit can terminate the inference earlier with accurate results. The passive decision-making of whether to exit or continue the next layer has to go through every pre-placed exiting layer until it exits. In addition, it is also hard to adjust the configurations of the computing platforms alongside the inference proceeds. By incorporating a low-cost prediction engine, we propose a Predictive Exit framework for computation- and energy-efficient deep learning applications. Predictive Exit can forecast where the network will exit (i.e., establish the number of remaining layers to finish the inference), which effectively reduces the network computation cost by exiting on time without running every pre-placed exiting layer. Moreover, according to the number of remaining layers, proper computing configurations (i.e., frequency and voltage) are selected to execute the network to further save energy. Extensive experimental results demonstrate that Predictive Exit achieves up to 96.2 computation reduction and 72.9 learning networks; and 12.8 compared with the early exit under state-of-the-art exiting strategies, given the same inference accuracy and latency.


BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks

Deep neural networks are state of the art methods for many learning task...

NeuralPower: Predict and Deploy Energy-Efficient Convolutional Neural Networks

"How much energy is consumed for an inference made by a convolutional ne...

Cello: Efficient Computer Systems Optimization with Predictive Early Termination and Censored Regression

Sample-efficient machine learning (SEML) has been widely applied to find...

Energy Predictive Models for Convolutional Neural Networks on Mobile Platforms

Energy use is a key concern when deploying deep learning models on mobil...

E2-Train: Energy-Efficient Deep Network Training with Data-, Model-, and Algorithm-Level Saving

Convolutional neural networks (CNNs) have been increasingly deployed to ...

FREE - Fine-grained Scheduling for Reliable and Energy Efficient Data Collection in LoRaWAN

Collecting data from remote sensor devices with limited infrastructure i...

SCAI: A Spectral data Classification framework with Adaptive Inference for the IoT platform

Currently, it is a hot research topic to realize accurate, efficient, an...

1 Introduction

Deep learning approaches, such as convolution neural networks (CNNs), have achieved tremendous success in versatile applications. However, deploying the deep learning models on resource-constrained systems, such as drones and self-driving cars, is challenging because of its huge computation and energy cost. Diving into the performance of each inference case, researchers

teerapittayanon2016branchynet ; figurnov2017spatially ; wang2018skipnet

found that the significant growth in model complexity is only helpful to classifying a handful of complicated inputs correctly, and they might become “wasteful” for simple inputs. Motivated by this observation, several works

panda2016conditional have tackled the problem of input-dependent dynamic inference. As one of the dynamic inference techniques, early exit includes additional side branch classifiers (exiting layers) to some of the network layers. As shown in Fig. 1, compared with the classic deep learning network in (a), the additional exiting layer in (b) allows inference results for a large portion of test samples to exit the network early when samples have already been inferred with high confidence.

Despite being employed in burgeoning efforts to reduce computation cost and energy consumption in inference, early exits scardapane2020should are still not capable of addressing the challenges listed below. First, there is a dilemma between fine-grained and coarse-grained placements of exiting points. Fine-grained exiting points lead to significant performance and energy overheads due to frequent execution of the exiting layers enzo2020optimized . In contrast, coarse-grained exiting points may miss the opportunities to exit earlier. Second, the exiting layers have different topologies at different exiting points, which will add extra burdens to mapping the computation to computing hardware resources. Third, the energy-efficient computing platforms using early exits can only reduce the computing voltage and frequency after exiting. During the inference process, run-time computing configuration adjustment, such as dynamic voltage and frequency scaling (DVFS), can not be applied ahead of time for energy saving.

To provide an efficient early exit for resource-constrained computing platforms, we propose Predictive Exit: prediction of fine-grain early exits in computation- and energy-efficient inference. To our best knowledge, Predictive Exit is the first work to forecast the exiting point and adjust the computing configuration (i.e. DVFS) along with the inference proceeds, which achieves significant computation and energy savings. The contributions of this work are three-fold.

  • [wide, labelwidth=!, labelindent=0pt]

  • A novel Predictive Exit framework for deep learning networks is introduced, which is shown in Fig. 1 (c). Fine-grained exiting layers sharing the same topology are potentially placed after each network layer. A selective execution of the exiting layers will capture the opportunities to exit on time and reduce computation and energy overheads.

  • A low-cost prediction engine is proposed to predict the point where the network will exit, execute the exiting layer selectively at run-time based on the prediction, and optionally adjust the computing configurations for energy saving during each inference based on DVFS.

  • Extensive experiments with floating-point (FP32) and fixed-point quantization (INT8) operations are conducted to demonstrate that Predictive Exit achieves computation cost reduction and energy saving with tiny accuracy loss compared with the network without early exit and no accuracy loss compared with the early exit using state-of-the-art placement strategies.

Figure 1: The architecture of classic deep learning network, early exit, and proposed Predictive Exit.

2 Background and Related Work

2.1 Latency, Power, and Energy

In many applications, such as robots and self-driving cars, deep learning networks are released and executed periodically. The time interval between the release times of two consecutive inferences is called the inference period . The actual execution time of each inference is called latency , which rarely exceeds the period

. Given a computing platform and a neural network with fixed computation, the inference latency

is inversely proportional to the operating frequency of the digital logic inside the computing platform, which is called the computing frequency . Following the basic rules of semiconductors, a higher indicates faster computing but requires a linearly increased processor supply voltage .

When a task is executed, the power consumption in computing, called active power , is the summation of dynamic, static, and constant power. When the computing platform is idle, the power consumption, also called idle power , mainly comprises static and constant power. The energy of a computing platform includes both active and idle power integrated with time tambe2021edgebert ,


is the time spent on executing the task, and is the time duration when the computing platform is in idle, which is the time interval between finishing this inference and starting the next inference, denoted by . The dynamic power consumption originates from the activity of logic gates inside a processor, which can be derived by


where is the capacitance of switched logic gates. The static power dissipation originates from transistor static current (including leakage and short circuit current) when the processor is powered on, which is described by


is the number of logic gates, and is the normalized static current for each logic gate. The constant power is the power consumption by the auxiliary devices of a computing platform, like board fans, peripheral circuitry, etc.

2.2 Early Exit with Dynamic Voltage and Frequency Scaling

DVFS is a power management technique in computer architecture whereby and of a microprocessor can be automatically adjusted on the fly depending on the actual needs to conserve power and energy cost of the computing platform tang2019impact . Given the same inference period , combining early exit and DVFS could effectively reduce the cost of computation resources for executing a deep learning network.

Originally, the deep learning networks are executed by running all network layers at the default high frequency and voltage without early exit and DVFS. We assume the inference latency (execution time) is equal to inference period . In this case, the energy consumption within is . As the deployment of early exists, network layers located ahead of the early exit run at the default high frequency and voltage. After the network exits at , computing platform can reduce the voltage and frequency to the lowest level by DVFS tambe2021edgebert until time reaches . Therefore, the energy consumption can be described by .

Since the processor voltage-frequency pair setting is cubic to but linear to and latency according to Eqs. (2) and  (3), it is highly demanded to further reduce the cost of . A direct intuition to achieve this is to adjust the voltage and frequency to proper middle-level and run the network until it exits at time . However, how to predict where the network will exit and adjust the voltage-frequency pair during the inference is not trivial. In this work, we propose a solution to adjust the voltage and frequency to the proper “middle-level” at run-time based on the prediction of early exits. The energy cost is reduced by utilizing the inference period better.

2.3 Related Work

Early exits have attracted tons of attention as an important branch for dynamic inference in the past few years. Since it was studied in panda2016conditional ; teerapittayanon2016branchynet , many approaches have been proposed to improve the accuracy or reduce the computation cost of exit decisions. Wang et al. wang2019dynexit offered a dynamic loss-weight adjustment early-exit strategy for ResNets together with the hardware-friendly architecture of early-exit branches. Laskaridis et al. laskaridis2020spinn presented SPINN, a distributed progressive inference mechanism that maintains reliable execution of CNN inference across device-server setups. To determine the placement of exiting layers, Kaya et al. kaya2019shallow explored the distribution of computation cost; Bonato et al. vanderlei2021class classified the exit point for each inference object; Panda et al. panda2017energy and Baccarelli et al.baccarelli2020optimized calculated the benefit and additional computation and energy cost from adding this exiting layer. To reduce average inference time and longest path inference time, Sabih et al. sabih2022dyfip introduced a novel dynamic pruning strategy that combines early exit with a dynamic pruning strategy. To further balance the accuracy, inference time, and energy tradeoffs, Wołczyk et al. maciej2021zero proposed a Zero Time Waste (ZTW) method approach by adding direct connections between exiting layers and combining previous outputs in an ensemble-like manner, and Ghodrati et al. ghodrati2021frame proposed an on-the-fly supervision training mechanism to reach a dynamic tradeoff between accuracy and power. Although the multi-exit networks in their studies can lead to a reduction in inference time and computational cost, their efficiency lies in selecting a placement of exiting layers that keeps a good balance between cost-saving and inference accuracy. However, studies focused on choosing the best placement usually result in fixed places and thus neglect the possibility of exiting between exiting layers. Our work provides a predictive design that can dynamically change the placement of the exiting layer, which can further reduce inference time and computational cost.

Together with the proceeding of theoretical models, early exits have started being deployed in hardware and IoT systems odema2021eexnas ; samikwa2022adaptive . Li et al. li2020edge and Zeng et al. zeng2019boom adopted device-edge synergy optimized by DNN partitioning and DNN right-sizing through the early exit for on-demand DNN collaborative inference. The strategy is feasible and effective for low-latency edge intelligence. Xu et al. xu2018efficient presented a compressed CeNN framework optimized by five different incremental quantization methods and early exit. FPGA implementation shows that two optimizations achieve 7.8x and 8.3x speedup, respectively, while almost no performance loss. In the hardware designs, a low-cost early exit decision hardware unit was designed by kim2020low to reduce the cost of early exiting. Tambe et al. tambe2021edgebert also exploited early exits in the Bert hardware accelerator designs to reduce the computation power and energy consumption. Meanwhile, Laskaridis et al. laskaridis2021adaptive and Scardapane et al. scardapane2020should further provided a thorough overview of the current architecture, state-of-art methods, and future challenges of early-exit networks. Their studies have made a great contribution to power savings of early exits on embedded CNN. While most of their strategies demand a redesign of either the hardware or the network, our work proposes a method based on DVFS that saves the trouble of reconstruction.

3 The Framework of Predictive Exit

Targeting computation- and energy-efficient inference with early exits, we propose a Predictive Exit framework for deep learning networks by taking the CNN as an example, which is the most popular deep learning algorithm alzubaidi2021review . The Predictive Exit combines the following schemes into one continuous spectrum, as shown in Fig. 2.

Figure 2: The framework of Predictive Exit.

Unified exiting layer:

The neural networks run on the graphics processing units (GPU), tensor processing units (TPU), or application-specific integrated circuits (ASIC) on local or edge devices. To release the burden of executing versatile computation, we make the exiting layers used in the network share the same topology.

Fine-grained exiting points: Exiting layers are potentially placed after each convolution layer to fully utilize the early exit opportunities. To avoid the computation overhead from running every exiting layer, only the layer at the predicted exiting point will be executed.

Low-cost prediction engine: The core of the framework is the low-cost prediction engine, which forecasts the possible exiting point and activates the early exiting layer. Based on the remaining computation workload before the expected exiting point, the proper computing frequency and voltage will be selected to run the inference and to further save energy.

3.1 Unified Exiting Layer

In this work, the design of exiting layer is based on the existing work passalis2020efficient . All the exiting layers share the same topology. It contains a Bag-of-Features (BoF) pooling layer and a Fully-Connected (FC) layer. Let be the intermediate results at the layer and be the number of object classes in the dataset. The BoF pooling layer functions as a feature aggregation to extract from

. In the BoF pooling, a set of feature vectors called codebook are used to describe

. The weight of each codebook is generated by measuring the similarity between the codebook and . Since the size of codebook weight is usually larger than , the FC layer further works as a classifier that adjusts the result of BoF pooling to

, which estimates the final output of the exiting layer. Details can be found in

passalis2020efficient . The function of the exiting layer is denoted as


where stands for the intermediate result after the calculation of to layer.

After we’ve trained the early exiting layers, the average feature weight is first calculated according to Eq. (5), which are parameters for the exit decision.


where is the size of the training dataset. During the inference, at each early exiting layer, the weight ratio is calculated as the maximum feature weight over the

multiplied by a hyperparameter

specified by the user. Once in Eq.(6) is larger than 1, the inference is terminated and the result at this early exiting layer is used as the final result.


The key idea behind such implementation is that the larger the maximum feature weight is, the more confident the classification is regarded. The parameter plays the role of striking a balance between the accuracy and acceleration extent. If the user expect to attain a more accurate result, should be higher and vice versa.

3.2 Fine-Grained Exiting Points

In the classic early exit design, a limited number of exiting layers are settled in a fixed location and distance, usually to , of the neural networks. For example, in the inference model proposed in classicexample , the two early exits are placed at roughly and of the network. If the first exiting layer fails, the inference must run through the following network layers until it reaches the next exiting layer to check whether it can exit. A more sensible way proposed by Kaya et al. kaya2019shallow is to settle the exiting layers based on the percentage of the total cost after a rough estimation of the computational costs of each layer. However, their “fixed-place” design ignores the possibility of exiting the layers between the two early exits. It is nontrivial to choose where and how to place early exits to satisfy both computation time and inference accuracy.

In this work, we proposed a fine-grained exiting points design, that is, placing potential exiting layers on every network layer since the starting layer . is a hyperparameter specified by the user, working as an adjuster of computation time and inference accuracy. During the inference process, the trial of early exit begins on . If the trial succeeds, the early exit is triggered. If fails, instead of running through every exiting layer from , the position of the next trial is , where Predict is the function of low-cost prediction engine to forecast the next location to exit, which will be introduced in the following subsection. Under the circumstances that early exit does not succeed on , we will start another prediction based on the exiting layer result of . That is


Therefore, the location expectation of exit can be expressed as


where is the location of first trial, stands for the number of times of predictions, and denotes the possibility of prediction failure.

3.3 Low-Cost Prediction Engine

3.3.1 Prediction of exiting points

The Predict function starts to forecast the exiting point since the layer. To fully use the information from the intermediate result during the inference process, the key function of Predict is realized by one dimensional convolution on the intermediate result and the exiting layer result based on it at , which is summarized in Algorithm 1.

To perform the convolution, we first extend the intermediate result from to

by zero padding, which allow for more space for the filter to cover the intermediate result,


Afterwards, a vector is generated (with all of 1 in our work) as the filter of the convolution. Based on and , a new set of feature weight is generated through one dimensional convolution, which estimates the results of exiting layer placed at ,


By recursively replacing the intermediate result in Eq. (9) with and repeating the computation of Eq. (9) and Eq. (10), the predicted results of exiting layer placed at can be obtained and noted as . Following above steps, the predicted results of any exiting layer placed after can be calculated.

With the predicted results of exiting layer placed at any layer after , we will have Predict function described as follows:


where represents the predicted exiting point after , which is the smallest number that satisfies

Input: Hyperparameter starting layer , and ;
   Exiting layer result at ; ;
Output: Predicted exiting point: ;
//The prediction procedure:
1 for  = , …,  do
     //Zero padding from into
     2 if  =  then
             //One dimensional convolution
             3 ;
             //Check prediction confidence
             4 if  then
                return ;
                 5 = , where is a predefined hyperparameter.
                 return ;
Algorithm  Algorithm 1: Low-cost Prediction Engine

where indicates the predicted exiting confidence at layer . Therefore, is the predicted exiting point from Predict. In case that cannot be found, a hyperparameter is further introduced, that is . should be no more than where stands for the number of layers the model, meaning that prediction result beyond the last layer is forbidden. If no integer in satisfies Eq. (12), we assume that . The prediction engine is simple enough to avoid adding notable computation cost to the network. The time complexity of Algorithm 1 is . Considering , the complexity can be estimated as . Meanwhile, hyperparameters are introduced in the prediction. , , and can be tuned by advanced users, which can balance the prediction accuracy and computation cost for different application scenarios.

Discussion of Weight Ratio- and Cross Entropy-Based Prediction We discuss the prediction for early exits using and cross entropy , which will be defined in Eq. (15), because these two parameters are the indicators of confidence in early exits. More precisely, a higher weight ratio or a lower cross entropy indicates more confidence in exiting. As shown in Fig. 3, we record the values of weight ratio and cross entropy ( axis) at the prediction starting point and the actual exiting layer (

axis) in the inference process. The test case results of the Resnet-34 model and the SVHN dataset are presented in the figure. However, on the weight ratios ranging from [0.2 0.6), the exiting points are distributed between the 1st and 12th layers. These spreading trends are also observed when the weight ratio lies in [0.6 0.8) and [0.8 1.0). Similarly, the exiting points are also widely distributed on the same cross entropy. Therefore, it is hard to predict the exiting point purely based on weight ratio and cross-entropy, even they are the indicators of confidence in early exits.

(a) Weight ratio
(b) Cross entropy
Figure 3: Exit prediction with weight ratio and cross entropy .

3.3.2 DVFS for Predictive Exit

Based on the predicted exiting point, the prediction engine will adjust the voltage-frequency pair to proper “middle levels” and run the network until it exits at time . Given a network with layers and a prediction engine which starts the prediction at and predicts the exiting point is , the prediction engine will reduce the computing frequency (and its corresponding voltage) to


where is the default high frequency of the computing platform. Therefore, the energy consumption used in this inference will be


where is the time of finishing the prediction starting layer and its previous network layers. As the selection of computing frequency is based on the predicted remaining workloads in the inference, the prediction engine can make the inference finish by time based on correct predictions. Since the processor voltage and frequency are linear scales to the computing performance (inversely proportional to inference latency) but the cubic scale and linear scale to dynamic power and static power, the predictive exit will effectively reduce energy consumption.

4 Training the Network with Predictive Exit

The learning objective of Predictive Exit is to keep the network inference accuracy given the predictive exit functions. Therefore, the objective function is the cross entropy loss for the entire network


where and are the true class distribution and the predicted class distribution for each object.

Our model is trained through batch gradient descent. The data are fed into the model to optimize the parameter. Let be the size of the training set , and a target vector be the correct result. At first, the model is trained without any exiting layer by the following equation


where is the learning rate. On the circumstance that the accuracy fails to meet the requirement, can be adjusted. Afterwards, with of the original model fixed and armed with exiting layers, the model is trained again to optimize the parameters in the BoF pooling and FC layer,


In our design, the original model parameters are not fixed during the training of the exiting layer. Although it may sacrifice the accuracy of the original layer, it leads to a higher accuracy at each exiting layer. The output of the original model can be regarded as another exiting layer and trained together to compensate the accuracy lost. Although the training procedure requires two steps, the advantage brought by the inference process far outweighs this disadvantage, especially when training is done by the server and only the inference is deployed on the resource-constrained platforms.

5 Evaluation

5.1 Experimental Setup

We evaluate the Predictive Exit using VGG-19 and ResNets-34

as the backbone models on the commonly used CIFAR-10, CIFAR-100

cifar10 , SVHN svhn , and STL10 datasets stl10 . For the consideration of low computation and energy applications, both floating-point (FP32) and fixed-point quantization (Int8) operations are tested. The training follows the procedure in Section 4. For the classic network and exiting layer, we set the learning rates as 0.00025/0.001 and train them with 100 iterations. The selection of is based on the prediction accuracy detailed in Section 5.3. We compare the proposed Predictive Exit with

5.2 Inference Accuracy and Computation Cost

The inference accuracy and computation cost (the number of floating-point or integer operations normalized by that of classic CNN) of different inference approaches and datasets are shown in Table 1. On the VGG-19 model, Predictive Exit achieves the same inference accuracy as other early exit approaches (which is 1% - 3% lower than Classic CNN). Because Predictive Exit will continue for the next prediction instead of forcing the inference to exit even if the prediction is wrong, no extra inference accuracy loss is brought. The early exit based approaches significantly reduce the computation cost of Classic CNN. Compared with Hierarchical and Placement, Predictive Exit further reduces up to 12.8% of computation cost by leveraging the opportunities to exit earlier and avoiding frequent execution of the exiting layers. On the ResNet model, compared with Hierarchical and Placement, Predictive Exit reduces up to 12.1% of computation cost without extra accuracy loss.

Model VGG-19 ResNet-34
Approach Classic CNN Hierarchical Placement Predictive Exit Classic CNN Hierarchical Placement Predictive Exit
FP32 CIFAR-10 89% | 100% 88% | 59.8% 88% | 46.9% 88% | 46.2% 90% | 100% 89% | 28.7% 89% | 12.1% 89% | 7.7%
FP32 CIFAR-100 64% | 100% 62% | 86.6% 62% | 80.6% 62% | 76.2% 66% | 100% 63% | 36.6% 63% | 21.2% 63% | 16.1%
FP32 SVHN 89% | 100% 87% | 55.6% 87% | 56.5% 87% | 46.8% 91% | 100% 89% | 6.9% 89% | 5.4% 89% | 3.8%
FP32 STL10 75% | 100% 74% | 67.6% 74% | 64.2% 74% | 51.5% 76% | 100% 73% | 61.8% 73% | 17.8% 73% | 5.7%
Int8 CIFAR-10 88% | 100% 87% | 62.3% 87% | 46.4% 87% | 46.1% 85% | 100% 82% | 29.6% 82% | 14.9% 82% | 8.4%
Int8 CIFAR-100 64% | 100% 62% | 95.0% 62% | 79.5% 62% | 76.4% 62% | 100% 60% | 44.4% 60% | 22.3% 60% | 18.5%
Int8 SVHN 89% | 100% 86% | 57.2% 86% | 47.4% 86% | 46.7% 88% | 100% 85% | 7.1% 85% | 5.6% 85% | 4.0%
Int8 STL10 74% | 100% 74% | 66.9% 74% | 64.3% 74% | 51.5% 71% | 100% 70% | 78.1% 70% | 17.8% 70% | 5.8%
Table 1: Inference accuracy and computation cost in the VGG-19 and ResNet-34 network

For the training process, compared with Classic CNN, which requires only the training of original network layers, Hierarchical structure demands additional training of the exiting layers placed in specified positions of the network. While both Placement and Predictive design need to train all exiting layers that settle after each layer of the network, Placement further needs one more step to determine placement indexes by an exhaustive search for the most profitable exiting layers.

5.3 Where to Start Prediction: Hyperparameter L0

Figure 4: Prediction accuracy with hyperparameter in VGG-19 network.
Figure 5: Prediction accuracy with hyperparameter in ResNet-34 network.

The prediction engine will start the exiting prediction since the starting layer . For each model and dataset, choosing should be done before the deep learning model is deployed as it directly determines Predictive Exit’s performance. This section quantitatively compares the prediction accuracy of different during the training process. To let the prediction engine cover a wide range of network layers, should be the first few layers in the networks. Therefore, in the VGG-19 network, we test placed at the 1st-10th layers, and in the ResNet-34 network, we test placed at the 6th-20th layers. Fig. 4 and Fig. 5 present the prediction accuracy under FP32 and INT8 operations across different datasets. The prediction accuracy indicates the percentage of successful exiting at the first predicted exiting point. In the VGG-19 network, if is the first network layer, the prediction can achieve over 81% accuracy across all tested datasets and operations. Most notably, if is the 6th network layer, the prediction can achieve over 99.8% accuracy. Therefore, for VGG-19 network and test datasets will be the 6th network layer. Similarly, for the ResNet-34 network, the desired

will be the 6th, 14th, 10th, and 6th layers for datasets CIFAR-10, CIFAR-100, SVHN, and STL 10, respectively. Comparing the VGG-19 and ResNet, Predictive Exit achieves higher exiting prediction accuracy on VGG-19, with a shallower network structure.

5.4 Potential Energy Benefit by DVFS

To illustrate the potential energy benefit of Predictive Exit, we calculate the energy consumption with offline measured active and idle power of NVIDIA Quadro GV100 GPU at different frequency-voltage pairs kandiah2021accelwattch (shown in Table 2). In a network with layers, when the Predictive Exit predicts the network will exit at layers, the processor will select the lowest candidate frequency in Table 2 that is higher than or equal to to execute this network. We count the inference workload (including the exiting layer) under each frequency-voltage pair. Based on the power consumption in each frequency-voltage pair, we compare the energy consumption and normalize it to the energy consumption used by the Classic CNN model in Table 3. Unsurprisingly, early exit achieves tremendous energy savings compared with Classic CNN models. Predictive Exit further reduces the energy consumption (compared with the best cases of Hierarchical and Placement) by 19.9% to 37.6% on the VGG-19 network and 19.7% to 37.3% on the ResNet-34 network, by predicting the exiting and adjusting the computation configuration along with the inference process.

Frequency (GHz) 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45
Active Power (W) 59.2 67.4 73.5 81.1 85.3 90.4 97.5 104.9 112.8 119.5 130.1 139.5 148.9 161.1 170.2 180.6 199.1 218.5
Idle Power (W) 35.1 35.9 37.0 37.8 38.9 39.8 41.1 41.3 43.8 44.2 45.0 45.8 46.5 47.8 49.3 50.5 52.3 55.1
Table 2: Measured power consumption of the NVIDIA QuGV100 GPU system
Model VGG-19 ResNet-34
Approach Classic CNN Hierarchical Placement Predictive Exit Classic CNN Hierarchical Placement Predictive Exit
FP32 CIFAR-10 100% 62.2% 47.9% 27.1% 100% 58.6% 59.3% 30.3%
FP32 CIFAR-100 100% 85.1% 72.8% 44.1% 100% 69.9% 67.3% 39.9%
FP32 SVHN 100% 58.9% 54.0% 27.1% 100% 46.8% 49.1% 27.1%
FP32 STL10 100% 65.7% 64.7% 27.1% 100% 73.6% 64.5% 27.2%
Int8 CIFAR-10 100% 76.8% 47.0% 27.1% 100% 58.8% 60.7% 30.6%
Int8 CIFAR-100 100% 85.2% 73.1% 45.1% 100% 75.0% 68.1% 41.9%
Int8 SVHN 100% 54.0% 47.7% 27.1% 100% 46.9% 49.1% 27.1%
Int8 STL10 100% 66.0% 64.7% 27.1% 100% 85.3% 64.5% 27.2%
Table 3: Normalized energy consumption

6 Conclusion

The proposed Predictive Exit can accurately predict where the network will exit as a computation- and energy-efficient inference technique. By activating the exiting layer at the expected exiting point, the Predictive Exit reduces the network computation costs by exiting on time without running every pre-placed exiting layer. The Predictive Exit significantly reduces the energy consumption used in inference by selecting proper computing configurations in the inference process. Since our method is a plug-in for existing deep learning networks without model modification, the Predictive Exit could be applied to general deep learning networks, which will be explored in our future work.


  • [1] Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In

    2016 23rd International Conference on Pattern Recognition (ICPR)

    , pages 2464–2469. IEEE, 2016.
  • [2] Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1039–1048, 2017.
  • [3] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 409–424, 2018.
  • [4] Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. Conditional deep learning for energy-efficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 475–480. IEEE, 2016.
  • [5] Simone Scardapane, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. Why should we add early exits to neural networks? Cognitive Computation, 12(5):954–966, 2020.
  • [6] Enzo Baccarelli, Simone Scardapane, Michele Scarpiniti, Alireza Momenzadeh, and Aurelio Uncini. Optimized training and scalable implementation of conditional deep neural networks with early exits for fog-supported iot applications. Information Sciences, 521:107–143, 2020.
  • [7] Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul Whatmough, Alexander M Rush, David Brooks, et al. Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 830–844, 2021.
  • [8] Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xiaowen Chu. The impact of gpu dvfs on the energy and performance of deep learning: An empirical study. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, pages 315–325, 2019.
  • [9] Meiqi Wang, Jianqiao Mo, Jun Lin, Zhongfeng Wang, and Li Du. Dynexit: A dynamic early-exit strategy for deep residual networks. In 2019 IEEE International Workshop on Signal Processing Systems (SiPS), pages 178–183, 2019.
  • [10] Stefanos Laskaridis, Stylianos I. Venieris, Mario Almeida, Ilias Leontiadis, and Nicholas D. Lane. Spinn: Synergistic progressive inference of neural networks over device and cloud. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, MobiCom ’20, New York, NY, USA, 2020. Association for Computing Machinery.
  • [11] Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Shallow-deep networks: Understanding and mitigating network overthinking. In

    International conference on machine learning

    , pages 3301–3310. PMLR, 2019.
  • [12] Vanderlei Bonato and Christos-Savvas Bouganis. Class-specific early exit design methodology for convolutional neural networks. Applied Soft Computing, 107:107316, 2021.
  • [13] Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. Energy-efficient and improved image recognition with conditional deep learning. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):1–21, 2017.
  • [14] Enzo Baccarelli, Simone Scardapane, Michele Scarpiniti, Alireza Momenzadeh, and Aurelio Uncini. Optimized training and scalable implementation of conditional deep neural networks with early exits for fog-supported iot applications. Information Sciences, 521:107–143, 2020.
  • [15] Muhammad Sabih, Frank Hannig, and Jürgen Teich. Dyfip: Explainable ai-based dynamic filter pruning of convolutional neural networks. In Proceedings of the 2nd European Workshop on Machine Learning and Systems, EuroMLSys ’22, page 109-115, New York, NY, USA, 2022. Association for Computing Machinery.
  • [16] Maciej Woł czyk, Bartosz Wójcik, Klaudia Bał azy, Igor T Podolak, Jacek Tabor, Marek Śmieja, and Tomasz Trzcinski. Zero time waste: Recycling predictions in early exit neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2516–2528. Curran Associates, Inc., 2021.
  • [17] Amir Ghodrati, Babak Ehteshami Bejnordi, and Amirhossein Habibian. Frameexit: Conditional early exiting for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15608–15618, June 2021.
  • [18] Mohanad Odema, Nafiul Rashid, and Mohammad Abdullah Al Faruque. Eexnas: Early-exit neural architecture search solutions for low-power wearable devices. In 2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pages 1–6, 2021.
  • [19] Eric Samikwa, Antonio Di Maio, and Torsten Braun. Adaptive early exit of computation for energy-efficient and low-latency machine learning over iot networks. In 2022 IEEE 19th Annual Consumer Communications Networking Conference (CCNC), pages 200–206, 2022.
  • [20] En Li, Liekang Zeng, Zhi Zhou, and Xu Chen. Edge ai: On-demand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications, 19(1):447–457, 2020.
  • [21] Liekang Zeng, En Li, Zhi Zhou, and Xu Chen. Boomerang: On-demand cooperative deep neural network inference for edge intelligence on the industrial internet of things. IEEE Network, 33(5):96–103, 2019.
  • [22] Xiaowei Xu, Qing Lu, Tianchen Wang, Yu Hu, Chen Zhuo, Jinglan Liu, and Yiyu Shi. Efficient hardware implementation of cellular neural networks with incremental quantization and early exit. J. Emerg. Technol. Comput. Syst., 14(4), dec 2018.
  • [23] Geonho Kim and Jongsun Park. Low cost early exit decision unit design for cnn accelerator. In 2020 International SoC Design Conference (ISOCC), pages 127–128. IEEE, 2020.
  • [24] Stefanos Laskaridis, Alexandros Kouris, and Nicholas D. Lane. Adaptive inference through early-exit networks: Design, challenges and directions. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, EMDL’21, page 11, New York, NY, USA, 2021. Association for Computing Machinery.
  • [25] Laith Alzubaidi, Jinglan Zhang, Amjad J Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, J Santamaría, Mohammed A Fadhel, Muthana Al-Amidie, and Laith Farhan. Review of deep learning: Concepts, cnn architectures, challenges, applications, future directions. Journal of big Data, 8(1):1–74, 2021.
  • [26] Nikolaos Passalis, Jenni Raitoharju, Anastasios Tefas, and Moncef Gabbouj. Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits. Pattern Recognition, 105:107346, 2020.
  • [27] Surat Teerapittayanon, Bradley McDanel, and H.T. Kung. Distributed deep neural networks over the cloud, the edge and end devices. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 328–339, 2017.
  • [28] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [29] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • [30] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    , pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  • [31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [33] Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G Rogers, Tor M Aamodt, and Nikos Hardavellas. Accelwattch: A power modeling framework for modern gpus. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 738–753, 2021.