## 1 Introduction

In Natural Language Processing (NLP), pretraining and fine-tuning have become a new norm for many tasks. Pretrained language models (PLMs) (e.g., BERT

devlin2018bert, XLNet yang2019xlnet, RoBERTa liu2019roberta, ALBERT lan2019albert) contain many layers and millions or even billions of parameters, making them computationally expensive and inefficient regarding both memory consumption and latency. This drawback hinders their application in scenarios where inference speed and computational costs are crucial. Another bottleneck of overparameterized PLMs that stack dozens of Transformer layers is the “overthinking” problem kaya2018shallow during their decision-making process. That is, for many input samples, their shallow representations at an earlier layer are adequate to make a correct classification, whereas the representations in the final layer may be otherwise distracted by over-complicated or irrelevant features that do not generalize well. The overthinking problem in PLMs leads to wasted computation, hinders model generalization, and may also make them vulnerable to adversarial attacks jin2019bert.In this paper, we propose a novel Patience-based Early Exit (PABEE) mechanism to enable models to stop inference dynamically. PABEE is inspired by the widely used Early Stopping morgan1990generalization; prechelt1998early strategy for model training. It enables better input-adaptive inference of PLMs to address the aforementioned limitations. Specifically, our approach couples an internal classifier with each layer of a PLM and dynamically stops inference when the intermediate predictions of the internal classifiers remain unchanged for times consecutively (see Figure 0(b)), where

is a pre-defined patience. We first show that our method is able to improve the accuracy compared to conventional inference under certain assumptions. Then we conduct extensive experiments on the GLUE benchmark and show that PABEE outperforms existing prediction probability distribution-based exit criteria by a large margin. In addition, PABEE can simultaneously improve inference speed and adversarial robustness of the original model while retaining or even improving its original accuracy with minor additional effort in terms of model size and training time. Also, our method can dynamically adjust the accuracy-efficiency trade-off to fit different devices and resource constraints by tuning the patience hyperparameter without retraining the model, which is favored in real-world applications

Cai2020Once-for-All. Although we focus on PLM in this paper, we also have conducted experiments on image classification tasks with the popular ResNet he2016deep as the backbone model and present the results in Appendix A to verify the generalization ability of PABEE.To summarize, our contribution is two-fold: (1) We propose Patience-based Early Exit, a novel and effective inference mechanism and show its feasibility of improving the efficiency and the accuracy of deep neural networks with theoretical analysis. (2) Our empirical results on the GLUE benchmark highlight that our approach can simultaneously improve the accuracy and robustness of a competitive ALBERT model, while speeding up inference across different tasks with trivial additional training resources in terms of both time and parameters.

## 2 Related Work

Existing research in improving the efficiency of deep neural networks can be categorized into two streams: (1) Static approaches design compact models or compress heavy models, while the models remain static for all instances at inference (i.e., the input goes through the same layers); (2) Dynamic approaches allow the model to choose different computational paths according to different instances when doing inference. In this way, the simpler inputs usually require less calculation to make predictions. Our proposed PABEE falls into the second category.

#### Static Approaches: Compact Network Design and Model Compression

Many lightweight neural network architectures have been specifically designed for resource-constrained applications, including MobileNet howard2017mobilenets, ShuffleNet zhang2018shufflenet, EfficientNet tan2019efficientnet, and ALBERT (lan2019albert), to name a few. For model compression, han2015deep

first proposed to sparsify deep models by removing non-significant synapses and then re-training to restore performance. Weight Quantization

wu2016quantized and Knowledge Distillation hinton2015distilling have also proved to be effective for compressing neural models. Recently, existing studies employ Knowledge Distillation sanh2019distilbert; sun2019patient; jiao2019tinybert, Weight Pruning michel2019sixteen; voita2019analyzing; fan2019reducing and Module Replacing xu2020bert to accelerate PLMs.#### Dynamic Approaches: Input-Adaptive Inference

A parallel line of research for improving the efficiency of neural networks is to enable adaptive inference for various input instances. Adaptive Computation Time graves2016adaptive proposed to use a trainable halting mechanism to perform input-adaptive inference. However, training the halting model requires extra effort and also introduces additional parameters and inference cost. To alleviate this problem, BranchyNet teerapittayanon2016branchynet calculated the entropy of the prediction probability distribution as a proxy for the confidence of branch classifiers to enable early exit. Shallow-Deep Nets kaya2018shallow leveraged the softmax scores of predictions of branch classifiers to mitigate the overthinking problem of DNNs. More recently, hu2020triple leveraged this approach in adversarial training to improve the adversarial robustness of DNNs. In addition, existing approaches graves2016adaptive; wang2018skipnet trained separate models to determine passing through or skipping each layer. Very recently, FastBERT liu2020fastbert and DeeBERT xin2020deebert adapted confidence-based BranchyNet teerapittayanon2016branchynet for PLMs while RightTool Schwartz:2020 leveraged the same early-exit criterion as in the Shallow-Deep Network kaya2018shallow.

However, Schwartz:2020 recently revealed that prediction probability based methods often lead to substantial performance drop compared to an oracle that identifies the smallest model needed to solve a given instance. In addition, these methods only support classification and leave out regression, which limits its applications. Different from these recent work that directly employ existing efficient inference methods on top of PLMs, PABEE is a novel early-exit criterion that captures the inner-agreement between earlier and later internal classifiers and exploit multiple classifiers for inference, leading to better accuracy both theoretically and empirically.

## 3 Patience-based Early Exit

Patience-based Early Exit (PABEE) is a plug-and-play method that can work well with minimal adjustment on training.

### 3.1 Motivation

We first conduct experiments to investigate the overthinking problem in PLMs. As shown in Figure 1(b), we illustrate the prediction distribution entropy teerapittayanon2016branchynet and the error rate of the model on the development set as more layers join the prediction. Although the model becomes more “confident” (lower entropy indicates higher confidence in BranchyNet teerapittayanon2016branchynet) with its prediction as more layers join, the actual error rate instead increases after 10 layers. This phenomenon was discovered and named “overthinking” by kaya2018shallow. Similarly, as shown in Figure 1(a)

, after 2.5 epochs of training, the model continues to get better accuracy on the training set but begins to deteriorate on the development set. This is the well-known overfitting problem which can be resolved by applying an early stopping mechanism

morgan1990generalization; prechelt1998early. From this aspect, overfitting in training and overthinking in inference are naturally alike, inspiring us to adopt an approach similar to early stopping for inference.### 3.2 Inference

The inference process of PABEE is illustrated in Figure 0(b). Formally, we define a common inference process as the input instance goes through layers and the classifier/regressor to predict a class label distribution (for classification) or a value (for regression, we assume the output dimension is for brevity). We couple an internal classifier/regressor with each layer of , respectively. For each layer , we first calculate its hidden state :

(1) | ||||

Then, we use its internal classifier/regressor to output a distribution or value as a per-layer prediction or . We use a counter to store the number of times that the predictions remain “unchanged”. For classification, is calculated by:

(2) |

While for regression, is calculated by:

(3) |

where is a pre-defined threshold. We stop inference early at layer when . If this condition is never fulfilled, we use the final classifier for prediction. In this way, the model can exit early without passing through all layers to make a prediction.

As shown in Figure 0(a), prediction score-based early exit relies on the softmax score. As revealed by prior work szegedy2013intriguing; jiang2018trust, prediction of probability distributions (i.e., softmax scores) suffers from being over-confident to one class, making it an unreliable metric to represent confidence. Nevertheless, the capability of a low layer may not match its high confidence score. In Figure 0(a), the second classifier outputs a high confidence score and incorrectly terminates inference. With Patience-based Early Exit, the stopping criteria is in a cross-layer fashion, preventing errors from one single classifier. Also, since PABEE comprehensively considers results from multiple classifiers, it can also benefit from an Ensemble Learning krogh1994ensemble effect.

### 3.3 Training

PABEE requires that we train internal classifiers to predict based on their corresponding layers’ hidden states. For classification, the loss function

for classifier is calculated with Cross Entropy:(4) |

where and denote a class label and the set of class labels, respectively. For regression, the loss is instead calculated by a (mean) squared error:

(5) |

where is the ground truth. Then, we calculate and train the model to minimize the total loss by a weighted average following kaya2018shallow:

(6) |

In this way, every possible inference branch has been covered in the training process. Also, the weighted average can correspond to the relative inference cost of each internal classifier.

### 3.4 Theoretical Analysis

It is straightforward to see that Patience-based Early Exit is able to reduce inference latency. To understand whether and under what conditions it can also improve accuracy, we conduct a theoretical comparison of a model’s accuracy with and without PABEE. We consider the case of binary classification for simplicity and conclude that:

###### Theorem 1

Assuming the patience of PABEE inference is , the total number of internal classifiers (IC) is , the misclassification probability (i.e., error rate) of all internal classifiers (excluding the final classifier) is , and the misclassification probability of the final classifier and the original classifier (without ICs) is . Then the PABEE mechanism improves the accuracy of conventional inference as long as

(the proof is detailed in Appendix B).

We can see the above inequality can be easily satisfied in practice. For instance, when , , and , the above equation is satisfied as long as the patience . Additionally, we verify the statistical feasibility of PABEE with Monte Carlo simulation in Appendix C. To further test PABEE with real data and tasks, we also conduct extensive experiments in the following section.

## 4 Experiments

### 4.1 Tasks and Datasets

We evaluate our proposed approach on the GLUE benchmark glue. Specifically, we test on Microsoft Research Paraphrase Matching (MRPC) mrpc, Quora Question Pairs (QQP)^{2}^{2}2https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs and STS-B senteval for Paraphrase Similarity Matching; Stanford Sentiment Treebank (SST-2) sst for Sentiment Classification; Multi-Genre Natural Language Inference Matched (MNLI-m), Multi-Genre Natural Language Inference Mismatched (MNLI-mm) mnli, Question Natural Language Inference (QNLI) qnli and Recognizing Textual Entailment (RTE) glue for the Natural Language Inference (NLI) task; The Corpus of Linguistic Acceptability (CoLA) cola for Linguistic Acceptability. We exclude WNLI wnli from GLUE following previous work devlin2018bert; jiao2019tinybert; xu2020bert.

### 4.2 Baselines

For GLUE tasks, we compare our approach with four types of baselines: (1) Backbone models: We choose ALBERT-base and BERT-base, which have approximately the same inference latency and accuracy. (2) Directly reducing layers: We experiment with the first 6 and 9 layers of the original (AL)BERT with a single output layer on the top, denoted by (AL)BERT-6L and (AL)BERT-9L, respectively. These two baselines help to set a lower bound for methods that do not employ any technique. (3) Static model compression approaches: For pruning, we include the results of LayerDrop fan2019reducing and attention head pruning michel2019sixteen on ALBERT. For reference, we also report the performance of state-of-the-art methods on compressing the BERT-base model with knowledge distillation or module replacing, including DistillBERT sanh2019distilbert, BERT-PKD sun2019patient and BERT-of-Theseus xu2020bert. (4) Input-adaptive inference: Following the settings in concurrent studies Schwartz:2020; liu2020fastbert; xin2020deebert, we add internal classifiers after each layer and apply different early exit criteria, including that employed by BranchyNet teerapittayanon2016branchynet and Shallow-Deep kaya2018shallow. We also add DeeBERT xin2020deebert, a BranchyNet variant on BERT alongside our BranchyNet implementation. To make a fair comparison, the internal classifiers and their insertions are exactly same in both baselines and Patience-based Early Exit. We search over a set of thresholds to find the one delivering the best accuracy for the baselines while targeting a speed-up ratio between and (the speed-up ratios of (AL)BERT-9L and -6L, respectively).

### 4.3 Experimental Setting

Training. We add a linear output layer after each intermediate layer of the pretrained BERT/ALBERT model as the internal classifiers. We perform grid search over batch sizes of {16, 32, 128}, and learning rates of {1e-5, 2e-5, 3e-5, 5e-5} with an Adam optimizer. We apply an early stopping mechanism and select the model with the best performance on the development set. We conduct our experiments on a single Nvidia V100 16GB GPU.

Inference. Following prior work on input-adaptive inference teerapittayanon2016branchynet; kaya2018shallow, inference is on a per-instance basis, i.e., the batch size for inference is set to 1. This is a common latency-sensitive production scenario when processing individual requests from different users Schwartz:2020

. We report the median performance over 5 runs with different random seeds because the performance on relatively small datasets such as CoLA and RTE usually has large variance. For PABEE, we set the patience

in the overall comparison to keep the speed-up ratio between and while obtaining good performance following Figure 4. We further analyze the behavior of the PABEE mechanism with different patience settings in Section 4.5.Method | #Param | Speed | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | STS-B | Macro |

-up | (8.5K) | (393K) | (3.7K) | (105K) | (364K) | (2.5K) | (67K) | (5.7K) | Score | ||

Dev. Set | |||||||||||

ALBERT-base lan2019albert | 12M | 1.00 | 58.9 | 84.6 | 89.5 | 91.7 | 89.6 | 78.6 | 92.8 | 89.5 | 84.4 |

ALBERT-6L | 12M | 1.96 | 53.4 | 80.2 | 85.8 | 87.2 | 86.8 | 73.6 | 89.8 | 83.4 | 80.0 |

ALBERT-9L | 12M | 1.30 | 55.2 | 81.2 | 87.1 | 88.7 | 88.3 | 75.9 | 91.3 | 87.1 | 81.9 |

LayerDrop fan2019reducing | 12M | 1.96 | 53.6 | 79.8 | 85.9 | 87.0 | 87.3 | 74.3 | 90.7 | 86.5 | 80.6 |

HeadPrune michel2019sixteen | 12M | 1.22 | 54.1 | 80.3 | 86.2 | 86.8 | 88.0 | 75.1 | 90.5 | 87.4 | 81.1 |

BranchyNet teerapittayanon2016branchynet | 12M | 1.88 | 55.2 | 81.7 | 87.2 | 88.9 | 87.4 | 75.4 | 91.6 | - | - |

Shallow-Deep kaya2018shallow | 12M | 1.95 | 55.5 | 81.5 | 87.1 | 89.2 | 87.8 | 75.2 | 91.7 | - | - |

PABEE (ours) | 12M | 1.57 | 61.2 | 85.1 | 90.0 | 91.8 | 89.6 | 80.1 | 93.0 | 90.1 | 85.1 |

Test Set | |||||||||||

ALBERT-base lan2019albert | 12M | 1.00 | 54.1 | 84.3 | 87.0 | 90.8 | 71.1 | 76.4 | 94.1 | 85.5 | 80.4 |

PABEE (ours) | 12M | 1.57 | 55.7 | 84.8 | 87.4 | 91.0 | 71.2 | 77.3 | 94.1 | 85.7 | 80.9 |

Method | #Param | Speed | MNLI | SST-2 | STS-B |

-up | (393K) | (67K) | (5.7K) | ||

BERT-base devlin2018bert | 108M | 1.00 | 84.5 | 92.1 | 88.9 |

BERT-6L | 66M | 1.96 | 80.1 | 89.6 | 81.2 |

BERT-9L | 87M | 1.30 | 81.4 | 90.5 | 85.0 |

DistilBERT sanh2019distilbert | 66M | 1.96 | 79.0 | 90.7 | 81.2 |

BERT-PKD xu2020bert | 66M | 1.96 | 81.3 | 91.3 | 86.2 |

BERT-of-Theseus xu2020bert | 66M | 1.96 | 82.3 | 91.5 | 88.7 |

BranchyNet teerapittayanon2016branchynet | 108M | 1.87 | 80.3 | 90.4 | - |

DeeBERT xin2020deebert | 108M | 1.59 | 80.7 | 90.0 | - |

Shallow-Deep kaya2018shallow | 108M | 1.91 | 80.5 | 90.6 | - |

PABEE (ours) | 108M | 1.62 | 83.6 | 92.0 | 88.7 |

Method | #Param | Train. time (min) | ||

MNLI | SST-2 | MNLI | SST-2 | |

ALBERT | ||||

w/o PABEE | 12M | 12M | 234 | 113 |

w/ PABEE | +36K | +24K | 227 | 108 |

BERT | ||||

w/o PABEE | 108M | 108M | 247 | 121 |

w/ PABEE | +36K | +24K | 242 | 120 |

### 4.4 Overall Comparison

We first report our main result on GLUE with ALBERT as the backbone model in Table 1. This choice is made because: (1) ALBERT is a state-of-the-art PLM for natural language understanding. (2) ALBERT is already very efficient in terms of the number of parameters and memory use because of its layer sharing mechanism, but still suffers from the problem of high inference latency. We can see that our approach significantly outperforms all compared approaches on improving the inference efficiency of PLMs, demonstrating the effectiveness of the proposed PABEE mechanism. Surprisingly, our approach consistently improves the performance of the original ALBERT model by a relatively large margin while speeding-up inference by . This is, to the best of our knowledge, the first inference strategy that can improve both the speed and performance of a fine-tuned PLM.

To better compare the efficiency of PABEE with the method employed in BranchyNet and Shallow-Deep, we illustrate speed-accuracy curves in Figure 3 with different trade-off hyperparameters (i.e., threshold for BranchyNet and Shallow-Deep, patience for PABEE). Notably, PABEE retains higher accuracy than BranchyNet and Shallow-Deep under the same speed-up ratio, showing its superiority over prediction score based methods.

To demonstrate the versatility of our method with different PLMs, we report the results on a representative subset of GLUE with BERT devlin2018bert as the backbone model in Table 3. We can see that our BERT-based model significantly outperforms other compared methods with either knowledge distillation or prediction probability based input-adaptive inference methods. Notably, the performance is slightly lower than the original BERT model while PABEE improves the accuracy on ALBERT. We suspect that this is because the intermediate layers of BERT have never been connected to an output layer during pretraining, which leads to a mismatch between pretraining and fine-tuning when adding the internal classifiers. However, PABEE still has a higher accuracy than various knowledge distillation-based approaches as well as prediction probability distribution based models, showing its potential as a generic method for deep neural networks of different kinds.

As for the cost of training, we present parameter numbers and training time with and without PABEE with both BERT and ALBERT backbones in Table 3. Although more classifiers need to be trained, training PABEE is no slower (even slightly faster) than conventional fine-tuning, which may be attributed to the additional loss functions of added internal classifiers. This makes our approach appealing compared with other approaches for accelerating inference such as pruning or distillation because they require separately training another model for each speed-up ratio in addition to training the full model. Also, PABEE only introduces fewer than 40K parameters ( of the original 12M parameters).

### 4.5 Analysis

#### Impact of Patience

As illustrated in Figure 4, different patience can lead to different speed-up ratios and performance. For a 12-layer ALBERT model, PABEE reaches peak performance with a patience of 6 or 7. On MNLI, SST-2 and STS-B, PABEE can always outperform the baseline with patience between 5 and 8. Notably, unlike BranchyNet and Shallow-Deep, whose accuracy drops as the inference speed goes up, PABEE has an inverted-U curve. We confirm this observation statistically with Monte Carlo simulation in Appendix C. To analyze, when the patience is set too large, the later internal classifier may suffer from the overthinking problem and make a wrong prediction that breaks the stable state among previous internal classifiers, which have not met the early-exit criterion because is large. This makes PABEE leave more samples to be classified by the final classifier , which suffers from the aforementioned overthinking problem. Thus, the effect of the Ensemble Learning vanishes and undermines its performance. Similarly, when is relatively small, more samples may meet the early-exit criterion by accident before actually reaching the stable state where consecutive internal classifiers agree with each other.

#### Impact of Model Depth

We also investigate the impact of model depth on the performance of PABEE. We apply PABEE to a 24-layer ALBERT-large model. As shown in Table 4, our approach consistently improves the accuracy as more layers and classifiers are added while producing an even larger speed-up ratio. This finding demonstrates the potential of PABEE for burgeoning deeper PLMs shoeybi2019megatron; raffel2019exploring; brown2020language.

Method | #Param | #Layer | Speed | MNLI | SST-2 | STS-B |
---|---|---|---|---|---|---|

-up | (393K) | (67K) | (5.7K) | |||

ALBERT-base lan2019albert | 12M | 12 | 1.00 | 84.6 | 92.8 | 89.5 |

+ PABEE | 12M | 12 | 1.57 | 85.1 | 93.0 | 90.1 |

ALBERT-large lan2019albert | 18M | 24 | 1.00 | 86.4 | 94.9 | 90.4 |

+ PABEE | 18M | 24 | 2.42 | 86.8 | 95.2 | 90.6 |

Metric | ALBERT | + Shallow-Deep kaya2018shallow | + PABEE (ours) | ||||||
---|---|---|---|---|---|---|---|---|---|

( better) | SNLI | MNLI-m/-mm | Yelp | SNLI | MNLI-m/-mm | Yelp | SNLI | MNLI-m/-mm | Yelp |

Original Acc. | 89.6 | 84.1 / 83.2 | 97.2 | 89.4 | 82.2 / 80.5 | 97.2 | 89.9 | 85.0 / 84.8 | 97.4 |

After-Attack Acc. | 5.5 | 9.8 / 7.9 | 7.3 | 9.2 | 15.4 / 13.8 | 11.4 | 19.3 | 30.2 / 25.6 | 18.1 |

Query Number | 58 | 80 / 86 | 841 | 64 | 82 / 86 | 870 | 75 | 88 / 93 | 897 |

### 4.6 Defending Against Adversarial Attack

Deep Learning models have been found to be vulnerable to adversarial examples that are slightly altered with perturbations often indistinguishable to humans kurakin2017adversarial. jin2019bert revealed that PLMs can also be attacked with a high success rate. Recent studies kaya2018shallow; hu2020triple attribute the vulnerability partially to the overthinking problem, arguing that it can be mitigated by early exit mechanisms.

In our experiments, we use a state-of-the-art adversarial attack method, TextFooler jin2019bert, which demonstrates effectiveness on attacking BERT. We conduct black-box attacks on three datasets: SNLI snli, MNLI mnli and Yelp yelp. Note that since we use the pre-tokenized data provided by jin2019bert, the results on MNLI differ slightly from the ones in Table 1. We attack the original ALBERT-base model, ALBERT-base with Shallow-Deep kaya2018shallow and with Patience-based Early Exit.

As shown in Table 5, we report the original accuracy, after-attack accuracy and the number of queries needed by TextFooler to attack each model. Our approach successfully defends more than attacks compared to the original ALBERT on NLI tasks, and

on the Yelp sentiment analysis task. Also, PABEE increases the number of queries needed to attack by a large margin, providing more protection to the model. Compared to Shallow-Deep

kaya2018shallow, our model demonstrates significant robustness improvements. To analyze, although the early exit mechanism of Shallow-Deep can prevent the aforementioned overthinking problem, it still relies on a single classifier to make the final prediction, which makes it vulnerable to adversarial attacks. In comparison, since Patience-based Early Exit exploits multiple layers and classifiers, the attacker has to fool multiple classifiers (which may exploit different features) at the same time, making it much more difficult to attack the model. This effect is similar to the merits of Ensemble Learning against adversarial attack, discussed in previous studies strauss2017ensemble; tramer2018ensemble; pang2019improving.## 5 Discussion

In this paper, we proposed PABEE, a novel efficient inference method that can yield better accuracy-speed trade-off than existing methods. We verify its effectiveness and efficiency on GLUE and provide theoretical analysis. Empirical results show that PABEE can simultaneously improve the efficiency, accuracy, and adversarial robustness upon a competitive ALBERT model. However, some limitations should be noted. First, PABEE requires a relative deep model to effectively apply the patience mechanism, making it inapplicable for shallow models. Second, PABEE cannot work on multi-branch networks (e.g., NASNet zoph2018learning) but only models with a single branch (e.g., ResNet, Transformer). For future work, we would like to explore our method on more tasks and settings. Also, since PABEE is orthogonal to prediction distribution based early exit approaches, it would be interesting to see if we can combine them with PABEE for better performance.

## Acknowledgments

We would like to thank the authors of TextFooler jin2019bert, Di Jin and Zhijing Jin, for their help with the data for adversarial attack.

## References

## Appendix A Image Classification

To verify the effectiveness of PABEE on Computer Vision, we follow the experimental settings in Shallow-Deep

kaya2018shallow, we conduct experiments on two image classification datasets, CIFAR-10 and CIFAR-100 krizhevsky2009learning. We use ResNet-56 he2016deep as the backbone and compare PABEE with BranchyNet teerapittayanon2016branchynet and Shallow-Deep kaya2018shallow. After every two convolutional layers, an internal classifier is added. We set the batch size to 128 and use SGD optimizer with learning rate of .Method | CIFAR-10 | CIFAR-100 | ||
---|---|---|---|---|

Speed-up | Acc. | Speed-up | Acc. | |

ResNet-56 he2016deep | 1.00 | 91.8 | 1.00 | 68.6 |

BranchyNet teerapittayanon2016branchynet | 1.33 | 91.4 | 1.29 | 68.2 |

Shallow-Deep kaya2018shallow | 1.35 | 91.6 | 1.32 | 68.8 |

PABEE (ours) | 1.26 | 92.0 | 1.22 | 69.1 |

The experimental results in CIFAR are reported in Table 6. PABEE outperform the original ResNet model by and in terms of accuracy while speed up the inference by and on CIFAR-10 and CIFAR-100, respectively. Also, PABEE demonstrates a better performance and a similar speed-up ratio compared to both baselines.

## Appendix B Proof of Theorem 1

###### Proof B.1

Recap we are in the case of binary classification. We denote the patience of PABEE as , the total number of internal classifiers (IC) as , the misclassification probability (i.e., error rate) of all internal classifiers as , and the misclassification probability of the final classifier and the original classifier as . We want to prove the PABEE mechanism improves the accuracy of conventional inference as long as .

For the examples that do not early-stopped, the misclassification probability with and without PABEE is the same. Therefore, we only need to consider the ratio between the probability that a sample is early-stopped and misclassified (denoted as ) and that a sample is early-stopped (denoted as ). We want to find the condition on and which makes .

First, considering only the probability mass of the model consecutively output the same label from the first position, we have

(7) |

which is the lower bound of that only considering the probability of a sample is early-stopped by consecutively predicted to be the same label from the first internal classifier. We then take its derivative and find it obtains its minimum when . This corresponds to the case where the classification is performing random guessing (i.e. classification probability for class 0 and 1 equal to 0.5). Intuitively, in the random guessing case the internal classification results are most instable so the probability that a sample is early-stopped is the smallest.

Therefore, we have .

Then for , we have

(8) |

where is the probability that the example is consecutively misclassified for t+1 times from the first IC. The term is the summation of probability that the example is consecutively misclassified for t+1 times from the th IC but correctly classified at the previous IC, without considering the cases that the the inference may already finished (whether correctly or not) before that IC. The summation of these two terms is an upper bound of .

So we need to have

(9) |

which equals to

(10) |

which equals to

(11) |

Specially, when , the condition becomes

## Appendix C Monte Carlo Simulation

To verify the theoretical feasibility of Patience-based Early Exit, we conduct Monte Carlo simulation. We simplify the task to be a binary classification with a 12-layer model which has classifiers that all have the same probability to correctly predict.

Shown in Figure 4(a), we illustrate the accuracy lower bound of each single

needed for PABEE to reach the same accuracy as the original accuracy without PABEE. We run the simulation 10,000 times with random Bernoulli Distribution sampling for every

of the original accuracy between and with patience . The result shows that Patience-based Early Exit can effectively reduce the needed accuracy for each classifier. Additionally, we illustrate the accuracy requirement reduction in Figure 4(b). We can see a valley along the patience axis which matches our observation in Section 4.5. However, the best patience in favor of accuracy in our simulation is around while in our experiments on real models and data suggest a patience setting of . To analyze, in the simulation we assume all classifiers have the same accuracy while in reality the accuracy is monotonically increasing with more layers involved in the calculation.