1 Introduction
Recent years have witnessed exciting achievements in the development of highly capable deep neural networks (DNNs), to the extent that new stateoftheart (SOTA) results are being published frequently. However, achieving this level of performance requires either using extremely large architectures such as GPT3
[gpt3] or SEER [seer] with billions of parameters (350GB memory and 175B parameters in GPT3), or ensembling many models. Consequently, this results in an inefficient inference compared to lightweight models.To alleviate the problem of slow inference on large architectures, a natural solution is to apply some form of model compression. Model compression literature is rich and mature, and covers various techniques such as network quantization [hawq, Jacob2018QuantizationAT, zeroq, haq, jin2020neural], knowledge distillation [hinton2015distilling, zhou2000m], pruning [cheng2017survey, he2017channel, gao2020rethinking, le2020paying], or a combination of multiple techniques [polino2018model, cheng2017survey, han2015deep]. After compression, the output DNN may have a reduced number of parameters or may operate in lower bit precision. However, it can be observed that there is a tradeoff between the compression ratio and accuracy of a model. Aggressive compression leads to significant performance drop that defeats the purpose. Moreover, compression is one solution for all data and is inferencetime deterministic (static) with no flexibility over different data samples. That being said, compression techniques are commonly orthogonal to other approaches in that a degree of compression can be added to other methods.
On the other hand, adaptive inference approaches propose to route to different branches of a DNN either stochastically, or based on some decision criteria on input data [branchynet, blockdrop, msdnet, ranet, yu2018slimmable, yu2019universally, yang2020mutualnet, wang2020resolution]. These methods mostly are based on architecture redesign, i.e., the model needs to be built in a specific way to support dynamic inference. This makes their training more complex and imposes additional nontrivial hyperparameter tuning. Adaptive inference methods can broadly be categorized as redundancybased and multiexit structures. The redundancybased approaches exploit the parameter redundancy in neural networks. To this end, [lccl] designed a convolutionbased controller layer, which reduces the computations in practice, even though increases the overall network size. Or [liu2018dynamic, blockdrop, veit2018convolutional, wang2018skipnet, sact, cnmm] dynamically skip some layers or blocks on the fly via selective layer execution.
Multiexit or multistage approaches, however, are based on architectures in which a network can exit from different paths based on some confidence criteria. Earlier techniques such as BranchyNet [branchynet] incorporated an entropybased threshold for routing. A similar approach was taken by [panda2016conditional, berestizshevsky2019dynamically]
by training side classifiers to navigate to different paths.
[msdnet] proposed a multiscale dense network to reuse feature maps of different scales, which was further improved in [ranet] by designing a resolution adaptive network (RANet) to identify low resolution inputs as easy cases, and process them with cheaper computations. There are also works based on architecture search for dynamic inference models [yuan2020enas4d]. It is also worth noting that the majority of the existing methods focus on the task of image classification and fail to study the other applications. [adaptivefeeding]is an example were adaptive inference was investigated for the task of object detection, by leveraging a Support Vector Machine (SVM) classifier to route the workload. A downside for
[adaptivefeeding], however, is that dynamically changing the routing traffic between the fast and slow branches requires retraining.Although the redundancybased and multiexit methods have made a significant progress and work well in practice, we will show that they do not reach the levels of performance provided by our energybased strategy. In addition, most of these methods require training models in a specific way necessitated by their architecture design. In contrast, our method works with outofthebox already trained models without a need for retraining.
In this paper, we propose an adaptive inference strategy that combines a large, deep accurate model (called Teacher) with a small, shallow fast one (called Student). Our method is based on an effective energybased decision making module for routing different samples to deep or shallow models. In this way, certain examples will be sent to the Student model that yields high speed inference, and other examples go to the Teacher model, which is slower, but highly accurate. Our method provides an inferencetime tradeoff between the inference latency and task accuracy. This can be thought of as a knob for the users to play with, and to dynamically choose a desired point in the tradeoff based on their required accuracy or latency. Figure 1 shows a highlevel schematic of the proposed framework.
In addition to our main adaptive inference strategy, we provide an extension called specialized EBJR, which provides more accurate and efficient inference by training the Student in a way that it only learns to perform the downstream task partially (details in Section 2.4).
The main contributions of this paper are summarized as follows:

Combining small/shallow models (low accuracy and latency) with large/deep models (high accuracy and latency) to achieve high accuracy and low latency. Our method is easy to build and deploy, is architecture agnostic, applicable to different downstream tasks (e.g., classification and object detection), and can be applied to existing pretrained models (with no need for retraining).

An energybased routing mechanism for directing examples to the small (Student) or large (Teacher) models. This allows a dynamic tradeoff between accuracy and computational cost that outperforms the previous works in adaptive inference (with zero overhead for realtime adjustment of speed/accuracy).

Creating a small, Student model specialized for a subset of tasks (e.g., topC classes only) with high accuracy; along with a plusone (+1) mechanism, to distinguish the topCclass data from the others.
2 EnergyBased Joint Reasoning (EBJR)
We introduce EBJR, a novel energybased joint reasoning approach for adaptive inference. Our method is inspired by the fact that smaller (shallower/narrower) models typically have lower accuracy, but very fast inference; and larger (deeper/wider) models, on the other hand, are highly accurate, but very slow. We combine the small model (denoted by Student) and the large model (denoted by Teacher) in an efficient and effective way to provide a fast inference, while maintaining the high accuracy. A schematic of our framework is shown in Figure 1.
The main challenge here is to design an effective routing mechanism (denoted by Router) to decide which model to use for each input. As for the adaptive inference, the Router should also provide the option of dynamic tradeoffs between accuracy and latency at the inference time. The Router module essentially operates similar to a binary classifier that directs easy samples to the Student and the hard ones to the Teacher. In some ways, this problem is also similar to the outofdistribution detection (OOD) problem [liu2020energy]
in which in and outofdistribution data are differentiated. OOD is generally used when a model sees some input test data that differs from its training data (indistribution data). Consequently, the predictions of the model on OOD samples would be unreliable. For our case, the Router should be able to identify whether or not the input data fits in the distribution with which the Student has been trained (i.e., there is a high probability that the Student can make accurate predictions for that input data). If not, the data is labelled as hard for the Student and should be forwarded to the Teacher that has higher capability. In our work, we investigate the energy characteristics of data samples to route them effectively.
Energy definitions.
Given an input data point x, the energy function is defined as to map the input x to a scalar, nonprobabilistic energy value
. The probability distribution over a collection of energy values can be defined according to the Gibbs distribution
[hinton1994autoencoders, lecun2006tutorial]: where is the partition function. The free energy [lecun2006tutorial] of x can then be expressed as the negative log of the partition function:(1) 
In the following subsections, we will describe our energybased joint reasoning method, and give formulations for classification, regression, and object detection problems.
2.1 Classification
The Student classifier is defined as a function for mapping the input x to
realvalued logits (i.e., for
number of class labels):. In probability theory, we can use the output of the softmax function to represent a categorical distribution that is a probability distribution over
different possible outcomes [liu2020energy]. A categorical distribution using the softmax function is expressed by:(2) 
where denotes the logit (probability) of the th class label. The energy for a given input in this case is defined as [liu2020energy]. The free energy function is then expressed similar to (1) as:
(3) 
Problem.
We seek to identify samples suitable for the Student and would like to direct the others to the Teacher. A natural solution to this problem is to use the data density function and consider the inputs with low likelihood as hard (or unfit) samples. To this end, an energybased density function for the Student can be defined as:
(4) 
where the denominator
is the normalized densities, which can be intractable to compute or estimate. By taking the logarithm of both sides, we obtain:
(5) 
Solution.
The term is constant for all x, and does not affect the overall energy values distribution. Thus, the negative free energy is linearly aligned with the log likelihood function. This makes it a suitable solution to our problem for detecting easy and hard samples. In this case, higher energy means lower likelihood, which represents harder (or more unfit) samples for the Student’s training distribution.
More precisely, for a threshold on the density function such that , then a threshold on the negative free energy can be calculated based on (5) as . In practice, for a given input, an energy function is applied to the Student outputs at inference to compute the energy score. Then, if the negative energy value is smaller than a threshold, the input is identified as a bad sample for the Student, and is sent to the Teacher.
Therefore, given the input data x, the Student , and the threshold , our energybased Router can simply be defined as:
(6) 
Let the Teacher classifier be , with (the same number of class labels as in the Student). Our joint reasoning classification function can then be written by:
(7) 
2.2 Regression
A regressor maps an input x to a target scalar defined as . For a given input , the energy function for a regressor is simply defined as
. The regression problem can then be expressed by creating an energybased model of the conditional density
as:(8) 
where the denominator is the normalizing partition function that involves a computationally intractable integral. One solution is to obtain its approximations using Monte Carlo importance sampling method as described in [gustafsson2020energy]. The free energy in this case is defined by:
(9) 
Similar to (4), the density function for a regressor using the energybased model can be obtained as follows:
(10) 
where the denominator is the normalized densities defined as . By taking the log of both sides:
(11) 
which, as in the classification problem, shows that has a linear alignment with the log likelihood function by considering the fact that is constant for all x, which makes it desirable for our problem.
2.3 Object detection
For the object detection task with a combination of classification and regression, we can define the total free energy score as: , where the regressor for predicting 4 points of a bounding box is defined as . With number of detected boxes and labels, the classifier’s free energy score is formulated as:
(12) 
where is the classifier’s output for the th class label of the th bounding box with and . The regressor’s free energy is also given by:
(13) 
where is the regression output for the th point of the th bounding box with .
The energybased joint reasoning function for object detection task is finally defined as:
(14) 
where and denote the Student and Teacher object detection models.
2.4 Specialized EBJR
In Section 2.1, it was assumed that the Student and Teacher models have equal number of classes that is . As proved in [abramovich2019classification], in order to achieve a good performance for a classifier with large number of classes, significantly large number of features are required. Since the Teacher model is assumed to be a very large model with significant number of features, it is capable of handling more difficult tasks with a large . On the other hand, the small Student model may lack enough features to be able to effectively deal with a large .
In addition, in inference services such as public clouds, the majority of input data usually belongs to a small, popular subset of classes that are used frequently, for example,“people", “cat", “dog", “car", etc (supplementary materials contain exampleperclass histogram plots for public datasets, and confirms this intuition). Considering this fact, the Student can be trained and specialized to be highly accurate on this specific/popular subset (with a small ). Consequently, in our joint reasoning scheme, most of the input data can be handled by the Student in a very accurate and computationally efficient way.
Let the specialized Student be , where . To make sure the model can still exploit and learn from all data at the training time, we label the data that do not belong to as an additional class (i.e., the ‘other’ class). The extra class is also utilized as a supplementary mechanism in our Router to evaluate the performance of on a given input data at the inference time. Similar to a binary classifier, it is used for distinguishing the data with labels from the others.
The specialized Student has another benefit for our energybased Router. Since only a subset of class labels is used for training the Student, the energy differences between in and outofdistribution data respectively denoted by and tends to be larger:
(15) 
where and . The larger the energy difference, the better the Router can distinguish the fit and unfit data for the Student, which results in more accurate and efficient adaptive inference. Given the input data x, the specialized Student , and a threshold , our specialized energybased Router is expressed as:
(16) 
where denotes the extra class defined in . The free energy for the specialized Student is calculated only over the top classes, not the extra class, as follows:
(17) 
Let the Teacher be with . Then, the specialized joint reasoning function for making the predictions related to x can be given by:
(18) 
3 Experiments
In this section, we evaluate and discuss the performance of our EBJR approach along with the other related methods on image classification and object detection tasks on different benchmarks. We provide more results and ablation studies in the supplementary materials.
3.1 Adaptive inference results
show the classification results for EBJR and the SOTA in adaptive inference on CIFAR10, CIFAR100, ImageNet, and Caltech256
[caltech256] datasets. We use multiple datasets not only to evaluate the generality of our method, but also because not all other methods published results on a single standard dataset. For all the datasets, we use DenseNet models [densenet] for our Student and Teacher, except Caltech256 for which ResNet models are used. Table 1 shows and compares the details about the Student, Teacher, and EBJR models and their accuracy, floating point operations (FLOPs), and average inference time (latency).Note that many previous approaches are based on the DenseNet architecture, and then adaptively dropping connections for inference speedup. Thus, we also choose DenseNet as the main architecture to establish a fair comparison although our method does not rely on any specific network design and can work with any blackbox architectures. Moreover, we follow the standard practice in the previous works and analyze the results with FLOPs [ranet, msdnet, cnmm, blockdrop]. For our method, the total FLOPs is measured as a weighted average of the Teacher and Student FLOPs based on their usage frequency as: , where and are respectively the number of samples processed by Student (with FLOPs) and Teacher (with FLOPs). Note that the metric used in [ranet, msdnet, cnmm] is multiplyaccumulates (MACs), i.e., half the FLOPs used in this work.
Evaluation of EBJR on the ImageNet (left) and Caltech256 (right) datasets. The numbers on the EBJR (random) curve show the percentage of samples processed by the Teacher.
In Figures 2 and 3, the tradeoff between accuracy and computational cost is adaptively achieved in our method by choosing different values for the threshold parameter as defined in (6) and (7). The larger the threshold, the more input data are routed to the Teacher model, which results in more accurate, but slower inference. As the Student is able to make accurate predictions for the majority of input data, the adaptive inference with an appropriate small enough can almost reach the Teacher’s accuracy but with a much lower computational cost. For CIFAR10, this strategy achieves the Teacher’s accuracy with 2.2 less FLOPs. It can also lead to approximately 3 less FLOPs with an accuracy of 94.5% (i.e., only 0.2% lower than the Teacher). The amount of speedup for Caltech256 is about 2, while maintaining the Teacher’s accuracy of 89.87%. For CIFAR100 and ImageNet, which are more complicated benchmarks, Teacher’s top1 accuracy is almost achieved with approximately 1.5 savings on the computations. Moreover, as illustrated in Figures 2 and 3, our method outperforms the previous works such as RANet [ranet] and MSDNet [msdnet] on all the three benchmarks across a variety of accuracy and cost combinations.
CIFAR10  CIFAR100  ImageNet  Caltech256  
S  T  EBJR  S  T  EBJR  S  T  EBJR  S  T  EBJR  
Depth  52  64    58  88    121  201    18  152   
Growth Rate  6  12    6  8    12  32         
Accuracy ()  91.81  94.76  94.74  69.28  74.94  74.87  66.28  76.92  76.62  83.16  89.87  89.87 
FLOPs ()  0.54  2.92  1.36  0.64  2.14  1.57  11.51  86.37  58.1  54.0  340.0  170.1 
Latency (ms)  14.0  35.0  23.78  26.0  51.0  42.1  84.0  225.0  196.8  25.0  200.0  113.6 
To investigate the performance of energybased routing mechanism compared to other alternatives, we perform an ablation study on Caltech256, where the energy score is replaced by the softmax confidence or entropy [branchynet] scores. We also include the random baseline in this experiment, where the input samples are randomly distributed between the Student and Teacher models (the experiment was run multiple times and the best of them was reported). The corresponding adaptive inference results are presented in Figure 3right. It is observed that softmax and entropybased mechanisms can reach the Teacher’s accuracy with 1.4 and 1.7 less FLOPs, which is lower than the energybased strategy with 2 speedup. The theoretical analysis for the entropy score will be given in the supplementary materials.
Figure 4 illustrates the energy score distribution for the samples processed by the Student (i.e., indistribution data) and Teacher (i.e., outofdistribution data). As observed, the indistribution samples (suitable for the Student) tend to have higher energy scores. Based on our experiments, the optimal setup for EBJR is achieved by choosing the threshold at the crossing point of the two distributions. As a consequence, by choosing =12.0 for CIFAR10, 70% of the samples are handled by the Student with an accuracy of 99.0%, and only 30% are routed to the Teacher, which results in 3X less total FLOPs. For CIFAR100 (with =15.0) and ImageNet (with =15.5), 50% are processed by the Student (with an accuracy of 91.0%), which achieve about 1.5X less FLOPs.
CIFAR10  ImageNet  
EBJR  [tann2016runtime]  EBJR  BLNet [park2015big]  
Accuracy loss ()  0.0  0.96  0.9 (0.0)  0.9 
Power savings ()  64.03  58.74  56.63 (32.93)  53.7 
PowerAccuracy Tradeoff.
In the literature, there are also some adaptive inference methods that are proposed for efficient poweraccuracy tradeoff, for example, [tann2016runtime] and BLNet [park2015big]. In order to compare EBJR with these approaches, we use the strategy in [lee2019energy] to calculate the power (or energy) consumption per image. As summarized in Table 2, the method in [tann2016runtime] reduces the power consumption by 58.74% with 0.96% accuracy loss on CIFAR10, while EBJR (Figure 2) achieves 64.03% power savings without any accuracy loss. Moreover, BLNet [park2015big] achieves 53.7% reduction in power consumption with an accuracy loss of 0.9% on ImageNet. EBJR (Figure 5), on the other hand, provides a reduction of 56.63% in power consumption with the same accuracy drop. Unlike BLNet that does not reach the big model’s accuracy, our method achieves the Teacher’s accuracy with 32.93% less power consumption.
MobileNetV2Based EBJR.
In addition to DenseNet, there exist some SOTA that are based on other architectures such as MobileNetV2 [sandler2018mobilenetv2], for example, SNet [yu2018slimmable], USNet [yu2019universally], MutualNet [yang2020mutualnet], and RSNet [wang2020resolution]. In order to compare EBJR with these approaches, we run another set of experiments on ImageNet, where MobileNetV2 models with 128128 and 224224 input resolutions are respectively used as our Student and Teacher. As shown in Figure 5Left, EBJR achieves better performance than SNet, USNet, and MutualNet across all FLOPs, and also better than RSNet in high FLOPs. RSNet provides better results than EBJR in low FLOPs, which is due to the less accurate Student used in EBJR. However, when EBJR and RSNet are integrated and the RSNet’s 128128 path is employed as the Student, the results are improved and EBJR outperforms RSNet at all tradeoff points.
The performance of EBJR for object detection on MSCOCO (compared with EfficientDet
[efficientdet]).Significance Test.
In order to evaluate the statistical significance of the results, we perform the McNemar’s test [dietterich1998approximate] over EBJR and SOTA including RANet and RSNet. The McNemar’s test is interpreted based on a given significance level (commonly set to 0.05 showing 95% confidence) as well as the
value and odds ratio calculated by the test. The default assumption (null hypothesis), i.e., if
, states that the two classifiers should have the same error rate or there should be no difference in the disagreements between them. However, if null hypothesis is rejected, i.e., if , it suggests that the two classifiers disagree in different ways. After running the test over the EBJR vs. RANet predictions on CIFAR10 and the EBJR vs. RSNet predictions on ImageNet , values of and are respectively obtained. The very low pvalues (), which reject the null hypothesis, strongly confirms that there is a significant difference in the disagreements between EBJR and other two models. Also, an odds ratio of 1.42 and 1.14 is respectively obtained, which gives an estimation of how much better EBJR is compared to RANet and RSNet.Unlike image classification, the adaptive inference for the object detection task has rarely been explored. We analyze the performance of EBJR on the task of object detection (formulated and described in Section 2.3) on the MSCOCO dataset [coco]. We employ the EfficientDetD0 and EfficientDetD4 [efficientdet] as the Student and Teacher, respectively. Figure 5Right shows the adaptive inference results compared to the EfficientDet models (D0, D1, D2, D3, and D4). As shown in the figure, EBJR outperforms the standard EfficientDet models, where it reaches 97% of the Teacher’s mAP on MSCOCO with 1.8 speedup. For the same mAP level, the adaptive feeding method of [adaptivefeeding] reports only 1.3 speedup.
3.2 Specialized EBJR
In Section 2.4, we argued that creating a specialized Student targeted to handle only the popular categories can make the joint inference more efficient. To study this case we run a set of experiments on a subset of the Open Images dataset (OID) [oid] that has been labeled using the 256 class labels of Caltech256 dataset. We train the Student with 20% of the class labels (i.e. =50 out of 256 labels) along with an extra one reserved for the other classes. In this setup, we choose the top50 class labels with the most number of samples in OID training set. For testing, we randomly select a new set of size 3K from the OID validation set, where 75% of the data have the top50 of the labels. This is done to ensure the initial assumption of ‘having the majority of samples from the popular classes’ remains valid.
Figure 6left shows the results of this experiment. We see that compared to the general cases of EBJR, the specialized EBJR provides the best performance under the assumption that the majority of input data belong to a small subset of classes. For example, compared to the Teacher, the specialized EBJR achieves 1.5 less FLOPs with the same accuracy.
Figure 6right shows the effect of the percentage of data that belong to the top classes. As expected, the more data in the top classes, the faster the joint model, since more load will be directed to the Student which is faster than the Teacher. We observe that when is too low or too high, e.g., =10 or 100, the adaptive inference with the specialized Student becomes less efficient even with large percentages of data in the top classes. For or 50, the specialized EBJR becomes more efficient, especially when or more of data belong to top classes. More analysis will be given in the supplementary materials.
Note that EBJR is orthogonal to SOTA dynamic inference approaches, including the weightsharing ones. In Figure 5left, we applied EBJR on RSNet, and showed an improved performance on top of it. More results are given in the supplementary materials.
One limitation of EBJR is memory overheard due to the need of both Student and Teacher at inference time. One solution to deal with this problem is to perform the largest possible Student on the edge, but the Teacher on the cloud. If a desired accuracy on the edge is not met, the Router sends certain samples to the cloud for higher accuracy. Since Student is the largest size that can fit to the device, it is expected to handle most cases, while cloud will be used only sparingly in accuracysensitive applications. In this setup, the overall accuracy is not bounded by what can run on edge, but the upperbound is what can run on cloud.
4 Conclusion
In this paper, we presented an adaptive inference method that combines large, but accurate models with small, but fast models. We proposed an effective energybased routing module for directing different samples to deep or shallow models. Our method provided a tradeoff between the inference latency and accuracy, which in practice is a useful knob for the users to adjust based on their required accuracy or latency, without a need for retraining. In addition, we provided an extension to our method for increasing the inference efficiency by training the shallow models in a way that they only learn to perform the downstream tasks partially. We presented theoretical and experimental evaluations in support of our method. We hope our work can help facilitate building efficient multimodel inference systems.
References
5 Supplementary materials
This section contains the supplementary materials.
5.1 Demo
In addition to the code, we also include a ‘Demo.mp4’ video file that contains a demonstration of our framework. This is based on screen recording of a web application we built to showcase the usecases of our method in realworld scenarios. Figure 7 shows a screenshot of the demo application.
5.2 Ablation studies on CIFAR10, CIFAR100, and ImageNet
Figure 8 shows the results of ablation studies of our EBJR method with different architectures for Student and Teacher models on CIFAR (10 and 100) and ImageNet. We observe that the results do not vary excessively, which shows the robustness of the proposed method.
5.3 More experiments with RANet
In this experiment, we evaluate the performance of EBJR when the SOTA architectures are used as our Student and Teacher models. In other words, we investigate whether our method can be added on top of other efficient methods such as RANet to benefit both from their designs and our joint inference. To this end, we trained the RANet architecture with three scales (as suggested in RANet work) on CIFAR10, CIFAR100, ImageNet. The accuracy and computational cost of the used Student and Teacher models for the three datasets are summarized in Table 3. For the Student, we employed the RANet’s first classifier from the first scale with 0.316 () FLOPs. For the Teacher, the last classifier from the last scale with 1.89 () FLOPs was used. Figure 9 shows the corresponding adaptive inference results compared with the RANet baseline on CIFAR10, CIFAR100, and ImageNet. We observe that our method is orthogonal to RANet, and can improve it further.
CIFAR10  CIFAR100  ImageNet  
S  T  S  T  S  T  
Accuracy ()  91.18  93.61  67.28  74.73  56.18  71.69 
FLOPs ()  0.3162  1.898  0.3166  1.9  3.36  33.62 
5.4 Alternative routing mechanisms: Softmax and Entropy
In Section 3.1, an ablation study with some experiments (Figure 3right) was presented to analyze the softmax and entropy scores as alternative means of analyzing the Student. Here, we study the mathematical connection of them with the energy score and their potential to solve the routing problem.
5.4.1 Softmaxbased Router
The softmax score for a classifier is expressed by:
(19) 
By taking the logarithm of both sides, we start to see the connection between the log of the softmax and the free energy score formulated in (3):
(20) 
where all logits are shifted by their maximum logit . Plugging in the energy term to (5) yields:
(21) 
It is observed that for the samples with high likelihood of being in the Student’s distribution, the free energy goes lower, but the maximum logit tends to go higher. Due to this shifting, unlike the energy score, the softmax confidence score is not wellaligned with the probability density . As a result, the confidence score is less reliable for our Router to analyze the performance of the Student.
5.4.2 Entropybased Router
The entropy score is a measure of the randomness in the information being processed, and is calculated as follows:
(22) 
where is the probability (logit) corresponding to the th class label.
Let be the internal energy (i.e., the expectation value of the energy function [oh2020entropy]), defined by:
(23) 
According to [oh2020entropy], the entropy can be defined in terms of the internal and free energy functions as:
(24) 
where all logits are shifted by the internal energy .
Substituting the free energy term from (5) yields:
(25) 
which shows that, due to the shifting caused by the internal energy, the entropy score is not reliably aligned with the probability density . Thus, it is a less suitable mechanism to be used as a routing mechanism in our Router, as opposed to the energy score.
5.5 Imbalance in class distributions
In Section 2.4, it was mentioned that in many practical applications, training or testing datasets are imbalanced. For example, consider a cloud inference API, which receives images as input, and most of the input images belong to a limited number of popular classes or categories. This motivated the specialized EBJR case. We studied the class distribution for the Caltech256, OID, and MSCOCO datasets in Figures 10 and 11, and the statistics confirm our intuition.
5.6 More results on the specialized EBJR
Figure 12 shows the adaptive inference results for the specialized EBJR case. This figure shows the top1 classification accuracy of joint models when top=10 or 20 popular classes are used. For top=10, we choose the top10 class labels with the most number of samples in the OID training set, and for testing, we randomly select a new set of size 1.7K from the OID validation set, where 70% of the data have the top10 of the class labels. For top=20, the size of the corresponding randomly selected validation set is 2K, where 75% of the samples belong to the top20 labels. It should be noted that the Teacher accuracies over the top10 and top20 validation sets are not the same because the validation sets are not identical (different data/label distribution).
It is observed from Figure 12 that top=20 results in a better overall performance, and can achieve the same accuracy as the teacher but with 1.35 faster inference. Reducing the number of classes to 10 will make the performance worse, where almost no speedup can be achieved compared to the Teacher. This suggests that limiting the majority of popular categories to a very low number of classes may hurt the performance.
5.7 More insights on inequality (15)
The free energy of the Student in (3) can be broken into the logarithm of two terms as:
(26) 
where and . Factoring out the term from inside the logarithm yields:
(27) 
By denoting the second term as , we will have:
(28) 
Caltech256  OID  
S  S  T  S  S  T  
Accuracy ()  83.16  70.71  89.87  75.0  74.1  86.64 
FLOPs ()  5.4  5.4  34.0  5.4  5.4  34.0 
Let be indistribution and be outofdistribution samples, where and . Based on (28), the inequality (15) can be reformulated as:
(29) 
where the followings can be observed for the left side of this inequality:

Since is an indistribution sample (high likelihood) and also , the free energy tends to be lower.

Since is an outofdistribution sample (low likelihood) and also , the free energy tends to be higher.
And for the right side:

For , the th logit value tends to increase (high likelihood), which makes decrease.

For , the th logit value tends to decrease (low likelihood), which makes to increase.
The two terms on the left side tend to go in the opposite directions, thereby enlarging the energy difference. On the other hand, the two terms on the right side of (29) do not show a similar behaviour, and thus their gap does not necessarily increase.
5.8 Unsupervised EBJR
So far, it was assumed that the Teacher and Student alreadytrained models are given, and with those we created a joint inference bundle model. This assumption may not always be true. Suppose we have a large model that is highly accurate, but also very slow. In addition, no dataset with ground truth labels is available to train a small and fast Student model. In this scenario, in order to achieve an efficient joint reasoning model, we can effectively distill the Teacher model to a small and fast Student architecture, in a completely unsupervised manner. Unsupervised knowledge distillation is an emerging technique for leveraging the abundance of unlabeled data for labelfree model training. Our framework is flexible in that it can organically incorporate the unsupervised distillation.
The most straightforward application of unsupervised EBJR is for cloud services, which include very large models for different machine learning tasks served through cloud APIs. Such inference services can be replaced by our EBJR architecture in which a side Student model is created for each large model. In this case, there is no need for retraining the large models nor acquiring data labels. By replacing the current large models behind the APIs with the joint reasoning equivalent, a speedup gain can be achieved without a considerable loss in accuracy. For the classification problem, as an example, the commonly used crossentropy loss function for training the Student is given by:
(30) 
where the pseudolabels generated by the Teacher model are utilized as the targets (denoted by ) in the loss function.
5.8.1 Experimental results  image classification
In this section, we study the performance of the unsupervised version of our method (see Section 5.8). To this end, we perform unsupervised distillation on the Student, using a set of unlabeled examples, which are passed to the Teacher to obtain pseudolabels. The Student is then trained purely with these pseudolabels. In this experiment, we use a ResNet152 pretrained on the Caltech256 training set (22K examples in 256 classes) as the Teacher. The Student is a ResNet18 trained with a 56K unlabeled random subset of OID.
For testing, we evaluated our approach on two validation sets including Caltech256 (7.8K images) and a subset of OID validation set (12K images). The accuracies and computational costs of the Student (supervised and unsupervised) and Teacher on both validations sets are reported in Table 4. Note that we study these two validations sets since they can both be valid measures depending on the target application. One represents the case when a user provides a large Teacher model with some validation data for which the joint model needs to attain a high accuracy. The other represents the case where a user provides a large Teacher model, and the joint model is supposed to work well for data that hit the cloud API, which are similar to the unlabeled data used to train the unsupervised joint model.
Figure 13 presents the adaptive inference results with the unsupervised EBJR and also its comparison with the supervised case on both validation sets. For the supervised EBJR, the Student is trained on Caltech256 (similar to the Teacher). As observed in Figure 13left, the unsupervised EBJR does not perform as well as the supervised case, which is because the distributions of the training and testing sets are different (OID vs. Caltech256). However, when evaluated on the OID validation set, which follows the same distribution with which the Student is trained, a better performance is achieved (Figure 13right).
It is shown in [xie2020self, sohn2020simple, zoph2020rethinking] that using large amounts of unlabeled data for pseudolabel selftraining can achieve results even higher than the supervised models. In agreement with this observation, we will see later in this section that the performance of the unsupervised joint model tends to improve if larger amounts of unlabeled data are used. In some cases, it may even surpass the performance of the supervised model (see Figure 14). That being said, the results in Figure 13 are excellent for the supervised case, and still very promising for the unsupervised case, as the later is not using any labels for training the joint model.
5.8.2 Experimental results  object detection
We also analyze the performance of the unsupervised variant of EBJR on the task of object detection on the MSCOCO dataset, where we employ the EfficientDetD0 and EfficientDetD4 architectures [efficientdet] for the Student and Teacher, respectively. For the unsupervised EBJR, the OID training set is used as the unlabeled set.
Student  Teacher  
Mode  Supervised  Unsupervised  Supervised  
Trainset size  118K  160K  320K  1.7M  118K 
mAP  0.359  0.329  0.350  0.373  0.514 
FLOPs ()  2.54  55.2 
Table 5 reports the performance of the Student model trained in the supervised and unsupervised settings, compared to the Teacher. For the unsupervised case, we tested different amounts of unlabeled data from OID. We observe that when sufficient unlabeled data (e.g., 1.7M in Table 5) are provided, the unsupervised Student can perform even better than the supervised one.
Moreover, Figure 14 shows the adaptive inference results for both supervised and unsupervised (with 1.7M samples from OID) cases compared to the EfficientDet models (D0, D1, D2, D3, and D4). Both supervised and unsupervised EBJR outperform the standard EfficientDet models.