EARLIN: Early Out-of-Distribution Detection for Resource-efficient Collaborative Inference

06/25/2021 ∙ by Sumaiya Tabassum Nimi, et al. ∙ 0

Collaborative inference enables resource-constrained edge devices to make inferences by uploading inputs (e.g., images) to a server (i.e., cloud) where the heavy deep learning models run. While this setup works cost-effectively for successful inferences, it severely underperforms when the model faces input samples on which the model was not trained (known as Out-of-Distribution (OOD) samples). If the edge devices could, at least, detect that an input sample is an OOD, that could potentially save communication and computation resources by not uploading those inputs to the server for inference workload. In this paper, we propose a novel lightweight OOD detection approach that mines important features from the shallow layers of a pretrained CNN model and detects an input sample as ID (In-Distribution) or OOD based on a distance function defined on the reduced feature space. Our technique (a) works on pretrained models without any retraining of those models, and (b) does not expose itself to any OOD dataset (all detection parameters are obtained from the ID training dataset). To this end, we develop EARLIN (EARLy OOD detection for Collaborative INference) that takes a pretrained model and partitions the model at the OOD detection layer and deploys the considerably small OOD part on an edge device and the rest on the cloud. By experimenting using real datasets and a prototype implementation, we show that our technique achieves better results than other approaches in terms of overall accuracy and cost when tested against popular OOD datasets on top of popular deep learning models pretrained on benchmark datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the emergence of Artificial Intelligence (AI), applications and services using deep learning models, especially Convolutional Neural Networks (CNN), for performing intelligent tasks, such as image classification, have become prevalent. However, several issues have been observed in deployment of the deep learning models for real-life application. First, since the models tend to be very large in size (100’s of MB in many cases), they require higher computation, memory, and storage to run, which makes it difficult to deploy them on end-user/edge devices. Second, these models usually predict with high confidence, even for those input samples that are supposed to be unknown to the models (called out-of-distribution (OOD) samples) 

[nguyen2015deep, szegedy2013intriguing]. Since both in-distribution (ID) and OOD input samples are likely to appear in real-life settings, OOD detection has emerged as a challenging research problem. The first issue, the deployment of deep learning models in end/edge devices, has been studied well in the literature [laskaridis2020spinn]. One solution is to run collaborative inference, in which the end devices do not run the heavy model on-board, instead offload the inference task by uploading the input to a nearby server (or to the cloud in appropriate cases) and obtain the inference/prediction results from there. Other recent works propose doing edge-cloud collaboration [gazzaz2019collaborative], model compression [schindler2018towards] or model splitting [kang2017neurosurgeon] for faster inference. The second issue, the OOD detection, has received much attention in the deep learning research community [cardoso2017weightless, chalapathy2017robust, xie2019slsgd, deecke2018image, hendrycks2016baseline, liang2018enhancing]. We note several gaps in these research works, particularly their suitability of deployment in collaborative inference setup. Firstly, in most of these works, the input data were detected as an OOD sample using the outputs from the last [liang2018enhancing, lee2020multi, mohseni2020self, hsu2020generalized] or penultimate [lee2018simple]

layer of the deep learning classifiers. We argue that detecting an input sample as OOD after these many computations are done by the model is inefficient. Secondly, most of the OOD detection approaches rely on full retraining the original classifier model to enable the OOD detection 

[mohseni2020self, lee2017training, lee2020multi], which is computationally very expensive. Thirdly, in most of these works [liang2018enhancing, lee2018simple, mohseni2020self]

, several model hyperparameters for the detection task need to be tuned based on a validation dataset of OOD samples. The fitted model is then tested, thereby inducing bias towards those datasets. Finally, some OOD detector requires computationally expensive pre-processing of the input samples 

[liang2018enhancing, lee2018simple]. In this paper, we tackle the above two discussed issues jointly. We propose a novel OOD detection approach, particularly for Convolutional Neural Networks (CNN) models, that detects an input sample as OOD early into the network pipeline, using the portion of the feature space obtained from the shallow layers of the pretrained model. It is documented that early layers in CNN models usually pick up some salient features representing the overall input space whereas the deeper layers progressively capture more discriminant features toward classifying the input samples to the respective classes [matthew2014visualizing]. This, therefore, suggests that these salient feature maps extracted from a designated early layer will be different for ID and OOD samples. This is the principle observation based on which we attempt to build our OOD detector model. However, the space spanned by the obtained feature maps is in most of the cases too big to make any significant partitioning between ID and OOD samples. Hence we compress the high dimensional feature space by mining the most significant

information out of the space. We apply a series of “feature selection” operations on the extracted feature maps, namely


and max-pooling, to reduce the large feature space to a manageable size. After the reduction, we construct a distance function defined on the reduced feature space so that the distance measure can differentiate ID and OOD samples. For deployment in edge-cloud collaboration setup, we partition the model around the selected layer to obtain a super-small OOD detection model and readily deploy the lightweight model on an edge device. With that, the edge device can detect an incoming input sample as OOD and if detected, does not upload the sample to the server/cloud (thus saves communication and computation resources). To this end, we develop EARLIN (EARLy OOD Detection for Collaborative INference) based on the our proposed OOD detection technique. We evaluate EARLIN on a set of popular CNN models for image classification, namely Densenet, ResNet34, ResNet44, and VGG16 models pretrained on benchmark image datasets CIFAR-10 and CIFAR-100 

[cifar]. We also compare our OOD detection algorithm with state-of-the-art other OOD detection techniques discussed in the literature. Furthermore, we design and develop an OOD-aware collaborative inference system and show that this setup results in faster and more precise inference in edge devices. To the best of our knowledge, ours is the first work to propose such OOD-aware collaborative inference framework. Furthermore, we define a novel performance metric, the joint accuracy of a model combined with its detector, to quantify the performance of the model and detector combination, and formally characterize EARLIN’s performance and cost using that metric. We summarize our contributions as follows:

  • We propose a novel OOD detection approach called EARLIN that enables detection of OOD samples early in the computation pipeline, with minimal computation.

  • Our technique does not require retraining the neural network classifier and thus can be implemented as an external module on top of available pretrained classifiers.

  • We do not exploit samples from unknown set of OOD data for tuning hyperparamters, thereby reducing bias towards any subset of the unknown set of OOD samples.

  • We propose a novel OOD-aware edge-cloud collaborative setup based on our proposed detector for precise and resource-efficient inference at edge devices, along with characterizations of its performance and cost.

2 Related Work

Deep Learning based methods have been designed to achieved huge success in recent years in recognition tasks but they have their limitations. The problem of reporting high confidence for all input samples, even those outside the domain of training data is inherent in the general construct of the popular deep learning models. In order to deploy the deep learning models in real-life applications, this issue should be mitigated. Hence in the recent years, a large number of research works have been conducted towards this direction. In [hendrycks2016baseline]

, confidence of the deep learning classifiers in the form of output softmax probability for the predicted class was used to differentiate between ID and OOD samples.Later, in 

[liang2018enhancing, lee2018simple], OOD detection approaches were proposed that worked without making any change to the original trained deep learning models. We note several limitations in the works. Firstly, samples are detected as OOD at the very last layer of the classifier, thereby wasting computational resources on unnecessary computations done on input samples, that is eventually going to be identified unsuitable for classification. Secondly, the hyperparameters were tuned while being exposed to subset of OOD samples that the approach was tested on, inducing bias towards those samples. Also, due to this exposure, it can not be guaranteed that the approach will be as successful on any completely different set of OOD data. Thirdly, this approach required computationally heavy preprocessing of the input samples for the approach to work. The preprocessing involved two forward and one backward passes over the classifier model, rendering the approach completely unsuitable for real-time deployment. In the recent literature [lee2020multi, mohseni2020self, hsu2020generalized] also, OODs were not detected on the top of readily available deep learning models that were pretrained with traditional cross entropy loss, instead retraining was required. A better approach was proposed in [yu2020convolutional], that did not require retraining the classifier. Although this approach achieved good performance on OOD detection, we note that best reported results were obtained when feature maps from the deeper layers were used.

width= Baseline [hendrycks2016baseline] ODIN [liang2018enhancing] Mahalanobis [lee2018simple] MALCOM [yu2020convolutional] DeConf [hsu2020generalized] EARLIN Without Retraining? Before Last layer? Use One Layer Output? Without OOD Exposed? Without Input Preprocessing?

Table 1: Comparison of approaches.

3 Proposed OOD Detection Approach: EARLIN

Figure 1: Framework of the training and inference using EARLIN.

We propose an OOD detection approach, called EARLIN (EARLy OOD detection for Collaborative INference), that enables OOD detection from the shallow layers of a neural network classifier, without requiring to retrain the classifier and without exposure to any OOD sample during training. EARLIN infers a test input sample to be ID or OOD as follows. It first feeds the input to the classification CNN model and computes up to a designated shallow layer of the model and extracts feature maps from that layer. The output of the intermediate layer is a stack of 2D feature maps, out of which a small subset of them are selected. The selected maps are ones that supposedly contain the most information entailed by all of those maps. This process is called indexed-pooling the parameters of which (i.e., the positions of 2D maps to be selected) are determined from the training ID dataset during the training phase of the process. We then do max-pooling

for downsampling the feature space even further. With that, we obtain a vector representation of the original input in some high dimensional feature space. During training, we do this for a large number of samples drawn from the training ID dataset, aggregate them in a single cluster, and find the centroid of the ID space. Consequently, we define a distance function from the ID samples to the centroid such that at a certain level of confidence, it can be asserted that the sample is ID if the distance is less than a threshold. Since the threshold is a measure of distance, its value is expected to be low for ID samples. During inference, we use this obtained value of threshold to differentiate between ID and OOD samples. The framework of our proposed ID detection approach is shown in Figure 

1. Feature Selection and Downsampling: Indexed-Pooling and Max-Pooling: We at first select a subset of 2D feature maps from a designated shallow intermediate layer of pretrained neural network classifier based on a quantification of the amount of information each 2D feature map contains. We denote the chosen layer by . From this layer, we choose the most informative feature maps. We know from studies done previously that feature maps at shallow layers of the classifier capture useful properties out of input images [matthew2014visualizing], but the space spanned by the feature maps is too big to capture properties inherent to the ID images out of this space. Hence we reduce the feature space by selecting a subset of the features. Visual observation reveals that some of the maps, the ones for which we obtain almost monochromatic plots, do not carry significant observation about the input image. Whereas, there are some maps that capture useful salient features from the image. We consider variance of a 2D feature map as the quantification of the amount of information contained in the map. Suppose at layer , feature map, is obtained. So we have a total of 2D maps (also known as channels), each of which with a dimension of . Our goal is to select most informative 2D feature maps out of these maps. For finding the most informative feature maps using this assumption, we choose a subset of ID training data, . Using each data sample from , we calculate feature map , with shape from layer and finally obtain collection of feature maps, , with shape for . We define information contained in each feature map , denoted as

, as the summation of variance (aggregate variance) of 2D maps obtained from all input sample in the training ID dataset (

). This collectively measures how important map in layer is with respect to the entire ID population. Formally, we compute:


where denotes layer of model

with a tensor of size

and denote -th 2D map in that layer having the size of size . Once the values are obtained from all maps, we find the order statistic of values (sort the values in the descending order) as such:

We then find the indices of top channels that have the largest aggregate variance across the ID training dataset and populate a binary index vector to denote whether a certain map from that layer is selected or not. More precisely,

Obviously, . Given this binary index-vector, , and the layer output of for an input sample x, the indexed-pooling operation takes out only those feature maps (channels) as specified by the index-vector thus effectively reduces the feature space dimension from to . Consequently, we define the indexed-pooling operator as as follows:


where indicates concatenation. We note that the feature space spanned by the chosen feature maps from layer is still too large to capture useful information. Hence, we downsample the space by max-pooling. Max-pooling is an operation that is traditionally done within the deep leaning model architectures for downsampling the feature space, so that only the most relevant information out of a bunch of neighboring values is retained. We follow the same practice for downsampling our feature space. The max-pooling operator is denoted as . We usually use . Let be the vector representation for an input, x obtained from layer , constructed by applying two pooling operators on the extracted features maps, indexed-pooling and max-pooling. Using the above two pooling operators and , therefore, the construction of , for an input x, can be written as:


where means [composite function]. We show in Figure 2, the segregation between ID and OOD samples obtained in feature space after executing the above feature selection operations.

Figure 2: Effect of the proposed Feature Pooling strategy on differentiating between ID and OOD (LSUN) in feature space of ResNet model, visualized in 2D using PCA. (a) Original feature space and (b) Pooled feature space obtained for ID dataset CIFAR-10, (c) Original feature space and (d) Pooled feature space obtained for ID dataset CIFAR-100. The black dots represent the center of the feature space.

Feature Aggregation: We compute for input all inputs to represent the entire ID space as . We then find a set of aggregated information for the entire ID space. For that, we at first find the aggregated cluster centroid, denoted by c, of the feature space. The centroid is defined as follows:


where MEAN computes the element-wise mean of the the collection of vectors obtained from . The centroid, c, ultimately designates a center position of the ID space around which all ID samples position themselves in a close proximity. In that, the distance between the center and for any ID sample x should follow a low-variance distribution. This distance is denoted as , which is defined as the Euclidean distance between and c:


We hypothesize that since the centroid is a pre-determined value calculated using features of ID samples, distance, , will take smaller value for any an ID sample than the distance obtained for an OOD sample. We observe from Figure 2 that feature space , when visualized in 2D, validates our hypothesis. That requires us to find a suitable a threshold value on this distance value based on which ID and OOD sample can be separated out. This is what we do next. We empirically find a threshold value, , that detects ID samples with some confidence , such that fraction of the ID training samples have . That is:


We usually set . This is actually the expected TPR (True Positive Rate) of the OOD detector that we expect (the detector’s capability to detect a true ID sample as ID). We note that, tuning the value of this hyperparameter does not require exposure to any a priori known OOD samples. OOD Detection during Inference: During inference, for an incoming input sample x, we first pass the input into the model up to layer and extract the intermediate output from that layer. We then choose best maps from that intermediate output (specified by the binary index-vector ). Then we do max-pooling on that space to find . After that we find the distance of from the centroid, c, as and compare this value with the predefined threshold for detecting if the sample is ID or OOD. Let denote the detector output for input x, which can be obtained as:


So the inference is very fast and since our chosen layer is very shallow (unlike [hendrycks2016baseline, liang2018enhancing, lee2018simple] that detect OOD samples at the last layer), we can reject the extraneous OOD samples, way before lots of unnecessary computations are done on the sample, which would lead nowhere. Hence our ID detection approach gives higher throughput during batch inference. Besides we do not retrain the classifier model unlike [hsu2020generalized]. Also, we detect using features collected from a single layer only unlike [yu2020convolutional, lee2018simple], without preprocessing input samples unlike [hsu2020generalized, liang2018enhancing, lee2018simple] and without using OOD samples for validation unlike [lee2018simple, liang2018enhancing]. The comparison of EARLIN with other approaches is summarized in Table 1.

4 Collaborative Inference based on EARLIN

Figure 3: Collaborative Inference Scheme.

Based on our proposed OOD detection technique, we develop a setup for collaborative inference as a collaboration between an edge device and a server (this server can be in the cloud or can be a nearby edge resources, such as Cloudlet [verbelen2012cloudlets], Mobile Edge Cloud (MEC) [liu2017mobile], we generically refer to it as “server”). Deep learning models usually have large memory and storage requirements, and hence are difficult to deploy in the constrained environment of the edge devices. Thus, edge devices make remote call to the server devices for inference. If the incoming image is Out-of-Distribution, making such call is useless since the model would not be able to classify the image. Hence, we both save resources and make more precise recognition by not allowing to call when input image is OOD. Since our detection model, consisting of the first few layers of the network architecture, is very lightweight, we deploy the detection pipeline in the edge device. Then, if the image is detected as ID, we send the image to the server for classification. Otherwise, we report that the image is OOD and hence not classifiable by the model. We thus save resource by not sending the OOD images to the server. The schematic diagram of the framework appears in Figure 3. We note that we send the original image, instead of the intermediate layer output to the server, when the sample is detected as ID. This is because intermediate layer outputs from deep learning models at the shallower layers are often significantly higher in dimension than the original images. With that, we save considerable upload bandwidth. As the servers are usually high-end machines, repeating the same computation up to layer adds very nominal overhead compared to the volume of data to be uploaded. Moreover, is a very shallow layer (below 10% from the input layer) as reported in Table 2

. We also note that all model parameters are estimated/trained in the cloud using base model and the training datasets, and the resultant detector model is deployed on the edge device.

Overall Accuracy of OOD Detector and Deep Learning Classifier: Traditionally the performance of deep learning classification models are reported in terms of accuracy to establish how well they perform. Accuracy is defined as the ratio or percentage of samples classified correctly by the model, given that every sample comes from In-Distribution (ID). Let us denote to indicate if an input x truly belongs to ID and to denote if x truly belongs to OOD (ideally, is logically equivalent to ). In terms of probability expression, the (written as acc in short) of a model, , can be written as:

where represents the classification output of model and represents the true class label of input x, denotes probability of event . On the other hand, the performance of an OOD detector can be expressed in term of two metrics: True Positive Rate (TPR) and True Negative rate (TNR). TPR is the ratio of ID samples correctly classified as ID by the detector where TNR is the ratio of the true OOD samples detected as OODs. Let denote a detector and denote a binary output of the detector to indicate that whether input x is detected as ID or OOD ( if detected as ID else 0). Consequently, the TPR and TNR values of a detector, , can be expressed as:


While the capability of the classification model () and the OOD detector model () can be expressed individually in terms of their respective performance metrics (that is, model classification accuracy, TPR and TNR), it is interesting to note how these three terms play a role in measuring the accuracy of the model and the detector combined. We refer to this as the joint accuracy or overall accuracy. We define the Overall Accuracy as the success rate of assigning correct class labels to test inputs. That is, for an ID sample, this corresponds to assigning correct the class label to the input whereas, for an OOD sample, this corresponds to detecting it as an OOD (OOD samples do not have any correct class label other than being flagged as OOD). Let us use to denote the classification model and detector combined and we are interested to determine the accuracy of as a function of its constituents. We observe that in addition to the above three metrics, the overall accuracy of the model and OOD detector combined is dependent on what fraction of inputs are actually OOD as opposed to ID as inputs are passed to the model. Let this ratio be denoted as . Formally,

More specifically, given the accuracy of model and the TPR and TNR values of the associated OOD detector, , the overall accuracy of is given by:


The proof of the above equation is based on the fact that a correct output occurs when either of the two mutually exclusive events happen with respect to an input sample: (a) the input sample truly belongs to ID, and the detector also detects it as ID and the model correctly classifies it, (b) the input sample belongs to OOD and the detector detects this as OOD (detail appears in the supplementary document). Performance and Cost Characteristics of the Collaborative Setup: As per Eq (10), the overall accuracy depends on four quantities: accuracy of the original model, TPR and TNR of the detector, and (fraction of samples being OOD in the inference workload). Without any detector in place (when TPR becomes 1 and TNR is 0), the overall accuracy of the model , sharply declines with . With the detector combined, the overall accuracy of the model, in fact, improves at a rate of with respect to (actually, the accuracy grows only when the slope is positive, that is, ). In Section 6, we demonstrate this.

Figure 4: (a) Histogram of that differentiates between ID and OOD samples (drawn from benchmark datasets) for Densenet pretrained on CIFAR-10. (b) CDF of that differentiates between ID and OOD samples with CIFAR-10 (C10) and CIFAR-100 (C100) as ID and the TinyImageNet (TIN), LSUN and iSUN dataset as OOD.

In EARLIN, as shown in Figure 3, we send inputs to the server only when they are detected as ID by the lightweight OOD detector deployed at the edge. Let be the time required for OOD detection at the edge, be the round-trip communication delay between the edge and the server, and be the time required for classifying the image at the server when sent. In that, when we encounter a sample that is detected as OOD (when ), the time required is only (no communication to the server nor processing at the server). On the other hand, when an incoming sample is detected as ID (when ), the inference latency becomes . So, the time required for inference is closely associated with the ratio of OOD samples, and the precision with which the detector detects input samples as ID vs OOD. We can characterize the cost, in terms of latency, involved with each inference using as follows:


Similar to our performance indicator, , the cost characteristics of the setup, , can also be approximated as a linear function of (OOD ratio). In general, the end-to-end inference latency declines as grows as OOD samples are intercepted by the OOD detector at edge thus reducing inference latency and saving communication resources. In particular, the inference latency declines at a rate of , where with respect to . More detailed performance characterization can be found in the supplementary section.

5 Experimental Evaluation of EARLIN

In this section, we show how our proposed OOD detector, EARLIN, performs on standard pretrained models and benchmark datasets compared to the previously proposed approaches for OOD detection. Evaluation Metrics of OOD Detection: TNR and FPR at 95% TPR: This is the rate of detecting an OOD sample as OOD. Hence, , where FP is the number of OOD samples detected as ID and TN is the number of OOD samples detected as OOD. We report TNR values obtained when , TP being the number of ID samples detected as ID, is as high as 95%. And FPR is defined as (1-TNR).

width= Model # of Layers Chosen Layer Size of Detector Model (in KB) Training Dataset ResNet 34 BN () 112 CIFAR-10 ResNet 34 BN () 55 CIFAR-100 DenseNet 100 BN () 256 CIFAR-10 DenseNet 100 BN () 256 CIFAR-100

Table 2: Chosen Layer and Size of corresponding OOD detection models in pretrained Models

Detection Accuracy and Detection Error: This depicts the overall accuracy of detection and is calculated using formula , assuming that both ID and OOD samples are equally likely to be encountered by the classifier during inference. And Detection Error is (1-Detection Accuracy).
AUROC: This evaluates area under the ROC curve. Results: We conduct experiments on Densenet with 100 layers (growth rate = 12) and ResNet with 34 layers pretrained on CIFAR-10 and CIFAR-100 datasets. Each of the ID datasets contains training images and test images. Summary of the pretrained models used in terms of their total number of layers, chosen layer for OOD detection, size of detector model , ID dataset on which the model was trained and the classification accuracy of the corresponding model are shown in Table 2.

max width= ID Dataset Model OOD TNR at TPR Detection Accuracy AUROC MLCM BASE ODIN MAHA EARLIN MLCM BASE ODIN MAHA EARLIN BASE ODIN MAHA MLCM EARLIN CIFAR-10 TinyImagenet 95.50 81.20 87.59 93.61 97.50 95.33 88.10 92.34 94.38 96.25 94.10 97.69 98.29 99.06 99.14 Densenet LSUN 96.78 85.40 94.53 96.21 99.30 96.07 90.20 94.91 95.78 97.15 95.50 98.85 98.91 99.23 99.85 iSUN 95.59 83.30 91.81 93.21 97.60 95.41 89.15 93.82 94.17 96.30 94.80 98.40 97.98 99.04 99.37 CIFAR-10 TinyImagenet 98.10 71.60 70.39 97.53 93.92 96.92 83.30 85.80 96.55 94.46 91.00 91.88 99.43 99.56 97.54 Resnet34 LSUN 99.04 71.70 81.94 98.83 98.00 97.65 83.35 90.01 97.58 96.5 91.10 95.55 99.64 99.70 99.55 iSUN 98.25 71.90 77.89 97.64 95.19 96.94 83.45 88.4 96.66 95.09 91.00 94.26 99.47 99.59 98.74 CIFAR-100 TinyImagenet 87.12 47.90 53.88 80.37 92.60 91.65 61.45 81.32 88.40 93.80 71.60 89.16 93.64 97.21 98.04 Densenet LSUN 90.46 49.70 60.77 85.74 98.10 92.87 62.35 84.51 90.85 96.55 70.80 92.06 95.82 97.61 99.96 iSUN 88.29 47.30 54.85 81.78 94.00 92.04 61.15 82.51 89.30 94.50 69.60 90.29 94.81 97.34 98.61 CIFAR-100 TinyImagenet 92.88 31.00 64.48 91.76 95.40 94.10 58.00 85.77 93.56 95.20 67.10 93.06 98.28 98.54 98.55 Resnet34 LSUN 94.76 35.30 64.95 95.31 99.20 94.92 55.15 86.09 95.22 97.10 65.60 93.39 98.81 98.71 99.74 iSUN 92.36 36.70 63.03 91.98 97.10 93.81 55.85 85.33 93.76 96.05 65.60 92.76 98.27 98.24 99.22

Table 3: OOD detection performance on different datasets and pretrained models. Here MLCM stands for MALCOM [yu2020convolutional], BASE for Baseline [hendrycks2016baseline], ODIN for ODIN [liang2018enhancing] and MAHA for Mahalanobis [lee2018simple]. bold indicates best result.
Figure 5: TNR at 95% TPR for different combinations of feature selection. (a) DenseNet and (b) ResNet34 pretrained on CIFAR-10, (c) DenseNet and (d) ResNet34 pretrained on CIFAR-100

In Table 3, we show the TNR (at 95% TPR) and Detection Accuracy of our approach. We compare our results with those obtained using previously proposed approaches, Baseline [hendrycks2016baseline], ODIN [liang2018enhancing] Mahalanobis Detector [lee2018simple] and MALCOM [yu2020convolutional] on benchmark datasets [ooddata] TinyImagenet, LSUN and iSUN, popularly used for testing OOD detection techniques. It is to be noted that, we did not implement the earlier approaches (except Baseline), rather compare with the results reported in [yu2020convolutional] by using the same experimental setup. We see from the results in Table 3 that EARLIN performs better than the previous approaches in most of the cases. We report another set of results in Table-4, where we compare performance of EARLIN against DeConf [hsu2020generalized] on DenseNet model pretrained on CIFAR-10 and CIFAR-100 datasets, in terms of metrics TNR at 95% TPR and AUROC. We note that we did not obtain results in the experimental setting on ResNet34 pretrained models in [hsu2020generalized]. We see from the results in Table 4 also that EARLIN performs better than the previous approaches in most of the cases. We report yet another set of experimental results on VGG16 and ResNet44 models pretrained on CIFAR-10 and CIFAR-100 in the supplementary file. To demonstrate the clear separation of ID and OOD samples based on the estimated distance, , in Figure 4, we show the density and the corresponding CDF of obtained from various test ID and test OOD datasets. We observe that ID and OOD samples have separable distribution based on .

max width=0.5 ID Dataset Model OOD TNR at TPR AUROC DeConf [hsu2020generalized] EARLIN DeConf [hsu2020generalized] EARLIN CIFAR-10 TinyImagenet 95.80 97.50 99.10 99.14 DenseNet LSUN 97.60 99.30 99.40 99.85 iSUN 97.50 97.60 99.40 99.37 CIFAR-100 TinyImagenet 93.30 92.60 98.60 98.04 DenseNet LSUN 93.80 98.10 98.70 99.96 iSUN 92.50 94.00 98.40 98.61

Table 4: OOD detection performance of EARLIN compared to DeConf [hsu2020generalized]. bold indicates best result.

Ablation Studies: In order to detect samples as OOD as early as possible, we explore top (shallowest) 10% layers of the pretrained models to find the separation between ID and OOD samples and report the layer

that performed the best. It is to be noted that we consider only the Batch Normalization (BN) layers of the pretrained models as in these layers parameters sensitive to the ID dataset are learned during training. We show in Figure-

6 how our end result, TNR at 95% TPR varies for different choices of the shallow BN layers in ResNet34 and DenseNet models pretrained on CIFAR-10. In Table 2 we show our choice of layers for the models we considered. We observe that in all cases, the chosen layer is quite early in the network pipeline. For finding the other hyperparameters, such as the number of maps , centroid c, and threshold , for each pretrained model, 20% of the training ID samples were used as without using their corresponding classification labels. For each pretrained model, we set to be half of the number of 2D feature maps at layer . The threshold, , is set to 95% ID detection confidence. In Figure 5, we show the effect of selecting different number of feature maps (), other than the default 50% (half). Figure 5 shows the TNR values at 95% TPR for different datasets on different pretrained models, for different combinations of selecting 2D features: best 50%, best 25%, best 75%, worst 50% based on (Eq (1)) and also all 100%. We see that in almost all cases, selecting worst 50% leads to the worst TNR for all datasets (more noticeable for ResNet34 models). Selecting top 50% of the maps leads to either better or equivalent TNR, compared to selecting top 25%, top 75% and all 100%. The choice of top 50% of the 2D feature maps apparently produces the best results.

Figure 6: TNR at 95% TPR for different choice of shallow BN layers in DenseNet and ResNet34 models pretrained on CIFAR-10. Random 1000 iSUN samples have been used as validation OOD data. Our chosen layer is shown in red in each case.

6 Prototype Implementation and Results

[] []

Figure 7: Change in (a) Performance and (b) Cost vs Performance of Collaborative Setup with ratio of OOD samples using models pretrained on CIFAR-100 dataset and TinyImagenet as OOD dataset.

Experimental Setup: We build a collaborative inference testbed where a client program with our EARLIN OOD detector runs on an edge device and the deep learning models are deployed on a server machine. Our client program runs on a desktop computer with a moderate CPU-only configuration (Intel®Core™ i7-9750H@2.60GHz CPU) and 32 GB RAM, a configuration similar to the edge setup described in [canel2019scaling]). The server program, developed using Flask [flask]

and TensorFlow 

[abadi2016tensorflow] framework in Python, is deployed at the Google Cloud and is powered by Nvidia K80 GPU devices. For demonstrating the effectiveness of EARLIN, we deploy two CNN models in the cloud: (a) DenseNet with 100 layers and (b) ResNet with 34 layers (both are pretrained on CIFAR-100 with 70% classification accuracy). We deploy their corresponding OOD detection part on the edge device. In all experiments, TinyImageNet dataset is used as OOD. We set a threshold to have 95% prediction confidence on ID samples, the condition we considered while reporting the results on EARLIN in Section-5. Hence all the TPR, TNR, and Accuracy (detector accuracy) values match those reported in Table 3. We note that the mean latency for computations done at the edge () is ms and at the server (), it is ms. The mean communication delay () is ms. We observe that latency at both edge and server is quite small. At the edge, we deploy a small portion of the model hence the latency is low. On the other hand, the server runs models on GPU resources so the inference time is small there. The communication delay to the server, which accounts round-trip delay between the edge and the server and all other request processing delays before hitting the inference model, seems to the the heavy part of the latency. In our EARLIN-based setup, we improve this latency by not sending to the server when not required and thus getting rid of the communication delay.
Experimental Results: We show a set of aggregated results in Figure 7. We show the accuracy results for varying degree of OOD samples for EARLIN, Baseline, and “no detector”. We observe that as the OOD ratio rises, the accuracy drops sharply if no OOD detector is applied. The overall accuracy of Baseline also declines whereas the accuracy of EARLIN grows as the OOD ratio grows. This is because EARLIN has considerably higher TNR value and higher detection accuracy than the Baseline detector. It is to note that when is close to (very few samples are OOD compared to ID), the accuracy of EARLIN is slightly worse than that of when no detector is used. This is because EARLIN detects, in the worst case, 5% ID samples as OODs (since TPR is 95%), which contributes to reducing the overall accuracy. Figure 7 shows the performance of EARLIN as we increase . We observe that, as we increase , the overall accuracy increases and the inference latency decreases. The decline in inference latency is due to the fact that as more OOD inputs are fed, they are detected at the edge as OOD. The samples being detected as ID are uploaded to the cloud accounting all three components of delay and the number of those samples decline as grows. We note that time required per inference when model is not associated with any detector is equivalent to the case when OOD samples are detected at the last layer of the model, as in both cases input will be sent to the server for classification and OOD detection. In Figure 7, we show the average time required per inference in this case. In Section 4, we showed that the overall accuracy of a model increases at a rate of with the increase of . Figure 7 shows how well that characterization fits with the experimental results. As we can see, our obtained curve closely matches the linear curve for the expected accuracy obtained based on our formulation. The same is true for cost (latency). We see that inference latency decreases linearly at a rate of , as expected.

7 Conclusion and Future Works

In this paper, we propose a novel edge-cloud collaborative inference system, EARLIN, based on a proposed Out-of-Distribution (OOD) detection technique. EARLIN enables the detection of OOD samples using feature maps obtained from the shallow layers of the pretrained deep learning classifiers. We exploit the advantage of early detection to design at OOD-aware edge-cloud collaborative inference framework as we deploy the small foot-print detector part on an edge device and the full model in the cloud. During inference, the edge detects if an input sample is ID. If it is, the sample is sent to the cloud for classification. Otherwise, the sample is reported as OOD and the edge starts processing the next sample in the pipeline. In this way, we make the inference at the edge faster and more precise. We characterize the performance and cost of the setup. Experimental results on benchmark datasets show that EARLIN performs well on OOD detection. Moreover, when deployed on a prototype implementation, results obtained show that expected improvement in cost and performance is achieved using proposed EARLIN-based setup. In future, we plan to investigate more on building a context-aware adaptive OOD detection setup that takes advantage of choosing from multiple candidate OOD detectors based on desired cost-accuracy trade offs.