1 Introduction
Outofdistribution (OOD) detection is critical for deploying machine learning models in safety critical applications [amodei2016concrete]. A lot of progress has been made in improving OOD detection by training complicated generative models [bishop1994novelty; nalisnick2019detecting; ren2019likelihood; morningstar2021density], modifying objective functions [zhang2020hybrid], and exposing to OOD samples while training [hendrycks2018deep]
. Although such methods have promising results, they might require training and deploying a separate model in addition to the classifier, or rely on OOD data for training and/or hyperparameter selection, which are not available in some applications. A Mahalanobis distance (MD) based OOD detection method
[lee2018simple] is a simpler approach which is easy to use. This method does not involve retraining the model and works outofthebox for any trained model. MD is a popular approach due to its simplicity.Although MD based methods are highly effective in identifying far
OOD samples (samples which are semantically and stylistically very different from the indistribution samples, e.g., CIFAR10 vs. SVHN), we identify that it often fails for
near OOD samples (samples which are semantically similar to the indistribution samples [winkens2020contrastive], e.g., CIFAR100 vs. CIFAR10) that are more challenging to detect. In this paper, we focus primarily on the near OOD detection task and investigate why the MD method fails in these cases. We propose relative Mahalanobis distance (RMD), a simple fix to the MD, and demonstrate its effectiveness in multiple nearOOD tasks. Our solution is as simple to use as MD, and it does not involve any complicated retraining or training OOD data.2 Methods
In this section, we briefly review the Mahalanobis distance method and introduce our proposed modifications to make it effective for nearOOD detection tasks.
Mahalanobis distance based OOD detection
The Mahalanobis distance (MD) [lee2018simple] method uses intermediate feature maps of a trained deep neural network. The most common choice for the feature map is the output of the penultimate layer just before the classification layer. Let us indicate this feature map as for an input . For an indistribution dataset with unique classes, MD method fits
class conditional Gaussian distributions
to each of the indistribution classes based on the feature maps. We estimate the mean vectors and covariance matrix as:
, for and . Note that classconditional means are independent for each classes, while the covariance matrix is shared by all classes to avoid underfitting errors.For a test input , the method computes the Mahalanobis distances from the feature map of a test input to each of the fitted indistribution Gaussian distributions given by . The minimum of the distances over all classes indicates the uncertainty score and its negative indicates the confidence score . These are computed as
(1)  
(2) 
This confidence score is used as a signal to classify a test input as an indistribution or OOD sample.
Our proposed Relative Mahalanobis distance
As we will demonstrate in Sec. 3 and Appendix. D, OOD detection performance using MD degrades for nearOOD scenarios. We draw our inspiration from the prior work by ren2019likelihood showing that the raw density from deep generative models may fail at OOD detection and proposing to fix this using a likelihood ratio between two generative models (one modeling the sophisticated foreground distribution and the other modeling the background distribution) as confidence score. Similarly, we propose Relative Mahalanobis distance (RMD) defined as
Here, indicates the Mahalanobis distance of sample to a distribution fitted to the entire training data not considering the class labels: , where and . This is a good proxy for the background distribution. The confidence score using RMD is given by
(3) 
See Appendix A for the pseudocode.
RMD is equivalent to computing a likelihood ratio , where is a Gaussian fit using classspecific data and is a Gaussian fit using data from all classes. Note that this can easily be extended to the case where and are represented by more powerful generative models such as flows [papamakarios2017masked; papamakarios2019normalizing].
Previous literature [kamoi2020mahalanobis] discussed a similar topic however their work mainly focused on farOOD, and their proposed method called Partial Mahalanobis distance (PMD) required a hyperparameter (number of eigenbases to consider), while our method performs better for nearOOD and is hyperparameter free. See Appendix C for the comparison of PMD and RMD.
3 Failure Modes of Mahalanobis distance
eigenbasis. The solid lines represent the means over the IND and OOD test data respectively. The shading indicates the [10%, 90%] quantiles. The 120 top dimensions (before the red threshold) have distinct Mahalanobis distance between IND and OOD, while the later dimensions have similar Mahalanobis distances between IND and OOD, confounding the final score. (b) Histograms of the Mahalanobis distance and Relative Mahalanobis distance for IND and OOD.
To better understand the failure mode of Mahalanobis distance and to visualize its difference from the Relative Mahalanobis, we perform an eigenanalysis to understand how these methods weight each dimension [kamoi2020mahalanobis]
. Specifically, we rewrite the Mahalanobis distance using eigenvectors
of the covariance matrix as , where is the dimension of the feature map, is the eigenvalue, and is the projected coordinate of to the eigenbasis such that can be regarded as the 1D Mahalanobis distance from the projected coordinate to the 1D Gaussian distribution . The eigenbases are independent of each other.In the CIFAR100 vs CIFAR10 experiment, we found that OOD inputs have significantly greater mean distance (i.e. the average distance over the test samples) in the top 120 dimensions with the largest eigenvalues, while in the remaining dimensions the OOD inputs have similar mean distance with the IND inputs (see Figure 0(a), top). Since the final Mahalanobis distance is the sum of the distance per dimension (this can be visualized as the area under the curve in Figure 0(a)), we see that the later dimensions contribute a significant portion to the final score, overwhelming the top dimensions and making it harder to distinguish OOD from IND (AUROC=74.98%).
Next we fit a classindependent 1D Gaussian as the background model in each dimension and compute RMD per dimension. As shown in Figure 0(a) (bottom), using RMD, the contributions of the later dimensions are significantly reduced to nearly zero, while the top dimensions still provide a good distinction between IND and OOD. As a result, the AUROC using RMD is improved to 81.08%.
We conjecture that the first 120 dimensions are discriminative features that contain different semantic meanings for different IND classes and OOD, while the remaining dimensions are the common features shared by the IND and OOD. To support our conjecture, we simulated a simple dataset following a highdimensional Gaussian with a diagonal covariance matrix and different means for different classes. In particular, we set IND and OOD to have distinct means in the first dimension (discriminative feature) and the same mean in the remaining dimensions (nondiscriminative features). Since MD is the sum over all the dimensions, the sum along those nondiscriminative dimensions can overwhelm that of the discriminative dimension. As a result, the AUROC is only 83.13%. Using RMD, we remove the effect of the nondiscriminative dimensions as for those dimensions the estimated , detecting OOD perfectly with 100% AUROC using the RMD.
4 Experiments and Results
As indicated in the previous section, in this work we primarily focus on nearOOD detection tasks. We choose the following established nearOOD setups: (i) CIFAR100 vs. CIFAR10, (ii) CIFAR10 vs. CIFAR100, (iii) Genomics OOD benchmark [ren2019likelihood] and (iv) CLINC Intent OOD benchmark [larson2019evaluation; liu2020simple]
. As baselines, we compare our proposed RMD to traditional MD and maximum of softmax probability (MSP)
[hendrycks2016baseline], both working directly with outofthebox trained models. Note that most OOD detection methods require retraining of the models and complicated hyperparameter tuning, which we do not consider for comparison. We also ablate over different choices of model architectures with and without large scale pretrained networks. The results are presented in the following sections.4.1 Models without pretraining
In this section, we train our models from scratch using the indistribution data. For CIFAR10/100 tasks we use a Wide ResNet 2810 architecture as the backbone. For genomics OOD benchmark we use a 1D CNN architecture consistent with [ren2019likelihood]. For all benchmarks, at the end of training, we extract the feature maps for test IND and OOD inputs, and evaluate the OOD performance for our proposed RMD and comapre it with MD and MSP. As seen in Table 1, contrasting MD and RMD, we observe a consistent improvement in AUROC for all benchmarks with gains ranging from 1.2 points to 15.8 points. Comparing RMD to MSP, we observe a significant gain of 2.5 points for the Genomics OOD benchmark and partial gains for CIFAR10/100 benchmarks. This substantiates our claim that our proposed RMD boosts nearOOD detection performance.
Benchmark  MD 

MSP  
CIFAR100 vs CIFAR10  74.91%  81.01%  80.14%  
CIFAR10 vs CIFAR100  88.49%  89.71%  89.27%  
Genomics OOD  53.10% ^{1}^{1}1We observed that the AUROC for MD changes a lot during training of the 1D CNN genomics model. We report the performance based on the model checkpoint at the end of the training without any hyperparameter tuning using validation set. See Section 4.3 for details.  68.98%  66.53% 
Using flows for and
To demonstrate that our proposed idea can be extended to more powerful density models, we fit the feature maps using a onelayer masked autoregressive flow [papamakarios2017masked] for the CIFAR100 vs CIFAR10 benchmark. The AUROCs for using and are 76.10%, and 78.34% respectively, showing that our proposal works for nonGaussian density models as well.
4.2 Models with pretraining
Massive models pretrained on large scale datasets are becoming a standard practice in modern image recognition and language classification tasks. It has been shown that the highquality features learnt during this pretraining stage can be very useful in boosting the performance of the downstream task [hendrycks2019using; paul2021vision; fort2021exploring]. In this section, we investigate if such highquality representations also aid in better OOD detection and how our proposed RMD performs in such a setting, using different pretrained models as architectural backbone for OOD detection. Specifically, we consider Vision Transformer (ViT) [dosovitskiy2020image], Big Transfer (BiT) [kolesnikov2019big], and CLIP [radford2021learning] for CIFAR10/100 benchmarks, and the unsupervised BERT style pretraining model [devlin2018bert] for genomics^{2}^{2}2The BERT model used for the genomics benchmark is pretrained on the genomics data with the standard masked language modeling method. and CLINC benchmarks.
We investigate two settings: (i) directly using pretrained models for OOD detection and (ii) finetuning the pretrained model on the indistribution dataset for OOD detection.
Pretrained models without finetuning
We present our results in Table 2, comparing MD and RMD for all benchmarks using different pretrained models. Note that here we cannot evaluate MSP as the network was never trained to produce the predictive probabilities. As shown, we first observe that, even without taskspecific finetuning, the AUROC scores are either very close or better to Table 1, indicating that pretrained models work well for OOD detection out of the box. Secondly, we observe that RMD outperforms MD for all benchmarks with different pretrained models with margins varying between 3.17 points to 16.5 points. For the CIFAR100 vs CIFAR10 benchmark BiT models provide the best performance followed by CLIP and Vision Transformer. BiT with RMD achieves significantly higher AUROC (84.60%) in comparison to the Wide ResNet baseline model (81.01%). For CIFAR10 vs CIFAR100, using pretrained CLIP, RMD achieves 91.19% AUROC, higher than any of the other methods considered. Finally, it is worth noting that the gains provided by RMD are very prominent for genomics and CLINC intent benchmark when using BERT pretrained features.
Benchmark 



ViTB16 Pretrained  
CIFAR100 vs CIFAR10  67.19%  79.91%  
CIFAR10 vs CIFAR100  84.88%  89.73%  
BiT R50x1 Pretrained  
CIFAR100 vs CIFAR10  81.37%  84.60%  
CIFAR10 vs CIFAR100  86.70%  89.87%  
CLIP Pretrained  
CIFAR100 vs CIFAR10  71.40%  81.83%  
CIFAR10 vs CIFAR100  83.57%  91.19%  
BERT Pretrained  
Genomics OOD  48.46%  60.36%  
CLINC Intent OOD  75.48%  91.98% 
Pretrained models with finetuning
We now explicitly finetune the pretrained model on the indistribution dataset optimizing for classification accuracy. Using the finetuned models for different benchmark, we report the performance in Table 3, comparing RMD with MD and MSP baselines. We see that the performance of the MD improves significantly after the model finetuning (comparing Tables 2 and 3), suggesting a deletion of disruptive nondiscriminative features which existed in the pretrained models. MD achieves close or competitive AUROC when compared to RMD for most of the task evaluated, with the notable exception of genomics OOD (see Section 4.3). In light of discussion in Section 3, we conjecture that after taskspecific finetuning using labeled data, most of the features become discriminative between IND and OOD. It is also possible that the pretraining and finetuning regimes end up at better local minima, and that the resulting features are capable of modelling the foreground and background implicitly (without our explicit normalization using RMD). Therefore the effectiveness of RMD in such cases is limited.
Benchmark 


MSP  
ViTB16 Finetuned  
CIFAR100 vs CIFAR10  94.42%  93.09%  92.30%  
CIFAR10 vs CIFAR100  99.87%  98.82%  99.50%  
BiTM R50x1 Finetuned  
CIFAR100 vs CIFAR10  81.37%  84.60%  81.04%  
CIFAR10 vs CIFAR100  94.57%  94.94%  85.65%  
BERT Finetuned  
Genomics OOD  55.87%^{3}^{3}3We observed that the AUROC for MD changes a lot during finetuning. We report the performance based on the model checkpoint at the end of the training. See Section 4.3 for details.  72.04%  72.02%  
CLINC Intent OOD  97.92%  97.62%  96.99% 
4.3 Relative Mahalanobis is more robust
In the genomics experiments, we noticed that the OOD performance of MD is quite unstable during training of the 1D CNN model and the finetuning of the BERT pretrained model. The AUROC of MD increases at first during the early stages of training, followed by a decrease at later stages. Figure 2 shows the change of AUROCs for MD and RMD during the training of the 1D CNN model. The AUROC of MD quickly increases to 66.19% at step 50k, when the model is not well trained yet, with training and test accuracies being 88.59% and 82.20% respectively. As the model trains further and achieves higher training accuracy of 99.96% and higher test accuracy of 85.71% at step 500k, the AUROC for MD drops to 53.10%. On the other hand, the RMD increases as the training and test accuracies increase, and gets stabilized as the accuracy stabilizes, which is a more desirable property to have. Similarly, we observed this phenomenon in the finetuning of the BERT genomics model. At the early training stage, AUROC for MD achieves the peak of 77.49%, while the model is not trained well with the training and test accuracies being only 82.62% and 83.97% respectively.
Acknowledgements
We thank Zack Nado and D. Sculley for helpful feedback.
References
Appendix A Pseudocode for Relative Mahalanobis distance
The pseudocode for our method is shown in Algorithm 1.
Appendix B Additional Experimental Details
For CIFAR10/100 experiments, we first train a Wide ResNet 2810 model^{4}^{4}4https://github.com/google/uncertaintybaselines/blob/master/baselines/cifar/deterministic.py from scratch using the indistribution data. Next we use the publicly available pretrained models ViTB16^{5}^{5}5https://github.com/googleresearch/vision˙transformer, BiT R50x1^{6}^{6}6https://github.com/googleresearch/big_transfer, and CLIP^{7}^{7}7https://github.com/openai/CLIP, and replace the last layer with a classification head and finetune the full models using indistribution data. We do not finetune CLIP model since CLIP requires paired (text, image) data for training. The finetuned ViT model has indistribution test accuracy of 89.91% for CIFAR100, and 97.48% for CIFAR10. The finetuned BiT model has indistribution test accuracy of 86.89% for CIFAR100, and 97.66% for CIFAR10.
For the genomics OOD benchmark, the dataset is available at Tensorflow Datasets
^{8}^{8}8https://www.tensorflow.org/datasets/catalog/genomics˙ood. The dataset contains 10 indistribution bacteria classes, and 60 OOD classes and the input is a fixed length sequence of 250 base pairs composed by letters A, C, G and T. We first train a 1D CNN of 2000 filters of length 20 from scratch using the indistribution data. We train the model for 1 million steps using the learning rate of and Adam optimizer. Next we pretrain a BERT style model by randomly masking the input token and predict the masked token using the output of the transformer encoder. The model is trained using the unlabeled training and validation data. The prediction accuracy for the masked token is 48.35%. At the finetuning stage, the model is finetuned using the indistribution training data for 100,000 steps at the learning rate of , and the classification accuracy is 89.84%.For CLINC Intent OOD, we use a standard BERT pretrained model^{9}^{9}9https://github.com/google/uncertaintybaselines/blob/master/baselines/clinc˙intent/deterministic.py
, and finetune it using the indistribution CLINC data for 3 epochs with the learning rate of
. The classification accuracy is 96.53%.Appendix C Performance of Partial Mahalanobis distance
We compare our method with the Partial Mahalanobis distance (PMD) proposed in [kamoi2020mahalanobis]. PMD uses a subset of eigenbases to compute the distance score, . Although can be any subset of , it was recommended to use or corresponding to the largest or smallest Eigenvalues respectively. We compare our RMD method with the two versions of PMD using the benchmark task of CIFAR100 vs CIFAR10. Since there is a hyperparameter involved in PMD, we search from . Figure 2(a) shows the AUROC when using the top eigenbases to compute PMD. The AUROC increases as increases and reaches to the peak of 79.72% at , and then decreases when including more dimensions. Therefore the performance of PMD method depends on the choice of , while our method RMD is hyperparameterfree. Our method also achieves a slightly higher AUROC of 81.08% than the peak value for PMD.
We also investigate the performance of PMD when using eigenbases corresponding to the smallest eigenvalues (Figure 2(b)). The AUROC decreases as we exclude the top eigenbases from the set, suggesting that the top eigenbases are more important for the nearOOD detection. This observation supports our conjecture in Section 3 that the top eigenbases are discriminative features and the rest are common features shared by the IND and OOD.
Another variant of the Mahalanobis distance called Marginal Mahalanobis distance (MMD) was also proposed in [kamoi2020mahalanobis]. It fits a single Gaussian distribution to all the training data ignoring class, the same as we define the background model in our RMD. Though it has a good performance for farOOD tasks (e.g. CIFAR10 vs SVHN) [kamoi2020mahalanobis], it does not perform well for the nearOOD tasks, with AUROC being only 52.88% for CIFAR100 vs CIFAR10, and 83.81% for CIFAR10 vs CIFAR100.
Appendix D Simulation study for the failure mode of Mahalanobis distance
We use a simple simulation to demonstrate the failure mode of Mahalanobis distance. We simulate a binary classification problem where the two classes follow a highdimensional Gaussian distribution with different means. Specifically, , where the covariance matrix is a fixed diagonal matrix with the scalar . The mean vector has only the first dimension nonzero. To distinguish the two classes, we set for the first class, for the second class, and . The key idea is that only the first dimension is a discriminative feature that is classspecific, whereas the remaining dimensions are nondiscriminative common features that are shared by all classes. We set the number of dimensions . To simplify the problem, we set the covariance matrix to be diagonal such that the feature dimensions are independent.
For each of the classes, we randomly sample data points from the given distribution for training data. For test data, we sample data points from each class as the test IND data. For test OOD data, we set and and sample data points from each of them. Figure 3(a) shows the histograms of the first dimension of IND and OOD data. The IND and OOD data points are well separated by the first dimension feature. Figure 3(b) shows the histogram of the remaining dimensions . The IND and OOD data points are not separable there, since they follow the Gaussian distribution with the same mean.
For simplicity, we first treat the as the feature map . We fit a classconditional Gaussian using the training data, and compute the MD for each of the test inputs. We find that although OOD inputs in general have a greater distance than IND inputs, the two are largely overlapping. See Figure 3(c) for details.
The reason behind the failure mode is simple. Since the dimensions are independent, the loglikelihood of an input is the sum of the loglikelihoods of each individual dimension, i.e. , . For the discriminative feature , the distributions of IND and OOD are different, so approximately . However, the remaining nondiscriminative features are classindependent and both IND and OOD inputs follow the same distribution. Thus the likelihood of IND inputs based on those features will be indistinguishable from that of OOD inputs, i.e. . When the number of nondiscriminative features is much greater than the number of discriminative features, the loglikelihood of the former will overwhelm the latter.
Next we compute the RMD. We fit a classindependent Gaussian distribution using the training data regardless of the class labels, and compute the Relative Mahalanobis distance based on (class conditional Gaussian distribution) and (class independent Gaussian distribution) for each of the test inputs. Using our proposed method, we are able to perfectly separate IND and OOD test inputs. See Figure 3(d).
The class independent Gaussian helps to remove the effect of the nondiscriminative features. Specifically, since those nondiscriminative features are class independent, the fitted class conditional Gaussian is close to the fitted class independent Gaussian, i.e. . Thus the two values are canceled by each other in the RMD computation, resulting in . For the discriminative feature, the fitted class conditional Gaussian is very different from the fitted class independent Gaussian. For IND inputs, , since the class conditional Gaussian fits better to the IND data. For the OOD input, the difference between the two is nearly , since none of the two distributions fit OOD. Therefore RMD provides a better separation between IND and OOD as we have seen in Figure 3(d).
To mimic the real scenario where the feature maps are the extracted features from the neural networks, we train simple onelayer neural networks for this binary classification task. We retrieve the feature maps of the training data, fit a class conditional Gaussian and compute MD for the test inputs. We observed the same failure mode for this case; the distributions of MD between IND and OOD largely overlap. Then we fit a class independent Gaussian and compute RMD. Using RMD, we again recover the perfect separation between the two. We expect that the intermediate layer for image, text, and genomics models also contain nondiscriminative features. Therefore our proposed method is useful for overcoming this effect and improving the performance of nearOOD detection.