Out-of-distribution (OOD) detection is critical for deploying machine learning models in safety critical applications [amodei2016concrete]. A lot of progress has been made in improving OOD detection by training complicated generative models [bishop1994novelty; nalisnick2019detecting; ren2019likelihood; morningstar2021density], modifying objective functions [zhang2020hybrid], and exposing to OOD samples while training [hendrycks2018deep]
. Although such methods have promising results, they might require training and deploying a separate model in addition to the classifier, or rely on OOD data for training and/or hyper-parameter selection, which are not available in some applications. A Mahalanobis distance (MD) based OOD detection method[lee2018simple] is a simpler approach which is easy to use. This method does not involve re-training the model and works out-of-the-box for any trained model. MD is a popular approach due to its simplicity.
Although MD based methods are highly effective in identifying far
OOD samples (samples which are semantically and stylistically very different from the in-distribution samples, e.g., CIFAR-10 vs. SVHN), we identify that it often fails fornear OOD samples (samples which are semantically similar to the in-distribution samples [winkens2020contrastive], e.g., CIFAR-100 vs. CIFAR-10) that are more challenging to detect. In this paper, we focus primarily on the near OOD detection task and investigate why the MD method fails in these cases. We propose relative Mahalanobis distance (RMD), a simple fix to the MD, and demonstrate its effectiveness in multiple near-OOD tasks. Our solution is as simple to use as MD, and it does not involve any complicated re-training or training OOD data.
In this section, we briefly review the Mahalanobis distance method and introduce our proposed modifications to make it effective for near-OOD detection tasks.
Mahalanobis distance based OOD detection
The Mahalanobis distance (MD) [lee2018simple] method uses intermediate feature maps of a trained deep neural network. The most common choice for the feature map is the output of the penultimate layer just before the classification layer. Let us indicate this feature map as for an input . For an in-distribution dataset with unique classes, MD method fits
class conditional Gaussian distributionsto each of the in-distribution classes based on the feature maps , for and . Note that class-conditional means are independent for each classes, while the covariance matrix is shared by all classes to avoid under-fitting errors.
For a test input , the method computes the Mahalanobis distances from the feature map of a test input to each of the fitted in-distribution Gaussian distributions given by . The minimum of the distances over all classes indicates the uncertainty score and its negative indicates the confidence score . These are computed as
This confidence score is used as a signal to classify a test input as an in-distribution or OOD sample.
Our proposed Relative Mahalanobis distance
As we will demonstrate in Sec. 3 and Appendix. D, OOD detection performance using MD degrades for near-OOD scenarios. We draw our inspiration from the prior work by ren2019likelihood showing that the raw density from deep generative models may fail at OOD detection and proposing to fix this using a likelihood ratio between two generative models (one modeling the sophisticated foreground distribution and the other modeling the background distribution) as confidence score. Similarly, we propose Relative Mahalanobis distance (RMD) defined as
Here, indicates the Mahalanobis distance of sample to a distribution fitted to the entire training data not considering the class labels: , where and . This is a good proxy for the background distribution. The confidence score using RMD is given by
See Appendix A for the pseudocode.
RMD is equivalent to computing a likelihood ratio , where is a Gaussian fit using class-specific data and is a Gaussian fit using data from all classes. Note that this can easily be extended to the case where and are represented by more powerful generative models such as flows [papamakarios2017masked; papamakarios2019normalizing].
Previous literature [kamoi2020mahalanobis] discussed a similar topic however their work mainly focused on far-OOD, and their proposed method called Partial Mahalanobis distance (PMD) required a hyper-parameter (number of eigen-bases to consider), while our method performs better for near-OOD and is hyper-parameter free. See Appendix C for the comparison of PMD and RMD.
3 Failure Modes of Mahalanobis distance
eigen-basis. The solid lines represent the means over the IND and OOD test data respectively. The shading indicates the [10%, 90%] quantiles. The 120 top dimensions (before the red threshold) have distinct Mahalanobis distance between IND and OOD, while the later dimensions have similar Mahalanobis distances between IND and OOD, confounding the final score. (b) Histograms of the Mahalanobis distance and Relative Mahalanobis distance for IND and OOD.
To better understand the failure mode of Mahalanobis distance and to visualize its difference from the Relative Mahalanobis, we perform an eigen-analysis to understand how these methods weight each dimension [kamoi2020mahalanobis]
. Specifically, we rewrite the Mahalanobis distance using eigenvectorsof the covariance matrix as , where is the dimension of the feature map, is the eigenvalue, and is the projected coordinate of to the eigen-basis such that can be regarded as the 1D Mahalanobis distance from the projected coordinate to the 1D Gaussian distribution . The eigen-bases are independent of each other.
In the CIFAR-100 vs CIFAR-10 experiment, we found that OOD inputs have significantly greater mean distance (i.e. the average distance over the test samples) in the top 120 dimensions with the largest eigenvalues, while in the remaining dimensions the OOD inputs have similar mean distance with the IND inputs (see Figure 0(a), top). Since the final Mahalanobis distance is the sum of the distance per dimension (this can be visualized as the area under the curve in Figure 0(a)), we see that the later dimensions contribute a significant portion to the final score, overwhelming the top dimensions and making it harder to distinguish OOD from IND (AUROC=74.98%).
Next we fit a class-independent 1D Gaussian as the background model in each dimension and compute RMD per dimension. As shown in Figure 0(a) (bottom), using RMD, the contributions of the later dimensions are significantly reduced to nearly zero, while the top dimensions still provide a good distinction between IND and OOD. As a result, the AUROC using RMD is improved to 81.08%.
We conjecture that the first 120 dimensions are discriminative features that contain different semantic meanings for different IND classes and OOD, while the remaining dimensions are the common features shared by the IND and OOD. To support our conjecture, we simulated a simple dataset following a high-dimensional Gaussian with a diagonal covariance matrix and different means for different classes. In particular, we set IND and OOD to have distinct means in the first dimension (discriminative feature) and the same mean in the remaining dimensions (non-discriminative features). Since MD is the sum over all the dimensions, the sum along those non-discriminative dimensions can overwhelm that of the discriminative dimension. As a result, the AUROC is only 83.13%. Using RMD, we remove the effect of the non-discriminative dimensions as for those dimensions the estimated , detecting OOD perfectly with 100% AUROC using the RMD.
4 Experiments and Results
As indicated in the previous section, in this work we primarily focus on near-OOD detection tasks. We choose the following established near-OOD setups: (i) CIFAR-100 vs. CIFAR-10, (ii) CIFAR-10 vs. CIFAR-100, (iii) Genomics OOD benchmark [ren2019likelihood] and (iv) CLINC Intent OOD benchmark [larson2019evaluation; liu2020simple]
. As baselines, we compare our proposed RMD to traditional MD and maximum of softmax probability (MSP)[hendrycks2016baseline], both working directly with out-of-the-box trained models. Note that most OOD detection methods require re-training of the models and complicated hyper-parameter tuning, which we do not consider for comparison. We also ablate over different choices of model architectures with and without large scale pre-trained networks. The results are presented in the following sections.
4.1 Models without pre-training
In this section, we train our models from scratch using the in-distribution data. For CIFAR-10/100 tasks we use a Wide ResNet 28-10 architecture as the backbone. For genomics OOD benchmark we use a 1D CNN architecture consistent with [ren2019likelihood]. For all benchmarks, at the end of training, we extract the feature maps for test IND and OOD inputs, and evaluate the OOD performance for our proposed RMD and comapre it with MD and MSP. As seen in Table 1, contrasting MD and RMD, we observe a consistent improvement in AUROC for all benchmarks with gains ranging from 1.2 points to 15.8 points. Comparing RMD to MSP, we observe a significant gain of 2.5 points for the Genomics OOD benchmark and partial gains for CIFAR-10/100 benchmarks. This substantiates our claim that our proposed RMD boosts near-OOD detection performance.
|CIFAR-100 vs CIFAR-10||74.91%||81.01%||80.14%|
|CIFAR-10 vs CIFAR-100||88.49%||89.71%||89.27%|
|Genomics OOD||53.10% 111We observed that the AUROC for MD changes a lot during training of the 1D CNN genomics model. We report the performance based on the model checkpoint at the end of the training without any hyperparameter tuning using validation set. See Section 4.3 for details.||68.98%||66.53%|
Using flows for and
To demonstrate that our proposed idea can be extended to more powerful density models, we fit the feature maps using a one-layer masked auto-regressive flow [papamakarios2017masked] for the CIFAR-100 vs CIFAR-10 benchmark. The AUROCs for using and are 76.10%, and 78.34% respectively, showing that our proposal works for non-Gaussian density models as well.
4.2 Models with pre-training
Massive models pre-trained on large scale datasets are becoming a standard practice in modern image recognition and language classification tasks. It has been shown that the high-quality features learnt during this pre-training stage can be very useful in boosting the performance of the downstream task [hendrycks2019using; paul2021vision; fort2021exploring]. In this section, we investigate if such high-quality representations also aid in better OOD detection and how our proposed RMD performs in such a setting, using different pre-trained models as architectural backbone for OOD detection. Specifically, we consider Vision Transformer (ViT) [dosovitskiy2020image], Big Transfer (BiT) [kolesnikov2019big], and CLIP [radford2021learning] for CIFAR-10/100 benchmarks, and the unsupervised BERT style pre-training model [devlin2018bert] for genomics222The BERT model used for the genomics benchmark is pre-trained on the genomics data with the standard masked language modeling method. and CLINC benchmarks.
We investigate two settings: (i) directly using pre-trained models for OOD detection and (ii) fine-tuning the pre-trained model on the in-distribution dataset for OOD detection.
Pre-trained models without fine-tuning
We present our results in Table 2, comparing MD and RMD for all benchmarks using different pre-trained models. Note that here we cannot evaluate MSP as the network was never trained to produce the predictive probabilities. As shown, we first observe that, even without task-specific fine-tuning, the AUROC scores are either very close or better to Table 1, indicating that pre-trained models work well for OOD detection out of the box. Secondly, we observe that RMD outperforms MD for all benchmarks with different pre-trained models with margins varying between 3.17 points to 16.5 points. For the CIFAR-100 vs CIFAR-10 benchmark BiT models provide the best performance followed by CLIP and Vision Transformer. BiT with RMD achieves significantly higher AUROC (84.60%) in comparison to the Wide ResNet baseline model (81.01%). For CIFAR-10 vs CIFAR-100, using pre-trained CLIP, RMD achieves 91.19% AUROC, higher than any of the other methods considered. Finally, it is worth noting that the gains provided by RMD are very prominent for genomics and CLINC intent benchmark when using BERT pre-trained features.
|CIFAR-100 vs CIFAR-10||67.19%||79.91%|
|CIFAR-10 vs CIFAR-100||84.88%||89.73%|
|BiT R50x1 Pre-trained|
|CIFAR-100 vs CIFAR-10||81.37%||84.60%|
|CIFAR-10 vs CIFAR-100||86.70%||89.87%|
|CIFAR-100 vs CIFAR-10||71.40%||81.83%|
|CIFAR-10 vs CIFAR-100||83.57%||91.19%|
|CLINC Intent OOD||75.48%||91.98%|
Pre-trained models with fine-tuning
We now explicitly fine-tune the pre-trained model on the in-distribution dataset optimizing for classification accuracy. Using the fine-tuned models for different benchmark, we report the performance in Table 3, comparing RMD with MD and MSP baselines. We see that the performance of the MD improves significantly after the model fine-tuning (comparing Tables 2 and 3), suggesting a deletion of disruptive non-discriminative features which existed in the pre-trained models. MD achieves close or competitive AUROC when compared to RMD for most of the task evaluated, with the notable exception of genomics OOD (see Section 4.3). In light of discussion in Section 3, we conjecture that after task-specific fine-tuning using labeled data, most of the features become discriminative between IND and OOD. It is also possible that the pre-training and finetuning regimes end up at better local minima, and that the resulting features are capable of modelling the foreground and background implicitly (without our explicit normalization using RMD). Therefore the effectiveness of RMD in such cases is limited.
|CIFAR-100 vs CIFAR-10||94.42%||93.09%||92.30%|
|CIFAR-10 vs CIFAR-100||99.87%||98.82%||99.50%|
|BiT-M R50x1 Fine-tuned|
|CIFAR-100 vs CIFAR-10||81.37%||84.60%||81.04%|
|CIFAR-10 vs CIFAR-100||94.57%||94.94%||85.65%|
|Genomics OOD||55.87%333We observed that the AUROC for MD changes a lot during finetuning. We report the performance based on the model checkpoint at the end of the training. See Section 4.3 for details.||72.04%||72.02%|
|CLINC Intent OOD||97.92%||97.62%||96.99%|
4.3 Relative Mahalanobis is more robust
In the genomics experiments, we noticed that the OOD performance of MD is quite unstable during training of the 1D CNN model and the fine-tuning of the BERT pre-trained model. The AUROC of MD increases at first during the early stages of training, followed by a decrease at later stages. Figure 2 shows the change of AUROCs for MD and RMD during the training of the 1D CNN model. The AUROC of MD quickly increases to 66.19% at step 50k, when the model is not well trained yet, with training and test accuracies being 88.59% and 82.20% respectively. As the model trains further and achieves higher training accuracy of 99.96% and higher test accuracy of 85.71% at step 500k, the AUROC for MD drops to 53.10%. On the other hand, the RMD increases as the training and test accuracies increase, and gets stabilized as the accuracy stabilizes, which is a more desirable property to have. Similarly, we observed this phenomenon in the fine-tuning of the BERT genomics model. At the early training stage, AUROC for MD achieves the peak of 77.49%, while the model is not trained well with the training and test accuracies being only 82.62% and 83.97% respectively.
We thank Zack Nado and D. Sculley for helpful feedback.
Appendix A Pseudocode for Relative Mahalanobis distance
The pseudocode for our method is shown in Algorithm 1.
Appendix B Additional Experimental Details
For CIFAR-10/100 experiments, we first train a Wide ResNet 28-10 model444https://github.com/google/uncertainty-baselines/blob/master/baselines/cifar/deterministic.py from scratch using the in-distribution data. Next we use the publicly available pre-trained models ViT-B16555https://github.com/google-research/vision˙transformer, BiT R50x1666https://github.com/google-research/big_transfer, and CLIP777https://github.com/openai/CLIP, and replace the last layer with a classification head and fine-tune the full models using in-distribution data. We do not fine-tune CLIP model since CLIP requires paired (text, image) data for training. The fine-tuned ViT model has in-distribution test accuracy of 89.91% for CIFAR-100, and 97.48% for CIFAR-10. The fine-tuned BiT model has in-distribution test accuracy of 86.89% for CIFAR-100, and 97.66% for CIFAR-10.
For the genomics OOD benchmark, the dataset is available at Tensorflow Datasets888https://www.tensorflow.org/datasets/catalog/genomics˙ood. The dataset contains 10 in-distribution bacteria classes, and 60 OOD classes and the input is a fixed length sequence of 250 base pairs composed by letters A, C, G and T. We first train a 1D CNN of 2000 filters of length 20 from scratch using the in-distribution data. We train the model for 1 million steps using the learning rate of and Adam optimizer. Next we pre-train a BERT style model by randomly masking the input token and predict the masked token using the output of the transformer encoder. The model is trained using the unlabeled training and validation data. The prediction accuracy for the masked token is 48.35%. At the fine-tuning stage, the model is fine-tuned using the in-distribution training data for 100,000 steps at the learning rate of , and the classification accuracy is 89.84%.
For CLINC Intent OOD, we use a standard BERT pretrained model999https://github.com/google/uncertainty-baselines/blob/master/baselines/clinc˙intent/deterministic.py
, and fine-tune it using the in-distribution CLINC data for 3 epochs with the learning rate of. The classification accuracy is 96.53%.
Appendix C Performance of Partial Mahalanobis distance
We compare our method with the Partial Mahalanobis distance (PMD) proposed in [kamoi2020mahalanobis]. PMD uses a subset of eigen-bases to compute the distance score, . Although can be any subset of , it was recommended to use or corresponding to the largest or smallest Eigenvalues respectively. We compare our RMD method with the two versions of PMD using the benchmark task of CIFAR-100 vs CIFAR-10. Since there is a hyperparameter involved in PMD, we search from . Figure 2(a) shows the AUROC when using the top eigen-bases to compute PMD. The AUROC increases as increases and reaches to the peak of 79.72% at , and then decreases when including more dimensions. Therefore the performance of PMD method depends on the choice of , while our method RMD is hyperparameter-free. Our method also achieves a slightly higher AUROC of 81.08% than the peak value for PMD.
We also investigate the performance of PMD when using eigen-bases corresponding to the smallest eigen-values (Figure 2(b)). The AUROC decreases as we exclude the top eigen-bases from the set, suggesting that the top eigen-bases are more important for the near-OOD detection. This observation supports our conjecture in Section 3 that the top eigen-bases are discriminative features and the rest are common features shared by the IND and OOD.
Another variant of the Mahalanobis distance called Marginal Mahalanobis distance (MMD) was also proposed in [kamoi2020mahalanobis]. It fits a single Gaussian distribution to all the training data ignoring class, the same as we define the background model in our RMD. Though it has a good performance for far-OOD tasks (e.g. CIFAR-10 vs SVHN) [kamoi2020mahalanobis], it does not perform well for the near-OOD tasks, with AUROC being only 52.88% for CIFAR-100 vs CIFAR-10, and 83.81% for CIFAR-10 vs CIFAR-100.
Appendix D Simulation study for the failure mode of Mahalanobis distance
We use a simple simulation to demonstrate the failure mode of Mahalanobis distance. We simulate a binary classification problem where the two classes follow a high-dimensional Gaussian distribution with different means. Specifically, , where the covariance matrix is a fixed diagonal matrix with the scalar . The mean vector has only the first dimension non-zero. To distinguish the two classes, we set for the first class, for the second class, and . The key idea is that only the first dimension is a discriminative feature that is class-specific, whereas the remaining dimensions are non-discriminative common features that are shared by all classes. We set the number of dimensions . To simplify the problem, we set the covariance matrix to be diagonal such that the feature dimensions are independent.
For each of the classes, we randomly sample data points from the given distribution for training data. For test data, we sample data points from each class as the test IND data. For test OOD data, we set and and sample data points from each of them. Figure 3(a) shows the histograms of the first dimension of IND and OOD data. The IND and OOD data points are well separated by the first dimension feature. Figure 3(b) shows the histogram of the remaining dimensions . The IND and OOD data points are not separable there, since they follow the Gaussian distribution with the same mean.
For simplicity, we first treat the as the feature map . We fit a class-conditional Gaussian using the training data, and compute the MD for each of the test inputs. We find that although OOD inputs in general have a greater distance than IND inputs, the two are largely overlapping. See Figure 3(c) for details.
The reason behind the failure mode is simple. Since the dimensions are independent, the log-likelihood of an input is the sum of the log-likelihoods of each individual dimension, i.e. , . For the discriminative feature , the distributions of IND and OOD are different, so approximately . However, the remaining non-discriminative features are class-independent and both IND and OOD inputs follow the same distribution. Thus the likelihood of IND inputs based on those features will be indistinguishable from that of OOD inputs, i.e. . When the number of non-discriminative features is much greater than the number of discriminative features, the log-likelihood of the former will overwhelm the latter.
Next we compute the RMD. We fit a class-independent Gaussian distribution using the training data regardless of the class labels, and compute the Relative Mahalanobis distance based on (class conditional Gaussian distribution) and (class independent Gaussian distribution) for each of the test inputs. Using our proposed method, we are able to perfectly separate IND and OOD test inputs. See Figure 3(d).
The class independent Gaussian helps to remove the effect of the non-discriminative features. Specifically, since those non-discriminative features are class independent, the fitted class conditional Gaussian is close to the fitted class independent Gaussian, i.e. . Thus the two values are canceled by each other in the RMD computation, resulting in . For the discriminative feature, the fitted class conditional Gaussian is very different from the fitted class independent Gaussian. For IND inputs, , since the class conditional Gaussian fits better to the IND data. For the OOD input, the difference between the two is nearly , since none of the two distributions fit OOD. Therefore RMD provides a better separation between IND and OOD as we have seen in Figure 3(d).
To mimic the real scenario where the feature maps are the extracted features from the neural networks, we train simple one-layer neural networks for this binary classification task. We retrieve the feature maps of the training data, fit a class conditional Gaussian and compute MD for the test inputs. We observed the same failure mode for this case; the distributions of MD between IND and OOD largely overlap. Then we fit a class independent Gaussian and compute RMD. Using RMD, we again recover the perfect separation between the two. We expect that the intermediate layer for image, text, and genomics models also contain non-discriminative features. Therefore our proposed method is useful for overcoming this effect and improving the performance of near-OOD detection.