Detecting semantic anomalies

08/13/2019 ∙ by Faruk Ahmed, et al. ∙ Université de Montréal 5

We critically appraise the recent interest in out-of-distribution (OOD) detection, questioning the practical relevance of existing benchmarks. While the currently prevalent trend is to consider different datasets as OOD, we posit that out-distributions of practical interest are ones where the distinction is semantic in nature, and evaluative tasks should reflect this more closely. Assuming a context of computer vision object recognition problems, we then recommend a set of benchmarks which we motivate by referencing practical applications of anomaly detection. Finally, we explore a multi-task learning based approach which suggests that auxiliary objectives for improved semantic awareness can result in improved semantic anomaly detection, with accompanying generalization benefits.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 14

Code Repositories

detecting_semantic_anomalies

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, concerns have been raised about modern neural network based classification systems providing incorrect predictions with equally high confidence 

Guo et al. (2017). A possibly related finding is that classification-trained CNNs find it much easier to overfit to low-level properties such as texture Geirhos et al. (2019), canonical pose Alcorn et al. (2019), or contextual cues Beery et al. (2018)

rather than learning globally coherent characteristics of objects. A subsequent worry is that such a classifier trained on data sampled from a particular distribution is likely to be misleading when encountering novel situations in deployment. For example, silent failure might occur due to equally confident categorization of an unknown object into a known category, due to having overfit to local biases. This last concern is one of the primary motivating reasons for wanting to be able to detect when test data comes from a different distribution than that of the training data. This problem has been recently dubbed as

out-of-distribution (OOD) detection in Hendrycks and Gimpel (2017), and is also synonymously referred to as anomaly detection or novelty detection

in the contemporary deep learning context. Evaluation is typically carried out by benchmarks of the style proposed in 

Hendrycks and Gimpel (2017), where different datasets are treated as OOD after training on a particular in-distribution dataset. This area of research has since steadily developed Liang et al. (2018); Hendrycks et al. (2019); DeVries and Taylor (2018); Shalev et al. (2018); Lee et al. (2018), with similarly themed additions to the tasks, and improved results in OOD detection on them.

Current benchmarks are ill-motivated

Despite such tasks rapidly becoming the standard benchmark for OOD detection in the community, we suggest that, taken as a whole, they are not very well-motivated. For example, the object recognition dataset CIFAR-10 (consisting of images of objects placed in the foreground), is typically trained and tested against noise, or different datasets such as downsampled LSUN (a dataset of scenes), or SVHN (a dataset of house numbers), or

tiny-Imagenet

(a different dataset of objects). For the simpler cases of noise, or datasets with scenes or numbers, low-level image statistics are sufficient to tell them apart. While the latter choice of tiny-Imagenet might seem reasonable, it has been noted that particular datasets have particular biases related to the specific data collection and curation quirks Torralba and Efros (2011); Tommasi et al. (2017), which renders the problem of treating different datasets for OOD detection questionable. It is possible that we are only getting better at distinguishing such idiosyncrasies 28 (also see Appendix C). Perhaps due to these tasks being intrinsically trivial, most approaches typically report very flattering performance on them.

Semanticity is relevant

Therefore, we call into question the practical relevance of these benchmarks, which are currently treated as standard by the community. While they might have some value as a testbed for diagnosing peculiar pathologies Nalisnick et al. (2019), we argue that in practical contexts of being able to detect novelty, the nature of novelty that is of primary concern is semantic in nature. The implicit goal for the current style of benchmarks is that of detecting a wide variety of covariate shifts, which mostly consist of extraneous factors. We suggest that this is misguided: non-semantic covariate shift is something we should be robust to, while semantic covariate shift should be flagged down. For example, a shift in low-level properties, such as lighting for instance, or even a contextual one, such as a shift in background detail, should not deter an object recognizer from performing a correct categorization. Quite the contrary, we would like our classifiers to systematically generalize Fodor and Pylyshyn (1988); Lake and Baroni (2018); Bahdanau et al. (2019), arriving at decisions by assimilating a globally coherent understanding of the relevant semantics for the task. For example, object recognizers should successfully recognize a cow even when it is sitting on a seashore, without getting confused by the atypical context Beery et al. (2018). Similarly, we would like action classifiers to correctly identify an activity as running even when the running person is wearing boxing gloves, and not misclassify the action due to having associated boxing only with the presence of boxing gloves; a scene should be correctly categorized as a bedroom by a scene classifier even if there is a large picture of a church on the wall.

Context matters

The connotations carried by the terms anomaly detection and novelty detection

are unclear without context, and convey meaning only after acknowledging the semantics of interest for a given task. For example, in scene classification, the concerned semantic of interest is the composition of components from across the entire frame. In the context of such a task, a kitchen with a bed in the middle is an anomalous observation (if such a composition has not been previously observed). However, when performing object recognition, the concerned semantic is not the composition of components anymore, but the identity of the foreground object, and now the unusual context should not prevent correct object recognition. Similarly, in activity recognition from videos, the concerned semantic is the temporal evolution of the action across frames. The task of

out-of-distribution detection as presented in Hendrycks and Gimpel (2017), presumably intended to be broader and more flexible, suffers from imprecision due to this context-free breadth. Being unclear about the nature of the out-distribution leads to a corresponding lack of clarity about the implications of underlying methodologies in proposed approaches. The current benchmarks and approaches therefore carry a risk of potential misalignment of evaluative performance with field performance in practical anomaly detection problems.

In this paper, we offer the following contributions.

1. Semantic anomalies are what matter: We provided a motivating discussion about the relevance of semanticity in the context of a task when we think about practical anomaly detection problems.

2. More meaningful benchmarks for anomaly detection: Based on this discussion, and real life examples of anomaly detection (Section 2), we recommend a style of benchmarks that are more pragmatic than the currently adopted set of tasks. In our application-specific discussions, we shall presume the context of an object recognition task, but the theme applies generally.

3. Improved semantic awareness improves semantic anomaly detection: Following our discussion about the importance of semanticity, we investigate the effectiveness of multi-task learning with auxiliary self-supervised objectives (Section 4,5). We evaluate representative anomaly detection methods on our recommended benchmarks, and demonstrate that such augmented objectives can result in improved anomaly detection, as well as accompanying generalization benefits.

2 Motivation and proposed tasks

In order to develop meaningful benchmarks, we begin by considering some practical applications where being able to detect anomalies, in the context of classification tasks, would find use.

Nature studies and monitoring: Biodiversity scientists want to keep track of variety and statistics of species across the world. Online tools such as iNaturalist 27 enable photo-based classification and subsequent cataloguing in data repositories from pictures uploaded by naturalists. In such automated detection tools, a potentially novel species should result in a request for expert help rather than misclassification into a known species, and detection of undiscovered species is in fact a task of interest. A similar practical application is camera-trap monitoring of members in an ecosystem, notifying caretakers upon detection of invasive species Fedor et al. (2009); Willi et al. (2019). Taxonomy of collected specimens is often backlogged due to the human labour involved. Automating digitization and identification can help catch up, and often new species are brought to light through the process Carranza-Rojas et al. (2017), which obviously depends on effective detection of novel specimens.

Medical diagnosis and clinical microbiology: Online medical diagnosis tools such as Chester Cohen et al. (2019)

can be impactful at improving healthcare levels worldwide. Such tools should be especially adept at being able to know when faced with a novel pathology rather than categorizing into a known subtype. Similar desiderata applies to being able to quickly detect new strains of pathogens when using machine learning systems to automate clinical identification in the microbiology lab 

Zieliński et al. (2017).

AI safety: Amodei et al. Amodei et al. (2016) discuss the problem of distributional shift in the context of autonomous agents operating in our midst, with examples of actions that do not translate well across domains. A similar example in the vein of Amodei et al. (2016), grounded in a computer vision classification task, is the contrived scenario of encountering a novel vehicle (that follows different dynamics of motion), which might lead to a dangerous decision by a self-driving car which fails to recognize unfamiliarity.

Having compiled the examples above, we can now try to come up with a set of tasks more aligned with realistic applications. The basic assumptions we make about possible evaluative tasks are: (i) that anomalies of practical interest are semantic in nature; (ii) that they are relatively rare events whose correct detection is of more primary relevance than minimizing false positives; and (iii) that we do not have access to examples of anomalies (as some existing works assume Liang et al. (2018); Lee et al. (2018)). These assumptions guide our choice of benchmarks and evaluation.

Subset Number of members Total training images Total test images
Dog (hound dog) 12 14864 600
Car 10 13000 500
Snake (colubrid snake) 9 11700 450
Spider 6 7800 300
Fungus 6 7800 300
Table 1: Sizes of proposed benchmark subsets from ILSVRC2012. Sample images are in the Appendix. The training set consists of roughly 1300 images per member, and 50 images per member in the test set (which come from the validation set images in the ILSVRC2012 dataset).

Proposed benchmarks

A very small number of recent works Akcay et al. (2018); Zenati et al. (2018) have considered a case that is more aligned with the goals stated above. Namely, for a choice of dataset, for example MNIST, train as many versions of classifiers as there are classes, holding out one class every time. At evaluation time, score the ability of being able to detect the held out class as anomalous. This is a setup more clearly related to the task of being able to detect semantic anomalies, holding dataset-bias factors invariant to a significantly greater extent, and incorporating sufficient variability (every class is held out once) to be maximally informative and diagnostic. In this paper, we shall explore this setting with CIFAR-10 and STL-10, and recommend this as the default benchmark for evaluating anomaly detection in the context of object recognition. Similar setups apply to different contexts.

While the hold-out-class setting for CIFAR-10 and STL-10 is a good setup for testing anomaly detection of disparate objects, a lot of applications, including some of the ones we described earlier, require detection of more fine-grained anomalies. For such situations, we propose a suite of tasks comprised of subsets of Imagenet (ILSVRC2012 Russakovsky et al. (2015)), with fine-grained subcategories. For example, the spider subset consists of members tarantula, Argiope aurantia, barn spider, black widow, garden spider, wolf spider. We also propose fungus, dog, snake, and car subsets. These subsets have varied sizes, with some of them being fairly small (see Table 1). Although this is a significantly harder task, we believe this setting aligned with the practical situations we described above, where sometimes large quantities of labelled data are not always available, and a particular fine-grained selection of categories is of interest. We emphasize that the point of such benchmarks, as with most benchmarks, is to provide a reasonably well-aligned and general setup for researchers to develop and investigate approaches to anomaly detection, while not being overwhelmingly expensive to run. For very particular tasks, such as our motivating examples described above, we advise practitioners to curate particular benchmarks to evaluate fine-tuned models upon.

Evaluation

Current works tend to mainly use both Area under the Receiver-Operator-Characteristics (AUROC) and Area under Precision-Recall curve (AUPRC) to evaluate performance on anomaly detection. The AUROC score is more easily interpretable, being a measure of how likely a randomly chosen positive example (an anomaly in our case) is to be scored higher than a randomly chosen negative one. However, in situations where positive examples are not only much rarer, but also of primary interest for detection, AUROC scores are a poor reflection of detection performance; precision, which is the proportion of true positives among positive detections, is more relevant Fawcett (2006); Davis and Goadrich (2006); Avati et al. (2018). We shall not inspect AUROC scores because in all of our settings, normal examples significantly outnumber anomalous examples, and AUROC scores are known to be often overly optimistic and hence potentially misleading in imbalanced datasets Davis and Goadrich (2006). Additionally, it has been shown in Davis and Goadrich (2006)

that optimizing for performance under AUROC does not imply optimizing for AUPRC. Precision and recall are calculated as

precision (1)
recall (2)

and a precision-recall curve is then defined as a set of precision-recall points

(3)

where is a threshold parameter.

The area under the precision-recall curve is calculated by varying the threshold

over a range spanning the data, and creating a finite set of points for the PR curve. One alternative is to interpolate these points, producing a continuous curve as an approximation to the true curve, and computing the area under the interpolation by, for example, the trapezoid rule. Interpolation in a precision-recall curve can sometimes be misleading, as studied in 

Boyd et al. (2013)

, who recommend a number of more robust estimators. Here we use the standard approximation to average precision as the weighted mean of precisions at thresholds, weighted by the increase in recall from the previous threshold

(4)

which is the implementation in Scikit-Learn Pedregosa et al. (2011).

3 Related work

Evaluation

As we have discussed earlier, the style of benchmarks widely adopted today follows the recommendation in Hendrycks and Gimpel (2017). Among follow-ups, the most significant successor has been Liang et al. (2018) which augmented the suite of tests with slightly more reasonable choices: for example, they consider tiny-Imagenet as out-of-distribution for in-distrbution datasets such as CIFAR-10, which makes more sense than only comparing to noise and scene datasets, as in Hendrycks and Gimpel (2017). However, on closer inspection, we find that tiny-Imagenet shares semantic categories with CIFAR-10, such as species of {dogs, cats, frogs, birds}, so it is unclear how such choices of evaluative tasks correspond to realistic anomaly detection problems. Some work in the area of open-set recognition has come close to a realistic setup: in Bendale and Boult (2016), detection of novel categories is tested with a set of images corresponding to different classes that were discontinued in later versions of Imagenet, but later work Dhamija et al. (2018) relapsed into treating very different datasets as novel. Also, using one particular split of a collection of unseen classes as OOD does not provide as varied and as finely diagnostic a view as holding out each class at a time, which is what we recommend. As mentioned earlier, a small number of works in OOD detection Akcay et al. (2018); Zenati et al. (2018) have already used the hold-out-class style of tasks for evaluation. Unfortunately, due to a lack of a motivated discussion about the choice, the community has ignored this, and continue to adopt the set of benchmarks in Hendrycks and Gimpel (2017); Liang et al. (2018).

Approaches to OOD detection

In  Hendrycks and Gimpel (2017), the most natural baseline for a trained classifier is presented, where the detection score is simply given by the predictive confidence of the classifier (MSP). Follow-up work in Liang et al. (2018) proposed adding a small amount of adversarial perturbation, followed by temperature scaling of the softmax. There is no ablation study, but we have found temperature scaling to be responsible for most of the gains over Hendrycks and Gimpel (2017). Methodologically, the approach suffers from having to pick a temperature and perturbation weight per anomaly-dataset. Confidence calibration has also been explored in DeVries and Taylor (2018), and was shown to improve complementary approaches like MSP and ODIN.

Using auxiliary datasets as surrogate anomalies has been shown to improve performance on existing benchmarks in Hendrycks et al. (2019). This approach is limited, due to reliance on other datasets, but a more convincing variant has been explored in Lee et al. (2018) where a GAN is used to generate negative samples. However, Lee et al. (2018)

suffers from the methological issue discussed earlier, with hyperparameters being optimized per anomaly-dataset. We believe that such contentious practices arise from a lack of a clear discussion of the nature of the tasks we should be concerned with, and a lack of grounding in practical applications which would dictate proper methodology. The primary goal of our paper is to help fill this gap.

Among existing works, the closest in terms of approaching the essence of our discussion about semantic properties being the discriminative factor of interest is Shalev et al. (2018), where the training set is augmented with semantically similar labels. While we approve of the spirit, it is not always practical to assume access to a corpora providing such labels. This is especially true for the motivating practical applications we described earlier. In the next part of the paper, we explore a way to induce greater semantic awareness in a classifier, with the hope that such improved representation would lead to corresponding improvements in semantic anomaly detection and generalization.

4 Encouraging semantic representation with multi-task learning

We hypothesize that classifiers that learn representations which are more oriented toward capturing semantic properties rather than overfitting to low-level properties would naturally lead to better performance at detecting semantic anomalies. Overfitting to low-level features such as colour or texture without consideration of global coherence might result in potential confusions in situations where the training data is not representative and unbiased. For a lot of existing datasets, it is quite possible to achieve good generalization performance without learning semantic distinctions, a possibility that spurs the search for removing algorithmic bias Zemel et al. (2013), and which is often exposed in embarrassing ways 29; 13. As an extreme example, if the training and testing data consists of only one kind of animal which is furry, the classifier only needs to learn to detect fur-texture, and can ignore other meaningful characteristics such as the number of legs. Such a system would fail to recognize another furry, but nine-legged animal as anomalous, while achieving good test performance. Motivated by this line of thinking, we ask the question of how we might encourage classifiers to learn more semantically meaningful representations.

Caruana Caruana (1993) describes how sharing parameters for learning multiple tasks, which are related in the sense of requiring similar features, can be a powerful tool for inducing domain-specific inductive biases in a learner. Hand-design of inductive biases requires complicated engineering, while using the training signal from a related task can be a much easier way to achieve similar goals. Even when related tasks are not explicitly available, it is often possible to construct one. A recent example of an application of this idea is Rei (2017), where the task of sequence labelling is augmented with the auxiliary self-supervised task of predicting neighbouring words, with shared representation learning parameters. This was found to improve performance on the original task of sequence labelling.

CIFAR-10 STL-10
Classification only
Classification+rotation
Table 2: Multi-task augmentation with the self-supervised objective of predicting rotation improves generalization. Subsequently, we investigate if this improvement also correlates to improved semantic understanding, as measured indirectly through the task of detecting semantic anomalies. (Results are averages over 3 trials.)

We explore such a framework for augmenting object recognition classifiers with auxiliary tasks that require semantic understanding. Recently, there has been significant interest in self-supervision applied to vision Doersch et al. (2015); Pathak et al. (2016); Noroozi and Favaro (2016); Zhang et al. (2017); van den Oord et al. (2018); Gidaris et al. (2018); Caron et al. (2018), exploring tasks that induce semantic representations without requiring expensive annotations. These tasks naturally lend themselves as auxiliary tasks for encouraging inductive biases towards semantic representations. Here, we use the recently introduced rotation task of Gidaris et al. (2018), which asks the learner to predict the orientation of a rotated input. This blends in most readily in the standard multi-task learning setup, where typically there is hard sharing of representation learning parameters, with a shared input space and addition of an extra task at the end. A shortcut solution is possible in such a setting where the networks learns to use a subset of the parameters in the shared representation to predict rotation, and the other part for predicting object category, leading to no improvements. Empirically, we find that this does not appear to happen in our settings: in Table 2, we show significantly improved generalization performance of classifiers on CIFAR-10 and STL-10 when augmented with the auxiliary task of predicting rotation. Details of experimental settings, and the correspondence to anomaly detection, are in the next section.

We note that the application of strategies for encouraging more semantic representations through multi-task learning is complementary to the choice of scoring anomalies, which has been the focus of most existing work. Additionally, as in standard multi-task learning setups, it enables further augmentation with more auxiliary tasks Doersch and Zisserman (2017), which we leave for future empirical exploration.

5 Evaluation

We study the two existing representative baselines of maximum softmax probability (MSP) 

Hendrycks and Gimpel (2017), and ODIN Liang et al. (2018) on the proposed benchmarks. For ODIN, it is unclear how to choose the hyperparameters for temperature scaling and the weight for adversarial perturbation without assuming access to anomalous examples, an assumption we consider unrealistic in most practical settings. We fix , which is the most common setting in Liang et al. (2018), and , which is the setting there for CIFAR 80/20. We initially used , as suggested in Liang et al. (2018) for CIFAR-80/20, but found that this setting underperforms for our hold-out-class experiments on CIFAR-10.

Figure 1: Plots of training and testing costs, accuracies, and average precision corresponding to hold-out-class experiments with three categories each from CIFAR-10 (top) and STL-10 (bottom), using the MSP method Hendrycks and Gimpel (2017). While classification performance is not correlated with performance at anomaly detection (compare absolute test accuracy numbers with average precision scores across columns), the “pattern” of improvement in anomaly detection appears roughly related to generalization (compare the coarse shape of test accuracy curves with that of average precision curves across iterations), indicating that there is some connection between generalization and the ability to detect semantic anomalies. In general, we should expect greater semantic understanding to lead to improvements in both generalization as well as anomaly detection.
CIFAR-10 Classification-only Rotation-augmented
Anomaly MSP ODIN Accuracy MSP ODIN Accuracy
airplane 43.30 1.13 48.23 1.90 96.00 0.16 46.87 2.10 49.75 2.30 96.91 0.02
automobile 14.13 1.33 13.47 1.50 95.78 0.12 17.39 1.26 17.35 1.12 96.66 0.03
bird 46.55 1.27 50.59 0.95 95.90 0.17 51.49 1.07 54.62 1.10 96.79 0.06
cat 38.06 1.31 38.97 1.43 97.05 0.12 53.12 0.92 55.80 0.76 97.46 0.07
deer 49.11 0.53 53.03 0.50 95.87 0.12 50.35 2.57 52.82 2.96 96.76 0.09
dog 25.39 1.17 24.41 1.05 96.64 0.13 32.11 0.82 32.46 1.39 97.36 0.06
frog 40.91 0.81 42.21 0.48 95.65 0.09 52.39 4.58 54.44 5.80 96.51 0.12
horse 36.18 0.77 36.78 0.82 95.64 0.08 39.93 2.30 39.65 4.31 96.27 0.07
ship 28.35 0.81 30.61 1.46 95.70 0.15 29.36 3.16 28.82 4.63 96.66 0.17
truck 27.17 0.73 28.01 1.06 96.04 0.24 29.22 2.87 29.93 3.86 96.91 0.12
Average 34.92 0.41 36.63 0.61 96.03 0.00 40.22 0.16 41.56 0.15 96.83 0.02
Table 3: We train ResNet classifiers on CIFAR-10 holding out each class per run, and score detection with average precision for the maximum softmax probability (MSP) baseline in Hendrycks and Gimpel (2017) and ODIN Liang et al. (2018). We find that augmenting with rotation results in complementary improvements to both scoring methods for anomaly detection (contrast columns in the right half with those in the left half). All results are reported over 3 trials.

5.1 Experimental settings

Architectural/training choices for Cifar-10 and Stl-10

Our base network for all CIFAR-10 experiments is a Wide ResNet Zagoruyko and Komodakis (2016) with 28 convolutional layers and a widening factor of 10 (WRN-28-10) with the recommended dropout rate of 0.3. Following Zagoruyko and Komodakis (2016)

, we train for 200 epochs, with an initial learning rate of 0.1 which is scaled down by 5 at the 60th, 120th, and 160th epochs, using stochastic gradient descent with Nesterov’s momentum at 0.9. We train in parallel on 4 Pascal V100 GPUs with batches of size 128 on each. For

STL-10, we use the same architecture but append an extra group of 4 residual blocks with the same layer widths as in the previous group. We also use a widening factor of 4 instead of 10, and batches of size 64 on each of the 4 GPUs. We use the same optimizer settings as with CIFAR-10

. In both cases, we apply standard data augmentation of random crops (after padding) and random horizontal reflections.

Architectural/training choices for Imagenet

For experiments with the proposed subsets of Imagenet, we replicate the architecture we use for STL-10, but add a downsampling average pooling layer after the first convolution on the images. We do not use dropout, and use a batch size of 64; otherwise all other details follow the setup from the experiments on STL-10. The standard data augmentation steps of random crops to a size of and random horizontal reflections are applied.

Auxiliary rotation prediction

For adding rotation prediction as an auxiliary task, we follow the standard recipe from typical multi-task learning setups where the representation space is shared through hard parameter sharing. This means that all we do is append an extra linear layer alongside the one that is responsible for object recognition. We weight the rotation loss with a co-efficient (0.5 for CIFAR-10, 1.0 for STL-10, and a mix of 0.5 and 1.0 for Imagenet), which we chose by performing a coarse hyperparameter search for the best combined performance at predicting object category and rotation on the validation set. The optimizer and regularizer settings are kept the same, with the learning rate decayed along with the learning rate for the classifier at the same scales.

STL-10 Classification-only Rotation-augmented
Anomaly MSP ODIN Accuracy MSP ODIN Accuracy
airplane 19.21 1.05 23.46 1.65 85.18 0.20 22.21 0.76 23.37 1.71 89.24 0.12
bird 29.05 0.69 33.51 0.36 85.91 0.36 36.12 2.08 40.08 3.30 89.91 0.29
car 14.52 0.37 16.14 0.83 84.32 0.55 15.95 2.20 16.87 2.94 89.52 0.44
cat 25.21 0.93 27.92 0.84 86.95 0.36 29.34 1.30 31.35 1.88 90.89 0.26
deer 24.29 0.53 25.94 0.49 85.34 0.35 27.60 2.22 29.71 2.55 89.20 0.17
dog 23.42 0.60 23.44 1.18 87.78 0.45 26.78 0.71 26.14 0.62 91.37 0.33
horse 21.31 1.01 22.19 0.75 85.52 0.21 23.79 1.46 23.59 1.63 89.60 0.11
monkey 23.67 0.83 21.98 0.91 86.66 0.31 28.43 1.67 28.32 1.20 90.07 0.23
ship 14.61 0.12 13.78 0.63 84.65 0.21 16.79 1.20 15.37 1.22 89.33 0.15
truck 15.43 0.17 14.35 0.12 85.34 0.17 17.05 0.50 16.59 0.60 90.08 0.38
Average 21.07 0.25 22.27 0.29 85.77 0.13 24.41 0.23 25.14 0.45 89.92 0.08
Table 4: Average precision scores for hold-out-class experiments with STL-10. We observe that the same trends hold as with the previous experiments on CIFAR-10.

We emphasize that this procedure is not equivalent to data augmentation, since we do not optimize the linear classification layer for rotated images. Only the rotation prediction linear layer gets updated for inputs corresponding to the rotation task, and only the linear classification layer gets updated for non-rotated, object-labelled images. Asking the classifier to be rotation-invariant would require the auxiliary task to develop a disjoint subset in the shared representation that is not rotation-invariant, so that it can succeed at predicting rotations. This encourages an internally split representation, thus diminishing the potential advantage we hope to achieve from a shared, mutually beneficial space.

5.2 Discussion

Self-supervised multi-task learning is promising

In Tables 3 and 4 we report average precision scores on CIFAR-10 and STL-10 for the baseline scoring methods MSP Hendrycks and Gimpel (2017) and ODIN Liang et al. (2018). We note that ODIN, with the hyperparameter settings we have fixed for all experiments, continues to outperform MSP most of the time. Then, we augment our classifiers with the auxiliary rotation-prediction task and find that anomaly detection as well as test set accuracy are markedly improved for both scoring methods. As we have remarked earlier, a representation space with greater semanticity should be expected to bring improvements on both fronts. All results report mean standard deviation over 3 trials. In Table 5, we repeat the same process for all of our proposed Imagenet subsets. Full results, corresponding to individual members of the subsets, are in the Appendix, while here we only show the average performance across all members of the subset. The results indicate that multi-task learning with self-supervised auxiliary tasks is a promising complement for anomaly detection, while also improving generalization.

High test set accuracy does not imply semantic representation

By our hypothesis, training methods developed solely to improve generalization, without consideration of the affect on semantic understanding, might perform worse at detecting semantic anomalies. This is because it is often possible to overfit to low-level or contextual discriminatory patterns, which are almost surely biased in small datasets for complex domains such as natural images, and perform reasonably well on the test set. To illustrate this, we run an experiment where we randomly mask out a region in CIFAR-10 images from within the central region. We find that while this leads to improved test accuracies, anomaly detection suffers (numbers are averages across hold-out-class trials):

Method Accuracy Av. Prec. with MSP
Base model 96.03 0.00 34.92 0.41
Random-center-masked 96.27 0.05 34.41 0.74
Rotation-augmented 96.83 0.02 40.22 0.16

This hints that while the masking strategy is effective as a regularizer, it might come at the cost of less semantic representation. Such training choices can therefore result in models with seemingly improved generalization but which have a poorer understanding of object coherence, due to potentially overfitting on biases in local statistics or contextual cues in the dataset. For comparison, the rotation-augmented network achieves both a higher test set accuracy as well as an improved average precision. This example might be yet another cautionary tale about developing techniques that might inadvertently lead to neural networks achieving reassuring test set performance while following an internal modus operandi very misaligned with the pattern of reasoning we hope they discover. This can have unexpected consequences when such models are deployed in real life applications.

Classification-only Rotation-augmented
Subset Skew MSP ODIN Accuracy MSP ODIN Accuracy
dog 8.33 23.92 0.49 25.85 0.09 85.09 0.14 24.66 0.58 25.73 0.87 85.25 0.17
car 10.00 21.54 0.62 22.49 0.54 77.17 0.10 21.66 0.19 22.38 0.46 76.72 0.19
snake 11.11 18.62 0.93 19.18 0.79 69.74 1.63 20.23 0.18 21.17 0.12 70.51 0.48
spider 16.67 21.20 0.56 24.15 0.72 68.40 0.21 22.90 1.29 25.10 1.78 68.68 0.77
fungus 16.67 42.56 0.49 44.59 1.46 88.23 0.45 44.19 1.86 46.86 1.13 88.47 0.43
Table 5: Averaged average precisions for the proposed subsets of Imagenet. Each row shows averaged performance across all members of the subset. A random-guessing baseline would score at the skew rate. Expanded rows are in the Appendix.

6 Conclusion

We provided a motivated discussion of what the goals of anomaly detection should be, grounded in the context of machine learning classification systems. While there is significant recent interest in the area, current research suffers from questionable benchmarks and methodology. In this paper, we propose a set of benchmarks which are better aligned with realistic applications.

We also explored the effectiveness of a multi-task learning framework with an auxiliary self-supervised objective that has been shown to induce semantic representations. Our results demonstrate improved anomaly detection along with improved generalization under the augmented objective. This suggests that such semantic-learning inductive biases could have an important role to play in developing neural networks with improved semantic representations, in the context of the problem they are tasked with.

Additionally, the ability to detect semantic anomalies also provides us with an indirect notion of semanticity in the representations learned by our mostly opaque deep models.

Acknowledgements

We thank Rachel Rolland for referencing and discussing the motivating examples of anomaly detection in nature studies. Ishaan Gulrajani and anonymous reviewer R1 provided useful feedback.

This work was enabled by the computational resources provided by Compute Canada.

=0mu plus 2mu

References

  • [1] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon (2018) GANomaly: semi-supervised anomaly detection via adversarial training. ArXiv e-prints. Cited by: §2, §3.
  • [2] M. Alcorn, Q. Li, Z. Gong, C. Wang, L. Mai, W. Ku, and A. Nguyen (2019) Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. CVPR. Cited by: §1.
  • [3] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in AI safety. CoRR abs/1606.06565. Cited by: §2.
  • [4] A. Avati, K. Jung, S. Harman, L. Downing, A. Ng, and N. H. Shah (2018-12-12) Improving palliative care with deep learning. BMC Medical Informatics and Decision Making 18 (4), pp. 122. Cited by: §2.
  • [5] D. Bahdanau, S. Murty, M. Noukhovitch, T. H. Nguyen, H. de Vries, and A. Courville (2019) Systematic generalization: what is required and can it be learned?. In ICLR, Cited by: §1.
  • [6] S. Beery, G. V. Horn, and P. Perona (2018) Recognition in terra incognita. CoRR. Cited by: §1, §1.
  • [7] A. Bendale and T. E. Boult (2016) Towards open set deep networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1563–1572. Cited by: §3.
  • [8] K. Boyd, K. H. Eng, and C. D. Page (2013)

    Area under the precision-recall curve: point estimates and confidence intervals

    .
    In Machine Learning and Knowledge Discovery in Databases, pp. 451–466. Cited by: §2.
  • [9] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §4.
  • [10] J. Carranza-Rojas, H. Goeau, P. Bonnet, E. Mata-Montero, and A. Joly (2017-08-11) Going deeper in the automated identification of herbarium specimens. BMC Evolutionary Biology 17 (1), pp. 181. Cited by: §2.
  • [11] R. Caruana (1993) Multitask learning: a knowledge-based source of inductive bias. In ICML, Cited by: §4.
  • [12] J. P. Cohen, P. Bertin, and V. Frappier (2019) Chester: A web delivered locally computed chest x-ray disease prediction system. CoRR abs/1901.11210. Cited by: §2.
  • [13] A. Datta, M. C. Tschantz, and A. Datta (2015) Automated experiments on ad privacy settings: a tale of opacity. In Choice, and Discrimination”, Proceedings on Privacy Enhancing Technologies, pp. 92–112. Cited by: §4.
  • [14] J. Davis and M. Goadrich (2006) The relationship between precision-recall and roc curves. pp. 233–240. Cited by: §2.
  • [15] T. DeVries and G. W. Taylor (2018) Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: §1, §3.
  • [16] A. R. Dhamija, M. Günther, and T. Boult (2018) Reducing network agnostophobia. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9157–9168. External Links: Link Cited by: §3.
  • [17] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. ICCV. Cited by: §4.
  • [18] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2070–2079. Cited by: §4.
  • [19] T. Fawcett (2006) An introduction to roc analysis. Pattern Recognition Letters 27 (8), pp. 861 – 874. Cited by: §2.
  • [20] P. Fedor, J. Vanhara, J. Havel, I. Malenovsky, and I. Spellerberg (2009) Artificial intelligence in pest insect monitoring. Systematic Entomology 34 (2), pp. 398–400. Cited by: §2.
  • [21] J. A. Fodor and Z. W. Pylyshyn (1988) Connectionism and cognitive architecture: a critical analysis. Cognition 28 (1), pp. 3 – 71. Cited by: §1.
  • [22] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR. Cited by: §1.
  • [23] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. ICLR. Cited by: §4.
  • [24] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. ICML, pp. 1321–1330. Cited by: §1.
  • [25] D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. ICLR. Cited by: §1, §1, §3, §3, Figure 1, §5.2, Table 3, §5.
  • [26] D. Hendrycks, M. Mazeika, and T. Dietterich (2019)

    Deep anomaly detection with outlier exposure

    .
    ICLR. Cited by: §1, §3.
  • [27] (2019) https://news.developer.nvidia.com/ai-app-identifies-plants-and-animals-in-seconds, accessed on 17 may 2019. Cited by: §2.
  • [28] (2019) https://openreview.net/forum?id=rkgpCoRctm, accessed on 17 may 2019. Cited by: §1.
  • [29] (2019) https://www.forbes.com/sites/cognitiveworld/2019/01/20/can-artificial-intelligence-be-biased/, accessed on 18 may 2019. Cited by: §4.
  • [30] B. M. Lake and M. Baroni (2018) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. In ICML, Cited by: §1.
  • [31] K. Lee, H. Lee, K. Lee, and J. Shin (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. ICLR. Cited by: §1, §2, §3.
  • [32] S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution detection in neural networks. ICLR. Cited by: Appendix C, §1, §2, §3, §3, §5.2, Table 3, §5.
  • [33] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2019) Do deep generative models know what they don’t know?. ICLR. Cited by: Appendix C, §1.
  • [34] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV. Cited by: §4.
  • [35] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros (2016) Context encoders: feature learning by inpainting. CVPR. Cited by: §4.
  • [36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2.
  • [37] M. Rei (2017-07) Semi-supervised multitask learning for sequence labeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 2121–2130. Cited by: §4.
  • [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §2.
  • [39] G. Shalev, Y. Adi, and J. Keshet (2018) Out-of-distribution detection using multiple semantic label representations. NeuRIPS. Cited by: §1, §3.
  • [40] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars (2017) A deeper look at dataset bias. In Domain Adaptation in Computer Vision Applications, pp. 37–55. Cited by: §1.
  • [41] A. Torralba and A. A. Efros (2011-06) Unbiased look at dataset bias. In CVPR, pp. 1521–1528. Cited by: §1.
  • [42] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. Cited by: §4.
  • [43] M. Willi, R. T. Pitman, A. W. Cardoso, C. Locke, A. Swanson, A. Boyer, M. Veldthuis, and L. Fortson (2019) Identifying animal species in camera trap images using deep learning and citizen science. Methods in Ecology and Evolution 10 (1), pp. 80–91. Cited by: §2.
  • [44] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In BMVC, Cited by: §5.1.
  • [45] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013-17–19 Jun) Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, Georgia, USA, pp. 325–333. Cited by: §4.
  • [46] H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar (2018) Efficient gan-based anomaly detection. CoRR abs/1802.06222. Cited by: §2, §3.
  • [47] R. Zhang, P. Isola, and A. A. Efros (2017)

    Split-brain autoencoders: unsupervised learning by cross-channel prediction

    .
    CVPR. Cited by: §4.
  • [48] B. Zieliński, A. Plichta, K. Misztal, P. Spurek, M. Brzychczy-Włoch, and D. Ochońska (2017) Deep learning approach to bacterial colony classification. PLoS One 12(9). Cited by: §2.

Appendix A Imagenet benchmarks

In this section, we present details of the Imagenet-based benchmark we proposed. For constructing these datasets, we first sorted all subsets by the number of members, as structured in the Imagenet hierarchy. We then picked from among the list of top twenty subsets, with a preference for subsets that are more closely aligned with the theme of motivating practical applications we provided. We also manually inspected the data, to check for inconsistencies, and performed some pruning. For example, the beetle subset, while seeming ideal, has some issues with labelling: leaf beetle and ladybug appear to overlap in some cases. Finally, we settled on our choice of 5 subsets. The following table lists the members under every proposed subset.

Dog (hound) Car Snake (colubrid) Spider Fungus
Ibizan hound Model T ringneck snake tarantula stinkhorn
bluetick race car vine snake Argiope aurantia bolete
beagle sports car hognose snake barn spider hen-of-the-woods
Afghan hound minivan thunder snake black widow earthstar
Weimaraner ambulance garter snake garden spider gyromitra
Saluki cab king snake wolf spider coral fungus
redbone beach wagon night snake
otterhound jeep green snake
Norweigian elkhound convertible water snake
basset hound limo
Scottish deerhound
bloodhound

In Figures 2,3,4,5,6 we show samples of images. The sets are collected by first resizing such that the shorter side is of length 256 pixels, followed by a center crop. It is obivous that due to intrinsic dataset bias, some categories may be viewed as anomalous without careful inspection of the object of interest. For example, owing to their smaller size, ringneck snakes are most often photographed when held in human hands, and race cars are usually pictured on race tracks. Such dataset biases have historically been hard to account for, and we recommend more thoughtful curation of specific datasets for specific tasks for our proposed style of benchmarks to be more reflective of field performance.

(a) Ibizan hound
(b) Bluetick
(c) Beagle
(d) Afghan hound
(e) Weimaraner
(f) Saluki
(g) Redbone
(h) Otterhound
(i) Norweigian elkhound
(j) Basset hound
(k) Scottish deerhound
(l) Bloodhound
Figure 2: Dog (hound dog)
(a) Model T
(b) Race car
(c) Sports car
(d) Minivan
(e) Ambulance
(f) Cab
(g) Beach wagon
(h) Jeep
(i) Convertile
(j) Limousine
Figure 3: Car
(a) Ringneck snake
(b) Vine snake
(c) Hognose snake
(d) Thunder snake
(e) Garter snake
(f) King snake
(g) Night snake
(h) Green snake
(i) Water snake
Figure 4: Snake (colubrid)
(a) Tarantula
(b) Argiope aurantia
(c) Barn spider
(d) Black widow
(e) Garden spider
(f) Wolf spider
Figure 5: Spider
(a) Stinkhorn
(b) Bolete
(c) Hen-of-the-woods
(d) Earthstar
(e) Gyromitra
(f) Coral fungus
Figure 6: Fungus

Appendix B Expanded results for Imagenet-subset experiments

In this section, we show expanded results for the Imagenet experiments, corresponding to every hold-out-class experiment.

Dog Classification-only Rotation-augmented
Anomalous dogs MSP ODIN Accuracy MSP ODIN Accuracy
Ibizan hound 25.56 2.43 26.34 2.65 85.19 0.65 22.85 2.54 24.27 2.60 85.99 0.73
bluetick 34.37 4.32 39.19 1.43 85.50 0.82 29.60 4.39 31.70 1.99 85.36 0.59
beagle! 18.29 1.54 17.05 1.21 86.33 1.18 19.79 2.02 18.52 1.69 86.37 0.71
Afghan hound 20.05 3.07 18.69 1.22 84.16 0.59 20.62 1.26 20.56 1.37 83.44 0.98
Weimaraner 31.04 2.22 36.87 2.90 83.68 0.36 27.65 2.52 30.59 1.03 83.62 1.40
Saluki 26.64 2.50 31.75 2.01 83.76 1.08 28.20 1.46 29.27 1.30 84.74 0.19
redbone 17.93 0.59 18.66 0.59 86.61 1.48 19.14 0.39 19.54 1.15 86.01 1.10
otterhound 22.71 0.77 23.31 0.83 84.50 0.54 21.32 3.71 22.90 4.28 84.24 0.56
Norweigian elkhound 28.82 2.16 36.55 0.61 83.91 1.35 34.64 6.16 41.61 6.55 84.33 0.84
basset hound 18.39 0.91 16.23 0.58 86.34 0.76 21.33 3.10 19.46 1.72 86.45 0.65
Scottish deerhound 26.83 2.97 26.52 2.23 83.95 0.61 26.64 1.17 24.87 0.52 83.80 0.02
bloodhound 16.43 1.17 19.04 0.91 87.17 0.24 24.19 6.23 25.42 6.60 88.69 0.84
Average 23.92 0.49 25.85 0.09 85.09 0.14 24.66 0.58 25.73 0.87 85.25 0.17
Car Classification-only Rotation-augmented
Anomalous cars MSP ODIN Accuracy MSP ODIN Accuracy
Model T 26.77 1.21 31.20 1.22 72.92 0.49 32.09 0.86 36.10 1.84 72.52 0.62
race car 22.48 2.53 27.12 3.90 79.65 1.85 20.32 5.47 22.41 6.41 74.67 3.39
sports car 16.20 1.77 13.86 0.44 80.97 1.97 16.80 0.93 15.58 0.96 81.00 0.48
minivan 17.19 2.57 17.78 1.68 79.25 1.89 17.32 3.08 18.45 2.67 80.01 0.98
ambulance 11.13 1.78 9.51 0.97 75.71 2.44 11.24 0.84 10.61 1.25 75.78 0.31
cab 26.17 2.42 27.93 2.30 75.92 3.77 28.57 1.91 29.39 2.52 76.74 3.09
beach wagon 24.82 0.85 26.30 2.00 78.75 1.09 24.50 1.64 25.22 1.89 79.81 1.27
jeep 25.47 0.38 26.99 2.70 74.67 1.37 27.92 5.01 27.74 3.28 72.84 0.47
convertible 20.00 2.63 18.35 1.17 76.86 0.81 15.32 2.01 14.79 2.24 76.26 1.74
limo 25.16 1.43 25.87 0.83 77.04 1.52 22.53 2.08 23.49 1.17 77.54 1.19
Average 21.54 0.62 22.49 0.54 77.17 0.10 21.66 0.19 22.38 0.46 76.72 0.19
Snake Classification-only Rotation-augmented
Anomalous snakes MSP ODIN Accuracy MSP ODIN Accuracy
ringneck snake 20.18 2.98 20.56 2.78 71.08 3.13 20.84 0.77 23.22 1.07 69.03 0.46
vine snake 16.07 4.51 17.19 3.91 67.94 3.96 15.94 1.05 16.15 0.96 72.65 6.44
hognose snake 16.82 0.38 16.65 0.46 67.95 2.81 19.70 1.22 19.32 0.77 69.85 2.50
thunder snake 17.06 2.94 19.18 3.31 71.86 7.58 21.26 0.35 23.08 0.70 69.34 7.08
garter snake 21.45 4.35 22.16 3.81 67.26 2.19 22.67 1.13 23.12 1.83 68.29 5.90
king snake 17.37 0.39 16.55 1.19 66.45 5.38 19.47 3.72 17.96 2.74 68.13 2.74
night snake 21.70 4.01 20.50 3.36 76.56 0.78 23.28 0.79 24.12 1.26 78.71 3.91
green snake 12.42 3.31 13.49 3.57 71.07 6.41 13.15 1.11 13.94 0.23 71.74 2.75
water snake 24.50 3.10 26.36 3.24 67.46 6.52 25.77 3.59 29.62 5.04 66.85 0.70
Average 18.62 0.93 19.18 0.79 69.74 1.63 20.23 0.18 21.17 0.12 70.51 0.48
Spider Classification-only Rotation-augmented
Anomalous spiders MSP ODIN Accuracy MSP ODIN Accuracy
tarantula 19.45 0.73 22.91 2.37 60.67 0.66 24.27 3.25 26.07 2.73 60.07 0.86
Argiope aurantia 12.97 0.39 12.49 0.48 69.70 2.17 12.82 0.51 12.01 0.21 69.17 1.57
barn spider 23.03 3.03 23.83 2.95 75.69 0.55 21.41 1.41 23.54 1.21 76.56 1.85
black widow 29.24 4.39 37.96 5.68 61.79 0.87 37.08 7.50 42.64 8.87 62.63 0.09
garden spider 17.36 1.33 15.51 0.88 77.81 2.29 16.57 1.58 15.88 1.42 76.38 1.94
wolf spider 25.15 2.98 32.23 1.91 64.73 0.45 25.48 1.75 30.69 1.28 67.19 0.37
Average 21.20 0.56 24.15 0.72 68.40 0.21 22.90 1.29 25.10 1.78 68.68 0.77
Fungus Classification-only Rotation-augmented
Anomalous fungi MSP ODIN Accuracy MSP ODIN Accuracy
stinkhorn 52.43 1.15 56.37 1.98 90.91 0.54 54.37 4.65 59.10 5.71 92.27 0.97
bolete 51.04 0.42 52.82 3.10 89.19 0.94 49.43 2.05 53.07 3.48 89.22 1.09
hen-of-the-woods 44.83 1.52 48.04 0.84 89.41 1.64 48.87 2.00 51.37 2.44 90.13 0.33
earthstar 34.90 3.26 36.79 2.16 86.70 1.91 41.96 7.66 43.24 4.92 86.46 0.62
gyromitra 46.75 0.42 49.06 2.64 86.79 1.66 44.90 1.94 49.20 1.51 86.39 0.18
coral fungus 25.42 2.60 24.44 3.04 86.36 1.25 25.58 1.15 25.22 2.81 86.35 0.80
Average 42.56 0.49 44.59 1.46 88.23 0.45 44.19 1.86 46.86 1.13 88.47 0.43

Appendix C Trivial baseline for OOD detection on existing benchmarks

To demonstrate that the current baselines are trivial with low-level details, we tested OOD detection with CIFAR-10 as the in-distribution by simply looking at likelihoods under a mixture of 3 Gaussians, trained at a pixel-level, channel-wise. We find that this simple baseline performs quite well at the benchmarks in [32], as we show below:

OOD dataset Average precision
TinyImagenet (crop) 96.84
TinyImagenet (resize) 99.03
LSUN 58.06
LSUN (resize) 99.77
iSUN 99.21

We see that this method does not do well on LSUN. When we inspect LSUN, we find that the images are cropped patches from scene-images, and are mostly of uniform colour, with little variation and structure in them. We believe that this results in the phenomenon reported in [33]

, where one distribution that “sits inside” the other because of a similar mean but lower variance ends up being more likely under the wider distribution.

We found that this simple baseline underperforms severely on the hold-out-class experiments on CIFAR-10, achieving an average precision of a mere 11.17% across the 10 tests, indicating that trivial statistics might not as useful at telling apart semantic anomalies, which are of greater practical interest.