Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift

10/29/2018 ∙ by Stephan Rabanser, et al. ∙ Carnegie Mellon University 10

We might hope that when faced with unexpected inputs, well-designed software systems would fire off warnings. Machine learning (ML) systems, however, which depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to fail silently. This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift and identifying exemplars that most typify the shift. We focus on several datasets and various perturbations to both covariates and label distributions with varying magnitudes and fractions of data affected. Interestingly, we show that while classifier-based methods perform well in high-data settings, they perform poorly in low-data settings. Moreover, across the dataset shifts that we explore, a two-sample-testing-based approach, using pretrained classifiers for dimensionality reduction performs best.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

page 12

page 13

page 14

page 16

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Even subtle changes in the data distribution can destroy the performance of otherwise state-of-the-art classifiers, a phenomenon exemplified by adversarial examples Szegedy et al. (2013); Zügner et al. (2018). When decisions are made under uncertainty, even shifts in the label distribution can significantly compromise accuracy (Zhang et al., 2013; Lipton et al., 2018). Unfortunately, in practice, ML pipelines rarely inspect incoming data for signs of distribution shift, and for detecting shift in high-dimensional real-world data, best practices have not been established yet111TensorFlow’s data validation tools only compare summary stats of training vs incoming data—https://www.tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift. The first indications that something has gone awry might come when customers complain.

This paper investigates methods for efficiently detecting distribution shift, a problem naturally cast as two-sample testing. We wish to test the equivalence of the source distribution (from which training data is sampled) and target

distribution (from which real-world data is sampled). For simple univariate distributions, such hypothesis testing is a mature science. One might be tempted to use off-the-shelf methods for multivariate two-sample tests to handle high-dimensional data but these kernel-based approaches do not scale with the dataset size and their statistical power decays badly when the ambient dimension is high

Ramdas et al. (2015).

For ML practitioners, another intuitive approach might be to train a classifier to distinguish between examples from source and targe distributions. We can then look to see if the classifier achieves significantly greater than accuracy. Analyzing the simple case where one wishes to test the means of two Gaussians, Ramdas et al. (2016) recently made the intriguing discovery that the power of a classification-based strategy using Fisher’s Linear Discriminant Analysis (LDA) classifier achieves minimax rate-optimal performance. However, to date, we know of no rigorous empirical investigation characterizing classifier-based approaches to recognize dataset shift in the real high-dimensional data distributions with no known parametric form on which modern machine learning is routinely deployed. Providing this analysis is a key contribution of this paper. To avoid confusion, we will denote any source-vs-target classifier a domain classifier and refer to the original classifier, (trained on source data) to predict the classes as the label classifier.

A key benefit of the classifier-based approach is that the domain classifier

reduces dimensionality to a single dimension, learned precisely for the purpose of discriminating between source and target data. However, a major drawback is that deep neural networks, the precise classifiers that are effective on the high-dimensional data that interests us, require large amounts of training data. Adding to the problem, the domain-classifier approach requires partitioning our (scarce) target data using, e.g., half for training and leaving only the remainder for two-sample testing. Thus, as an alternative we also explore the

black box shift detection (BBSD) approach due to Lipton et al. (2018), which addresses shift detection under the label shift assumption. They show that if one possesses an off-the-shelf label classifier

with an invertible confusion matrix (verifiable on training data), then detecting that the source distribution

is different from the target distribution requires only detecting that . This insight enables efficient shift detection, using a pretrained (label) classifier for dimensionality reduction.

Building on these ideas of combining (black-box) dimensionality reduction with subsequent two-sample testing, we explore a range of dimensionality reduction techniques and compare them under a wide variety of shifts (Figure 1 illustrates our general framework). Interestingly, we show (empirically) that BBSD works surprisingly well under a broad set of shifts, outperforming other methods, even when its assumptions are not met.

Figure 1: Our pipeline for detecting dataset shift. We consider various choices for how to represent the data and how to perform two-sample tests.
Related work

Given just one example from the test data our problem simplifies to anomaly detection, surveyed thoroughly by

Chandola et al. (2009); Markou and Singh (2003)

. Popular approaches to anomaly detection include density estimation

Breunig et al. (2000), margin-based approaches such as one-class SVMs (Schölkopf et al., 2000), and the tree-based isolation forest method due to (Liu et al., 2008). Recently, GANs have been explored for this task (Schlegl et al., 2017). Given simple streams of data arriving in a time-dependent fashion where the signal is piece-wise stationary, with stationary periods separated by abrupt changes, the problem of detecting a dataset shift becomes the classic time series problem of change point detection, with existing methods summarized succinctly in an excellent survey by Truong et al. (2018). An extensive literature addresses dataset shift in machine learning, typically in the larger context of domain adaptation, often through importance-weighted risk minimization. Owing to the impossibility of correcting for shift absent assumptions Ben-David et al. (2010), these papers often assume either covariate shift Shimodaira (2000); Sugiyama et al. (2008); Gretton et al. (2009) or label shift Saerens et al. (2002); Chan and Ng (2005); Storkey (2009); Zhang et al. (2013); Lipton et al. (2018). Schölkopf et al. (2012)

provides a unifying view of these shifts, showing how assumed invariances in conditional probabilities correspond to causal assumptions about the inputs and outputs.

2 Shift Detection Techniques

Given labeled data and unlabeled data , our task is to determine whether equals . Formally, vs . Chiefly, we explore the following design considerations: (i) what representation to run the test on; (ii) which two-sample test to run; (iii) when the representation is multidimensional; whether to run multivariate or multiple univariate two-sample tests; and (iv) how to combine their results. Additionally, we share some preliminary work on qualitatively characterizing the shift, e.g. by presenting exemplars, or identifying salient features.

2.1 Dimensionality Reduction

Building on recent results in Lipton et al. (2018); Ramdas et al. (2015) suggesting the benefits of low-dimensional two-sample testing, we consider the following representations: (i) No Reduction: To justify any dimensionality reduction techniques, we include tests on the original raw features; (ii) SRP: sparse random projections (implemented in scikit-learn with default parameters); (iii) PCA

: principal components analysis; (vi)

TAE

: We extract representations by running an autoencoder trained on source data; (v)

UAE: the same approach but with an untrained autoencoder; (vi) BBSD: Here, we adopt the approach of Lipton et al. (2018), using the outputs of a label classifier trained on source data as our representation for subsequent two-sample testing. Two variations of this approach are to use the hard-thresholded predictions (BBSDh) of the label classifier enabling a likelihood ratio test, or to use the softmax outputs (BBSDs), requiring a subsequent multivariate test; and (vii) Classifier: We partition both the source data and target data into two halves, training a domain classifier to distinguish source (class 0) from target (class 1) (trained with balanced classes). We then apply this model to the remaining data, performing a subsequent binomial test on the hard-thresholded predictions.

In this paper, we use the following heuristic for choosing the latent dimension

: We decided on a fraction (in our experiments,

) of the variance in the original data that we would like the latent representation to explain. We then choose

to be minimal number of principal components required to explain this fraction of the variance in the data. Guided by this heuristic and taking the liberty of rounding for convenience, we used for all experiments on all datasets.

2.2 Two-sample testing

The dimensionality reduction techniques each yield a representation, either uni- or multidimensional, either continuous or discrete. Among categorical output, we may have binary outputs (as from the domain classifier) or multiple categories (the results from the hard label classifier). The next step is to choose a suitable two sample test. In all experiments, we adopt a high-significance level of for hypothesis rejection. For representation methods that yield multidimensional outputs, we have two choices: to perform a multivariate two-sample test, such as kernel two-sample tests due to Gretton et al. (2012) or to perform uni-variate tests separately on each component. In the latter case, we must subsequently combine the -values from each test, encountering the problem of multiple hypothesis testing. Unable to make strong assumptions about the dependence among the tests, we must rely on a conservative aggregation method, such as the Bonferroni correction Bland and Altman (1995). While a number of less conservative aggregations methods have been proposed Simes (1986); Zaykin et al. (2002); Loughin (2004); Heard and Rubin-Delanchy (2018); Vovk and Wang (2018), we found that even using the Bonferroni method, dimensionality reduction plus aggregation, generally outperformed kernel two-sample testing on those same representations.

When performing univariate tests on continuous variables, we adopt the Kolmogorov-Smirnov (KS) test, a non-parametric test whose statistic is calculated by taking the supremum over all values of the differences of the cumulative density functions (CDFs) as follows: where and are the empirical CDFs of the source and target data, respectively. Each value of the KS statistic corresponds to a -value. When aggregating

univariate tests together, we apply the Bonferroni correction rejecting the null hypothesis if the minimum

-value among all tests is less than .

For all methods yielding multidimensional representations (NoRed, PCA, SRP, UAE, TAE, and BBSDs), we tried both the kernel two-sample tests and Bonferroni-corrected univariate KS tests, finding to our surprise, that the aggregated KS tests provided superior shift detection in most cases (see experiments). For BBSDh (using the hard-thresholded label classifier predictions) we employ a likelihood ratio test, known to be optimal in the parametric case. For the domain-classifier, we simply compare its accuracy on held-out data to random chance via a binomial test.

3 Experiments

(a)
(b)
Figure 2: Shift detection accuracy, averaged over all tested shifts (a) and shift detection performance on each tested shift averaged over all dimensionality reduction techniques (b). For full results by method and shift type, see Appendix A.

We briefly summarize our experimental setup. Our experiments address the MNIST and CIFAR-10 datasets. For autoencoder (UAE & TAE) experiments, we employ a convolutional architecture with conv and fully-connected layers. For the label classifier and domain classifier we use a ResNet-18 He et al. (2016)

. We train all networks (TAE, BBSDs, BBSDh, Classif) with stochastic gradient descent (SGD) with momentum and a batch size of

, (decaying the learning rate with ) over epochs (only epochs to train the domain classifier) with early stopping. In these experiments, we simulate a number of varieties of shift, affecting both the covariates and the label proportions. For all shifts, we evaluate the various methods’ abilities to detect shift (including no-shift cases to check against false positives). We evaluate the models with various amounts of samples from the target dataset . Because of the unfavorable dependence of kernel methods on the dataset size we run these methods only up until target samples.

For each shift type (as appropriate) we explored three levels of shift intensity (e.g. the magnitude of added noise) and various percentages of affected data . We explore the following types of shift: Adversarial examples: as introduced by Szegedy et al. (2013) and created via the FGSM method due to Goodfellow et al. (2014); Knock-out shift (KO): a form of label shift introduced by Lipton et al. (2018), where a fraction of data points from class are removed, creating class imbalance; Gaussian shift (GN)

: covariates corrupted by noise with standard deviation

; Image shift (IMG): We simulated more natural shifts to images, modifying a fraction of photos with combinations of random amounts of rotations, translations, and zoom-ins. We also explored combinations of image shift with label shift. Original splits: As a sanity check, we also evaluated our detectors on the original source/target splits provided by in MNIST, CIFAR-10, Fashion MNIST, and SVHN datasets typically regarded as being i.i.d.; and Domain adaptation datasets: We tested our detection method on the domain adaptation task from MNIST (source) to USPS (target) (Long et al., 2013) ( source samples, target samples, features).

4 Discussion

Aggregating results across the broad spectrum of explored shifts (Figure 2), we see that BBSDs and UAE perform best. Interestingly, the simple no reduction baseline with univariate tests on each input feature and aggregated via the Bonferroni correction (which is severe here, owing to the large ambient dimension) performs in the middle of the pack. The domain-classifier approach (denoted Classif in Figure 2) struggles the most to detect shift in the low-sample regime, but performs better with more sample data. One benefit of the classifier-based approach is that it gives us an intuitive way to quantify the amount of shift (the accuracy of the clasifier), and also yields a mechanism for identifying exemplars most likely to occur in either the source or target distributions. In Appendix A, we break out all results by shift type, and depict the exemplars identified most confidently by the classifier approach in Appendix B

. We were surprised to find that across our dimensional-reduction methods, aggregated univariate tests performed separately on each component of the latent vectors outperformed multivariate tests. The multivariate test also performed poorly in the no reduction case, but this was expected.

One surprising finding discovered early in this study was that the original MNIST train/test split appears not to be i.i.d., detected by nearly all methods. As a sanity check we reran the test on a random split, not detecting a shift. Closer inspection revealed significant differences in the means of ’s. Corroborating this finding, the most confidently predicted classifier scores singled out test set ’s.

We see several promising paths for future work: (i) we must extend our work to account intelligently for repeated-two sample tests over time as data streams in, exploiting the high degree of correlation between adjacent time steps; (ii) in reality all datasets shift, so the bigger challenge is to quantify/characterize the shift, and to decide when practitioners should be alarmed and what actions they should take.

References

  • Ben-David et al. (2010) Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2010.
  • Bland and Altman (1995) J Martin Bland and Douglas G Altman. Multiple significance tests: the bonferroni method. Bmj, 1995.
  • Breunig et al. (2000) Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander.

    Lof: identifying density-based local outliers.

    In ACM sigmod record. ACM, 2000.
  • Chan and Ng (2005) Yee Seng Chan and Hwee Tou Ng. Word sense disambiguation with distribution estimation. In International Joint Conference on Artificial intelligence (IJCAI), 2005.
  • Chandola et al. (2009) Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 2009.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2014.
  • Gretton et al. (2009) Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Journal of Machine Learning Research (JMLR), 2009.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research (JMLR), 2012.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer vision and pattern recognition (CVPR), 2016.
  • Heard and Rubin-Delanchy (2018) Nicholas A Heard and Patrick Rubin-Delanchy. Choosing between methods of combining-values. Biometrika, 2018.
  • Lipton et al. (2018) Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. 2018.
  • Liu et al. (2008) Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In International Conference on Data Mining (ICDM), 2008.
  • Long et al. (2013) Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu.

    Transfer feature learning with joint distribution adaptation.

    In International conference on computer vision (ICCV), 2013.
  • Loughin (2004) Thomas M Loughin. A systematic comparison of methods for combining p-values from independent tests. Computational statistics & data analysis, 2004.
  • Markou and Singh (2003) Markos Markou and Sameer Singh. Novelty detection: a review—part 1: statistical approaches. Signal processing, 2003.
  • Ramdas et al. (2015) Aaditya Ramdas, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry A Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Association for the Advancement of Artificial Intelligence (AAAI), 2015.
  • Ramdas et al. (2016) Aaditya Ramdas, Aarti Singh, and Larry Wasserman. Classification accuracy as a proxy for two sample testing. arXiv preprint arXiv:1602.02210, 2016.
  • Saerens et al. (2002) Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 2002.
  • Schlegl et al. (2017) Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, 2017.
  • Schölkopf et al. (2000) Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, and John C Platt. Support vector method for novelty detection. In Advances in neural information processing systems, 2000.
  • Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In International Conference on Machine Learning (ICML), 2012.
  • Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 2000.
  • Simes (1986) R John Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 1986.
  • Storkey (2009) Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, 2009.
  • Sugiyama et al. (2008) Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, 2008.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Truong et al. (2018) Charles Truong, Laurent Oudre, and Nicolas Vayatis. A review of change point detection methods. arXiv preprint arXiv:1801.00718, 2018.
  • Vovk and Wang (2018) Vladimir Vovk and Ruodu Wang. Combining p-values via averaging. 2018.
  • Zaykin et al. (2002) Dmitri V Zaykin, Lev A Zhivotovsky, Peter H Westfall, and Bruce S Weir. Truncated product method for combining p-values. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 2002.
  • Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning (ICML), 2013.
  • Zügner et al. (2018) Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial attacks on neural networks for graph data. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 2847–2856, 2018.

Appendix A Shift detection results

Our complete shift detection results in which we evaluate different kinds of target shifts on MNIST and CIFAR-10 using the proposed methods are documented below. In addition to our artificially generated shifts, we also evaluated our testing procedure on the original splits provided by MNIST, Fashion MNIST, CIFAR-10, and SVHN.

a.1 Artificially generated shifts

a.1.1 Mnist

(a) 10% adversarial samples.
(b) 50% adversarial samples.
(c) 100% adversarial samples.
Figure 3: MNIST adversarial shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% adversarial samples.
(b) 50% adversarial samples.
(c) 100% adversarial samples.
Figure 4: MNIST adversarial shift, multivariate two-sample tests
(a) Knock out 10% of class 0.
(b) Knock out 50% of class 0.
(c) Knock out 100% of class 0.
Figure 5: MNIST knock-out shift, univariate two-sample tests + Bonferroni aggregration.
(a) Knock out 10% of class 0.
(b) Knock out 50% of class 0.
(c) Knock out 100% of class 0.
Figure 6: MNIST knock-out shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 7: MNIST large Gaussian noise shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 8: MNIST large Gaussian noise shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 9: MNIST medium Gaussian noise shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 10: MNIST medium Gaussian noise shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 11: MNIST small Gaussian noise shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 12: MNIST small Gaussian noise shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 13: MNIST large image shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 14: MNIST large image shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 15: MNIST medium image shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 16: MNIST medium image shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 17: MNIST small image shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 18: MNIST small image shift, multivariate two-sample tests
(a) Knock out 10% of class 0.
(b) Knock out 50% of class 0.
(c) Knock out 100% of class 0.
Figure 19: MNIST medium image shift (50%, fixed) plus knock-out shift (variable), univariate two-sample tests + Bonferroni aggregration.
(a) Knock out 10% of class 0.
(b) Knock out 50% of class 0.
(c) Knock out 100% of class 0.
Figure 20: MNIST medium image shift (50%, fixed) plus knock-out shift (variable), multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 21: MNIST only-zero shift (fixed) plus medium image shift (variable), univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 22: MNIST only-zero shift (fixed) plus medium image shift (variable), multivariate two-sample tests
(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 23: MNIST to USPS domain adaptation, univariate two-sample tests + Bonferroni aggregration.
(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 24: MNIST to USPS domain adaptation, multivariate two-sample tests.

a.1.2 Cifar-10

(a) 10% adversarial samples.
(b) 50% adversarial samples.
(c) 100% adversarial samples.
Figure 25: CIFAR-10 adversarial shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% adversarial samples.
(b) 50% adversarial samples.
(c) 100% adversarial samples.
Figure 26: CIFAR-10 adversarial shift, multivariate two-sample tests
(a) Knock out 10% of class 0.
(b) Knock out 50% of class 0.
(c) Knock out 100% of class 0.
Figure 27: CIFAR-10 knock-out shift, univariate two-sample tests + Bonferroni aggregration.
(a) Knock out 10% of class 0.
(b) Knock out 50% of class 0.
(c) Knock out 100% of class 0.
Figure 28: CIFAR-10 knock-out shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 29: CIFAR-10 large Gaussian noise shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 30: CIFAR-10 large Gaussian noise shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 31: CIFAR-10 medium Gaussian noise shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 32: CIFAR-10 medium Gaussian noise shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 33: CIFAR-10 small Gaussian noise shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 34: CIFAR-10 small Gaussian noise shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 35: CIFAR-10 large image shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 36: CIFAR-10 large image shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 37: CIFAR-10 medium image shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 38: CIFAR-10 medium image shift, multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 39: CIFAR-10 small image shift, univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 40: CIFAR-10 small image shift, multivariate two-sample tests
(a) Knock out 10% of class 0.
(b) Knock out 50% of class 0.
(c) Knock out 100% of class 0.
Figure 41: CIFAR-10 medium image shift (50%, fixed) plus knock-out shift (variable), univariate two-sample tests + Bonferroni aggregration.
(a) Knock out 10% of class 0.
(b) Knock out 50% of class 0.
(c) Knock out 100% of class 0.
Figure 42: CIFAR-10 medium image shift (50%, fixed) plus knock-out shift (variable), multivariate two-sample tests
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 43: CIFAR-10 only-zero shift (fixed) plus medium image shift (variable), univariate two-sample tests + Bonferroni aggregration.
(a) 10% perturbed samples.
(b) 50% perturbed samples.
(c) 100% perturbed samples.
Figure 44: CIFAR-10 only-zero shift (fixed) plus medium image shift (variable), multivariate two-sample tests

a.2 Original splits

a.2.1 Mnist

(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 45: MNIST randomized and original split, univariate two-sample tests + Bonferroni aggregration.
(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 46: MNIST randomized and original split, multivariate two-sample tests.

a.2.2 Fashion MNIST

(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 47: Fashion MNIST randomized and original split, univariate two-sample tests + Bonferroni aggregration.
(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 48: Fashion MNIST randomized and original split, multivariate two-sample tests.

a.2.3 Cifar-10

(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 49: CIFAR-10 randomized and original split, univariate two-sample tests + Bonferroni aggregration.
(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 50: CIFAR-10 randomized and original split, multivariate two-sample tests.

a.2.4 Svhn

(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 51: SVHN randomized and original split, univariate two-sample tests + Bonferroni aggregration.
(a) Randomly shuffled dataset with same split proportions as original dataset.
(b) Original split.
Figure 52: SVHN randomized and original split, multivariate two-sample tests.

Appendix B Difference classifier samples

We further report the top most different samples found by the difference classifier in some of our MNIST experiments.

b.1 Artificially generated shifts

Adversarial shifts, Gaussian noise, and image shifts are clearly detectable using the difference classifier’s samples.

Figure 53: Top different samples in target set with adversarial shift.
Figure 54: Top different samples in target set with Gaussian noise.
Figure 55: Top different samples in target set with image shift.

b.2 Original split

In the original MNIST split, we see that an unusually high amount of 6’s is returned by the difference classifier.

Figure 56: Top different samples in target set from original split.