1 Introduction
Even subtle changes in the data distribution can destroy the performance of otherwise stateoftheart classifiers, a phenomenon exemplified by adversarial examples Szegedy et al. (2013); Zügner et al. (2018). When decisions are made under uncertainty, even shifts in the label distribution can significantly compromise accuracy (Zhang et al., 2013; Lipton et al., 2018). Unfortunately, in practice, ML pipelines rarely inspect incoming data for signs of distribution shift, and for detecting shift in highdimensional realworld data, best practices have not been established yet^{1}^{1}1TensorFlow’s data validation tools only compare summary stats of training vs incoming data—https://www.tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift. The first indications that something has gone awry might come when customers complain.
This paper investigates methods for efficiently detecting distribution shift, a problem naturally cast as twosample testing. We wish to test the equivalence of the source distribution (from which training data is sampled) and target
distribution (from which realworld data is sampled). For simple univariate distributions, such hypothesis testing is a mature science. One might be tempted to use offtheshelf methods for multivariate twosample tests to handle highdimensional data but these kernelbased approaches do not scale with the dataset size and their statistical power decays badly when the ambient dimension is high
Ramdas et al. (2015).For ML practitioners, another intuitive approach might be to train a classifier to distinguish between examples from source and targe distributions. We can then look to see if the classifier achieves significantly greater than accuracy. Analyzing the simple case where one wishes to test the means of two Gaussians, Ramdas et al. (2016) recently made the intriguing discovery that the power of a classificationbased strategy using Fisher’s Linear Discriminant Analysis (LDA) classifier achieves minimax rateoptimal performance. However, to date, we know of no rigorous empirical investigation characterizing classifierbased approaches to recognize dataset shift in the real highdimensional data distributions with no known parametric form on which modern machine learning is routinely deployed. Providing this analysis is a key contribution of this paper. To avoid confusion, we will denote any sourcevstarget classifier a domain classifier and refer to the original classifier, (trained on source data) to predict the classes as the label classifier.
A key benefit of the classifierbased approach is that the domain classifier
reduces dimensionality to a single dimension, learned precisely for the purpose of discriminating between source and target data. However, a major drawback is that deep neural networks, the precise classifiers that are effective on the highdimensional data that interests us, require large amounts of training data. Adding to the problem, the domainclassifier approach requires partitioning our (scarce) target data using, e.g., half for training and leaving only the remainder for twosample testing. Thus, as an alternative we also explore the
black box shift detection (BBSD) approach due to Lipton et al. (2018), which addresses shift detection under the label shift assumption. They show that if one possesses an offtheshelf label classifierwith an invertible confusion matrix (verifiable on training data), then detecting that the source distribution
is different from the target distribution requires only detecting that . This insight enables efficient shift detection, using a pretrained (label) classifier for dimensionality reduction.Building on these ideas of combining (blackbox) dimensionality reduction with subsequent twosample testing, we explore a range of dimensionality reduction techniques and compare them under a wide variety of shifts (Figure 1 illustrates our general framework). Interestingly, we show (empirically) that BBSD works surprisingly well under a broad set of shifts, outperforming other methods, even when its assumptions are not met.
Related work
Given just one example from the test data our problem simplifies to anomaly detection, surveyed thoroughly by
Chandola et al. (2009); Markou and Singh (2003). Popular approaches to anomaly detection include density estimation
Breunig et al. (2000), marginbased approaches such as oneclass SVMs (Schölkopf et al., 2000), and the treebased isolation forest method due to (Liu et al., 2008). Recently, GANs have been explored for this task (Schlegl et al., 2017). Given simple streams of data arriving in a timedependent fashion where the signal is piecewise stationary, with stationary periods separated by abrupt changes, the problem of detecting a dataset shift becomes the classic time series problem of change point detection, with existing methods summarized succinctly in an excellent survey by Truong et al. (2018). An extensive literature addresses dataset shift in machine learning, typically in the larger context of domain adaptation, often through importanceweighted risk minimization. Owing to the impossibility of correcting for shift absent assumptions BenDavid et al. (2010), these papers often assume either covariate shift Shimodaira (2000); Sugiyama et al. (2008); Gretton et al. (2009) or label shift Saerens et al. (2002); Chan and Ng (2005); Storkey (2009); Zhang et al. (2013); Lipton et al. (2018). Schölkopf et al. (2012)provides a unifying view of these shifts, showing how assumed invariances in conditional probabilities correspond to causal assumptions about the inputs and outputs.
2 Shift Detection Techniques
Given labeled data and unlabeled data , our task is to determine whether equals . Formally, vs . Chiefly, we explore the following design considerations: (i) what representation to run the test on; (ii) which twosample test to run; (iii) when the representation is multidimensional; whether to run multivariate or multiple univariate twosample tests; and (iv) how to combine their results. Additionally, we share some preliminary work on qualitatively characterizing the shift, e.g. by presenting exemplars, or identifying salient features.
2.1 Dimensionality Reduction
Building on recent results in Lipton et al. (2018); Ramdas et al. (2015) suggesting the benefits of lowdimensional twosample testing, we consider the following representations: (i) No Reduction: To justify any dimensionality reduction techniques, we include tests on the original raw features; (ii) SRP: sparse random projections (implemented in scikitlearn with default parameters); (iii) PCA
: principal components analysis; (vi)
TAE: We extract representations by running an autoencoder trained on source data; (v)
UAE: the same approach but with an untrained autoencoder; (vi) BBSD: Here, we adopt the approach of Lipton et al. (2018), using the outputs of a label classifier trained on source data as our representation for subsequent twosample testing. Two variations of this approach are to use the hardthresholded predictions (BBSDh) of the label classifier enabling a likelihood ratio test, or to use the softmax outputs (BBSDs), requiring a subsequent multivariate test; and (vii) Classifier: We partition both the source data and target data into two halves, training a domain classifier to distinguish source (class 0) from target (class 1) (trained with balanced classes). We then apply this model to the remaining data, performing a subsequent binomial test on the hardthresholded predictions.In this paper, we use the following heuristic for choosing the latent dimension
: We decided on a fraction (in our experiments,) of the variance in the original data that we would like the latent representation to explain. We then choose
to be minimal number of principal components required to explain this fraction of the variance in the data. Guided by this heuristic and taking the liberty of rounding for convenience, we used for all experiments on all datasets.2.2 Twosample testing
The dimensionality reduction techniques each yield a representation, either uni or multidimensional, either continuous or discrete. Among categorical output, we may have binary outputs (as from the domain classifier) or multiple categories (the results from the hard label classifier). The next step is to choose a suitable two sample test. In all experiments, we adopt a highsignificance level of for hypothesis rejection. For representation methods that yield multidimensional outputs, we have two choices: to perform a multivariate twosample test, such as kernel twosample tests due to Gretton et al. (2012) or to perform univariate tests separately on each component. In the latter case, we must subsequently combine the values from each test, encountering the problem of multiple hypothesis testing. Unable to make strong assumptions about the dependence among the tests, we must rely on a conservative aggregation method, such as the Bonferroni correction Bland and Altman (1995). While a number of less conservative aggregations methods have been proposed Simes (1986); Zaykin et al. (2002); Loughin (2004); Heard and RubinDelanchy (2018); Vovk and Wang (2018), we found that even using the Bonferroni method, dimensionality reduction plus aggregation, generally outperformed kernel twosample testing on those same representations.
When performing univariate tests on continuous variables, we adopt the KolmogorovSmirnov (KS) test, a nonparametric test whose statistic is calculated by taking the supremum over all values of the differences of the cumulative density functions (CDFs) as follows: where and are the empirical CDFs of the source and target data, respectively. Each value of the KS statistic corresponds to a value. When aggregating
univariate tests together, we apply the Bonferroni correction rejecting the null hypothesis if the minimum
value among all tests is less than .For all methods yielding multidimensional representations (NoRed, PCA, SRP, UAE, TAE, and BBSDs), we tried both the kernel twosample tests and Bonferronicorrected univariate KS tests, finding to our surprise, that the aggregated KS tests provided superior shift detection in most cases (see experiments). For BBSDh (using the hardthresholded label classifier predictions) we employ a likelihood ratio test, known to be optimal in the parametric case. For the domainclassifier, we simply compare its accuracy on heldout data to random chance via a binomial test.
3 Experiments
We briefly summarize our experimental setup. Our experiments address the MNIST and CIFAR10 datasets. For autoencoder (UAE & TAE) experiments, we employ a convolutional architecture with conv and fullyconnected layers. For the label classifier and domain classifier we use a ResNet18 He et al. (2016)
. We train all networks (TAE, BBSDs, BBSDh, Classif) with stochastic gradient descent (SGD) with momentum and a batch size of
, (decaying the learning rate with ) over epochs (only epochs to train the domain classifier) with early stopping. In these experiments, we simulate a number of varieties of shift, affecting both the covariates and the label proportions. For all shifts, we evaluate the various methods’ abilities to detect shift (including noshift cases to check against false positives). We evaluate the models with various amounts of samples from the target dataset . Because of the unfavorable dependence of kernel methods on the dataset size we run these methods only up until target samples.For each shift type (as appropriate) we explored three levels of shift intensity (e.g. the magnitude of added noise) and various percentages of affected data . We explore the following types of shift: Adversarial examples: as introduced by Szegedy et al. (2013) and created via the FGSM method due to Goodfellow et al. (2014); Knockout shift (KO): a form of label shift introduced by Lipton et al. (2018), where a fraction of data points from class are removed, creating class imbalance; Gaussian shift (GN)
: covariates corrupted by noise with standard deviation
; Image shift (IMG): We simulated more natural shifts to images, modifying a fraction of photos with combinations of random amounts of rotations, translations, and zoomins. We also explored combinations of image shift with label shift. Original splits: As a sanity check, we also evaluated our detectors on the original source/target splits provided by in MNIST, CIFAR10, Fashion MNIST, and SVHN datasets typically regarded as being i.i.d.; and Domain adaptation datasets: We tested our detection method on the domain adaptation task from MNIST (source) to USPS (target) (Long et al., 2013) ( source samples, target samples, features).4 Discussion
Aggregating results across the broad spectrum of explored shifts (Figure 2), we see that BBSDs and UAE perform best. Interestingly, the simple no reduction baseline with univariate tests on each input feature and aggregated via the Bonferroni correction (which is severe here, owing to the large ambient dimension) performs in the middle of the pack. The domainclassifier approach (denoted Classif in Figure 2) struggles the most to detect shift in the lowsample regime, but performs better with more sample data. One benefit of the classifierbased approach is that it gives us an intuitive way to quantify the amount of shift (the accuracy of the clasifier), and also yields a mechanism for identifying exemplars most likely to occur in either the source or target distributions. In Appendix A, we break out all results by shift type, and depict the exemplars identified most confidently by the classifier approach in Appendix B
. We were surprised to find that across our dimensionalreduction methods, aggregated univariate tests performed separately on each component of the latent vectors outperformed multivariate tests. The multivariate test also performed poorly in the no reduction case, but this was expected.
One surprising finding discovered early in this study was that the original MNIST train/test split appears not to be i.i.d., detected by nearly all methods. As a sanity check we reran the test on a random split, not detecting a shift. Closer inspection revealed significant differences in the means of ’s. Corroborating this finding, the most confidently predicted classifier scores singled out test set ’s.
We see several promising paths for future work: (i) we must extend our work to account intelligently for repeatedtwo sample tests over time as data streams in, exploiting the high degree of correlation between adjacent time steps; (ii) in reality all datasets shift, so the bigger challenge is to quantify/characterize the shift, and to decide when practitioners should be alarmed and what actions they should take.
References

BenDavid et al. (2010)
Shai BenDavid, Tyler Lu, Teresa Luu, and Dávid Pál.
Impossibility theorems for domain adaptation.
In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2010.  Bland and Altman (1995) J Martin Bland and Douglas G Altman. Multiple significance tests: the bonferroni method. Bmj, 1995.

Breunig et al. (2000)
Markus M Breunig, HansPeter Kriegel, Raymond T Ng, and Jörg Sander.
Lof: identifying densitybased local outliers.
In ACM sigmod record. ACM, 2000.  Chan and Ng (2005) Yee Seng Chan and Hwee Tou Ng. Word sense disambiguation with distribution estimation. In International Joint Conference on Artificial intelligence (IJCAI), 2005.
 Chandola et al. (2009) Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 2009.
 Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2014.
 Gretton et al. (2009) Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Journal of Machine Learning Research (JMLR), 2009.
 Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research (JMLR), 2012.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer vision and pattern recognition (CVPR), 2016.
 Heard and RubinDelanchy (2018) Nicholas A Heard and Patrick RubinDelanchy. Choosing between methods of combiningvalues. Biometrika, 2018.
 Lipton et al. (2018) Zachary C Lipton, YuXiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. 2018.
 Liu et al. (2008) Fei Tony Liu, Kai Ming Ting, and ZhiHua Zhou. Isolation forest. In International Conference on Data Mining (ICDM), 2008.

Long et al. (2013)
Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu.
Transfer feature learning with joint distribution adaptation.
In International conference on computer vision (ICCV), 2013.  Loughin (2004) Thomas M Loughin. A systematic comparison of methods for combining pvalues from independent tests. Computational statistics & data analysis, 2004.
 Markou and Singh (2003) Markos Markou and Sameer Singh. Novelty detection: a review—part 1: statistical approaches. Signal processing, 2003.
 Ramdas et al. (2015) Aaditya Ramdas, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry A Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Association for the Advancement of Artificial Intelligence (AAAI), 2015.
 Ramdas et al. (2016) Aaditya Ramdas, Aarti Singh, and Larry Wasserman. Classification accuracy as a proxy for two sample testing. arXiv preprint arXiv:1602.02210, 2016.
 Saerens et al. (2002) Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 2002.
 Schlegl et al. (2017) Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula SchmidtErfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, 2017.
 Schölkopf et al. (2000) Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John ShaweTaylor, and John C Platt. Support vector method for novelty detection. In Advances in neural information processing systems, 2000.
 Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In International Conference on Machine Learning (ICML), 2012.
 Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 2000.
 Simes (1986) R John Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 1986.
 Storkey (2009) Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, 2009.
 Sugiyama et al. (2008) Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, 2008.
 Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Truong et al. (2018) Charles Truong, Laurent Oudre, and Nicolas Vayatis. A review of change point detection methods. arXiv preprint arXiv:1801.00718, 2018.
 Vovk and Wang (2018) Vladimir Vovk and Ruodu Wang. Combining pvalues via averaging. 2018.
 Zaykin et al. (2002) Dmitri V Zaykin, Lev A Zhivotovsky, Peter H Westfall, and Bruce S Weir. Truncated product method for combining pvalues. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 2002.
 Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning (ICML), 2013.
 Zügner et al. (2018) Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial attacks on neural networks for graph data. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 2847–2856, 2018.
Appendix A Shift detection results
Our complete shift detection results in which we evaluate different kinds of target shifts on MNIST and CIFAR10 using the proposed methods are documented below. In addition to our artificially generated shifts, we also evaluated our testing procedure on the original splits provided by MNIST, Fashion MNIST, CIFAR10, and SVHN.
a.1 Artificially generated shifts
a.1.1 Mnist
a.1.2 Cifar10
a.2 Original splits
a.2.1 Mnist
a.2.2 Fashion MNIST
a.2.3 Cifar10
a.2.4 Svhn
Appendix B Difference classifier samples
We further report the top most different samples found by the difference classifier in some of our MNIST experiments.
b.1 Artificially generated shifts
Adversarial shifts, Gaussian noise, and image shifts are clearly detectable using the difference classifier’s samples.
b.2 Original split
In the original MNIST split, we see that an unusually high amount of 6’s is returned by the difference classifier.
Comments
There are no comments yet.