Test for non-negligible adverse shifts

07/07/2021 ∙ by Vathy M. Kamulete, et al. ∙ 0

Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences where there is in fact adequate sample coverage and predictive performance. We propose instead a robust framework for tests of dataset shift based on outlier scores, D-SOS for short. D-SOS detects adverse shifts and can identify false alarms caused by benign ones. It posits that a new (test) sample is not substantively worse than an old (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates. Beyond comparing distributions, users can define what worse means in terms of predictive performance and other relevant notions. We show how versatile and practical D-SOS is for a wide range of real and simulated datasets. Unlike tests of equal distribution and of goodness-of-fit, the D-SOS tests are uniquely tailored to serve as robust performance metrics to monitor model drift and dataset shift.



There are no comments yet.


page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose we fit a predictive model on a training set and predict on a test set. Dataset shift (Quionero-Candela et al. 2009; Jose G Moreno-Torres et al. 2012; Kelly, Hand, and Adams 1999), also known as data or population drift, occurs when training and test distributions are not alike. This is essentially a sample mismatch problem. Some regions of the data space are either too sparse or absent during training and gain importance at test time. We want methods that alert users to the presence of unexpected inputs in the test set (Rabanser, Günnemann, and Lipton 2019). To do so, a measure of divergence between training and test set is required. Can we not simply use the many modern off-the-shelf multivariate tests of equal distributions for this?

One reason for moving beyond tests of equal distributions is that they are often too strict. They require high fidelity between training and test set everywhere in the input domain. However, not all changes in distribution are a cause for concern – some changes are benign. Practitioners distrust these tests because of false alarms. Polyzotis et al. (2019) comment:

statistical tests for detecting changes in the data distribution […] are too sensitive and also uninformative for the typical scale of data in machine learning pipelines, which led us to seek alternative methods to quantify changes between data distributions.

Even when the difference is small or negligible, tests of equal distributions reject the null hypothesis of no difference. An alarm should only be raised if a shift warrants intervention. Retraining models when distribution changes are benign is both costly and ineffective. To tackle these challenges, we propose

D-SOS instead. D-SOS provides robust performance metrics for monitoring model drift and validating data quality.

In comparing the test set to the training set, D-SOS

pays more attention to the regions — typically, the outlying regions — where we are most vulnerable. To confront false alarms, it uses a robust test statistic, namely the weighted area under the receiver operating characteristic curve (WAUC). The weights in the WAUC (Li and Fine

2010) discount the safe regions of the distribution. The goal of D-SOS is to detect non-negligible adverse shifts. This is reminiscent of noninferiority tests (Wellek 2010), widely used in healthcare for determining that a new treatment is in fact not inferior to an older one. Colloquially, the D-SOS null hypothesis holds that the new sample is not substantively worse than the old sample, and not that the two are equal. D-SOS moves beyond tests of equal distributions and lets users specify which notions of outlyingness to probe. The choice of the score function plays a central role in formalizing what we mean by ‘worse’. D-SOS avoids the prevailing overreliance on tests of equal distributions to detect and assess the impact of dataset shift. We argue throughout that other notions of shift or outlyingness are often, if not more informative, at the very least complementary.

In this paper, we make the following contributions:

  1. We derive D-SOS, a novel multivariate two-sample test for no adverse shift, from tests of goodness-of-fit. D-SOS computes its -values based on sample splitting, permutation and/or out-of-bag predictions. In our experience, the out-of-bag variant is the most convenient. This approach improves on sample splitting, which sacrifices calibration accuracy for inferential robustness (Rinaldo et al. 2019).

  2. We compare D-SOS performance to two modern tests of equal distributions. To do so, we transform -values to -values (Greenland 2019). -values give us the means to define a region of practical equivalence. We show via simulations that D-SOS matches or exceeds the performance of these tests when the outlier score reflects whether the observation belongs to the training or test set. These results underscore the point that tests of equal distributions are suboptimal when we could instead opt for tests of goodness-of-fit, which explicitly account for the training set being the reference distribution.

  3. Based on 62 real-world classification tasks from the OpenML-CC18 benchmark (Casalicchio et al. 2017; Bischl et al. 2017), we show that different notions of outlyingness are complementary for detecting partition-induced dataset shift. In supervised settings, D-SOS

    extends to notions of outlyingness such as residuals and prediction intervals. These help to determine whether the shift has an adverse impact on predictive performance. We estimate the correlations between these complementary notions.

The main takeaway is that given a generic method that assigns an outlier score to a data point, D-SOS

uplifts these scores and turns them into a two-sample test for no adverse shift. The scores can come from anomaly detection, two-sample classification, uncertainty quantification, residual diagnostics, density estimation, dimension reduction, and more. We have created an R package

dsos for our method. In addition, all code and data used in this paper are publicly available.

2 Statistical framework

The theoretical framework builds on Zhang (2002), hereafter referred to as the the Zhang test. Take an i.i.d. training set and a test set . Each dataset with origin lies in -dimensional domain with sample sizes

, cumulative distribution function (CDF)

and probability density function (PDF)

. Let be a score function and define binary instances such that is 1 when the score exceeds the threshold and 0, otherwise. The proportion above the threshold in dataset is in effect the contamination rate , defined as . The contamination rate is the complementary CDF of the scores, . As before, and denote the CDF and PDF of the scores.

Consider the null hypothesis of equal distribution against the alternative . In tandem, consider the null of equal contamination against the alternative at a given threshold score . To evaluate goodness-of-fit, Zhang (2002) shows that testing is equivalent to testing for . If is the relevant test statistic for equal contamination, a global test statistic for goodness-of-fit can be constructed from , its local counterpart. One such is


where are threshold-dependent weights. For concision, we sometimes suppress the dependence on the threshold score and denote weights, contamination statistics and rates as , and respectively. D-SOS differs from the statistic (1) in three particular aspects: the score function , the weights and the contamination statistic . We address each in turn.

D-SOS scores instances from least to most abnormal according to a specified notion of outlyingness. To be concrete, for density estimation111If so, the negative log density, a measure of surprise, is a natural score for outlyingness., this property of can be expressed as


for and , a (sufficiently small) approximation error. Accordingly, instances in high-density regions of the training set (nominal points) score low; those in low-density regions (outliers) score high. Here, the score function can be thought of as a density-preserving projection. More generally, higher scores, e.g. wider prediction intervals and larger residuals, indicate worse outcomes; the higher the score, the more unusual the observation. The structure in is the catalyst for adjusting the weights and the statistic .

D-SOS updates the weights to be congruent with the score function . When projecting to the outlier subspace, high scores imply unusual points, whereas both tails are viewed as extremes in the Zhang test. Univariate tests such as the Anderson-Darling and the Cramér-von Mises tests fit the framework in (1): they make different choices for , and . These classical tests place more weight at the tails to reflect the severity of tail exceedances relative to deviations in the center of the distribution. D-SOS corrects for outliers being projected to the upper (right) tail of scores via the score function. The D-SOS weights are specified as


The weights in (3) shift most of the mass from low-threshold to high-threshold regions. As a result, D-SOS is highly tolerant of negligible shifts associated with low scores, and conversely, it is attentive to the shifts associated with high scores. Zhang (2002) posits other functional forms, but the quadratic relationship between weights and contaminations gives rise to one of the most powerful variants of the test and so, we follow suit.

D-SOS constructs a test statistic based on score ranks, not on levels. The weighted area under the receiver operating curve, weighted AUC (WAUC) for short, is a robust statistic that is invariant to changes in levels so long as the underlying ranking is unaffected. The WAUC, denoted , can be written as:


assuming the test set is the positive class and gets labeled as 1s (0s for the training set). See Hand (2009) for this derivation of the AUC: the weights are tacked on to obtain the WAUC. Formally, in (4) is the D-SOS test statistic for dataset shift. is also clearly a member within the class of tests in (see 1).

The null hypothesis of dataset shift is that most instances in the test set resemble the training set. The alternative is that the test set contains more outliers than expected, if the training set is the reference distribution. D-SOS specifies its null as against the alternative , where is the observed WAUC and , the WAUC under the null of exchangeable samples. Note that D-SOS is a one-tailed test. To have no or disproportionately fewer outliers in the test set is desirable. These shifts do not trigger the alarm as they would had D-SOS been two-tailed. Tests of goodness-of-fit and of equal distributions reject the null if the training and test set are different, even if the test set is less abnormal i.e. better; D-SOS does not do this. Through the mapping to the outlier subspace, D-SOS simplifies a multivariate two-sample test to a univariate representation.

3 Motivation

For illustration, we apply D-SOS to the canonical iris dataset (Anderson 1935

). The task is to classify the species of Iris flowers based on

covariates (features) and observations for each species. We show how D-SOS helps diagnose false alarms. Specifically, we highlight the following: (1) changes in distribution do not necessarily hurt predictive performance, and (2) data points in the densest regions of the data distribution can be the most difficult – unsafe – to predict. For the subsequent tests, we split iris into 2/3 training and 1/3 test set. Figure 1 displays train-test pairs, which are split according to different sampling strategies. The first two principal components of iris show that the species cluster together.

Figure 1: Principal components (PC) for split samples with different partitioning strategies. Each subplot is a train-test pair from iris: (1) random sampling, (2) stratified sampling by species, (3) in-distribution examples in the test set, and (4) out-of-distribution examples in the test set.

Consider four notions of outlyingness. For two-sample classification, define the outlier score as the probability of belonging to the training or test set. For anomaly or out-of-distribution detection, the outlier score is the isolation score, a proxy for local density. For residual diagnostics, the score is the out-of-sample error. Finally, for uncertainty quantification (resampling uncertainty), it is the standard error of the mean prediction

222This is the same as in RUE (Schulam and Saria 2019). RUE

depends on pointwise confidence intervals to establish model trust.

. Only the first notion of outlyingness – two-sample classification – pertains to modern tests of equal distributions; the others capture other meaningful notions of adverse shifts. For all these scores, higher is worse: higher scores indicate that the observation is diverging from the desired outcome or that it does not conform to the training set333Section 5 covers the implementation details..

Suppose we want to test for dataset shift. How do the splits in Figure 1 fare with respect to the tests for no adverse shifts given by these notions of outlyingness? Let and denote value and value. The test results are reported on the scale because it is intuitive and lends itself to comparison. We return to the advantages of using value for comparison later. An value of can be interpreted as seeing independent coin flips with the same outcome – all heads or all tails – if the coin is “fair” (Greenland 2019). This conveys how incompatible the data is with the null hypothesis. For plotting, we winsorize (clip) values to a low and high of 1 and 10 respectively. We also display a secondary y-axis with the value as a cognitive bridge.

Figure 2: Tests of no adverse shifts for iris. The tests cover 4 notions of outlyingness for the sample splits in Figure 1. The dotted black line is the common – and commonly abused – value threshold.

In Figure 2, the case with (1) random sampling exemplifies the type of false alarms we want to avoid. Two-sample classification, representing tests of equal distributions, is incompatible with the null of no adverse shift (a value of around 8). But this shift does not carry over to the other tests. Anomaly detection, residual diagnostics and confidence intervals are all compatible with the view that the test set is not worse. Had we been entirely reliant on two-sample classification, we may not have realized that this shift is essentially benign. Tests of equal distributions alone give a narrow perspective on dataset shift. Contrast (1) random with (2) stratified sampling. When stratified by species, all the tests are compatible with the null of no adverse shift.

We also expect the tests based on anomaly detection to let through the (3) in-distribution test set and flag the (4) out-of-distribution one. Indeed, the results in Figure 2 concur. We might be tempted to conclude that the in-distribution observations are safe, and yet, the tests based on residual diagnostics and confidence intervals are fairly incompatible with this view. This is because some of the in-distribution (densest) points are concentrated in a region close to the decision boundary where the classifier does not discriminate well; Figure 1 show where Iris versicolor (triangles) and Iris virginica (squares) partially overlap. That is, the densest observations are not necessarily safe. Out-of-distribution detection glosses over this, in that the culprits may very well be the in-distribution points. D-SOS offers a more holistic perspective of dataset shift because it borrows strength from these complementary notions of outlyingness.

4 Related work

Outlier scores. Density ratios and class probabilities serve as scores in comparing distributions (Menon and Ong 2016). These scores, however, do not directly account for the predictive performance. Prediction intervals and residuals, on the other hand, do. Intuitively, both prediction intervals and residuals reflect adverse predictive performance in some regions of the feature space. Prediction (confidence) intervals typically widen with increasing dataset shift (Snoek et al. 2019). Methods like MD3 (Sethi and Kantardzic 2017) and RUE (Schulam and Saria 2019) lean on this insight: they essentially track instances in the uncertain regions of the predictive model. Similarly, out-of-distribution examples often have larger residuals or errors (Hendrycks and Gimpel 2017). This points to a dual approach: the first is based on residual diagnostics (the classical approach) and the second is based on out-of-distribution detection. For recent examples of these research areas with a modern machine learning twist, see Janková et al. (2020) for the former and Morningstar et al. (2021) for the latter. Residuals traditionally underpin misspecification tests in regression. Other approaches such as trust scores (Jiang et al. 2018) can also flag unreliable predictions. Because all these scores reflect distinct notions of outlyingness, the contrasts are insightful even if admittedly, some are related.444Aguinis, Gottfredson, and Joo (2013) lists “14 unique and mutually exclusive outlier definitions” - see Table 1 therein. In this respect, D-SOS is in some sense a unifying framework. Bring your own outlier scores and D-SOS would morph into the corresponding two-sample test.

Dimension reduction. In practice, reconstruction errors from dimension reduction can also separate inliers from outliers (Ruff et al. 2021). Recently, Rabanser, Günnemann, and Lipton (2019) combines dimension reduction, as a preprocessing step, with tests of equal distributions to detect dataset shift. Kirchler et al. (2020

) pushes this further and constructs powerful two-sample tests based on an intermediate low-rank representation of the data using deep learning. Sticking to the theme of projecting to a more informative subspace, some methods e.g. (Cieslak and Chawla

2009; Lipton, Wang, and Smola 2018) gainfully rely on model predictions for their low-rank representation. This last approach, to add a cautionary note, entangles the predictive model, subject to its own sources of errors such as misspecification and overfitting, with dataset shift (Wen, Yu, and Greiner 2014). But it also points to the effectiveness of projecting to a lower subspace to circumvent issues related to high dimensions. Indeed, the classifier two-sample test (Friedman 2004; Clémençon, Depecker, and Vayatis 2009; Lopez-Paz and Oquab 2017; Cai, Goggin, and Jiang 2020) leverages univariate scores that discriminate between training and test set to detect changes in distribution for multivariate data. Inspired by this approach, D-SOS effectively uses outlier scores as a device for dimension reduction.

Test statistic. The area under the receiver operating characteristic curve (AUC) has a long tradition of being used as a robust test statistic in nonparametric tests. Within the framework of classifier tests, Clémençon, Depecker, and Vayatis (2009) pairs a high-capacity classifier with the AUC as a test statistic, via the Mann-Whitney-Wilcoxon test. Demidenko (2016) proposes an AUC-like test statistic as an alternative to the classical tests at scale because the latter “do not make sense with big data: everything becomes statistically significant” while the former attenuates the strong bias toward large sample size. As a generalization of the AUC, the weighted AUC (WAUC) can put non-uniform weights on decision thresholds (Li and Fine 2010). D-SOS seizes on this to give more weight to the outlying regions of the data distribution. Weng and Poon (2008) also advocates for the WAUC as a performance metric for imbalanced datasets because it can capture disparate misclassification costs. D-SOS duly recognizes disparate costs in other contexts and reflects that the outlying regions suffer the most. To the best of our knowledge, this is the first time that the WAUC is being used as a test statistic.

5 Implementation

In this section, we turn our attention to deriving valid values for inference when the score function is data depedent. Without loss of generality, let be the pooled training and test set. The score function is estimated from data

and hyperparameters

. This calibration procedure returns the requisite score function . The asymptotic null distribution of the WAUC , however, is invalid when the same data is used both for calibration and inference (scoring). We circumvent this issue with permutations, sample splitting, and/or out-of-bag predictions.

Permutations use the empirical, rather than the asymptotic, null distribution. The maximum number of permutations is set to (Marozzi 2004). This procedure is outlined in Algorithm 1. We refer to this variant as DSOS-PT in Section 6. Unless stated otherwise, DSOS-PT is the default used in the experiments in Section 6. For speed, this is implemented as a sequential Monte Carlo test (Gandy 2009), which exercises early stopping when the result is fairly obvious. Even with computational tricks to increase the speed, permutations can be computationally expensive, sometimes prohibitively so.

Data: Pooled dataset , calibration procedure , number of permutations and hyperparameters
1 Label: Assign labels , ;
2 Calibrate: Fit the score function ;
3 Score: Score observations , ;
4 Test: Compute the observed WAUC using (4);
5 Shuffle: Permute the labels in to get for each permutation ;
6 Estimate null: Repeat steps 2-4 with the matching to get the empirical null values ;
Algorithm 1 Permutation-based test (DSOS-PT)

A faster alternative, based on sample splitting, relies on the asymptotic null distribution but incurs a cost in calibration accuracy because it sets aside some data. This tradeoff is common in classifier two-sample tests, for example. Assume that we split each dataset in half: and . The first half is used for calibration, and the second for scoring (inference). We describe this split-sample procedure in Algorithm 2, and refer to it as DSOS-SS in Section 6. Given the weights in 3, Li and Fine (2010) shows that under some mild regularity assumptions, the distribution of the test statistic under the null

is asymptotically normally distributed as


with mean

and standard deviation

, which depends on the sample sizes. The constants in (5) work out to and . Li and Fine (2010) gives the general form of the asymptotic distribution of the WAUC, ready for use with weights different from the chosen ones in 3.

A third option, which is a natural extension of sample splitting, is to use cross-validation instead. In fold cross-validation, the number of folds mediates between calibration accuracy and inferential robustness. At the expense of refitting times, this approach uses most of the data for calibration and can leverage the asymptotic null distribution, provided that the scores are out-of-sample predictions. In other words, it combines the best of Algorithm 1 and 2, namely calibration accuracy from the first and inferential speed from the second. We refer to this variant as DSOS-AT in Section 6. We show in simulations that this approach either matches or exceeds the performance of DSOS-PT and DSOS-SS.

Data: Calibration set , Inference set , calibration procedure and hyperparameters
1 Label: Assign labels , ;
2 Calibrate: Fit the score function ;
3 Score: Score observations , ;
4 Test: Compute the observed WAUC using (4);
5 Asymptotic null: Under the null, is asymptotically normally distributed as in (5);
Algorithm 2 Split-sample test (DSOS-SS)

We make the following pragmatic choices for ease of use; typically, the selected score functions perform well out-of-the-box with little to no costly hyperparameter tuning. For anomaly or out-of-distribution detection, the outlier scores are obtained with an isolation forest using isotree (Cortes 2020). We take the default hyperparamters from isotree

as given. To investigate other notions of outlyingness, we use random forests from

ranger (Wright and Ziegler 2017). We take its default hyperparameters from Probst, Boulesteix, and Bischl (2019). As in Hediger, Michel, and Näf (2019), a random forest allows us to use the out-of-sample variant of D-SOS (DSOS-AT) for free, so to speak, since out-of-bag predictions are viable surrogates for out-of-sample scores. For calibration, random forest is to two-sample classification, residual diagnostic and resampling uncertainty (uncertainty quantification) what isolation forest is to anomaly detection. Of course, we can plug in other algorithms – choose your favourites – to obtain reasonable outlier scores.

6 Experiments

All experiments were run on a commodity desktop computer with a 12-core Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz processor and 64 GB RAM in R version 3.6.1 (2019-07-05). We stress that no hyperparameter tuning was performed - we set the hyperparameters to reasonable defaults as previously discussed. To avoid ambiguity, we explicitly state when the results pertain to DSOS-SS, DSOS-PT or DSOS-AT.

6.1 Simulated shifts

We compare D-SOS to two modern tests of equal distribution based on simulated data. The first is a classifier two-sample test, ctst for short. ctst tests whether a classifier can reliably distinguish training from test instances. If so, it is taken as evidence against the null. Lopez-Paz and Oquab (2017) and Cai, Goggin, and Jiang (2020) show that ctst can match the performance of kernel-based tests. The second is the energy test (Székely, Rizzo, and others 2004), a type of kernel test (Sejdinovic et al. 2012). To be consistent with D-SOS, we implement ctst with the same classifier and hyperparameters. Following Clémençon, Depecker, and Vayatis (2009), this ctst variant uses sample splitting, as does DSOS-SS, and the AUC as a test statistic. Both ctst and D-SOS use the classifier’s prediction (probability) as its scores. As this score increases, so does the likelihood that an observation belongs to the test set, as opposed to the training set.

We simulate shifts from a two-component multivariate Gaussian mixture model (GMM). The training and test set are drawn from:

Omitting subscripts and superscripts for brevity, , and

are the component weight, mean vector and covariance matrix respectively. The baseline specifies the training and test sample size

, the number of dimensions , the component weights , the mean vectors and and the covariance matrices . is the identity matrix and is the -dimensional all-ones vector. The baseline configurations enforce that training and test set are drawn from the same distribution, i.e. no shift.

We generally construct “fair” alternatives so that the dimension of change is fixed as the ambient dimension increases. The power of multivariate tests based on kernels and distances decays with increasing dimension when differences only exist along a few intrinsic dimensions (Ramdas et al. 2015). We vary one or more parameters, namely , , and , to simulate the desired shifts, all else constant. We change the following settings to pre-set intensity levels:

  1. Label shift – We flip the weights so that goes with . The majority component in training becomes the minority in the test sample.

  2. Corrupted sample – We draw a fraction of examples in the test set from the component that is absent in training such that and .

  3. Mean shift – We change the mean vector in the test set so that , where .

  4. Noise shift – We change the covariance matrix in the test set so that , where and is the assignment operator for the diagonal elements of the -by- covariance matrix.

  5. Dependency shift – We induce a positive relationship between the first two covariates. We change the covariance structure in the test set so that , where .

For each type and intensity of shift, we repeat the experiment 500 times. To compare D-SOS to both ctst and energy, we employ the Bayesian signed rank test (Benavoli et al. 2017, 2014). To do so, we specify a region of practical equivalence (ROPE) on the -value scale and a suitable prior. We deem two statistical tests practically equivalent if the absolute difference in -value is . This difference corresponds to one more coin flip turning up as heads (or tails), keeping the streak alive. Anything greater arouses suspicion that one test is more powerful, and that the difference is not negligible. Specifying a ROPE on the -value scale is a lot more cumbersome because -values, let alone -value differences, are notoriously difficult to interpret (Wasserstein, Lazar, and others 2016

). The Bayesian signed rank test yields the posterior probability that one method is better, practically equivalent or worse than the other. The prior for these experiments adds one

pseudo experiment, distinct from the 500 real simulations, where the difference is 0, i.e. no practical difference exists out of the gate.

Contender Outcome
(1) (2) Tie/Draw1 Win (1) Win (2)
ctst energy 89 12 43
ctst DSOS-SS 110 0 34
ctst DSOS-PT 84 0 60
ctst DSOS-AT 80 0 64
DSOS-AT energy 81 37 26
DSOS-SS DSOS-AT 108 0 36
DSOS-SS DSOS-PT 112 0 32
Table 1: Test comparisons based on simulations.

1Tied if the absolute difference in s-value (ROPE) is <= 1.

To remain concise, the full tables for comparison are provided as supplementary material. Table 1 summarizes these findings across all settings: dimension, sample size, type and intensity of shift. For simplicity, we say that one method dominates (wins) if the posterior probability is ; similarly, they draw if the posterior probability that they are tied (practically equivalent) is . We make several observations:

  1. Across all settings, DSOS-SS is either better than or equal to ctst. This forms the basis for our suggestion that tests of goodness-of-fit, of which D-SOS is a member, supplants tests of equal distributions when a shift is suspected. The latter does not consider the training set as the reference distribution and treats both training and test set as if they were equally important. It ignores that the training set precedes the test set, that the predictive model is built on the training set, and not vice versa. Note, for example, that all the tests explored in Rabanser, Günnemann, and Lipton (2019) for shift detection are tests of equal distributions.

  2. Across all settings, DSOS-PT is either better than or equal to DSOS-SS. The same applies to DSOS-AT. DSOS-SS pays a hefty price in calibration accuracy for the privilege of using the asymptotic null distribution. DSOS-AT and DSOS-PT are practically equivalent. If feasible, we recommend these two over DSOS-SS. Sample splitting stunts the performance of DSOS-SS and ctst relative to the field.

  3. DSOS-AT (or DSOS-PT) is often better than or equal to energy except in settings with (1) label shift and (2) corrupted samples. As expected, the WAUC is not unduly influenced by outliers in the way that nonrobust statistics are. This resilience is desirable to combat false alarms. While still attentive to outlying regions, our approach does not break down easily.

  4. DSOS-AT (or DSOS-PT) dominates energy for dependency shifts. This is consistent with Cai, Goggin, and Jiang (2020

    ), but somewhat at odds with Hediger, Michel, and Näf (

    2019) per se. In the latter, they induce positive association between many (all) variables, whereas here, in the spirit of ‘fair’ alternatives, only the correlation between the first two variables is affected.

6.2 Partition-induced shifts

To investigate how notions of outlyingness are correlated, we analyze 62 real-world and influential classification tasks from the OpenML-CC18 benchmark (Casalicchio et al. 2017; Bischl et al. 2017)555We exclude datasets with more than 500 features because of long runtimes.. Similar to the motivating example in Section 3

, we look at tests of no adverse shifts based on two-sample classification, residual diagnostics, and resampling uncertainty (but with no outlier detection in this round). For each dataset in the OpenML-CC18 benchmark, we perform stratified 10-fold cross-validation, repeated twice. We end up with 20 train-test splits per task. In total, we run 3720 tests of no adverse shifts (62 datasets, 20 random splits, and 3 tests). Summary statistics and granular test results for these datasets are in the supplementary material.

Figure 3:

Heatmap of fixed effects for 62 datasets from the OpenML-CC18 benchmark, ordered via hierarchical clustering. Fixed effects are reported on the

-value scale for 3 different D-SOS tests.

We expect the D-SOS results to be correlated within but not across datasets. To formalize this setup, we use the following model:


where for each dataset and test , the value

is lognormally distributed: positive and skewed to the right. The

value consists of a dataset-specific (fixed) effect , subject to noise in ; accounts for within-dataset correlation. The specification in 6

gives rise to the so-called fixed effects model with clustered covariance, which is widely used in econometrics and biostatistics. We fit such a linear regression using the

clubSandwich package (Pustejovsky and Tipton 2018) to obtain robust estimates of the fixed effects even with arbitrary heteroskedasticity and autocorrelation in left unspecified. All fixed effects are statistically significant ().

Figure 4: Correlation matrix of fixed effects for 62 datasets from the OpenML-CC18 benchmark across 3 notions of outlyingness.

In stark contrast with worst-case (say, adversarial) shifts used to assess or build stable models e.g. Subbaswamy, Adams, and Saria (2021) and Duchi and Namkoong (2018), cross-validation generates partition-induced shifts (Jose García Moreno-Torres, Sáez, and Herrera 2012). These shifts supposedly mimic the natural variation in the data and are less harsh or extreme than the worst-case shifts. The fixed effects in 6 measure how sensitive or robust a dataset is to these partition-induced shifts. In other words, these are interpretable and quantitative metrics for data monitoring or validation. Recall that data validation via statistical tests, to echo Polyzotis et al. (2019), is our main concern.

At a glance, the heatmap along with the dendrogram for clustering in Figure 3, created with superheat (Barter and Yu 2018), tells the whole story quite succintly. Across 62 datasets, we see that tests of no adverse shifts based on two-sample classification, residual diagnostics, and resampling uncertainty are indeed highly correlated. Figure 4 shows this correlation matrix666We use the method in Schäfer and Strimmer (2005) to estimate the correlation matrix.. These three distinct notions of outlyingness are distinctly not independent: they leak information about one another. Residual diagnostics and resampling uncertainty, both of which incorporate facets of the predictive performance, are the most correlated. This suggests that they can be used interchangeably. Two-sample classification is not as strongly associated with the other two. This connects back to and further supports the point made in Section 3; namely, tests for distributional shifts alone are ill-equiped, by definition, to detect whether a shift is benign or harmful for predictive tasks. If the key criterion is predictive performance, D-SOS tests based on residual diagnostics and resampling uncertainty are more informative than those solely based on comparing distributions.

7 Conclusion

D-SOS is a framework derived from goodness-of-fit testing to detect adverse shifts. It can confront the data with more informative hypotheses than tests of equal distributions. This works well when mapping to the relevant subspace is based on distinguishing between safe and unsafe regions of the data distribution. Our method accommodates different notions of outlyingness. It lets users define, via a score function, what is meant by ‘worse’ off. Besides the outlier scores explored in this paper, we stress that other sensible choices for the score function and the weights abound. These choices can be adjusted to reflect prior domain knowledge and serve the needs of the data scientist. Looking ahead, future research could investigate how different weighting schemes affect the power of the test. The functional form of the postulated weights could be a hyperparameter worth tuning. Moreover, composite score functions, which would combine several notions of outlyingness together, would enrich the types of hypotheses that could be tested.

8 Acknowledgements

The author would like to thank the reviewers for suggesting numerous improvements. This work was supported by the Royal Bank of Canada (RBC). The views expressed here are those of the author, not of RBC.

9 References

Aguinis, Herman, Ryan K Gottfredson, and Harry Joo. 2013. “Best-Practice Recommendations for Defining, Identifying, and Handling Outliers.” Organizational Research Methods 16 (2): 270–301.

Anderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.” Bull. Am. Iris Soc. 59: 2–5.

Barter, Rebecca L, and Bin Yu. 2018. “Superheat: An R Package for Creating Beautiful and Extendable Heatmaps for Visualizing Complex Data.” Journal of Computational and Graphical Statistics 27 (4): 910–22.

Benavoli, Alessio, Giorgio Corani, Janez Demšar, and Marco Zaffalon. 2017. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.” The Journal of Machine Learning Research 18 (1): 2653–88.

Benavoli, Alessio, Giorgio Corani, Francesca Mangili, Marco Zaffalon, and Fabrizio Ruggeri. 2014. “A Bayesian Wilcoxon Signed-Rank Test Based on the Dirichlet Process.” In International Conference on Machine Learning, 1026–34. PMLR.

Bischl, Bernd, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G Mantovani, Jan N van Rijn, and Joaquin Vanschoren. 2017. “OpenML Benchmarking Suites.” arXiv Preprint arXiv:1708.03731.

Cai, Haiyan, Bryan Goggin, and Qingtang Jiang. 2020. “Two-Sample Test Based on Classification Probability.”

Statistical Analysis and Data Mining: The ASA Data Science Journal

13 (1): 5–13.

Casalicchio, Giuseppe, Jakob Bossek, Michel Lang, Dominik Kirchhoff, Pascal Kerschke, Benjamin Hofner, Heidi Seibold, Joaquin Vanschoren, and Bernd Bischl. 2017. “OpenML: An R Package to Connect to the Machine Learning Platform Openml.” Computational Statistics 32 (3): 1–15. https://doi.org/10.1007/s00180-017-0742-2.

Cieslak, David A, and Nitesh V Chawla. 2009. “A Framework for Monitoring Classifiers’ Performance: When and Why Failure Occurs?” Knowledge and Information Systems 18 (1): 83–108.

Clémençon, Stéphan, Marine Depecker, and Nicolas Vayatis. 2009. “AUC Optimization and the Two-Sample Problem.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems, 360–68.

Cortes, David. 2020. Isotree: Isolation-Based Outlier Detection. https://CRAN.R-project.org/package=isotree.

Demidenko, Eugene. 2016. “The P-Value You Can’t Buy.” The American Statistician 70 (1): 33–38.

Duchi, John, and Hongseok Namkoong. 2018. “Learning Models with Uniform Performance via Distributionally Robust Optimization.” arXiv Preprint arXiv:1810.08750.

Friedman, Jerome. 2004. “On Multivariate Goodness-of-Fit and Two-Sample Testing.” Stanford Linear Accelerator Center, Menlo Park, CA (US).

Gandy, Axel. 2009. “Sequential Implementation of Monte Carlo Tests with Uniformly Bounded Resampling Risk.” Journal of the American Statistical Association 104 (488): 1504–11.

Greenland, Sander. 2019. “Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution with S-Values.” The American Statistician 73 (sup1): 106–14.

Hand, David J. 2009. “Measuring Classifier Performance: A Coherent Alternative to the Area Under the Roc Curve.” Machine Learning 77 (1): 103–23.

Hediger, Simon, Loris Michel, and Jeffrey Näf. 2019. “On the Use of Random Forest for Two-Sample Testing.” arXiv Preprint arXiv:1903.06287.

Hendrycks, Dan, and Kevin Gimpel. 2017. “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks.” In

5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=Hkg4TI9xl.

Janková, Jana, Rajen D Shah, Peter Bühlmann, and Richard J Samworth. 2020. “Goodness-of-Fit Testing in High Dimensional Generalized Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (3): 773–95.

Jiang, Heinrich, Been Kim, Melody Y Guan, and Maya R Gupta. 2018. “To Trust or Not to Trust a Classifier.” In NeurIPS, 5546–57.

Kelly, Mark G, David J Hand, and Niall M Adams. 1999. “The Impact of Changing Populations on Classifier Performance.” In Proceedings of the Fifth Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 367–71.

Kirchler, Matthias, Shahryar Khorasani, Marius Kloft, and Christoph Lippert. 2020. “Two-Sample Testing Using Deep Learning.” In

International Conference on Artificial Intelligence and Statistics

, 1387–98. PMLR.

Li, Jialiang, and Jason P Fine. 2010. “Weighted Area Under the Receiver Operating Characteristic Curve and Its Application to Gene Selection.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 59 (4): 673–92.

Lipton, Zachary, Yu-Xiang Wang, and Alexander Smola. 2018. “Detecting and Correcting for Label Shift with Black Box Predictors.” In International Conference on Machine Learning, 3122–30. PMLR.

Lopez-Paz, David, and Maxime Oquab. 2017. “Revisiting Classifier Two-Sample Tests.” In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJkXfE5xx.

Marozzi, Marco. 2004. “Some Remarks About the Number of Permutations One Should Consider to Perform a Permutation Test.” Statistica 64 (1): 193–201.

Menon, Aditya, and Cheng Soon Ong. 2016. “Linking Losses for Density Ratio and Class-Probability Estimation.” In International Conference on Machine Learning, 304–13.

Moreno-Torres, Jose García, José A Sáez, and Francisco Herrera. 2012. “Study on the Impact of Partition-Induced Dataset Shift on -Fold Cross-Validation.” IEEE Transactions on Neural Networks and Learning Systems 23 (8): 1304–12.

Moreno-Torres, Jose G, Troy Raeder, RocíO Alaiz-RodríGuez, Nitesh V Chawla, and Francisco Herrera. 2012. “A Unifying View on Dataset Shift in Classification.” Pattern Recognition 45 (1): 521–30.

Morningstar, Warren, Cusuh Ham, Andrew Gallagher, Balaji Lakshminarayanan, Alex Alemi, and Joshua Dillon. 2021. “Density of States Estimation for Out of Distribution Detection.” In International Conference on Artificial Intelligence and Statistics, 3232–40. PMLR.

Polyzotis, Neoklis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. 2019. “Data Validation for Machine Learning.” Proceedings of Machine Learning and Systems 1: 334–47.

Probst, Philipp, Anne-Laure Boulesteix, and Bernd Bischl. 2019. “Tunability: Importance of Hyperparameters of Machine Learning Algorithms.” Journal of Machine Learning Research 20 (53): 1–32.

Pustejovsky, James E, and Elizabeth Tipton. 2018. “Small-Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models.”

Journal of Business & Economic Statistics 36 (4): 672–83.

Quionero-Candela, Joaquin, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2009. Dataset Shift in Machine Learning. The MIT Press.

Rabanser, Stephan, Stephan Günnemann, and Zachary Lipton. 2019. “Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift.” In Advances in Neural Information Processing Systems, 1394–1406.

Ramdas, Aaditya, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry Wasserman. 2015. “On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions.” In Twenty-Ninth Aaai Conference on Artificial Intelligence.

Rinaldo, Alessandro, Larry Wasserman, Max G’Sell, and others. 2019. “Bootstrapping and Sample Splitting for High-Dimensional, Assumption-Lean Inference.” The Annals of Statistics 47 (6): 3438–69.

Ruff, Lukas, Jacob R Kauffmann, Robert A Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G Dietterich, and Klaus-Robert Müller. 2021. “A Unifying Review of Deep and Shallow Anomaly Detection.” Proceedings of the IEEE.

Schäfer, Juliane, and Korbinian Strimmer. 2005. “A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics.” Statistical Applications in Genetics and Molecular Biology 4 (1).

Schulam, Peter, and Suchi Saria. 2019. “Can You Trust This Prediction? Auditing Pointwise Reliability After Learning.” In The 22nd International Conference on Artificial Intelligence and Statistics, 1022–31. PMLR.

Sejdinovic, Dino, Arthur Gretton, Bharath Sriperumbudur, and Kenji Fukumizu. 2012. “Hypothesis Testing Using Pairwise Distances and Associated Kernels.” In Proceedings of the 29th International Coference on International Conference on Machine Learning, 787–94.

Sethi, Tegjyot Singh, and Mehmed Kantardzic. 2017. “On the Reliable Detection of Concept Drift from Streaming Unlabeled Data.” Expert Systems with Applications 82: 77–99.

Snoek, Jasper, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. 2019. “Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift.” In Advances in Neural Information Processing Systems, 13969–80.

Subbaswamy, Adarsh, Roy Adams, and Suchi Saria. 2021. “Evaluating Model Robustness and Stability to Dataset Shift.” In International Conference on Artificial Intelligence and Statistics, 2611–9. PMLR.

Székely, Gábor J, Maria L Rizzo, and others. 2004. “Testing for Equal Distributions in High Dimension.” InterStat 5 (16.10): 1249–72.

Wasserstein, Ronald L, Nicole A Lazar, and others. 2016. “The Asa’s Statement on P-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33.

Wellek, Stefan. 2010. Testing Statistical Hypotheses of Equivalence and Noninferiority. CRC Press.

Wen, Junfeng, Chun-Nam Yu, and Russell Greiner. 2014. “Robust Learning Under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification.” In ICML, 631–39.

Weng, Cheng G, and Josiah Poon. 2008. “A New Evaluation Measure for Imbalanced Datasets.” In Proceedings of the 7th Australasian Data Mining Conference-Volume 87, 27–32.

Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.”

Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.

Zhang, Jin. 2002. “Powerful Goodness-of-Fit Tests Based on the Likelihood Ratio.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (2): 281–94.