1 Introduction
Suppose we fit a predictive model on a training set and predict on a test set. Dataset shift (QuioneroCandela et al. 2009; Jose G MorenoTorres et al. 2012; Kelly, Hand, and Adams 1999), also known as data or population drift, occurs when training and test distributions are not alike. This is essentially a sample mismatch problem. Some regions of the data space are either too sparse or absent during training and gain importance at test time. We want methods that alert users to the presence of unexpected inputs in the test set (Rabanser, Günnemann, and Lipton 2019). To do so, a measure of divergence between training and test set is required. Can we not simply use the many modern offtheshelf multivariate tests of equal distributions for this?
One reason for moving beyond tests of equal distributions is that they are often too strict. They require high fidelity between training and test set everywhere in the input domain. However, not all changes in distribution are a cause for concern – some changes are benign. Practitioners distrust these tests because of false alarms. Polyzotis et al. (2019) comment:
statistical tests for detecting changes in the data distribution […] are too sensitive and also uninformative for the typical scale of data in machine learning pipelines, which led us to seek alternative methods to quantify changes between data distributions.
Even when the difference is small or negligible, tests of equal distributions reject the null hypothesis of no difference. An alarm should only be raised if a shift warrants intervention. Retraining models when distribution changes are benign is both costly and ineffective. To tackle these challenges, we propose
DSOS instead. DSOS provides robust performance metrics for monitoring model drift and validating data quality.In comparing the test set to the training set, DSOS
pays more attention to the regions — typically, the outlying regions — where we are most vulnerable. To confront false alarms, it uses a robust test statistic, namely the weighted area under the receiver operating characteristic curve (WAUC). The weights in the WAUC (Li and Fine
2010) discount the safe regions of the distribution. The goal of DSOS is to detect nonnegligible adverse shifts. This is reminiscent of noninferiority tests (Wellek 2010), widely used in healthcare for determining that a new treatment is in fact not inferior to an older one. Colloquially, the DSOS null hypothesis holds that the new sample is not substantively worse than the old sample, and not that the two are equal. DSOS moves beyond tests of equal distributions and lets users specify which notions of outlyingness to probe. The choice of the score function plays a central role in formalizing what we mean by ‘worse’. DSOS avoids the prevailing overreliance on tests of equal distributions to detect and assess the impact of dataset shift. We argue throughout that other notions of shift or outlyingness are often, if not more informative, at the very least complementary.In this paper, we make the following contributions:

We derive DSOS, a novel multivariate twosample test for no adverse shift, from tests of goodnessoffit. DSOS computes its values based on sample splitting, permutation and/or outofbag predictions. In our experience, the outofbag variant is the most convenient. This approach improves on sample splitting, which sacrifices calibration accuracy for inferential robustness (Rinaldo et al. 2019).

We compare DSOS performance to two modern tests of equal distributions. To do so, we transform values to values (Greenland 2019). values give us the means to define a region of practical equivalence. We show via simulations that DSOS matches or exceeds the performance of these tests when the outlier score reflects whether the observation belongs to the training or test set. These results underscore the point that tests of equal distributions are suboptimal when we could instead opt for tests of goodnessoffit, which explicitly account for the training set being the reference distribution.

Based on 62 realworld classification tasks from the OpenMLCC18 benchmark (Casalicchio et al. 2017; Bischl et al. 2017), we show that different notions of outlyingness are complementary for detecting partitioninduced dataset shift. In supervised settings, DSOS
extends to notions of outlyingness such as residuals and prediction intervals. These help to determine whether the shift has an adverse impact on predictive performance. We estimate the correlations between these complementary notions.
The main takeaway is that given a generic method that assigns an outlier score to a data point, DSOS
uplifts these scores and turns them into a twosample test for no adverse shift. The scores can come from anomaly detection, twosample classification, uncertainty quantification, residual diagnostics, density estimation, dimension reduction, and more. We have created an R package
dsos for our method. In addition, all code and data used in this paper are publicly available.2 Statistical framework
The theoretical framework builds on Zhang (2002), hereafter referred to as the the Zhang test. Take an i.i.d. training set and a test set . Each dataset with origin lies in dimensional domain with sample sizes
, cumulative distribution function (CDF)
and probability density function (PDF)
. Let be a score function and define binary instances such that is 1 when the score exceeds the threshold and 0, otherwise. The proportion above the threshold in dataset is in effect the contamination rate , defined as . The contamination rate is the complementary CDF of the scores, . As before, and denote the CDF and PDF of the scores.Consider the null hypothesis of equal distribution against the alternative . In tandem, consider the null of equal contamination against the alternative at a given threshold score . To evaluate goodnessoffit, Zhang (2002) shows that testing is equivalent to testing for . If is the relevant test statistic for equal contamination, a global test statistic for goodnessoffit can be constructed from , its local counterpart. One such is
(1) 
where are thresholddependent weights. For concision, we sometimes suppress the dependence on the threshold score and denote weights, contamination statistics and rates as , and respectively. DSOS differs from the statistic (1) in three particular aspects: the score function , the weights and the contamination statistic . We address each in turn.
DSOS scores instances from least to most abnormal according to a specified notion of outlyingness. To be concrete, for density estimation^{1}^{1}1If so, the negative log density, a measure of surprise, is a natural score for outlyingness., this property of can be expressed as
(2) 
for and , a (sufficiently small) approximation error. Accordingly, instances in highdensity regions of the training set (nominal points) score low; those in lowdensity regions (outliers) score high. Here, the score function can be thought of as a densitypreserving projection. More generally, higher scores, e.g. wider prediction intervals and larger residuals, indicate worse outcomes; the higher the score, the more unusual the observation. The structure in is the catalyst for adjusting the weights and the statistic .
DSOS updates the weights to be congruent with the score function . When projecting to the outlier subspace, high scores imply unusual points, whereas both tails are viewed as extremes in the Zhang test. Univariate tests such as the AndersonDarling and the Cramérvon Mises tests fit the framework in (1): they make different choices for , and . These classical tests place more weight at the tails to reflect the severity of tail exceedances relative to deviations in the center of the distribution. DSOS corrects for outliers being projected to the upper (right) tail of scores via the score function. The DSOS weights are specified as
(3) 
The weights in (3) shift most of the mass from lowthreshold to highthreshold regions. As a result, DSOS is highly tolerant of negligible shifts associated with low scores, and conversely, it is attentive to the shifts associated with high scores. Zhang (2002) posits other functional forms, but the quadratic relationship between weights and contaminations gives rise to one of the most powerful variants of the test and so, we follow suit.
DSOS constructs a test statistic based on score ranks, not on levels. The weighted area under the receiver operating curve, weighted AUC (WAUC) for short, is a robust statistic that is invariant to changes in levels so long as the underlying ranking is unaffected. The WAUC, denoted , can be written as:
(4) 
assuming the test set is the positive class and gets labeled as 1s (0s for the training set). See Hand (2009) for this derivation of the AUC: the weights are tacked on to obtain the WAUC. Formally, in (4) is the DSOS test statistic for dataset shift. is also clearly a member within the class of tests in (see 1).
The null hypothesis of dataset shift is that most instances in the test set resemble the training set. The alternative is that the test set contains more outliers than expected, if the training set is the reference distribution. DSOS specifies its null as against the alternative , where is the observed WAUC and , the WAUC under the null of exchangeable samples. Note that DSOS is a onetailed test. To have no or disproportionately fewer outliers in the test set is desirable. These shifts do not trigger the alarm as they would had DSOS been twotailed. Tests of goodnessoffit and of equal distributions reject the null if the training and test set are different, even if the test set is less abnormal i.e. better; DSOS does not do this. Through the mapping to the outlier subspace, DSOS simplifies a multivariate twosample test to a univariate representation.
3 Motivation
For illustration, we apply DSOS to the canonical iris dataset (Anderson 1935
). The task is to classify the species of Iris flowers based on
covariates (features) and observations for each species. We show how DSOS helps diagnose false alarms. Specifically, we highlight the following: (1) changes in distribution do not necessarily hurt predictive performance, and (2) data points in the densest regions of the data distribution can be the most difficult – unsafe – to predict. For the subsequent tests, we split iris into 2/3 training and 1/3 test set. Figure 1 displays traintest pairs, which are split according to different sampling strategies. The first two principal components of iris show that the species cluster together.Consider four notions of outlyingness. For twosample classification, define the outlier score as the probability of belonging to the training or test set. For anomaly or outofdistribution detection, the outlier score is the isolation score, a proxy for local density. For residual diagnostics, the score is the outofsample error. Finally, for uncertainty quantification (resampling uncertainty), it is the standard error of the mean prediction
^{2}^{2}2This is the same as in RUE (Schulam and Saria 2019). RUEdepends on pointwise confidence intervals to establish model trust.
. Only the first notion of outlyingness – twosample classification – pertains to modern tests of equal distributions; the others capture other meaningful notions of adverse shifts. For all these scores, higher is worse: higher scores indicate that the observation is diverging from the desired outcome or that it does not conform to the training set^{3}^{3}3Section 5 covers the implementation details..Suppose we want to test for dataset shift. How do the splits in Figure 1 fare with respect to the tests for no adverse shifts given by these notions of outlyingness? Let and denote value and value. The test results are reported on the scale because it is intuitive and lends itself to comparison. We return to the advantages of using value for comparison later. An value of can be interpreted as seeing independent coin flips with the same outcome – all heads or all tails – if the coin is “fair” (Greenland 2019). This conveys how incompatible the data is with the null hypothesis. For plotting, we winsorize (clip) values to a low and high of 1 and 10 respectively. We also display a secondary yaxis with the value as a cognitive bridge.
In Figure 2, the case with (1) random sampling exemplifies the type of false alarms we want to avoid. Twosample classification, representing tests of equal distributions, is incompatible with the null of no adverse shift (a value of around 8). But this shift does not carry over to the other tests. Anomaly detection, residual diagnostics and confidence intervals are all compatible with the view that the test set is not worse. Had we been entirely reliant on twosample classification, we may not have realized that this shift is essentially benign. Tests of equal distributions alone give a narrow perspective on dataset shift. Contrast (1) random with (2) stratified sampling. When stratified by species, all the tests are compatible with the null of no adverse shift.
We also expect the tests based on anomaly detection to let through the (3) indistribution test set and flag the (4) outofdistribution one. Indeed, the results in Figure 2 concur. We might be tempted to conclude that the indistribution observations are safe, and yet, the tests based on residual diagnostics and confidence intervals are fairly incompatible with this view. This is because some of the indistribution (densest) points are concentrated in a region close to the decision boundary where the classifier does not discriminate well; Figure 1 show where Iris versicolor (triangles) and Iris virginica (squares) partially overlap. That is, the densest observations are not necessarily safe. Outofdistribution detection glosses over this, in that the culprits may very well be the indistribution points. DSOS offers a more holistic perspective of dataset shift because it borrows strength from these complementary notions of outlyingness.
4 Related work
Outlier scores. Density ratios and class probabilities serve as scores in comparing distributions (Menon and Ong 2016). These scores, however, do not directly account for the predictive performance. Prediction intervals and residuals, on the other hand, do. Intuitively, both prediction intervals and residuals reflect adverse predictive performance in some regions of the feature space. Prediction (confidence) intervals typically widen with increasing dataset shift (Snoek et al. 2019). Methods like MD3 (Sethi and Kantardzic 2017) and RUE (Schulam and Saria 2019) lean on this insight: they essentially track instances in the uncertain regions of the predictive model. Similarly, outofdistribution examples often have larger residuals or errors (Hendrycks and Gimpel 2017). This points to a dual approach: the first is based on residual diagnostics (the classical approach) and the second is based on outofdistribution detection. For recent examples of these research areas with a modern machine learning twist, see Janková et al. (2020) for the former and Morningstar et al. (2021) for the latter. Residuals traditionally underpin misspecification tests in regression. Other approaches such as trust scores (Jiang et al. 2018) can also flag unreliable predictions. Because all these scores reflect distinct notions of outlyingness, the contrasts are insightful even if admittedly, some are related.^{4}^{4}4Aguinis, Gottfredson, and Joo (2013) lists “14 unique and mutually exclusive outlier definitions”  see Table 1 therein. In this respect, DSOS is in some sense a unifying framework. Bring your own outlier scores and DSOS would morph into the corresponding twosample test.
Dimension reduction. In practice, reconstruction errors from dimension reduction can also separate inliers from outliers (Ruff et al. 2021). Recently, Rabanser, Günnemann, and Lipton (2019) combines dimension reduction, as a preprocessing step, with tests of equal distributions to detect dataset shift. Kirchler et al. (2020
) pushes this further and constructs powerful twosample tests based on an intermediate lowrank representation of the data using deep learning. Sticking to the theme of projecting to a more informative subspace, some methods e.g. (Cieslak and Chawla
2009; Lipton, Wang, and Smola 2018) gainfully rely on model predictions for their lowrank representation. This last approach, to add a cautionary note, entangles the predictive model, subject to its own sources of errors such as misspecification and overfitting, with dataset shift (Wen, Yu, and Greiner 2014). But it also points to the effectiveness of projecting to a lower subspace to circumvent issues related to high dimensions. Indeed, the classifier twosample test (Friedman 2004; Clémençon, Depecker, and Vayatis 2009; LopezPaz and Oquab 2017; Cai, Goggin, and Jiang 2020) leverages univariate scores that discriminate between training and test set to detect changes in distribution for multivariate data. Inspired by this approach, DSOS effectively uses outlier scores as a device for dimension reduction.Test statistic. The area under the receiver operating characteristic curve (AUC) has a long tradition of being used as a robust test statistic in nonparametric tests. Within the framework of classifier tests, Clémençon, Depecker, and Vayatis (2009) pairs a highcapacity classifier with the AUC as a test statistic, via the MannWhitneyWilcoxon test. Demidenko (2016) proposes an AUClike test statistic as an alternative to the classical tests at scale because the latter “do not make sense with big data: everything becomes statistically significant” while the former attenuates the strong bias toward large sample size. As a generalization of the AUC, the weighted AUC (WAUC) can put nonuniform weights on decision thresholds (Li and Fine 2010). DSOS seizes on this to give more weight to the outlying regions of the data distribution. Weng and Poon (2008) also advocates for the WAUC as a performance metric for imbalanced datasets because it can capture disparate misclassification costs. DSOS duly recognizes disparate costs in other contexts and reflects that the outlying regions suffer the most. To the best of our knowledge, this is the first time that the WAUC is being used as a test statistic.
5 Implementation
In this section, we turn our attention to deriving valid values for inference when the score function is data depedent. Without loss of generality, let be the pooled training and test set. The score function is estimated from data
and hyperparameters
. This calibration procedure returns the requisite score function . The asymptotic null distribution of the WAUC , however, is invalid when the same data is used both for calibration and inference (scoring). We circumvent this issue with permutations, sample splitting, and/or outofbag predictions.Permutations use the empirical, rather than the asymptotic, null distribution. The maximum number of permutations is set to (Marozzi 2004). This procedure is outlined in Algorithm 1. We refer to this variant as DSOSPT in Section 6. Unless stated otherwise, DSOSPT is the default used in the experiments in Section 6. For speed, this is implemented as a sequential Monte Carlo test (Gandy 2009), which exercises early stopping when the result is fairly obvious. Even with computational tricks to increase the speed, permutations can be computationally expensive, sometimes prohibitively so.
A faster alternative, based on sample splitting, relies on the asymptotic null distribution but incurs a cost in calibration accuracy because it sets aside some data. This tradeoff is common in classifier twosample tests, for example. Assume that we split each dataset in half: and . The first half is used for calibration, and the second for scoring (inference). We describe this splitsample procedure in Algorithm 2, and refer to it as DSOSSS in Section 6. Given the weights in 3, Li and Fine (2010) shows that under some mild regularity assumptions, the distribution of the test statistic under the null
is asymptotically normally distributed as
(5) 
with mean
, which depends on the sample sizes. The constants in (5) work out to and . Li and Fine (2010) gives the general form of the asymptotic distribution of the WAUC, ready for use with weights different from the chosen ones in 3.A third option, which is a natural extension of sample splitting, is to use crossvalidation instead. In fold crossvalidation, the number of folds mediates between calibration accuracy and inferential robustness. At the expense of refitting times, this approach uses most of the data for calibration and can leverage the asymptotic null distribution, provided that the scores are outofsample predictions. In other words, it combines the best of Algorithm 1 and 2, namely calibration accuracy from the first and inferential speed from the second. We refer to this variant as DSOSAT in Section 6. We show in simulations that this approach either matches or exceeds the performance of DSOSPT and DSOSSS.
We make the following pragmatic choices for ease of use; typically, the selected score functions perform well outofthebox with little to no costly hyperparameter tuning. For anomaly or outofdistribution detection, the outlier scores are obtained with an isolation forest using isotree (Cortes 2020). We take the default hyperparamters from isotree
as given. To investigate other notions of outlyingness, we use random forests from
ranger (Wright and Ziegler 2017). We take its default hyperparameters from Probst, Boulesteix, and Bischl (2019). As in Hediger, Michel, and Näf (2019), a random forest allows us to use the outofsample variant of DSOS (DSOSAT) for free, so to speak, since outofbag predictions are viable surrogates for outofsample scores. For calibration, random forest is to twosample classification, residual diagnostic and resampling uncertainty (uncertainty quantification) what isolation forest is to anomaly detection. Of course, we can plug in other algorithms – choose your favourites – to obtain reasonable outlier scores.6 Experiments
All experiments were run on a commodity desktop computer with a 12core Intel(R) Core(TM) i78700 CPU @ 3.20GHz processor and 64 GB RAM in R version 3.6.1 (20190705). We stress that no hyperparameter tuning was performed  we set the hyperparameters to reasonable defaults as previously discussed. To avoid ambiguity, we explicitly state when the results pertain to DSOSSS, DSOSPT or DSOSAT.
6.1 Simulated shifts
We compare DSOS to two modern tests of equal distribution based on simulated data. The first is a classifier twosample test, ctst for short. ctst tests whether a classifier can reliably distinguish training from test instances. If so, it is taken as evidence against the null. LopezPaz and Oquab (2017) and Cai, Goggin, and Jiang (2020) show that ctst can match the performance of kernelbased tests. The second is the energy test (Székely, Rizzo, and others 2004), a type of kernel test (Sejdinovic et al. 2012). To be consistent with DSOS, we implement ctst with the same classifier and hyperparameters. Following Clémençon, Depecker, and Vayatis (2009), this ctst variant uses sample splitting, as does DSOSSS, and the AUC as a test statistic. Both ctst and DSOS use the classifier’s prediction (probability) as its scores. As this score increases, so does the likelihood that an observation belongs to the test set, as opposed to the training set.
We simulate shifts from a twocomponent multivariate Gaussian mixture model (GMM). The training and test set are drawn from:
Omitting subscripts and superscripts for brevity, , and
are the component weight, mean vector and covariance matrix respectively. The baseline specifies the training and test sample size
, the number of dimensions , the component weights , the mean vectors and and the covariance matrices . is the identity matrix and is the dimensional allones vector. The baseline configurations enforce that training and test set are drawn from the same distribution, i.e. no shift.We generally construct “fair” alternatives so that the dimension of change is fixed as the ambient dimension increases. The power of multivariate tests based on kernels and distances decays with increasing dimension when differences only exist along a few intrinsic dimensions (Ramdas et al. 2015). We vary one or more parameters, namely , , and , to simulate the desired shifts, all else constant. We change the following settings to preset intensity levels:

Label shift – We flip the weights so that goes with . The majority component in training becomes the minority in the test sample.

Corrupted sample – We draw a fraction of examples in the test set from the component that is absent in training such that and .

Mean shift – We change the mean vector in the test set so that , where .

Noise shift – We change the covariance matrix in the test set so that , where and is the assignment operator for the diagonal elements of the by covariance matrix.

Dependency shift – We induce a positive relationship between the first two covariates. We change the covariance structure in the test set so that , where .
For each type and intensity of shift, we repeat the experiment 500 times. To compare DSOS to both ctst and energy, we employ the Bayesian signed rank test (Benavoli et al. 2017, 2014). To do so, we specify a region of practical equivalence (ROPE) on the value scale and a suitable prior. We deem two statistical tests practically equivalent if the absolute difference in value is . This difference corresponds to one more coin flip turning up as heads (or tails), keeping the streak alive. Anything greater arouses suspicion that one test is more powerful, and that the difference is not negligible. Specifying a ROPE on the value scale is a lot more cumbersome because values, let alone value differences, are notoriously difficult to interpret (Wasserstein, Lazar, and others 2016
). The Bayesian signed rank test yields the posterior probability that one method is better, practically equivalent or worse than the other. The prior for these experiments adds one
pseudo experiment, distinct from the 500 real simulations, where the difference is 0, i.e. no practical difference exists out of the gate.Contender  Outcome  
(1)  (2)  Tie/Draw^{1}  Win (1)  Win (2) 
ctst  energy  89  12  43 
ctst  DSOSSS  110  0  34 
ctst  DSOSPT  84  0  60 
ctst  DSOSAT  80  0  64 
DSOSAT  energy  81  37  26 
DSOSSS  DSOSAT  108  0  36 
DSOSSS  DSOSPT  112  0  32 
DSOSAT  DSOSPT  144  0  0 
^{1}Tied if the absolute difference in svalue (ROPE) is <= 1.
To remain concise, the full tables for comparison are provided as supplementary material. Table 1 summarizes these findings across all settings: dimension, sample size, type and intensity of shift. For simplicity, we say that one method dominates (wins) if the posterior probability is ; similarly, they draw if the posterior probability that they are tied (practically equivalent) is . We make several observations:

Across all settings, DSOSSS is either better than or equal to ctst. This forms the basis for our suggestion that tests of goodnessoffit, of which DSOS is a member, supplants tests of equal distributions when a shift is suspected. The latter does not consider the training set as the reference distribution and treats both training and test set as if they were equally important. It ignores that the training set precedes the test set, that the predictive model is built on the training set, and not vice versa. Note, for example, that all the tests explored in Rabanser, Günnemann, and Lipton (2019) for shift detection are tests of equal distributions.

Across all settings, DSOSPT is either better than or equal to DSOSSS. The same applies to DSOSAT. DSOSSS pays a hefty price in calibration accuracy for the privilege of using the asymptotic null distribution. DSOSAT and DSOSPT are practically equivalent. If feasible, we recommend these two over DSOSSS. Sample splitting stunts the performance of DSOSSS and ctst relative to the field.

DSOSAT (or DSOSPT) is often better than or equal to energy except in settings with (1) label shift and (2) corrupted samples. As expected, the WAUC is not unduly influenced by outliers in the way that nonrobust statistics are. This resilience is desirable to combat false alarms. While still attentive to outlying regions, our approach does not break down easily.

DSOSAT (or DSOSPT) dominates energy for dependency shifts. This is consistent with Cai, Goggin, and Jiang (2020
), but somewhat at odds with Hediger, Michel, and Näf (
2019) per se. In the latter, they induce positive association between many (all) variables, whereas here, in the spirit of ‘fair’ alternatives, only the correlation between the first two variables is affected.
6.2 Partitioninduced shifts
To investigate how notions of outlyingness are correlated, we analyze 62 realworld and influential classification tasks from the OpenMLCC18 benchmark (Casalicchio et al. 2017; Bischl et al. 2017)^{5}^{5}5We exclude datasets with more than 500 features because of long runtimes.. Similar to the motivating example in Section 3
, we look at tests of no adverse shifts based on twosample classification, residual diagnostics, and resampling uncertainty (but with no outlier detection in this round). For each dataset in the OpenMLCC18 benchmark, we perform stratified 10fold crossvalidation, repeated twice. We end up with 20 traintest splits per task. In total, we run 3720 tests of no adverse shifts (62 datasets, 20 random splits, and 3 tests). Summary statistics and granular test results for these datasets are in the supplementary material.
We expect the DSOS results to be correlated within but not across datasets. To formalize this setup, we use the following model:
(6) 
where for each dataset and test , the value
is lognormally distributed: positive and skewed to the right. The
value consists of a datasetspecific (fixed) effect , subject to noise in ; accounts for withindataset correlation. The specification in 6gives rise to the socalled fixed effects model with clustered covariance, which is widely used in econometrics and biostatistics. We fit such a linear regression using the
clubSandwich package (Pustejovsky and Tipton 2018) to obtain robust estimates of the fixed effects even with arbitrary heteroskedasticity and autocorrelation in left unspecified. All fixed effects are statistically significant ().In stark contrast with worstcase (say, adversarial) shifts used to assess or build stable models e.g. Subbaswamy, Adams, and Saria (2021) and Duchi and Namkoong (2018), crossvalidation generates partitioninduced shifts (Jose García MorenoTorres, Sáez, and Herrera 2012). These shifts supposedly mimic the natural variation in the data and are less harsh or extreme than the worstcase shifts. The fixed effects in 6 measure how sensitive or robust a dataset is to these partitioninduced shifts. In other words, these are interpretable and quantitative metrics for data monitoring or validation. Recall that data validation via statistical tests, to echo Polyzotis et al. (2019), is our main concern.
At a glance, the heatmap along with the dendrogram for clustering in Figure 3, created with superheat (Barter and Yu 2018), tells the whole story quite succintly. Across 62 datasets, we see that tests of no adverse shifts based on twosample classification, residual diagnostics, and resampling uncertainty are indeed highly correlated. Figure 4 shows this correlation matrix^{6}^{6}6We use the method in Schäfer and Strimmer (2005) to estimate the correlation matrix.. These three distinct notions of outlyingness are distinctly not independent: they leak information about one another. Residual diagnostics and resampling uncertainty, both of which incorporate facets of the predictive performance, are the most correlated. This suggests that they can be used interchangeably. Twosample classification is not as strongly associated with the other two. This connects back to and further supports the point made in Section 3; namely, tests for distributional shifts alone are illequiped, by definition, to detect whether a shift is benign or harmful for predictive tasks. If the key criterion is predictive performance, DSOS tests based on residual diagnostics and resampling uncertainty are more informative than those solely based on comparing distributions.
7 Conclusion
DSOS is a framework derived from goodnessoffit testing to detect adverse shifts. It can confront the data with more informative hypotheses than tests of equal distributions. This works well when mapping to the relevant subspace is based on distinguishing between safe and unsafe regions of the data distribution. Our method accommodates different notions of outlyingness. It lets users define, via a score function, what is meant by ‘worse’ off. Besides the outlier scores explored in this paper, we stress that other sensible choices for the score function and the weights abound. These choices can be adjusted to reflect prior domain knowledge and serve the needs of the data scientist. Looking ahead, future research could investigate how different weighting schemes affect the power of the test. The functional form of the postulated weights could be a hyperparameter worth tuning. Moreover, composite score functions, which would combine several notions of outlyingness together, would enrich the types of hypotheses that could be tested.
8 Acknowledgements
The author would like to thank the reviewers for suggesting numerous improvements. This work was supported by the Royal Bank of Canada (RBC). The views expressed here are those of the author, not of RBC.
9 References
Aguinis, Herman, Ryan K Gottfredson, and Harry Joo. 2013. “BestPractice Recommendations for Defining, Identifying, and Handling Outliers.” Organizational Research Methods 16 (2): 270–301.
Barter, Rebecca L, and Bin Yu. 2018. “Superheat: An R Package for Creating Beautiful and Extendable Heatmaps for Visualizing Complex Data.” Journal of Computational and Graphical Statistics 27 (4): 910–22.
Benavoli, Alessio, Giorgio Corani, Janez Demšar, and Marco Zaffalon. 2017. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.” The Journal of Machine Learning Research 18 (1): 2653–88.
Benavoli, Alessio, Giorgio Corani, Francesca Mangili, Marco Zaffalon, and Fabrizio Ruggeri. 2014. “A Bayesian Wilcoxon SignedRank Test Based on the Dirichlet Process.” In International Conference on Machine Learning, 1026–34. PMLR.
Bischl, Bernd, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G Mantovani, Jan N van Rijn, and Joaquin Vanschoren. 2017. “OpenML Benchmarking Suites.” arXiv Preprint arXiv:1708.03731.
Cai, Haiyan, Bryan Goggin, and Qingtang Jiang. 2020. “TwoSample Test Based on Classification Probability.”
Statistical Analysis and Data Mining: The ASA Data Science Journal
13 (1): 5–13.Casalicchio, Giuseppe, Jakob Bossek, Michel Lang, Dominik Kirchhoff, Pascal Kerschke, Benjamin Hofner, Heidi Seibold, Joaquin Vanschoren, and Bernd Bischl. 2017. “OpenML: An R Package to Connect to the Machine Learning Platform Openml.” Computational Statistics 32 (3): 1–15. https://doi.org/10.1007/s0018001707422.
Cieslak, David A, and Nitesh V Chawla. 2009. “A Framework for Monitoring Classifiers’ Performance: When and Why Failure Occurs?” Knowledge and Information Systems 18 (1): 83–108.
Clémençon, Stéphan, Marine Depecker, and Nicolas Vayatis. 2009. “AUC Optimization and the TwoSample Problem.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems, 360–68.
Cortes, David. 2020. Isotree: IsolationBased Outlier Detection. https://CRAN.Rproject.org/package=isotree.
Duchi, John, and Hongseok Namkoong. 2018. “Learning Models with Uniform Performance via Distributionally Robust Optimization.” arXiv Preprint arXiv:1810.08750.
Friedman, Jerome. 2004. “On Multivariate GoodnessofFit and TwoSample Testing.” Stanford Linear Accelerator Center, Menlo Park, CA (US).
Gandy, Axel. 2009. “Sequential Implementation of Monte Carlo Tests with Uniformly Bounded Resampling Risk.” Journal of the American Statistical Association 104 (488): 1504–11.
Greenland, Sander. 2019. “Valid PValues Behave Exactly as They Should: Some Misleading Criticisms of PValues and Their Resolution with SValues.” The American Statistician 73 (sup1): 106–14.
Hand, David J. 2009. “Measuring Classifier Performance: A Coherent Alternative to the Area Under the Roc Curve.” Machine Learning 77 (1): 103–23.
Hediger, Simon, Loris Michel, and Jeffrey Näf. 2019. “On the Use of Random Forest for TwoSample Testing.” arXiv Preprint arXiv:1903.06287.
Hendrycks, Dan, and Kevin Gimpel. 2017. “A Baseline for Detecting Misclassified and OutofDistribution Examples in Neural Networks.” In
5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=Hkg4TI9xl.Janková, Jana, Rajen D Shah, Peter Bühlmann, and Richard J Samworth. 2020. “GoodnessofFit Testing in High Dimensional Generalized Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (3): 773–95.
Jiang, Heinrich, Been Kim, Melody Y Guan, and Maya R Gupta. 2018. “To Trust or Not to Trust a Classifier.” In NeurIPS, 5546–57.
Kelly, Mark G, David J Hand, and Niall M Adams. 1999. “The Impact of Changing Populations on Classifier Performance.” In Proceedings of the Fifth Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 367–71.
Kirchler, Matthias, Shahryar Khorasani, Marius Kloft, and Christoph Lippert. 2020. “TwoSample Testing Using Deep Learning.” In
International Conference on Artificial Intelligence and Statistics
, 1387–98. PMLR.Li, Jialiang, and Jason P Fine. 2010. “Weighted Area Under the Receiver Operating Characteristic Curve and Its Application to Gene Selection.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 59 (4): 673–92.
Lipton, Zachary, YuXiang Wang, and Alexander Smola. 2018. “Detecting and Correcting for Label Shift with Black Box Predictors.” In International Conference on Machine Learning, 3122–30. PMLR.
LopezPaz, David, and Maxime Oquab. 2017. “Revisiting Classifier TwoSample Tests.” In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJkXfE5xx.
Marozzi, Marco. 2004. “Some Remarks About the Number of Permutations One Should Consider to Perform a Permutation Test.” Statistica 64 (1): 193–201.
Menon, Aditya, and Cheng Soon Ong. 2016. “Linking Losses for Density Ratio and ClassProbability Estimation.” In International Conference on Machine Learning, 304–13.
MorenoTorres, Jose García, José A Sáez, and Francisco Herrera. 2012. “Study on the Impact of PartitionInduced Dataset Shift on Fold CrossValidation.” IEEE Transactions on Neural Networks and Learning Systems 23 (8): 1304–12.
MorenoTorres, Jose G, Troy Raeder, RocíO AlaizRodríGuez, Nitesh V Chawla, and Francisco Herrera. 2012. “A Unifying View on Dataset Shift in Classification.” Pattern Recognition 45 (1): 521–30.
Morningstar, Warren, Cusuh Ham, Andrew Gallagher, Balaji Lakshminarayanan, Alex Alemi, and Joshua Dillon. 2021. “Density of States Estimation for Out of Distribution Detection.” In International Conference on Artificial Intelligence and Statistics, 3232–40. PMLR.
Polyzotis, Neoklis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. 2019. “Data Validation for Machine Learning.” Proceedings of Machine Learning and Systems 1: 334–47.
Probst, Philipp, AnneLaure Boulesteix, and Bernd Bischl. 2019. “Tunability: Importance of Hyperparameters of Machine Learning Algorithms.” Journal of Machine Learning Research 20 (53): 1–32.
Pustejovsky, James E, and Elizabeth Tipton. 2018. “SmallSample Methods for ClusterRobust Variance Estimation and Hypothesis Testing in Fixed Effects Models.”
Journal of Business & Economic Statistics 36 (4): 672–83.QuioneroCandela, Joaquin, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2009. Dataset Shift in Machine Learning. The MIT Press.
Rabanser, Stephan, Stephan Günnemann, and Zachary Lipton. 2019. “Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift.” In Advances in Neural Information Processing Systems, 1394–1406.
Ramdas, Aaditya, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry Wasserman. 2015. “On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions.” In TwentyNinth Aaai Conference on Artificial Intelligence.
Rinaldo, Alessandro, Larry Wasserman, Max G’Sell, and others. 2019. “Bootstrapping and Sample Splitting for HighDimensional, AssumptionLean Inference.” The Annals of Statistics 47 (6): 3438–69.
Ruff, Lukas, Jacob R Kauffmann, Robert A Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G Dietterich, and KlausRobert Müller. 2021. “A Unifying Review of Deep and Shallow Anomaly Detection.” Proceedings of the IEEE.
Schäfer, Juliane, and Korbinian Strimmer. 2005. “A Shrinkage Approach to LargeScale Covariance Matrix Estimation and Implications for Functional Genomics.” Statistical Applications in Genetics and Molecular Biology 4 (1).
Schulam, Peter, and Suchi Saria. 2019. “Can You Trust This Prediction? Auditing Pointwise Reliability After Learning.” In The 22nd International Conference on Artificial Intelligence and Statistics, 1022–31. PMLR.
Sejdinovic, Dino, Arthur Gretton, Bharath Sriperumbudur, and Kenji Fukumizu. 2012. “Hypothesis Testing Using Pairwise Distances and Associated Kernels.” In Proceedings of the 29th International Coference on International Conference on Machine Learning, 787–94.
Sethi, Tegjyot Singh, and Mehmed Kantardzic. 2017. “On the Reliable Detection of Concept Drift from Streaming Unlabeled Data.” Expert Systems with Applications 82: 77–99.
Snoek, Jasper, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. 2019. “Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift.” In Advances in Neural Information Processing Systems, 13969–80.
Subbaswamy, Adarsh, Roy Adams, and Suchi Saria. 2021. “Evaluating Model Robustness and Stability to Dataset Shift.” In International Conference on Artificial Intelligence and Statistics, 2611–9. PMLR.
Székely, Gábor J, Maria L Rizzo, and others. 2004. “Testing for Equal Distributions in High Dimension.” InterStat 5 (16.10): 1249–72.
Wasserstein, Ronald L, Nicole A Lazar, and others. 2016. “The Asa’s Statement on PValues: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33.
Wen, Junfeng, ChunNam Yu, and Russell Greiner. 2014. “Robust Learning Under Uncertain Test Distributions: Relating Covariate Shift to Model Misspecification.” In ICML, 631–39.
Weng, Cheng G, and Josiah Poon. 2008. “A New Evaluation Measure for Imbalanced Datasets.” In Proceedings of the 7th Australasian Data Mining ConferenceVolume 87, 27–32.
Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.”
Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.
Comments
There are no comments yet.