Sequential Multivariate Change Detection with Calibrated and Memoryless False Detection Rates

08/02/2021 ∙ by Oliver Cobb, et al. ∙ 0

Responding appropriately to the detections of a sequential change detector requires knowledge of the rate at which false positives occur in the absence of change. When the pre-change and post-change distributions are unknown, setting detection thresholds to achieve a desired false positive rate is challenging, even when there exists a large number of samples from the reference distribution. Existing works resort to setting time-invariant thresholds that focus on the expected runtime of the detector in the absence of change, either bounding it loosely from below or targeting it directly but with asymptotic arguments that we show cause significant miscalibration in practice. We present a simulation-based approach to setting time-varying thresholds that allows a desired expected runtime to be targeted with a 20x reduction in miscalibration whilst additionally keeping the false positive rate constant across time steps. Whilst the approach to threshold setting is metric agnostic, we show that when using the popular and powerful quadratic time MMD estimator, thoughtful structuring of the computation can reduce the cost during configuration from O(N^2B) to O(N^2+NB) and during operation from O(N^2) to O(N), where N is the number of reference samples and B the number of bootstrap samples. Code is made available as part of the open-source Python library .



There are no comments yet.


page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Algorithms that can detect change in the distribution governing the generation of a data stream have long played an important role in applications such as quality assurance, cybersecurity and financial market analysis. However, as society becomes increasingly digitised and processes increasingly automated by machine learning models, there is a growing need for algorithms addressing the more specific problem of detecting when the distribution underlying a stream of live deployment data changes from that which generated a historical training set.

Deploying a trained model to make decisions or predictions of real-world impact carries risk. The usual justification is that by evaluating the performance of the model on held out training instances we can get an accurate and unbiased estimate of how the model will perform on a stream of future deployment data. However, this justification is predicated on the assumption that the process underlying the generation of the deployment data remains identical to the process that underlay the historical training data. In practice, not only can seemingly benign changes in the underlying process cause catastrophic deterioration in model performance, when feedback is delayed this can occur silently and damage can accumulate over time.

Attempts to make machine learning models robust to such changes have so far had limited success [Taori et al., 2020]. One might hope that probabilistic models would respond to distributional change by reporting increased uncertainty in their predictions that reflect their unfamiliarity with the drifted distribution. However Ovadia et al. [2019] show that model uncertainties become wildly miscalibrated in the face of such changes; not only are models more frequently wrong, but they are confident in their wrong predictions. It therefore remains to detect such changes directly so that alerts can be raised and adaptation or retraining processes can be triggered.

When performing change detection false positives are unavoidable. Responding appropriately therefore requires knowledge of the rate at which false detections occur in the absence of change. For example suppose that change is detected at time . A proportionate response given false detections are expected every 500 time steps would be entirely different to a proportionate response given false detections are expected every 5000 time steps. It is therefore of enormous value to practitioners to be able to choose a false positive rate (FPR) that is suitable for their use case and then calibrate their detectors such that they operate at this known rate.

Despite this, most existing approaches to sequential change detection either make no effort to calibrate the detector to operate at a known FPR, or calibrate it to operate at some unknown FPR below a specifiable upper bound. Other approaches recognise the importance of targeting a specifiable FPRs, but the calibration process is motivated by asymptotic arguments which we later show often result in significantly miscalibrated detectors in practice.

In this paper we focus on the sequential detection of change from the distribution that underlay the generation of a large reference set. Our main contributions are to:

  1. Present a novel, simulation-based and metric-agnostic approach to threshold setting that results in calibrated detectors where the false detection probabilities are known and kept constant across time steps.

  2. Show how an estimator of maximum mean discrepancy (MMD) can be used to define drift detectors that are flexible enough to operate on streams of data residing in any domain on which a kernel can be defined.

  3. Show how to structure computations such that the cost of leveraging the minimum variance unbiased estimator of MMD is reduced from

    to during configuration and from to per timestep during operation, where is the number of reference samples and the number of bootstrap samples.

The algorithms introduced in this paper have been successfully integrated into the open source Python library alibi-detect111 developed by Van Looveren et al. [2019], which can therefore be referred to for code implementations.

2 Background

2.1 Problem Statement and Notation

Let , 

,  denote an evolving data stream of independent random variables generated by the process


where denotes a pre-change distribution over , denotes a post-change distribution over and is an unknown change point222 corresponds to the stream generated entirely according to .. We focus on the setting where both and are assumed unknown but there exists a large set of i.i.d. reference instances from .

To refer to finite windows of the data stream we use interval notation in the subscript such that, for example, denotes the window of size starting at time and ending at time . We will use to denote a collection of random variables distributed i.i.d. according to and to denote a random subsample of size from where we will specify whether the sampling takes place with or without replacement. We will use a dash to distinguish random variables when necessary such that, for example, and denote two independent and identically distributed random variables. We will use and to denote expectations and probabilities corresponding to the stream with change point .

Now consider the sequential change detection problem where at each time we test whether significant evidence exists to suggest that change has already occurred (i.e. whether ). Letting denote the time at which change is detected, we consider the problem of designing detection algorithms that minimise Pollak [1985]’s formulation of the worst case expected detection delay

subject to operating at a known expected runtime

in the absence of change.

2.2 Not All False Positive Rates Are Equal

Note how the ERT corresponds to a rate of false positives. In the absence of change we expect a false positive every ERT time steps on average. However, it is important to note that this doesn’t mean that false positives occur at a constant rate. Assuming that a detector calibrated to does not make a false detection at times , the probability of a false detection at time is not necessarily .

For reasons we later consider, detectors based on overlapping test windows of size are usually incapable of making detections for , have disproportionately large probabilities of making false detections at times and then a disproportionately low probability thereafter. Detectors based on letting evidence accumulate indefinitely are more likely to make false detections later in the stream than earlier. These irregularities complicate the interpretation of the significance of detections when they occur.

We therefore consider it useful to not only target a specifiable , but to do so in a manner such that false detections occur at a constant rate in the absence of change. In other words, conditional on , we want the detection time

to be distributed according to the geometric distribution

with . We would then have the memoryless property


for all .

2.3 Related Work

With roots tracing back to the sequential hypothesis testing framework introduced by Wald [1945], the problem of sequential change point detection has been well studied in various contexts such as statistical process control [Lai, 1995, Roberts, 1966, Harris and Ross, 1991] and intrusion detection [Polunchenko et al., 2012, Tartakovsky et al., 2006]. We consider the traditional framework where the false detection probability at each time step is of interest, not the context of increasing interest [Yu et al., 2020, Kirch and Stoehr, 2019] that considers the probability (of asymptotic nature) that a false detection occurs at any time . Under the assumption that both the pre-change and post-change distributions are known Lai [1995] show that, subject to operating at an expected runtime , the CUSUM algorithm of Page [1954] asymptotically achieves the lowest possible EDD as . However, knowledge of the pre-change distribution is rare and knowledge of the post-change distribution even more so. Significant work has gone into proposing approaches that don’t require such knowledge.

Most sequential change detection methods are designed for univariate data streams and do not extend straightforwardly to the multivariate case [Kifer et al., 2004, Bifet and Gavalda, 2007]. Whilst some authors advocate leveraging an ensemble of univariate detectors to deal with multivariate data streams [Faithfull et al., 2019], such approaches can only detect changes to the marginal distributions and require a multiple testing correction which only allows false positive rates to be loosely bounded rather than targeted. Designing detectors that are flexible enough to detect any change in the distribution governing a multivariate data stream has been an area of recent focus [Bu et al., 2017, Li et al., 2019, Liu et al., 2018, Mozaffari and Yilmaz, 2019, Hinder et al., 2020, Chen et al., 2019, Kurt et al., 2020, Dasu et al., 2006].

Change detectors are often central components within more application-specific drift detection and adaptation schemes [Gama et al., 2004, Baena-Garcıa et al., 2006, Lu et al., 2018]. Here the random variables decompose into features and targets and there is particular interest in detecting change that might lead to deterioration in the performance of a model of interest . When the targets are observable the drift detection is supervised and the problem reduces to detecting change in the univariate stream of model losses . More commonly however the targets are unobservable and the drift detection is unsupervised. It is then necessary to detect changes directly in the feature stream , motivating the development of change detectors flexible enough to operate on data streams of various dimensionalities and modalities.

Change detectors typically work by (perhaps implicitly) testing for changes between a window of reference data and a window of test data and can be usefully categorised by their windowing strategy. Whilst the reference window can be dynamic (e.g. Chen et al. [2019]), we focus on the setting where the reference window is fixed, as is most useful for detecting change from a distribution that generated a model’s training set. There are then three ways in which test windows can be defined.

The simplest way is to adopt a disjoint window strategy that partitions the stream into disjoint windows of size , with tests being performed every time steps. Another is to use an overlapping window strategy where a window of fixed size iteratively receives the newest observation and releases the -th oldest. A final way, which often isn’t thought of explicitly in terms of windows, is to accumulate evidence for change continuously, perhaps with a forgetting mechanism that ensures evidence doesn’t accumulate in favour of no change. This can be thought of as an adaptive window strategy with adaptive windows that grow when there is evidence for change and reset to size when there is no such evidence.

Approaches leveraging disjoint windows can use more expensive procedures for performing tests (as they are performed only every time steps) and have uncorrelated test outcomes which makes calibration simple. However they are incapable of strong responses to both slight and severe drift. Detecting slight drift requires a large in order for the slight deviation to be confirmed as significant, but a large prohibits a low EDD, regardless of how severe and obvious the drift is. Hence, disjoint windows are rarely considered for sequential change detection.

Approaches leveraging adaptive windows, usually adaptations of the CUSUM algorithm of Page [1954], work well when the post-change distribution is known. Prior to change the test window will usually remain small which means that if severe drift occurs it heavily influences detection decisions straight away and if it is slight the window size can grow to accommodate. However, in the absence of post-change distribution knowledge approaches either have cost that grows with time (e.g. Bifet and Gavalda [2007]

) or have test statistics based on how outlying each data point in the test window is without considering their collective similarity (e.g.

Flynn and Yoo [2019], Kurt et al. [2020]). The adaptivity also complicates calibration to a desired ERT.

Approaches leveraging overlapping windows strike an appealing middle ground. The window size can be made large enough to facilitate slight drift and when severe drift occurs instances filter into the window, and therefore start to influence test outcomes, immediately. Testing at every time step typically requires a test statistic that can be updated incrementally at low cost. Such methods will be the focus of this work.

Evaluation metrics for sequential change detectors tend to vary from paper to paper. Sometimes in the drift detection context the performance of change detection components are evaluated in terms of how well they contribute to maintaining model performance within some overarching model adaptation framework. In pure change detection contexts a notion of power, such as average detection delay or detection rate across a fixed time span, is often compared between detectors guaranteed to be operating above a specified ERT in the absence of change (or, equivalently, with false detection probabilities below a specified ). In this case actual ERTs (or false detection probabilities) are often not reported or considered. Other times TP/TN/FP/FN rates are computed under various threshold values and ROC/AUC-like metrics are reported. Another strategy is to perform experiments with synthetic data streams with known distributions that allow thresholds to be configured to target specified ERTs. All of these evaluation strategies circumnavigate the fact that the detectors being evaluated can’t be configured to target a desired ERT when the distribution is not known.

2.4 B-statistic and LSDD-Inc

In this section we look at two nonparametric overlapping window methods that tackle the same type of change detection as we do. Both approaches consider the context whereby a large set of reference data is available from the unknown pre-change distribution and an overlapping test window strategy is adopted. At each time , a test window with unknown underlying distribution is considered and a test statistic is computed as a sample-based estimate of a notion of distance between the unknown and .

As the notion of distance Bu et al. [2017] use the least squares density difference (LSDD), defined as

Li et al. [2019] instead use the maximum mean discrepancy (MMD) [Gretton et al., 2012], which can be defined in multiple ways but perhaps most simply as

for some kernel and , .

Crucially, both of these notions of distance admit two-sample estimators that can be incrementally updated at low cost. To estimate LSDD Bu et al. [2017] adopt a regularised estimator which involves fitting a Gaussian kernel model to the difference function . It can be updated in time with respect to the number of reference samples . To update an estimate of MMD at low cost Li et al. [2019] use the linear time333When we refer to algorithms having linear or quadratic time we mean with respect to the size of the reference set , which is assumed to be large. Scaling with respect to the size of the test window , which is chosen independently of , is not considered important.

B-statistic instead of the more powerful and commonly used quadratic time estimator. Another motivation for this linear time estimator was that it made (asymptotic) analysis of runtimes more tractable. We later show, however, that there is a way of incrementally updating estimates from the quadratic time estimator in linear time such that resorting to the linear time estimator is unnecessary.

Having defined a test statistic that can be updated at low cost we can monitor the test statistic trajectory . The difficult part is then to determine when the trajectory has deviated significantly from its expectation under the no-change null. The strategy adopted by both Bu et al. [2017] and Li et al. [2019] is to detect change whenever for some time-invariant threshold . Li et al. [2019] define to be an estimate of , the unknown threshold that achieves a desired ERT. However their estimate is only accurate asymptotically (holds as ) and leads to significant miscalibration in practice as we later show. Bu et al. [2017] take a simulation based approach which is more similar to ours and is outlined in Algorithm 1, where superscripted pairs are used to denote quantities corresponding to a reference window of size and test window of size .

1 Input reference set , window size and number of bootstrap samples .
2 for  : do
3        Sample, with replacement from , a reference window and test window .
4        Compute the corresponding estimate of .
5Let be the empirical quantile of .
Output threshold , where the expectations have known values.
Algorithm 1 LSDD-Inc threshold configuration

Within this algorithm the procedure up until step 5 is relatively standard and estimates , the -quantile of , with an estimate bootstrapped using with-replacement sampling from . However step 6 is less standard and the justification is that the distribution of will be more tightly centered on its expected value than that of and therefore will correspond to a quantile of that is greater than .

This procedure therefore aims to bound the expected runtime from below rather than target it. The bound can be very loose if and the distribution of is much more diffuse than that of . Moreover, no correction is made for correlated test outcomes, introducing a correlated-test bias which loosens the bound further. In combination these effects result in a detector that operates at some unknown ERT that is often orders of magnitude greater than that which was desired. This results in an EDD that is potentially much larger than would be achieved with a detector targeting the desired ERT.

3 Calm

Note that we are not in fact interested in the distribution of . Given that is known prior to both configuration and operation, we wish to configure the detector to operate at a desired ERT conditional on the particular realisation of the reference data that shall be used to compute test statistics – not conditional on the reference data being drawn from the same underlying distribution. In other words, we are instead interested in the distribution of .

Ideally we would estimate the distribution of by sampling from , whilst keeping the reference set fixed as . Without the ability to sample , a naive bootstrap approach would be to instead use with-replacement samples from . However this results in biasing quantile estimates of downward to the fact that the samples in the test window are also in the reference window. We refer to this as the window-sharing bias.

This explains why Li et al. [2019] instead use as an intermediary for LSDD-Inc and target it using a standard bootstrap approach; assuming then the probability of shared samples between the reference and test windows, and therefore the resulting window-sharing bias, is very low. However, as mentioned earlier, corresponding thresholds can then only be used to lower-bound the ERT rather than target it.

We instead recommend using a reference window , sampled without replacement from during the configuration phase and then kept fixed throughout the operational phase. Test statistics then take the form and the quantity of interest is , the -quantile of , which encompasses the randomness resulting from the sampling of the reference window. This quantity can then be estimated using the procedure described in Algorithm 2.

1 Input reference set , window size and number of bootstrap samples .
2 for : do
3        Sample, without replacement from , a reference window of size and let the unsampled points act as a corresponding test window.
4        Compute the corresponding estimate of .
5Let be the empirical -quantile of .
Output threshold
Algorithm 2 Time-invariant threshold configuration

Within this procedure the process for sampling reference windows corresponds exactly to the process used during operation and does not introduce any bias. The process for sampling test windows corresponds to estimating the distribution of using that of sampled without replacement from . The fact that we are inferring a property of a function of random variables using a pool of size of , where , means that the bias introduced here is actually much less than that typically introduced by bootstrap approximations of functions of random variables444This is evidenced by the existence of the subsampling bootstrap and m-out-of-n bootstrap that approaches the case by first performing inference for a function of fewer variables and then applying corrections.. For , sampling without replacement (subsampling) is considered preferable and, crucially, allows the window-sharing bias to be completely eliminated by keeping the reference and test windows disjoint.

To demonstrate this reduction in bias we consider a simple example where the reference distribution is the -dimensional isotropic Gaussian , we have a reference set of size and we wish to estimate the distribution of for some two-sample distance estimator . We let , and consider the estimators of both MMD and LSDD.

We are interested in comparing the accuracy of bootstrap estimates using with-replacement and without-replacement sampling. For the former the distribution of , with sampled with replacement from , is estimated using bootstrap samples from where both the reference and test windows are sampled with-replacement. For the latter the distribution of , with sampled without replacement from , is estimated using bootstrap samples from where both the reference and test windows are sampled without replacement such that they are disjoint. For both cases we compare the empirical distribution obtained using 25000 bootstrap samples with an empirical distribution formed from 25000 samples from the true distribution being targeted, where the test window is sampled from .

Figure 0(b) shows the targeted and estimated distributions for a range of dimensions for the MMD case. We see that in the univariate setting both with-replacement and without-replacement sampling fares well and the distributions align. However as the dimension increases the alignment noticeably deteriorates when sampling with replacement as window sharing exerts a downwards bias on the estimates. This is unsurprising because MMD is based on pairwise similarities and in high dimensions fewer pairs of samples are similar to each other. Therefore the bias introduced by the high similarity between multiple draws of the same instance is higher. Figure 0(a), which plots the Kolmogorov-Smirnov two-sample distance between the targeted and estimated distributions against dimension, demonstrates this effect is true also for the LSDD distance (albeit to a lesser extent).

(a) Kolmogorov-Smirnov distance between bootstrap samples and a sample from the distribution being targeted, plotted against the dimension of the Gaussian.
(b) Probability density functions of the bootstrap samples and a sample from the distribution being targeted, corresponding to the MMD distance and a range of dimensions for the Gaussian.
Figure 1: A comparison of the accuracy of with-replacement and without-replacement bootstrap approximations of the distribution of two-sample distance estimators. Estimators of both MMD and LSDD are considered and both reference and test samples are drawn from a -dimensional Gaussian.

3.1 Adjusting for Correlated Test Outcomes

The above process aims to configure such that the first test statistic 555Note how prior to time the test window is not full and a test statistic can therefore not be computed. satisfies . However, the second test statistic is computed using almost the same data ( rather than ) and is therefore highly correlated with . Consequently, in the absence of change the probability of a false detection at time conditional on no false detection at time is some unknown value that is less than , and often considerably so. If we let denote the false detection probability at time conditional on no prior false detections, even if is perfectly estimated we have for all and therefore an expected runtime that takes some unknown value above that which was desired. We would instead like to estimate time-varying thresholds that result in for all

In estimating these time-varying thresholds we now consider sampling and fixing a reference window of slightly reduced size . At the first test time we wish to compare the test statistic against the threshold that satisfies . This value is the -quantile of which we can estimate using the bootstrap approach described above adjusted to account for the reduced size of the reference window.

But assuming the detector makes it to time without making a false detection, we then wish to compare the test statistic against the threshold that satisfies

This problem has been considered in the context of adaptive window methods and Verdier et al. [2008] propose a solution for contexts where sampling from is possible. However we now show how the same idea can be applied to the overlapping window context where only the finite sample from is available. What makes this possible is the fact that in the overlapping window context


That is, the thresholds differ for the first tests but then remain constant. To see this note that for the first test at time the points in the test window have collectively contributed to 0 passed tests. For the second test we have points that have contributed to a passed test each and a new point that hasn’t passed a test – so the test is performed with a test window that has collectively contributed times to passed tests. For the third test we have passed test contributions as some have contributed to two passed tests. Up until the -th test we receive more and more information that makes a passed test more likely. But then the information available for the -th test (at time ) is the same as that which was available for the -th test – every single test point has maximally contributed to passed tests (i.e. contributions). In other words, observing a passed test at time doesn’t introduce any information that causes to differ from . The same principle applies thereafter meaning that the problem is reduced to estimating the thresholds .

Like Verdier et al. [2008] we simulate trajectories under the no-change null and sequentially estimate quantiles such that the trajectories used in the estimation of are only those which have remained below all previous thresholds (in our case ), thereby enacting the conditioning. More concretely, defining for compactness, our approach is described in Algorithm 3.

1 Input reference set , window size and number of bootstrap samples .
2 for : do
3        Sample, without replacement from , a reference window of size .
4        Let act as a corresponding mini data stream of length .
5Set .
6 for : do
7        Compute , the empirical -quantile of , where .
8        Set .
9        Set .
Output thresholds .
Algorithm 3 Calibrated and memoryless (CALM) threshold configuration

Note how the number of bootstrap samples remaining with which to compute quantile estimates gradually decreases such that approximately remain for the final estimate of . However, assuming an ERT of and recalling we can note

for all reasonable values of , and therefore the number of bootstrap samples remains on the same order of magnitude throughout the configuration process.

During configuration a single reference window should be sampled and then kept fixed throughout operation. The test statistic should be compared to for and to thereafter. We will later see that this leads to a detector which, in the absence of change, has post- runtimes that accurately target the desired geometric distribution, i.e. .

If desired, an adjustment can be made such that instead the absolute runtime , in the absence of change, targets the desired geometric distribution. In other words, we can start performing tests straight away at time rather than . This can be achieved by initialising the detector with a test window full of reference instances that weren’t sampled in the reference window . We think of this as prepending such instances to the datastream as observations . In order for the first test outcome to depend on the extremity of rather than that of we check that and resample if not. This allows us to compare to the conditional threshold . More generally should be compared to for and to thereafter.

4 Calm-Mmd

The approach to threshold setting described in the previous section applies to any test statistic that takes the form of a two-sample estimator of distance between underlying distributions, where is a without-replacement sample of size from the reference set of size . In this section we consider more specifically the case where is an estimator of the squared maximum mean discrepancy for some kernel .

The minimum variance unbiased estimator of this distance is given by the quadratic time estimator, which for samples from and from is given by


Computing this test statistic from scratch has a cost of , and therefore a naive implementation of Algorithm 3 during a configuration phase would have cost , which is problematic assuming both and are large. Moreover the cost during the operational phase would be per time step. However, thoughtful structuring of the computation can reduce the cost during configuration from to and during operation from to .

To keep subsequent notation compact we will allow the kernel to operate on collections of observations such that , for example, denotes the matrix with -th entry . Also, for a matrix , we use the shorthand notation to denote the sum of off-diagonal entries and to denote the sum of all entries.

4.1 Configuration

During configuration, for each bootstrap and time we are required to compute where . Let

denote the kernel matrix with its rows and columns permuted such that , and . Further define the submatrix of and of . We can then write


By noting the relation and amortizing the cost of initially computing across all bootstrap samples we can compute for all using only the submatrices and . This results in a cost per bootstrap sample of . In addition to the cost of computing , this results in a total configuration cost of .

4.2 Operation

During operation we have a fixed reference window and at each time step need to compute . Let

where , and . We can then write


We can compute just once during the configuration phase and therefore update the test statistic at cost during operation. Although we have focused on scaling with respect to rather than we additionally note that caching certain computations (similarly to Li et al. [2019]) allows to be computed from in time (rather than ) and to be computed from in time (rather than ).

5 Experiments

In this section we first compare, across a range of desired expected runtimes and pre-change and post-change distributions, the performance of the proposed change detectors with the MMD-based B-statistic detector of Li et al. [2019] and LSDD-Inc detector of Bu et al. [2017]. The main focus will be on comparing the degree to which detectors can be calibrated to operate with known expected behaviour, or in other words, how closely average runtimes (ARTs) match desired expected runtimes (ERTs). However we will additionally explore whether calibration comes at the expense of detection power, or in other words, the ability to respond to actual changes with short delays. Obtaining accurate estimates of average runtimes and delays requires a large number of runs and therefore smaller scale experiments, so we then demonstrate the scalability of the proposed method and suitability for problems of practical interest by exploring the performance on a medical imaging problem.

5.1 Calibration and Power Comparison

5.1.1 Datasets

For the comparisons we generated data synthetically from known pre-change and post-change distributions. We chose the 4 combinations described below and visualised in Figure 2 with the first two taken directly from Li et al. [2019] and the second two chosen to be similar in nature to D3-D6 in Bu et al. [2017] but adapted to be less Gaussian to provide diversity from the first two.

  • D1 (Gaussian mean shift): Pre-change distribution of and post-change distribution of , where 1

    is a vector of ones.

  • D2 (Gaussian covariance change): Pre-change distribution of and post-change distribution of with diagonal covariance matrix satisfying for and for .

  • D3 (Uniform square to uniform diamond): Pre-change distribution is the uniform distribution on the two dimensional square with coordinates

    . Post-change distribution is the uniform distribution on the diamond with coordinates , , and .

  • D4 (Hollowing of uniform square): Pre-change distribution is the uniform distribution on the two dimensional square with coordinates . Post-change distribution is the uniform distribution on the same square but with the inner square with coordinates removed.

Figure 2: Plots of samples from the pre-change () and post-change () distributions for the four problems D1-D4. For D1 and D2 the plots corresponds to the 2-dimensional equivalent of the 20-dimensional problem under consideration.

5.1.2 Experimental setup

The results we report are averages obtained over a large number of runs. Each run involved a reference set of size simulated from the pre-change distribution and a stream defined by Equation 1 for some change point . The detector was configured using window size with the aim of targeting some desired .

In order to allow a fair comparison to the B-statistic and LSDD-Inc detectors, which can’t test until the test window is full, we did not use the modification described at the end of Section

3.1 and focused on targeting a post- runtime of ERT. To obtain worst case average detection delays in the presence of change we used a change point of such that the detector is required to respond starting from a test window full of data from the pre-change distribution. To report average runtimes in the absence of change the post-change distribution was set to the pre-change distribution. The results are reported as averages over 100 configurations (each with different reference sets) and 500 runtimes per configuration, resulting in averages over 50000 runtimes.

For the B-statistic detector we configured thresholds as in the paper using the recommended skewness correction. For the LSDD-Inc detector, as discussed at the end of Section

2.4, we found that not accounting for correlation between test outcomes and estimating thresholds corresponding to a reference set of size with bootstraps using reference sets of size , resulted in average runtimes orders of magnitude greater than desired666Note that this is not necessarily unexpected behaviour as Bu et al. [2017] consider lower false positive rates to be desirable whereas we prefer them to be known and controllable.. In order for experiments to complete and comparable results to be obtained we used reference sets of size during the operational phase in order to correspond to the size of the bootstrapped reference sets. We compared these two detectors to CALM-MMD and CALM-LSDD using the same distance estimators but within the framework set out in Section 3

. For the MMD-based detectors we use the Gaussian radial basis function (RBF)

as the kernel where the bandwidth is taken to be the median of the pairwise distances between reference samples.

5.1.3 Results

Method Miscalibration Reduction
D1 & D2 D3 & D4 D1 D2 D3 D4
B-statistic 0.253 0.201 0.951 0.910 0.903 0.562
CALM-MMD 0.010 0.010 0.951 0.909 0.903 0.560
LSDD-Inc 0.202 0.187 0.936 0.866 0.888 0.522
CALM-LSDD 0.010 0.014 0.950 0.921 0.933 0.700
Table 1: Comparison of miscalibrations and reductions on problems D1-D4, averaged over expected runtimes of 128, 256, 512 and 1024. Miscalibration is defined as the relative difference between average and expected runtimes in the absence of change. Reduction is defined as the relative reduction in detection time that results from a change occurring.
(a) Gaussian pre-change distribution
(D1 & D2)
(b) Uniform pre-change distribution
(D3 & D4)
Figure 3: Desired expected runtimes (ERTs) plotted against actual average runtimes (ARTs) for each detector and pre-change distribution, with corresponding miscalibration metrics plotted beneath.

For this comparison we fixed the size of the reference set to , the window size to and used bootstrap samples to obtain quantile and skewness estimates. To compare calibration, for each detector we report the average runtime (ART) in the absence of change under configurations targeting ERTs of 128, 256, 512, and 1024 respectively. When plotted as a graph of ART vs ERT a well calibrated detector should lie close to the diagonal. We consider the relative error as a metric of miscalibration, where lower values represent better calibrated detectors.

Figure 3 visualises the miscalibration of the four detectors for both pre-change distributions and across the four specified ERTs, the average across which is recorded in Table 1. We see that the B-statistic and LSDD-Inc detectors typically operate at expected runtimes that deviate from the desired expected runtimes by between 15% and 25%. Moreover, the miscalibration varies depending on the problem and desired ERT. For the Gaussian pre-change distribution the miscalibration leads to average runtimes which are 20-25% higher than desired across a range of ERTs. However on the uniform square the B-statistic detector operated at around 30% above the desired level when an ERT of 128 is desired but around 30% below the desired level when an ERT of 1024 is desired. This suggests that practitioners deploying such a detector should be aware that the actual false positive rate could be anywhere between 30% below and 30% above the level they desired.

By contrast, the detectors configured using the CALM methodology invariably operate at ERTs within approximately 1% of the desired level, representing a 20 reduction in miscalibration. Moreover, by using the CALM methodology not only do we control the mean of the distribution of runtimes in the absence of change, but the entire distribution. The Q-Q plots in Figure 4 confirm that the distribution of runtimes is memoryless, with only a handful of the most extreme of the 50000 points lying away from the diagonal in each plot. By contrast using MMD within the framework of Li et al. [2019] results in detectors with some unknown runtime distribution in the absence of change where short runtimes are underrepresented and large runtimes overrepresented relative to the corresponding memoryless distribution. Note that the Q-Q plots are adjusted for the ARTs that are actually obtained such that deviation from the diagonal represents deviation from the best fitting geometric distribution, not that with mean equal to the ERT. Therefore we are not doubly penalising miscalibration of the mean.

Figure 4: Q-Q plots of runtimes obtained in the absence of change, with theoretical quantiles corresponding to the geometric (memoryless) distribution with mean equal to the sample average. Top row: MMD-based detector of Li et al. [2019]. Bottom row: CALM-MMD. Each plot contains 50000 runtimes.

We now consider whether this improved calibration comes at the cost of less powerful detectors. However, comparing the power of detectors operating at different ERTs is difficult. Merely comparing average detection delays (ADDs) favours detectors operating at lower ERTs and therefore miscalibrated detectors operating above the desired ERT get doubly penalised. In order to provide a fair comparison we performed additional experiments where we reconfigured the CALM-MMD and CALM-LSDD detectors to operate at the ERTs achieved by the corresponding detectors of Li et al. [2019] and Bu et al. [2017] respectively. We then recorded new average detection delays and can make direct comparisons between detectors using the same notion of distance and operating at the same ERT. We report as a notion of power the relative reduction in runtime that results when a change occurs, i.e. , such that a perfect detector would score a value of 1 and a useless detector would achieve a value of 0. Figure 5 allows, for each dataset D1-4, the power of the detectors to be compared. The average reduction across ERTs is recorded in Table 1.

At least for the problems considered, there is no discernible difference in power between the B-statistic detector and CALM-MMD, with average reductions matching to two decimal places on all four problems, as shown by the overlapping curves in Figure 5. This is despite the latter using (at linear time cost during operation) the quadratic time estimator of MMD. The difference between the LSDD-Inc detector and CALM-LSDD is much greater, which is unsurprising due to the fact that it was necessary to sample test windows of size from the reference data in order to obtain a detector for which the ERT could be controlled at all. Although we should be careful when comparing detectors operating at different ERTs, we also note that by considering the full curves in Figure 5 we can see that within the CALM framework using the LSDD estimator resulted in more powerful detectors that the MMD estimator for these problems.

Figure 5: Plots of average runtimes in the absence of change against relative reductions resulting from change (ART-ADD/ART), where higher reductions represent more powerful detectors.

5.2 Case Study

5.2.1 Dataset

We now demonstrate that the CALM methodology extends beyond simple toy problems to larger scale problems of real practical interest. We consider the Camelyon17-WILDS [Koh et al., 2020] dataset, adapted from Bandi et al. [2018] and pictured in Figure 7, of tissue scans taken from different hospitals for classification as either benign or cancerous. The idea is that in medical imaging it is often desirable to train models on data from one hospital/demographic and deploy the model more broadly across a wider pool of hospitals/demographics. As long as the data distribution remains the same as that which generated the training data, this is likely to be safe practice. However if new hospitals are introduced for which the distribution appears different then the performance of the model can no longer be relied upon. With respect to the Camelyon dataset, Koh et al. [2020] show that models trained on the training data, taken from three hospitals, achieve an average accuracy of 93.2% on unseen scans from the same three hospitals, but only 70.3% accuracy on scans in the test set, corresponding to a different hospital.

5.2.2 Experimental Setup

This time we used a larger reference set of size and a desired ERT of . We additionally scaled the window size to and the number of bootstrap samples to . To demonstrate scalability we used the more expensive CALM-MMD detector. Adhering to what we consider a recommended workflow in settings where the raw data is unstructured and of dimensionality likely far large than the intrinsic dimensionality of the data, we performed a preprocessing step to project the image patches onto a more structured lower dimensional representation.

To do this we trained an autoencoder to reconstruct the patches whilst passing them through a lower dimensional space of dimension

. For the encoder we used five convolutional layers, each separated by ReLU nonlinearities, which gradually reduce the spatial dimension from

to and increase the number of channels from to . The decoder is of symmetric form, mapping the 32 dimensional encoding vector back onto a image. Crucially, the autoencoder was trained using a split of the data which then no longer serves as part of the reference set. We split the data such that half (10000) of the instances were used to learn the representation and the other half were used for testing. The autoencoder was trained using the Adam optimizer with a learning rate of 0.001 on batches of size for epochs. Drift detection is then performed on the 32-dimensional vectors that result from passing the image patches through the trained encoder.

5.2.3 Results

The results are reported as averages over 150 configurations (each with different reference sets and autoencoders) and 100 runtimes per configuration, resulting in averages over 15000 runtimes. In the absence of change the average runtime was 4949, representing a miscalibration of 1.0%. The Q-Q plot in Figure 7 shows that this was achieved with runtimes following the desired memoryless distribution. By contrast the average detection delay in the presence of change was 52.8, representing a relative reduction of 98.9% from the ART in the absence of change.

Figure 6: Scans from four hospitals contained in the Camelyon17-WILDS dataset. The first three are from the hospitals included in the training data whereas the fourth is from a hospital not represented in the training data.
Figure 7: Q-Q plot of the runtimes obtained in the absence of change on the Camelyon17 dataset, with theoretical quantiles corresponding to the geometric distribution with mean .

6 Conclusion

In this paper we focused on the problem of automatically configuring thresholds of change detection algorithms such that, in the absence of change, the distribution of runtimes follows the geometric distribution with the desired mean. We described a metric agnostic method for achieving this property and show that when used with the two-sample estimators of MMD and LSDD adopted by recent works the resulting detectors can then operate with known behaviour without compromising on responsiveness to change.


  • Taori et al. [2020] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33, 2020.
  • Ovadia et al. [2019] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua V Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019.
  • Van Looveren et al. [2019] Arnaud Van Looveren, Giovanni Vacanti, Janis Klaise, Alexandru Coca, and Oliver Cobb.

    Alibi detect: Algorithms for outlier, adversarial and drift detection, 2019.

  • Pollak [1985] Moshe Pollak. Optimal detection of a change in distribution. The Annals of Statistics, pages 206–227, 1985.
  • Wald [1945] Abraham Wald. Sequential tests of statistical hypotheses. The annals of mathematical statistics, 16(2):117–186, 1945.
  • Lai [1995] Tze Leung Lai. Sequential changepoint detection in quality control and dynamical systems. Journal of the Royal Statistical Society: Series B (Methodological), 57(4):613–644, 1995.
  • Roberts [1966] SW Roberts. A comparison of some control chart procedures. Technometrics, 8(3):411–430, 1966.
  • Harris and Ross [1991] Thomas J Harris and William H Ross. Statistical process control procedures for correlated observations. The canadian journal of chemical engineering, 69(1):48–57, 1991.
  • Polunchenko et al. [2012] Aleksey S Polunchenko, Alexander G Tartakovsky, and Nitis Mukhopadhyay. Nearly optimal change-point detection with an application to cybersecurity. Sequential Analysis, 31(3):409–435, 2012.
  • Tartakovsky et al. [2006] Alexander G Tartakovsky, Boris L Rozovskii, Rudolf B Blažek, and Hongjoong Kim. Detection of intrusions in information systems by sequential change-point methods. Statistical methodology, 3(3):252–293, 2006.
  • Yu et al. [2020] Yi Yu, Oscar Hernan Madrid Padilla, Daren Wang, and Alessandro Rinaldo. A note on online change point detection. arXiv preprint arXiv:2006.03283, 2020.
  • Kirch and Stoehr [2019] Claudia Kirch and Christina Stoehr. Sequential change point tests based on u-statistics. arXiv preprint arXiv:1912.08580, 2019.
  • Page [1954] Ewan S Page. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
  • Kifer et al. [2004] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In VLDB, volume 4, pages 180–191. Toronto, Canada, 2004.
  • Bifet and Gavalda [2007] Albert Bifet and Ricard Gavalda. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining, pages 443–448. SIAM, 2007.
  • Faithfull et al. [2019] William J Faithfull, Juan J Rodríguez, and Ludmila I Kuncheva. Combining univariate approaches for ensemble change detection in multivariate data. Information Fusion, 45:202–214, 2019.
  • Bu et al. [2017] Li Bu, Dongbin Zhao, and Cesare Alippi. An incremental change detection test based on density difference estimation. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(10):2714–2726, 2017.
  • Li et al. [2019] Shuang Li, Yao Xie, Hanjun Dai, and Le Song. Scan b-statistic for kernel change-point detection. Sequential Analysis, 38(4):503–544, 2019.
  • Liu et al. [2018] Anjin Liu, Jie Lu, Feng Liu, and Guangquan Zhang. Accumulating regional density dissimilarity for concept drift detection in data streams. Pattern Recognition, 76:256–272, 2018.
  • Mozaffari and Yilmaz [2019] Mahsa Mozaffari and Yasin Yilmaz. Online multivariate anomaly detection and localization for high-dimensional settings. arXiv preprint arXiv:1905.07107, 2019.
  • Hinder et al. [2020] Fabian Hinder, André Artelt, and Barbara Hammer. Towards non-parametric drift detection via dynamic adapting window independence drift detection (dawidd). In International Conference on Machine Learning, pages 4249–4259. PMLR, 2020.
  • Chen et al. [2019] Hao Chen et al. Sequential change-point detection based on nearest neighbors. The Annals of Statistics, 47(3):1381–1407, 2019.
  • Kurt et al. [2020] Mehmet Necip Kurt, Yasin Yilmaz, and Xiaodong Wang.

    Real-time nonparametric anomaly detection in high-dimensional settings.

    IEEE transactions on pattern analysis and machine intelligence, 2020.
  • Dasu et al. [2006] Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubramanian, and Ke Yi. An information-theoretic approach to detecting changes in multi-dimensional data streams. In In Proc. Symp. on the Interface of Statistics, Computing Science, and Applications

    . Citeseer, 2006.

  • Gama et al. [2004] Joao Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. Learning with drift detection. In

    Brazilian symposium on artificial intelligence

    , pages 286–295. Springer, 2004.
  • Baena-Garcıa et al. [2006] Manuel Baena-Garcıa, José del Campo-Ávila, Raúl Fidalgo, Albert Bifet, R Gavalda, and R Morales-Bueno. Early drift detection method. In Fourth international workshop on knowledge discovery from data streams, volume 6, pages 77–86, 2006.
  • Lu et al. [2018] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12):2346–2363, 2018.
  • Flynn and Yoo [2019] Thomas Flynn and Shinjae Yoo. Change detection with the kernel cumulative sum algorithm. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 6092–6099. IEEE, 2019.
  • Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • Verdier et al. [2008] Ghislain Verdier, Nadine Hilgert, and Jean-Pierre Vila. Adaptive threshold computation for cusum-type procedures in change detection and isolation problems. Computational Statistics & Data Analysis, 52(9):4161–4174, 2008.
  • Koh et al. [2020] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. arXiv preprint arXiv:2012.07421, 2020.
  • Bandi et al. [2018] Peter Bandi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging, 38(2):550–560, 2018.