Counterfactual Explanations of Concept Drift

by   Fabian Hinder, et al.
Bielefeld University

The notion of concept drift refers to the phenomenon that the distribution, which is underlying the observed data, changes over time; as a consequence machine learning models may become inaccurate and need adjustment. While there do exist methods to detect concept drift or to adjust models in the presence of observed drift, the question of explaining drift has hardly been considered so far. This problem is of importance, since it enables an inspection of the most prominent features where drift manifests itself; hence it enables human understanding of the necessity of change and it increases acceptance of life-long learning models. In this paper we present a novel technology, which characterizes concept drift in terms of the characteristic change of spatial features represented by typical examples based on counterfactual explanations. We establish a formal definition of this problem, derive an efficient algorithmic solution based on counterfactual explanations, and demonstrate its usefulness in several examples.



There are no comments yet.


page 4

page 9


Analysis of Drifting Features

The notion of concept drift refers to the phenomenon that the distributi...

Meaningfully Explaining a Model's Mistakes

Understanding and explaining the mistakes made by trained models is crit...

A probability theoretic approach to drifting data in continuous time domains

The notion of drift refers to the phenomenon that the distribution, whic...

Characterizing Concept Drift

Most machine learning models are static, but the world is dynamic, and i...

Concept Drift Monitoring and Diagnostics of Supervised Learning Models via Score Vectors

Supervised learning models are one of the most fundamental classes of mo...

Consistent Recalibration Models and Deep Calibration

Consistent Recalibration models (CRC) have been introduced to capture in...

Drift Estimation with Graphical Models

This paper deals with the issue of concept drift in supervised machine l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One fundamental assumption in classical machine learning is the fact that observed data are i.i.d. according to some unknown probability

, i.e. the data generating process is stationary. Yet, this assumption is often violated in real world problems: models are subject to seasonal changes, changed demands of individual customers, ageing of sensors, etc. In such settings, life-long model adaptation rather than classical batch learning is required. Since drift or covariate change is a major issue in real-world applications, many attempts were made to deal with this setting [8, 10].

Depending on the domain of data and application, the presence of drift is modelled in different ways. As an example, covariate shift refers to different marginal distributions of training and test set [17]

. Learning for data streams extends this setting to an unlimited (but usually countable) stream of observed data, mostly in supervised learning scenarios

[14, 31]. Here one distinguishes between virtual and real drift, i.e. non-stationarity of the marginal distribution only or also the posterior. Learning technologies for such situations often rely on windowing techniques, and adapt the model based on the characteristics of the data in an observed time window. Active methods explicitly detect drift, while passive methods continuously adjust the model [10, 22, 25, 29].

Interestingly, a majority of approaches deals with supervised scenarios, aiming for a small interleaved train-test error; this is accompanied by first approaches to identify particularly relevant features where drift occurs [30], and a large number of methods aims for a detection of drift, an identification of change points in given data sets, or a characterization of overarching types of drift [1, 16]. However non of those methods aims for an explanation of the observed drift by means of a characterization of the observed change in an intuitive way. Unlike the vast literature on explainability of AI models [7, 11, 15, 18]

, only few approaches address explainability in the context of drift. A first approach for explaining drift highlights the features with most variance

[30]; yet this approach is restricted to an inspection of drift in single features. The purpose of our contribution is to provide a novel formalization how explain observed drift, such that an informed monitoring of the underlying process becomes possible. For this purpose, we characterize the underlying distribution in terms of typical representatives, and we describe drift by the evolution of these characteristic samples over time. Besides a formal mathematical characterization of this objective, we provide an efficient algorithm to describe the form of drift and we show its usefulness in benchmarks.

This paper is organized as follows: In the first part (sections 2 and 3) we describe the setup of our problem and give a formal definition (see Definitions 1 and 2). In section 3.1 we derive an efficient algorithm as a realization of the problem. In the second part we quantitatively evaluate the resulting algorithms and demonstrate their behavior in several benchmarks (see section 5).

2 Problem Setup

In the classical batch setup of machine learning one considers a generative process , i.e. a probability measure, on

. In this context one views the realizations of i.i.d. random variables

as samples. Depending on the objective, learning algorithms try to infer the data distribution based on these samples or, in the supervised setting, a posterior distribution. We will only consider distributions in general, this way subsuming the notion of both, real drift and virtual drift.

Many processes in real-world applications are online with data arriving consecutively as drawn from a possibly changing distribution, hence it is reasonable to incorporate temporal aspects. One prominent way to do so is to consider an index set , representing time, and a collection of probability measures on indexed over , which describe the underlying probability at time point and which may change over time [14]. In the following we investigate the relationship of those . Drift refers to the fact that is different for different time points , i.e.

A relevant problem is to explain concept drift, i.e. characterize the difference of those pairs . A typical use case is the monitoring of processes. While drift detection technologies enable automatic drift identification [3, 4, 6, 9, 13, 26, 28]

, it is often unclear how to react to such drift, i.e. to decide whether a model change is due. This challenge is in general ill-posed and requires expert insight; hence an explanation would enable a human to initiate an appropriate reaction. A drift characterization is particularly demanding for high dimensional data or a lack of clear semantic features.

In this contribution, we propose to describe the drift characteristics by contrasting suitable representatives of the underlying distributions [24, 27]. Intuitively, we identify critical samples of the system, and we monitor their development over time, such that the user can grasp the characteristic changes as induced by the observed drift. This leads to the overall algorithmic scheme:

  1. Choose characteristic samples that cover , where denotes the set of observations / samples (over data and time).

  2. For each sample find a corresponding such that and for all , i.e. extend to a time series of its corresponding points under drift.

  3. Present the evolution , or its most relevant changes, respectively, to the user.

In this intuitive form, however, this problem is still ill-posed. In the following, we formalize the notion of "characteristic points" for the distribution of via optima of a characterizing function, and we define the problem of "correspondences" of samples within different time slices; these definitions will reflect our intuition and lead to efficient algorithmic solutions.

3 Characteristic Samples

To make the term "characteristic sample" formally tractable, we describe the process in terms of dependent random variables and representing data and time. This allows us to identify those values of that are "characteristic" for a given time and hence yields a notion of characteristic sample using information theoretic techniques. To start with, we restrict ourselves to the case of discrete time, i.e. , which is a particularly natural choice in the context of data streams or time series [14]. Even for continuous time, it is possible to find a meaningful discretization induced by change points by applying drift detection methods [4, 9]. For simplicity, we assume finitely many time points, i.e. . This allows us to construct a pair of random variables and , representing data and time respectively, which enable a reconstruction of the original distributions by the conditional distributions of given , i.e. for it holds

. This corresponds to the joint distribution

where denotes the Dirac-measure concentrated at and denotes the product measure. This notion has the side effect that, if we keep track of the associated time points of observations, i.e. we consider rather than just , we may consider observations as i.i.d. realizations of . In particular, we may apply well known analysis techniques from the batch setting.

(a) Schematic illustration of two one-dimensional distributions (density; red and blue graphs) and their characteristic regions (hatched boxes).
(b) Schematic illustration of two data sets (red crosses and blue X’es) and their characteristic and similar regions (red and blue res. black hatched).
Figure 1: Schematic illustration of characteristic samples.

The term "characteristic" refers to two properties: the likelihood to observe such samples at all and the identifiability, which refers to the capability of identifying its origin, such as its generating latent variable, e.g. a certain point in time, this we quantify by means of entropy. We illustrate this behaviour in Figure 1. Here , as defined above, is distributed according to a mixture model, where the origin is given by the corresponding mixture component. Each of those components corresponds to a certain time point

. Informally, we say that an observation or property identifies a certain time, if it only occurs during this time point. By using Bayes’ theorem we can characterize identifiability regarding

for a given – the probability that a certain data point was observed at . Measuring its identifiability in terms of the entropy, we obtain the following definition:

Definition 1.

The identifiability function induced by is defined as

where is induced by

over the uniform distribution on

, where is the Radon-Nikodým density, denotes the entropy.

Obviously, has values in . The identifiability function indicates the existence of drift as follows:

Theorem 1.

has drift if and only if .

Theorem 1 shows that captures important properties of regarding drift. It is important to notice that the identifyability function turns time characteristics into spatial properties: while drift is defined globally in the data space and locally in time, encodes drift locally in the data space and globally in time. This will allow us to localize drift, a phenomenon of time, in space, i.e. point towards spatial regions where drift manifests itself – these can then serve as a starting point for an explanation of drift under the assumption that data points or features have a semantic meaning.

The identifiability function per se, however, does not take the overall probability into account. So unlikely samples can be considered as identifying as long as they occur only at a single point in time. To overcome this problem, we extend to the characterizing function:

Definition 2.

Let denote the (density of) marginal distribution of . The characterizing function is defined as

We say that is a characteristic sample iff it is a (local) maximum of .

In contrast to the identifiability function, the characterizing function also takes the likelihood of observing at any time into account. This reflects the idea, that a characteristic example is not only particularly pronounced with respect to other samples of another distribution, and hence identifiable, but also likely to be observed. We illustrate the behaviour of and in Figures 2 and 1. Obviously finds exactly those regions, which mark the presence of drift in the naive sense.

Figure 2: Head map of two distribution (also presented in Figure 1) and their identifiability and characterizing map as defined in Definitions 1 and 2

3.1 Find Characteristic Samples given Data

We are interested in an efficient algorithm, which enables us to find characteristic samples from given data. Unlike classical function optimization, we face the problem that itself is unknown, and we cannot observe it directly. Rather,

is given as a product of two components, each of which requires a different estimation scheme. We will rely on the strategy to estimate the identifiability function first. Then, we can reduce the problem to an estimation of a (weighted) density function, rather than estimating

separately and then optimizing the product .

The problem of finding local maxima of a density function from given samples is a well studied problem, which can be addressed by prototype based clustering algorithms such as mean shift [12]

, which identifies local maxima based on kernel density estimators. Efficient deterministic counterparts such as

-means often yield acceptable solutions [5]. Since constitutes a "weighted" density function rather than a pure one, we rely on a weighted version of a prototype base clustering algorithm, which applies weighting/resampling of samples according to the estimated identifiability function. The following theorem shows, that this procedure yields a valid estimate.

Theorem 2.

Let be a probability space, a measure space and a sequence of -valued, i.i.d. random variables. Let be a bounded, measurable map with . Denote by the weighted version of , where denotes the indicator function. For every let be a sequences of independent -valued random variables with (or iff all ) for all and . If is a Glivenko–Cantelli class of then it holds

in almost surely, where we can take the limits in arbitrary order.

Theorem 2 implies that samples obtained by resampling from according to are distributed according to the distribution associated to , i.e. . This induces an obvious algorithmic scheme, by applying prototype-based clustering to reweighted samples. This is particularly beneficial since some algorithms, like mean shift, do not scale well with the number of samples. It remains to find a suitable method to estimate : We need to estimate the probability of a certain time given a data point. Since we consider discrete time, this can be modeled as probabilistic classification problem which maps observations to a probability of the corresponding time, . Hence popular classification methods such as

-nearest neighbour, random forest, or Gaussian proceses can be used. We will evaluate the suitability of these methods in section 


4 Explanation via Examples: Counterfactual Explanations

So far we have discussed the problem of finding characteristic samples, which can be modelled as probabilistic classification. This links the problem of explaining the difference between drifting distributions to the task of explaining machine learning models by means of examples. One particularly prominent explanation in this context is offered by counterfactual explanations: these contrast samples by counterparts with minimum change of the appearance but different class label (see section 2). First, we shortly recapitulate counterfactual explanations.

4.1 Counterfactual Explanations

Counterfactuals explain the decision of a model regarding a given sample by contrasting it with a similar one, which is classified differently


Definition 3.

Let be a classifier,

a loss function, and

a dissimilarity. For a constant and target class a counterfactual for a sample is defined as

Hence a counterfactual of a given sample is an element that is similar to but classified differently by . Common choices for include -norms or the Mahalanobis distance with as symmetric pdf matrix.

As discussed in [2, 21] this initial definition suffers from the problem that counterfactuals might be implausible. To overcome this problem, the proposal [21] suggests to allow only those samples that lie on the data manifold. This can be achieved by enforcing a lower threshold for the probability of counterfactuals


In the work [2], is chosen as mixture model, and approximated such that the optimizaton problem becomes a convex problem for a number of popular models .

1:  Input: data stream
3:  while  do
5:     if  then
8:        for all  do
10:        end for
14:        for all  do
16:        end for
17:        for all  do
19:        end for
20:        Present , to User
21:     end if
23:  end while
Algorithm 1 Explaining drift

4.2 Explaining Drift by Means of Counterfactuals

In section 3.1 we connected the problem of identifying relevant information of observed drift to a probabilistic classification problem, mapping representative samples to their time of occurrence via . This connection enables us to link the problem of understanding drift to the problem of explaining this mapping by counterfactuals. We identify characteristic samples as local optima of , as described above, and show how they contrast to similar points, as computed by counterfactuals, which are associated to a different time.

Since we are interested in an overall explanation of the ongoing drift, we can also restrict ourselves to finding counterfactuals of within the set of given training samples, skipping the step of density estimation to generate reasonable counterfactuals. It is advisable to coordinate the assignment of subsequent counterfactuals by minimizing the overall costs induced by the similarity matrix – we refer to the resulting samples as associated samples.

The algorithmic scheme presented in section 1 gives rise to algorithm 1. The explaining routine is run if drift was detected. Depending on the chosen sub algorithms (we use the Hungarian method, -NN classifier, affinity propagation or -means) we obtain a run time complexity of , with the number of samples and the number of displayed representative samples for the processing of a drift event. Since we therefore obtain a run time complexity of .

Estimation of (MSE) Optimization of (mean value)
data set -NN RF -M AP MS
Table 1: MSE for estimation identifiability function and final value of optimization of the identifiability function using different models/methods. First three data set are theoretical, encoding

(dimension / complexity of distribution / complexity of component overlap). Estimation over 30 runs. Standard deviation is only shown if

. All results and details are given in the supplement.

5 Experiments

In this section, we quantitatively evaluate the method. This includes the evaluation of the single components, and an application to realistic benchmark data sets.

5.1 Quantitative Evaluation

We evaluate the following components: Estimation of the identifiability map , identification of characteristic samples, and plausibility of explanation for a known generating mechanism. To reduce the complexity, we restrict ourselves to two time points, , since multiple time points can be dealt with by an iteration of this scenario. We evaluate the estimation capabilities of different machine learning models – -nearest neighbour (

-NN), Gaussian process classification (GP with Matern-kernel), artificial neural network (ANN, 1-layer MLP) and random forest (RF) – and prototype based clustering algorithms –

-Means (-M), affinity propagation (AP) and mean shift (MS). We evaluate on both theoretical data with known ground truth generated by mixture distributions, as well as common benchmark data sets for regression and classification for more realistic data, where its occurrence is induced by the output component. We present a part of the results in table 1. Details are in the supplemental material.

Evaluation of identifiability map

For the theoretical data, we evaluate how a) dimensionality, b) complexity of distribution, and c) complexity of component overlap influence the model performance. As it turns out, the overlap is a crucial parameter, regardless of the chosen model. Further, -NN is the least vulnerable method with best results, random forests perform second best. For the benchmark data sets we found that -NN performs quite well in most cases and is very similar to the random forest. The Gaussian process only works well on the regression data sets.

Evaluation of characteristic samples

We compared different prototype based clustering algorithms as regards their ability to identify representatives of . We applied the resampling scheme from section 3.1 and also considered the weighted version of -means as well as the standard version of -means as baseline. It turns out that the resampling method performs best. Data parameters such as overlap and dimensionality have no significant influence. For the benchmark data sets we only evaluate the identifiability. We find that AP performs best, followed by -means with resampling.

(a) Evolving distribution with known components per time slot. Distributions are given as mixtures of equally weighted uniform distributions.
(b) Visualization of the identified characteristic samples over 50 runs (-nn & mean shift). Underlying distributions are shown in Figure 3.
(c) Performance of different instantiations of the algorithmic pipeline. Numbers refer to the mean percentage of misclassified cells.
Figure 3: Evaluation of the correct identification of spatial components provided by counterfactuals

Evaluation of explainability

We evaluate the explainibility by measuring the capability to detect vanishing of parts of the distribution. We generate a checkerboard data set (see Figure 3) and evaluate the explanations as provided by the technology as regards its capability to identify parts which vanish/appear in subsequent time windows (see Figure 3). A quantitative evaluation can be based on the number of incorrectly identified components, averaged over 30 runs, as shown in Figure 3, using random distributions and samples. GP combined with AP performs best.

5.2 Explanation of Drift Data Sets

We apply the technology (-NN + -means) on the electricity market benchmark data set [19], which is a well studied drift data set [30], and a new data set derived from MNIST [20] by inducing drift in the occurrence of classes. To obtain an overall impression we use the dimensionality reduction technique UMAP [23] to project the data to the two dimensional space (Figure 4 and 5). The color displays the identifiability. The chosen characteristic samples, as well as the associated samples are emphasized.

(a) Color represents origin (red/blue) and identifyability (satiation). Cards mark considered samples (black: characteristic sample; white: associated sample), projection via UMAP
(b) Illustration of resulting pairs of samples: Columns (left to right): characteristic sample, associated sample and difference of samples, Rows: from old to new (upper row), from new to old (lower row).
Figure 4: Our method applied to electricity market data set with split at 2nd of May 1997.

Electricity market

The Electricity Market data set [19] describes electricity pricing in South-East Australia. It records price and demand in New South Wales and Victoria as well as the amount of power transferred between those states. All time related features have been cleaned. We try to explain the difference between the data set before and after the 2nd of May 1997, when a new national electricity market was introduced, which allows the trading of electricity between the states of New South Wales, Victoria, the Australian Capital Territory, and South Australia. Three new features were introduced to the data set (vicprice, vicdemand, transfer), old samples were extended by filling up with constant values. The data set consists of instances, with 5 features each. We randomly selected instances before and after the drift (which we consider to take place at the sample) to create the visualization shown in Figure 4.

As can be seen in Figure 4 only the last two features (vicdemand, transfer) are relevant for drift (see Figure 4 Diff. – white columns mean no drift in this feature). A further analysis showed that [30]. Furthermore it can be seen that the distribution of those attributes was extended as there exist samples after the drift comparable to those before the drift, but not the other way around (see Figure 4 Diff. – only is not white).

(a) Overview image created using UMAP, same colors / markers as in Fig. 4.
(b) Illustration of resulting pairs of digits: characteristic sample (Char. / top row), associated sample (Ass. / middle row) and difference (Diff. / lower row)
Figure 5: Our method applied to artificial data set created using MNIST data


The data set consists of sample digits 1,3,4,7,8 from the MNIST data set. The digits 1 and 3 are present before the drift, the digits 7 and 8 after the drift. 4 occurs before and after drift alike. Each data point consists of a -pixel black-white images of numbers. We randomly selected of those images (aligned as described above) to create the visualization shown in Figure 5.

As can be seen in Figure 5 only the digits 1,3,7,8 are considered to be relevant to the drift. The blob of data point on the left side of Figure 5, that are marked as un-identifiabe (), are "4"-digits, indeed. Furthermore, we observe that there is some tendency to associate "1"- and "7"-digits and "3"- and "8"-digits, as can be seen in Figure 5 and 5.

6 Discussion and Further Work

We introduced a new method to formalize an explanation of drift observed in a distribution by means of characteristic examples, as quantified in terms of the identifiability function. We derived a new algorithm to estimate this characteristics and demonstrated its relation to intuitive notions of change as well as statistical notions, respectively. We demonstrated the behavior in several examples, and the empirical results demonstrate that this proposal constitutes a promising approach as regards drift explanation in an intuitive fashion. The technology is yet restricted to discrete time points with well defined change points or points of drift. An extension to continuous drift is subject of ongoing work.


  • [1] S. Aminikhanghahi and D. J. Cook. A survey of methods for time series change point detection. Knowl. Inf. Syst., 51(2):339–367, May 2017.
  • [2] A. Artelt and B. Hammer. Convex density constraints for computing plausible counterfactual explanations, 2020.
  • [3] M. Baena-García, J. Campo-Ávila, R. Fidalgo-Merino, A. Bifet, R. Gavald, and R. Morales-Bueno. Early drift detection method. 01 2006.
  • [4] A. Bifet and R. Gavaldà. Learning from time-changing data with adaptive windowing. In Proceedings of the Seventh SIAM International Conference on Data Mining, April 26-28, 2007, Minneapolis, Minnesota, USA, pages 443–448, 2007.
  • [5] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
  • [6] L. Bu, C. Alippi, and D. Zhao. A pdf-free change detection test based on density difference estimation. IEEE Transactions on Neural Networks and Learning Systems, 29(2):324–334, Feb 2018.
  • [7] R. M. J. Byrne.

    Counterfactuals in explainable artificial intelligence (xai): Evidence from human reasoning.

    In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6276–6282. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
  • [8] R. F. de Mello, Y. Vaz, C. H. G. Ferreira, and A. Bifet. On learning guarantees to unsupervised concept drift detection on data streams. Expert Syst. Appl., 117:90–102, 2019.
  • [9] G. Ditzler and R. Polikar. Hellinger distance based drift detection for nonstationary environments. In 2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments, CIDUE 2011, Paris, France, April 13, 2011, pages 41–48, 2011.
  • [10] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar. Learning in nonstationary environments: A survey. IEEE Comp. Int. Mag., 10(4):12–25, 2015.
  • [11] A.-K. Dombrowski, M. Alber, C. Anders, M. Ackermann, K.-R. Müller, and P. Kessel. Explanations can be manipulated and geometry is to blame. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 13589–13600. Curran Associates, Inc., 2019.
  • [12] K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory, 21:32–40, 1975.
  • [13] J. Gama, P. Medas, G. Castillo, and P. P. Rodrigues. Learning with drift detection. In Advances in Artificial Intelligence - SBIA 2004, 17th Brazilian Symposium on Artificial Intelligence, São Luis, Maranhão, Brazil, September 29 - October 1, 2004, Proceedings, pages 286–295, 2004.
  • [14] J. a. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept drift adaptation. ACM Comput. Surv., 46(4):44:1–44:37, Mar. 2014.
  • [15] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal. Explaining explanations: An overview of interpretability of machine learning. In F. Bonchi, F. J. Provost, T. Eliassi-Rad, W. Wang, C. Cattuto, and R. Ghani, editors,

    5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018, Turin, Italy, October 1-3, 2018

    , pages 80–89. IEEE, 2018.
  • [16] I. Goldenberg and G. I. Webb. Survey of distance measures for quantifying concept drift and shift in numeric data. Knowl. Inf. Syst., 60(2):591–615, 2019.
  • [17] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf. Covariate shift and local learning by distribution matching, pages 131–160. MIT Press, Cambridge, MA, USA, 2009.
  • [18] D. Gunning, M. Stefik, J. Choi, T. Miller, S. Stumpf, and G.-Z. Yang. Xai—explainable artificial intelligence. Science Robotics, 4(37), 2019.
  • [19] M. Harries, U. N. cse tr, and N. S. Wales. Splice-2 comparative evaluation: Electricity pricing. Technical report, 1999.
  • [20] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.
  • [21] A. V. Looveren and J. Klaise. Interpretable counterfactual explanations guided by prototypes, 2019.
  • [22] V. Losing, B. Hammer, and H. Wersing. Tackling heterogeneous concept drift with the self-adjusting memory (SAM). Knowl. Inf. Syst., 54(1):171–201, 2018.
  • [23] L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction, 2018.
  • [24] C. Molnar. Interpretable machine learning. 2019. URL [https://christophm. github. io/interpretable-ml-book/]. accessed, pages 05–04, 2019.
  • [25] J. Montiel, J. Read, A. Bifet, and T. Abdessalem. Scikit-multiflow: A multi-output streaming framework. Journal of Machine Learning Research, 19(72):1–5, 2018.
  • [26] E. S. PAGE. Continuous inspection schemes. Biometrika, 41(1-2):100–115, 06 1954.
  • [27] S. Wachter, B. D. Mittelstadt, and C. Russell. Counterfactual explanations without opening the black box: Automated decisions and the gdpr. ArXiv, abs/1711.00399, 2017.
  • [28] A. Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117–186, 1945.
  • [29] S. Wang, L. L. Minku, N. V. Chawla, and X. Yao. Learning from data streams and class imbalance. Connect. Sci., 31(2):103–104, 2019.
  • [30] G. I. Webb, L. K. Lee, F. Petitjean, and B. Goethals. Understanding concept drift. CoRR, abs/1704.00362, 2017.
  • [31] D. Zambon, C. Alippi, and L. Livi.

    Concept drift and anomaly detection in graph streams.

    IEEE Trans. Neural Networks Learn. Syst., 29(11):5592–5605, 2018.

Appendix A Proofs

In this section we will give complete proofs of the stated theorems. The numbering of the theorems coincide with the one given in the paper. The needed lemmas are not contained in the paper itself and follow a different numbering.

Lemma 1.

is a well-defined measurable map and holds -a.s..


Since is a Radon-Nikodým density,it is a well-defined map in .

Let us start by showing that is a probability measure, indeed. To start with notice that for it holds

where ! holds follows by the linearity of Radon-Nikodým densities. Furthermore for two probability measures we have that

where the first statement follows from the fact that, if and then so that and hence . So by writing we see that is a probability measure on , so that we can speak of the entropy of .

Now let us show that is measurable. Since is measurable and is measurable, as well as the sum of measurable functions is measurable it follows that and hence is measurable, too.

Now, since for all probability measure on it holds it follows that

Theorem 1.

It holds that has drift if and only if .


Suppose has no drift then it holds that

is a valid choice. Hence it follows that is the uniform distribution for all and hence since .

Suppose then holds -a.s. since a.s.. Now, since for any probability measure on it holds if and only if the uniform distribution on it follows that

where follows since is monotonous (in the second case the null sets may depend on ) and follows from the uniqueness of Radon-Nikodým densities.∎

Recall the following definition:

Definition 1.

Let a measurable space. For a set we define a pseudonorm on the space of all finite measures

Let be a probability space and a sequence of -valued, i.i.d. random variables. We say that is a Glivenko–Cantelli class of iff

Lemma 2.

Let be a probability space, a measure space and a sequence of -valued, i.i.d. random variables. Let be a bounded, measurable map with . Then for any set it holds

where denotes the Dirac measure concentrated at (we use the convention ).


Denote by

We hence may rewrite the statement as

Since for any we have that implies and implies we have that if there exists no such that we have that for all , on the other hand if there exists a such that then the sequence and converges to by the low of large numbers so that converges to and hence we see that a.s. since a.s.. ∎

Lemma 3.

Let be a probability space, a measure space and a sequence of -valued, i.i.d. random variables. Let be a bounded, measurable map. Denote by the weighted version of , where denotes the indicator function. If is a Glivenko–Cantelli class of then it holds


We will prove the statement using monotonous class techniques. Let be the set of all functions such that


Clearly and by the triangle inequality it follows that if , then . Now, let be a bounded, increasing, point wise and converging sequence with .

Since is bounded, it is integrable for every finite measure (so in particular for or ), too. Therefore, by the dominated convergence theorem, it holds that for every we may find an such that for all it holds

so we see that by an -argument. We have shown that

is an monotonous vector space, once we have shown that for any

we have the statement follows.

W.l.o.g. w.m.a. . Denote by . Consider

By lemma 2 we see that the first summand converges to 0 almost surly. On the other hand consider as an descrete stochastic process and fix its induced filtration. Define and a sequence of stopping times, so for every fix we have that is the subsequence of that lies within . Since we have . Then is a sequence of i.i.d. random variables with distributed according to . Since it follows that the second summund converges to 0 almost surly. ∎

Theorem 2.

Let be a probability space, a measure space and a sequence of -valued, i.i.d. random variables. Let be a bounded, measurable map with . Denote by the weighted version of , where denotes the indicator function. For every let be a sequences of independent -valued random variables with (or iff all ) for all and . If is a Glivenko–Cantelli class of then it holds

in almost surely, where we can take the limits in arbitrary order.


Denote by the theoretical measure for a fixed set of observations and by . It holds

Notice that only the first summand depend on and . However, since we approximate an distribution on we see that by Kolmogorov’s theorem is uniformly bounded by .

By lemma 3 we see that a.s. and hence it remains to show that a.s.: Denote by , . We have that

Def. and

so the statement follows by lemma 2. ∎

Appendix B Experiments

In this section we will give additional details on the evaluations and experiments. This includes a precise setup of how the used data was generated and how the experiments were run as well as the obtained results/measurements and our interpretations.

b.1 Experimental setup

In this subsection we will discuss how we generated our data and how we evaluated the results. To simplify it we use different paragraphs for theoretical and benchmark data.

Theoretical data

As discussed in the paper we were interested in understanding which of the following parameters is relevant for the quality of our prediction:

  1. Dimension (Dimensionality of data)

  2. Complexity of distribution (How complex/fractal/fine grained is )

  3. Complexity of overlap (How complex/fractal/fine grained are the regions where and have weight)

We used a mixture of Gaussians with uniformly distributed means and constant variance. We controlled the dimensionality in the obvious way (). We controlled the complexity of the distributions by the number of used Gaussians with equal degree of overlap (). We controlled the complexity of overlap by controlling the number of degrees of overlap ().

We therefore obtain and

Notice that is a distribution on with . In this case and can be computed analytically given .

We generated 500 samples according to .

For the evaluation of the estimation of we trained our models on the data to solve the probabilistic classification task , i.e. the classification rule for a sample is given by . We evaluated the resulting models by estimating the MSE between the estimation (based on ) and the real using 1.500 samples distributed according to , 1.500 samples distributed according to a Gaussian mixture equivalent to except that we used and 1.500 samples distributed according to . We repeated the process for every considered combination of 30 times and document mean value and standard deviation.

Notice that the classifier is not trained on data which contains ! Instead it is trained to predict the time point given . Since we consider probabilistic models this allows us to use them to estimate , but the actual value of is never presented to the model.

For the evaluation of the estimation of we applied the (modified) clustering methods to the generated samples. The ground truth value of was used by the methods. We evaluated the resulting models using the ground truth value of and . If a clustering produced more then one prototype the mean value over all prototypes was considered as the accomplish value for the maximization of and . We repeated the process for every considered combination of 30 times and document mean value and standard deviation.

Benchmark data

We considered both regression data () and classification data (). We processed the regression data as follows: We normalized the data sets output, i.e. we have . For every sample we randomly generated a occurrence time with , i.e. the chance that is higher if the original prediction value is close to 0. Accordingly we computed the identifiability as . The new samples are given by .

We processed the classification data as follows: For every class