1 Introduction
One fundamental assumption in classical machine learning is the fact that observed data are i.i.d. according to some unknown probability
, i.e. the data generating process is stationary. Yet, this assumption is often violated in real world problems: models are subject to seasonal changes, changed demands of individual customers, ageing of sensors, etc. In such settings, lifelong model adaptation rather than classical batch learning is required. Since drift or covariate change is a major issue in realworld applications, many attempts were made to deal with this setting [8, 10].Depending on the domain of data and application, the presence of drift is modelled in different ways. As an example, covariate shift refers to different marginal distributions of training and test set [17]
. Learning for data streams extends this setting to an unlimited (but usually countable) stream of observed data, mostly in supervised learning scenarios
[14, 31]. Here one distinguishes between virtual and real drift, i.e. nonstationarity of the marginal distribution only or also the posterior. Learning technologies for such situations often rely on windowing techniques, and adapt the model based on the characteristics of the data in an observed time window. Active methods explicitly detect drift, while passive methods continuously adjust the model [10, 22, 25, 29].Interestingly, a majority of approaches deals with supervised scenarios, aiming for a small interleaved traintest error; this is accompanied by first approaches to identify particularly relevant features where drift occurs [30], and a large number of methods aims for a detection of drift, an identification of change points in given data sets, or a characterization of overarching types of drift [1, 16]. However non of those methods aims for an explanation of the observed drift by means of a characterization of the observed change in an intuitive way. Unlike the vast literature on explainability of AI models [7, 11, 15, 18]
, only few approaches address explainability in the context of drift. A first approach for explaining drift highlights the features with most variance
[30]; yet this approach is restricted to an inspection of drift in single features. The purpose of our contribution is to provide a novel formalization how explain observed drift, such that an informed monitoring of the underlying process becomes possible. For this purpose, we characterize the underlying distribution in terms of typical representatives, and we describe drift by the evolution of these characteristic samples over time. Besides a formal mathematical characterization of this objective, we provide an efficient algorithm to describe the form of drift and we show its usefulness in benchmarks.This paper is organized as follows: In the first part (sections 2 and 3) we describe the setup of our problem and give a formal definition (see Definitions 1 and 2). In section 3.1 we derive an efficient algorithm as a realization of the problem. In the second part we quantitatively evaluate the resulting algorithms and demonstrate their behavior in several benchmarks (see section 5).
2 Problem Setup
In the classical batch setup of machine learning one considers a generative process , i.e. a probability measure, on
. In this context one views the realizations of i.i.d. random variables
as samples. Depending on the objective, learning algorithms try to infer the data distribution based on these samples or, in the supervised setting, a posterior distribution. We will only consider distributions in general, this way subsuming the notion of both, real drift and virtual drift.Many processes in realworld applications are online with data arriving consecutively as drawn from a possibly changing distribution, hence it is reasonable to incorporate temporal aspects. One prominent way to do so is to consider an index set , representing time, and a collection of probability measures on indexed over , which describe the underlying probability at time point and which may change over time [14]. In the following we investigate the relationship of those . Drift refers to the fact that is different for different time points , i.e.
A relevant problem is to explain concept drift, i.e. characterize the difference of those pairs . A typical use case is the monitoring of processes. While drift detection technologies enable automatic drift identification [3, 4, 6, 9, 13, 26, 28]
, it is often unclear how to react to such drift, i.e. to decide whether a model change is due. This challenge is in general illposed and requires expert insight; hence an explanation would enable a human to initiate an appropriate reaction. A drift characterization is particularly demanding for high dimensional data or a lack of clear semantic features.
In this contribution, we propose to describe the drift characteristics by contrasting suitable representatives of the underlying distributions [24, 27]. Intuitively, we identify critical samples of the system, and we monitor their development over time, such that the user can grasp the characteristic changes as induced by the observed drift. This leads to the overall algorithmic scheme:

Choose characteristic samples that cover , where denotes the set of observations / samples (over data and time).

For each sample find a corresponding such that and for all , i.e. extend to a time series of its corresponding points under drift.

Present the evolution , or its most relevant changes, respectively, to the user.
In this intuitive form, however, this problem is still illposed. In the following, we formalize the notion of "characteristic points" for the distribution of via optima of a characterizing function, and we define the problem of "correspondences" of samples within different time slices; these definitions will reflect our intuition and lead to efficient algorithmic solutions.
3 Characteristic Samples
To make the term "characteristic sample" formally tractable, we describe the process in terms of dependent random variables and representing data and time. This allows us to identify those values of that are "characteristic" for a given time and hence yields a notion of characteristic sample using information theoretic techniques. To start with, we restrict ourselves to the case of discrete time, i.e. , which is a particularly natural choice in the context of data streams or time series [14]. Even for continuous time, it is possible to find a meaningful discretization induced by change points by applying drift detection methods [4, 9]. For simplicity, we assume finitely many time points, i.e. . This allows us to construct a pair of random variables and , representing data and time respectively, which enable a reconstruction of the original distributions by the conditional distributions of given , i.e. for it holds
. This corresponds to the joint distribution
where denotes the Diracmeasure concentrated at and denotes the product measure. This notion has the side effect that, if we keep track of the associated time points of observations, i.e. we consider rather than just , we may consider observations as i.i.d. realizations of . In particular, we may apply well known analysis techniques from the batch setting.
The term "characteristic" refers to two properties: the likelihood to observe such samples at all and the identifiability, which refers to the capability of identifying its origin, such as its generating latent variable, e.g. a certain point in time, this we quantify by means of entropy. We illustrate this behaviour in Figure 1. Here , as defined above, is distributed according to a mixture model, where the origin is given by the corresponding mixture component. Each of those components corresponds to a certain time point
. Informally, we say that an observation or property identifies a certain time, if it only occurs during this time point. By using Bayes’ theorem we can characterize identifiability regarding
for a given – the probability that a certain data point was observed at . Measuring its identifiability in terms of the entropy, we obtain the following definition:Definition 1.
The identifiability function induced by is defined as
where is induced by
over the uniform distribution on
, where is the RadonNikodým density, denotes the entropy.Obviously, has values in . The identifiability function indicates the existence of drift as follows:
Theorem 1.
has drift if and only if .
Theorem 1 shows that captures important properties of regarding drift. It is important to notice that the identifyability function turns time characteristics into spatial properties: while drift is defined globally in the data space and locally in time, encodes drift locally in the data space and globally in time. This will allow us to localize drift, a phenomenon of time, in space, i.e. point towards spatial regions where drift manifests itself – these can then serve as a starting point for an explanation of drift under the assumption that data points or features have a semantic meaning.
The identifiability function per se, however, does not take the overall probability into account. So unlikely samples can be considered as identifying as long as they occur only at a single point in time. To overcome this problem, we extend to the characterizing function:
Definition 2.
Let denote the (density of) marginal distribution of . The characterizing function is defined as
We say that is a characteristic sample iff it is a (local) maximum of .
In contrast to the identifiability function, the characterizing function also takes the likelihood of observing at any time into account. This reflects the idea, that a characteristic example is not only particularly pronounced with respect to other samples of another distribution, and hence identifiable, but also likely to be observed. We illustrate the behaviour of and in Figures 2 and 1. Obviously finds exactly those regions, which mark the presence of drift in the naive sense.
3.1 Find Characteristic Samples given Data
We are interested in an efficient algorithm, which enables us to find characteristic samples from given data. Unlike classical function optimization, we face the problem that itself is unknown, and we cannot observe it directly. Rather,
is given as a product of two components, each of which requires a different estimation scheme. We will rely on the strategy to estimate the identifiability function first. Then, we can reduce the problem to an estimation of a (weighted) density function, rather than estimating
separately and then optimizing the product .The problem of finding local maxima of a density function from given samples is a well studied problem, which can be addressed by prototype based clustering algorithms such as mean shift [12]
, which identifies local maxima based on kernel density estimators. Efficient deterministic counterparts such as
means often yield acceptable solutions [5]. Since constitutes a "weighted" density function rather than a pure one, we rely on a weighted version of a prototype base clustering algorithm, which applies weighting/resampling of samples according to the estimated identifiability function. The following theorem shows, that this procedure yields a valid estimate.Theorem 2.
Let be a probability space, a measure space and a sequence of valued, i.i.d. random variables. Let be a bounded, measurable map with . Denote by the weighted version of , where denotes the indicator function. For every let be a sequences of independent valued random variables with (or iff all ) for all and . If is a Glivenko–Cantelli class of then it holds
in almost surely, where we can take the limits in arbitrary order.
Theorem 2 implies that samples obtained by resampling from according to are distributed according to the distribution associated to , i.e. . This induces an obvious algorithmic scheme, by applying prototypebased clustering to reweighted samples. This is particularly beneficial since some algorithms, like mean shift, do not scale well with the number of samples. It remains to find a suitable method to estimate : We need to estimate the probability of a certain time given a data point. Since we consider discrete time, this can be modeled as probabilistic classification problem which maps observations to a probability of the corresponding time, . Hence popular classification methods such as
nearest neighbour, random forest, or Gaussian proceses can be used. We will evaluate the suitability of these methods in section
5.1.4 Explanation via Examples: Counterfactual Explanations
So far we have discussed the problem of finding characteristic samples, which can be modelled as probabilistic classification. This links the problem of explaining the difference between drifting distributions to the task of explaining machine learning models by means of examples. One particularly prominent explanation in this context is offered by counterfactual explanations: these contrast samples by counterparts with minimum change of the appearance but different class label (see section 2). First, we shortly recapitulate counterfactual explanations.
4.1 Counterfactual Explanations
Counterfactuals explain the decision of a model regarding a given sample by contrasting it with a similar one, which is classified differently
[27]:Definition 3.
Let be a classifier,
a loss function, and
a dissimilarity. For a constant and target class a counterfactual for a sample is defined asHence a counterfactual of a given sample is an element that is similar to but classified differently by . Common choices for include norms or the Mahalanobis distance with as symmetric pdf matrix.
As discussed in [2, 21] this initial definition suffers from the problem that counterfactuals might be implausible. To overcome this problem, the proposal [21] suggests to allow only those samples that lie on the data manifold. This can be achieved by enforcing a lower threshold for the probability of counterfactuals
s.t. 
In the work [2], is chosen as mixture model, and approximated such that the optimizaton problem becomes a convex problem for a number of popular models .
4.2 Explaining Drift by Means of Counterfactuals
In section 3.1 we connected the problem of identifying relevant information of observed drift to a probabilistic classification problem, mapping representative samples to their time of occurrence via . This connection enables us to link the problem of understanding drift to the problem of explaining this mapping by counterfactuals. We identify characteristic samples as local optima of , as described above, and show how they contrast to similar points, as computed by counterfactuals, which are associated to a different time.
Since we are interested in an overall explanation of the ongoing drift, we can also restrict ourselves to finding counterfactuals of within the set of given training samples, skipping the step of density estimation to generate reasonable counterfactuals. It is advisable to coordinate the assignment of subsequent counterfactuals by minimizing the overall costs induced by the similarity matrix – we refer to the resulting samples as associated samples.
The algorithmic scheme presented in section 1 gives rise to algorithm 1. The explaining routine is run if drift was detected. Depending on the chosen sub algorithms (we use the Hungarian method, NN classifier, affinity propagation or means) we obtain a run time complexity of , with the number of samples and the number of displayed representative samples for the processing of a drift event. Since we therefore obtain a run time complexity of .
Estimation of (MSE)  Optimization of (mean value)  

data set  NN  RF  M  AP  MS 
2/2/2  
100/8/2  
2/2/10  
diabetes  
faces 
(dimension / complexity of distribution / complexity of component overlap). Estimation over 30 runs. Standard deviation is only shown if
. All results and details are given in the supplement.5 Experiments
In this section, we quantitatively evaluate the method. This includes the evaluation of the single components, and an application to realistic benchmark data sets.
5.1 Quantitative Evaluation
We evaluate the following components: Estimation of the identifiability map , identification of characteristic samples, and plausibility of explanation for a known generating mechanism. To reduce the complexity, we restrict ourselves to two time points, , since multiple time points can be dealt with by an iteration of this scenario. We evaluate the estimation capabilities of different machine learning models – nearest neighbour (
NN), Gaussian process classification (GP with Maternkernel), artificial neural network (ANN, 1layer MLP) and random forest (RF) – and prototype based clustering algorithms –
Means (M), affinity propagation (AP) and mean shift (MS). We evaluate on both theoretical data with known ground truth generated by mixture distributions, as well as common benchmark data sets for regression and classification for more realistic data, where its occurrence is induced by the output component. We present a part of the results in table 1. Details are in the supplemental material.Evaluation of identifiability map
For the theoretical data, we evaluate how a) dimensionality, b) complexity of distribution, and c) complexity of component overlap influence the model performance. As it turns out, the overlap is a crucial parameter, regardless of the chosen model. Further, NN is the least vulnerable method with best results, random forests perform second best. For the benchmark data sets we found that NN performs quite well in most cases and is very similar to the random forest. The Gaussian process only works well on the regression data sets.
Evaluation of characteristic samples
We compared different prototype based clustering algorithms as regards their ability to identify representatives of . We applied the resampling scheme from section 3.1 and also considered the weighted version of means as well as the standard version of means as baseline. It turns out that the resampling method performs best. Data parameters such as overlap and dimensionality have no significant influence. For the benchmark data sets we only evaluate the identifiability. We find that AP performs best, followed by means with resampling.
Evaluation of explainability
We evaluate the explainibility by measuring the capability to detect vanishing of parts of the distribution. We generate a checkerboard data set (see Figure 3) and evaluate the explanations as provided by the technology as regards its capability to identify parts which vanish/appear in subsequent time windows (see Figure 3). A quantitative evaluation can be based on the number of incorrectly identified components, averaged over 30 runs, as shown in Figure 3, using random distributions and samples. GP combined with AP performs best.
5.2 Explanation of Drift Data Sets
We apply the technology (NN + means) on the electricity market benchmark data set [19], which is a well studied drift data set [30], and a new data set derived from MNIST [20] by inducing drift in the occurrence of classes. To obtain an overall impression we use the dimensionality reduction technique UMAP [23] to project the data to the two dimensional space (Figure 4 and 5). The color displays the identifiability. The chosen characteristic samples, as well as the associated samples are emphasized.
Electricity market
The Electricity Market data set [19] describes electricity pricing in SouthEast Australia. It records price and demand in New South Wales and Victoria as well as the amount of power transferred between those states. All time related features have been cleaned. We try to explain the difference between the data set before and after the 2nd of May 1997, when a new national electricity market was introduced, which allows the trading of electricity between the states of New South Wales, Victoria, the Australian Capital Territory, and South Australia. Three new features were introduced to the data set (vicprice, vicdemand, transfer), old samples were extended by filling up with constant values. The data set consists of instances, with 5 features each. We randomly selected instances before and after the drift (which we consider to take place at the sample) to create the visualization shown in Figure 4.
As can be seen in Figure 4 only the last two features (vicdemand, transfer) are relevant for drift (see Figure 4 Diff. – white columns mean no drift in this feature). A further analysis showed that [30]. Furthermore it can be seen that the distribution of those attributes was extended as there exist samples after the drift comparable to those before the drift, but not the other way around (see Figure 4 Diff. – only is not white).
Mnist
The data set consists of sample digits 1,3,4,7,8 from the MNIST data set. The digits 1 and 3 are present before the drift, the digits 7 and 8 after the drift. 4 occurs before and after drift alike. Each data point consists of a pixel blackwhite images of numbers. We randomly selected of those images (aligned as described above) to create the visualization shown in Figure 5.
As can be seen in Figure 5 only the digits 1,3,7,8 are considered to be relevant to the drift. The blob of data point on the left side of Figure 5, that are marked as unidentifiabe (), are "4"digits, indeed. Furthermore, we observe that there is some tendency to associate "1" and "7"digits and "3" and "8"digits, as can be seen in Figure 5 and 5.
6 Discussion and Further Work
We introduced a new method to formalize an explanation of drift observed in a distribution by means of characteristic examples, as quantified in terms of the identifiability function. We derived a new algorithm to estimate this characteristics and demonstrated its relation to intuitive notions of change as well as statistical notions, respectively. We demonstrated the behavior in several examples, and the empirical results demonstrate that this proposal constitutes a promising approach as regards drift explanation in an intuitive fashion. The technology is yet restricted to discrete time points with well defined change points or points of drift. An extension to continuous drift is subject of ongoing work.
References
 [1] S. Aminikhanghahi and D. J. Cook. A survey of methods for time series change point detection. Knowl. Inf. Syst., 51(2):339–367, May 2017.
 [2] A. Artelt and B. Hammer. Convex density constraints for computing plausible counterfactual explanations, 2020.
 [3] M. BaenaGarcía, J. CampoÁvila, R. FidalgoMerino, A. Bifet, R. Gavald, and R. MoralesBueno. Early drift detection method. 01 2006.
 [4] A. Bifet and R. Gavaldà. Learning from timechanging data with adaptive windowing. In Proceedings of the Seventh SIAM International Conference on Data Mining, April 2628, 2007, Minneapolis, Minnesota, USA, pages 443–448, 2007.
 [5] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag, Berlin, Heidelberg, 2006.
 [6] L. Bu, C. Alippi, and D. Zhao. A pdffree change detection test based on density difference estimation. IEEE Transactions on Neural Networks and Learning Systems, 29(2):324–334, Feb 2018.

[7]
R. M. J. Byrne.
Counterfactuals in explainable artificial intelligence (xai): Evidence from human reasoning.
In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19, pages 6276–6282. International Joint Conferences on Artificial Intelligence Organization, 7 2019.  [8] R. F. de Mello, Y. Vaz, C. H. G. Ferreira, and A. Bifet. On learning guarantees to unsupervised concept drift detection on data streams. Expert Syst. Appl., 117:90–102, 2019.
 [9] G. Ditzler and R. Polikar. Hellinger distance based drift detection for nonstationary environments. In 2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments, CIDUE 2011, Paris, France, April 13, 2011, pages 41–48, 2011.
 [10] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar. Learning in nonstationary environments: A survey. IEEE Comp. Int. Mag., 10(4):12–25, 2015.
 [11] A.K. Dombrowski, M. Alber, C. Anders, M. Ackermann, K.R. Müller, and P. Kessel. Explanations can be manipulated and geometry is to blame. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 13589–13600. Curran Associates, Inc., 2019.
 [12] K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory, 21:32–40, 1975.
 [13] J. Gama, P. Medas, G. Castillo, and P. P. Rodrigues. Learning with drift detection. In Advances in Artificial Intelligence  SBIA 2004, 17th Brazilian Symposium on Artificial Intelligence, São Luis, Maranhão, Brazil, September 29  October 1, 2004, Proceedings, pages 286–295, 2004.
 [14] J. a. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept drift adaptation. ACM Comput. Surv., 46(4):44:1–44:37, Mar. 2014.

[15]
L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal.
Explaining explanations: An overview of interpretability of machine
learning.
In F. Bonchi, F. J. Provost, T. EliassiRad, W. Wang, C. Cattuto,
and R. Ghani, editors,
5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018, Turin, Italy, October 13, 2018
, pages 80–89. IEEE, 2018.  [16] I. Goldenberg and G. I. Webb. Survey of distance measures for quantifying concept drift and shift in numeric data. Knowl. Inf. Syst., 60(2):591–615, 2019.
 [17] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf. Covariate shift and local learning by distribution matching, pages 131–160. MIT Press, Cambridge, MA, USA, 2009.
 [18] D. Gunning, M. Stefik, J. Choi, T. Miller, S. Stumpf, and G.Z. Yang. Xai—explainable artificial intelligence. Science Robotics, 4(37), 2019.
 [19] M. Harries, U. N. cse tr, and N. S. Wales. Splice2 comparative evaluation: Electricity pricing. Technical report, 1999.
 [20] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.
 [21] A. V. Looveren and J. Klaise. Interpretable counterfactual explanations guided by prototypes, 2019.
 [22] V. Losing, B. Hammer, and H. Wersing. Tackling heterogeneous concept drift with the selfadjusting memory (SAM). Knowl. Inf. Syst., 54(1):171–201, 2018.
 [23] L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction, 2018.
 [24] C. Molnar. Interpretable machine learning. 2019. URL [https://christophm. github. io/interpretablemlbook/]. accessed, pages 05–04, 2019.
 [25] J. Montiel, J. Read, A. Bifet, and T. Abdessalem. Scikitmultiflow: A multioutput streaming framework. Journal of Machine Learning Research, 19(72):1–5, 2018.
 [26] E. S. PAGE. Continuous inspection schemes. Biometrika, 41(12):100–115, 06 1954.
 [27] S. Wachter, B. D. Mittelstadt, and C. Russell. Counterfactual explanations without opening the black box: Automated decisions and the gdpr. ArXiv, abs/1711.00399, 2017.
 [28] A. Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117–186, 1945.
 [29] S. Wang, L. L. Minku, N. V. Chawla, and X. Yao. Learning from data streams and class imbalance. Connect. Sci., 31(2):103–104, 2019.
 [30] G. I. Webb, L. K. Lee, F. Petitjean, and B. Goethals. Understanding concept drift. CoRR, abs/1704.00362, 2017.

[31]
D. Zambon, C. Alippi, and L. Livi.
Concept drift and anomaly detection in graph streams.
IEEE Trans. Neural Networks Learn. Syst., 29(11):5592–5605, 2018.
Appendix A Proofs
In this section we will give complete proofs of the stated theorems. The numbering of the theorems coincide with the one given in the paper. The needed lemmas are not contained in the paper itself and follow a different numbering.
Lemma 1.
is a welldefined measurable map and holds a.s..
Proof.
Since is a RadonNikodým density,it is a welldefined map in .
Let us start by showing that is a probability measure, indeed. To start with notice that for it holds
where ! holds follows by the linearity of RadonNikodým densities. Furthermore for two probability measures we have that
where the first statement follows from the fact that, if and then so that and hence . So by writing we see that is a probability measure on , so that we can speak of the entropy of .
Now let us show that is measurable. Since is measurable and is measurable, as well as the sum of measurable functions is measurable it follows that and hence is measurable, too.
Now, since for all probability measure on it holds it follows that
∎
Theorem 1.
It holds that has drift if and only if .
Proof.
Suppose has no drift then it holds that
is a valid choice. Hence it follows that is the uniform distribution for all and hence since .
Suppose then holds a.s. since a.s.. Now, since for any probability measure on it holds if and only if the uniform distribution on it follows that
where follows since is monotonous (in the second case the null sets may depend on ) and follows from the uniqueness of RadonNikodým densities.∎
Recall the following definition:
Definition 1.
Let a measurable space. For a set we define a pseudonorm on the space of all finite measures
Let be a probability space and a sequence of valued, i.i.d. random variables. We say that is a Glivenko–Cantelli class of iff
Lemma 2.
Let be a probability space, a measure space and a sequence of valued, i.i.d. random variables. Let be a bounded, measurable map with . Then for any set it holds
where denotes the Dirac measure concentrated at (we use the convention ).
Proof.
Denote by
We hence may rewrite the statement as
Since for any we have that implies and implies we have that if there exists no such that we have that for all , on the other hand if there exists a such that then the sequence and converges to by the low of large numbers so that converges to and hence we see that a.s. since a.s.. ∎
Lemma 3.
Let be a probability space, a measure space and a sequence of valued, i.i.d. random variables. Let be a bounded, measurable map. Denote by the weighted version of , where denotes the indicator function. If is a Glivenko–Cantelli class of then it holds
Proof.
We will prove the statement using monotonous class techniques. Let be the set of all functions such that
(1) 
Clearly and by the triangle inequality it follows that if , then . Now, let be a bounded, increasing, point wise and converging sequence with .
Since is bounded, it is integrable for every finite measure (so in particular for or ), too. Therefore, by the dominated convergence theorem, it holds that for every we may find an such that for all it holds
so we see that by an argument. We have shown that
is an monotonous vector space, once we have shown that for any
we have the statement follows.W.l.o.g. w.m.a. . Denote by . Consider
By lemma 2 we see that the first summand converges to 0 almost surly. On the other hand consider as an descrete stochastic process and fix its induced filtration. Define and a sequence of stopping times, so for every fix we have that is the subsequence of that lies within . Since we have . Then is a sequence of i.i.d. random variables with distributed according to . Since it follows that the second summund converges to 0 almost surly. ∎
Theorem 2.
Let be a probability space, a measure space and a sequence of valued, i.i.d. random variables. Let be a bounded, measurable map with . Denote by the weighted version of , where denotes the indicator function. For every let be a sequences of independent valued random variables with (or iff all ) for all and . If is a Glivenko–Cantelli class of then it holds
in almost surely, where we can take the limits in arbitrary order.
Proof.
Denote by the theoretical measure for a fixed set of observations and by . It holds
Notice that only the first summand depend on and . However, since we approximate an distribution on we see that by Kolmogorov’s theorem is uniformly bounded by .
Appendix B Experiments
In this section we will give additional details on the evaluations and experiments. This includes a precise setup of how the used data was generated and how the experiments were run as well as the obtained results/measurements and our interpretations.
b.1 Experimental setup
In this subsection we will discuss how we generated our data and how we evaluated the results. To simplify it we use different paragraphs for theoretical and benchmark data.
Theoretical data
As discussed in the paper we were interested in understanding which of the following parameters is relevant for the quality of our prediction:

Dimension (Dimensionality of data)

Complexity of distribution (How complex/fractal/fine grained is )

Complexity of overlap (How complex/fractal/fine grained are the regions where and have weight)
We used a mixture of Gaussians with uniformly distributed means and constant variance. We controlled the dimensionality in the obvious way (). We controlled the complexity of the distributions by the number of used Gaussians with equal degree of overlap (). We controlled the complexity of overlap by controlling the number of degrees of overlap ().
We therefore obtain and
Notice that is a distribution on with . In this case and can be computed analytically given .
We generated 500 samples according to .
For the evaluation of the estimation of we trained our models on the data to solve the probabilistic classification task , i.e. the classification rule for a sample is given by . We evaluated the resulting models by estimating the MSE between the estimation (based on ) and the real using 1.500 samples distributed according to , 1.500 samples distributed according to a Gaussian mixture equivalent to except that we used and 1.500 samples distributed according to . We repeated the process for every considered combination of 30 times and document mean value and standard deviation.
Notice that the classifier is not trained on data which contains ! Instead it is trained to predict the time point given . Since we consider probabilistic models this allows us to use them to estimate , but the actual value of is never presented to the model.
For the evaluation of the estimation of we applied the (modified) clustering methods to the generated samples. The ground truth value of was used by the methods. We evaluated the resulting models using the ground truth value of and . If a clustering produced more then one prototype the mean value over all prototypes was considered as the accomplish value for the maximization of and . We repeated the process for every considered combination of 30 times and document mean value and standard deviation.
Benchmark data
We considered both regression data () and classification data (). We processed the regression data as follows: We normalized the data sets output, i.e. we have . For every sample we randomly generated a occurrence time with , i.e. the chance that is higher if the original prediction value is close to 0. Accordingly we computed the identifiability as . The new samples are given by .
We processed the classification data as follows: For every class
Comments
There are no comments yet.