This paper discusses the exchangeability test for a given random sequence using the martingale based approach. The results can be applied for detecting change-point in streaming data including cases for abrupt change and concept drifting, i.e. changes happen in a smooth and incremental manner.
In real world applications, systems often exhibit complex behaviour and the data distribution is unknown. The martingale based approach is well suited in these scenarios for change-point detection since it does not rely on any distributional knowledge about the data, which is in contrast to the traditional sequential change-point detection methods such as Sequential Probability Ratio Test [Wald1945], and the CUmulative SUM control chart [Page1954].
The idea of using martingale for exchangeability test dates back to the work in [Vovk et al.2003], building on the theory of Transductive Confidence Machine [Vovk et al.2005], where the concept of ’exchangeability martingales’ was introduced for implementing the test which works in an online manner. Later on, It was further established that [Fedorova et al.2012]
to maximize the logarithmic growth rate of the multiplicative martingale, the betting function used for constructing the martingale should be chosen as the empirical probability density function (p.d.f.) of the p-values. The way to construct this empirical p.d.f. suggested therein was to use a modified kernel density estimator. In[Ho and Wechsler2005] and [Ho2005], the authors applied various concentration inequalities on the multiplicative martingale sequence to design tests for detecting change in data streams. However, due to the high variability of the multiplicative martingale sequence, it is hard to design the test based on concentration inequalities. In addition, the multiplicative martingale values in the log scale will exhibit an undesirable behaviour that a decreasing trend even when no change happens (which will be illustrated by an example in the Evaluation section).
In order to address these shortcomings as stated before, we propose a new type of martingale – we will call it the additive martingale approach – for developing the exchangeability test, which can also be implemented in an online fashion. Different betting functions for constructing the additive martingale are discussed as well. Interestingly, similar to the mulplicative martingale case, it is shown that by choosing the betting function as the underlying p.d.f. of the p-values, when change-point appears (then the generated p-values are not uniformly distributed), a satisfied balance between the smoothness and expected one-step increment in the martingale sequence will be obtained. Based on Beta distribution parametrization, a computationally efficient way for constructing this betting function is discussed. And we also discuss how to design tests for change-point detection based on different concentration inequalities.
Definition 1 (Martingale).
A sequence of random variables
A sequence of random variablesis a martingale if for any , it satisfies that
Definition 2 (Exchangeability).
A set of random variables are exchangeable if it holds that
in which denotes any permutation of . A series of random variables is exchangeable if is exchangeable for any natural number .
Let denote a sequence of data samples. For each sample , the ’nonconformity measure’ quantifies the strangeness of with respect to the other data samples:
The operator in eq. represents certain algorithm which takes and the other data samples as inputs, and returns a value reflecting the ’nonconformity’ of with respect to the other data samples. For example, one way to obtain the nonconformity measure is based on the Nearest Neighbour algorithm as follows:
where denotes the Euclidean distance.
Once the nonconformity measures of all data samples are calculated, the sequence of p-values can be calculated using Algorithm 1 [Fedorova et al.2012], in which denotes a random number uniformly distributed in and denotes the cardinality of set .
If the data samples satisfy the exchangeability assumption, Algorithm 1 will produce p-values that are independent and uniformly distributed in .
In Theorem 1, the values reflect the strangeness of the corresponding data points – a smaller means a larger strangeness of the corresponding data sample. Note that computing p-values according to Algorithm 1 is heavily time consuming: whenever a new sample is obtained, the noncomformity measures for all the previous data samples have to be recalculated. To avoid this expensive computations, we will apply its ’inductive’ version to compute the p-values, as given in Algorithm 2. In the ’inductive’ version, there assumes a prefixed training set, and based on this fixed training set, the noncomformity values for all the samples only need to be calculated once [Vovk et al.2005, Denis et al.2017], hence it is much more computationally efficient.
To prepare for the next few sections, the main idea of the martingale based approach for exchangeability test is briefly summarized here. When the p-value sequence is obtained by running Algorithm 1, and based on which, a new sequence can be constructed through a proper ’betting function’ (which satisfy some special properties). When no change happens, the newly constructed sequence will be very likely to stay in a bounded region since it is a valid martingale; otherwise the sequence will have a growing or decreasing trend for the reason that the p-values are not uniformly distributed in anymore (As implied by Theorem 1, lack of exchangeability will give non-uniformly distributed p-values), and will start concentrating around a small region.
2.1 Multiplicative martingale
In [Vovk et al.2003, Ho2005, Fedorova et al.2012], the authors proposed the exchangeable martingale (which we will refer to as multiplicative martingale in this work) for the exchangeability test. The idea is summarized as follows. For the sequence generated by Algorithm 1, consider the following random sequence
where is called betting function, which satisfies
From which, it follows that
Therefore is a valid martingale sequence according to definition. Different betting functions have been suggested for multiplicative martingales – three typical ones are summarized as follows.
Power Martingale. It uses a fixed power function as betting function
where . Therefore, the power martingale for a given is written as
Mixture Power Martingale. It uses a mixture of power martingales based on different values.
Plug-In Martingale. It uses an empirical p.d.f. of the p-values as betting function. In addition, it has been justified in [Fedorova et al.2012] that the plug-in martingale is more efficient in terms of rapid change in the martingale value when change-point happens. To construct the empirical p.d.f., a modified kernel density estimator is used therein.
Due to the unboundness of the power function and the multiplicative construction, it is found inconvenient for the multiplicative martingale to adapt the Hoffding-Azuma type concentration inequalities to design tests for detecting change in data streams, see the discussions in [Ho and Wechsler2005] and [Ho2005]. These shortcoming of the multiplicative martingale motivates us to consider the ’additive martingale’ as an alternative, which will be elaborated further in next section.
3 Additive martingale
As pointed out in [Denis et al.2017] ’it is interesting whether there are any other test exchangeability martingales apart from the conformal martingales (i.e. the one defined in eq. ) ’. In this section, we will present the additive martingale, which will address some of the issues of multiplicative martingale.
3.1 Basic idea
The additive martingale is inherently related to the multiplicative martingale, and their connection can be elucidated through the following reasoning. Suppose that we take the logarithm operation on both sides of eq. (4), we will get
What we hope to get is that will be a valid martingale (since we want to obtain a martingale in the ’additive’ sense) sequence, or equivalently, it satisfies that
However, in the multiplicative martingale case, the betting function is chosen to satisfy
and . Note that the function is concave, we will get
The previous reasoning proves that, instead of being a martingale, the sequence is actually a supermartingale. This justifies the decreasing trend of the multiplicative martingale value (in the log scale), as illustrated in Figure 1(d).
To mitigate this problem (i.e. to get a valid martingale), we can directly enforce the betting functions to integrate to zero, that is, for , let be defined as
Then we have:
therefore becomes a valid martingale.
3.2 Betting functions for additive martingale
In what follows, we will give two betting function constructions to get valid additive martingales.
3.2.1 Shifted odd functions
By definition, any odd functionwill satisfy
from which, it follows that
This simple fact implies that will be a valid betting function for any odd function . One example is to let , more betting functions can easily be constructed by picking different odd functions.
3.2.2 Shifted empirical probability density function
From the p-values calculated by Algorithm 1, an empirical probability density function of the p-values can be obtained (one computational efficient way will be discussed later on), which we denote as at time . Based on which, it can be readily checked that a valid betting function can be formulated as
This construction is not only valid, and in fact, it will give a rapid and smooth change in the martingale sequence when change-point happens (see the experiment result in Figure 1(b)) in the data sequence. Next we will explain this observation. To this end, we first define the following optimization problem.
The objective function in eq. consists of two parts: the first part represents the expected increment of the martingale sequence value at each step, when betting function is used, and given the underlying p.d.f. of p-values as ; the second term represents the ’flatness’ (or’ regularness’) of the betting function.
To make better sense of the optimization problem, we analyze the following two extreme cases: when and .
When , since and are both non-negative functions, problem in eq. can be reduced to
Assume that is upper-bounded by and at point we have , then it follows that
and the maximum value can be obtained when is set to be , where denotes the Dirac delta function which is an extremely peaky. More concretely, when , we have
Figure 1 illustrates how the martingale
changes over time, when the betting function is a Dirac delta function (a Gaussian pdf with a very small variance). As shown on Figure1 (b), the martingale can reach very high values when p-values are not uniformly distributed. However, even when p-values are uniformly distributed, the martingale sequence still have a high variation and may end up far from its initial point, as can be observed from Figure 1 (a). This is not ideal for change-point detection, since it may increase the possibilities of false-alarms.
Let’s discuss the case when . In this situation, the problem in eq. reduces to
Given by Cauchy-Schwarz inequality, we have that
where the equality holds when , . Given by these calculations, in the case when , the optimal solution to eq. is given by a uniform distribution function within the interval - which is the most ’regular’ function.
The previous discussion implies that a proper choice of will give a satisfied balance between the one-step increment and the ’regularness’ (by which we mean that the sequence does not include big jumps) of the martingale sequence. Next, we show that, when choosing
the optimal solution to eq. is given by , which gives that the corresponding betting function is .
Note that when is chosen as , again by the Cauchy-Schwarz inequality, we will have
where the equality holds when and is a constant. When the equality holds, the maximum of the objective (which is zero) in eq. is achieved. Given the fact that both and integrate to 1 in , we get , and it implies that the optimum to the optimization problem in eq. is .
4 Estimating p-value distribution with Beta distribution
Given the importance of the p-value density function in constructing efficient additive martingales, in what follows, we will discuss a computationally efficient way to build up an approximation of the p-value density function.
According to Theorem 1, we know that when change-point happens, the p-values will not be uniformly distributed within
. Typical cases are that the distribution will be skewed with a single mode. This observation inspires us to model the p-value density function with a Beta distribution, which is defined as follows.
4.1 Beta distribution
Definition 3 (Beta distribution).
The beta distribution , parametrized by two positive shape parameters and , defines a family of continuous probability distributions on
, defines a family of continuous probability distributions on, given as
where and denotes the Gamma function.
Note that when , it gives the uniform distribution on . When both and are greater than one, an imbalanced choice of , will give a skewed density function, which is of particular interest to us since it will be useful to model the skewed p-value distribution with a single mode.
There exist non-parametric approaches [Tsybakov2009] for estimating density functions, for example the histogram and kernel based density estimators. For these estimators, optimal choice for the number of bins or the kernel bandwidth will depend on knowledge of the underlying p.d.f. which is often unknown. The Beta parametric approach presents an alternative for the case when single mode appears in the p-value distribution. The parameters are easy to estimate and estimation can be done in an online fashion and will be explained in next part.
4.2 Parameter estimation
, a moment-matching based method for the shape parameters (and ) estimation was proposed. A notable feature of this approach is that it only involves the calculation of the sample mean and variance.
More concretely, assume we are given a set of p-values as , then according to [Bain and Engelhardt1992], and (the estimated parameters) can be calculated as follows:
in which and denote the sample mean and variance respectively, which are defined as
Online calculation of and can be found in [Welford1962]. Concretely,
And for updating the sample variance, we can first update recursively as follows
with which we get
Note that when sliding window (with size W) is introduced, the sample mean and sample variance , defined as
in which , can similarly be calculated in an online manner, and main steps can be found in the Appendix.
5 Designing tests for change-point detection
In this section, we will apply Hoeffding-Azuma inequality and Doob-Kolmogorov’s inequality to the additive martingale sequence to develop statistical tests for change-point detection. The general idea is that, when no change-point appears, the martingale sequence will be bounded in certain region with high probability. However, when the sequence exceeds the specific region, it is very likely that a change-point has occurred, hence an alarm needs to be triggered.
Theorem 2 (Hoeffding-Azuma inequality).
Let be constants and let be a martingale difference sequence with for each . Then for any , we have
In the additive martingale case, when the betting function is chosen as a shifted odd function with , eq. reduces to
This fact can be used to design statistical tests for (: -values follow a uniform distribution in , i.e. no change-point appearing). More specifically, given the significance level , when
we reject the hypothesis and trigger an alarm.
In practice, sliding window (assume the window size is ) can be introduced to track the change more rapidly. Given the significance level and after similar calculations as done before, we can have that when
the hypothesis will be rejected with the significance level .
It can be observed that, due to the additive structure of the constructed martingale (in contrast to the multiplicative martingale case), it becomes more convenient to apply the hoeffding-Azuma inequality for the test design.
Inspired by [Ho2005], we can also design tests based on the following inequality.
Theorem 3 (Doob-Kolmogorov inequality).
Let , , , be a martingale difference sequence, and for . Then it follows that
Notice the facts in Theorem 1 that the -values generated from Algorithm 1 are independent from each other, therefore if the betting function is chosen independent from , where , for instance when , by calculating out , the inequality in eq. can be reduced to (assuming a sliding window with size W is used):
which gives that will be rejected (with significance level ) when
To showcase properties of the proposed method, we will conduct an experiment to illustrate how both martingale sequences behave in the exchangeable and non-exchangeable cases. The setup is given as follows: data in the first part (before the vertical dashed line in Figure 2
) of the sequence are i.i.d. drawn from a Gaussian distribution with zero mean and unit variance; data in the second part is also drawn from a Gaussian distribution, but with a different mean from the first Gaussian. The second part of the sequence models the data when change has happened. Distance to nearest neighbour is used as the non-conformity measure, and the Algorithm2 is used for p-value calculation.
The results are reported in Figure 2. In all these figures, x-axis represents time and y-axis represents the martingale values (in multiplicative martingale case, it’s the martingale values in the log scale). Comparing subfigures (a) and (b), we can observe that the curves obtained with the estimated p.d.f. as the betting function can achieve higher martingale value (which will make the change-point detection more confident) than the one with shifted odd function as the betting function. Again, the reason is that, the approach based on the estimated p.d.f. can adapt to the change, and gain a higher one-step increment in the curve. In subfigure (b), we can also observe that, when change-point just appears, the curve will increase slowly in the initial time steps, during which the algorithm is trying to gain knowledge of the changed p.d.f. of the p-values, but after this phase, the curves increase very rapidly. In the subfigures (c) and (d) obtained using the multiplicative martingales, the curves will exhibit decreasing trend in the period of no change-point; after change-point appears, the curves start to increase and will take significant time to return to high value. Though it is possible to post-process the curve, for example by applying a one-step finite difference filter to transform the ’v’-shape curve into a step-alike curve similar to the ones in subfigures (a) and (b), or to use another trick introduced in [Denis et al.2017], however these post-processing will introduce additional complications for designing tests (for example, consider applying the Hoeffding-Azuma type inequalities on the transformed sequence). In addition, in the period of no deviation, the curves in subfigure (b) have small variations as compared to the curves in other subfigures, which will make it less prone to trigger false alarms.
7 Future work
The proposed framework gives an alternative way for change-point detection, with some advantages over existing ones. There are still many questions left open for further research: 1) as discussed in previous section, the change of martingale sequence in subfigure (b) of Figure 2 is slow in the initial time steps when the size of sliding window is large. However, as can be observed from the same figure, when the sliding window size is smaller, the curve changes more rapidly in response to the deviation. This observation leads us to consider designing adaptive strategies, for example by using a smaller-size sliding window in the initial steps to increase the response speed to change, and gradually increasing the window size to gain more accurate information about the changed distribution in order to have a larger one-step increment; 2) there exist improvements over the basic Hoffding-Azuma inequality, for example some are presented in chapter 2 of [Maxim and Sason2015]. It will be interesting to see whether these more advanced concentration inequalities can give tighter bounds than the one in eq. (21) for a given significance level.
The update rule for the sample mean is given as
For sample variance, we have
which gives that
with which we can conclude that
- [Bain and Engelhardt1992] Lee J. Bain and Max Engelhardt. Introduction to Probability and Mathematical Statistics. Duxbury Press, California, 1992.
[Denis et al.2017]
Volkhonskiy Denis, Ilia Nouretdinov, Alexander Gammerman, Vladimir Vovk, and
Inductive conformal martingales for change-point detection.
Proceedings of Machine Learning Research - Conformal and Probabilistic Prediction and Applications, pages 60:1–22, 2017.
- [Fedorova et al.2012] Valentina Fedorova, Alex Gammerman, Ilia Nouretdinov, and Vladimir Vovk. Plug-in martingales for testing exchangeability on-line. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 923–930, 2012.
[Ho and Wechsler2005]
Shen-Shyang Ho and Harry Wechsler.
On the detection of concept changes in time-varying data stream by
Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pages 267–274. AUAI Press, 2005.
- [Ho2005] Shen-Shyang Ho. A martingale framework for concept change detection in time-varying data streams. In Proceedings of the 22nd international conference on Machine learning, pages 321–327. ACM, 2005.
- [Maxim and Sason2015] Raginsky Maxim and Igal Sason. Concentration of measure inequalities in information theory. Now Publishers, 2015.
- [Page1954] E.S. Page. Continuous inspection scheme. Biometrika, 41(1/2):100–115, 1954.
- [Tsybakov2009] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, 2009.
- [Vovk et al.2003] Vladimir Vovk, Ilia Nouretdinov, and Alexander Gammerman. Testing exchangeability on-line. In Proceedings of the 20th International Conference on Machine Learning, pages 768–775, 2003.
- [Vovk et al.2005] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer, New York, 2005.
- [Wald1945] Abraham Wald. Sequential tests of statistical hypotheses. Annals of Mathematical Statistics, 16(2):117–186, 1945.
- [Welford1962] B. P. Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420, 1962.