1 Introduction
This paper discusses the exchangeability test for a given random sequence using the martingale based approach. The results can be applied for detecting changepoint in streaming data including cases for abrupt change and concept drifting, i.e. changes happen in a smooth and incremental manner.
In real world applications, systems often exhibit complex behaviour and the data distribution is unknown. The martingale based approach is well suited in these scenarios for changepoint detection since it does not rely on any distributional knowledge about the data, which is in contrast to the traditional sequential changepoint detection methods such as Sequential Probability Ratio Test [Wald1945], and the CUmulative SUM control chart [Page1954].
The idea of using martingale for exchangeability test dates back to the work in [Vovk et al.2003], building on the theory of Transductive Confidence Machine [Vovk et al.2005], where the concept of ’exchangeability martingales’ was introduced for implementing the test which works in an online manner. Later on, It was further established that [Fedorova et al.2012]
to maximize the logarithmic growth rate of the multiplicative martingale, the betting function used for constructing the martingale should be chosen as the empirical probability density function (p.d.f.) of the pvalues. The way to construct this empirical p.d.f. suggested therein was to use a modified kernel density estimator. In
[Ho and Wechsler2005] and [Ho2005], the authors applied various concentration inequalities on the multiplicative martingale sequence to design tests for detecting change in data streams. However, due to the high variability of the multiplicative martingale sequence, it is hard to design the test based on concentration inequalities. In addition, the multiplicative martingale values in the log scale will exhibit an undesirable behaviour that a decreasing trend even when no change happens (which will be illustrated by an example in the Evaluation section).In order to address these shortcomings as stated before, we propose a new type of martingale – we will call it the additive martingale approach – for developing the exchangeability test, which can also be implemented in an online fashion. Different betting functions for constructing the additive martingale are discussed as well. Interestingly, similar to the mulplicative martingale case, it is shown that by choosing the betting function as the underlying p.d.f. of the pvalues, when changepoint appears (then the generated pvalues are not uniformly distributed), a satisfied balance between the smoothness and expected onestep increment in the martingale sequence will be obtained. Based on Beta distribution parametrization, a computationally efficient way for constructing this betting function is discussed. And we also discuss how to design tests for changepoint detection based on different concentration inequalities.
2 Background
Definition 1 (Martingale).
Definition 2 (Exchangeability).
A set of random variables are exchangeable if it holds that
(2) 
in which denotes any permutation of . A series of random variables is exchangeable if is exchangeable for any natural number .
Let denote a sequence of data samples. For each sample , the ’nonconformity measure’ quantifies the strangeness of with respect to the other data samples:
(3) 
The operator in eq. represents certain algorithm which takes and the other data samples as inputs, and returns a value reflecting the ’nonconformity’ of with respect to the other data samples. For example, one way to obtain the nonconformity measure is based on the Nearest Neighbour algorithm as follows:
where denotes the Euclidean distance.
Once the nonconformity measures of all data samples are calculated, the sequence of pvalues can be calculated using Algorithm 1 [Fedorova et al.2012], in which denotes a random number uniformly distributed in and denotes the cardinality of set .
The following Theorem 1 from [Vovk et al.2003, Fedorova et al.2012] plays a pivotal role for developing the martingale based test for exchangeability.
Theorem 1.
If the data samples satisfy the exchangeability assumption, Algorithm 1 will produce pvalues that are independent and uniformly distributed in .
In Theorem 1, the values reflect the strangeness of the corresponding data points – a smaller means a larger strangeness of the corresponding data sample. Note that computing pvalues according to Algorithm 1 is heavily time consuming: whenever a new sample is obtained, the noncomformity measures for all the previous data samples have to be recalculated. To avoid this expensive computations, we will apply its ’inductive’ version to compute the pvalues, as given in Algorithm 2. In the ’inductive’ version, there assumes a prefixed training set, and based on this fixed training set, the noncomformity values for all the samples only need to be calculated once [Vovk et al.2005, Denis et al.2017], hence it is much more computationally efficient.
To prepare for the next few sections, the main idea of the martingale based approach for exchangeability test is briefly summarized here. When the pvalue sequence is obtained by running Algorithm 1, and based on which, a new sequence can be constructed through a proper ’betting function’ (which satisfy some special properties). When no change happens, the newly constructed sequence will be very likely to stay in a bounded region since it is a valid martingale; otherwise the sequence will have a growing or decreasing trend for the reason that the pvalues are not uniformly distributed in anymore (As implied by Theorem 1, lack of exchangeability will give nonuniformly distributed pvalues), and will start concentrating around a small region.
2.1 Multiplicative martingale
In [Vovk et al.2003, Ho2005, Fedorova et al.2012], the authors proposed the exchangeable martingale (which we will refer to as multiplicative martingale in this work) for the exchangeability test. The idea is summarized as follows. For the sequence generated by Algorithm 1, consider the following random sequence
(4) 
where is called betting function, which satisfies
From which, it follows that
(5) 
Therefore is a valid martingale sequence according to definition. Different betting functions have been suggested for multiplicative martingales – three typical ones are summarized as follows.
Power Martingale. It uses a fixed power function as betting function
where . Therefore, the power martingale for a given is written as
(6) 
Mixture Power Martingale. It uses a mixture of power martingales based on different values.
(7) 
PlugIn Martingale. It uses an empirical p.d.f. of the pvalues as betting function. In addition, it has been justified in [Fedorova et al.2012] that the plugin martingale is more efficient in terms of rapid change in the martingale value when changepoint happens. To construct the empirical p.d.f., a modified kernel density estimator is used therein.
Due to the unboundness of the power function and the multiplicative construction, it is found inconvenient for the multiplicative martingale to adapt the HoffdingAzuma type concentration inequalities to design tests for detecting change in data streams, see the discussions in [Ho and Wechsler2005] and [Ho2005]. These shortcoming of the multiplicative martingale motivates us to consider the ’additive martingale’ as an alternative, which will be elaborated further in next section.
3 Additive martingale
As pointed out in [Denis et al.2017] ’it is interesting whether there are any other test exchangeability martingales apart from the conformal martingales (i.e. the one defined in eq. ) ’. In this section, we will present the additive martingale, which will address some of the issues of multiplicative martingale.
3.1 Basic idea
The additive martingale is inherently related to the multiplicative martingale, and their connection can be elucidated through the following reasoning. Suppose that we take the logarithm operation on both sides of eq. (4), we will get
(8) 
What we hope to get is that will be a valid martingale (since we want to obtain a martingale in the ’additive’ sense) sequence, or equivalently, it satisfies that
(9) 
However, in the multiplicative martingale case, the betting function is chosen to satisfy
(10) 
and . Note that the function is concave, we will get
Remark.
The previous reasoning proves that, instead of being a martingale, the sequence is actually a supermartingale. This justifies the decreasing trend of the multiplicative martingale value (in the log scale), as illustrated in Figure 1(d).
To mitigate this problem (i.e. to get a valid martingale), we can directly enforce the betting functions to integrate to zero, that is, for , let be defined as
(11) 
Then we have:
therefore becomes a valid martingale.
3.2 Betting functions for additive martingale
In what follows, we will give two betting function constructions to get valid additive martingales.
3.2.1 Shifted odd functions
By definition, any odd function
will satisfyfrom which, it follows that
This simple fact implies that will be a valid betting function for any odd function . One example is to let , more betting functions can easily be constructed by picking different odd functions.
3.2.2 Shifted empirical probability density function
From the pvalues calculated by Algorithm 1, an empirical probability density function of the pvalues can be obtained (one computational efficient way will be discussed later on), which we denote as at time . Based on which, it can be readily checked that a valid betting function can be formulated as
(12) 
This construction is not only valid, and in fact, it will give a rapid and smooth change in the martingale sequence when changepoint happens (see the experiment result in Figure 1(b)) in the data sequence. Next we will explain this observation. To this end, we first define the following optimization problem.
(13) 
The objective function in eq. consists of two parts: the first part represents the expected increment of the martingale sequence value at each step, when betting function is used, and given the underlying p.d.f. of pvalues as ; the second term represents the ’flatness’ (or’ regularness’) of the betting function.
To make better sense of the optimization problem, we analyze the following two extreme cases: when and .
When , since and are both nonnegative functions, problem in eq. can be reduced to
(14) 
Assume that is upperbounded by and at point we have , then it follows that
(15) 
and the maximum value can be obtained when is set to be , where denotes the Dirac delta function which is an extremely peaky. More concretely, when , we have
Figure 1 illustrates how the martingale
changes over time, when the betting function is a Dirac delta function (a Gaussian pdf with a very small variance). As shown on Figure
1 (b), the martingale can reach very high values when pvalues are not uniformly distributed. However, even when pvalues are uniformly distributed, the martingale sequence still have a high variation and may end up far from its initial point, as can be observed from Figure 1 (a). This is not ideal for changepoint detection, since it may increase the possibilities of falsealarms.Let’s discuss the case when . In this situation, the problem in eq. reduces to
(16) 
Given by CauchySchwarz inequality, we have that
where the equality holds when , . Given by these calculations, in the case when , the optimal solution to eq. is given by a uniform distribution function within the interval  which is the most ’regular’ function.
The previous discussion implies that a proper choice of will give a satisfied balance between the onestep increment and the ’regularness’ (by which we mean that the sequence does not include big jumps) of the martingale sequence. Next, we show that, when choosing
the optimal solution to eq. is given by , which gives that the corresponding betting function is .
Note that when is chosen as , again by the CauchySchwarz inequality, we will have
(17) 
where the equality holds when and is a constant. When the equality holds, the maximum of the objective (which is zero) in eq. is achieved. Given the fact that both and integrate to 1 in , we get , and it implies that the optimum to the optimization problem in eq. is .
4 Estimating pvalue distribution with Beta distribution
Given the importance of the pvalue density function in constructing efficient additive martingales, in what follows, we will discuss a computationally efficient way to build up an approximation of the pvalue density function.
According to Theorem 1, we know that when changepoint happens, the pvalues will not be uniformly distributed within
. Typical cases are that the distribution will be skewed with a single mode. This observation inspires us to model the pvalue density function with a Beta distribution, which is defined as follows.
4.1 Beta distribution
Definition 3 (Beta distribution).
The beta distribution , parametrized by two positive shape parameters and
, defines a family of continuous probability distributions on
, given aswhere and denotes the Gamma function.
Note that when , it gives the uniform distribution on . When both and are greater than one, an imbalanced choice of , will give a skewed density function, which is of particular interest to us since it will be useful to model the skewed pvalue distribution with a single mode.
Remark.
There exist nonparametric approaches [Tsybakov2009] for estimating density functions, for example the histogram and kernel based density estimators. For these estimators, optimal choice for the number of bins or the kernel bandwidth will depend on knowledge of the underlying p.d.f. which is often unknown. The Beta parametric approach presents an alternative for the case when single mode appears in the pvalue distribution. The parameters are easy to estimate and estimation can be done in an online fashion and will be explained in next part.
4.2 Parameter estimation
, a momentmatching based method for the shape parameters (
and ) estimation was proposed. A notable feature of this approach is that it only involves the calculation of the sample mean and variance.More concretely, assume we are given a set of pvalues as , then according to [Bain and Engelhardt1992], and (the estimated parameters) can be calculated as follows:
(18)  
(19) 
in which and denote the sample mean and variance respectively, which are defined as
Online calculation of and can be found in [Welford1962]. Concretely,
And for updating the sample variance, we can first update recursively as follows
(20) 
with which we get
Remark.
Note that when sliding window (with size W) is introduced, the sample mean and sample variance , defined as
in which , can similarly be calculated in an online manner, and main steps can be found in the Appendix.
5 Designing tests for changepoint detection
In this section, we will apply HoeffdingAzuma inequality and DoobKolmogorov’s inequality to the additive martingale sequence to develop statistical tests for changepoint detection. The general idea is that, when no changepoint appears, the martingale sequence will be bounded in certain region with high probability. However, when the sequence exceeds the specific region, it is very likely that a changepoint has occurred, hence an alarm needs to be triggered.
Theorem 2 (HoeffdingAzuma inequality).
Let be constants and let be a martingale difference sequence with for each . Then for any , we have
(21) 
In the additive martingale case, when the betting function is chosen as a shifted odd function with , eq. reduces to
where
This fact can be used to design statistical tests for (: values follow a uniform distribution in , i.e. no changepoint appearing). More specifically, given the significance level , when
we reject the hypothesis and trigger an alarm.
In practice, sliding window (assume the window size is ) can be introduced to track the change more rapidly. Given the significance level and after similar calculations as done before, we can have that when
(22) 
the hypothesis will be rejected with the significance level .
Remark.
It can be observed that, due to the additive structure of the constructed martingale (in contrast to the multiplicative martingale case), it becomes more convenient to apply the hoeffdingAzuma inequality for the test design.
Inspired by [Ho2005], we can also design tests based on the following inequality.
Theorem 3 (DoobKolmogorov inequality).
Let , , , be a martingale difference sequence, and for . Then it follows that
(23) 
Notice the facts in Theorem 1 that the values generated from Algorithm 1 are independent from each other, therefore if the betting function is chosen independent from , where , for instance when , by calculating out , the inequality in eq. can be reduced to (assuming a sliding window with size W is used):
which gives that will be rejected (with significance level ) when
6 Evaluation
To showcase properties of the proposed method, we will conduct an experiment to illustrate how both martingale sequences behave in the exchangeable and nonexchangeable cases. The setup is given as follows: data in the first part (before the vertical dashed line in Figure 2
) of the sequence are i.i.d. drawn from a Gaussian distribution with zero mean and unit variance; data in the second part is also drawn from a Gaussian distribution, but with a different mean from the first Gaussian. The second part of the sequence models the data when change has happened. Distance to nearest neighbour is used as the nonconformity measure, and the Algorithm
2 is used for pvalue calculation.The results are reported in Figure 2. In all these figures, xaxis represents time and yaxis represents the martingale values (in multiplicative martingale case, it’s the martingale values in the log scale). Comparing subfigures (a) and (b), we can observe that the curves obtained with the estimated p.d.f. as the betting function can achieve higher martingale value (which will make the changepoint detection more confident) than the one with shifted odd function as the betting function. Again, the reason is that, the approach based on the estimated p.d.f. can adapt to the change, and gain a higher onestep increment in the curve. In subfigure (b), we can also observe that, when changepoint just appears, the curve will increase slowly in the initial time steps, during which the algorithm is trying to gain knowledge of the changed p.d.f. of the pvalues, but after this phase, the curves increase very rapidly. In the subfigures (c) and (d) obtained using the multiplicative martingales, the curves will exhibit decreasing trend in the period of no changepoint; after changepoint appears, the curves start to increase and will take significant time to return to high value. Though it is possible to postprocess the curve, for example by applying a onestep finite difference filter to transform the ’v’shape curve into a stepalike curve similar to the ones in subfigures (a) and (b), or to use another trick introduced in [Denis et al.2017], however these postprocessing will introduce additional complications for designing tests (for example, consider applying the HoeffdingAzuma type inequalities on the transformed sequence). In addition, in the period of no deviation, the curves in subfigure (b) have small variations as compared to the curves in other subfigures, which will make it less prone to trigger false alarms.
7 Future work
The proposed framework gives an alternative way for changepoint detection, with some advantages over existing ones. There are still many questions left open for further research: 1) as discussed in previous section, the change of martingale sequence in subfigure (b) of Figure 2 is slow in the initial time steps when the size of sliding window is large. However, as can be observed from the same figure, when the sliding window size is smaller, the curve changes more rapidly in response to the deviation. This observation leads us to consider designing adaptive strategies, for example by using a smallersize sliding window in the initial steps to increase the response speed to change, and gradually increasing the window size to gain more accurate information about the changed distribution in order to have a larger onestep increment; 2) there exist improvements over the basic HoffdingAzuma inequality, for example some are presented in chapter 2 of [Maxim and Sason2015]. It will be interesting to see whether these more advanced concentration inequalities can give tighter bounds than the one in eq. (21) for a given significance level.
8 Appendix
The update rule for the sample mean is given as
For sample variance, we have
which gives that
with which we can conclude that
References
 [Bain and Engelhardt1992] Lee J. Bain and Max Engelhardt. Introduction to Probability and Mathematical Statistics. Duxbury Press, California, 1992.

[Denis et al.2017]
Volkhonskiy Denis, Ilia Nouretdinov, Alexander Gammerman, Vladimir Vovk, and
Evgeny Burnaev.
Inductive conformal martingales for changepoint detection.
In
Proceedings of Machine Learning Research  Conformal and Probabilistic Prediction and Applications
, pages 60:1–22, 2017.  [Fedorova et al.2012] Valentina Fedorova, Alex Gammerman, Ilia Nouretdinov, and Vladimir Vovk. Plugin martingales for testing exchangeability online. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 923–930, 2012.

[Ho and Wechsler2005]
ShenShyang Ho and Harry Wechsler.
On the detection of concept changes in timevarying data stream by
testing exchangeability.
In
Proceedings of the TwentyFirst Conference on Uncertainty in Artificial Intelligence
, pages 267–274. AUAI Press, 2005.  [Ho2005] ShenShyang Ho. A martingale framework for concept change detection in timevarying data streams. In Proceedings of the 22nd international conference on Machine learning, pages 321–327. ACM, 2005.
 [Maxim and Sason2015] Raginsky Maxim and Igal Sason. Concentration of measure inequalities in information theory. Now Publishers, 2015.
 [Page1954] E.S. Page. Continuous inspection scheme. Biometrika, 41(1/2):100–115, 1954.
 [Tsybakov2009] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, 2009.
 [Vovk et al.2003] Vladimir Vovk, Ilia Nouretdinov, and Alexander Gammerman. Testing exchangeability online. In Proceedings of the 20th International Conference on Machine Learning, pages 768–775, 2003.
 [Vovk et al.2005] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer, New York, 2005.
 [Wald1945] Abraham Wald. Sequential tests of statistical hypotheses. Annals of Mathematical Statistics, 16(2):117–186, 1945.
 [Welford1962] B. P. Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420, 1962.
Comments
There are no comments yet.