Change Detection with the Kernel Cumulative Sum Algorithm

03/05/2019 ∙ by Thomas Flynn, et al. ∙ Brookhaven National Laboratory 0

Online change detection involves monitoring a stream of data for changes in the statistical properties of incoming observations. A good change detector will detect any changes shortly after they occur, while raising few false alarms. Although there are algorithms with confirmed optimality properties for this task, they rely on the exact specifications of the probability distributions involved and this limits their practicality. In this work we describe a kernel-based variant of the Cumulative Sum (CUSUM) change detection algorithm that can detect changes under less restrictive assumptions. Instead of using the likelihood ratio, which is a parametric quantity, the Kernel CUSUM (KCUSUM) algorithm relies on a statistic based on the Maximum Mean Discrepancy (MMD) non-parametric testing framework. The KCUSUM algorithm is applicable in settings where there is a large amount of background data available and it is desirable to detect a change away from this background setting. Exploiting the random-walk structure of the test statistic, we derive bounds on the performance of the algorithm, including the expected delay and the average time to false alarm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In this work we are interested in the problem of detecting abrupt changes in streams of data. This could mean detecting a change in the average value of the observations, or a change in variance, or, more generally, finding a change in any other distributional property. In particular we are interested in online change detection, where the algorithm should figure out a change has occurred soon after it happens, without waiting for the entire data stream to be observed. Some examples of where this is relevant include intrusion detection, industrial quality control, and others.

In cases where sufficient prior knowledge of the change is available, there are known optimal algorithms for online change detection. If the probability distributions before and after the change are known, then the CUSUM procedure (shown in Algorithm 1) is known to be optimal for an objective function that takes into account the magnitude of delays and the frequency of false alarms [11, 14]. Aside from optimality, the CUSUM also is also simple to program and has an intuitive interpretation in terms of maximum likelihood. However, there are many situations where the probability distributions involved can not be modeled precisely, and it would be difficult to use the CUSUM in these cases.

Closely related to change detection is statistical hypothesis testing. In particular, in this work we attempt to leverage tools that have been developed for the problem of two-sample testing and adapt them for use in change detection. Two-sample testing is a non-parametric hypothesis testing task where the goal is to determine if two data sets come from the same distribution. For this problem, an approach based on kernel embeddings has been developed, termed Maximum Mean Discrepancy (MMD)

[8]. In the MMD approach, two datasets are compared by computing the distance between the corresponding empirical measures, using a distance which is induced by a positive definite kernel function (See Section III

for formal definitions.) Compared to other distance measures, MMD distances are appealing because they admit very simple unbiased estimators. Furthermore, being defined through kernels, the methods are not restricted to Euclidean datasets, and are applicable to hypothesis testing problems involving strings, graphs, and other structured data

[15].

Motivated by prior work using MMD in hypothesis testing, we introduce the Kernel Cumulative Sum (KCUSUM) algorithm (Algorithm 2). Unlike the CUSUM, the KCUSUM does not require exact specifications of the pre- and post-change distributions. Instead, it relies on a database of samples from the pre-change distribution, and continuously compares incoming observations with samples from the database using a kernel function chosen by the user. In this way, the KCUSUM is able to detect a change to any distribution whose distance from the pre-change distribution is above a user-supplied threshold. Our main theoretical results (Theorem IV.2 and Corollary IV.3) concern the delays and false alarms of the KCUSUM. We derive an upper bound on the time to detect a change (that is, the delay) and a lower bound on the time until a false alarm occurs when there is no change. The analysis builds on existing theory for the CUSUM [12, 11].

The rest of this paper is structured as follows. In Section II we review the basic notions of the CUSUM algorithm and in Section III we review the MMD framework. We introduce the KCUSUM algorithm in Section IV, where we also present the analysis. In Section V we present the results of a numerical evaluation.

Ii Cumulative Sum algorithm

We consider a sequence of random variables

and assume that there is an index such that for all the variables are distributed according to a distribution , and for , the are distributed according to the distribution . Presently we assume that the take values in some Euclidean space , although the kernel methods that we shall introduce are not restricted to this scenario. The index is referred to as the change point. An online change detection algorithm tries to identify this change point in real-time, and bases the decision of whether or not a change has occurred by time on the data available up to time .

Fig. 1: Illustration of the CUSUM. For each observation (left) the log-likelihood ratio is computed (middle) and added to a running sum (right). See main text for details.

The Cumulative Sum algorithm is an online change detection procedure introduced by Page [16]. For the purposes of introducing the CUSUM, assume that the distributions and have densities and respectively. The steps of the CUSUM are presented in Algorithm 1. At step of the procedure a data point is observed, and the log-likelihood ratio is calculated and the result added onto the statistic . If the result would be negative then is set to zero, which has the effect of restarting the algorithm. If crosses a threshold then a change is declared at time .

For some insight into why the CUSUM works, consider the behavior of the log-likelihood ratio before and after the change. Before the change, it has mean , where is the Kullback-Liebler divergence between the distributions and . Since is positive when , the increment term will have a negative mean. This drift combined with the barrier at zero causes the statistic to stay near zero before the change. After the change, the increment has a positive mean equal to and begins to increase, eventually crossing any positive threshold

with probability one, which will cause the algorithm to end. Beyond these heuristic arguments, the CUSUM can be shown to be optimal in a certain sense, as we review below.

input: Data and threshold .
initialize
for do
.
if then set and exit.
else continue.
end
Algorithm 1 CUSUM Algorithm [16]
Example II.1

Denote by

the normal distribution with mean

and variance . Consider detecting a change in variance of normally distributed random variables, where the pre-change distribution is and the post-change distribution is . A sample sequence of length with a change point at is shown on the left in Fig. 1. The true change time is marked by a dashed line. The log likelihood ratio in this case is . The values of for are plotted in the middle of Fig. 1. We can see that the log likelihood ratio has a negative mean before the change and a positive mean after the change. The resulting CUSUM statistic is shown on the right of Fig. 1. Using a threshold of results in detection at time .

Next we review the performance characteristics of the CUSUM. Each possible change time defines a different distribution on the sequences . If there is no change, then the variables are independent and identically distributed (i.i.d.) with for all . We denote this distribution on sequences by , and denote expectations with respect to this distribution by . In general, a change at time means that for the are i.i.d. with and for they are i.i.d. with . We let denote the probability distribution on sequences under the assumption of a change at time , and represents the expectation under this distribution. For let be the -algebra . Intuitively, represents the information contained in the observations up to and including time . Formally, an online change detector can be represented as a stopping time with respect to the filtration , that we denote by , with the interpretation that the value of is an estimate of the change point.

When running a change detector on a particular sequence, two types of errors may occur. There may be a false alarm, which means the change is detected too early, or there may be a delay, meaning the change was detected late. We formalize the levels of false alarm and delay using the metrics of average time to false alarm () and worst case delay (). These are standard metrics for evaluating change detectors [11, 19, 13].

For a change detector , the time to false alarm (TTFA) is defined as

(1)

That is, the is the average amount of time until a change is detected given a sequence of observations with no change.

We measure delay using Lorden’s criterion [11]. If is a change detector and there is a change at time , then the expected delay given the history of the observations up to time is the random variable 111The notation refers to the positive part function: . The worst case delay for a change at time is obtained by taking an essential supremum over all possible sequences of length , denoted by . Finally, taking the supremum over all change times we obtain the worst case delay (WCD):

(2)

Notably, the CUSUM provides the optimal trade off between the time to false alarm and the worst case delay. This was first proved in an asymptotic form in [11], and the result was later proved in full non-asymptotic form in [14]. A proof of optimality can be found in [14, 19]. Further information on the derivation and properties of the CUSUM may be found in [2], or [19].

The precise relation between the threshold and the performance levels TTFA and WCD is non-trivial and involves solving numerically intractable integral equations [16, 13]. However, it is possible to derive some upper and lower bounds that may be useful in practice. For the sake of comparison with the analysis of Kernel CUSUM, it will interesting to consider the following quantitative bound on the performance of the CUSUM.

Proposition II.2

The performance of (Algorithm 1) can be bounded as follows. The time to false alarm satisfies

If it also holds that then

See the appendix. The intuitive interpretation of these equations is that increasing the threshold causes an increase the time until false alarm, but it also leads to increased detection delay. From the second equation, we can see that the detection delay increases when the distributions get closer. The term involving the positive part of the log likelihood ratio is related to the variance of the CUSUM statistic. In Corollary IV.3 below we shall obtain somewhat analogous bounds for the Kernel CUSUM.

Iii Maximum Mean Discrepancy

Two-sample testing refers to the problem of determining whether two datasets are drawn from the same distribution. One approach to this problem is to consider the empirical measures defined by the datasets, and to compute the distance between the empirical measures using a probability metric. If enough data points are used, then the empirical distance should be close to the true distance. If the distance is large then we can be confident that the datasets are generated by different distributions. This is the idea underlying several classical tests, such as the Kolmogorov-Smirnov test [4], the Cramer-von-Mises test [1] and the Anderson-Darling test [17].

The Maximum Mean Discrepancy (MMD) approach is also based on computing the distance between empirical distributions. In MMD, the datasets are implicitly embedded in a Reproducing Kernel Hilbert Space (RKHS) corresponding to a user-supplied kernel function , and the distance between the embeddings is computed [8]. Compared to the classical approaches, there are several features of MMD that make it appealing for non-parametric statistics. First, the MMD distance has a range of simple unbiased estimators (see the definition of below for one such example.) Second, there is the flexibility offered by choice of kernel, and using kernels means the test can be applied on datasets without a natural Euclidean representation, such as strings, graphs and other structured data [7, 21].

Let be a set, and let be a kernel on this set; this is a symmetric, positive definite function that we regard intuitively as a similarity measure222Symmetric means that and positive-definite means that for any choice of elements and real numbers , we have . The reader may have in mind the set and the Gaussian kernel

(3)

Further suppose that has the structure of a measurable space and that is a measurable function on with the product -algebra. Define to be the set of all probability measures on , and use the kernel , define the subset . If the kernel has the additional property of being characteristic333We refer the reader to [8] or [15] for a precise definition of characteristic kernel. For instance, if then the Gaussian kernel is characteristic. then we may define the MMD metric on , denoted . This metric is defined as

See [8] for more details.

One of the unbiased estimators of presented in [8] is the linear statistic defined below. For convenience, the estimator is expressed using the following function :

(4)

Consider two data sets and . Then is

(5)

The linear statistic is interesting as it is a sum of i.i.d. terms, which means the central limit theorem may be used to approximate its distributional properties, which can help in tuning the thresholds when MMD is used for hypothesis testing

[8]. Furthermore, in the online setting we can study the trajectory of the statistic (that is, as a function of ) using the theory of sums of i.i.d. random variables, greatly facilitating analysis. We will exploit this structure in the analysis of KCUSUM below.

Iii-a Other related work

The field of online change detection has its roots in industrial quality control [20], and sequential analysis [22]. In particular, the CUSUM procedure is closely related to the sequential probability ratio test (SPRT), which is a foundational online hypothesis testing algorithm [22].

An alternative to online change detection is offline detection, where the algorithms do not run until the entire sequence is observed, and all data is used to make a decision. Kernel offline change point analysis was explored in [9]. In that scheme, every instance of time is considered as a possible change point. Each hypothetical change time is used to partition the dataset into two groups, consisting of prior observations and later observations, and the two resulting datasets are compared using a kernel based test. This is repeated for each possible change time. If one of the comparisons yields a significant discrepancy between the pre- and post- observation datasets, then a change is declared at that observation where the difference was largest. This procedure is of interest as it does not use reference data, instead basing its decisions on comparisons between disjoint sets of observations.

A number of approaches to non-parametric online change detection have been proposed in [3]. As given, they apply only to the case of detecting changes in mean. One straightforward method of applying kernel non-parametric tests to online change detection is a sliding window approach, as pursued in [10]. In this approach, at each time a decision regarding the change is made based on the distance between the most recent fixed-size block of data and a block from the reference distribution, using an MMD distance . This can be seen as an kernel-based generalization of the Shewhart control charts [20].

Besides the CUSUM, there are other algorithms with optimality properties, notably the Shiryaev-Roberts (SR) change detector [18]. While the CUSUM minimizes the worst case delay (WCD), the SR detector minimizes a form of average delay. Like the CUSUM, the SR test statistic admits a simple recursive form and hence it may be possible to extend our algorithmic construction to SR-type detectors as well, but this is outside the scope of this paper.

Iv The Kernel CUSUM Algorithm

Fig. 2: Illustration of the Kernel CUSUM. The observations (left upper) are compared with reference data (left lower) by computing the MMD statistic (middle). A change is detected when the cumulative sum of the MMD comparisons crosses a threshold. See the main text for details.

The Kernel CUSUM (KCUSUM) algorithm blends features of the CUSUM procedure with the MMD framework. The basic idea is that instead of using each new observation to compute the log-likelihood ratio , which is an estimator of the KL-divergence , we will use the new observation and some other random samples to compute an estimate of an MMD distance .

The KCUSUM algorithm is defined with the help of a shifted version of the function used to define the MMD statistics: Given (the role of is explained in detail below), define as

(6)

The details of the Kernel CUSUM are listed in Algorithm 2. At even numbered iterations, the most recent observation is paired with the previous observation and these two points are compared with two reference points using MMD. We subtract a constant from the result to get . The variable is then added onto the statistic to get the next value . If the new value would be negative, then it resets to

, effectively restarting the algorithm. At odd numbered iterations, the statistic

is unchanged.

The reason for subtracting a positive amount at each step of Algorithm 2 is to guarantee that the increments have negative drift under the pre-change regime and positive drift in the post-change regime. This is a consistency property that enables us to formulate non-trivial bounds on the time to false alarm. Using the definitions in Algorithm 2, it is evident that before a change, and after a change, . In other words, the algorithm can detect a change to any distribution that is at least distance away from the reference distribution .

input: Thresholds and data
initialize
for do
sample from reference measure
if is even then
  .
else
  
end
if then set and exit
else continue.
end
Algorithm 2 Kernel CUSUM (KCUSUM)
Example IV.1

Consider the problem of detecting a change in variance of normally distributed random variables, as in in Example II.1. The upper left plot of Fig. 2 shows the stream of observations, which are normally distributed with a change in variance at time . The KCUSUM is based on the linear statistic , so at each time the MMD estimate is computed, and this quantity is plotted in the middle of Fig. 2. On the right is the resulting sequence. As in the CUSUM, a simple threshold is used to decide that a change has occurred. For this particular realization of the variables, a threshold of results in detection at . The value of was .

Next we consider the delay and false alarm rate of the KCUSUM. As above, we are considering the worst case delay and the time to false alarm. The time to false alarm is defined as in Equation 1. The worst case delay is defined as in 2, using the filtration where . To prove the bounds we adapt the methods of [11], which allows one to reduce the problem of analyzing the CUSUM to that of analyzing a random walk with i.i.d. terms.

For the analysis it will be convenient to group the variables together as where for ,

(7)

Associated to the grouped sequence , define auxiliary stopping times as

Theorem IV.2

Let be the change detector corresponding to the Kernel CUSUM (Algorithm 2). Let be the pre-change, or reference distribution. Then by using a threshold , the time to false alarm is at least

(8)

If is a distribution with , then the worst case detection delay is at most

(9)

Let and for let Define the stopping time as

(10)

The relation between and the KCUSUM stopping time is

(11)

Note that, as discussed in [11], the stopping time can be represented as

(12)

and each stopping time uses the same decision rule, the only difference being that they operate on shifted versions of the input sequence . In this setting Theorem 2 from [11] is applicable, which yields a lower bound on under :

(13)

Combining Equations (11) and (13) yields the claim (8).

Now we consider bounding the worst case delay. If the sequence has a change at an odd valued time, say for , then the sequence has a change at time . Explicitly,

From here we reason as in Theorem 2 of [11]:

(14)

Step A follows from Equation (11) and Step B follows since is the infimum of the . Step C follows from the independence of from and finally Step D follows from the fact that the distribution of under is the same as the distribution of under .

The situation is slightly more complex if the change occurs a time that is even, say for some . In this case the grouped sequence does not experience an abrupt change, and instead there are 3 distributions involved. Specifically,

Reasoning as in Equation (14), then,

(15)

Combining (14) and (15), we see that for all ,

(16)

Combining Equation (16) with the definition of worst case delay (2) yields the inequality (9).

We can combine Theorem IV.2 with certain facts about random walks (Lemma A.2 and Proposition A.1) to get more specific bounds on the delays and false alarms, as shown in the following Corollary. In this corollary, we assume that the kernel is bounded by a constant , and also assume that is bounded by . This is a necessary assumption when the kernel is bounded, since if then for all , and it will not be possible to detect any changes.

Fig. 3: The logarithmic relation between time to false alarm and average delay in the kernel CUSUM as implied by our analysis. See text for details.
Corollary IV.3

Let the assumptions of Theorem IV.2 hold. Further assume that the kernel is bounded by a constant and let . Then the time to false alarm satisfies

(17)

If is a distribution with , then the worst case detection delay is at most

(18)

To start, note that

(19)

To upper-bound the last term in this equation, we will apply Lemma A.2

. Note that under our assumption that the kernel is bounded, the moment generating function

is guaranteed to be well defined for all . Therefore

(20)

where is any number satisfying and . To identify such an , start with a second order expansion of :

Under the assumption that it holds that and

Minimizing the right hand side of the final equation above with respect to yields

Combining (8), (IV) with (20) and using this definition of yields the claim (17).

For the delay, note that is the expected amount of time until a random walk with positive drift crosses an upper boundary. Hence we may apply Proposition A.1. This leads to

(21)

Combining (9), (21), and the bound , we obtain the relation (18).

Figure 3 shows the logarithmic relation between false alarm time and delay specified by the theorem. For each level of false alarm , we computed the smallest value of guaranteed to achieve false alarm rate according to Equation (17), and plug in this threshold to compute delay according to Equation (18). The computations were performed for a hypothetical problem where and .

V Empirical Results

In this section we evaluate the KCUSUM on several change detection tasks. In each case, the observations consisted of vectors in

, and the Gaussian kernel (defined in Equation 3) was used with . The pre-change distribution in each task is the normal distribution on with mean zero and a covariance matrix equal to the identity scaled by a factor of . The four possible post-change distributions were as follows:

  1. Change in mean: A normal distribution with mean and the pre-change covariance matrix.

  2. Change in variance (all components): A normal distribution with mean and a covariance matrix equal to the identity scaled by a factor of .

  3. Change in variance (random component): The distribution obtained by sampling from the pre-change distribution and scaling a random component by a factor of .

  4. Change to uniform: The distribution on

    where each component is sampled independently from the uniform distribution on

Fig. 4: The performance of the KCUSUM on several change detection tasks. See text for details.

Note that the interval in Problem 4 was chosen so that the resulting distribution has the same mean and variance as the pre-change distribution.

For each task we used a Monte Carlo approach to estimate the time to false alarm and delay. The time to false alarm was estimated by generating sequences with no change and running the KCUSUM until a false alarm was detected. We record the time where the false alarm occurs and average theses values to get the estimate for TTFA. For the delay, we generated sequences that had a change at time , and recorded the amount of time until the alarm goes off as an estimate of the delay. In the examples, we set in tasks 1 - 3, and set it in task 4.

Based on the results in Figure 4 we see that the change detection tasks increase in difficulty as we go from problem 1 to 4. In all of the problems we observed a similar logarithmic growth rate in the the delay as the time to false alarm is allowed to increase. However, the scale of this growth can vary dramatically. For instance, the first two problems are relatively easy, while the fourth problem seems to be quite difficult for the KCUSUM. Note that in each case, the only data used by the algorithm is the incoming observations and samples from the pre-change distribution, and no information or samples about the post-change distribution are used. Overall, the results suggest that the KCUSUM may be a promising approach for change detection problems where less is known about the type of changes involved.

Vi Conclusion

This work introduced the Kernel Cumulative Sum Algorithm (KCUSUM), a new approach for online change detection. Unlike the CUSUM algorithm, this approach does not require knowledge of the probability density ratio for its implementation. Instead, it uses incoming observations and samples from the pre-change distribution. The result is that the same algorithm works for detecting many types of changes. Our theoretical analysis establishes the algorithm’s ability to detect changes, and shows a relation between the delay and the MMD distance of the two distributions. These bounds should also be useful in the analysis of other non-parametric change detectors. Finally, we would like to suggest two avenues for future work. First, there are likely variants of KCUSUM that leverage more complex, non i.i.d. statistics, that may lead to improved detection performance. Secondly, the CUSUM has been investigated for detecting changes in scenarios with more complex dependencies among observations [5] and it may also be possible to extend the kernel methods developed in this paper to detect changes in these cases.

Appendix A Appendix

Proposition A.1 (Corollary 1, [12])

Let be i.i.d. real-valued random variables such that and . Define and for let be the stopping time . Then

where .

Lemma A.2

Let be a sequence of i.i.d. real-valued random variables and for define the partial sums . Let be the moment generating function of the and suppose there is a such that . Then for any ,

For define . Then for ,

Furthermore, for all it holds that and . Therefore is a non-negative supermartingale. Hence

The first step in the above derivation follows from the monotonicity of the function , the second step follows from Theorem 7.8 in [6], and the final step follows from our assumption on .

Proof of Proposition ii.2

Define as

and let . It follows from Theorem 2 of [11] that the CUSUM obeys

Note that . Hence is a random walk with negative drift. Furthermore, the moment generating function of the increments under can be expressed as and it is evident that . Then we may apply Lemma A.2 with to conclude that . This established our claim on the time to false alarm.

Let be the stopping time . Again applying Theorem 2 from [11], it holds that

Under the assumption of a change point at , the variables are i.i.d with positive mean, and therefore is a random walk with positive drift . The bound on WCD then follows from applying Proposition A.1 to the stopping time .

References

  • [1] T. W. Anderson. On the distribution of the two-sample cramer-von mises criterion. Ann. Math. Statist., 33(3):1148–1159, 09 1962.
  • [2] Michèle Basseville and Igor V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993.
  • [3] E. Brodsky and B.S. Darkhovsky. Nonparametric Methods in Change Point Problems. Mathematics and Its Applications. Springer Netherlands, 1993.
  • [4] W Feller et al. On the kolmogorov-smirnov limit theorems for empirical distributions. The Annals of Mathematical Statistics, 19(2):177–189, 1948.
  • [5] Cheng-Der Fuh et al.

    Sprt and cusum in hidden markov models.

    The Annals of Statistics, 31(3):942–977, 2003.
  • [6] R.G. Gallager. Discrete Stochastic Processes. The Springer International Series in Engineering and Computer Science. Springer US, 1995.
  • [7] Thomas Gärtner. A survey of kernels for structured data. ACM SIGKDD Explorations Newsletter, 5(1):49–58, 2003.
  • [8] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.

    Journal of Machine Learning Research

    , 13(Mar):723–773, 2012.
  • [9] Zaid Harchaoui, Eric Moulines, and Francis R Bach. Kernel change-point analysis. In Advances in neural information processing systems, pages 609–616, 2009.
  • [10] Shuang Li, Yao Xie, Hanjun Dai, and Le Song. M-statistic for kernel change-point detection. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3366–3374. Curran Associates, Inc., 2015.
  • [11] G. Lorden. Procedures for reacting to a change in distribution. Ann. Math. Statist., 42(6):1897–1908, 12 1971.
  • [12] Gary Lorden. On excess over the boundary. The Annals of Mathematical Statistics, pages 520–527, 1970.
  • [13] George Moustakides, Aleksey Polunchenko, and Alexander Tartakovsky. Numerical comparison of cusum and shiryaev–roberts procedures for detecting changes in distributions. Communications in Statistics - Theory and Methods, 38(16-17):3225–3239, 2009.
  • [14] George V Moustakides. Optimal stopping times for detecting changes in distributions. The Annals of Statistics, pages 1379–1387, 1986.
  • [15] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
  • [16] Ewan S Page. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
  • [17] A. N. Pettitt. A two-sample anderson-darling rank statistic. Biometrika, 63(1):161–168, 1976.
  • [18] Aleksey S. Polunchenko and Alexander G. Tartakovsky. On optimality of the shiryaev-roberts procedure for detecting a change in distribution. The Annals of Statistics, 38(6):3445–3457, 2010.
  • [19] H. Vincent Poor and Olympia Hadjiliadis. Quickest Detection. Cambridge University Press, 2008.
  • [20] W.A. Shewhart. Economic control of quality of manufactured product. Bell Telephone Laboratories series. D. Van Nostrand Company, Inc., 1931.
  • [21] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242, 2010.
  • [22] A Wald. Sequential Analysis. Wiley, 1947.