1 Introduction
Stationarity is crucial in analyzing random sequences because statistical inference usually requires a probabilistic mechanism constant in, at least, a segment of observations. Therefore, it is important to detect whether changes occur in a sequence of observations prior to statistical inference. Such changepoint problems arise from many applications: abrupt events in video surveillance (Mayer and Mundy, 2015); deterioration of product quality in quality control (Lai, 1995); credit card fraud in finance (Bolton and Hand, 2002); alteration of genetic regions in cancer research (Erdman and Emerson, 2008) and so on.
There is a large and rapidly growing literature on changepoint detection (Aminikhanghahi and Cook, 2016; Niu et al., 2016; Sharma et al., 2016; Fryzlewicz et al., 2014)
. Many methods rely on the assumed parametric models to detect special change types such as location, scale or presumed distribution family.
Page (1954) introduced a method by examining the ratio of loglikelihood functions. Lavielle and Teyssiere (2006) detected changepoints by maximizing a loglikelihood function. Yau and Zhao (2016)proposed a likelihood ratio scan method for piecewise stationary autoregressive time series. Some Bayesian changepoint detection methods assume that the observations are normally distributed, and calculate the probability of changepoint at each point
(Barry and Hartigan, 1993; Zhang, 1995; Wang and Emerson, 2015; Maheu and Song, 2018) to name a few. Since parametric methods potentially suffer from model misspecification, other methods are developed to detect general distributional changes with more relaxed assumptions. Kawahara and Sugiyama (2012) provided an algorithm which relied heavily on estimating the ratio of probability densities. LungYutFong et al. (2015) identified changepoints via the wellknown Wilcoxon rank statistic. Matteson and James (2014) proposed a nonparametric method using the concept of energy distance for independent observations. There are also some binary segmentation methods statistics (Fryzlewicz et al., 2014; Cho and Fryzlewicz, 2015; Eichinger et al., 2018). Two advantages of the binary segmentation procedures are their simplicity and computational efficiency, but their false discovery rates may be hard to control because they are ‘greedy’ procedures. Zou et al. (2014) introduced a nonparametric empirical likelihood approach to detecting multiple changepoints in independent sequences, and estimated the locations of the changepoints by using the dynamic programming algorithm and the intrinsic order structure of the likelihood function.Automatically detecting the number of changepoints is also important. Some methods are developed to detect only a single changepoint (Ryabko and Ryabko, 2008), while some methods require a known number of changepoints but unknown locations (Hawkins, 2001; LungYutFong et al., 2015). In real data analysis, however, we usually do not know the number of changepoints.
With increasing richness of data types, nonEuclidean data, such as shape data, functional data, and spatial data, commonly arise from applications. For example, one of the problems of interest to us is the changes in the monsoon direction as defined by circle, a simple Riemannian manifold. Methods developed in Hilbert spaces are not effective for this type of problems as our analysis of the data from YunnanGuizhou Plateau (E, N) collected from 2015/06/01 to 2015/10/30 illustrates below. To the best of our knowledge, few methods exist to detect changepoints in a nonEuclidean sequence. Chen et al. (2015) and Chu et al. (2019) proposed a series of graphbased nonparametric approaches that could be applied to nonEuclidean data with arbitrary dimension. However, their proposed methods apply to iid observations only and are restricted to one or two changepoints. Therefore, it remains to be an open and challenging problem to develop methods to detect arbitrarily distributional changes for nonEuclidean sequences, including the changepoint locations and the number of the changepoints.
To address this challenge, we introduce a novel concept of Ball detection function via Ball divergence (Pan et al., 2018)
. Ball divergence is a recently developed measure of divergence between two probabilities in separable Banach spaces. The Ball divergence is zero if and only if the two probability measures are identical. Since its sample statistic is constructed by metric ranks, the test procedure for an identical distribution is robust to heavytailed data or outliers, consistent against alternative hypothesis, and applicable to imbalanced data. Therefore, the empirical ball divergence is an ideal statistic to test whether or not a change has occurred. Unfortunately, it does not inform us where the change occurs, because in theory the probability measures before and after any time point are always different if there exists a change point in the sequence. Therefore it is imperative for us to observe how the probability measures before and after any time vary with time and then develop a proper criterion to detect the changepoint location. We introduce a Ball detection function as an effective choice which reaches its maximum at the change point if a sequence has only one change point. We further develop a hierarchical algorithm to detect multiple changepoints using the statistic based on the Ball detection function. The advantages of our procedure are threefold: our procedure can estimate the number of changepoints and detect their locations; our procedure can detect any types of changepoints; and both uniquely and importantly, our procedure can handle complex stochastic sequences, for example, nonEuclidean sequences.
The rest of this article is organized as follows. In Section 2, we review the notion of Ball divergence, and then introduce a novel changepoint detection function, i.e., a Ball detection function based on Ball divergence with a scale parameter for weakly dependent sequences. We further establish its asymptotic properties. We show how to use the Ball detection function to detect changepoints and establish the consistent properties of our method in Section 3. In Section 4, we compare the performance of our method with some existing methods in various simulation settings. In section 5, two real data analyses demonstrate the utility of our proposed method. We make some concluding remarks in Section 6. All technical details are deferred to Appendix.
2 Changepoint Detection in Dependent Sequences
2.1 Review of Ball Divergence
Ball divergence (BD, Pan et al. (2018)) is a measure of the difference between two probabilities in a separable Banach space , with the norm . , the distance between and deduced from the norm is . Denote by a closed ball. Let be the smallest algebra in that contains all closed (or open) subsets of . Let and be two probabilities on . Ball divergence (Pan et al., 2018) is defined as follows.
Definition 2.1.1
The Ball divergence of two Borel probabilities and in is defined as an integral of the square of the measure difference between and over arbitrary closed balls,
Let and be the support sets of and respectively. The BD has the following important property (Pan et al., 2018):
Theorem 2.1.1
Given two Borel probabilities and in a finite dimensional Banach space , then where the equality holds if and only if . It can be extended to separable Banach spaces if or .
2.2 Ball Divergence with a Scale Parameter
The Ball divergence introduced above cannot detect the locations of changepoints accurately enough while comparing the distributions of the sequences before and after the changepoints. We need to introduce a Ball divergence associated with a scale parameter as follows.
Definition 2.2.1
A Ball divergence of two Borel measures and in is defined as
(1) 
where is the mixture distribution measure with the scale parameter .
also has the equivalence property below, which is critical to the comparison of the distributions of any two sequences.
Theorem 2.2.1
Given two Borel probabilities in a finite dimensional Banach space , then where the equality holds if and only if . It also holds on separable Banach spaces if or .
Theorem 2.2.1 assures that for any possesses the most important property as in terms of testing the distributional difference between two sequences. Importantly, with the introduction of , we can consistently estimate the locations of the changepoints. Here, we highlight the relationship and difference between and
When , is the measure difference over the balls whose centers and the endpoints of the radius following measure When , is the measure difference over the balls whose centers and the endpoints of the radius following the measure . Moreover,
For , is the mean of the measure differences from two samples over the balls whose centers and endpoints of the radius following four possible pairs of measures:, ,, and where the ratio of two measures is .
Ball divergence with a scale parameter can be defined in the general metric space, following the Generalized BanachMazur theorem (Kleiber and Pervin, 1969) as stated in the Supplementary material.
2.3 Ball Detection Function
Now, we introduce a Ball detection function which is maximized at the change point if there exists one, and hence can be used to determine the location of the change point. For clarity, let us consider a conceptual sequence with a change point , and the probability measures before and after are and , respectively. Denote the indicator function by . For a "time" , define
Without loss of generality, suppose that , the probability measures before and after are and . By the definition of Ball divergence (1), we have
Therefore, in general,
(2) 
The maximum of is attained when if there exists a changepoint In this case, we can find the change point by maximizing the ball divergence in equation (2). But we still need to test whether a change point has occurred or not. Next, we introduce a Ball detection function to simultaneously test the existence of a changepoint and determine its location:
Note that the maximum of is also attained when , allowing us to find the change point by maximizing
. In next subsection, we shall discuss how this function is used to construct a test for a changepoint test statistic.
2.4 Ball Detection Function in Sample
Suppose that a sequence of observations is comprised of two multivariate stationary sequences with the probability measure and with , where both and are unknown. We estimate with based on . Let , which identifies whether the point falls into the closed ball with as the center and as the radius, and , which determines whether two points and fall into the ball together. Let , A consistent estimator of the Ball divergence of and with the scale parameter is
as summarized in Theorem 2.4.1.
We also prove that
has a limiting distribution under the null hypothesis in Theorem
2.4.2. For this reason, we chooseas the statistic to detect changepoints.
To investigate the asymptotic properties of , we introduce two concepts of the random sequence: absolutely regular and ergodic stationary sequence.
Given the probability space and two subfields and of , let
where the supreme is taken over all partitions of into sets , all partitions of into sets and all . A stochastic sequence is called absolutely regular ( (Dehling and Fried, 2012), also called weakly Bernoulli (Aaronson et al., 1996)), if
as . Here the denotes the
field generated by the random variables
. In this paper, we suppose that for any . The concept of absolutely regular sequence is wide enough to cover all relevant examples from statistics except for long memory sequences.Recall that an ergodic, stationary sequence (ESS) (Aaronson et al., 1996) is a random sequence of form where is an ergodic, probabilitypreserving transformation in the probability space , and is a measurable function. In essence, an ESS implies that the random sequence will not change its statistical properties with time (stationarity) and that its statistical properties can be deduced from a single, sufficiently long sample of the sequence (ergodicity).
We have the following theorem for an absolutely regular sequence comprised of two ergodic stationary sequences:
Theorem 2.4.1
Suppose that is an absolutely regular sequence, and are both ergodic stationary with marginal probability measure respectively. When , for some , then
Theorem 2.4.1 means that converges to Ball detection function almost surely. We further investigate the asymptotic distribution of . Under the null hypothesis, the Ball detection function in sample is the sum of four degenerate Vstatistics. As in Pan et al. (2018), we denote as the second component in the Hdecomposition of . Then we have the spectral decomposition:
where and
are the eigenvalues and eigenfunctions of
. Let be an independent copy of . For are assumed to be iid , and letTheorem 2.4.2
Under null hypothesis , is a stationary absolutely regular sequence with coefficients satisfying for , if , for some , we have
Under the alternative hypothesis, the Ball detection function in sample is asymptotically normal because it is a sum of nondegenerate Vstatistics. Let and be the first component in Hdecomposition of and
We can obtain the asymptotic distribution under the alternative hypothesis.
Theorem 2.4.3
is a absolutely regular sequence with coefficients satisfying for . Under , if , and for some , then we have
We show that the Ball detection function in sample is consistent against general alternatives. Our new detection function can handle the problem of imbalanced sample sizes. As shown in the following theorem, the asymptotic power of the test does not go to zero even if goes to or .
Theorem 2.4.4
The test based on is consistent against any general alternative . More specifically,
and
3 Detection of changepoints
3.1 Hierarchical Algorithm
Next, we use the Ball detection function in sample to detect changepoints in a sequence. For simplicity, suppose that the sequence contains at most one changepoint. The possible changepoint location is then estimated by maximizing the detection function:
(3) 
We use the bootstrap method to estimate the probability that exceeds a threshold. If the estimated probability is high enough, is the estimated changepoint. Otherwise, we proceed as if there does not exist any changepoint in the sequence.
It is more complicated if the sequence has multiple changepoints. In this case, we estimate the first changepoint by
(4) 
From (4), we can see that the introduction of here is to alleviate a weakness of bisection algorithm (Matteson and James, 2014). Because in each segment, there may exist multiple changepoints. If we do not introduce , the value of may be lower than .
Suppose that changepoints have been estimated at locations , and , . Those changepoints partition the sequence into segments . In segment , let The Ball detection function in sample of segment is denoted as
Now let
(5) 
Then is the th possible changepoint located within segment This hierarchical algorithm for estimating multiple changepoints is outlined below.
3.2 Hierarchical Significance Testing
Here, we elaborate the use of the bootstrap method mentioned above.
Theorem 2.4.2 shows that the asymptotic null distribution of is a mixture of distributions. In practice, it is difficult to directly take advantage of the asymptotic null distribution. So, we use the moving block bootstrap (Kunsch, 1989) to obtain the empirical probabilities.
Given a set of observations and the block size , we draw a bootstrap resample as follows: (i) define the dimensional vector ; (ii) resample from block data with replacement to get pseudo data which satisfies , where denotes the integer part of . Denote the first elements of as the bootstrap resample ; (iii) repeat steps (i) and (ii) times. For the th repetition, denote the maximum value in equation (4) based on by ; (iv) the approximate probability is estimated by Denote the threshold of the estimated probability by , if , then is a changepoint.
In applications, the choice of the block size involves a tradeoff. If the block size becomes too small, the moving block bootstrap will destroy the time dependency of the data and the accuracy will deteriorate. But if the block size becomes too large, there will be few blocks to be used. In other words, increasing the block size reduces the bias and captures more persistent dependence, while decreasing the block size reduces the variance as more subsamples are available. Thus, a reasonable trade off is to consider the mean squared error as the objective criterion to balance the bias and variance. For the linear time series, as proved in Carlstein (1986), the value of the block size that minimizes MSE is
where is the first order autocorrelation. Because the construction of MSE depends on the knowledge of the underlying data generating sequence, no optimal result is available in general. In this paper, we follow Hong et al. (2017) and Xiao and Lima (2007) to choose , where
(6) 
where is the estimator of the first autocorrelation of , is the same as (6) except replacing with the estimated first order autocorrelation of . So the choice of considers the linear dependence and nonlinear dependence.
3.3 Consistency
The next theorem shows the consistency of the estimated changepoint locations under the following assumption.
Assumption 3.1
Suppose that is an absolutely regular sequence which is comprised of two ergodic stationary sequences. Let denote the fraction of the observations, such that be an ergodic stationary sequence with marginal probability measure , the second ergodic stationary sequence with marginal distribution . Finally, let be a sequence of positive numbers, such that and as .
Theorem 3.3.1
This theorem shows that the consistency only requires the size of each segment increases to , but not necessarily at the same rate. Under the Assumption 3.1, can be close to 0 or 1 when , which is an imbalanced case.
In the multiple changepoints situation, we have the following Assumption.
Assumption 3.2
Suppose that is an absolutely regular sequence. Let , and , with and . For , is an ergodic stationary sequence with marginal probability measure and . Furthermore, let be a sequence of positive numbers, such that and as .
It is worth noting that we do not assume the upper bounds on the number of changepoints , but by specifying the minimum sample size in each segment. In other words, under Assumption 3.2, as , we can have changepoints.
Analysis of multiple change points can be reduced to the analysis of only two change points under Assumption 3.2. Let , for any . The observations can be seen as a random sample from a mixture of probability measures , denoted as . Similarly, observations are a sample from a mixture of probability measures , denoted here as . The remaining observations are distributed according to some probability measure . Furthermore, and . If one of the previous two inequalities does not hold, we refer to the single change point setting.
Consider any such that, . Then, this choice of will create two mixture probability measures. One with component probability measures and , and the other with component probability measures and . Then, the Ball detection function between these two mixture probability measures is equal to
(7) 
Theorem 3.3.2
By Theorem 3.3.2, is maximized when or for . Additionally, define
Let . Let . Then, we have the following Theorem.
Theorem 3.3.3
Repeated applications of Theorem 3.3.3 can show that as , the first estimated change points will converge to the true change point locations in the manner described above. With a fixed threshold of the estimated probability , all of the changepoints will be estimated. However, with probability approaching 1 as the sample size increases, the number of changepoints determined in this way will be more than the true number of changepoints, since any given nominal level of significance implies a nonzero probability of rejecting the null hypothesis when it holds. The hierarchical procedure could be made consistent by adopting a threshold for the test that decrease to zero, at a suitable rate, as the sample size increases(Bai and Perron, 1998). This is illustrated by the following theorem.
Theorem 3.3.4
Although the hierarchical algorithm tends to estimate more changepoints asymptotically when is fixed, this has little effect in practice. For example, the asymptotic probability of selecting changepoints, is given by which decreases rapidly. Furthermore, if there is no change point, that is , the probability of selecting at least one change point in our algorithm is
Hence the total rate of type I errors is still
. This is a distinct feature of our hierarchical procedure because controlling for type I errors is a challenging issue in multiple testings.4 Simulation studies
In this section, we present the numerical performance of the proposed method (BDCP) with and compare it with several typical methods, including Bayesian method (BCP) (Barry and Hartigan, 1993), WBS method (Fryzlewicz et al., 2014), the graphbased methodgSeg (Chen et al., 2015; Chu et al., 2019) and energy distance based method (ECP) (Matteson and James, 2014). BCP, WBS, ECP and BDCP can estimate the number of changepoints automatically while gSeg can detect only one changepoint or an interval.
There are four commonly used criteria for the performance of those methods: the adjusted Rand index, the over segmentation error, the under segmentation error and the Hausdorff distance.
Suppose that the true changepoints set is and estimated changepoints set is .
Then denote the true segments of series
by and the estimated segments by
.
Consider the pairs of observations that fall into one of the following two sets:
{pairs of observations in the same segments under and in same segments under };
{pairs of observations in different segments under and in different segments under }. Denote and as the number of pairs of observations in each of these two sets. The Rand index is defined as
Adjusted Rand index is the correctedforchance version of the Rand index which is defined as
in which 1 corresponds to the maximum Rand index value.
On the other hand, we also calculate the distance between and by
which quantify the oversegmentation error and the undersegmentation error, respectively (Boysen et al., 2009; Zou et al., 2014). The Hausdorff distance (Harchaoui and LévyLeduc, 2010) between and is defined as
Here, we only report the results based on adjusted Rand index. The results under other criteria are deferred to the supplementary material.
Three scenarios are used for comparisons: univariate sequence, multivariate sequence and manifold sequence. In each scenario, we consider two types of examples, one without changepoint, and one with two changepoints as follow:
The sample sizes are set to be , . We will repeat each model 400 times and the threshold is at 0.05. To save space, some results of univariate sequences are available on the supplementary material.
4.1 Multivariate sequence
In this subsection, we consider the dimensional sequences. Examples 4.1.14.1.7 are the sequences with no changepoint and Examples 4.1.84.1.15 are the models with two changepoints.

Examples 4.1.14.1.3:
for Example 4.1.1, for Example 4.1.2 and for Example 4.1.3.

Examples 4.1.44.1.5:
for Example 4.1.4 and for Example 4.1.5.

Examples 4.1.64.1.7:
for Example 4.1.6 and for Example 4.1.7.

Example 4.1.8:

Example 4.1.9:

Example 4.1.10:
Comments
There are no comments yet.