α-Ball divergence and its applications to change-point problems for Banach-valued sequences

08/05/2018 ∙ by Qiang Zhang, et al. ∙ Yale University SUN YAT-SEN UNIVERSITY 0

In this paper, we extend a measure of divergence between two distributions: Ball divergence, to a new one: α-Ball divergence. With this new notion, we propose its sample statistic which can be used to test whether two weakly dependent sequences of Banach-valued random vectors have the same distribution. The properties of α-Ball divergence and its sample statistic, as Ball divergence has, are inspected and shown to hold for random sequences which are functionals of some absolutely regular sequences. We further apply the sample statistic to change-point problems for a sequence of weakly dependent Banach-valued observations with multiple possible change-points. Our procedure does not require any assumptions on special change-point type. It could detect the number of change-points as well as their locations. We also prove the consistency of the estimated change-point locations. Extensive simulation studies and analyses of two interesting real data sets about wind direction and bitcoin price illustrate that our procedure has considerable advantages over other existing competitors, especially when observations are non-Euclidean or there are distributional changes in the variance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stationarity is crucial in analyzing random sequences because statistical inference usually requires a probabilistic mechanism constant in, at least, a segment of observations. Therefore, it is important to detect whether changes occur in a sequence of observations prior to statistical inference. Such change-point problems arise from many applications: abrupt events in video surveillance (Mayer and Mundy, 2015); deterioration of product quality in quality control (Lai, 1995); credit card fraud in finance (Bolton and Hand, 2002); alteration of genetic regions in cancer research (Erdman and Emerson, 2008) and so on.

There is a large and rapidly growing literature on change-point detection (Aminikhanghahi and Cook, 2016; Niu et al., 2016; Sharma et al., 2016; Fryzlewicz et al., 2014)

. Many methods rely on the assumed parametric models to detect special change types such as location, scale or presumed distribution family.

Page (1954) introduced a method by examining the ratio of log-likelihood functions. Lavielle and Teyssiere (2006) detected change-points by maximizing a log-likelihood function. Yau and Zhao (2016)

proposed a likelihood ratio scan method for piecewise stationary autoregressive time series. Some Bayesian change-point detection methods assume that the observations are normally distributed, and calculate the probability of change-point at each point

(Barry and Hartigan, 1993; Zhang, 1995; Wang and Emerson, 2015; Maheu and Song, 2018) to name a few. Since parametric methods potentially suffer from model misspecification, other methods are developed to detect general distributional changes with more relaxed assumptions. Kawahara and Sugiyama (2012) provided an algorithm which relied heavily on estimating the ratio of probability densities. Lung-Yut-Fong et al. (2015) identified change-points via the well-known Wilcoxon rank statistic. Matteson and James (2014) proposed a nonparametric method using the concept of energy distance for independent observations. There are also some binary segmentation methods statistics (Fryzlewicz et al., 2014; Cho and Fryzlewicz, 2015; Eichinger et al., 2018). Two advantages of the binary segmentation procedures are their simplicity and computational efficiency, but their false discovery rates may be hard to control because they are ‘greedy’ procedures. Zou et al. (2014) introduced a nonparametric empirical likelihood approach to detecting multiple change-points in independent sequences, and estimated the locations of the change-points by using the dynamic programming algorithm and the intrinsic order structure of the likelihood function.

Automatically detecting the number of change-points is also important. Some methods are developed to detect only a single change-point (Ryabko and Ryabko, 2008), while some methods require a known number of change-points but unknown locations (Hawkins, 2001; Lung-Yut-Fong et al., 2015). In real data analysis, however, we usually do not know the number of change-points.

With increasing richness of data types, non-Euclidean data, such as shape data, functional data, and spatial data, commonly arise from applications. For example, one of the problems of interest to us is the changes in the monsoon direction as defined by circle, a simple Riemannian manifold. Methods developed in Hilbert spaces are not effective for this type of problems as our analysis of the data from Yunnan-Guizhou Plateau (E, N) collected from 2015/06/01 to 2015/10/30 illustrates below. To the best of our knowledge, few methods exist to detect change-points in a non-Euclidean sequence. Chen et al. (2015) and Chu et al. (2019) proposed a series of graph-based nonparametric approaches that could be applied to non-Euclidean data with arbitrary dimension. However, their proposed methods apply to iid observations only and are restricted to one or two change-points. Therefore, it remains to be an open and challenging problem to develop methods to detect arbitrarily distributional changes for non-Euclidean sequences, including the change-point locations and the number of the change-points.

To address this challenge, we introduce a novel concept of Ball detection function via Ball divergence (Pan et al., 2018)

. Ball divergence is a recently developed measure of divergence between two probabilities in separable Banach spaces. The Ball divergence is zero if and only if the two probability measures are identical. Since its sample statistic is constructed by metric ranks, the test procedure for an identical distribution is robust to heavy-tailed data or outliers, consistent against alternative hypothesis, and applicable to imbalanced data. Therefore, the empirical ball divergence is an ideal statistic to test whether or not a change has occurred. Unfortunately, it does not inform us where the change occurs, because in theory the probability measures before and after any time point are always different if there exists a change point in the sequence. Therefore it is imperative for us to observe how the probability measures before and after any time vary with time and then develop a proper criterion to detect the change-point location. We introduce a Ball detection function as an effective choice which reaches its maximum at the change point if a sequence has only one change point. We further develop a hierarchical algorithm to detect multiple change-points using the statistic based on the Ball detection function. The advantages of our procedure are threefold: our procedure can estimate the number of change-points and detect their locations; our procedure can detect any types of change-points; and both uniquely and importantly, our procedure can handle complex stochastic sequences, for example, non-Euclidean sequences.

The rest of this article is organized as follows. In Section 2, we review the notion of Ball divergence, and then introduce a novel change-point detection function, i.e., a Ball detection function based on Ball divergence with a scale parameter for weakly dependent sequences. We further establish its asymptotic properties. We show how to use the Ball detection function to detect change-points and establish the consistent properties of our method in Section 3. In Section 4, we compare the performance of our method with some existing methods in various simulation settings. In section 5, two real data analyses demonstrate the utility of our proposed method. We make some concluding remarks in Section 6. All technical details are deferred to Appendix.

2 Change-point Detection in Dependent Sequences

2.1 Review of Ball Divergence

Ball divergence (BD, Pan et al. (2018)) is a measure of the difference between two probabilities in a separable Banach space , with the norm . , the distance between and deduced from the norm is . Denote by a closed ball. Let be the smallest -algebra in that contains all closed (or open) subsets of . Let and be two probabilities on . Ball divergence (Pan et al., 2018) is defined as follows.

Definition 2.1.1

The Ball divergence of two Borel probabilities and in is defined as an integral of the square of the measure difference between and over arbitrary closed balls,

Let and be the support sets of and respectively. The BD has the following important property (Pan et al., 2018):

Theorem 2.1.1

Given two Borel probabilities and in a finite dimensional Banach space , then where the equality holds if and only if . It can be extended to separable Banach spaces if or .

2.2 Ball Divergence with a Scale Parameter

The Ball divergence introduced above cannot detect the locations of change-points accurately enough while comparing the distributions of the sequences before and after the change-points. We need to introduce a Ball divergence associated with a scale parameter as follows.

Definition 2.2.1

A Ball divergence of two Borel measures and in is defined as

(1)

where is the mixture distribution measure with the scale parameter .

also has the equivalence property below, which is critical to the comparison of the distributions of any two sequences.

Theorem 2.2.1

Given two Borel probabilities in a finite dimensional Banach space , then where the equality holds if and only if . It also holds on separable Banach spaces if or .

Theorem 2.2.1 assures that for any possesses the most important property as in terms of testing the distributional difference between two sequences. Importantly, with the introduction of , we can consistently estimate the locations of the change-points. Here, we highlight the relationship and difference between and

When , is the measure difference over the balls whose centers and the endpoints of the radius following measure When , is the measure difference over the balls whose centers and the endpoints of the radius following the measure . Moreover,

For , is the mean of the measure differences from two samples over the balls whose centers and endpoints of the radius following four possible pairs of measures:, ,, and where the ratio of two measures is .

Ball divergence with a scale parameter can be defined in the general metric space, following the Generalized Banach-Mazur theorem (Kleiber and Pervin, 1969) as stated in the Supplementary material.

2.3 Ball Detection Function

Now, we introduce a Ball detection function which is maximized at the change point if there exists one, and hence can be used to determine the location of the change point. For clarity, let us consider a conceptual sequence with a change point , and the probability measures before and after are and , respectively. Denote the indicator function by . For a "time" , define

Without loss of generality, suppose that , the probability measures before and after are and . By the definition of Ball divergence (1), we have

Therefore, in general,

(2)

The maximum of is attained when if there exists a change-point In this case, we can find the change point by maximizing the ball divergence in equation (2). But we still need to test whether a change point has occurred or not. Next, we introduce a Ball detection function to simultaneously test the existence of a change-point and determine its location:

Note that the maximum of is also attained when , allowing us to find the change point by maximizing

. In next subsection, we shall discuss how this function is used to construct a test for a change-point test statistic.

2.4 Ball Detection Function in Sample

Suppose that a sequence of observations is comprised of two multivariate stationary sequences with the probability measure and with , where both and are unknown. We estimate with based on . Let , which identifies whether the point falls into the closed ball with as the center and as the radius, and , which determines whether two points and fall into the ball together. Let , A consistent estimator of the Ball divergence of and with the scale parameter is

as summarized in Theorem 2.4.1.

We also prove that

has a limiting distribution under the null hypothesis in Theorem

2.4.2. For this reason, we choose

as the statistic to detect change-points.

To investigate the asymptotic properties of , we introduce two concepts of the random sequence: absolutely regular and ergodic stationary sequence.

Given the probability space and two sub--fields and of , let

where the supreme is taken over all partitions of into sets , all partitions of into sets and all . A stochastic sequence is called absolutely regular ( (Dehling and Fried, 2012), also called weakly Bernoulli (Aaronson et al., 1996)), if

as . Here the denotes the

-field generated by the random variables

. In this paper, we suppose that for any . The concept of absolutely regular sequence is wide enough to cover all relevant examples from statistics except for long memory sequences.

Recall that an ergodic, stationary sequence (ESS) (Aaronson et al., 1996) is a random sequence of form where is an ergodic, probability-preserving transformation in the probability space , and is a measurable function. In essence, an ESS implies that the random sequence will not change its statistical properties with time (stationarity) and that its statistical properties can be deduced from a single, sufficiently long sample of the sequence (ergodicity).

We have the following theorem for an absolutely regular sequence comprised of two ergodic stationary sequences:

Theorem 2.4.1

Suppose that is an absolutely regular sequence, and are both ergodic stationary with marginal probability measure respectively. When , for some , then

Theorem 2.4.1 means that converges to Ball detection function almost surely. We further investigate the asymptotic distribution of . Under the null hypothesis, the Ball detection function in sample is the sum of four degenerate V-statistics. As in Pan et al. (2018), we denote as the second component in the H-decomposition of . Then we have the spectral decomposition:

where and

are the eigenvalues and eigenfunctions of

. Let be an independent copy of . For are assumed to be iid , and let

Theorem 2.4.2

Under null hypothesis , is a stationary absolutely regular sequence with coefficients satisfying for , if , for some , we have

Under the alternative hypothesis, the Ball detection function in sample is asymptotically normal because it is a sum of non-degenerate V-statistics. Let and be the first component in H-decomposition of and

We can obtain the asymptotic distribution under the alternative hypothesis.

Theorem 2.4.3

is a absolutely regular sequence with coefficients satisfying for . Under , if , and for some , then we have

We show that the Ball detection function in sample is consistent against general alternatives. Our new detection function can handle the problem of imbalanced sample sizes. As shown in the following theorem, the asymptotic power of the test does not go to zero even if goes to or .

Theorem 2.4.4

The test based on is consistent against any general alternative . More specifically,

and

3 Detection of change-points

3.1 Hierarchical Algorithm

Next, we use the Ball detection function in sample to detect change-points in a sequence. For simplicity, suppose that the sequence contains at most one change-point. The possible change-point location is then estimated by maximizing the detection function:

(3)

We use the bootstrap method to estimate the probability that exceeds a threshold. If the estimated probability is high enough, is the estimated change-point. Otherwise, we proceed as if there does not exist any change-point in the sequence.

It is more complicated if the sequence has multiple change-points. In this case, we estimate the first change-point by

(4)

From (4), we can see that the introduction of here is to alleviate a weakness of bisection algorithm (Matteson and James, 2014). Because in each segment, there may exist multiple change-points. If we do not introduce , the value of may be lower than .

Suppose that change-points have been estimated at locations , and , . Those change-points partition the sequence into segments . In segment , let The Ball detection function in sample of segment is denoted as

Now let

(5)

Then is the -th possible change-point located within segment This hierarchical algorithm for estimating multiple change-points is outlined below.

  Let the minimum segment size , the change-points set .
  Suppose that change-points have been estimated. This decomposes the observations into segments.
  for each  do
     if the th segment is a new segment, then
        
        for  do
           for  do
              Compute ;
              if  then
                 ;
              end if
           end for
        end for
     else
         had been calculated.
     end if
  end for
  
  if  exceeds a threshold, then
     put into ;
  else
     there does not exist new change-point.
  end if
Algorithm 1 Multiple change-points Algorithm

3.2 Hierarchical Significance Testing

Here, we elaborate the use of the bootstrap method mentioned above.

Theorem 2.4.2 shows that the asymptotic null distribution of is a mixture of distributions. In practice, it is difficult to directly take advantage of the asymptotic null distribution. So, we use the moving block bootstrap (Kunsch, 1989) to obtain the empirical probabilities.

Given a set of observations and the block size , we draw a bootstrap resample as follows: (i) define the dimensional vector ; (ii) resample from block data with replacement to get pseudo data which satisfies , where denotes the integer part of . Denote the first elements of as the bootstrap resample ; (iii) repeat steps (i) and (ii) times. For the -th repetition, denote the maximum value in equation (4) based on by ; (iv) the approximate probability is estimated by Denote the threshold of the estimated probability by , if , then is a change-point.

In applications, the choice of the block size involves a trade-off. If the block size becomes too small, the moving block bootstrap will destroy the time dependency of the data and the accuracy will deteriorate. But if the block size becomes too large, there will be few blocks to be used. In other words, increasing the block size reduces the bias and captures more persistent dependence, while decreasing the block size reduces the variance as more subsamples are available. Thus, a reasonable trade off is to consider the mean squared error as the objective criterion to balance the bias and variance. For the linear time series, as proved in Carlstein (1986), the value of the block size that minimizes MSE is

where is the first order autocorrelation. Because the construction of MSE depends on the knowledge of the underlying data generating sequence, no optimal result is available in general. In this paper, we follow Hong et al. (2017) and Xiao and Lima (2007) to choose , where

(6)

where is the estimator of the first autocorrelation of , is the same as (6) except replacing with the estimated first order autocorrelation of . So the choice of considers the linear dependence and non-linear dependence.

3.3 Consistency

The next theorem shows the consistency of the estimated change-point locations under the following assumption.

Assumption 3.1

Suppose that is an absolutely regular sequence which is comprised of two ergodic stationary sequences. Let denote the fraction of the observations, such that be an ergodic stationary sequence with marginal probability measure , the second ergodic stationary sequence with marginal distribution . Finally, let be a sequence of positive numbers, such that and as .

Theorem 3.3.1

Suppose Assumption 3.1 holds. Let be the estimated change-point location from Equation (3) for a sample of size . For all and large enough such that , we have

This theorem shows that the consistency only requires the size of each segment increases to , but not necessarily at the same rate. Under the Assumption 3.1, can be close to 0 or 1 when , which is an imbalanced case.

In the multiple change-points situation, we have the following Assumption.

Assumption 3.2

Suppose that is an absolutely regular sequence. Let , and , with and . For , is an ergodic stationary sequence with marginal probability measure and . Furthermore, let be a sequence of positive numbers, such that and as .

It is worth noting that we do not assume the upper bounds on the number of change-points , but by specifying the minimum sample size in each segment. In other words, under Assumption 3.2, as , we can have change-points.

Analysis of multiple change points can be reduced to the analysis of only two change points under Assumption 3.2. Let , for any . The observations can be seen as a random sample from a mixture of probability measures , denoted as . Similarly, observations are a sample from a mixture of probability measures , denoted here as . The remaining observations are distributed according to some probability measure . Furthermore, and . If one of the previous two inequalities does not hold, we refer to the single change point setting.

Consider any such that, . Then, this choice of will create two mixture probability measures. One with component probability measures and , and the other with component probability measures and . Then, the Ball detection function between these two mixture probability measures is equal to

(7)
Theorem 3.3.2

Suppose that Assumption 3.2 holds, then the Ball detection function in equation (7) is maximized when either or .

By Theorem 3.3.2, is maximized when or for . Additionally, define

Let . Let . Then, we have the following Theorem.

Theorem 3.3.3

Suppose that Assumption 3.2, and , Let be the estimated change point as defined by equation (4). Then as .

Repeated applications of Theorem 3.3.3 can show that as , the first estimated change points will converge to the true change point locations in the manner described above. With a fixed threshold of the estimated probability , all of the change-points will be estimated. However, with probability approaching 1 as the sample size increases, the number of change-points determined in this way will be more than the true number of change-points, since any given nominal level of significance implies a nonzero probability of rejecting the null hypothesis when it holds. The hierarchical procedure could be made consistent by adopting a threshold for the test that decrease to zero, at a suitable rate, as the sample size increases(Bai and Perron, 1998). This is illustrated by the following theorem.

Theorem 3.3.4

Let be the number of change-points obtained using the hierarchical method based on the statistic (5) applied with threshold , and be the true number of change-points. If , then under Assumption 3.2, .

Although the hierarchical algorithm tends to estimate more change-points asymptotically when is fixed, this has little effect in practice. For example, the asymptotic probability of selecting change-points, is given by which decreases rapidly. Furthermore, if there is no change point, that is , the probability of selecting at least one change point in our algorithm is

Hence the total rate of type I errors is still

. This is a distinct feature of our hierarchical procedure because controlling for type I errors is a challenging issue in multiple testings.

4 Simulation studies

In this section, we present the numerical performance of the proposed method (BDCP) with and compare it with several typical methods, including Bayesian method (BCP) (Barry and Hartigan, 1993), WBS method (Fryzlewicz et al., 2014), the graph-based method-gSeg (Chen et al., 2015; Chu et al., 2019) and energy distance based method (ECP) (Matteson and James, 2014). BCP, WBS, ECP and BDCP can estimate the number of change-points automatically while gSeg can detect only one change-point or an interval.

There are four commonly used criteria for the performance of those methods: the adjusted Rand index, the over segmentation error, the under segmentation error and the Hausdorff distance. Suppose that the true change-points set is and estimated change-points set is . Then denote the true segments of series by and the estimated segments by
. Consider the pairs of observations that fall into one of the following two sets:
{pairs of observations in the same segments under and in same segments under };
{pairs of observations in different segments under and in different segments under }. Denote and as the number of pairs of observations in each of these two sets. The Rand index is defined as

Adjusted Rand index is the corrected-for-chance version of the Rand index which is defined as

in which 1 corresponds to the maximum Rand index value.

On the other hand, we also calculate the distance between and by

which quantify the over-segmentation error and the under-segmentation error, respectively (Boysen et al., 2009; Zou et al., 2014). The Hausdorff distance (Harchaoui and Lévy-Leduc, 2010) between and is defined as

Here, we only report the results based on adjusted Rand index. The results under other criteria are deferred to the supplementary material.

Three scenarios are used for comparisons: univariate sequence, multivariate sequence and manifold sequence. In each scenario, we consider two types of examples, one without change-point, and one with two change-points as follow:

The sample sizes are set to be , . We will repeat each model 400 times and the threshold is at 0.05. To save space, some results of univariate sequences are available on the supplementary material.

4.1 Multivariate sequence

In this subsection, we consider the dimensional sequences. Examples 4.1.1-4.1.7 are the sequences with no change-point and Examples 4.1.8-4.1.15 are the models with two change-points.

  • Examples 4.1.1-4.1.3:

    for Example 4.1.1, for Example 4.1.2 and for Example 4.1.3.

  • Examples 4.1.4-4.1.5:

    for Example 4.1.4 and for Example 4.1.5.

  • Examples 4.1.6-4.1.7:

    for Example 4.1.6 and for Example 4.1.7.

  • Example 4.1.8:

  • Example 4.1.9:

  • Example 4.1.10: