# A Composite Likelihood-based Approach for Change-point Detection in Spatio-temporal Process

This paper develops a unified, accurate and computationally efficient method for change-point inference in non-stationary spatio-temporal processes. By modeling a non-stationary spatio-temporal process as a piecewise stationary spatio-temporal process, we consider simultaneous estimation of the number and locations of change-points, and model parameters in each segment. A composite likelihood-based criterion is developed for change-point and parameters estimation. Asymptotic theories including consistency and distribution of the estimators are derived under mild conditions. In contrast to classical results in fixed dimensional time series that the asymptotic error of change-point estimator is O_p(1), exact recovery of true change-points is guaranteed in the spatio-temporal setting. More surprisingly, the consistency of change-point estimation can be achieved without any penalty term in the criterion function. A computational efficient pruned dynamic programming algorithm is developed for the challenging criterion optimization problem. Simulation studies and an application to U.S. precipitation data are provided to demonstrate the effectiveness and practicality of the proposed method.

## Authors

• 9 publications
• 2 publications
• 2 publications
• 4 publications
• ### Localised pruning for data segmentation based on multiscale change point procedures

The segmentation of a time series into piecewise stationary segments is ...
10/28/2019 ∙ by Haeran Cho, et al. ∙ 0

• ### Wavelet Spatio-Temporal Change Detection on multi-temporal PolSAR images

We introduce WECS (Wavelet Energies Correlation Sreening), an unsupervis...
03/26/2021 ∙ by Rodney Fonseca, et al. ∙ 0

• ### Generalized Laplace Inference in Multiple Change-Points Models

Under the classical long-span asymptotic framework we develop a class of...
03/28/2018 ∙ by Alessandro Casini, et al. ∙ 0

• ### Regularized Estimation in High-Dimensional Vector Auto-Regressive Models using Spatio-Temporal Information

A Vector Auto-Regressive (VAR) model is commonly used to model multivari...
12/18/2020 ∙ by Zhenzhong Wang, et al. ∙ 0

• ### A Bayesian change point model for spatio-temporal data

Urbanization of an area is known to increase the temperature of the surr...
05/22/2021 ∙ by Candace Berrett, et al. ∙ 0

• ### Multiple change point detection under serial dependence: Wild energy maximisation and gappy Schwarz criterion

We propose a methodology for detecting multiple change points in the mea...
11/27/2020 ∙ by Haeran Cho, et al. ∙ 0

• ### Alternating Dynamic Programming for Multiple Epidemic Change-Point Estimation

In this paper, we study the problem of multiple change-point detection f...
07/16/2019 ∙ by Zifeng Zhao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In analyzing dependent data, many widely-used statistical models in practice require the assumption of stationarity to establish the validity of statistical methodologies. However, stationary models are often inadequate when the data are collected over a long period. A growing body of empirical evidence reveals that real data do exhibit non-stationarities, such as in climate statistics (Kelly and Ó Gráda, 2014; Nam et al., 2015), financial data (Fryzlewicz, 2014) and genetics (Fearnhead and Liu, 2007; Fearnhead and Rigaill, 2018). Instead of developing complicated models for describing the non-stationary behavior, it is often more intuitive and effective to incorporate a change-point model which segments the data into stationary pieces. With the advance in data collection technology, many long or high frequency data sets are available, and hence change-point analysis is becoming increasingly popular in recent decades.

Change-point estimation has been extensively studied for time series data, see, for example, Davis et al. (2006); Aue et al. (2009); Matteson and James (2014); Preuss et al. (2015); Cho and Fryzlewicz (2015); Yau and Zhao (2016)

. Recently, change-point analysis has gained popularity in high-dimensional statistics

(Wang and Samworth, 2018) and functional data analysis (Berkes et al., 2009; Aston and Kirch, 2012; Aue et al., 2018). However, inference of change-points for spatio-temporal data remains largely unexplored. Only a few literature have considered change-point estimation for spatio-temporal processes. Moreover, the existing methods are subject to various limitations. For example, Bayesian methods such as Majumdar et al. (2005) and Altieri et al. (2015) usually require very specific model structures and are computationally intensive. Gromenko et al. (2017) focus on at most one change-point in the mean function of a spatio-temporal process. Furthermore, all existing literature require the separability of space-time covariance. The major difficulty of change-point analysis for spatio-temporal processes stems from the theoretical and computational challenges of spatio-temporal modeling under the increase in dimension of both space and time with the presence of unknown change-points.

In this paper, we propose a likelihood-based procedure for multiple change-point estimation in spatio-temporal process with spatial locations at a possibly irregular grid. The procedure adopts pairwise likelihood to alleviate the computational difficulty of the full likelihood for spatio-temporal data while maintaining statistical efficiency. The proposed method is a general approach that can handle a wide range of spatio-temporal models including both separable and non-separable space-time covariance. Furthermore, it works under model misspecification and can detect changes beyond first and second moments.

The contribution of this paper is two-folded. First, in terms of statistical theory, we show that, in contrast to the commonly used likelihood-based change-point estimation approach for time series (e.g. Davis et al., 2006; Ma and Yau, 2016), the pairwise likelihood for spatio-temporal data induces an edge effect from the boundary of each change-point. This edge effect is non-ignorable under the spatio-temporal setting as the spatial dimension grows, and causes inconsistency of change-point estimation. To tackle this problem, we carefully modify the pairwise likelihood by introducing a marginal likelihood term to correct the edge effect. Interestingly, unlike traditional criterion functions such as Bayesian information criterion (BIC) and minimum description length (MDL) which involve a penalty term, the consistency of the change-point estimation can be achieved solely by the modified pairwise likelihood. Moreover, in contrast to classical results in fixed dimensional time series that the asymptotic error of change-point estimation is , we show that exact recovery of true change-points is guaranteed in the spatio-temporal setting under mild conditions. To further achieve consistent model selection for each segment and enhance finite sample performance, an MDL-based criterion function is developed. We prove that, even if the model is misspecified, the number and locations of change-points can be consistently estimated under mild assumptions. The asymptotic distributions of the estimated change-points and the spatio-temporal model parameters in each stationary segment are also derived.

Second, in terms of statistical computing, we develop a computationally efficient algorithm for the optimization of the criterion function. Computational feasibility is a major challenge in change-point estimation since it involves optimization over a large number of change-point location configurations. Popular optimization methods, such as binary segmentation (Zhang et al., 2010)

(Davis et al., 2006), are fast but only provide approximate solutions. On the other hand, dynamic programming algorithms (Killick et al., 2012; Maidstone et al., 2017) provide exact solutions with a linear to quadratic computational cost. In particular, the pruned exact linear time (PELT) algorithm (Killick et al., 2012) reduces the computational cost substantially by pruning away irrelevant computational configurations. In this paper, PELT is adapted to the spatio-temporal setting to yield a computationally efficient and asymptotically exact solution to the optimization problem.

The rest of this paper is organized as follows. Section 2 provides the background and derivation of the composite likelihood based criterion for change-point inference. The main results including estimation consistency and asymptotic distribution of the estimators are presented in Section 3. Numerical experiments and an application to U.S. precipitation data are given in Section 4.

## 2 Background

### 2.1 Setting and notations

On a set of spatial locations with cardinality , consider a spatio-temporal process

 Y={yt,s:t∈[1,T]∩N,s∈S}={yt:1≤t≤T,t∈N},

where denotes the observations of all locations at time . There are in total observations. We focus on , while the theory can be easily generalized to with .

We assume that the spatio-temporal process can be partitioned into strictly stationary segments along the time dimension 111Note that potentially there can also be structural breaks across the space , however, since there is no natural order for , the search space is much larger, so here we do not consider this case.. In other words, there are unknown distinct change-points in the observations. Set , and let be the normalized change-points. Assume that . The asymptotic results are based on increasing with the s being fixed.

For , the th stationary segment is modeled by a stationary process with , such that

 x(j)t−τj−1=yt,t=τj−1+1,…,τj. (1)

Let be the length of the th segment. The observed spatio-temporal process can be written as

 Y=(x(1)1,…,x(1)T1,x(2)1,…,x(2)T2,…,x(m+1)1,…,x(m+1)Tm+1).

As in Davis et al. (2008) and Aue et al. (2009), we first assume for simplicity that the data across different segments () are independent. This assumption is commonly found in spatio-temporal change-point literatures, such as Altieri et al. (2015). Also, Gromenko et al. (2017) assume independence between functional observations, which implies the independence between segments. See Remark 1 for discussions on its relaxations.

Given the locations of the change-points, each stationary segment is modeled by a member of a pre-specified finite class of models, . Each element in is a model specified by an integer-valued parameter of dimension that represents the model order. Let be the

-dimensional integer-valued vector that specifies the model order for the

th segment

. Note that, by definition, the form of the parametric model and thus the form of the likelihood function for the

th segment is completely determined by . Given , the model for depends on a real-valued parameter of dimension . That is, the distribution of is completely determined by . Here, is regarded as a parameter for the model order and is the parameter that specifies the exact model. This framework, for example, covers the general spatio-temporal process

 Yt,s=μt,s+εt,% s, (2)

where , and is a zero-mean error process. The mean of can take a constant or a regression form such as , where denotes a -dimensional covariates associated with . The covariance structure can take possible parametric forms of separable or non-separable spatio-temporal dependence. Here, a model class may be specified by , where or indicates whether the th covariate is selected in the model or not, and indicates the parametric form for the covariance structure. Also, contains the coefficients of the selected covariates and the covariance parameters.

Define as the complete model parameter set of the th segment. Assume that is in the interior of a compact parameter space . The change-points and the model parameter vector are denoted by and , respectively. To lighten the notations, we use for and suppress the in and when there is no possibility of confusion.

### 2.2 Composite likelihood and pairwise likelihood

Although likelihood based methods generally achieve high statistical efficiency, when the likelihood function involves high-dimensional inverse covariance matrices or integrals, computations become infeasible. To overcome this limitation, Lindsay (1988) considers the composite likelihood, which is a weighted product of likelihoods for some subsets of the data. By specifying the subsets, different classes of composite likelihood are obtained. One popular class is the pairwise likelihood, which is the product of the bivariate densities of all possible pairs of observations,

 LP(θ;X)=∏i,jLi(θ;xi,xj)wi,j, (3)

where are the weights. Composite likelihood often enjoys computational efficiency while statistical efficiency is retained; see Lindsay (1988); Varin et al. (2011). Owing to its flexibility and attractive asymptotic properties, composite likelihood has been widely used in genetics (Larribe and Fearnhead, 2011), longitudinal data (Bartolucci and Lupparelli, 2016), time series (Davis and Yau, 2011), spatial statistics (Guan et al., 2015) and spatio-temporal statistics (Bevilacqua et al., 2012; Huser and Davison, 2014).

Given a stationary segment with parameter , the pairwise likelihood is defined as the product of the bivariate densities of each distinct pair of observations. However, in many spatio-temporal processes, the dependence between observations diminishes as the time lag or spatial distance increases, so there is a loss of efficiency because most of the pairs are not informative. Therefore, it suffices to consider pairs up to a small time lag and a small spatial distance . Denote as a distance-based neighborhood of location s, where the distance can be Euclidean metric. Note that depends on implicitly. Given observations and , the pairwise log-likelihood is defined as

 PLST(ψ;X)=∑(t,i,s1,s2)∈Dk,Nlogf(xt,s1,xt+i,s2;ψ)=∑(t,i,s1,s2)∈Dk,Nlpair(ψ;xt,s1,xt+i,s2), (4)

where is the collection of pairs of observations that are at most time units and spatial distance apart.

For the choices of and , intuitively, larger neighborhood can be used if there exists strong spatial correlation across , and smaller neighborhood should be favored if the spatial correlation is weak; see Varin and Vidoni (2005) and Bai et al. (2012). On the other hand, similar to Ma and Yau (2016), if the main focus is estimating change-points rather than model parameters, then it usually suffices to use the smallest and that ensure identifiability of the models in the candidate model set . See more discussions below Assumption 3 in Section 3.

### 2.3 Edge effect and a remedial composite likelihood

By the definition of in (4), each data point in a stationary segment does not appear in for the same number of times. For example, can only be paired with , and thus appears approximately half frequently than an ordinary observation , , which can be paired with . This can be viewed as that different weights are implicitly assigned to the observations in , and observations on the edge of a segment receive less weights. We call this phenomenon in pairwise likelihood the “edge effect”.

When we want to decide whether a change point exists in a segment, say , we need to compare two quantities: the pairwise likelihood formed by , and the sum of pairwise likelihoods formed by and , respectively. For the latter quantity, all observations within time units from suffer from the edge effect and receive less weights. Essentially, the latter quantity has fewer terms than the former one. When , the edge effect is non-ignorable and can cause inconsistency of the pairwise likelihood based method, as the following example shows.

###### Example 1.

Consider the simple example of and let the criterion function be in the form

 IC(m)=−m+1∑j=1PLST(^ψj;Xj)+C(m+1)log(ST),

where is the number of change-points, is the pairwise log-likelihood of the th segment evaluated under a parameter estimate , and the penalty is of similar magnitude as some common information criteria such as BIC. We compare between two scenarios:

Scenario 1: No change point is assumed and the pairwise log-likelihood function gives an estimate over the entire observations . We have , where

Scenario 2: A change point is fixed at and the pairwise log-likelihoods give the estimates in the two segments, respectively. Thus, , where

 2∑j=1PLST(^ψj;Xj)=∑(t,i,s1,s2)∈D1k,Nlogf(yt,s1,yt+i,s2;^ψ1)+∑(t,i,s1,s2)∈D2k,Nlogf(yt,s1,yt+i,s2;^ψ2),

, and

Suppose that there is no change-point in the data, i.e., Scenario 1 is true, and the true parameter value is . Then, under some regularity conditions, the estimators and

converge in probability to

with order . Hence, by Taylor’s expansion, it can be seen that

 IC(0)−IC(1) = 2∑j=1PLST(ψo;Xj)−PLST(ψo;Y)−Clog(ST)+Op(1). (5)

Compared to , due to the edge effect, does not contain the pairwise log-likelihoods at , which is of order . In other words,

 2∑j=1PLST(ψo;Xj)−PLST(ψo;Y)=−∑(s1,s2)∈Dck,Nlogf(yT2,s1,yT2+1,s2;ψo). (6)

Note that for a pairwise log-likelihood evaluated at , we have

 −logf(y1,y2;ψo) = −log12π√|Σ|exp(−(y−μ)′Σ−1(y−μ)/2) = log2π+1/2log|Σ|+(y−μ)′Σ−1(y−μ)/2.

If , then the term in (6) is strictly positive and is of order , which dominates the penalty term of order . Combining with (5), with probability approaching one, indicating that a change-point model is wrongly selected.

The non-ignorable edge effect in Example 1 is due to the special feature of in the spatio-temporal setting, in contrast to the fixed cross-sectional dimension in multivariate time series. To fix the edge effect and hence achieve consistency, one should compensate the missing pairs in the edge so that each data point appears the same number of times in the likelihood function. This can be achieved by introducing additional marginal likelihoods for data points observed on the edge , resulting in a new composite likelihood (for the th segment)

 L(j)ST(ψj;Xj) = PLST(ψj;Xj)+∑(i,s)∈Ek,Nlogf(x(j)i,s;ψj)+∑(i,s)∈Ek,Nlogf(x(j)Tj−i+1,s;ψj) =: ∑(t,i,s1,s2)∈D(j)k,Nlpair(ψj;x(j)t,s1,x(j)t+i,s2)+∑(i,s)∈Ek,N[lmarg(ψj;x(j)i,s)+lmarg(ψj;x(j)Tj−i+1,s)],

where and , , and is the collection of sites where marginal likelihoods are included for compensation of the edge effect. Note that both and depend implicitly on , which is suppressed for notational simplicity.

### 2.4 Derivation of the criterion

With the modified composite likelihood, in this section we derive a criterion function for estimating the change-points and model parameters in each segment. The criterion is based on the minimum description length (MDL) principle, which aims to select the best-fitting model that requires the minimum amount of code length to store the data (Rissanen, 2012), and has been shown to have promising performance for change-point detection; see Davis et al. (2006, 2008); Aue and Lee (2011). One classical way to construct the MDL is the two-stage approach (Hansen and Yu, 2001; Lee, 2001), which splits the code length, , into two components:

 CL(Y)=CL(ˆM)+CL(Y∣ˆM),

where is the code length for the fitted model and is the information in unexplained by . Recall that all model parameters in the th segment are specified by , where is the model order and is the model parameter. Given , the composite likelihood estimator of is obtained by , where is defined in (2.3). Since the fitted model can be completely described by , s and s, we have

 CL(ˆM)= CL(m)+CL(λ1)+⋯+CL(λm)+CL(ξ1)+⋯+CL(ξm+1) +CL(^θ1)+⋯+CL(^θm+1).

Since contains information equivalent to the integer-valued vector , and the code lengths for an integer and an estimate of a real-valued parameter from observations are and , respectively, we have

 CL(ˆM)=log2m+m+1∑j=1log2Tj+m+1∑j=1cj∑i=1log2ξi,j+m+1∑j=1dj2(log2Tj+log2S), (8)

where ; see Hansen and Yu (2001) and Lee (2001).

As demonstrated by Rissanen (2012), , where is the maximized full likelihood. By regarding the composite likelihood as an approximation to the full likelihood, and using logarithm base rather than base , the sum of the negative of (2.3) can be used to define . However, by construction, each data point appears in several parts of the composite likelihood function. In particular, the number of times that an observation is used in the composite log-likelihood equals to

 Ck,N=2Card(Dk,N)+2% Card(Ek,N)ST, (9)

where denotes the cardinality of a set. Indeed, if the data are identically and independently distributed, then the composite likelihood is a product of marginal densities, and is exactly times the full likelihood. Thus, we compensate the code length by magnifying by a factor , and define the composite likelihood-MDL (CLMDL) criterion as

 CLMDL(m,Λ,Ψ) = Ck,N{logm+m+1∑j=1cj∑i=1logξi,j+m+1∑j=1[(dj2+1)logTj+dj2logS]} (10) −m+1∑j=1L(j)ST(ψj;Xj).

To ensure identifiability of the change-points, assume that there exists an such that . In other words, we require , where

 Amϵλ={Λ∈(0,1)m:0=λ0<λ1<⋯<λm<λm+1=1,λj−λj−1≥ϵλ,j=1,…,m+1}. (11)

Under this constraint, the number of change-points is bound by . The estimates of the number of change-points, the locations of the change-points and the parameters in each of the segments are given by the vector , where

 (^m,^ΛST,^ΨST)=argminm≤Mλ,ψj∈M,Λ∈AmϵλCLMDL(m,Λ,Ψ). (12)

Denote and , where . Note that

 ^θ(j)ST=argmaxθj∈Θj(^ξj)L(j)ST{(^ξj,θj);^Xj}

is the composite likelihood estimator of the model parameters of the th estimated segment of the spatio-temporal process, .

## 3 Main Results

In this section, we impose some mild regularity conditions on the composite log-likelihood, and on the strong-mixing coefficients of the piecewise stationary spatio-temporal process, and then we present the main results.

For the asymptotic theory, we assume that the spatio-temporal process is generated by a random field in . Also, the data is observed on with and , in other words,

 Y={yt,s:t∈[1,Tn]∩N,s∈Sn,Sn⊂R2}.

The asymptotic theory is based on , where the number of observations whenever . For notational simplicity, in the following we use instead of when there is no possibility of confusion.

We define a metric on by , where , , , are indexes in . Also, the distance between any subsets is defined as

 ρ(U,V)=inf{ρ(d1,d2):d1∈U and d2∈V}.

The asymptotic results follow an increasing domain framework as in Jenish and Prucha (2009) and Bai et al. (2012), as is made explicit by Assumption 1.

###### Assumption 1.

The (possibly unevenly spaced) lattice is countably infinite. All elements in are located at distances of at least from each other, i.e., for all we have .

Assumption 1 allows for unevenly spaced locations and general forms of sample regions, which is often encountered in real data. By Assumption 1, we can assume the maximum cardinality of the neighborhood set is bounded by a constant . Throughout this section, we assume the time lag used in CLMDL is and the maximum cardinality of in CLMDL is .

###### Assumption 2 (r).

There exists an such that for any fixed model order , we have
(i) for and ,

 supS,Tsup(t,i,s1,s2)∈D(j)k,NE[supθ∈Θ(ξ)|l[a]pair{(ξ,θ);x(j)t,s1,x(j)t+i,s2}|r+ϵ]<∞, supS,Tsup(t,s)∈Ek,NE[supθ∈Θ(ξ)|l[a]marg{(ξ,θ);x(j)t,s}|r+ϵ]<∞,

(ii) for , the above moment conditions hold with ,
where and are defined in (2.3) and in stands for the th order derivative w.r.t. .

Assumption 2() requires that the composite likelihood (2.3) is twice differentiable and has a finite th moment, and its first and second order derivatives have a finite th moment. Note that if the data is observed on a regular lattice, then by the piecewise stationarity assumption, the supremum w.r.t. can be dropped in Assumption 2().

Define the expected likelihood under the true model as . The derivatives and are defined similarly.

###### Assumption 3.

For each stationary segment of the random field, where ,

(i) there exists a model with parameters satisfying

 ψoj=(ξoj,θoj)=argmaxξj,θj\lx@overaccentsetL(j)ST{(ξj,θj)}.

The model is uniquely identifiable in the sense that if there exists another model with and , then . Moreover, for every , we have ;

(ii) and , where and are defined in (i).

Similar as above, if the data is observed on a regular lattice, then by the piecewise stationarity assumption, the supremum w.r.t. can be dropped in Assumption 3. Assumption 3 does not require that the stationary process is from the model class . Instead, Assumption 3(i) only assumes the existence of a pseudo-true model in , which is of the simplest form. That is, the model cannot be expressed by another model in , where is of a smaller dimension. The last statement in Assumption 3(i) asserts that for any model order , the point is the unique parameter value that maximizes the expected composite likelihood.

Assumption 3(ii) rules out the degenerate case that the th and th stationary segments are indistinguishable by the composite likelihood. In general, Assumption 3(ii) may fail if the two adjacent segments follow different models but have the same expected composite likelihood. In theory, this situation can always be avoided by using a larger lag and neighborhood when defining (2.3). As noted by Ma and Yau (2016), this situation seldom occurs in practice and or is usually sufficient to distinguish between stationary segments.

The next two assumptions regulate the dependence structure of the spatio-temporal process by imposing strong mixing conditions on the underlying random field. For , let

be the sigma field generated by the random variables in

Define

 α(U,V)=sup{|P(A∩B)−P(A)P(B)|;A∈σ(U),B∈σ(V)}.

The -mixing coefficient for the th stationary random field is defined as

 αXj(m;u,v)=sup{α(U,V);|U|≤u,|V|≤v,ρ(U,V)≥m}.

Also, denote , where upper bounds the number of composite likelihood components that any point can have in .

###### Assumption 4 (r).

For each stationary segment of the random field, where , there exist some and where , such that for all , , , we have

 ∞∑m=1(m+1)3(c−u+1)−1[αXj(m;Mu,Mv)]ϵ/(c+ϵ)<∞.

Assumption 4 requires a polynomial decaying rate for the mixing coefficient of the random field, which is mild and is used for invoking the moment inequality in Doukhan (1994) that controls the asymptotic size of the composite likelihood. Note that the mixing rate in Assumption 4 depends on . This is intuitive since a longer time lag and a larger neighborhood size induce a slightly higher dependence among the composite likelihood and thus require stronger conditions on the mixing coefficients. Define also

 α∗Xj(m)=sup{α((−∞,t1]×S,[t2,∞]×S);t2−t1=m},

which characterizes the dependence of the random field along the time dimension. It can be regarded as an analogue to the classic strong mixing coefficient in the time series setting.

###### Assumption 5.

For each stationary segment of the random field, where , there exist , and , such that for any fixed model order ,

 (i) supS,Tsup(t,i,s1,s2)∈D(j)k,NE[supθ∈Θ(ξ)|l[1]pair{(ξ,θ);x(j)t,s1,x(j)t+i,s2}|r+δ]<∞, (ii) α∗Xj(m)≤Cm−τ for some C>0.

Assumption 5(i) is a moment condition on the first order derivative of the composite likelihood, which is similar to the one in Assumption 2. Assumption 5(ii) is a mild mixing condition along the time dimension and is used for invoking an maximal moment inequality in Yang (2007) that controls the asymptotic size of the first order derivative of the composite likelihood at the true parameter value. Note that for a fixed moment condition <