Change-point detection is an active research area in statistics and has been studied extensively due to its broad applications in many fields such as finance, genetics and meteorology among others. There is vast literature in change-point detection, for example, see Yao (1987), Davis et al. (2006), Fearnhead (2006), Killick et al. (2012), Chan et al. (2014), Zou et al. (2014), Matteson and James (2014), Fearnhead and Rigaill (2017) and references therein.
A less studied yet important type of change-point detection problem is the epidemic change-point detection, which is first proposed and studied in Levin and Kline (1985). Let be a sequence of independently distributed univariate observations. Roughly speaking, under the (classical) epidemic change-point setting, there exist two change-points such that and follow the same distribution and follows a different distribution. The two segments on the sides and are referred to as the normal state and the central segment is referred to as the epidemic state. We call this setting the single epidemic change-point setting.
In the literature, the epidemic change-point detection is typically formulated as a hypothesis testing problem, where different test statistics have been proposed to test the null hypothesis of no change-point against the above defined epidemic alternative with two change-points, seeYao (1993), Guan (2004), Arias-Castro et al. (2005) and Ning et al. (2012) for examples. Moreover, the existing literature focus on the single epidemic change-point setting where the data is assumed to start at the normal state and only one single epidemic state is allowed. The more realistic setting of detecting multiple epidemic change-points, however, has not been explored.
In this paper, we propose a model selection based framework on multiple epidemic change-points estimation. Specifically, we assume that under the epidemic alternative, there exist unknown change-points such that the distribution of alternates between a (common) normal state and (different) epidemic states. Note that we do not require the assumption that the data starts at the normal state. For a concrete example, let the number of change-points be even and the data start at the normal state. Denote and . Under the multiple epidemic change-point setting, the odd-numbered segments are at the (common) normal state and the even-numbered segments are at (different) epidemic states. In other words, the data alternates between the normal state and epidemic states.
The multiple epidemic change-point setting incorporates the aforementioned single epidemic setting as a special case and is more realistic in that it allows the observations to move back and forth between the common normal state and epidemic states. One motivating example for multiple epidemic change-point setting is the DNA copy number variation (see e.g. Olshen et al., 2004; Niu and Zhang, 2012), where the observations are the log-ratios of the copy number of genes between the test and reference sequence. For most genes, there is no variation (common normal state) and the mean log-ratio is a common constant (e.g. 0). When there is variation (epidemic state), depending on the duplication or deletion of certain genes, the mean log-ratio can be either larger or smaller than that of the normal state. Another important example is large scale multiple testing with locally clustered signal as considered in Cao and Wu (2015), where a sequence of -values are observed with being the -value for the th test, and we need to perform hypothesis tests based on . The signal is locally clustered in the sense that the sequence of -values can be partitioned into alternating blocks of signal (epidemic state, where -values do not follow ) and noise (common normal state, where -values follow ). The two examples are later discussed in detail in Section 5.
Compared to the conventional multiple change-point detection problem, the unique aspect of the multiple epidemic change-point setting is that there is an underlying alternating structure on the behavior of the observation . Same as all the other change-point detection problems, our primary interest is to recover the unknown number and locations of change-points. In addition, a further interest is to recover the underlying alternating states of the observation . Specifically, the goal is to assign a normal or epidemic label to each estimated segment.
The unique alternating structure of states and the shared common behavior among all normal state segments impose both challenges and opportunities for change-point detection. Specifically, existing efficient multiple change-point detection algorithms such as PELT in Killick et al. (2012) and FPOP in Maidstone et al. (2017) cannot directly recover the underlying alternating states and thus require additional post-analysis on the estimation results. Moreover, intuitively, if an algorithm can explicitly incorporate and exploit the alternating structure and the knowledge that segments at the normal state share the same behavior, improved estimation accuracy should be expected due to the additional information on the structure of the estimation problem.
Motivated by the above observations, in this paper, we propose a novel alternating dynamic programming algorithm, named aPELT, to efficiently solve the multiple epidemic change-point problem. The proposed approach is based on the seminal work of PELT in Killick et al. (2012), but involves an explicit treatment of the alternating structure and common normal state behavior. Similar to PELT, it can be applied to find change-points under a range of statistical criteria such as penalized likelihood, quasi-likelihood and cumulative sum of squares, and enjoys the same computational efficiency of PELT, thus it can be applied to segment large data sets.
The advantages of aPELT are two-fold. First, by incorporating the shape-constraint explicitly, aPELT achieves simultaneous inference on both change-points and alternating states of the sequence, thus does not require any post processing of the estimation result. Moreover, as demonstrated by extensive numerical experiments and real data applications, the explicit treatment further helps to improve accuracy for both change-point estimation and parameter estimation. The proposed aPELT has useful applications in multiple testing problems with locally clustered signals and to DNA copy number variation detection (see Section 5 for more details), where superior performance over existing methods is observed.
A related yet different stream of literature is the constrained dynamic optimization according to Hocking et al. (2015) and Hocking et al. (2018), which is motivated by mean changes in ChIP-seq data. The authors propose efficient algorithms to solve a model selection problem under the constraint that a decrease in mean must be followed by an increase, and vice versa. Note that similar to Hocking et al. (2015) and Hocking et al. (2018), we also face a constrained optimization problem. However, under the multiple epidemic change-point setting, we do not impose directional relation on the normal state and epidemic state behavior. Moreover, our use of common normal state parameter poses further difficulty on the optimization, since the constraint is not only on two neighboring segments but on all normal state segments.
The rest of the paper is organized as follows. In Section 2, we formulate the multiple epidemic change-point detection problem and review the model selection approach for general change-point problems. For a tailored and efficient solution, an alternating dynamic programming algorithm (aPELT) is proposed in Section 3. The efficiency and accuracy of the proposed method are demonstrated via extensive numerical experiments in Section 4. Applications of aPELT to DNA copy number variation and multiple testing with locally clustered signals are presented in Section 5, where results show the superior performance of aPELT over existing methods. The paper concludes with a discussion. Additional simulations and technical materials can be found in the supplementary material.
2 Background and Existing Solutions
2.1 Basic setting
Roughly speaking, change-point detection can be considered as the identification of points within a dataset where the statistical properties change. In this paper, we assume that is a sequence of independently distributed univariate observations. There are change-points that split the data into segments. Define and , we have that the th segment contains data . We remark that extensions to settings with multivariate observations or dependence within segments is straightforward.
We assume that the distribution of belongs to a parametric family , where denotes the parameter of interest and denotes the nuisance parameter. For example, in mean change detection for independent Gaussian observations, is the mean and
is the variance of, and in change-point detection for independent Poisson counts, is the intensity of and there is no Denote the parameters for the th segment as , we have for Note that we do not put any restriction on the nuisance parameters except assuming that they are unknown.
Under the multiple epidemic change-point setting, we further assume that the parameter of interest alternates between a common normal state and different epidemic states. More formally, for any , if the th segment follow (or for some ), then the th segment will follow for some (or ). The only requirement for an epidemic state is that without any directional constraint. For the multiple epidemic change-point setting, our inference interests are two-fold: 1. to recover the unknown number and locations of change-points, 2. to recover the alternating states of the observation .
As mentioned in Section 1, existing literature focus on single epidemic change-point detection with and assume that the data starts at the normal state. Under such setting, typically a test statistic in the form of is constructed for change-point detection and estimation via hypothesis testing. With and initial state of the data being unknown, a direct generalization of this testing procedure to the multiple setting is not obvious. Moreover, the computational cost for obtaining such test statistic is , which makes it less suitable for change-point detection in large data set. Thus, we instead tackle the multiple epidemic change-point detection via a model selection approach.
2.2 Optimal Partitioning and PELT
The multiple epidemic change-point detection is a special type of multiple change-point detection problem. In this section, we review two existing model selection based detection algorithms, which serves as the basis for our proposed alternating change-point detection procedure. For the moment, assume that we are doing classical multiple change-point detection, thus the only requirement isfor .
Given the observation , denote
as the candidate set of all possible vectors of change-points. The model selection approach estimates the true change-points
by minimizing a penalized loss function
where denotes the measure of model fit such as twice the negative log-likelihood, , and P denotes the penalty for model complexity such as BIC or MDL.
The optimization of (1) is in general difficult. Using dynamic programming, Jackson et al. (2005) propose the Optimal Partitioning (OP) algorithm which obtains the exact solution of (1) with computational complexity. The essential idea is the recursive relationship where for any ,
This provides a recursion which gives the minimal cost of in terms of the minimal cost of for , and thus can be solved in turn for Note that the essential condition for the recursive relationship (2) to hold is that the optimization of is independent across different segments, which is true under the classical multiple change-point setting.
Assuming the existence of a constant such that for all , , Killick et al. (2012) propose the PELT algorithm, which further reduces the computational complexity and can solve in linear time under mild conditions. The central observation is that for the calculation of , we do not need to consider all but only a pruned subset , and thus achieve a lower computational cost.
3 Exact Multiple Epidemic Change-point Detection via Alternating Dynamic Programming
Compared to the classical setting, the epidemic change-point setting imposes an implicit shape-constraint on the model parameter where alternates between the normal state and epidemic states, and all normal state s are the same.
One primary interest of inference is to recover the label of each segment, in other words, we would like the estimated parameter to possess the alternating structure. However, neither OP nor PELT can directly recover the alternating structure since the shape-constraint is not explicitly considered in the penalized loss function in (1), where only determines the number and locations of the change-points but does not restrict the state of each segment.
To impose the alternating structure of , we propose to modify the penalized loss function in (1) by explicitly assigning states to segments. Note that due to the alternating structure, for any given , once the state of the last segment is determined, all the other states are fixed. Given and , we define four index sets
Depending on the state of the last segment, the four index sets assign states and group the segments of by normal or epidemic states. For example, for , , , and .
Based on the four index sets, we further define two penalized loss functions
where denotes the penalty for the segment at normal state and denotes the penalty for the segment at epidemic state. Note that and may take different values, since for the normal state segment, there is no penalty for the parameter estimation of .
By design, forces the last segment of to be at the normal state and forces the last segment to be at the epidemic state. Moreover, and explicitly incorporate the alternating structure of since enforces alternating normal and epidemic states among segments and all segments in or share a common .
Thus, for the simultaneous inference of change-points and alternating states, we can then solve the modified penalized loss function
which explicitly incorporates the alternating shape-constraint of and does not require the knowledge of the initial state of .
To solve (5) efficiently, a recursive relationship similar to (2) is required for an efficient dynamic programming based algorithm. However, due to the presence of the common parameter across all segments at the normal state, the recursive relationship in (2) no longer holds since the optimization of in and in are not independent across segments any more. Thus, the previous algorithms break down and a new algorithm is needed for the computationally feasible inference.
3.1 A two stage optimization procedure
The key observation is that the optimization of the common causes the breakdown of the recursive relationship (2). To bypass this obstacle, we propose a two-stage optimization procedure for , which separates the optimization of and other model parameters . This procedure shares the same spirit as profile likelihood. For any fixed , we define
Denote we have
Thus if we can solve efficiently for each given and is a smooth function of , then we can efficiently solve in a profile-likelihood fashion. In the following two subsections, we describe the two-stage optimization procedure in detail.
3.2 Alternating PELT (aPELT) under known normal state parameter
In this section, for a given , we propose an efficient alternating dynamic programming algorithm for solving , based on an alternating recursion between and . Later, we further extend the algorithm to the case where is unknown.
Denote and . Under the epidemic change-point setting, a normal state is always followed by an epidemic state and vice versa. Thus, there is an implicit alternating recursion between and where
Equations (6) provides a recursive relationship between the minimal cost for and the minimal cost for with , and similarly between and . Thus, to obtain , we can solve and simultaneously by recursion in turn for The computational cost of the algorithm is .
To further reduce the computational cost for large data set, we propose an alternating PELT (aPELT) by extending the idea of PELT in Killick et al. (2012). The central idea is that when calculating and via recursion (6), we do not need to consider all . Instead, we only need to consider a subset of by adding a pruning step. The theoretical guarantee for the pruning step is stated in Theorem 1.
Given assume that there exists a constant such that for all ,
holds, at a future time , can never be the optimal last change-point for prior to . Similarly assume that there exists a constant such that for all ,
holds, at a future time , can never be the optimal last change-point for prior to .
Based on Theorem 1, the pseudo-code of aPELT with known normal state parameter is given in Algorithm 1 and we name it aPELT(). An interesting phenomenon in Theorem 1 is that the pruning for requires the values of and vice versa, which again requires simultaneous calculation of and .
Remark 1: If is the log-likelihood function of , it can be easily shown that the constant and exist and can be set to 0 for any .
3.3 Alternating PELT under unknown normal state parameter
For many applications, the normal state parameter is naturally known and thus aPELT(
) proposed in Section 3.2 is sufficient. For example, in DNA copy number variation, the mean log-ratio between the test and reference sequence is typically 0 when there is no variation; in multiple testing with locally clustered signals, the normal state is uniform distribution. See Section 5 for more details of the above two examples. Nevertheless, for the sake of generality, it is of interest to cover the case of unknown . In this section, we discuss two extensions of aPELT, namely aPELT_profile and aPELT_plugin, to handle such situation.
3.3.1 Profile aPELT
The proposed aPELT() in Section 3.2 can find the exact minimum of for a given normal state parameter , thus if is a smooth function of , we can solve by a standard optimization algorithm such as gradient descent, and as a byproduct, can be estimated by . This two-stage procedure shares the same spirit as profile likelihood, thus we name it aPELT_profile.
To justify aPELT_profile, we investigate the behavior of as a function of and show that in general a gradient-based algorithm can be used to solve . We have
For a given , and are constants, thus we denote and .
For a given , and are functions of , thus we denote and . Therefore, we have
In other words, is the minimum of 2 functions of , where denotes the cardinality of a set. Intuitively, if for each , and are smooth functions of , should also be a (piecewise) smooth function of .
In the following, denote as the parameter space of the normal state parameter and denote as the interior of . Before stating Theorem 2, we first state two assumptions on the behavior of the functions in .
Assumption 1 (Smoothness).
Any function in is a differentiable function of and has a unique global minimizer in . WLOG, further assume that the global minimizers and the minimum values of different functions are different.
Assumption 2 (Finite Partition).
There exists a finite partition of where each is a connected set in and there is no intersection among functions in in , for .
Both Assumptions 1 and 2 are mild and are expected to hold for common loss functions such as log-likelihood functions. For example, for , a sufficient condition for Assumption 2 to hold is that all functions in intersect finite times with each other. Assumption 2 is used to evoke intermediate value theorem in the proof and show that is “piecewise” differentiable on In Section 3.5, we verify Assumptions 1 and 2 for some classical change-point settings.