1 Introduction
In this paper, we study change point (CP) detection in multidimensional sequences. This problem arises in many areas. Figure 1 shows a motivating example in which copy number variations of four malignant lymphoma patients are plotted. The purpose of this study is to detect common CPs among patients. The difficulty of this problem lies in the fact that the copy number variations are heterogeneous among patients, i.e., they are commonly observed only in a subset of patients as illustrated in the figure.
The goal of this paper is to develop a method for detecting CPs that are characterized by a subset of dimensions in a multidimensional sequence. A CP detection method for solving such a problem can be formulated as a twostage method: one for selecting dimensions, and another for selecting time points^{1}^{1}1Although our target is not restricted to time sequences, we use the term “time” to refer to a point in a sequence.. We call the former the aggregation stage, and the latter the scanning stage. In the aggregation stage, a subset of dimensions is selected and the scores for the selected dimensions are aggregated into a scalar score that represents the plausibility of a common change at the time point. Various forms of aggregation can be considered. In the scanning stage, a timepoint that maximizes the aggregated score is selected as a CP by scanning the onedimensional aggregated score sequence.
In this paper, we our goal is not only to detect CPs but also to properly control the false detection probability at the desired significance level, e.g., . To this end, we must take into account the fact that the CPs are obtained by selecting the dimensions and the time points in the aggregation and scanning stages, respectively. This means that two types of selection bias must be properly corrected in order to make inferences on the detected CPs. For a onedimensional sequence, various statistical CP detection methods have been studied (Page, 1954; Yao, 1988; Lee et al., 2003; Yau & Zhao, 2016). In contrast, for multidimensional sequences, there are only a few asymptotic results and these were developed under rather restrictive assumptions. Deriving the sampling distributions of CP statistics is generally difficult even in asymptotic scenarios because stochastic singularity of CPs must be properly handled.
In this paper, we introduce a class of CP detection methods that can exactly control the false detection probability. Our key contribution is to interpret a CP detection problem as a selective inference problem. For the class of CP detection methods, this interpretation allows us to derive the exact (nonasymptotic) sampling distribution of the test CP statistic conditional on the fact that the CP is selected based on a particular method. Using a recent result on selective inference (Taylor & Tibshirani, 2015; Lee et al., 2016), we show that inferences based on the derived sampling distribution can correct the two types of selection bias in the aggregation and scanning stages, and the overall false detection probability can be properly controlled.
The rest of the paper is organized as follows. In §2, we formulate our problem setup and review related work. Here, we focus on a problem of detecting a single mean structure change in an offline manner. In §3, we introduce a selective inference interpretation of a CP detection problem, and present the main results. Here, we first introduce a class of CP detection methods which includes many existing ones. We then show that the
selective type I error
of the detected CPs can be exactly (nonasymptotically) controlled. We also show that our hypothesis testing procedure is an approximately unbiased test and derive a lower bound of the power in the sense of selective inference. In §4, the results in the previous section are extended to multiple CP detection problems via a local hypothesis testing framework. §5 is devoted to numerical experiments. Here, we first confirm that the proposed CP detection methods can properly control the false detection probabilities in a simulation study. We then apply these CP detection methods to copy number variation analysis of 46 malignant lymphoma patients.Notations
We define , for and, as a special case, . For two matrices and , denotes the Kronecker product, i.e., . We use notation for an appropriate matrix
and vector
. For a nonnegative integer , an byidentity matrix is denoted by, while a zero matrix is denoted as
, omitting its size as long as no confusion is expected. The sign function, the indicator function and the vectorize operator are denoted by and , respectively.2 Problem Setup and Related Work
In this section, we present the problem setup and discuss related works. We first consider the problem of detecting a single CP. Its extensions to multiple CP detection is discussed in §4.
2.1 Problem Setup
Let us write an dimensional sequence with length as an by matrix . Then, a single CP detection problem for mean shift is formulated as the following hypothesis testing problem
(1) 
where the null hypothesis
states that the mean vector does not change within the entire sequence, whereas the alternative hypothesis states that there is one CP.CUSUM score
In order to discuss the test statistic and its sampling distribution for the hypothesis testing problem in (1), let us simply consider a single CP detection from a onedimensional sequence denoted as . Let and
be generic random variables corresponding to
and , respectively. Then, we expect the time point to be a CP when the valueis large. Hence, we define a natural estimator of discrepancy between random variables
and asand, its scaled measure
is known as the CUSUM (cumulative sum) score (Page, 1954). Note that the CUSUM score can be interpreted as a realization of the logarithm of a Gaussian likelihood when we assume that the sequence is mutually independent. A point that maximizes is detected as a CP.
Multivariate CUSUM score and its aggregation
In the case of a multidimensional sequence, it is natural to consider a multivariate version of the CUSUM score:
(2) 
for , where each is a CUSUM score corresponding to the th dimension. Since each element of a multivariate CUSUM score cannot be maximized simultaneously, we need to first aggregate the dimensional vector into a scalar value. We denote an aggregation function as , i.e., for each . The aggregated score represents the plausibility of a CP at time point .
Choices of aggregation function
For multidimensional CP detection, various choices of aggregation function can be considered.
aggregation
Jirak (2015) proposed aggregation as . This aggregation function simply selects the dimension whose absolute CUSUM score is greatest among the dimensions. This choice is not appropriate when there are changes in multiple dimensions.
aggregation
Another simple aggregation function is aggregation defined as . This aggregation function just sums up the individual CUSUM scores. This choice is not appropriate if changes are observed only in a subset of dimensions.
Top aggregation
If changes are observed in a subset of dimensions and the size of the subset is known to be , then the top aggregation function defined as is appropriate. This aggregation function can be interpreted as a generalization of  and aggregation functions since it reduces to  and aggregation when and , respectively.
Double CUSUM aggregation
If is unknown, then it would be nice to be able select the appropriate from the data. Let , be the th largest value in , i.e., is satisfied for . Cho (2016) proposed the double CUSUM aggregation function defined as
(3) 
where and is a predetermined positive constant. This aggregation function returns the CUSUM score of the sequence for each . The rationale for this choice is that, if there are changes in dimensions, then the top absolute CUSUM values tend to have larger values than the remaining CUSUM values , meaning that the CUSUM score for this sequence would be maximized at . As suggested in Cho (2016), would be optimal in the sense of asymptotic theory.
Test statistic for the problem in (1)
Based on the above discussion, a natural test statistic for the hypothesis testing in (1) is
(4) 
Here, can be interpreted as a realization of the corresponding random variable , and thus we could consider as a test statistic. Then, the value is defined as the false detection probability under , that is, . If the value is smaller than the significance level , then we can conclude that there is a single CP in the multidimensional sequence.
As we briefly discuss in the following subsection, it is difficult (even asymptotically) to derive the sampling distribution of in (4) for most practical choices of the aggregation function unless rather restrictive assumptions are imposed.
2.2 Related Work
Here, we briefly review existing work on controlling the false detection probabilities in CP detection problems.
Inference for onedimensional sequences
First, we review statistical inferences on CP detection in onedimensional sequences. As described above, one can regard that the point at which the CUSUM score maximized is the most plausible as a CP. Hence, the test statistic of the target hypothesis is naturally defined as , and corresponds to the loglikelihood ratio test statistic. Then, as is well known, test statistic
converges weakly to a Brownian bridge under appropriate moment and weak dependence assumptions on the sequence
(Phillips, 1987; Csörgö & Horváth, 1997; Shao & Zhang, 2010), whereis the socalled longrun variance. In this asymptotic theory, the weak dependence assumption is essential because, otherwise, the longrun variance does not exist or becomes zero.
Another closely related existing inference procedure is proposed in Hyun et al. (2016). They used fused LASSO (Tibshirani et al., 2005) for CP detection, and interpret the inferences on the detected CPs as the inferences on the regression model coefficients for the selected features by fused LASSO. Inferences on the regression model coefficients for selected features can be done using a recently popularized framework on selective inference (Lee et al., 2016).
Although inference problems on CP detection in onedimensional sequences have been intensively studied in the literature, they cannot be easily generalized to the case of multidimensional sequences.
Inference for multidimensional sequences
Unlike the onedimensional sequence case, the literature of inference on multidimensional sequences is very scarce^{2}^{2}2Several methods for assuring estimation were developed (Fryzlewicz et al., 2014; Cho & Fryzlewicz, 2015; Cho, 2016; Wang & Samworth, 2016), but they cannot be used for inference on the detected CPs.. To the best of our knowledge, existing work on this topic can be defined into two types.
method  weight: 

aggregation  
aggregation  
top aggregation  
double CUSUM 
The first type studies likelihoodbased methods such as aggregation reported by Jirak (2015). They derived an asymptotic distribution of aggregation score as an extreme value distribution. To establish the asymptotic distribution, the relation between the length of the sequence and the size of the dimension must satisfy for some . This condition indicates that if is relatively large compared to , then one can no longer control the false detection probability even when the underlying distribution is independent for each time point.
The second type uses a kernelbased method. The basic idea of kernelbased CP detection is to consider the problem as a twosample test where the two multidimensional subsequences before and after the CP are regarded as the two samples. In this approach, some discrepancy measure between the two samples such as the kernel Fisher discriminant ratio or maximum mean discrepancy (MMD) measures is defined, and then the test statistic is the maximum value of the discrepancy measure scanned along the sequence. Harchaoui et al. (2009) first studied kernel CP detection by using the kernel Fisher discriminant ratio, while Li et al. (2015) employed MMD as the discrepancy measure. They derived an asymptotic distribution of the test statistic (i.e., the maximum discrepancy along the sequence) under the assumption that values at different time points are independently distributed.
3 Selective Inference for Multidimensional CP Detection
In this section, we present our main results. As discussed in the previous section, it is difficult to derive the sampling distribution of the test statistic in the form of (4). Our basic idea to overcome this difficulty is to interpret the CP detection problem as a selective inference problem, i.e., the problem of making an inference on the detected CP conditional on the fact that the CP is selected by a particular choice of aggregation function. This interpretation enables us to derive the exact (nonasymptotic) selective sampling distribution of the test statistic for a wide class of practical aggregation functions .
3.1 Proposed Class of Aggregation Functions
Let us first propose a class of aggregation functions for which we can derive the exact selective sampling distribution of the test statistic. Recall that is the CUSUM score in the th dimension at the th time point for . As in the definition of the double CUSUM aggregation function, , is defined as the th largest value in , i.e., is satisfied for . We define a class of aggregation functions as
(5) 
where are constants. We refer to this class of aggregation functions as weighted rank aggregation (WRAG) functions. Table 1 shows several choices of the constants corresponding to the choices of the aggregation functions discussed in §2. With the use of a WRAG function, we can detect a CP as
(6) 
We refer to a CP detection method via (6) as a WRAG method.
3.2 Selective Inference on the CPs by WRAG methods
An inference on the CP detected by a WRAG method can be interpreted as a selective inference, which has been actively studied in the past few years for inferences on feature selection problems
(Fithian et al., 2014; Yang et al., 2016; Tian & Taylor, 2017; Suzumura et al., 2017). In order to formulate our problem as a selective inference problem, let us define the selection event and the selected test statistic. The selection event in a WRAG method is written aswhere and are interpreted as realizations of the corresponding random variables and defined as
On the other hand, the selected test statistic is given as
(7) 
In the context of selective inference, based on the selection event and the selected test statistic, the socalled selective value is defined as
where is interpreted as a realization of the corresponding statistic .
For the purpose of inference, we make an assumption on the normality of multidimensional sequence , namely,
where is the mean matrix, whereas and are covariance matrices representing the time correlation and the variable correlation structures, respectively. In practical CP detection tasks, it is often possible to obtain sequences without CPs, i.e., samples from the null hypothesis. The covariance matrices and can be estimated from such samples or manually specified based on prior knowledge. Denoting the mean matrix as , the null hypothesis in our selective inference is written as
(8) 
where, remember, is the detected CP via a WRAG method, meaning that it is a random variable.
In order to derive the sampling distribution of the selected test statistic, we use a recent seminal result in Lee et al. (2016)
. We first slightly generalize the key lemma in their work for handling a random matrix
.Lemma 1 (Polyhedral Lemma for A Random Matrix).
Suppose that with mean and covariance . Let for any and , and let . Then, any event represented in the form of for a fixed matrix and a fixed vector can be written as
where
(9) 
and . In addition, is independent of .
The proof of the lemma is presented in Appendix B.1. This lemma states that if the test statistic is expressed as a bilinear function with the matrix in the form of , and the selection event can be expressed as an affine constraint in the form of , then the selected test statistic is restricted to a certain interval. This lemma is a simple extension of Lemma 5.1 in Lee et al. (2016).
The selection event cannot be written in the form of . Here, we consider the signs and the permutations of for each as additional selection events. Let be an by diagonal matrix whose diagonal elements are the signs of for each , and be an by permutation matrix which maps to . The selection event is then formulated as
where are interpreted as realizations of the corresponding statistics for .
The following theorem is the core of the selective value computation in our selective inference.
Theorem 2.
Assume that the conditions of Lemma 1 hold. Then, there exist and such that
(10) 
under the null hypothesis (8), where
is the cumulative distribution function (c.d.f.) of normal distribution
truncated to the interval .The complete proof of Theorem 2 is presented in Appendix B.2, where we show that, for any choice of aggregation function from the class in (5),

the selection event can be written as an affine constraint event in the form of ,

the selected test static can be written as a bilinear function with a matrix in the form of .
Then, by applying Lemma 1 and Theorem 5.2 in Lee et al. (2016), Theorem 2 can be proved.
As described in Appendix A, for any choice of the aggregation function from the class in (5), the values of , , and in Theorem 2 can be computed. By using these values, the selective value can be computed as
where is the c.d.f. of the standard normal distribution.
Remark 3.
When we do not need to select from the data, e.g., ,  and top aggregations, we do not need to consider the event since the event is redundant for inference. For the same reason, we also do not need to consider the some of the signs or/and the permutations depending on the choice of the aggregation function. Concrete examples of truncation points for each choice are described in Appendix A.
Remark 4.
We can establish the same result as Theorem 2 even if all the signs and the permutations are considered. In this case, instead of a single interval, multiple intervals must be considered for all possible choices of the permutations and the signs. However, since the number of all possible combinations of the signs and the permutations is large, this is computationally intractable.
3.3 Power Analysis
Since for any and , hypothesis (8) can be viewed as
In practice, since and are determined by a WRAG method, the hypothesis is also random. Therefore, we consider an alternative hypothesis
(11) 
which is a negation of the null since our aggregated score takes only positive values. Under the alternative, the same argument as in Theorem 2 indicates that
We now consider the power of the test in the situation that . Note that this asymptotic scenario corresponds to the power analysis under the local alternative.
Theorem 5.
Let be the upper
quantile of the null distribution. Then, under the alternative (
11), the power of the test is approximated as follows:almost surely, where
and
is the probability density function of the standard normal distribution.
The proof of Theorem 5 is presented in Appendix B.3. The theorem states that our selective inference procedure is an approximately unbiased test. Here, an unbiased test refers to a test in which the power becomes at the boundary of the hypothesis, i.e., . In addition, the test has a power of at least since the second term in the last inequality is always positive. Theorem 5 suggests that there may exist better tests in terms of power.
4 Extension to Multiple CP Detection
In this section, we extend the selective inference framework for WRAG methods to be able to detect multiple CPs. To this end, we introduce a sliding window approach. Let be a sliding window centered at with length for each . If we simply conducted single CP detection within each sliding window, too many CPs would be detected due to overlaps of multiple windows. To circumvent this issue, Hao et al. (2013) considered the socalled local hypothesis testing problem. For each window , a local hypothesis test is defined as
In this hypothesis test, even when the null hypothesis is rejected, unless there is a CP at the center of the window, the hypothesis itself is considered to be out of our interest. A natural estimate of the set of CPs by this approach would be
In the context of onedimensional CP detection problems, Yau & Zhao (2016) referred to this type of multiple CP estimates as local change point estimates. In this approach, we can only consider hypotheses, which is usually a much smaller number than that of windows . Yau & Zhao (2016) also discussed the choice of the window size , and claimed that the choice of would be appropriate in an asymptotic sense.
5 Numerical Experiments
Here, we confirm the performances of the proposed selective inference framework for WRAG methods through numerical experiments with both synthetic and real data.
5.1 Experiments on Synthetic Data
5.1.1 FPRs of selective and naive inferences
First, we confirmed whether the false positive rates (FPRs) are properly controlled in the selective inference framework for WRAG methods with double CUSUM (DC) and top aggregation with , where and correspond to  and aggregations, respectively. The synthetic data with and were generated from normal distribution where and . We considered for simulating the cases without and with correlation among different dimensions, while was changed from 0.0 to 1.0 to simulate the cases with various degrees of correlation among different time points. In addition, we also computed the FPRs of naive inference for WRAG methods without any selection bias correction procedure as in Theorem 2. The significance level was set as . In all cases, 1,000 runs with different random seeds were simulated.
Figure 2 shows the FPRs of selective inference (solid lines) and naive inference (dashed lines), where the horizontal and vertical axes indicate the value of and the estimated FPRs, respectively. We see that selective inference could control FPRs appropriately in all cases. On the other hand, in almost all cases, naive inference failed to control the FPRs especially when is small^{3}^{3}3The bias of naive inference is large when the “effective” length of the sequence is large. Since effective length decreases as the degree of correlation increases, the bias is large when is small.. Although the results in Figure 2 might be interpreted that naive inference with and could also control the FPRs properly, this was actually not the case. Under the null hypothesis, the
value should be uniformly distributed between 0 and 1 (see, e.g., Section 3 in
Lehmann & Romano (2006)). Figure 3 shows the distributions of (a) the selective values and (b) the naive values, where we see that the former are uniformly distributed, while the latter are not uniformly distributed. Indeed, the KolmogorovSmirnov test for uniformity resulted in for selective values, but for naive values.5.1.2 FPRs of existing methods
As mentioned in §2, there are two existing CP detection methods for multidimensional sequences. In both methods, the asymptotic sampling distribution of the test statistic in the form of (4) is derived under certain assumptions. Here, we see how these existing methods behave when the assumptions are violated.
First, to see the performances of the method proposed in Jirak (2015), we generated from with , and investigated the performances as varies from 1 to 100 (see Figure 4(a)). We observe that the FPRs increase as becomes large when the assumption in Jirak (2015) is violated. In contrast, the proposed selective inference (with double CUSUM aggregation) could appropriately control FPRs to the desired significant level of in the same setting (red solid line).
Next, to see the performances of the method proposed in Li et al. (2015), we generated from , where with and , and investigated the performances as varies from 0.0 to 0.3 (see Figure 4(b)). We observe that the FPRs increase as increases when the assumption in Li et al. (2015) is violated. Again, the proposed selective inference could appropriately control FPRs to the desired significant level of in the same setting.
5.2 Application to CNV Detection
We applied the proposed selective inference framework for WRAG methods to a copy number variation (CNV) study on malignant lymphoma (Takeuchi et al., 2009). In this study, CNVs of 46 patients diagnosed with diffuse large Bcell lymphoma (DLBCL) were investigated by an array comparative genomic hybridization (array CGH) technique (Hodgson et al., 2001). The dataset that we analyze here is represented as a realvalued multidimensional sequence with and . Each dimension indicates a patient, while each time point indicates a local genomic region. It is well known that CNVs in DLBCL are heterogeneous because DLBCL has several subtypes^{4}^{4}4Identifying and characterizing the genetic properties of disease subtypes are crucially important for precision medicine.. The goal of this medical study is to detect CNVs commonly observed in a subset of patients. Various onedimensional CP detection methods have been used for analyzing array CGH data for a single patient (Wang et al., 2005; Tibshirani & Wang, 2008; Rapaport et al., 2008). However, there is no existing method for detecting common CPs by analyzing CNVs of multiple patients altogether, or for providing the statistical significance of the detected CNVs.
Due to space limitations, we only present the results for Chromosome 1, in which there are local genomic regions. We applied a WRAG method with a double CUSUM aggregation function to this dataset. For detecting multiple CPs, we used a local hypothesis testing framework described in §4, in which we set because this is the closest integer to . The covariance structure was set to be because each dimension in this multidimensional sequence was obtained from an individual patients. On the other hand, the covariance structure was estimated from a different control dataset with ^{5}^{5}5CNV data in array CGH analysis is obtained by comparing the CNs between the patient and a healthy reference person. Therefore, a control dataset (without any CNVs) can be easily obtained by comparing the CNs between two healthy reference persons.. The parameter in double CUSUM was set to be 0.5 as suggested in Cho (2016).
We detected 54 CPs and 11 of them are statistically significant in the sense that the selective value is less than 0.05. Table 2 shows the list of detected CPs. Two examples of the detected CPs are illustrated in Figure 5(b). Note that the numbers of the selected dimensions (patients) are different among the detected CPs, which is an advantage of the double CUSUM aggregation function. Our selective inference interpretation of CP detection problems allows us to properly correct the selection bias even if the selection procedure is fairly complicated as in double CUSUM aggregation.

gene name 




7  Q8N7E4  5  0.010  
15  NM018125  5  0.028  
18  CDA/KIF17  1  0.000  
22  PAFAH2  4  0.000  
31  EIF2C1  2  0.001  
36  NA  1  0.000  
106 

13  0.038  
120  C1orf9/TNFSF6  21  0.040  
151  RPS6KC1  1  0.010  
162  PSEN2  1  0.000  
165  DISC1  23  0.044 
References
 Cho (2016) Cho, Haeran. Changepoint detection in panel data via double cusum statistic. Electronic Journal of Statistics, 10(2):2000–2038, 2016.
 Cho & Fryzlewicz (2015) Cho, Haeran and Fryzlewicz, Piotr. Multiplechangepoint detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(2):475–507, 2015.
 Csörgö & Horváth (1997) Csörgö, Miklós and Horváth, Lajos. Limit theorems in changepoint analysis, volume 18. John Wiley & Sons Inc, 1997.
 Fithian et al. (2014) Fithian, William, Sun, Dennis, and Taylor, Jonathan. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
 Fryzlewicz et al. (2014) Fryzlewicz, Piotr et al. Wild binary segmentation for multiple changepoint detection. The Annals of Statistics, 42(6):2243–2281, 2014.
 Hao et al. (2013) Hao, Ning, Niu, Yue Selena, and Zhang, Heping. Multiple changepoint detection via a screening and ranking algorithm. Statistica Sinica, 23(4):1553, 2013.
 Harchaoui et al. (2009) Harchaoui, Zaid, Moulines, Eric, and Bach, Francis R. Kernel changepoint analysis. In Advances in neural information processing systems, pp. 609–616, 2009.
 Hodgson et al. (2001) Hodgson, Graeme, Hager, Jeffrey H, Volik, Stas, Hariono, Sujatmi, Wernick, Meredith, Moore, Dan, Albertson, Donna G, Pinkel, Daniel, Collins, Colin, Hanahan, Douglas, et al. Genome scanning with array cgh delineates regional alterations in mouse islet carcinomas. Nature genetics, 29(4):459, 2001.
 Hyun et al. (2016) Hyun, Sangwon, G’Sell, Max, and Tibshirani, Ryan J. Exact postselection inference for changepoint detection and other generalized lasso problems. arXiv preprint arXiv:1606.03552, 2016.
 Jirak (2015) Jirak, Moritz. Uniform change point tests in high dimension. The Annals of Statistics, 43(6):2451–2483, 2015.
 Lee et al. (2016) Lee, Jason D, Sun, Dennis L, Sun, Yuekai, and Taylor, Jonathan E. Exact postselection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
 Lee et al. (2003) Lee, Sangyeol, Ha, Jeongcheol, Na, Okyoung, and Na, Seongryong. The cusum test for parameter change in time series models. Scandinavian Journal of Statistics, 30(4):781–796, 2003.
 Lehmann & Romano (2006) Lehmann, Erich L and Romano, Joseph P. Testing statistical hypotheses. Springer Science & Business Media, 2006.
 Li et al. (2015) Li, Shuang, Xie, Yao, Dai, Hanjun, and Song, Le. Mstatistic for kernel changepoint detection. In Advances in Neural Information Processing Systems, pp. 3366–3374, 2015.
 Page (1954) Page, E. S. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
 Phillips (1987) Phillips, Peter CB. Time series regression with a unit root. Econometrica, 55(2):277–301, 1987.
 Rapaport et al. (2008) Rapaport, Franck, Barillot, Emmanuel, and Vert, JeanPhilippe. Classification of arraycgh data using fused svm. Bioinformatics, 24(13):i375–i382, 2008.
 Shao & Zhang (2010) Shao, Xiaofeng and Zhang, Xianyang. Testing for change points in time series. Journal of the American Statistical Association, 105(491):1228–1240, 2010.

Suzumura et al. (2017)
Suzumura, Shinya, Nakagawa, Kazuya, Umezu, Yuta, Tsuda, Koji, and Takeuchi,
Ichiro.
Selective inference for sparse highorder interaction models.
In
International Conference on Machine Learning
, pp. 3338–3347, 2017.  Takeuchi et al. (2009) Takeuchi, Ichiro, Tagawa, Hiroyuki, Tsujikawa, Akira, Nakagawa, Masao, KatayamaSuguro, Miyuki, Guo, Ying, and Seto, Masao. The potential of copy number gains and losses, detected by arraybased comparative genomic hybridization, for computational differential diagnosis of bcell lymphomas and genetic regions involved in lymphomagenesis. haematologica, 94(1):61–69, 2009.
 Taylor & Tibshirani (2015) Taylor, Jonathan and Tibshirani, Robert J. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
 Tian & Taylor (2017) Tian, Xiaoying and Taylor, Jonathan. Asymptotics of selective inference. Scandinavian Journal of Statistics, 44(2):480–499, 2017.
 Tibshirani & Wang (2008) Tibshirani, Robert and Wang, Pei. Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics, 9(1):18–29, 2008.
 Tibshirani et al. (2005) Tibshirani, Robert, Saunders, Michael, Rosset, Saharon, Zhu, Ji, and Knight, Keith. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
 Wang et al. (2005) Wang, Pei, Kim, Young, Pollack, Jonathan, Narasimhan, Balasubramanian, and Tibshirani, Robert. A method for calling gains and losses in array cgh data. Biostatistics, 6(1):45–58, 2005.
 Wang & Samworth (2016) Wang, Tengyao and Samworth, Richard J. Highdimensional changepoint estimation via sparse projection. arXiv preprint arXiv:1606.06246, 2016.
 Yang et al. (2016) Yang, Fan, Barber, Rina Foygel, Jain, Prateek, and Lafferty, John. Selective inference for groupsparse linear models. In Advances in Neural Information Processing Systems, pp. 2469–2477, 2016.

Yao (1988)
Yao, YiChing.
Estimating the number of changepoints via schwarz’criterion.
Statistics & Probability Letters, 6(3):181–189, 1988.  Yau & Zhao (2016) Yau, Chun Yip and Zhao, Zifeng. Inference for multiple change points in time series via likelihood ratio scan statistics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(4):895–916, 2016.
Appendix A Example of Truncation Points
In this section, we evaluate truncation points and in Theorem 2 for several aggregation functions described in §3, i.e., general WRAG, aggregation, aggregation and top aggregation functions.
a.1 General WRAG
To derive the truncation points and in Theorem 2, we first show that can be expressed as a bilinear form of . By the definition of multivariate CUSUM score, we have , where
Let be an by diagonal matrix whose diagonal elements are the sign of for each . Then, by the definition of , there exists by permutation matrix such that . Combining all the above, (7) can be reduced to
where . Let . Then, the selection event would be expressed as an affine constraint with respect to (see, Section B.2). Precisely, let be a first order difference matrix, that is,
Then, the event
would be reduced to the intersection of affine constraints for all , where
and  
To derive truncation points, let and . Since itself is nonnegative, Lemma 1 implies
by simple calculations, where
(12a)  
(12b)  
and  
(12c) 
First, it hold that
where . Thus we see that (12a) implies
In addition, we have
Then, (12b) can be reduced to
Finally,
imply
Combining all the above, lower truncation point can be obtained by
Similarly, by using the fact that the constraint does not affect to an upper truncation point, can be obtained by
a.2 aggregation
Recall that the aggregation function of aggregation score is expressed by
Let , where is the sign of and is an dimensional unit vector whose th element is one. Then we see that , where is an dimensional vector defined in Appendix A.1. In aggregation, we consider the event as a selection event, where is a maximizer of . In this case, the constraint on the sign of is equivalent to that on the nonnegativity of test statistic. Hence the event can be expressed as
Note that the former event in the above expression can be rewritten by
Comments
There are no comments yet.