Selective Inference for Multi-Dimensional Multiple Change Point Detection

06/01/2017
by   Yuta Umezu, et al.
0

We consider the problem of multiple change point (CP) detection from a multi-dimensional sequence. We are mainly interested in the situation where changes are observed only in a subset of multiple dimensions at each CP. In such a situation, we need to select not only the time points but also the dimensions where changes actually occur. In this paper we study a class of multi-dimensional multiple CP detection algorithms for this task. Our main contribution is to introduce a statistical framework for controlling the false detection probability of these class of CP detection algorithms. The key idea is to regard a CP detection problem as a selective inference problem, and derive the sampling distribution of the test statistic under the condition that those CPs are detected by applying the algorithm to the data. By using an analytical tool recently developed in the selective inference literature, we show that, for a wide class of multi-dimensional multiple CP detection algorithms, it is possible to exactly (non-asymptotically) control the false detection probability at the desired significance level.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 15

10/18/2021

Valid and Exact Statistical Inference for Multi-dimensional Multiple Change-Points by Selective Inference

In this paper, we study statistical inference of change-points (CPs) in ...
06/13/2016

The Crossover Process: Learnability and Data Protection from Inference Attacks

It is usual to consider data protection and learnability as conflicting ...
02/27/2021

Online High-Dimensional Change-Point Detection using Topological Data Analysis

Topological Data Analysis (TDA) is a rapidly growing field, which studie...
12/12/2021

Detection and Estimation of Multiple Transient Changes

Change-point detection methods are proposed for the case of temporary fa...
06/28/2021

Change-Point Detection in Dynamic Networks with Missing Links

Structural changes occur in dynamic networks quite frequently and its de...
05/21/2018

NEWMA: a new method for scalable model-free online change-point detection

We consider the problem of detecting abrupt changes in the distribution ...
05/23/2019

Computing Expected Runtimes for Constant Probability Programs

We introduce the class of constant probability (CP) programs and show th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A motivating example taken from a biomedical study (Takeuchi et al., 2009). The four-dimensional sequence indicates the results of copy number variations of four malignant lymphoma patients in Chromosome 1. The positive/negative values in each dimension indicate the copy number gain/loss at the corresponding genome position. The purpose of this biomedical study is to detect common change points (CPs) in a subset of dimensions which indicates that the copy number variations are relevant to the malignant lymphoma disease. The difficulty lies in the fact that copy number variations are heterogeneous among patients. For example, patients B and D have a common CP at position a, while patients A and B share a CP at position b. In this paper, we propose a class of methods that can detect common CPs shared by a subset of dimensions in a multi-variate sequence. In §5, we analyze this malignant lymphoma dataset in order to demonstrate the effectiveness of the proposed methods.

In this paper, we study change point (CP) detection in multi-dimensional sequences. This problem arises in many areas. Figure 1 shows a motivating example in which copy number variations of four malignant lymphoma patients are plotted. The purpose of this study is to detect common CPs among patients. The difficulty of this problem lies in the fact that the copy number variations are heterogeneous among patients, i.e., they are commonly observed only in a subset of patients as illustrated in the figure.

The goal of this paper is to develop a method for detecting CPs that are characterized by a subset of dimensions in a multi-dimensional sequence. A CP detection method for solving such a problem can be formulated as a two-stage method: one for selecting dimensions, and another for selecting time points111Although our target is not restricted to time sequences, we use the term “time” to refer to a point in a sequence.. We call the former the aggregation stage, and the latter the scanning stage. In the aggregation stage, a subset of dimensions is selected and the scores for the selected dimensions are aggregated into a scalar score that represents the plausibility of a common change at the time point. Various forms of aggregation can be considered. In the scanning stage, a time-point that maximizes the aggregated score is selected as a CP by scanning the one-dimensional aggregated score sequence.

In this paper, we our goal is not only to detect CPs but also to properly control the false detection probability at the desired significance level, e.g., . To this end, we must take into account the fact that the CPs are obtained by selecting the dimensions and the time points in the aggregation and scanning stages, respectively. This means that two types of selection bias must be properly corrected in order to make inferences on the detected CPs. For a one-dimensional sequence, various statistical CP detection methods have been studied (Page, 1954; Yao, 1988; Lee et al., 2003; Yau & Zhao, 2016). In contrast, for multi-dimensional sequences, there are only a few asymptotic results and these were developed under rather restrictive assumptions. Deriving the sampling distributions of CP statistics is generally difficult even in asymptotic scenarios because stochastic singularity of CPs must be properly handled.

In this paper, we introduce a class of CP detection methods that can exactly control the false detection probability. Our key contribution is to interpret a CP detection problem as a selective inference problem. For the class of CP detection methods, this interpretation allows us to derive the exact (non-asymptotic) sampling distribution of the test CP statistic conditional on the fact that the CP is selected based on a particular method. Using a recent result on selective inference (Taylor & Tibshirani, 2015; Lee et al., 2016), we show that inferences based on the derived sampling distribution can correct the two types of selection bias in the aggregation and scanning stages, and the overall false detection probability can be properly controlled.

The rest of the paper is organized as follows. In §2, we formulate our problem setup and review related work. Here, we focus on a problem of detecting a single mean structure change in an off-line manner. In §3, we introduce a selective inference interpretation of a CP detection problem, and present the main results. Here, we first introduce a class of CP detection methods which includes many existing ones. We then show that the

selective type I error

of the detected CPs can be exactly (non-asymptotically) controlled. We also show that our hypothesis testing procedure is an approximately unbiased test and derive a lower bound of the power in the sense of selective inference. In §4, the results in the previous section are extended to multiple CP detection problems via a local hypothesis testing framework. §5 is devoted to numerical experiments. Here, we first confirm that the proposed CP detection methods can properly control the false detection probabilities in a simulation study. We then apply these CP detection methods to copy number variation analysis of 46 malignant lymphoma patients.

Notations

We define , for and, as a special case, . For two matrices and , denotes the Kronecker product, i.e., . We use notation for an appropriate matrix

and vector

. For a non-negative integer , an -by-identity matrix is denoted by

, while a zero matrix is denoted as

, omitting its size as long as no confusion is expected. The sign function, the indicator function and the vectorize operator are denoted by and , respectively.

2 Problem Setup and Related Work

In this section, we present the problem setup and discuss related works. We first consider the problem of detecting a single CP. Its extensions to multiple CP detection is discussed in §4.

2.1 Problem Setup

Let us write an -dimensional sequence with length as an -by- matrix . Then, a single CP detection problem for mean shift is formulated as the following hypothesis testing problem

(1)

where the null hypothesis

states that the mean vector does not change within the entire sequence, whereas the alternative hypothesis states that there is one CP.

CUSUM score

In order to discuss the test statistic and its sampling distribution for the hypothesis testing problem in (1), let us simply consider a single CP detection from a one-dimensional sequence denoted as . Let and

be generic random variables corresponding to

and , respectively. Then, we expect the time point to be a CP when the value

is large. Hence, we define a natural estimator of discrepancy between random variables

and as

and, its scaled measure

is known as the CUSUM (cumulative sum) score (Page, 1954). Note that the CUSUM score can be interpreted as a realization of the logarithm of a Gaussian likelihood when we assume that the sequence is mutually independent. A point that maximizes is detected as a CP.

Multi-variate CUSUM score and its aggregation

In the case of a multi-dimensional sequence, it is natural to consider a multi-variate version of the CUSUM score:

(2)

for , where each is a CUSUM score corresponding to the -th dimension. Since each element of a multi-variate CUSUM score cannot be maximized simultaneously, we need to first aggregate the -dimensional vector into a scalar value. We denote an aggregation function as , i.e., for each . The aggregated score represents the plausibility of a CP at time point .

Choices of aggregation function

For multi-dimensional CP detection, various choices of aggregation function can be considered.

-aggregation

Jirak (2015) proposed -aggregation as . This aggregation function simply selects the dimension whose absolute CUSUM score is greatest among the dimensions. This choice is not appropriate when there are changes in multiple dimensions.

-aggregation

Another simple aggregation function is -aggregation defined as . This aggregation function just sums up the individual CUSUM scores. This choice is not appropriate if changes are observed only in a subset of dimensions.

Top -aggregation

If changes are observed in a subset of dimensions and the size of the subset is known to be , then the top -aggregation function defined as is appropriate. This aggregation function can be interpreted as a generalization of - and -aggregation functions since it reduces to - and -aggregation when and , respectively.

Double CUSUM aggregation

If is unknown, then it would be nice to be able select the appropriate from the data. Let , be the -th largest value in , i.e., is satisfied for . Cho (2016) proposed the double CUSUM aggregation function defined as

(3)

where and is a pre-determined positive constant. This aggregation function returns the CUSUM score of the sequence for each . The rationale for this choice is that, if there are changes in dimensions, then the top absolute CUSUM values tend to have larger values than the remaining CUSUM values , meaning that the CUSUM score for this sequence would be maximized at . As suggested in Cho (2016), would be optimal in the sense of asymptotic theory.

Test statistic for the problem in (1)

Based on the above discussion, a natural test statistic for the hypothesis testing in (1) is

(4)

Here, can be interpreted as a realization of the corresponding random variable , and thus we could consider as a test statistic. Then, the -value is defined as the false detection probability under , that is, . If the -value is smaller than the significance level , then we can conclude that there is a single CP in the multi-dimensional sequence.

As we briefly discuss in the following subsection, it is difficult (even asymptotically) to derive the sampling distribution of in (4) for most practical choices of the aggregation function unless rather restrictive assumptions are imposed.

2.2 Related Work

Here, we briefly review existing work on controlling the false detection probabilities in CP detection problems.

Inference for one-dimensional sequences

First, we review statistical inferences on CP detection in one-dimensional sequences. As described above, one can regard that the point at which the CUSUM score maximized is the most plausible as a CP. Hence, the test statistic of the target hypothesis is naturally defined as , and corresponds to the log-likelihood ratio test statistic. Then, as is well known, test statistic

converges weakly to a Brownian bridge under appropriate moment and weak dependence assumptions on the sequence

(Phillips, 1987; Csörgö & Horváth, 1997; Shao & Zhang, 2010), where

is the so-called long-run variance. In this asymptotic theory, the weak dependence assumption is essential because, otherwise, the long-run variance does not exist or becomes zero.

Another closely related existing inference procedure is proposed in Hyun et al. (2016). They used fused LASSO (Tibshirani et al., 2005) for CP detection, and interpret the inferences on the detected CPs as the inferences on the regression model coefficients for the selected features by fused LASSO. Inferences on the regression model coefficients for selected features can be done using a recently popularized framework on selective inference (Lee et al., 2016).

Although inference problems on CP detection in one-dimensional sequences have been intensively studied in the literature, they cannot be easily generalized to the case of multi-dimensional sequences.

Inference for multi-dimensional sequences

Unlike the one-dimensional sequence case, the literature of inference on multi-dimensional sequences is very scarce222Several methods for assuring estimation were developed (Fryzlewicz et al., 2014; Cho & Fryzlewicz, 2015; Cho, 2016; Wang & Samworth, 2016), but they cannot be used for inference on the detected CPs.. To the best of our knowledge, existing work on this topic can be defined into two types.

method weight:
-aggregation
-aggregation
top -aggregation
double CUSUM
Table 1: Example WRAG functions.

The first type studies likelihood-based methods such as -aggregation reported by Jirak (2015). They derived an asymptotic distribution of -aggregation score as an extreme value distribution. To establish the asymptotic distribution, the relation between the length of the sequence and the size of the dimension must satisfy for some . This condition indicates that if is relatively large compared to , then one can no longer control the false detection probability even when the underlying distribution is independent for each time point.

The second type uses a kernel-based method. The basic idea of kernel-based CP detection is to consider the problem as a two-sample test where the two multi-dimensional subsequences before and after the CP are regarded as the two samples. In this approach, some discrepancy measure between the two samples such as the kernel Fisher discriminant ratio or maximum mean discrepancy (MMD) measures is defined, and then the test statistic is the maximum value of the discrepancy measure scanned along the sequence. Harchaoui et al. (2009) first studied kernel CP detection by using the kernel Fisher discriminant ratio, while Li et al. (2015) employed MMD as the discrepancy measure. They derived an asymptotic distribution of the test statistic (i.e., the maximum discrepancy along the sequence) under the assumption that values at different time points are independently distributed.

3 Selective Inference for Multi-dimensional CP Detection

In this section, we present our main results. As discussed in the previous section, it is difficult to derive the sampling distribution of the test statistic in the form of (4). Our basic idea to overcome this difficulty is to interpret the CP detection problem as a selective inference problem, i.e., the problem of making an inference on the detected CP conditional on the fact that the CP is selected by a particular choice of aggregation function. This interpretation enables us to derive the exact (non-asymptotic) selective sampling distribution of the test statistic for a wide class of practical aggregation functions .

3.1 Proposed Class of Aggregation Functions

Let us first propose a class of aggregation functions for which we can derive the exact selective sampling distribution of the test statistic. Recall that is the CUSUM score in the -th dimension at the -th time point for . As in the definition of the double CUSUM aggregation function, , is defined as the -th largest value in , i.e., is satisfied for . We define a class of aggregation functions as

(5)

where are constants. We refer to this class of aggregation functions as weighted rank aggregation (WRAG) functions. Table 1 shows several choices of the constants corresponding to the choices of the aggregation functions discussed in §2. With the use of a WRAG function, we can detect a CP as

(6)

We refer to a CP detection method via (6) as a WRAG method.

3.2 Selective Inference on the CPs by WRAG methods

An inference on the CP detected by a WRAG method can be interpreted as a selective inference, which has been actively studied in the past few years for inferences on feature selection problems 

(Fithian et al., 2014; Yang et al., 2016; Tian & Taylor, 2017; Suzumura et al., 2017). In order to formulate our problem as a selective inference problem, let us define the selection event and the selected test statistic. The selection event in a WRAG method is written as

where and are interpreted as realizations of the corresponding random variables and defined as

On the other hand, the selected test statistic is given as

(7)

In the context of selective inference, based on the selection event and the selected test statistic, the so-called selective -value is defined as

where is interpreted as a realization of the corresponding statistic .

For the purpose of inference, we make an assumption on the normality of multi-dimensional sequence , namely,

where is the mean matrix, whereas and are covariance matrices representing the time correlation and the variable correlation structures, respectively. In practical CP detection tasks, it is often possible to obtain sequences without CPs, i.e., samples from the null hypothesis. The covariance matrices and can be estimated from such samples or manually specified based on prior knowledge. Denoting the mean matrix as , the null hypothesis in our selective inference is written as

(8)

where, remember, is the detected CP via a WRAG method, meaning that it is a random variable.

In order to derive the sampling distribution of the selected test statistic, we use a recent seminal result in Lee et al. (2016)

. We first slightly generalize the key lemma in their work for handling a random matrix

.

Lemma 1 (Polyhedral Lemma for A Random Matrix).

Suppose that with mean and covariance . Let for any and , and let . Then, any event represented in the form of for a fixed matrix and a fixed vector can be written as

where

(9)

and . In addition, is independent of .

The proof of the lemma is presented in Appendix B.1. This lemma states that if the test statistic is expressed as a bi-linear function with the matrix in the form of , and the selection event can be expressed as an affine constraint in the form of , then the selected test statistic is restricted to a certain interval. This lemma is a simple extension of Lemma 5.1 in Lee et al. (2016).

The selection event cannot be written in the form of . Here, we consider the signs and the permutations of for each as additional selection events. Let be an -by- diagonal matrix whose diagonal elements are the signs of for each , and be an -by- permutation matrix which maps to . The selection event is then formulated as

where are interpreted as realizations of the corresponding statistics for .

The following theorem is the core of the selective -value computation in our selective inference.

Theorem 2.

Assume that the conditions of Lemma 1 hold. Then, there exist and such that

(10)

under the null hypothesis (8), where

is the cumulative distribution function (c.d.f.) of normal distribution

truncated to the interval .

The complete proof of Theorem 2 is presented in Appendix B.2, where we show that, for any choice of aggregation function from the class in (5),

  • the selection event can be written as an affine constraint event in the form of ,

  • the selected test static can be written as a bi-linear function with a matrix in the form of .

Then, by applying Lemma 1 and Theorem 5.2 in Lee et al. (2016), Theorem 2 can be proved.

As described in Appendix A, for any choice of the aggregation function from the class in (5), the values of , , and in Theorem 2 can be computed. By using these values, the selective -value can be computed as

where is the c.d.f. of the standard normal distribution.

Remark 3.

When we do not need to select from the data, e.g., -, - and top -aggregations, we do not need to consider the event since the event is redundant for inference. For the same reason, we also do not need to consider the some of the signs or/and the permutations depending on the choice of the aggregation function. Concrete examples of truncation points for each choice are described in Appendix A.

Remark 4.

We can establish the same result as Theorem 2 even if all the signs and the permutations are considered. In this case, instead of a single interval, multiple intervals must be considered for all possible choices of the permutations and the signs. However, since the number of all possible combinations of the signs and the permutations is large, this is computationally intractable.

3.3 Power Analysis

Since for any and , hypothesis (8) can be viewed as

In practice, since and are determined by a WRAG method, the hypothesis is also random. Therefore, we consider an alternative hypothesis

(11)

which is a negation of the null since our aggregated score takes only positive values. Under the alternative, the same argument as in Theorem 2 indicates that

We now consider the power of the test in the situation that . Note that this asymptotic scenario corresponds to the power analysis under the local alternative.

Theorem 5.

Let be the upper

-quantile of the null distribution. Then, under the alternative (

11), the power of the test is approximated as follows:

almost surely, where

and

is the probability density function of the standard normal distribution.

The proof of Theorem 5 is presented in Appendix B.3. The theorem states that our selective inference procedure is an approximately unbiased test. Here, an unbiased test refers to a test in which the power becomes at the boundary of the hypothesis, i.e., . In addition, the test has a power of at least since the second term in the last inequality is always positive. Theorem 5 suggests that there may exist better tests in terms of power.

4 Extension to Multiple CP Detection

In this section, we extend the selective inference framework for WRAG methods to be able to detect multiple CPs. To this end, we introduce a sliding window approach. Let be a sliding window centered at with length for each . If we simply conducted single CP detection within each sliding window, too many CPs would be detected due to overlaps of multiple windows. To circumvent this issue, Hao et al. (2013) considered the so-called local hypothesis testing problem. For each window , a local hypothesis test is defined as

In this hypothesis test, even when the null hypothesis is rejected, unless there is a CP at the center of the window, the hypothesis itself is considered to be out of our interest. A natural estimate of the set of CPs by this approach would be

In the context of one-dimensional CP detection problems, Yau & Zhao (2016) referred to this type of multiple CP estimates as local change point estimates. In this approach, we can only consider hypotheses, which is usually a much smaller number than that of windows . Yau & Zhao (2016) also discussed the choice of the window size , and claimed that the choice of would be appropriate in an asymptotic sense.

5 Numerical Experiments

Here, we confirm the performances of the proposed selective inference framework for WRAG methods through numerical experiments with both synthetic and real data.

5.1 Experiments on Synthetic Data

5.1.1 FPRs of selective and naive inferences

First, we confirmed whether the false positive rates (FPRs) are properly controlled in the selective inference framework for WRAG methods with double CUSUM (DC) and top -aggregation with , where and correspond to - and -aggregations, respectively. The synthetic data with and were generated from normal distribution where and . We considered for simulating the cases without and with correlation among different dimensions, while was changed from 0.0 to 1.0 to simulate the cases with various degrees of correlation among different time points. In addition, we also computed the FPRs of naive inference for WRAG methods without any selection bias correction procedure as in Theorem 2. The significance level was set as . In all cases, 1,000 runs with different random seeds were simulated.

Figure 2 shows the FPRs of selective inference (solid lines) and naive inference (dashed lines), where the horizontal and vertical axes indicate the value of and the estimated FPRs, respectively. We see that selective inference could control FPRs appropriately in all cases. On the other hand, in almost all cases, naive inference failed to control the FPRs especially when is small333The bias of naive inference is large when the “effective” length of the sequence is large. Since effective length decreases as the degree of correlation increases, the bias is large when is small.. Although the results in Figure 2 might be interpreted that naive inference with and could also control the FPRs properly, this was actually not the case. Under the null hypothesis, the

-value should be uniformly distributed between 0 and 1 (see, e.g., Section 3 in

Lehmann & Romano (2006)). Figure 3 shows the distributions of (a) the selective -values and (b) the naive -values, where we see that the former are uniformly distributed, while the latter are not uniformly distributed. Indeed, the Kolmogorov-Smirnov test for uniformity resulted in for selective -values, but for naive -values.

(a)
(b)
Figure 2: False positive rates (FPRs) of selective inference (solid) and naive inference (dashed). The left and the right plots show the results without and with correlation among different dimensions. In all cases, selective inference could properly control FPRs, while naive inference failed.
(a) selective -value
(b) naive -value
Figure 3: Histograms of (a) selective -values and (b) naive -values when and . Noting that the -values should be uniformly distributed between and under the null hypothesis, the left plot indicates that selective inference behaves as desired, while the right plot suggests that naive inference behaves incorrectly.

5.1.2 FPRs of existing methods

As mentioned in §2, there are two existing CP detection methods for multi-dimensional sequences. In both methods, the asymptotic sampling distribution of the test statistic in the form of (4) is derived under certain assumptions. Here, we see how these existing methods behave when the assumptions are violated.

First, to see the performances of the method proposed in Jirak (2015), we generated from with , and investigated the performances as varies from 1 to 100 (see Figure 4(a)). We observe that the FPRs increase as becomes large when the assumption in Jirak (2015) is violated. In contrast, the proposed selective inference (with double CUSUM aggregation) could appropriately control FPRs to the desired significant level of in the same setting (red solid line).

Next, to see the performances of the method proposed in Li et al. (2015), we generated from , where with and , and investigated the performances as varies from 0.0 to 0.3 (see Figure 4(b)). We observe that the FPRs increase as increases when the assumption in Li et al. (2015) is violated. Again, the proposed selective inference could appropriately control FPRs to the desired significant level of in the same setting.

(a) -aggregation
(b) kernel CP detection
Figure 4: FPRs of two existing CP detection methods for a multi-dimensional sequence. In (a), the FPRs of the method by Jirak (2015) are plotted versus the number of dimensions. In (b), the FPRs of the method by Li et al. (2015) are plotted versus the degree of correlation among different time points. In both plots, the existing methods failed to control FPRs to the desired level when the underlying assumptions are violated. On the other hand, the proposed selective inference framework successfully controlled the FPRs in all settings.

5.2 Application to CNV Detection

We applied the proposed selective inference framework for WRAG methods to a copy number variation (CNV) study on malignant lymphoma (Takeuchi et al., 2009). In this study, CNVs of 46 patients diagnosed with diffuse large B-cell lymphoma (DLBCL) were investigated by an array comparative genomic hybridization (array CGH) technique (Hodgson et al., 2001). The dataset that we analyze here is represented as a real-valued multi-dimensional sequence with and . Each dimension indicates a patient, while each time point indicates a local genomic region. It is well known that CNVs in DLBCL are heterogeneous because DLBCL has several subtypes444Identifying and characterizing the genetic properties of disease subtypes are crucially important for precision medicine.. The goal of this medical study is to detect CNVs commonly observed in a subset of patients. Various one-dimensional CP detection methods have been used for analyzing array CGH data for a single patient (Wang et al., 2005; Tibshirani & Wang, 2008; Rapaport et al., 2008). However, there is no existing method for detecting common CPs by analyzing CNVs of multiple patients altogether, or for providing the statistical significance of the detected CNVs.

Due to space limitations, we only present the results for Chromosome 1, in which there are local genomic regions. We applied a WRAG method with a double CUSUM aggregation function to this dataset. For detecting multiple CPs, we used a local hypothesis testing framework described in §4, in which we set because this is the closest integer to . The covariance structure was set to be because each dimension in this multi-dimensional sequence was obtained from an individual patients. On the other hand, the covariance structure was estimated from a different control dataset with 555CNV data in array CGH analysis is obtained by comparing the CNs between the patient and a healthy reference person. Therefore, a control dataset (without any CNVs) can be easily obtained by comparing the CNs between two healthy reference persons.. The parameter in double CUSUM was set to be 0.5 as suggested in Cho (2016).

We detected 54 CPs and 11 of them are statistically significant in the sense that the selective -value is less than 0.05. Table 2 shows the list of detected CPs. Two examples of the detected CPs are illustrated in Figure 5(b). Note that the numbers of the selected dimensions (patients) are different among the detected CPs, which is an advantage of the double CUSUM aggregation function. Our selective inference interpretation of CP detection problems allows us to properly correct the selection bias even if the selection procedure is fairly complicated as in double CUSUM aggregation.

genomic
region ID
gene name
# of selected
patients
selective
-value
7 Q8N7E4 5 0.010
15 NM018125 5 0.028
18 CDA/KIF17 1 0.000
22 PAFAH2 4 0.000
31 EIF2C1 2 0.001
36 NA 1 0.000
106
SH2D2A/INSRR
/NTRK1
13 0.038
120 C1orf9/TNFSF6 21 0.040
151 RPS6KC1 1 0.010
162 PSEN2 1 0.000
165 DISC1 23 0.044
Table 2: Detected copy number variations in Chromosome 1. The list of genomic region IDs for the array CGH analysis in (Takeuchi et al., 2009), known genes in the genomic region, # of selected patients (i.e., in double CUSUM aggregation), and selective -values of each detected CPs are shown.
(a) Changes are observed in patients 3, 4, 21 and 40 at the 22nd genomic region with selective -value 0.000.
(b) Changes are observed in patients 2 and 21 at the 31st genomic region with selective -value 0.001.
Figure 5: Examples of the detected CPs.

References

  • Cho (2016) Cho, Haeran. Change-point detection in panel data via double cusum statistic. Electronic Journal of Statistics, 10(2):2000–2038, 2016.
  • Cho & Fryzlewicz (2015) Cho, Haeran and Fryzlewicz, Piotr. Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(2):475–507, 2015.
  • Csörgö & Horváth (1997) Csörgö, Miklós and Horváth, Lajos. Limit theorems in change-point analysis, volume 18. John Wiley & Sons Inc, 1997.
  • Fithian et al. (2014) Fithian, William, Sun, Dennis, and Taylor, Jonathan. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
  • Fryzlewicz et al. (2014) Fryzlewicz, Piotr et al. Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42(6):2243–2281, 2014.
  • Hao et al. (2013) Hao, Ning, Niu, Yue Selena, and Zhang, Heping. Multiple change-point detection via a screening and ranking algorithm. Statistica Sinica, 23(4):1553, 2013.
  • Harchaoui et al. (2009) Harchaoui, Zaid, Moulines, Eric, and Bach, Francis R. Kernel change-point analysis. In Advances in neural information processing systems, pp. 609–616, 2009.
  • Hodgson et al. (2001) Hodgson, Graeme, Hager, Jeffrey H, Volik, Stas, Hariono, Sujatmi, Wernick, Meredith, Moore, Dan, Albertson, Donna G, Pinkel, Daniel, Collins, Colin, Hanahan, Douglas, et al. Genome scanning with array cgh delineates regional alterations in mouse islet carcinomas. Nature genetics, 29(4):459, 2001.
  • Hyun et al. (2016) Hyun, Sangwon, G’Sell, Max, and Tibshirani, Ryan J. Exact post-selection inference for changepoint detection and other generalized lasso problems. arXiv preprint arXiv:1606.03552, 2016.
  • Jirak (2015) Jirak, Moritz. Uniform change point tests in high dimension. The Annals of Statistics, 43(6):2451–2483, 2015.
  • Lee et al. (2016) Lee, Jason D, Sun, Dennis L, Sun, Yuekai, and Taylor, Jonathan E. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
  • Lee et al. (2003) Lee, Sangyeol, Ha, Jeongcheol, Na, Okyoung, and Na, Seongryong. The cusum test for parameter change in time series models. Scandinavian Journal of Statistics, 30(4):781–796, 2003.
  • Lehmann & Romano (2006) Lehmann, Erich L and Romano, Joseph P. Testing statistical hypotheses. Springer Science & Business Media, 2006.
  • Li et al. (2015) Li, Shuang, Xie, Yao, Dai, Hanjun, and Song, Le. M-statistic for kernel change-point detection. In Advances in Neural Information Processing Systems, pp. 3366–3374, 2015.
  • Page (1954) Page, E. S. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
  • Phillips (1987) Phillips, Peter CB. Time series regression with a unit root. Econometrica, 55(2):277–301, 1987.
  • Rapaport et al. (2008) Rapaport, Franck, Barillot, Emmanuel, and Vert, Jean-Philippe. Classification of arraycgh data using fused svm. Bioinformatics, 24(13):i375–i382, 2008.
  • Shao & Zhang (2010) Shao, Xiaofeng and Zhang, Xianyang. Testing for change points in time series. Journal of the American Statistical Association, 105(491):1228–1240, 2010.
  • Suzumura et al. (2017) Suzumura, Shinya, Nakagawa, Kazuya, Umezu, Yuta, Tsuda, Koji, and Takeuchi, Ichiro. Selective inference for sparse high-order interaction models. In

    International Conference on Machine Learning

    , pp. 3338–3347, 2017.
  • Takeuchi et al. (2009) Takeuchi, Ichiro, Tagawa, Hiroyuki, Tsujikawa, Akira, Nakagawa, Masao, Katayama-Suguro, Miyuki, Guo, Ying, and Seto, Masao. The potential of copy number gains and losses, detected by array-based comparative genomic hybridization, for computational differential diagnosis of b-cell lymphomas and genetic regions involved in lymphomagenesis. haematologica, 94(1):61–69, 2009.
  • Taylor & Tibshirani (2015) Taylor, Jonathan and Tibshirani, Robert J. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
  • Tian & Taylor (2017) Tian, Xiaoying and Taylor, Jonathan. Asymptotics of selective inference. Scandinavian Journal of Statistics, 44(2):480–499, 2017.
  • Tibshirani & Wang (2008) Tibshirani, Robert and Wang, Pei. Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics, 9(1):18–29, 2008.
  • Tibshirani et al. (2005) Tibshirani, Robert, Saunders, Michael, Rosset, Saharon, Zhu, Ji, and Knight, Keith. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
  • Wang et al. (2005) Wang, Pei, Kim, Young, Pollack, Jonathan, Narasimhan, Balasubramanian, and Tibshirani, Robert. A method for calling gains and losses in array cgh data. Biostatistics, 6(1):45–58, 2005.
  • Wang & Samworth (2016) Wang, Tengyao and Samworth, Richard J. High-dimensional changepoint estimation via sparse projection. arXiv preprint arXiv:1606.06246, 2016.
  • Yang et al. (2016) Yang, Fan, Barber, Rina Foygel, Jain, Prateek, and Lafferty, John. Selective inference for group-sparse linear models. In Advances in Neural Information Processing Systems, pp. 2469–2477, 2016.
  • Yao (1988) Yao, Yi-Ching.

    Estimating the number of change-points via schwarz’criterion.

    Statistics & Probability Letters, 6(3):181–189, 1988.
  • Yau & Zhao (2016) Yau, Chun Yip and Zhao, Zifeng. Inference for multiple change points in time series via likelihood ratio scan statistics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(4):895–916, 2016.

Appendix A Example of Truncation Points

In this section, we evaluate truncation points and in Theorem 2 for several aggregation functions described in §3, i.e., general WRAG, -aggregation, -aggregation and top -aggregation functions.

a.1 General WRAG

To derive the truncation points and in Theorem 2, we first show that can be expressed as a bi-linear form of . By the definition of multi-variate CUSUM score, we have , where

Let be an -by- diagonal matrix whose diagonal elements are the sign of for each . Then, by the definition of , there exists -by- permutation matrix such that . Combining all the above, (7) can be reduced to

where . Let . Then, the selection event would be expressed as an affine constraint with respect to (see, Section B.2). Precisely, let be a first order difference matrix, that is,

Then, the event

would be reduced to the intersection of affine constraints for all , where

and

To derive truncation points, let and . Since itself is non-negative, Lemma 1 implies

by simple calculations, where

(12a)
(12b)
and
(12c)

First, it hold that

where . Thus we see that (12a) implies

In addition, we have

Then, (12b) can be reduced to

Finally,

imply

Combining all the above, lower truncation point can be obtained by

Similarly, by using the fact that the constraint does not affect to an upper truncation point, can be obtained by

a.2 -aggregation

Recall that the aggregation function of -aggregation score is expressed by

Let , where is the sign of and is an -dimensional unit vector whose -th element is one. Then we see that , where is an -dimensional vector defined in Appendix A.1. In -aggregation, we consider the event as a selection event, where is a maximizer of . In this case, the constraint on the sign of is equivalent to that on the non-negativity of test statistic. Hence the event can be expressed as

Note that the former event in the above expression can be rewritten by