Chromosome copy number variant (CNV) is a type of structural variation with abnormal copy number changes involving DNA fragments (Freeman et al., 2006; Feuk et al., 2006). CNVs result in gains or losses of the genome, therefore interfering downstream functions of the DNA contents. Accounting for a substantial amount of genetic variation, CNVs are considered to be a risk factor for human diseases. Over the past decade, advances in genomic technologies have revealed that CNVs underlie many human diseases, including autism (Pinto et al., 2010), cancer (Fanale et al., 2013), schizophrenia (Castellani et al., 2014), and major depressive disorder (O’Dushlaine et al., 2014). It is fundamental to develop fast and accurate CNV detection tools.
A variety of statistical tools have been developed to discover structural changes in CNV data during last 20 years. Popular algorithms include circular binary segmentation (Olshen et al., 2004), the fused LASSO (Tibshirani and Wang, 2008), likelihood ratio selection (Jeng et al., 2010), and screening and ranking algorithm (Niu and Zhang, 2012). Some other change-point detection tools such as wild binary segmentation (Fryzlewicz, 2014)
and simultaneous multiscale changepoint estimator(Frick et al., 2014) can be also applied to CNV data. See Niu et al. (2016)
for a recent review on modern change-point analysis techniques. A majority of existing methods are based on Gaussian assumption, although quantile normalization(Xiao et al., 2014) or local median transformation (Cai et al., 2012) can be used for normalization. The computational complexity is also of concern for some of the existing methods as the modern technologies produce extraordinarily big data. In spite of some fast algorithms (Wang et al., 2007), few algorithms are known to possess both computational efficiency and solid theoretical foundation. Moreover, with only a few exceptions (Hao et al., 2013; Frick et al., 2014), the existing methods focus on detection whereas not offering statistical inference. For these reasons, we develop a fast nonparametric method for CNV detection with theoretical foundation and the opportunity of conducting statistical inference.
In this paper, we model the CNVs as short segments with nonzero height parameters, which are sparsely hidden in a long sequence. The goal is to identify those segments with high probability and, moreover, to assess the significance levels for detected segments. In particular, we propose a scalable nonparametric algorithm for short segment detection. It depends on only the ranks of the absolute values of the measurements and hence requires minimal assumptions on the noise distribution. A short segment may be present when there are a large enough number of observations exceeding a certain threshold on a short segment; for instance, 8 on a segment of 10 observations are larger than the 99th percentile of the data. Following this idea, we implement a super scalable short segment (4S) detection algorithm to cluster the points to form a segment when such a phenomenon occurs. The advantages of our method are fourfold. First, this nonparametric method requires minimal assumption on the noise distribution. Second, it is super fast as the core algorithm requires onlyoperations to analyze a sequence of measurements. In particular, it takes less than 2 seconds for our R codes to analyze 272 sequences with a range of about 34,000 measurements. Third, we establish a non-asymptotic theory to ensure the detection of all signal segments. Last but not least, our method can compute the significance level for each detected segment and offer a convenient approach to statistical inference.
2.1 Notations and the main idea
be a sequence of random variables such that
where the height parametervector is sparse and are random noises with median 0. Moreover, we assume that the nonzero entries of are supported in the union of disjoint intervals so that
Here we assume that and for all . Note that such a representation of is unique and used throughout this paper for or its estimator . For convenience and without confusion, we use the interval to imply the set of integers when referred to a set of locations. We call those intervals segments. In particular, a signal segment is a segment where the height parameter is a nonzero constant. Let denote a subvector of restricted to . We denote by the cardinality of a set . In particular, for . and denote vectors, and , respectively.
Naturally, a primary goal for model (2.1) is to identify the set of signal segments . Moreover, while rarely done, it is useful to assign a significance level for each of the detected segments. In this paper, we will study both estimation and related inference problems on segment detection. Our strategy is to cluster “putative locations” using spatial information. In particular, we consider the set of positions , where the observations exceed a threshold . Intuitively, for a properly chosen and some segment , if is big enough compared with , it is likely that .
To illustrate our idea, we consider a game of ball painting. Suppose that we start with white balls in a row, and paint the th ball with black color if . Let be the total number of black balls, which is much smaller than . If we observe that there are a few black balls crowded in a short segment, e.g., ‘segment 1’ illustrated in Figure 1, it is plausible that the height parameter is not zero in the segment. Our proposed algorithm can easily identify those segments. Also it may happen that in a neighborhood there is only a single black ball, e.g., ‘segment 2’ in Figure 1. Then, we may not have strong evidence against . To put this intuition into a sound theoretical framework, it is imperative to evaluate the significance for each pattern. In fact, given the numbers of white and black balls in a short segment, we may calculate how likely a certain pattern appears in a sequence of length with black balls, when white and black balls are actually randomly placed. We will develop a framework of inference based on this idea in Section 2.3.
To estimate , we propose a super scalable short segment (4S) detection algorithm which is described as follows.
Step 1: thresholding. Define . That is, we collect the positions where the observations exceed a threshold.
Step 2: completion. Construct the completion set by the criterion that if and only if there exist , , such that . That is, we add the whole segment into the completion set if the gap between , is small enough.
Write where with . Note that this decomposition is unique.
Step 3: clean up. We delete from if , and obtain our final estimator . That is, we ignore the segments that are too short to be considered.
The whole procedure depends on three parameters , , and . The choice of is crucial and depends on applications. and are relatively more flexible as we can screen false positives using significance levels defined later. We may ignore the subscripts and simply refer to , and when the sets obtained from the three steps above corresponding to , and are clear in the context. Figure 2 illustrates our procedure, where the location set obtained in each step is indicated by the positions of black balls.
2.3 Theory: consistency and inference
Our goal is to identify the set of signal segments with a false positive control. Here we say that is identified by an estimator if there is a unique , such that , and for all and . Such an is a true positive. We define that is identified by an estimator if every is identified by . That is, there is a one-to-one correspondence between and a subset of , and the pairs under this correspondence are the only pairs with nonempty interaction among all segments in and . See Figure 3 for an illustration of the definition.
In our three-step procedure, the first two steps establish an estimator and the last one aims to delete obvious false positives. Our theory proceeds in two main steps. First, we characterize the non-asymptotic probability that the first two steps produce an estimator which successfully identifies . In order to identify , we should ensure two conditions. Condition one is that, after step 1, the black balls are dense enough on so that they do not split into two or more segments in step 2. Condition two is that, in the gap between and , the black balls are sparse enough so that the black balls on and do not connect to a big segment. Theorem 1 addresses how to bound the probabilities of these two conditions for all .
Second, we develop a framework of inference to control false positives. As a rough control, Theorem 2 gives an upper bound for the expected number of false positives if all segments of length one are deleted in step 3. In general, after step 2, it is not optimal to decide the likelihood of a detected segment being a false positive only by its length. Therefore, for each segment in , we check its original color pattern back in step 1 and calculate a
-value of this pattern under null hypothesis. This assigns a significance level for each detected segment which helps control false positive. It is difficult to find the exact -values. Lemma 2 offers a reasonable approximation.
To facilitate theoretical analysis, we assume that, in this subsection, are independent and identically distributed (IID) noises with median 0. Moreover, has a continuous density function that is symmetric with respect to 0. Under this assumption, the black balls are randomly distributed for arbitrary threshold when .
Now we investigate when a signal segment can be detected by our algorithm. Let
be the cumulative distribution function of the noise density. We use to denote the -percentile. Suppose that there is a segment such that , and , where is a segment containing such that is the union of two segments both of which are of length . Without loss of generality, we assume that i.e. . For a threshold , let us continue our game of ball painting and focus on this segment and its neighborhood. Recall that we paint the ball at position with black color if and only if . The following two events together can ensure that the segment is identified by our method.
Event ensures that this segment can be detected as a whole segment while event controls the total length of the detected segment and makes sure that the detected segments are separated from each other. and together guarantee that our algorithm identifies a segment such that , and for any other signal segment . The following lemma gives non-asymptotic bounds for and .
Let be two segments such that and , . is the union of two segments both of which are of length . Let . For and , after the thresholding and completion steps, we have
Let be the minimal signal strength among all ’s, , be the minimal and maximal lengths of signal segments, respectively, and be the minimal gap between two signal segments. Define so . Let . Taking into account all signal segments in , the theorem below gives a lower probability bound for identifying after first two steps.
With , and , can identify all signal segments in with probability at least
The probability (2.2) goes to 1 asymptotically if and as .
Although Theorem 1 gives a theoretical guarantee to recover all signal segments with a large probability, there are some false positives. As an ad-hoc way, we may take or 3 to eliminate some obvious false positives. This clean up step is simple and helpful to delete isolated black balls. The Theorem below gives an upper bound on the number of false positives with a conservative choice .
Assume and . Then with .
The expected number of false positive segments can be well controlled if both and are small. Next, we illustrate how to access the significance levels for the detected segments by our method, which is helpful to control false positives. Recall our estimator where . For each , let , , and . Now consider balls in a row with black and white balls. Let be an event that there exists a segment of length where at least balls are black, in a sequence of balls with blacks ones. The -value of can be defined as the probability of if the balls are randomly placed. This -value can effectively control the false positives. However, it is challenging to find the exact formula to calculate the -value. The lemma below gives an upper bound of .
where follows a hypergeometric distribution with total population size
follows a hypergeometric distribution with total population size, number of success states , and number of draws .
This approximated -value is useful to eliminate false positives.
Our proposed method is nonparametric and depends on only the rank of absolute measurements . For a fixed triplet , it typically needs less than operations to determine when is small, say, less than 0.1. We need operations to compare each measurement with the threshold to determine . Let be a vector of locations in in an ascending order. In the completion step, we compare to a threshold. In particular, we declare that and belong to different segments if and only if . Let ,…, be those indices such as . consists of segments ,…,. We record the start and end points of each segment only. In the deletion step, we delete if its length is not greater than . The total operations can be controlled within .
The choice of threshold is crucial to the 4S algorithm, and may need to be determined on a case-by-case basis. Here we offer a general guideline for parameter selection. Recall that for the Gaussian model, the signal strength of a segment with length and height is usually measured by ; see, e.g. Table 1 in Niu et al. (2016). If there are two segments with the same overall signal strength , however, one with large and small (say, type A), and another one with small and large (say, type B), then it is usually not equally easy to detect both of them by an algorithm of complex . Indeed, for many segment detection algorithms, it is tricky to balance the powers to detect these two types of segments. Intuitively, the threshold parameter controls this tradeoff in our methods. A higher threshold may be more powerful in detecting type A segments but less powerful in detecting type B segments; and vice versa. In practice, we may choose the threshold as a certain sample percentile of the absolute values of the observations based on a pre-specified preference. For example, if we know the signal segments have relatively large height parameters but can be as short as 5 data points, then with a fixed , we can find largest such as and the percentile is chosen as . That is, we want to guarantee that a segment of 5 consecutive black balls is significant enough to stand out. In another scenario, our preference might be longer segments with possibly lower heights. Then we may choose a threshold to include segments of length 10 with at least 6 black balls. In general, such (or ) can be easily determined given , , and , by solving . We illustrate in Figure 4 the relationship between and selected percentile for , , and 0.1. Because our main goal in this paper is to identify short segments, we prefer a large threshold such as 95th sample percentile. An even larger threshold can be used to identify shorter segments, and a smaller threshold can be used for detecting longer segments with lower heights.
3 Numerical Studies
3.1 Simulated data
We use simulation studies to evaluate the performance of our method in terms of the average number of true positives (TP) and false positives (FP) for identifying signal segments. Recall that in our definition, a detected segment is a true positive, if it interacts with only one signal segment , and it is the only one in that interacts .
In Example 1, we show the effectiveness of our inference framework on the false positive control of the 4S algorithm by a null model. As suggested by Figure 4, we choose the 95th percentile of absolute values of the observations as the threshold . We set and . We use various -value thresholds for false positive control and compare them with a vanilla version of 4S, which is the one without -value control.
Example 1 (Null Model)
We generate a sequence based on model (2.1) with and . We consider three scenarios for the error distributions. In the first two scenarios, we consider which are IID from and , respectively. In the last scenario, we consider which are marginally and jointly from an autoregressive (AR) model with autocorrelation 0.2.
As in this example, all detected segments are FPs. We report the average FPs for three versions of 4S (Vanilla, and ) based on 100 replicates in Table 1. We see that our inference framework can effectively control the number of FPs.
|4S (Vanilla)||4S (p=0.05)||4S (p=0.1)|
In Example 2, we compare 4S with three algorithms CBS (Olshen et al., 2004), LRS (Jeng et al., 2010) and WBS (Fryzlewicz, 2014). The CBS and WBS methods, implemented by R packages DNAcopy and wbs respectively, give a segmentation of the sequence which consists of a set of all segments rather than only the signal segments. In order to include their results for comparison, we ignore the long segments (with length greater than 100) detected by CBS or WBS, which decreases their false positives. For LRS, we set the maximum length of signal segments as 50.
We generate a sequence based on model (2.1) with . There are 5 signal segments with lengths 8, 16, 24, 32 and 40 respectively. We use the same error distributions as in Example 1. We consider two levels of height parameter for different signal strengths. In particular, we set height as the 99-, and 97-th percentiles of the marginal error distribution in two scenarios, labeled by S1 and S2. For the standard normal error, the height values are 2.326 and 1.881, respectively.
The threshold we used for 4S is the 95-th sample percentile of absolute values, that is around the 97.5-th percentile of the error distribution, e.g., around 1.96 for the Gaussian case. Therefore, the true height is greater than in S1, but lower in S2. Average numbers of TPs and FPs are reported in Tables 2 and 3.
|S1||4S (p=0.05)||4S (p=0.1)||4S (p=0.5)||CBS||LRS||WBS|
|S2||4S (p=0.05)||4S (p=0.1)||4S (p=0.5)||CBS||LRS||WBS|
|S1||4S (p=0.05)||4S (p=0.1)||4S (p=0.5)||CBS||LRS||WBS|
|S2||4S (p=0.05)||4S (p=0.1)||4S (p=0.5)||CBS||LRS||WBS|
Overall, CBS performs the best for the IID Gaussian case, but suffers from a low power in the heavy-tail case, and high FPs in the correlated case. LRS and WBS perform reasonable well with slightly high FPs in the heavy-tail case. 4S methods are more robust against the error type. When the noise is Gaussian and the signal strength is weak, it is slightly less powerful than the methods based on the Gaussian assumption. In terms of computation time (Table 4), 4S is about 100 times faster than all other methods.
3.2 Real data example
We applied the 4S method to the 272 individuals from HapMap project. In particular, we tried 4S (with -value thresholds 0.05 and 0.5) to the LRR sequence of chromosome 1, which consists of 33991 measurements for each subject. We compared 4S with CBS, which has been a benchmark method in CNV detection. Note that CBS produces a segmentation of the sequence rather than the CNV segments directly. Therefore, we focused on only the short (less than 100 data points) segments detected by CBS because the long segments had means close to zero and are not likely to be CNVs. We found that most of these short segments are separated. But a very small portion of them are connected as CBS sometimes tends to over segment the sequence. Therefore, we merged two short segments detected by CBS if they are next to each other.
The 4S algorithm is extremely fast. It took less than 2 seconds (on a desktop with CPU 3.6 GHz Intel Core i7 and 16GB memory) to complete 272 sequences with -values calculated for all detected segments. CBS algorithm is reasonably fast, but much slower than our algorithm. In Table 5 we list the total number of detected CNVs, average length of CNVs and computation time for all methods.
|4S (p=0.05)||4S (p=0.5)||CBS|
|Number of CNVs||2832||3141||2962|
Overall, the segment detection results were very similar. We further compared the segments detected by two algorithms, i.e., 4S with threshold and CBS. We found that 2753 segments are in common. Here by a common segment we mean a pair of segments, one detected from each algorithm, such that they overlap to each other but do not overlap with other detected segments. Among these common segments, we calculated a similarity measure, called affinity in Arias-Castro et al. (2005), defined as follows.
if two segments and are the same and if they do not overlap. We found that the average value of this similarity measure is 0.9290 among 2753 pairs.
Figure 5 presents the histogram of affinity among 2753 pairs of commonly detected segments by 4S and CBS. We can see that 87.76% of those pairs have affinity values larger than 0.8. We further divided the detected segments into three groups: those detected by both methods (group 1); those detected by only 4S method with (group 2); those detected by only CBS (group 3). For each detected segment, we calculated its length and the sample mean of the measurements on the segment. Figure 6 displays the scatter plots of sample means versus lengths for all the segments in three groups. Most segments in group 1 carries relatively strong signals. So it is not surprised that they were detected by both algorithms. The groups 2 and 3 have much smaller sizes than group 1. In particular, we found that most segments in group 3 (i.e. those detected by only CBS) are very short, consisting of only 2 or 3 data points. Those segments are not significant in our inference framework unless we set a very high threshold
in step 1. Some segments in group 2 (i.e. those detected by only the 4S method) have relatively small sample mean values, which explains why they were not detected by CBS. Some of these segments might be true positives with the sample mean affected by outliers. Overall, the 4S and CBS methods gave similar results. The segments detected by only one method may be prone to false positives, or true positives have weak signal strengths.
We proposed a scalable nonparametric algorithm for segment detection, and applied it to real data for CNV detection. Two main advantages of the 4S algorithm are its computational efficiency and independence of the normal error assumption. We introduced an inference framework to assign significance levels to all detected segments. Our numerical studies demonstrated that our algorithm was much faster than CBS and performed similarly to CBS under the normality assumption and better when the normality assumption was violated. Although our inference framework depended on the assumption of IID noise, our numerical experiments suggested that our algorithm worked well under weakly correlated noises. Hence, the proposed method is faster and more robust against non-normal noises than CBS. Overall, the 4S algorithm is a safe and fast alternative to CBS, which has been a benchmark method in CNV studies.
In the literature, there are two popular classes of change-point models used to study CNV related problems. The first one assumes only a piecewise constant median/mean structure. The second one assumes, in addition, a baseline, which reflects the background information or normal status of the data. Quite often, it assumes that the abnormal part, called signal segments in our paper, are sparse. For the first approach, the goal is to identify the change points. In contrast, the second approach emphasizes more on segment detection rather than change-point detection. The difference is subtle for estimation but might become remarkable for inference. For example, it is technically difficult to define ‘true positive’ in the context of change-point detection (Hao et al., 2013)
. But it is easier to define related concepts for segment detection as we did in this paper. Roughly speaking, the first approach is more general, and the second one is more specific and suitable to model certain CNV data, e.g., SNP array data. In particular, the 4S algorithm aims to solve change-point models in the second class. It can be applied to any data sequence when there is a baseline. When the baseline mean/median is unknown, we suggest that the data should be centered first by the estimated mean/median. Our method can not be applied to data when a baseline does not exist. Besides change-point models, there are other approaches to study CNV such as hidden Markov model(Wang et al., 2007). Due to the space limit, we restricted our comparison to the methods based on change-point models and implemented by R packages.
Most segment detection algorithms involve one or more tuning parameters, whose values are critical to the results. In the study of segment detection, there are two trade-offs that researchers should consider in choosing algorithms as well as their parameters. The first one is the usual type I/type II errors trade-off, which might be tricky sometimes but well-known. The second one is more delicate and quite unique. For a signal segment, both its height and length determine the signal strength. Therefore, segments with weak but detectable signals can be roughly divided into two categories, the ones with small length (say, type A) and the ones with small height (say, type B). Typically a method may detect type A segments more powerfully, but type B segments less powerfully, than the other method. For the proposed 4S algorithm, a choice of a larger threshold parameter in step 1 makes the algorithm more powerful in detecting short and high signal segments (type A), and vice versa. The 4S algorithm can be easily tuned to maximize the power in detecting of a certain type of the signal segments. We may also try different thresholding levels in data analysis in order to detect different types of segments. In general, the choice of the parameters depends on the research goals and balance of two trade-offs mentioned above.
There are various platforms and technologies which produce data for CNV detection. Besides the SNP array data studied in this work, read depth data from next generation sequencing (NGS) technologies are often used in CNV studies. As one referee pointed out, the speed of 4S algorithm would be an advantage when applied to read depth data from whole genome sequencing. This is a wonderful research direction that we will investigate next.
An R package SSSS implementing our proposed method can be download via
The authors are partially supported by National Science Foundation, National Institutes of Health, Simons Foundation and University of Arizona faculty seed grant.
Proof of Lemma 1.
Let be the interval of integers with . For each , , the probability that the ball at is white is as and is symmetric. It is trivial to bound for the case as . Now let us consider the case . Let , be the event that the first segment of consecutive white balls starts from position . Then
Therefore, and . Note that is a constant depending on , and , so a sharper bound than for may be used to bound if more information is available.
Let us consider the segment on the right hand side of . implies that there is at least one black ball in each of the segments , , etc. Note that on these segments so the probability of white ball at is . Consider all segments of length on the right side of . The probability that all these segments contain at least one black ball is . Therefore, .
Proof of Theorem 1.
For , let and be the gap between and . Define
Note that all segments in are identified under event . By Lemma 1, or , which can be bounded by . Moreover, . The conclusion follows Bonferroni inequality.
Proof of Theorem 2.
Let be the locations of black balls after step 1. Note that and will be connected in step 2 if and only if . We aims to count the number of segments with at least 2 consecutive black balls after step 2, as all isolated black balls will be eliminated in step 3. Such a segment starts at only if . So the total number of such segments is at most . Let
follow Bernoulli distribution withif and only if for . When , all black balls are randomly distributed. , i.e., the probability that all balls are white in next positions following is . Therefore, .
Proof of Lemma 2.
We drop the subscript in as it is irrelevant in our derivation below. Under the assumption that black balls are randomly assigned in position, at a position of black ball, we calculate the probability that there are at least black balls in next positions. Let be the count of black balls in those positions. follows a hypergeometric distribution with total population size , number of success states , and number of draws . Therefore, as there are black balls.
- Near-optimal detection of geometric objects by fast multiscale methods. IEEE Transactions on Information Theory 51 (7), pp. 2402–2425. Cited by: §3.2.
- Robust detection and identification of sparse segments in ultrahigh dimensional data analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (5), pp. 773–797. Cited by: §1.
- Copy number variation distribution in six monozygotic twin pairs discordant for schizophrenia. Twin Research and Human Genetics 17 (02), pp. 108–120. Cited by: §1.
- Analysis of germline gene copy number variants of patients with sporadic pancreatic adenocarcinoma reveals specific variations. Oncology 85 (5), pp. 306–311. Cited by: §1.
- Structural variation in the human genome. Nature Reviews Genetics 7 (2), pp. 85–97. Cited by: §1.
- Copy number variation: new insights in genome diversity. Genome research 16 (8), pp. 949–961. Cited by: §1.
- Multiscale change point inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (3), pp. 495–580. Cited by: §1.
- Wild binary segmentation for multiple change-point detection. The Annals of Statistics 42 (6), pp. 2243–2281. Cited by: §1, §3.1.
- Multiple change-point detection via a screening and ranking algorithm. Statistica Sinica 23, pp. 1553–1572. Cited by: §1, §4.
- Optimal sparse segment identification with application in copy number variation analysis. Journal of the American Statistical Association 105 (491), pp. 1156–1166. Cited by: §1, §3.1.
- The screening and ranking algorithm to detect DNA copy number variations. The Annals of Applied Statistics 6 (3), pp. 1306–1326. Cited by: §1.
- Multiple change-point detection: a selective overview. Statistical Science 31 (4), pp. 611–623. External Links: Cited by: §1, §2.4.
- Rare copy number variation in treatment-resistant major depressive disorder. Biological psychiatry 76 (7), pp. 536–541. Cited by: §1.
- Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 (4), pp. 557–572. Cited by: §1, §3.1.
- Functional impact of global rare copy number variation in autism spectrum disorders. Nature 466 (7304), pp. 368–372. Cited by: §1.
- Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9 (1), pp. 18–29. Cited by: §1.
- PennCNV: an integrated hidden markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome research 17 (11), pp. 1665–1674. Cited by: §1, §4.
- Modified screening and ranking algorithm for copy number variation detection. Bioinformatics, pp. btu850. Cited by: §1.