1 Introduction
Finding the place (or time) where most or all of a set of onedimensional signals (or profiles) jointly change in some specific way is an important question in several fields. A common situation is when we want to find changepoints in a multidimensional signal, e.g., in audio and image processing [1, 2], to detect intrusion in computer networks [3, 4], or in financial and economics time series analysis [5]. Another important situation is when we are confronted with several 1dimensional signals which we believe share common changepoints, e.g., genomic profiles of a set of patients. The latter application is increasingly important in biology and medicine, in particular for the detection of copynumber variation along the genome [6], or the analysis of microarray and genetic linkage studies [7]. The common thread in biological applications is the search for data patterns shared by a set of individuals, such as cancer patients, at precise places on the genome; in particular, sudden changes in measured values. As opposed to the segmentation of multidimensional signals such as speech, where the dimension is fixed and collecting more data means having longer profiles, the length of signals in genomic studies (i.e., the number of probes measured along the genome) is fixed for a given technology while the number of signals (i.e., the number of individuals) can increase when we collect data about more patients. From a statistical point of view, it is therefore of interest to develop methods that identify multiple changepoints shared by several signals that can benefit from increasing the number of signals.
There exists a vast literature on the changepoint detection problem [8, 9]. Here we focus on computationally efficient methods to segment a multidimensional signal by approximating it with a piecewiseconstant one, using quadratic error criteria. It is wellknown that, in this case, the optimal segmentation of a dimensional signal of length into segments can be obtained in by dynamic programming [10, 11, 12]. However, the quadratic complexity in is prohibitive in applications such as genomics, where can be in the order of to with current technology. An alternative to such global
procedures, which estimate changepoints as solutions of a global optimization problem, are fast
local procedures such as binary segmentation [13], which detect breakpoints by iteratively applying a method for single changepoint detection to the segments obtained after the previous changepoint is detected. While such recursive methods can be extremely fast, in the order of when the single changepoint detector is , quality of segmentation is questionable when compared with global procedures [14].For (a single signal), an interesting alternative to these global and local procedures is to express the optimal segmentation as the solution of a convex optimization problem, using the (convex) total variation instead of the (nonconvex) number of jumps to penalize a piecewiseconstant function in order to approximate the original signal [15, 16]. The resulting piecewiseconstant approximation of the signal, defined as the global minimum of the objective function, benefits from theoretical guaranties in terms of correctly detecting changepoints [17, 18], and can be implemented efficiently in or [19, 17, 20].
In this paper we propose an extension of totalvariation based methods for single signals to the multidimensional setting, in order to approximate a multidimensional signal with a piecewiseconstant signal with multiple changepoints. We define the approximation as the solution of a convex optimization problem which involves a quadratic approximation error penalized by the sum of the Euclidean norms of the multidimensional increments of the function. The problem can be reformulated as a group Lasso [21], which we show how to solve exactly and efficiently. Alternatively, we provide an approximate yet often computationally faster solution to the problem using a group LARS procedure [21]. In the latter case, using the particular structure of the design matrix, we can find the first changepoints in , thus extending the method of [17] to the multidimensional setting.
Unlike most previous theoretical investigations of changepoint methods (e.g., [17, 18]), we are not interested in the case where the dimension is fixed and the length of the profiles increases, but in the opposite situation where is fixed and increases. Indeed, this corresponds to the case in genomics where, for example, would be the fixed number of probes used to measure a signal along the genome, and the number of samples or patients analyzed. We want to design a method that benefits from increasing in order to identify shared changepoints, even though the signaltonoise ratio may be very low within each signal. As a first step towards this question, we give conditions under which our method is able to consistently identify a single changepoint as increases. We also show by simulation that the method is able to correctly identify multiple changepoints as , validating its relevance in practical settings.
The paper is organized as follows. After fixing notation in Section 2, we present the group fused Lasso method in Section 3. We propose two efficient algorithms to solve it in Section 4, and discuss its theoretical properties in Section 5. Lastly, we provide an empirical evaluation of the method and a comparison with other methods in the study of copy number variations in cancer in Section 6. A preliminary version of this paper was published in [22].
2 Notation
For any two integers , we denote by the interval . For any matrix we note its th entry, and
its Frobenius norm (or Euclidean norm in the case of vectors). For any subsets of indices
and , we denote by the matrix with entries for . For simplicity we will use instead of or , i.e., is the th row of and is the th column of . We note the matrix of ones, and the identity matrix.3 Formulation
We consider realvalued profiles of length , stored in an matrix . The th profile is the th column of . We model each profile as a piecewiseconstant signal corrupted by noise, and assume that changepoint locations tend to be shared across profiles. Our goal is to detect these shared changepoints, and benefit from the possibly large number of profiles to increase the statistical power of changepoint detection.
3.1 Segmentation with a total variation penalty
When (a single profile), a popular method to find changepoints in a signal is to approximate it by a piecewiseconstant function using a quadratic error criterion, i.e., to solve
(1) 
where is the Dirac function, equal to if its argument is null, otherwise. In other words, (1) expresses the best approximation of by a piecewiseconstant profile with at most jumps. It is wellknown that (1) can be solved in by dynamic programming [10, 11, 12]. Although very fast when is of moderate size, the quadratic dependency in renders it impractical in current computers when reaches millions or more, which is often the case in many application such as segmentation of genomic profiles.
An alternative to the combinatorial optimization problem (
1) is to relax it to a convex optimization problem, by replacing the number of jumps by the convex total variation (TV) [15], i.e., to consider:(2) 
For a given , the solution of (2) is again piecewiseconstant. Recent work has shown that (2) can be solved much more efficiently than (1): [19] proposed a fast coordinate descentlike method, [17] showed how to find the first changepoints iteratively in , and [20] proposed a method to find all changepoints. Adding penalties proportional to the or norm of to (2) does not change the position of the changepoints detected [16, 23], and the capacity of TV denoising to correctly identify changepoints when increases has been investigated in [17, 18].
Here, we propose to generalize TV denoising to multiple profiles by considering the following convex optimization problem, for :
(3) 
The second term in (3) can be considered a multidimensional TV: it penalizes the sum of Euclidean norms of the increments of , seen as a timedependent multidimensional vector, and reduces to the classical 1dimensional TV when . Intuitively, when increases, this penalty will enforce many increment vectors to collapse to , just like the total variation in (2) in the case of dimensional signals. This implies that the positions of nonzero increments will be the same for all profiles. As a result, the solution to (3) provides an approximation of the profiles by an matrix of piecewiseconstant profiles which share changepoints.
While (3) is a natural multidimensional generalization of the classical TV denoising method (2), we more generally investigate the following variant:
(4) 
where are positiondependant weights which affect the penalization of the jump differently at different positions. While (4) boils down to (3) for uniform weights , , we will see that the unweighted version suffers from boundary effects and that positiondependent schemes such as:
(5) 
are both theoretically and empirically better choices.
To illustrate the grouping effect of the penalty in (4), Figure 1 compares the segmentation of three simulated profiles obtained with and without enforced sharing of changepoints across profiles. We simulated three piecewiseconstant signals corrupted by independent additive Gaussian noise. All profiles have length 500 and share the same changepoints, though with different amplitudes, at positions 38, 139, 268, 320 and 397. On the lefthand side, we show the first changepoints captured by TV denoising with weights (5) applied to each signal independently. On the right, we show the first changepoints captured by formulation (4). We see that the latter formulation finds the correct changepoints, whereas treating each profile independently leads to errors. For example, the first two changepoints have a small amplitude in the second profile and are therefore very difficult to detect from the profile only, while they are very apparent in the first and third profiles.
3.2 Reformulation as a group Lasso problem
It is wellknown that the 1dimensional TV denoising problem (2
) can be reformulated as a Lasso regression problem by an appropriate change of variable
[17]. We now show that our generalization (4) can be reformulated as a group Lasso regression problem, which will be convenient for theoretical analysis and implementation [21]. To this end, we make the change of variables given by:In other words is the jump between the th and the th positions of the th profile. We immediately get an expression for as a function of and :
This can be rewritten in matrix form as
(6) 
where is the matrix with entries for , and otherwise. Making this change of variable, we can reexpress (4) as follows:
(7) 
For any , the minimum in is attained with . Plugging this into (7), we get that the matrix of jumps is solution of
(8) 
where and are obtained from and by centering each column.
4 Implementation
Although (4) and (8) are convex optimization problems that can in principle be solved by generalpurpose solvers [24], we want to be able to work in dimensions that reach millions or more, making this computationally difficult. In particular, the design matrix in (8) is a nonsparse matrix of size , and cannot even fit in a computer’s memory when is large. Moreover, we would ideally like to obtain solutions for various values of , corresponding to various numbers of changepoints, in order to be able to select the optimal number of changepoints using statistical criteria. In the single profile case (), fast implementations in or have been proposed [19, 17, 20]. However, none of these methods is applicable directly to the setting since they all rely on specific properties of the case, such as the fact that the solution is piecewiseaffine in and that the set of changepoints is monotically decreasing with .
In this section we propose two algorithms to respectively exactly or approximately solve (4) efficiently. We adopt the algorithms suggested by [21] to solve the group Lasso problem (8) and show how they can be implemented very efficiently in our case due to the particular structure of the regression problem. We have placed in Annex A several technical lemmas which show how to efficiently perform several operations with the given design matrix that will be used repeatedly in the implementations proposed below.
4.1 Exact solution by block coordinate descent
A first possibility to solve the group Lasso problem (8) is to follow a block coordinate descent approach, where each group is optimized in turn with all other groups fixed. It can be shown that this strategy converges to the global optimum, and is reported to be stable and efficient [21, 25]. As shown by [21], it amounts to iteratively applying the following equation to each block in turn, until convergence:
(9) 
where and , and where denotes the matrix equal to except for the th row . The convergence of the procedure can be monitored by the KarushKuhnTucker (KKT) conditions:
(10) 
Since the number of blocks can be very large and we expect only a fraction of nonzero blocks at the optimum (corresponding to the changepoints), we implemented this block coordinate descent with an active set strategy. In brief, a set of active groups corresponding to nonzero groups is maintained, and the algorithm alternates between optimizing over the active groups in and updating by adding or removing groups based on violation of the KKT conditions. The resulting pseudocode is shown in Algorithm 1. The inner loop (lines 37) corresponds to the optimization of on the current active groups, using iteratively block coordinate descent (9). After convergence, groups that have been shrunk to are removed from the active set (line 8), and the KKT conditions are checked outside of the active set (lines 910). If they are not fulfilled, the group that most violates the conditions is added to the active set (line 11), otherwise the current solution satisfies all KKT conditions and is therefore the global optimum (line 13).
Although it is difficult to estimate the number of iterations needed to reach convergence for a certain level of precision, we note that by Lemma 5 (Annex A), computation of in line 1 can be done in , and each group optimization iteration (lines 37) requires computing (line 5), done in (see Lemma 6 in Annex A), then computing (line 5) in and softthresholding (line 6) in . The overall complexity of each group optimization iteration is therefore . Since each group in must typically be optimized several times, we expect complexity that is at least quadratic in and linear in for each optimization over an active set (lines 37). To check optimality of a solution after optimization over an active set , we need to compute (line 9) which takes (see Lemma 7, Annex A). Although it is difficult to upper bound the number of iterations needed to optimize over , this shows that a bestcase complexity to find changepoints, if we correctly add groups one by one to the active set, would be to check times the KKT conditions and find the next group to add, and in total if each optimization over an active set is in . In Section 6, we provide some empirical results on the behavior of this block coordinate descent strategy.
4.2 Group fused LARS implementation
Since exactly solving the group Lasso with the method described in Section 4.1 can be computationally intensive, it may be of interest to find fast, approximate solutions to (8). We propose to implement a strategy based on the group LARS, proposed in [21] as a good way to approximately find the regularization path of the group Lasso. More precisely, the group LARS approximates the solution path of (8) with a piecewiseaffine set of solutions and iteratively finds changepoints. The resulting algorithm is presented here as Algorithm 2, and is intended to approximately solve (8). Changepoints are added one by one (lines 4 and 8), and for a given set of changepoints the solution moves straight along a descent direction (line 6) with a given step (line 7) until a new changepoint is added (line 8). We refer to [21] for more details and justification for this algorithm.
While the original group LARS method requires storage and manipulation of the design matrix [21], implausible for large here, we can again benefit from the computational tricks provided in Annex A to efficiently run the fast group LARS method. Computing in line 1 can be done in using Lemma 5. To compute the descent direction (line 6), we first compute in using Lemma 8, then in using Lemma 7. To find the descent step (line 7), we need to solve polynomial equations of degree 2, the coefficients of which are computed in , resulting in a complexity. Overall the main loop for each new changepoint (lines 2–10) takes in computation and memory, resulting in complexity in time and in memory to find the first changepoints. We provide in Section 6 empirical results that confirm this theoretical complexity.
5 Theoretical analysis
In this section, we study theoretically to what extent the estimator (4) recovers correct changepoints. The vast majority of existing theoretical results for offline segmentation and changepoint detection consider the setting where is fixed (usually ), and increases (e.g., [2]). This typically corresponds to cases where we can sample a continuous signal with increasing density, and wish to locate more precisely the underlying changepoints as the density increases.
We propose a radically different analysis, motivated notably by applications in genomics. Here, the length of profiles is fixed for a given technology, but the number of profiles can increase when more samples or patients are collected. The property we would like to study is then, for a given changepoint detection method, to what extent increasing for fixed allows us to locate more precisely the changepoints. While this simply translates our intuition that increasing the number of profiles should increase the statistical power of changepoint detection, and while this property was empirically observed in [7], we are not aware of previous theoretical results in this setting. In particular we are interested in the consistency of our method, in the sense that it should correctly detect the true changepoints if enough samples are available.
5.1 Consistent estimation of a single changepoint
As a first step towards the analysis of this “fixed increasing ” setting, let us assume that the observed centered profiles are obtained by adding noise to a set of profiles with a single shared changepoint between positions and , for some . In other words, we assume that
where is an matrix of zeros except for the th row , and
is a noise matrix whose entries are assumed to be independent and identically distributed with respect to a centered Gaussian distribution with variance
. In this section we study the probability that the first changepoint found by our procedure is the correct one, when
increases. We therefore consider an infinite sequence of jumps , and letting , we assume that exists and is finite. We first characterize the first selected changepoint as increases.Lemma 1.
Assume, without loss of generality, that , and let, for ,
(11) 
When , the first changepoint selected by the group fused Lasso (4) is in with probability tending to .
Proof of this result is given in Annex B. From it we easily deduce conditions under which the first changepoint is correctly found with increasing probability as increases. Let us first focus on the unweighted group fused Lasso (3), corresponding to the setting for .
Theorem 2.
Let be the position of the changepoint scaled in the interval , and
(12) 
If , the probability that the first changepoint selected by the unweighted group fused Lasso (3) is the correct one tends to as . When , it is not the correct one with probability tending to .
This theorem, the proof of which can be found in Annex C, deserves several comments.

To detect a changepoint at position , the noise level must not be larger than the critical value given by (12), hence the method is not consistent for all positions. decreases monotonically from to , meaning that changepoints near the boundary are more difficult to detect correctly than changepoints near the center. The most difficult changepoint is the last one () which can only be detected consistently if is smaller than

For a given level of noise , changepoint detection is asymptotically correct for any , where satisfies , i.e.,
This shows in particular that increasing the profile length increases the relative interval (as a fraction of ) where changepoints are correctly identified, and that we can get as close as we want to the boundary for large enough.

When , the correct changepoint is found consistently when increases, showing the benefit of the accumulation of many profiles.
Theorem 2 shows that the unweighted group fused Lasso (3) suffers from boundary effects, since it may not correctly identify a single changepoints near the boundary is the noise is too large. In fact, Lemma 1 tells us that if we miss the correct changepoint position, it is because we estimate it more towards the middle of the interval (see proof of Theorem 2 for details). The larger the noise, the more biased the procedure is. We now show that this issue can be fixed when we consider the weighted group fused Lasso (4) with wellchosen weights.
Theorem 3.
5.2 Consistent estimation of a single changepoint with fluctuating position
An interesting variant of the problem of detecting a changepoint common to many profiles is that of detecting a changepoint with similar location in many profiles, allowing fluctuations in the precise location of the changepoint. This can be modeled by assuming that the profiles are random, and that the th profile has a single changepoint of value at position , where are independent and identically distributed according to a distribution (i.e., we assume independent from ). We denote and for . Assuming that the support of is with , the following result extends Theorem 2 by showing that the first changepoint discovered by the unweighted group fused Lasso is in the support of under some condition on the noise level, while the weighted group fused Lasso correctly identifies a changepoint in the support of asymptotically without conditions on the noise.
Theorem 4.

Let be the random position of the changepoint on and and the position of the left and right boundaries of the support of scaled to . If , then for any noise level , the probability that the first changepoint selected by the unweighted group fused Lasso (3) is in the support of tends to as . If or , let
(13) The probability that the first selected changepoint is in the support of tends to when . When , it is outside of the support of with probability tending to .
This theorem, the proof of which is postponed to Annex E, illustrates the robustness of the method to fluctuations in the precise position of the changepoint shared between profiles. Although this situation rarely occurs when we are considering classical multidimensional signals such as financial time series or video signals, it is likely to be the rule when we consider profiles coming from different biological samples, where for example we can expect frequent genomic alterations at the vicinity of important oncogenes or tumor suppressor genes. Although the theorem only gives a condition on the noise level to ensure that the selected changepoint lies in the support of the distribution of changepoint locations, a precise estimate of the location of the selected changepoint as a function of , which generalizes Lemma 1, is given in the proof.
5.3 The case of multiple changepoints
While the theoretical results presented above focus on the detection of a single changepoint, the real interest of the method is to estimate multiple changepoints. The extension of Theorem 2 to this setting is, however, not straightforward and we postpone it for future efforts. We conjecture that the group fused Lasso estimator can, under certain conditions, consistently estimate multiple changepoints. More precisely, in order to generalize the proof of Theorem 2, we must analyze the path of the vectors , and check that, for some in (3) or (4), they reach their maximum norm precisely at the true changepoints. The situation is more complicated than in the single changepoint case since, in order to fulfill the KKT optimality conditions, the vectors must hit a hypersphere at each correct changepoint, and must remain strictly within the hypersphere between consecutive changepoints. This can probably be ensured if the noise level is not too high (like in the single changepoint case), and if the positions corresponding to successive changepoints on the hypersphere are far enough from each other, which could be ensured if two successive changepoints are not too close to each other, and are in sufficiently different directions. Although the weighting scheme (5) ensures consistent estimation of the first changepoint independently of the noise level, it may however not be sufficient to ensure consistent estimation of subsequent changepoints.
Although we propose no theoretical results besides these conjectures for the case of multiple changepoints, we provide experimental results below that confirm that, when the noise is not too large, we can indeed correctly identify several changepoints, with probability of success increasing to as increases.
5.4 Estimating the number of changepoints
The number of changepoints detected by the group fused Lasso in the multidimensional signal depends on the choice of in (3) and (4). In practice, we propose the following scheme in order to estimate a segmentation and the number of changepoints. We try to select a that oversegments the multidimensional signal, that is, finds more changepoints that we would normally expect for the given type of signal or application. Then, on the set of changepoints found, we perform postprocessing using a simple leastsquares criteria. Briefly, for each given subset of changepoints, we approximate each signal between successive changepoints with the mean value of the points in that interval; then, we calculate the total sum of squared errors (SSE) between the set of real signals and these piecewiseconstant approximations to them. Though it may appear computationally intensive or even impossible to do this for all subsets of changepoints, a dynamic programming strategy (e.g., [6]) means that the best subset of changepoints can be calculated for all in .
It then remains to choose the “best” using, for example, a modelselection strategy. The optimal SSE for (which we may call to ease notation), will be smaller than but at a certain point, adding a further changepoint will have no physical reality, it only improves the SSE due to random noise. Here, we implemented a method proposed in [26, 6] where we first normalize the SSE for into a score such that and , in such a way the has an average slope of between and ; we then try to detect a kink in the curve by calculating the discrete second derivative of , and selecting the after which this second derivative no longer rises above a fixed threshold (typically ).
6 Experiments
In this section we test the group fused Lasso on several simulated and real data sets. All experiments were run under Linux on a machine with two 4core Intel Xeon 3.16GHz processors and a total of 16Gb of RAM. We have implemented the group fused Lasso in MATLAB; the package GFLseg is available for download^{1}^{1}1Available at http://cbio.ensmp.fr/GFLseg.
6.1 Speed trials
In a first series of experiments, we tested the behavior of the group fused Lasso in terms of computational efficiency. We simulated multidimensional profiles with various lengths between and , various dimensions between and , and various number of shared changepoints between and . In each case, we first ran the iterative weighted group fused LARS (Section 4.2) to detect successive changepoints, and recorded the corresponding values. We then ran the exact group fused Lasso implementation by block coordinate descent (Section 4.1) on the same values. Figure 2 shows speed with respect to increasing one of , and while keeping the other two variables fixed, for both implementations. The axes are , so the slope gives the exponent of the complexity (resp. , and ). For the weighted group fused LARS, linearity is clearest for , whereas for and , the curves are initially sublinear, then slightly superlinear for extremely large values of and . As these time trials reach out to the practical limits of current technology, we see that this is not critical  on average, even the longest trials here took less than 200 seconds. The weighted fused group Lasso results are perhaps more interesting, as it is harder to predict in advance the practical time performance of the algorithm. Surprisingly, when increasing ( and fixed) or increasing ( and fixed), the group fused Lasso eventually becomes as fast the iterative, deterministic group fused LARS. This suggests that at the limits of current technology, if is small (say, less than 10), the potentially superior performance of the Lasso version (see later) may not even be punished by a slower runtime with respect to the LARS version. We suggest that this may be due to the Lasso optimization problem becoming relatively “easier” to solve when or increases, as we observed that the Lasso algorithm converged quickly to its final set of changepoints. The main difference between the Lasso and LARS performance appears when the number of changepoints increases: with respective empirical complexities cubic and linear in , as predicted by the theoretical analysis, Lasso is already 1,000 times slower than LARS when we seek 100 changepoints.
6.2 Accuracy for detection of a single changepoint
Next, we tested empirically the accuracy the group fused Lasso for detecting a single changepoint. We first generated multidimensional profiles of dimension , with a single jump of height at a position , for different values of and . We added to the signals an i.i.d. Gaussian noise with variance , the critical value corresponding to in Theorem 2. We ran 1000 trials for each value of and , and recorded how often the group fused Lasso with or without weights correctly identified the changepoint. According to Theorem 2, we expect that, for the unweighted group fused Lasso, for there is convergence in accuracy to when increases, and for , convergence in accuracy to zero. This is indeed what is seen in Figure 3 (left panel), with the limit case between the two different modes of convergence. The center panel of Figure 3 shows that when the default weights (5) are added, convergence in accuracy to 1 occurs across all , as predicted by Theorem 3.
In addition, the righthandside panel of Figure 3 shows results for the same trials except that changepoint locations can vary uniformly in the interval . We see that, as predicted by Theorem 4, the accuracy of the weighted group fused Lasso remains robust against fluctuations in the exact changepoint location.
6.3 Accuracy for detecting multiple changepoints
To investigate the potential for extending the results to the case of many shared changepoints, we further simulated profiles of length with a changepoint at all of positions . We consider dimensions between and . Jumps at each changepoint of each profile were drawn from a Gaussian with mean 0 and variance 1; we then added centered Gaussian noise with to each position in each profile. For each value of and , we ran one hundred trials of both implementations, with or without weights, and recorded the accuracy of each method, defined as the percentage of trials where the first changepoints detected by the method are exactly the true changepoints. Results are presented in Figure 4 (from left to right, resp. ). Clearly, the group fused Lasso outperforms the group fused LARS, and the weighted version of each algorithm outperforms the unweighted version. Although the group LARS is usually considered a reliable alternative to the exact group Lasso [21], this experiment shows that the exact optimization by block coordinate descent may be worth the computational burden if one is interested in accurate group selection. It also demonstrates that, as we conjectured in Section 5.3, the group fused Lasso can consistently estimate multiple changepoints as the number of profiles increases.
6.4 Application to gain and loss detection
We now consider a possible application of our method for the detection of regions with frequent gains (positive values) and losses (negative values) among a set of DNA copy number profiles, measured by array comparative genomic hybridization (aCGH) technology [27]. We propose a twostep strategy for this purpose: first, find an adequate joint segmentation of the signals; then, check the presence of gain or loss on each interval of the segmentation by summarizing each profile by its average value on the interval. Note that we do not assume that all profiles share exactly the same changepoints, but merely see the joint segmentation as an adaptive way to reduce the dimension and remove noise from data.
In practice, we used group fused LARS on each chromosome to identify a set of candidate changepoints, and selected a subset of them by postprocessing as described in Section 5.4. Then, in each piecewiseconstant interval between successive shared changepoints, we calculate the mean of the positive segments (shown in green in Figures 5(a) and 6(c)) and the mean of the negative segments (shown in red). The larger the mean of the positive segments, the more likely we are to believe that a region harbors an important common gain; the reasoning is analogous for important common losses and the mean of the negative segments. Obviously, many other statistical tests could be carried out to detect frequent gains and losses on each segment, once the joint segmentation is performed.
We compare this method for detecting regions of gain and loss with the stateoftheart HHMM method [27], which has been shown to outperform several other methods in this setting. As [27] have provided their algorithm online with several of their data sets tested in their article, we implemented our method and theirs (HHMM) on their benchmark data sets.
In the first data set in [27], the goal is to recover two regions – one amplified, one deleted, that are shared in 8 short profiles, though only 6 of the profiles exhibit each of the amplified or deleted regions. Performance is measured by area under ROC curve (AUC), following [27]. Running HHMM with the default parameters, we obtained an AUC (averaged over 10 trials) of , taking on average 60.20 seconds. The weighted group fused LARS, asked to select 100 breakpoints and followed by dynamic programming, took 0.06 seconds and had an AUC of . Thus, the performance of both methods was similar, though weighted group fused LARS was around 1000 times faster.
The second data set was a cohort of lung cancer cell lines originally published in [28, 29]. As in [27], we concentrated on the 18 NSCLC adenocarcinoma (NA) cell lines. Figure 5 shows the score statistics obtained on Chromosome 8 when using either weighted group fused LARS or HHMM. Weighted group fused LARS first selected candidate changepoints per chromosome, then followed optimization of the number of changepoints by dynamic programming, took in total 4.7 seconds and finally selected 260 changepoints. In contrast, HHMM took 38 minutes (100 iterations, as given in the code provided by the authors). The HHMM scores should look like those shown in Figure 4 (top panel) of [27]; the difference is either due to the stochastic nature of the algorithm or using a different number of iterations than given in the sample code by the authors. In any case, at the MYC locus (near bp), both methods strongly suggest a common gained region. However, the supposed advantage of HHMM to very sparsely predict common gains and losses is not clear here; for example, it gives high common gain confidence to several fairly large genomic regions between 9 and 14 bp.
6.5 Application to bladder tumor aCGH profiles
We further considered a publicly available aCGH data set of 57 bladder tumor samples [30]. Each aCGH profile gave the relative quantity of DNA for 2215 probes. We removed the probes corresponding to sex chromosomes because the sex mismatch between some patients and the reference made the computation of copy number less reliable, giving us a final list of 2143 probes.
Results are shown in Figure 6. 97 changepoints were selected by the weighted group fused LARS; this took 1.1 seconds (Figure 6(c)). The HHMM method (Figure 6(d)) took 13 minutes for 200 iterations (after 100 iterations convergence had not occured).
We used the comprehensive catalogue of common genomic alterations in bladder cancer provided in Table 2 in [31] to validate the method and compare with HHMM. Our method (Figure 6(c)) concurred with the known frequentlyamplified chromosome arms 20q, 8q, 19q, 1q, 20p, 17q, 19p, 5p, 2p, 10p, 3q and 7p, and frequentlylost 9p, 9q, 11p, 10q, 13q, 8p, 17p, 18q, 2q, 5q, 18p, 14q and 16q. The only known commonlylost region which showed unconvincing common loss here was 6q. As for the HHMM method (Figure 6(d)), it selects a small number of very small regions of gain and loss, which are difficult to verify with respect to the wellknown frequently amplified arms in [31]. As is suggested, the method may therefore be useful for selecting the precise location of important genes. However, as can be seen in Figure 6(a)(b), many, but not all, alterations are much larger than those found with HHMM, and where for example there are clearly several localized gains and losses in chromosome 8, HHMM finds nothing at all. Perhaps the complexity of rearrangements in chromosome 8 is not easily taken into account by the HHMM algorithm. Note finally that the weighted group fused LARS was more than 700 times faster than HHMM.
7 Conclusion
We have proposed a framework that extends totalvariation based approximation to the multidimensional setting, and developed two algorithms to solve the resulting convex optimization problem either exactly or approximately. We have shown theoretically and empirically that the group fused Lasso can consistently estimate a single changepoints, and observed experimentally that this property is likely to hold also when several changepoints are present. In particular, we observed both theoretically and empirically that increasing the number of profiles is highly beneficial to detect approximatively shared changepoints, an encouraging property for biological applications where the accumulation of data measured on cohorts of patients promises to help in the detection of common genomic alterations.
Although we do not assume that all profiles have the same changepoints, we estimate only shared changepoints. In other words, we try to estimate the union of the changepoints present in the profiles. This can be useful by itself, eg, for dimension reduction. If we wanted to detect changepoints of individual profiles, we may either postprocess the results of the group fused Lasso, or modify the formulation by, e.g., adding a TV penalty to each profile in addition to the group lasso penalty. Similarly, for some applications, we may want to add a norm to the group fused Lasso objective function in order to constrain some or all signals to be frequently null. Finally, from a computational point of view, we have proposed efficient algorithms to solve an optimization problem (4) which is the proximal operator of more general optimization problems where a smooth convex functional of is minimized with a constraint on the multidimensional TV penalty; this paves the way to the efficient minimization of such functionals using, e.g., accelerated gradient methods [32].
Annex A: Computational lemmas
In this Annex we collect a few results useful to carry out the fast implementations claimed in Section 4. Remember that the matrix defined in (6) is defined by for , otherwise. Since the design matrix of the group Lasso problem (8) is obtained by centering each column of to zero mean, its columns are given by:
(14) 
We first show how to compute efficiently for any matrix :
Lemma 5.
For any , we can compute in operations and memory as follows:

Compute the matrix of cumulative sums by the induction:

.

For , .


For , compute .
Proof.
Next, we show how to compute efficiently submatrices of the matrix .
Lemma 6.
For any two subsets of indices and in , the matrix can be computed in in time and memory with the formula:
(15) 
Proof.
Let us denote . For any , denoting and , we easily get from (14) an explicit formula for , namely,
∎
The next lemma provides another useful computational trick to compute efficiently for any matrix :
Lemma 7.
For any , we can compute in by

Compute, for , .

Compute the vector .

Compute the matrix defined by by the induction:

.

for , .


Compute the matrix defined by by the induction:

.

for , .


Compute, for ,
Each step in Lemma 7 has complexity in memory and time, leading to an overall complexity in to compute . We note that if is rowsparse, i.e., is several rows of are null, then the first two steps have complexity , where is the number of nonzero rows in . Although this does not change the overall complexity to compute , this leads to a significant speedup in practice when .
Proof.
Let us denote the diagonal matrix with entries . By Lemma 6, we know that , with , for . Since step 1 computes and step 5 computes , we just need to show that the computed in step 4 satisfies to conclude that . By step 4, is defined by the relation for (with the convention ), therefore we just need to show that for to conclude. For , we note that (with the convention ) and , and therefore . For , we have and and therefore . Combining these expressions we get, for :
where and are defined in steps 2 and 3. This concludes the proof that . ∎
Next we show that has a tridiagonal structure, resulting in fast matrix multiplication.
Lemma 8.
For any set of distinct indices with , the matrix is invertible, and for any matrix , the matrix
can be computed in in time and memory by

For , compute

Compute the successive rows of according to:
(16)
Comments
There are no comments yet.