1 Introduction
In the analysis of complex signals, using a single statistical model with a fixed set of parameters is rarely enough to track data variations over their entire range. In long and/or highdimensional time series for example, the presence of nonstationarity, either in the form of slowly drifting dynamics or of abrupt regime changes, requires that statistical models flexibly account for temporal variations in signal characteristics. To overcome the intrinsic limitations of approaches based on a single model visàvis heterogeneous and nonstationary signals, model segmentation techniques have been successfully employed in various fields including image processing (Alaiz2013; Friedman2007) genetics (Bleakley2011; Tibshirani2007), brain imaging (Beer2019; Xu2015), finance Hallac2019; Nystrup2017, industrial monitoring (Saxen2016), oceanography (Ranalli2018), seismology (Ohlsson2010), and ecology (Alewijnse2018). Model segmentation consists in partitioning the domain of the signal (e.g. the temporal range of a time series or the lattice of a digital image) into a small number of segments or regions such that for each segment, the data are suitably represented with a single model. The models used to segment the data are typically of the same type (e.g. linear model) but differ by their parameters. The task of model segmentation is closely related to change point detection and is commonly referred to as (hybrid or timevarying) system identification in the engineering literature.
This work considers model segmentation in the following setup:

Structured multivariate data.
The observed data are multivariate predictor and response variables measured over a time grid, spatial lattice, or more generally a graph.

Regression.
Predictor and response variables are related through a regression model, e.g. a linear model, generalized linear model, or vector autoregressive model.

High dimension. There are far more predictors than response variables. However, at each measurement point, the responses only depend on a small number of predictors.
For simplicity, we present our methods and results in the context of linear regression with time series data, keeping in mind that our work readily extends to other regression models and graph structures. Let
and be multivariate time series where is a response vector and a predictor matrix. We consider the timevarying multivariate linear model , where is an unknown regression vector and a random vector with mean zero. As noted above, we assume that , that the are sparse, and that for most values of , that is, is a piecewise constant function ofwith few change points. Our goal is to develop efficient computational methods for estimating
and its change points .Before introducing the optimization problem at the core of this study, namely the sparse group fused lasso (SGFL), we review relevant work on model segmentation, change point detection, and structured sparse regression.
Related work
We first introduce some notations. Throughout the paper, denotes the standard norm: if and if for . For convenience, we use the same notation to refer to regression coefficients either as a single vector in or as a sequence of vectors in .
Combinatorial approaches to change point detection
There is an extensive literature on change point detection spanning multiple fields and decades, which we only very partially describe here. For estimating changes in linear regression models, if the number of segments (or equivalently the number of change points) is fixed, the segmentation problem can be expressed as
(1) 
with and . For a given set of change points , the minimizing argument
and associated objective value are obtained by ordinary least squares regression. Accordingly the optimization reduces to a combinatorial problem solvable by dynamic programming
(Bai2003). This technique is computationally demanding as it requires performing linear regressions before carrying out the dynamic program per se; the time spent in linear regression can however be reduced through recursive calculations. A fundamental instance of model segmentation in (1) occurs when the design matrixis the identity matrix. In this case the problem is to approximate the signal
itself with a piecewise constant function.If is not prespecified, one may add a penalty function to (1) so as to strike a compromise between fitting the data and keeping the model complexity low. Examples of penalty functions on
include the Akaike Information Criterion (AIC), Bayes Information Criterion (BIC), as well as more recent variants for highdimensional data
(Yao1988; Chen2008). Another way to select is to add/remove change points based on statistical tests or other criteria in top/down or bottom/up approaches. See Basseville1993 for a classical book on statistical change point detection and Truong2018 for a more recent survey. Readers interested in the popular method of binary segmentation may also consult Bai1997; Fryzlewicz2014; Leonardi2016.Total variation penalty methods
Studying the piecewise constant approximation of 1D signals, Friedman2007 utilize a convex relaxation of (1) called the fused lasso signal approximation (FLSA):
(2) 
Here, hard constraints or penalties on the number of segments are replaced by a penalty on the increments . This total variation penalty promotes flatness in the profile of , that is, a small number of change points. The penalty on , called a lasso penalty in the statistical literature, favors sparsity in . The regularization parameters determine a balance between fidelity to the data, sparsity of , and number of change points. They can be specified by the user or selected from the data, for example by crossvalidation. Friedman2007 derive an efficient coordinate descent method to calculate the solution to (2) along a path of values of . Their method can also be applied to the more general problem of
fused lasso regression
(3) 
where is a vector of predictors, although it is not guaranteed to yield a global minimizer in this case. One may recover the FLSA (2) by setting and taking the as the canonical basis of in (3). More recent approaches to fused lasso regression include Hoefling2010b; Liu2010; Wang2015.
The FLSA and fused lasso can easily be adapted to the multivariate setup as follows:
(4) 
where , , and
. These approaches are however not suitable for segmenting multivariate signals/models as they typically produce change points that are only shared by few predictor variables. This is because the
norm in the total variation penalty affects each of the predictors separately. A simple way to induce change points common to all predictors is to replace this norm by an norm with . Indeed for , the norm of is differentiable everywhere except at the origin, which promotes . Typically, for the model estimate to have a change point at time , a jump of at least modest size must occur in a significant fraction of the timevarying regression coefficients between and . Due to its computational simplicity, the norm is often used in practice. For example, a common approach to denoising multivariate signals is to solve(5) 
where the are positive weights. Bleakley2011 reformulate this problem as a group lasso regression and apply the group LARS algorithm (see Yuan2006) to efficiently find solution paths as varies. Wytock2014 propose Newtontype methods for (5) that extend to multichannel images. These two papers refer to problem (5) as the group fused lasso (GFL).
To segment multivariate regression models with group sparsity structure, Alaiz2013 consider a generalization of (5) that they also call group fused lasso:
(6) 
They handle the optimization with a proximal splitting method similar to Dykstra’s projection algorithm. Songsiri2015 studies (6) in the context of vector autoregressive models, using the wellknown alternative direction method of multipliers (ADMM). See e.g. Combettes2011 for an overview of proximal methods and ADMM.
Sparse Group Fused Lasso
Under our assumptions, the set of regression coefficients in the timevarying model is sparse and piecewise constant with few change points. To enforce these assumptions in fitting the model to data, we propose to solve
(7) 
Problem (7) has common elements with the fused lasso (4) and the group fused lasso (6) but the three problems are distinct and not reducible to one another. For example, (4) uses an TV penalty whereas (7) uses an TV penalty to promote blockwise equality . Also, unlike (6) which exploits an penalty to induce group sparsity in , (7) features a standard lasso penalty. To distinguish (7) from the group fused lasso problems (5)(6), we call it sparse group fused lasso (SGFL). (Problem (7) is referred to as variable fusion in Barbero2011 but we have not found this terminology elsewhere in the literature.) The GFL (5) is a special case of (7) where (identity matrix) for all and .
Remark 1 (Intercept).
A timevarying intercept vector can be added to the regression model, yielding . While intercepts are typically not penalized in lasso regression, one must assume some sparsity in the increments for the extended model to be meaningful. Accordingly, the extended SGFL expresses as
(8) 
For simplicity of exposition, we only consider problem (7) in this paper, noting that all methods and results easily extend to (8).
The objective function in (7) has three components: a smooth function (squared loss), a nonsmooth but separable function (elastic net penalty), and a nonsmooth, nonseparable function (total variation penalty). We recall that a function is said to be (block)separable if it can be expressed as a sum of functions . All three functions are convex. Accordingly, the SGFL (7) is a nonsmooth, nonseparable convex program. Several offtheshelf methods can be found in the convex optimization literature for this type of problem, among which primaldual algorithms take a preeminent place (Condat2013; Yan2018). One could also utilize generalpurpose convex optimization tools such as proximal methods (for instance, the Dykstralike approach of Alaiz2013 can easily be adapted to (7)), ADMM and its variants, or even subgradient methods. However, these approaches do not take full advantage of the structure of (7), which may cause computational inefficiencies. In addition, these approaches aim at function minimization and not model segmentation or change point detection. As a result, they typically produce solutions for which every time is a change point and where the task of recovering the “true” underlying change points (or segments) may be nontrivial. By devising customized methods for SGFL, one may expect substantial gains in computational speed while at the same time producing welldefined model segmentations.
Contributions and organization the paper
We make the following contributions with this paper.

We introduce the sparse group fused lasso (SGFL) for model segmentation in high dimension and develop a hybrid algorithm that efficiently solves the SGFL. The algorithm produces a sequence of solutions that monotonically decrease the objective function and converge to a global minimizer. It yields exact model segmentations, as opposed to generic optimization methods that only provide approximate segmentations. Importantly, the hybrid algorithm does require any complicated selection of tuning parameters from the user.

A key component of the hybrid algorithm is an iterative softthresholding scheme for computing the proximal operator of sums of and norms. This scheme, which is shown to converge linearly, is of independent interest and can serve as a building block in other optimization problems.

We present numerical experiments that compare our hybrid approach to stateoftheart optimization methods (ADMM, linearized ADMM, primaldual methods,…) in terms of computational speed and numerical accuracy. We also illustrate SGFL with an application to air quality monitoring.

We implement the hybrid algorithm in the R package sparseGFL available at https://github.com/ddegras/sparseGFL.
The paper is organized as follows. Section 2 gives an overview of the hybrid algorithm. Section 3 details the calculations involved in each part of the algorithm. Section 4 presents numerical experiments comparing the proposed algorithm to stateoftheart approaches; it also illustrate SGFL with air quality data. Section 5 summarizes our results and outlines directions for future research. Appendix A contains a proof of linear convergence for the iterative softthresholding scheme used in the algorithm.
2 Algorithm overview
Optimization strategy
The proposed algorithm operates at different levels across iterations or cycles. By order of increasing complexity and generality, the optimization of in (7) may be conducted with respect to:

A single block ;

A chain of blocks such that (fusion chain);

All fusion chains;

All blocks.
The rationale for this hybrid optimization is to exploit problem structure for fast calculations while guaranteeing convergence to a global solution. By problem structure, we refer both to the block structure of the regression coefficients and to the piecewise nature of the regression model over the time range . The first two levels of optimization (single block and single chain) involve block coordinate descent methods that can be implemented very quickly in a serial or parallel fashion. The next level (all fusion chains) involves an active set approach: assuming to have identified the optimal model segmentation, the associated fusion chains are fixed and is minimized with respect to these chains. Denoting by the number of chains, the dimension of the search space decreases from variables to where typically . The first three levels of optimization are not sufficient to guarantee convergence to a global solution: they only establish that (i) the current solution is blockwise optimal (Tseng2001), i.e. cannot be further reduced by changing just one block in , and that (ii) the minimum of over the current model segmentation has been attained. The fourth level consists in a single iteration of the subgradient method, which is known to converge (albeit very slowly) to a global minimizer of the objective function (e.g. Bertsekas2015). Of the four optimization levels, this is the most general and most computationally intensive one.
The general strategy of the hybrid algorithm is to identify the optimal model segmentation as early as possible and then solve the associated reduced problem which involves one block of regression coefficients per segment as opposed to one block per time point. Algorithm 1 starts with block coordinate descent cycles and continues until no further progress, i.e. reduction in , is possible. It then switches to the second level and performs fusion cycles on single chains until no more progress is realized. If progress has been made in any fusion cycle, the algorithm reverts to block coordinate descent; otherwise, it moves up one level and optimizes with respect to all fusion chains. And so on so forth. At the fourth level, the only instance when no progress can be achieved is when a global minimizer has been attained, in which case the algorithm terminates. The flow of these operations is presented in Algorithm 1, the main algorithm of the paper. We now give an overview of the algorithm at each level.
2.1 Block coordinate descent
The principle of block coordinate descent is to partition the optimization variables into blocks and to optimize the objective function at each iteration with respect to a given block while keeping the other blocks fixed. In the optimization (7), time provides a natural blocking structure. Given a current solution and a time index , the problem formulates as
Eliminating terms in that do not depend on , this amounts to
(9) 
To accommodate the cases and , we set and .
Problem (9) cannot be solved in closed form. Instead, we solve it using the fast iterative softthresholding algorithm (FISTA) of Beck2009, a proximal gradient method that enjoys the accelerated convergence rate , with the number of iterations. This algorithm is described in section 3.1. The application of FISTA to (9) entails calculating the proximal operator of the sum of the lasso and total variation penalties. As a reminder, the proximal operator of a convex function is defined by . Although the proximal operator of each penalty easily obtains in closed form, determining the proximal operator of their sum is highly nontrivial. For this purpose, we develop an iterative softthresholding algorithm described in section 3.2.
The optimization (9) is repeated over a sequence of blocks and the solution is updated each time until the objective function in (7) cannot be further reduced. The order in which the blocks are selected for optimization is called the sweep pattern. Common examples of sweep patterns include cyclic (e.g. Tseng2001), cyclic permutation, (e.g. Nesterov2012), and greedy selection (e.g. Li2009). The block coordinate descent is summarized in Algorithm 2.
2.2 Fusion cycle: single chain
Because the total variation penalty in is nonsmooth and nonseparable, the block coordinate descent can get stuck in points that are blockwise optimal but not globally optimal; see Tseng2001 for a theoretical justification and Friedman2007 for an example. To overcome this difficulty, one may constrain two or more consecutive blocks to be equal and optimize with respect to their common value while keeping other blocks fixed. This fusion strategy is well suited to segmentation because it either preserves segments or merges them into larger ones. Given a current solution , the time range is partitioned into segments or fusion chains () such that the and that . By convention we set and . If , are the estimated change points of the regression model . The algorithm successively optimizes (7) over each fusion chain while enforcing the equality constraint :
where is repeated times. This works out as
(10) 
The algorithm may also try to merge two consecutive fusion chains to form a larger chain. To be precise, as follows a given sweeping pattern , the algorithm either: (i) solves (10) if and (start of a nonsingleton chain), (ii) solves (10) with each replaced by and by if and (end of a chain), or (iii) skips to the next value of in other cases. The optimization (10) is performed in the same way as the block coordinate descent (9) (FISTA + iterative softthresholding). The fusion cycle for single chains is summarized in Algorithm 3.
2.3 Fusion cycle: all chains
When no further reduction can be achieved in by changing a single block or single fusion chain in the current solution , a logical next step is to optimize with respect to all fusion chains. Specifically, one identifies the fusion chains over which is constant and optimizes with respect to all blocks under the equality constraints induced by the fusion chains:
with each repeated times. Explicitly, this amounts to
(11) 
To solve (11) we employ a version of FISTA slightly different from the one used in (9) and (10). In particular this version (Algorithm 5) operates under the requirement that for all . If two blocks and become equal during the optimization, the corresponding fusion chains and are merged and problem (11) is restarted. Details are given in section 3.3.
2.4 Checking the optimality of a solution
A vector () minimizes a convex function if and only if is a subgradient of at . (The concept of subgradient generalizes the gradient to possibly nondifferentiable convex functions.) This expresses equivalently as the membership of to the subdifferential , that is, the set of all subgradients of at . Definition, basic properties, and examples of subgradients and subdifferentials can be found in textbooks on convex analysis, e.g. Rockafellar2015.
In order to formulate the optimality conditions of the SGFL problem (7), we define the sign operator
for and extend it as a setvalued function from to in a componentwise fashion: (). Now, a vector minimizes if and only if is a subgradient at , that is, if and only if there exist vectors and satisfying
(12a)  
and  
(12b)  
for as well as  
(12c)  
for . 
By convention we take . Conditions (12b)(12c) arise from the facts that the subdifferential of the norm is the sign operator and that the subdifferential of the norm at is the unit ball of .
The optimality conditions (12a)(12b)(12c) can be checked by solving
(13) 
where , , with , and is the differencing matrix given by if , if , and otherwise. (Here we use matrix formalism to express (12a) more simply.) The sets and embody the constraints (12b) and (12c), respectively. If the minimum of (13) is zero, then is a subgradient of at and minimizes . In this case the optimization is over.
A closer examination of (12a)(12b)(12c) reveals that change points in break the global problem (13) into independent subproblems. More precisely, let be the change points induced by (assuming there is at least one) and the associated segmentation of . The constraints (12c) entirely determine the vectors (), which breaks the coupling of the separated by change points in (12a). On the other hand the constraints (12b) clearly affect each block separately. Therefore, problem (13) can be solved separately (and in parallel) on each fusion chain . We tackle (13) on each using gradient projection. We embed this method inside FISTA for faster convergence. The necessary gradient calculation and projections on and are described in section 3.4.
2.5 Subgradient step
If the attained minimum in (13) is greater than zero, then is not a minimizer of . By design of Algorithm 1 this implies that the segmentation associated with is suboptimal and that, starting from , cannot be further reduced at the first three levels of optimization. In this case, arguments that minimize (13) provide a subgradient of minimum norm. Denoting the vectorized version of by , the opposite of is a direction of steepest descent for at (e.g. Shor1985). Accordingly, at the fourth level of optimization, the algorithm takes a step in the direction with step length obtained by exact line search. The updated solution expresses as where . The subgradient step accomplishes two important things: first, it moves the optimization away from the suboptimal segmentation and second, by reducing the objective, it ensures that this segmentation will not be visited again later in the optimization. This is because Algorithm 1 is a descent method and the best solution for the segmentation has already been attained in a previous cycle of optimization – otherwise the optimality check and subgradient step would not have been performed. Since there is a finite number of segmentations of , Algorithm 1 eventually finds an optimal segmentation and an associated minimizer of through the third level of optimization.
Theorem 1.
For any starting point , the sequence generated by Algorithm 1 converges to a (global) minimizer of .
3 Computations
This section gives a detailed account of how optimization is carried out at each level (single block, single fusion chain, all fusion chains, all blocks) in Algorithm 1. We first present the fast iterative softthresholding algorithm (FISTA) of Beck2009 which we extensively use in Algorithm 1.
3.1 Fista
Beck and Teboulle Beck2009 consider the convex program
where is a smooth convex function and is a continuous convex function, possibly nonsmooth. The function is assumed to be differentiable with Lipschitzcontinuous gradient:
for all and some finite Lipschitz constant . The function is assumed to be proximable, that is, its proximal operator should be easy to calculate for all .
FISTA is an iterative method that replaces at each iteration the difficult optimization of the objective by the simpler optimization of a quadratic approximation . Given a suitable vector , the goal is to minimize
(14) 
with respect to . With a few algebraic manipulations and omitting irrelevant additive constants, can be rewritten as
so that . In other words, the minimization of is achieved through a gradient step with respect to followed by a proximal step with respect to . FISTA can thus be viewed as a proximal gradient method, also known as forwardbackward method (e.g. Combettes2011). Observing that majorizes , FISTA can also be viewed as a majorizationminimization method.
Proximal gradient methods are not new: they have been used for decades. The innovation of FISTA is to accelerate the convergence of standard proximal gradient methods by introducing an auxiliary sequence such that is a wellchosen linear combination of and , the main solution iterates. With this technique, the convergence rate of proximal gradient improves from to . Algorithm 4 presents FISTA in the case where a Lipschitz constant is prespecified and kept constant through iterations. Algorithm 5 presents FISTA in the case where is difficult to determine ahead of time and is chosen by backtracking at each iteration. This version of FISTA requires an initial guess for the Lipschitz constant as well as a factor by which to increase the candidate value in backtracking steps.
3.2 Iterative softthresholding
In this section we present a novel iterative softthresholding algorithm for computing the proximal operators required in the application of FISTA to problems (9) and (10). We first examine the case of (9) (block coordinate descent) and then show how to adapt the algorithm to (10) (optimization of with respect to a single fusion chain). Of crucial importance is the softthresholding operator
where and is a threshold. This operator accommodates vector arguments in a componentwise fashion: ().
Checking for simple solutions.
It is advantageous to verify whether or solves (9) before applying FISTA, which is more computationally demanding. The optimality conditions for (9) are very similar to those for the global problem (7), namely (12a)(12b)(12c), although of course the conditions for (9) pertain to a single time . Hereafter we state these conditions in an easily computable form. Let be defined in a componentwise fashion by
If , this vector solves (9) if and only if
Comments
There are no comments yet.