Sparse Group Fused Lasso for Model Segmentation

12/16/2019 ∙ by David Degras, et al. ∙ 0

This article introduces the sparse group fused lasso (SGFL) as a statistical framework for segmenting high dimensional regression models. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no tuning parameter selection, and is guaranteed to converge to a global minimizer. In numerical experiments, the hybrid method compares favorably to state-of-the-art techniques both in terms of computation time and accuracy; benefits are particularly substantial in high dimension. The hybrid method is implemented in the R package sparseGFL available on the author's Github page. The SGFL framework, presented here in the context of multivariate time series, can be extended to multichannel images and data collected over graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the analysis of complex signals, using a single statistical model with a fixed set of parameters is rarely enough to track data variations over their entire range. In long and/or high-dimensional time series for example, the presence of nonstationarity, either in the form of slowly drifting dynamics or of abrupt regime changes, requires that statistical models flexibly account for temporal variations in signal characteristics. To overcome the intrinsic limitations of approaches based on a single model vis-à-vis heterogeneous and nonstationary signals, model segmentation techniques have been successfully employed in various fields including image processing (Alaiz2013; Friedman2007) genetics (Bleakley2011; Tibshirani2007), brain imaging (Beer2019; Xu2015), finance Hallac2019; Nystrup2017, industrial monitoring (Saxen2016), oceanography (Ranalli2018), seismology (Ohlsson2010), and ecology (Alewijnse2018). Model segmentation consists in partitioning the domain of the signal (e.g. the temporal range of a time series or the lattice of a digital image) into a small number of segments or regions such that for each segment, the data are suitably represented with a single model. The models used to segment the data are typically of the same type (e.g. linear model) but differ by their parameters. The task of model segmentation is closely related to change point detection and is commonly referred to as (hybrid or time-varying) system identification in the engineering literature.

This work considers model segmentation in the following setup:

  • Structured multivariate data.

    The observed data are multivariate predictor and response variables measured over a time grid, spatial lattice, or more generally a graph.

  • Regression.

    Predictor and response variables are related through a regression model, e.g. a linear model, generalized linear model, or vector autoregressive model.

  • High dimension. There are far more predictors than response variables. However, at each measurement point, the responses only depend on a small number of predictors.

For simplicity, we present our methods and results in the context of linear regression with time series data, keeping in mind that our work readily extends to other regression models and graph structures. Let

and be multivariate time series where is a response vector and a predictor matrix. We consider the time-varying multivariate linear model , where is an unknown regression vector and a random vector with mean zero. As noted above, we assume that , that the are sparse, and that for most values of , that is, is a piecewise constant function of

with few change points. Our goal is to develop efficient computational methods for estimating

and its change points .

Before introducing the optimization problem at the core of this study, namely the sparse group fused lasso (SGFL), we review relevant work on model segmentation, change point detection, and structured sparse regression.

Related work

We first introduce some notations. Throughout the paper, denotes the standard norm: if and if for . For convenience, we use the same notation to refer to regression coefficients either as a single vector in or as a sequence of vectors in .

Combinatorial approaches to change point detection

There is an extensive literature on change point detection spanning multiple fields and decades, which we only very partially describe here. For estimating changes in linear regression models, if the number of segments (or equivalently the number of change points) is fixed, the segmentation problem can be expressed as

(1)

with and . For a given set of change points , the minimizing argument

and associated objective value are obtained by ordinary least squares regression. Accordingly the optimization reduces to a combinatorial problem solvable by dynamic programming

(Bai2003). This technique is computationally demanding as it requires performing linear regressions before carrying out the dynamic program per se; the time spent in linear regression can however be reduced through recursive calculations. A fundamental instance of model segmentation in (1) occurs when the design matrix

is the identity matrix. In this case the problem is to approximate the signal

itself with a piecewise constant function.

If is not prespecified, one may add a penalty function to (1) so as to strike a compromise between fitting the data and keeping the model complexity low. Examples of penalty functions on

include the Akaike Information Criterion (AIC), Bayes Information Criterion (BIC), as well as more recent variants for high-dimensional data

(Yao1988; Chen2008). Another way to select is to add/remove change points based on statistical tests or other criteria in top/down or bottom/up approaches. See Basseville1993 for a classical book on statistical change point detection and Truong2018 for a more recent survey. Readers interested in the popular method of binary segmentation may also consult Bai1997; Fryzlewicz2014; Leonardi2016.

Total variation penalty methods

Studying the piecewise constant approximation of 1-D signals, Friedman2007 utilize a convex relaxation of (1) called the fused lasso signal approximation (FLSA):

(2)

Here, hard constraints or penalties on the number of segments are replaced by a penalty on the increments . This total variation penalty promotes flatness in the profile of , that is, a small number of change points. The penalty on , called a lasso penalty in the statistical literature, favors sparsity in . The regularization parameters determine a balance between fidelity to the data, sparsity of , and number of change points. They can be specified by the user or selected from the data, for example by cross-validation. Friedman2007 derive an efficient coordinate descent method to calculate the solution to (2) along a path of values of . Their method can also be applied to the more general problem of

fused lasso regression

(3)

where is a vector of predictors, although it is not guaranteed to yield a global minimizer in this case. One may recover the FLSA (2) by setting and taking the as the canonical basis of in (3). More recent approaches to fused lasso regression include Hoefling2010b; Liu2010; Wang2015.

The FLSA and fused lasso can easily be adapted to the multivariate setup as follows:

(4)

where , , and

. These approaches are however not suitable for segmenting multivariate signals/models as they typically produce change points that are only shared by few predictor variables. This is because the

norm in the total variation penalty affects each of the predictors separately. A simple way to induce change points common to all predictors is to replace this norm by an norm with . Indeed for , the norm of is differentiable everywhere except at the origin, which promotes . Typically, for the model estimate to have a change point at time , a jump of at least modest size must occur in a significant fraction of the time-varying regression coefficients between and . Due to its computational simplicity, the norm is often used in practice. For example, a common approach to denoising multivariate signals is to solve

(5)

where the are positive weights. Bleakley2011 reformulate this problem as a group lasso regression and apply the group LARS algorithm (see Yuan2006) to efficiently find solution paths as varies. Wytock2014 propose Newton-type methods for (5) that extend to multichannel images. These two papers refer to problem (5) as the group fused lasso (GFL).

To segment multivariate regression models with group sparsity structure, Alaiz2013 consider a generalization of (5) that they also call group fused lasso:

(6)

They handle the optimization with a proximal splitting method similar to Dykstra’s projection algorithm. Songsiri2015 studies (6) in the context of vector autoregressive models, using the well-known alternative direction method of multipliers (ADMM). See e.g. Combettes2011 for an overview of proximal methods and ADMM.

Sparse Group Fused Lasso

Under our assumptions, the set of regression coefficients in the time-varying model is sparse and piecewise constant with few change points. To enforce these assumptions in fitting the model to data, we propose to solve

(7)

Problem (7) has common elements with the fused lasso (4) and the group fused lasso (6) but the three problems are distinct and not reducible to one another. For example, (4) uses an TV penalty whereas (7) uses an TV penalty to promote blockwise equality . Also, unlike (6) which exploits an penalty to induce group sparsity in , (7) features a standard lasso penalty. To distinguish (7) from the group fused lasso problems (5)-(6), we call it sparse group fused lasso (SGFL). (Problem (7) is referred to as variable fusion in Barbero2011 but we have not found this terminology elsewhere in the literature.) The GFL (5) is a special case of (7) where (identity matrix) for all and .

Remark 1 (Intercept).

A time-varying intercept vector can be added to the regression model, yielding . While intercepts are typically not penalized in lasso regression, one must assume some sparsity in the increments for the extended model to be meaningful. Accordingly, the extended SGFL expresses as

(8)

For simplicity of exposition, we only consider problem (7) in this paper, noting that all methods and results easily extend to (8).

The objective function in (7) has three components: a smooth function (squared loss), a nonsmooth but separable function (elastic net penalty), and a nonsmooth, nonseparable function (total variation penalty). We recall that a function is said to be (block-)separable if it can be expressed as a sum of functions . All three functions are convex. Accordingly, the SGFL (7) is a nonsmooth, nonseparable convex program. Several off-the-shelf methods can be found in the convex optimization literature for this type of problem, among which primal-dual algorithms take a preeminent place (Condat2013; Yan2018). One could also utilize general-purpose convex optimization tools such as proximal methods (for instance, the Dykstra-like approach of Alaiz2013 can easily be adapted to (7)), ADMM and its variants, or even subgradient methods. However, these approaches do not take full advantage of the structure of (7), which may cause computational inefficiencies. In addition, these approaches aim at function minimization and not model segmentation or change point detection. As a result, they typically produce solutions for which every time is a change point and where the task of recovering the “true” underlying change points (or segments) may be nontrivial. By devising customized methods for SGFL, one may expect substantial gains in computational speed while at the same time producing well-defined model segmentations.

Contributions and organization the paper

We make the following contributions with this paper.

  1. We introduce the sparse group fused lasso (SGFL) for model segmentation in high dimension and develop a hybrid algorithm that efficiently solves the SGFL. The algorithm produces a sequence of solutions that monotonically decrease the objective function and converge to a global minimizer. It yields exact model segmentations, as opposed to generic optimization methods that only provide approximate segmentations. Importantly, the hybrid algorithm does require any complicated selection of tuning parameters from the user.

  2. A key component of the hybrid algorithm is an iterative soft-thresholding scheme for computing the proximal operator of sums of and norms. This scheme, which is shown to converge linearly, is of independent interest and can serve as a building block in other optimization problems.

  3. We present numerical experiments that compare our hybrid approach to state-of-the-art optimization methods (ADMM, linearized ADMM, primal-dual methods,…) in terms of computational speed and numerical accuracy. We also illustrate SGFL with an application to air quality monitoring.

  4. We implement the hybrid algorithm in the R package sparseGFL available at https://github.com/ddegras/sparseGFL.

The paper is organized as follows. Section 2 gives an overview of the hybrid algorithm. Section 3 details the calculations involved in each part of the algorithm. Section 4 presents numerical experiments comparing the proposed algorithm to state-of-the-art approaches; it also illustrate SGFL with air quality data. Section 5 summarizes our results and outlines directions for future research. Appendix A contains a proof of linear convergence for the iterative soft-thresholding scheme used in the algorithm.

2 Algorithm overview

Optimization strategy

The proposed algorithm operates at different levels across iterations or cycles. By order of increasing complexity and generality, the optimization of in (7) may be conducted with respect to:

  1. A single block ;

  2. A chain of blocks such that (fusion chain);

  3. All fusion chains;

  4. All blocks.

The rationale for this hybrid optimization is to exploit problem structure for fast calculations while guaranteeing convergence to a global solution. By problem structure, we refer both to the block structure of the regression coefficients and to the piecewise nature of the regression model over the time range . The first two levels of optimization (single block and single chain) involve block coordinate descent methods that can be implemented very quickly in a serial or parallel fashion. The next level (all fusion chains) involves an active set approach: assuming to have identified the optimal model segmentation, the associated fusion chains are fixed and is minimized with respect to these chains. Denoting by the number of chains, the dimension of the search space decreases from variables to where typically . The first three levels of optimization are not sufficient to guarantee convergence to a global solution: they only establish that (i) the current solution is blockwise optimal (Tseng2001), i.e. cannot be further reduced by changing just one block in , and that (ii) the minimum of over the current model segmentation has been attained. The fourth level consists in a single iteration of the subgradient method, which is known to converge (albeit very slowly) to a global minimizer of the objective function (e.g. Bertsekas2015). Of the four optimization levels, this is the most general and most computationally intensive one.

The general strategy of the hybrid algorithm is to identify the optimal model segmentation as early as possible and then solve the associated reduced problem which involves one block of regression coefficients per segment as opposed to one block per time point. Algorithm 1 starts with block coordinate descent cycles and continues until no further progress, i.e. reduction in , is possible. It then switches to the second level and performs fusion cycles on single chains until no more progress is realized. If progress has been made in any fusion cycle, the algorithm reverts to block coordinate descent; otherwise, it moves up one level and optimizes with respect to all fusion chains. And so on so forth. At the fourth level, the only instance when no progress can be achieved is when a global minimizer has been attained, in which case the algorithm terminates. The flow of these operations is presented in Algorithm 1, the main algorithm of the paper. We now give an overview of the algorithm at each level.

0:  Starting point , regularization parameters , tolerance
0:  
   true, true
  
  repeat
     while  true do
        
        Apply Algorithm 2 to and output {Block descent}
        if  then
            false
        end if
     end while
     while  true do
        
        Apply Algorithm 3 to and output {Fusion: single chains}
        if  then
            true, true
        else
            false
        end if
     end while
     if  false and false then
        
        Apply Algorithm 5 to and output {Fusion: all chains}
        Apply Algorithm 4 to and output subgradient
        if   then
           
           
            {Subgradient step}
        end if
     end if
  until 
Algorithm 1 Sparse Group Fused Lasso

2.1 Block coordinate descent

The principle of block coordinate descent is to partition the optimization variables into blocks and to optimize the objective function at each iteration with respect to a given block while keeping the other blocks fixed. In the optimization (7), time provides a natural blocking structure. Given a current solution and a time index , the problem formulates as

Eliminating terms in that do not depend on , this amounts to

(9)

To accommodate the cases and , we set and .

Problem (9) cannot be solved in closed form. Instead, we solve it using the fast iterative soft-thresholding algorithm (FISTA) of Beck2009, a proximal gradient method that enjoys the accelerated convergence rate , with the number of iterations. This algorithm is described in section 3.1. The application of FISTA to (9) entails calculating the proximal operator of the sum of the lasso and total variation penalties. As a reminder, the proximal operator of a convex function is defined by . Although the proximal operator of each penalty easily obtains in closed form, determining the proximal operator of their sum is highly nontrivial. For this purpose, we develop an iterative soft-thresholding algorithm described in section 3.2.

The optimization (9) is repeated over a sequence of blocks and the solution is updated each time until the objective function in (7) cannot be further reduced. The order in which the blocks are selected for optimization is called the sweep pattern. Common examples of sweep patterns include cyclic (e.g. Tseng2001), cyclic permutation, (e.g. Nesterov2012), and greedy selection (e.g. Li2009). The block coordinate descent is summarized in Algorithm 2.

0:  , sweeping pattern
0:  
  
  for  do
     Check (15)-(16)-(17) for a simple solution to (9)
     if simple solution then
         or as required
     else {FISTA}
        Set
        Apply Algorithm 4 to with starting point , Lipschitz constant , and given by (21)-(22). Output
        
     end if
     
  end for
Algorithm 2 Block Coordinate Descent

2.2 Fusion cycle: single chain

Because the total variation penalty in is nonsmooth and nonseparable, the block coordinate descent can get stuck in points that are blockwise optimal but not globally optimal; see Tseng2001 for a theoretical justification and Friedman2007 for an example. To overcome this difficulty, one may constrain two or more consecutive blocks to be equal and optimize with respect to their common value while keeping other blocks fixed. This fusion strategy is well suited to segmentation because it either preserves segments or merges them into larger ones. Given a current solution , the time range is partitioned into segments or fusion chains () such that the and that . By convention we set and . If , are the estimated change points of the regression model . The algorithm successively optimizes (7) over each fusion chain while enforcing the equality constraint :

where is repeated times. This works out as

(10)

The algorithm may also try to merge two consecutive fusion chains to form a larger chain. To be precise, as follows a given sweeping pattern , the algorithm either: (i) solves (10) if and (start of a non-singleton chain), (ii) solves (10) with each replaced by and by if and (end of a chain), or (iii) skips to the next value of in other cases. The optimization (10) is performed in the same way as the block coordinate descent (9) (FISTA + iterative soft-thresholding). The fusion cycle for single chains is summarized in Algorithm 3.

0:  , sweeping pattern
0:  
  
  Determine fusion chains and chain starts
  for  do
     if  for some and  then {chain start}
        
     else if  for some  then {chain end}
        
     else {chain interior}
        Skip to next
     end if
     Set
     Set
     Check (24)-(25)-(26) for a simple solution to
     if simple solution then
         or as required
     else {FISTA}
        Apply Algorithm 4 to with starting point , Lipschitz constant , and given by (27). Output
     end if
     
     if  and  then {merge and }
        Remove from , set , relabel chain starts as , and set
     end if
      and for
  end for
Algorithm 3 Fusion Cycle: Single Chain

2.3 Fusion cycle: all chains

When no further reduction can be achieved in by changing a single block or single fusion chain in the current solution , a logical next step is to optimize with respect to all fusion chains. Specifically, one identifies the fusion chains over which is constant and optimizes with respect to all blocks under the equality constraints induced by the fusion chains:

with each repeated times. Explicitly, this amounts to

(11)

To solve (11) we employ a version of FISTA slightly different from the one used in (9) and (10). In particular this version (Algorithm 5) operates under the requirement that for all . If two blocks and become equal during the optimization, the corresponding fusion chains and are merged and problem (11) is restarted. Details are given in section 3.3.

2.4 Checking the optimality of a solution

A vector () minimizes a convex function if and only if is a subgradient of at . (The concept of subgradient generalizes the gradient to possibly nondifferentiable convex functions.) This expresses equivalently as the membership of to the subdifferential , that is, the set of all subgradients of at . Definition, basic properties, and examples of subgradients and subdifferentials can be found in textbooks on convex analysis, e.g. Rockafellar2015.

In order to formulate the optimality conditions of the SGFL problem (7), we define the sign operator

for and extend it as a set-valued function from to in a componentwise fashion: (). Now, a vector minimizes if and only if is a subgradient at , that is, if and only if there exist vectors and satisfying

(12a)
and
(12b)
for as well as
(12c)
for .

By convention we take . Conditions (12b)-(12c) arise from the facts that the subdifferential of the norm is the sign operator and that the subdifferential of the norm at is the -unit ball of .

The optimality conditions (12a)-(12b)-(12c) can be checked by solving

(13)

where , , with , and is the differencing matrix given by if , if , and otherwise. (Here we use matrix formalism to express (12a) more simply.) The sets and embody the constraints (12b) and (12c), respectively. If the minimum of (13) is zero, then is a subgradient of at and minimizes . In this case the optimization is over.

A closer examination of (12a)-(12b)-(12c) reveals that change points in break the global problem (13) into independent subproblems. More precisely, let be the change points induced by (assuming there is at least one) and the associated segmentation of . The constraints (12c) entirely determine the vectors (), which breaks the coupling of the separated by change points in (12a). On the other hand the constraints (12b) clearly affect each block separately. Therefore, problem (13) can be solved separately (and in parallel) on each fusion chain . We tackle (13) on each using gradient projection. We embed this method inside FISTA for faster convergence. The necessary gradient calculation and projections on and are described in section 3.4.

2.5 Subgradient step

If the attained minimum in (13) is greater than zero, then is not a minimizer of . By design of Algorithm 1 this implies that the segmentation associated with is suboptimal and that, starting from , cannot be further reduced at the first three levels of optimization. In this case, arguments that minimize (13) provide a subgradient of minimum norm. Denoting the vectorized version of by , the opposite of is a direction of steepest descent for at (e.g. Shor1985). Accordingly, at the fourth level of optimization, the algorithm takes a step in the direction with step length obtained by exact line search. The updated solution expresses as where . The subgradient step accomplishes two important things: first, it moves the optimization away from the suboptimal segmentation and second, by reducing the objective, it ensures that this segmentation will not be visited again later in the optimization. This is because Algorithm 1 is a descent method and the best solution for the segmentation has already been attained in a previous cycle of optimization – otherwise the optimality check and subgradient step would not have been performed. Since there is a finite number of segmentations of , Algorithm 1 eventually finds an optimal segmentation and an associated minimizer of through the third level of optimization.

Theorem 1.

For any starting point , the sequence generated by Algorithm 1 converges to a (global) minimizer of .

3 Computations

This section gives a detailed account of how optimization is carried out at each level (single block, single fusion chain, all fusion chains, all blocks) in Algorithm 1. We first present the fast iterative soft-thresholding algorithm (FISTA) of Beck2009 which we extensively use in Algorithm 1.

3.1 Fista

Beck and Teboulle Beck2009 consider the convex program

where is a smooth convex function and is a continuous convex function, possibly nonsmooth. The function is assumed to be differentiable with Lipschitz-continuous gradient:

for all and some finite Lipschitz constant . The function is assumed to be proximable, that is, its proximal operator should be easy to calculate for all .

FISTA is an iterative method that replaces at each iteration the difficult optimization of the objective by the simpler optimization of a quadratic approximation . Given a suitable vector , the goal is to minimize

(14)

with respect to . With a few algebraic manipulations and omitting irrelevant additive constants, can be rewritten as

so that . In other words, the minimization of is achieved through a gradient step with respect to followed by a proximal step with respect to . FISTA can thus be viewed as a proximal gradient method, also known as forward-backward method (e.g. Combettes2011). Observing that majorizes , FISTA can also be viewed as a majorization-minimization method.

Proximal gradient methods are not new: they have been used for decades. The innovation of FISTA is to accelerate the convergence of standard proximal gradient methods by introducing an auxiliary sequence such that is a well-chosen linear combination of and , the main solution iterates. With this technique, the convergence rate of proximal gradient improves from to . Algorithm 4 presents FISTA in the case where a Lipschitz constant is prespecified and kept constant through iterations. Algorithm 5 presents FISTA in the case where is difficult to determine ahead of time and is chosen by backtracking at each iteration. This version of FISTA requires an initial guess for the Lipschitz constant as well as a factor by which to increase the candidate value in backtracking steps.

0:  , Lipschitz constant
0:  
  ,
  for  do
     
     
     
  end for
Algorithm 4 FISTA with constant step size
0:  , ,
0:  
  ,
  for  do
     
     repeat
        
        
        
     until 
     
     
     
  end for
Algorithm 5 FISTA with backtracking

3.2 Iterative soft-thresholding

In this section we present a novel iterative soft-thresholding algorithm for computing the proximal operators required in the application of FISTA to problems (9) and (10). We first examine the case of (9) (block coordinate descent) and then show how to adapt the algorithm to (10) (optimization of with respect to a single fusion chain). Of crucial importance is the soft-thresholding operator

where and is a threshold. This operator accommodates vector arguments in a componentwise fashion: ().

Checking for simple solutions.

It is advantageous to verify whether or solves (9) before applying FISTA, which is more computationally demanding. The optimality conditions for (9) are very similar to those for the global problem (7), namely (12a)-(12b)-(12c), although of course the conditions for (9) pertain to a single time . Hereafter we state these conditions in an easily computable form. Let be defined in a componentwise fashion by

If , this vector solves (9) if and only if