# Convex and Network Flow Optimization for Structured Sparsity

We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of l_2- or l_infinity-norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of l_infinity-norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches.

## Authors

• 51 publications
• 21 publications
• 18 publications
• 119 publications
• ### Network Flow Algorithms for Structured Sparsity

We consider a class of learning problems that involve a structured spars...
08/31/2010 ∙ by Julien Mairal, et al. ∙ 0

• ### Proximal Methods for Hierarchical Sparse Coding

Sparse coding consists in representing signals as sparse linear combinat...
09/11/2010 ∙ by Rodolphe Jenatton, et al. ∙ 0

• ### A General Framework for Structured Sparsity via Proximal Optimization

We study a generalized framework for structured sparsity. It extends the...
06/26/2011 ∙ by Andreas Argyriou, et al. ∙ 0

• ### Efficient Proximal Mapping Computation for Unitarily Invariant Low-Rank Inducing Norms

Low-rank inducing unitarily invariant norms have been introduced to conv...
10/17/2018 ∙ by Christian Grussler, et al. ∙ 0

• ### Structured sparsity-inducing norms through submodular functions

Sparse methods for supervised learning aim at finding good linear predic...
08/25/2010 ∙ by Francis Bach, et al. ∙ 0

• ### A Convex Functional for Image Denoising based on Patches with Constrained Overlaps and its vectorial application to Low Dose Differential Phase Tomography

We solve the image denoising problem with a dictionary learning techniqu...
05/06/2013 ∙ by Alessandro Mirone, et al. ∙ 0

• ### Structured Sparse Regression via Greedy Hard-Thresholding

Several learning applications require solving high-dimensional regressio...
02/19/2016 ∙ by Prateek Jain, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sparse linear models have become a popular framework for dealing with various unsupervised and supervised tasks in machine learning and signal processing. In such models, linear combinations of small sets of variables are selected to describe the data. Regularization by the

-norm has emerged as a powerful tool for addressing this variable selection problem, relying on both a well-developed theory (see Tibshirani, 1996; Chen et al., 1999; Mallat, 1999; Bickel et al., 2009; Wainwright, 2009, and references therein) and efficient algorithms (Efron et al., 2004; Nesterov, 2007; Beck and Teboulle, 2009; Needell and Tropp, 2009; Combettes and Pesquet, 2010).

The -norm primarily encourages sparse solutions, regardless of the potential structural relationships (e.g., spatial, temporal or hierarchical) existing between the variables. Much effort has recently been devoted to designing sparsity-inducing regularizations capable of encoding higher-order information about the patterns of non-zero coefficients (Cehver et al., 2008; Jenatton et al., 2009; Jacob et al., 2009; Zhao et al., 2009; He and Carin, 2009; Huang et al., 2009; Baraniuk et al., 2010; Micchelli et al., 2010), with successful applications in bioinformatics (Jacob et al., 2009; Kim and Xing, 2010), topic modeling (Jenatton et al., 2010a, 2011)

(Cehver et al., 2008; Huang et al., 2009; Jenatton et al., 2010b). By considering sums of norms of appropriate subsets, or groups, of variables, these regularizations control the sparsity patterns of the solutions. The underlying optimization is usually difficult, in part because it involves nonsmooth components.

Our first strategy uses proximal gradient methods, which have proven to be effective in this context, essentially because of their fast convergence rates and their ability to deal with large problems (Nesterov, 2007; Beck and Teboulle, 2009)

. They can handle differentiable loss functions with Lipschitz-continuous gradient, and we show in this paper how to use them with a regularization term composed of a sum of

-norms. The second strategy we consider exploits proximal splitting methods (see Combettes and Pesquet, 2008, 2010; Goldfarg and Ma, 2009; Tomioka et al., 2011; Qin and Goldfarb, 2011; Boyd et al., 2011, and references therein), which builds upon an equivalent formulation with non-overlapping groups, but in a higher dimensional space and with additional constraints.111The idea of using this class of algorithms for solving structured sparse problems was first suggested to us by Jean-Christophe Pesquet and Patrick-Louis Combettes. It was also suggested to us later by Ryota Tomioka, who briefly mentioned this possibility in (Tomioka et al., 2011). It can also briefly be found in (Boyd et al., 2011), and in details in the work of Qin and Goldfarb (2011) which was conducted as the same time as ours. It was also used in a related context by Sprechmann et al. (2010) for solving optimization problems with hierarchical norms. More precisely, we make four main contributions:

• We show that the proximal operator associated with the sum of -norms with overlapping groups can be computed efficiently and exactly by solving a quadratic min-cost flow problem, thereby establishing a connection with the network flow optimization literature.222Interestingly, this is not the first time that network flow optimization tools have been used to solve sparse regularized problems with proximal methods. Such a connection was recently established by Chambolle and Darbon (2009) in the context of total variation regularization, and similarly by Hoefling (2010) for the fused Lasso. One can also find the use of maximum flow problems for non-convex penalties in the work of Cehver et al. (2008) which combines Markov random fields and sparsity. This is the main contribution of the paper, which allows us to use proximal gradient methods in the context of structured sparsity.

• We prove that the dual norm of the sum of -norms can also be evaluated efficiently, which enables us to compute duality gaps for the corresponding optimization problems.

• We present proximal splitting methods for solving structured sparse regularized problems.

• We demonstrate that our methods are relevant for various applications whose practical success is made possible by our algorithmic tools and efficient implementations. First, we introduce a new CUR matrix factorization technique exploiting structured sparse regularization, built upon the links drawn by Bien et al. (2010) between CUR decomposition (Mahoney and Drineas, 2009)

and sparse regularization. Then, we illustrate our algorithms with different tasks: video background subtraction, estimation of hierarchical structures for dictionary learning of natural image patches

(Jenatton et al., 2010a, 2011), wavelet image denoising with a structured sparse prior, and topographic dictionary learning of natural image patches (Hyvärinen et al., 2001; Kavukcuoglu et al., 2009; Garrigues and Olshausen, 2010).

Note that this paper extends a shorter version published in Advances in Neural Information Processing Systems (Mairal et al., 2010b), by adding new experiments (CUR matrix factorization, wavelet image denoising and topographic dictionary learning), presenting the proximal splitting methods, providing the full proofs of the optimization results, and adding numerous discussions.

### 1.1 Notation

Vectors are denoted by bold lower case letters and matrices by upper case ones. We define for the -norm of a vector  in  as , where  denotes the -th coordinate of , and . We also define the -pseudo-norm as the number of nonzero elements in a vector:333Note that it would be more proper to write instead of to be consistent with the traditional notation . However, for the sake of simplicity, we will keep this notation unchanged in the rest of the paper. . We consider the Frobenius norm of a matrix  in : , where denotes the entry of  at row and column . Finally, for a scalar , we denote . For an integer , we denote by the powerset composed of the subsets of .

The rest of this paper is organized as follows: Section 2 presents structured sparse models and related work. Section 3 is devoted to proximal gradient algorithms, and Section 4 to proximal splitting methods. Section 5 presents several experiments and applications demonstrating the effectiveness of our approach and Section 6 concludes the paper.

## 2 Structured Sparse Models

We are interested in machine learning problems where the solution is not only known beforehand to be sparse—that is, the solution has only a few non-zero coefficients, but also to form non-zero patterns with a specific structure. It is indeed possible to encode additional knowledge in the regularization other than just sparsity. For instance, one may want the non-zero patterns to be structured in the form of non-overlapping groups (Turlach et al., 2005; Yuan and Lin, 2006; Stojnic et al., 2009; Obozinski et al., 2010), in a tree (Zhao et al., 2009; Bach, 2009; Jenatton et al., 2010a, 2011), or in overlapping groups (Jenatton et al., 2009; Jacob et al., 2009; Huang et al., 2009; Baraniuk et al., 2010; Cehver et al., 2008; He and Carin, 2009), which is the setting we are interested in here.

As for classical non-structured sparse models, there are basically two lines of research, that either (A) deal with nonconvex and combinatorial formulations that are in general computationally intractable and addressed with greedy algorithms or (B) concentrate on convex relaxations solved with convex programming methods.

### 2.1 Nonconvex Approaches

A first approach introduced by Baraniuk et al. (2010) consists in imposing that the sparsity pattern of a solution (i.e., its set of non-zero coefficients) is in a predefined subset of groups of variables . Given this a priori knowledge, a greedy algorithm (Needell and Tropp, 2009) is used to address the following nonconvex structured sparse decomposition problem

 minw∈Rp12∥y−Xw∥22  s.t.  Supp(w)∈G  and  ∥w∥0≤s,

where is a specified sparsity level (number of nonzeros coefficients), in is an observed signal, is a design matrix in and is the support of (set of non-zero entries).

In a different approach motivated by the minimum description length principle (see Barron et al., 1998), Huang et al. (2009) consider a collection of groups , and define a “coding length” for every group in , which in turn is used to define a coding length for every pattern in . Using this tool, they propose a regularization function such that for a vector in , represents the number of bits that are used for encoding . The corresponding optimization problem is also addressed with a greedy procedure:

 minw∈Rp12∥y−Xw∥22  s.t.   cl(w)≤s,

Intuitively, this formulation encourages solutions  whose sparsity patterns have a small coding length, meaning in practice that they can be represented by a union of a small number of groups. Even though they are related, this model is different from the one of Baraniuk et al. (2010).

These two approaches are encoding a priori knowledge on the shape of non-zero patterns that the solution of a regularized problem should have. A different point of view consists of modelling the zero patterns of the solution—that is, define groups of variables that should be encouraged to be set to zero together. After defining a set of such groups of variables, the following penalty can naturally be used as a regularization to induce the desired property

 ψ(w)≜∑g∈Gηgδg(w), with δg(w)≜{1if there exists j∈g such that wj≠0,0otherwise, (1)

where the ’s are positive weights. This penalty was considered by Bach (2010), who showed that the convex envelope of such nonconvex functions (more precisely strictly positive, non-increasing submodular functions of see Fujishige, 2005) when restricted on the unit -ball, are in fact types of structured sparsity-inducing norms which are the topic of the next section.

### 2.2 Convex Approaches with Sparsity-Inducing Norms

In this paper, we are interested in convex regularizations which induce structured sparsity. Generally, we consider the following optimization problem

 minw∈Rpf(w)+λΩ(w), (2)

where is a convex function (usually an empirical risk in machine learning and a data-fitting term in signal processing), and is a structured sparsity-inducing norm, defined as

 Ω(w)≜∑g∈Gηg∥wg∥, (3)

where is a set of groups of variables, the vector  in represents the coefficients of indexed by in , the scalars  are positive weights, and denotes the - or -norm. We now consider different cases:

• When is the set of singletons—that is , and all the are equal to one, is the -norm, which is well known to induce sparsity. This leads for instance to the Lasso (Tibshirani, 1996) or equivalently to basis pursuit (Chen et al., 1999).

• If is a partition of , i.e. the groups do not overlap, variables are selected in groups rather than individually. When the coefficients of the solution are known to be organized in such a way, explicitly encoding the a priori group structure in the regularization can improve the prediction performance and/or interpretability of the learned models (Turlach et al., 2005; Yuan and Lin, 2006; Roth and Fischer, 2008; Stojnic et al., 2009; Huang and Zhang, 2010; Obozinski et al., 2010). Such a penalty is commonly called group-Lasso penalty.

• When the groups overlap, is still a norm and sets groups of variables to zero together (Jenatton et al., 2009). The latter setting has first been considered for hierarchies (Zhao et al., 2009; Kim and Xing, 2010; Bach, 2009; Jenatton et al., 2010a, 2011), and then extended to general group structures (Jenatton et al., 2009). Solving Eq. (2) in this context is a challenging problem which is the topic of this paper.

Note that other types of structured-sparsity inducing norms have also been introduced, notably the approach of Jacob et al. (2009), which penalizes the following quantity

 Ω′(w)≜minξ=(ξg)g∈G∈Rp×|G|∑g∈Gηg∥ξg∥  s.t% .  w=∑g∈Gξg  %and  ∀g, Supp(ξg)⊆g.

This penalty, which is also a norm, can be seen as a convex relaxation of the regularization introduced by Huang et al. (2009), and encourages the sparsity pattern of the solution to be a union of a small number of groups. Even though both  and  appear under the terminology of “structured sparsity with overlapping groups”, they have in fact significantly different purposes and algorithmic treatments. For example, Jacob et al. (2009) consider the problem of selecting genes in a gene network which can be represented as the union of a few predefined pathways in the graph (groups of genes), which overlap. In this case, it is natural to use the norm  instead of . On the other hand, we present a matrix factorization task in Section 5.3, where the set of zero-patterns should be a union of groups, naturally leading to the use of . Dealing with  is therefore relevant, but out of the scope of this paper.

### 2.3 Convex Optimization Methods Proposed in the Literature

Generic approaches to solve Eq. (2) mostly rely on subgradient descent schemes (see Bertsekas, 1999), and interior-point methods (Boyd and Vandenberghe, 2004). These generic tools do not scale well to large problems and/or do not naturally handle sparsity (the solutions they return may have small values but no “true” zeros). These two points prompt the need for dedicated methods.

To the best of our knowledge, only a few recent papers have addressed problem Eq. (2) with dedicated optimization procedures, and in fact, only when is a linear combination of -norms. In this setting, a first line of work deals with the non-smoothness of by expressing the norm as the minimum over a set of smooth functions. At the cost of adding new variables (to describe the set of smooth functions), the problem becomes more amenable to optimization. In particular, reweighted- schemes consist of approximating the norm by successive quadratic upper bounds (Argyriou et al., 2008; Rakotomamonjy et al., 2008; Jenatton et al., 2010b; Micchelli et al., 2010). It is possible to show for instance that

 Ω(w)=min(zg)g∈G∈R|G|+12{∑g∈Gη2g∥wg∥22zg+zg}.

Plugging the previous relationship into Eq. (2), the optimization can then be performed by alternating between the updates of and the additional variables .444Note that such a scheme is interesting only if the optimization with respect to  is simple, which is typically the case with the square loss function (Bach et al., 2011). Moreover, for this alternating scheme to be provably convergent, the variables have to be bounded away from zero, resulting in solutions whose entries may have small values, but not “true” zeros. When the norm is defined as a linear combination of -norms, we are not aware of the existence of such variational formulations.

Problem (2) has also been addressed with working-set algorithms (Bach, 2009; Jenatton et al., 2009; Schmidt and Murphy, 2010). The main idea of these methods is to solve a sequence of increasingly larger subproblems of (2). Each subproblem consists of an instance of Eq. (2) reduced to a specific subset of variables known as the working set. As long as some predefined optimality conditions are not satisfied, the working set is augmented with selected inactive variables (for more details, see Bach et al., 2011).

The last approach we would like to mention is that of  Chen et al. (2010), who used a smoothing technique introduced by Nesterov (2005). A smooth approximation of is used, when is a sum of -norms, and is a parameter controlling the trade-off between smoothness of and quality of the approximation. Then, Eq. (2) is solved with accelerated gradient techniques (Beck and Teboulle, 2009; Nesterov, 2007) but is substituted to the regularization . Depending on the required precision for solving the original problem, this method provides a natural choice for the parameter , with a known convergence rate. A drawback is that it requires to choose the precision of the optimization beforehand. Moreover, since a -norm is added to the smoothed , the solutions returned by the algorithm might be sparse but possibly without respecting the structure encoded by . This should be contrasted with other smoothing techniques, e.g., the reweighted- scheme we mentioned above, where the solutions are only approximately sparse.

## 3 Optimization with Proximal Gradient Methods

We address in this section the problem of solving Eq. (2) under the following assumptions:

• is differentiable with Lipschitz-continuous gradient. For machine learning problems, this hypothesis holds when is for example the square, logistic or multi-class logistic loss (see Shawe-Taylor and Cristianini, 2004).

• is a sum of -norms. Even though the -norm is sometimes used in the literature (Jenatton et al., 2009), and is in fact used later in Section 4, the -norm is piecewise linear, and we take advantage of this property in this work.

To the best of our knowledge, no dedicated optimization method has been developed for this setting. Following Jenatton et al. (2010a, 2011) who tackled the particular case of hierarchical norms, we propose to use proximal gradient methods, which we now introduce.

Proximal methods have drawn increasing attention in the signal processing (e.g., Wright et al., 2009b; Combettes and Pesquet, 2010, and numerous references therein) and the machine learning communities (e.g., Bach et al., 2011, and references therein), especially because of their convergence rates (optimal for the class of first-order techniques) and their ability to deal with large nonsmooth convex problems (e.g., Nesterov, 2007; Beck and Teboulle, 2009).

These methods are iterative procedures that can be seen as an extension of gradient-based techniques when the objective function to minimize has a nonsmooth part. The simplest version of this class of methods linearizes at each iteration the function around the current estimate , and this estimate is updated as the (unique by strong convexity) solution of the proximal problem, defined as:

 minw∈Rpf(~w)+(w−~w)⊤∇f(~w)+λΩ(w)+L2∥w−~w∥22.

The quadratic term keeps the update in a neighborhood where is close to its linear approximation, and is a parameter which is a upper bound on the Lipschitz constant of . This problem can be equivalently rewritten as:

Solving efficiently and exactly this problem allows to attain the fast convergence rates of proximal methods, i.e., reaching a precision of in iterations.555Note, however, that fast convergence rates can also be achieved while solving approximately the proximal problem (see Schmidt et al., 2011, for more details). In addition, when the nonsmooth term  is not present, the previous proximal problem exactly leads to the standard gradient update rule. More generally, we define the proximal operator: [Proximal Operator]
The proximal operator associated with our regularization term , which we denote by , is the function that maps a vector to the unique solution of

 minw∈Rp12∥u−w∥22+λΩ(w). (4)

This operator was initially introduced by Moreau (1962) to generalize the projection operator onto a convex set. What makes proximal methods appealing to solve sparse decomposition problems is that this operator can often be computed in closed form. For instance,

• When is the -norm—that is — the proximal operator is the well-known elementwise soft-thresholding operator,

 ∀j∈{1,…,p},  uj↦sign(uj)(|uj|−λ)+   ={0%if  |uj|≤λsign(uj)(|uj|−λ)% otherwise.
• When is a group-Lasso penalty with -norms—that is, , with being a partition of , the proximal problem is separable in every group, and the solution is a generalization of the soft-thresholding operator to groups of variables:

 ∀g∈G  ,ug↦ug−Π∥.∥2≤λ[ug]=⎧⎨⎩0if  ∥ug∥2≤λ∥ug∥2−λ∥ug∥2ugotherwise,

where denotes the orthogonal projection onto the ball of the -norm of radius .

• When is a group-Lasso penalty with -norms—that is, , with being a partition of , the solution is a different group-thresholding operator:

 ∀g∈G,  ug↦ug−Π∥.∥1≤λ[ug],

where denotes the orthogonal projection onto the -ball of radius , which can be solved in operations (Brucker, 1984; Maculan and de Paula, 1989). Note that when , we have a group-thresholding effect, with .

• When is a tree-structured sum of - or -norms as introduced by Zhao et al. (2009)—meaning that two groups are either disjoint or one is included in the other, the solution admits a closed form. Let be a total order on such that for in , if and only if either or .666For a tree-structured set , such an order exists. Then, if , and if we define as (a) the proximal operator on the subspace corresponding to group and (b) the identity on the orthogonal, Jenatton et al. (2010a, 2011) showed that:

 ProxλΩ=Proxgm∘…∘Proxg1, (5)

which can be computed in operations. It also includes the sparse group Lasso (sum of group-Lasso penalty and -norm) of Friedman et al. (2010) and Sprechmann et al. (2010).

The first contribution of our paper is to address the case of general overlapping groups with -norm.

### 3.2 Dual of the Proximal Operator

We now show that, for a set   of general overlapping groups, a convex dual of the proximal problem (4) can be reformulated as a quadratic min-cost flow problem. We then propose an efficient algorithm to solve it exactly, as well as a related algorithm to compute the dual norm of . We start by considering the dual formulation to problem (4) introduced by Jenatton et al. (2010a, 2011): [Dual of the proximal problem, Jenatton et al., 2010a, 2011]
Given in , consider the problem

 minξ∈Rp×|G|12∥u−∑g∈Gξg∥22  % s.t.   ∀g∈G, ∥ξg∥1≤ληg    and    ξgj=0 if j∉g, (6)

where is in , and denotes the -th coordinate of the vector . Then, every solution of Eq. (6) satisfies , where is the solution of Eq. (4) when  is a weighted sum of -norms. Without loss of generality,777 Let denote a solution of Eq. (6). Optimality conditions of Eq. (6) derived in Jenatton et al. (2010a, 2011) show that for all in , the signs of the non-zero coefficients for in are the same as the signs of the entries . To solve Eq. (6), one can therefore flip the signs of the negative variables , then solve the modified dual formulation (with non-negative variables), which gives the magnitude of the entries (the signs of these being known). we assume from now on that the scalars are all non-negative, and we constrain the entries of  to be so. Such a formulation introduces dual variables which can be much greater than , the number of primal variables, but it removes the issue of overlapping regularization. We now associate a graph with problem (6), on which the variables , for in and  in , can be interpreted as measuring the components of a flow.

### 3.3 Graph Model

Let be a directed graph , where is a set of vertices, a set of arcs, a source, and a sink. For all arcs in , we define a non-negative capacity constant, and as done classically in the network flow literature (Ahuja et al., 1993; Bertsekas, 1998), we define a flow as a non-negative function on arcs that satisfies capacity constraints on all arcs (the value of the flow on an arc is less than or equal to the arc capacity) and conservation constraints on all vertices (the sum of incoming flows at a vertex is equal to the sum of outgoing flows) except for the source and the sink. For every arc in , we also define a real-valued cost function, which depends on the value of the flow on . We now introduce the canonical graph associated with our optimization problem: [Canonical Graph]
Let be a set of groups, and be positive weights. The canonical graph is the unique graph defined as follows:

1. , where is a vertex set of size , one vertex being associated to each index in , and is a vertex set of size , one vertex per group  in . We thus have . For simplicity, we identify groups in and indices in with vertices of the graph, such that one can from now on refer to “vertex ” or “vertex ”.

2. For every group in , contains an arc . These arcs have capacity and zero cost.

3. For every group in , and every index in , contains an arc with zero cost and infinite capacity. We denote by the flow on this arc.

4. For every index in , contains an arc with infinite capacity and a cost , where is the flow on .

Examples of canonical graphs are given in Figures (a)a-LABEL:sub@subfig:graphc for three simple group structures. The flows associated with can now be identified with the variables of problem (6). Since we have assumed the entries of to be non-negative, we can now reformulate Eq. (6) as

 minξ∈Rp×|G|+,¯¯ξ∈Rpp∑j=112(uj−¯¯ξj)2  s.t.  ¯¯ξ=∑g∈Gξg  and  ∀g∈G, {∑j∈gξgj≤ληg  and  Supp(ξg)⊆g}. (7)

Indeed,

• the only arcs with a cost are those leading to the sink, which have the form , where is the index of a variable in . The sum of these costs is , which is the objective function minimized in Eq. (7);

• by flow conservation, we necessarily have in the canonical graph;

• the only arcs with a capacity constraints are those coming out of the source, which have the form , where is a group in . By flow conservation, the flow on an arc is which should be less than by capacity constraints;

• all other arcs have the form , where is in and is in . Thus, .

Therefore we have shown that finding a flow minimizing the sum of the costs on such a graph is equivalent to solving problem (6). When some groups are included in others, the canonical graph can be simplified to yield a graph with a smaller number of edges. Specifically, if and are groups with , the edges for carrying a flow can be removed and replaced by a single edge of infinite capacity and zero cost, carrying the flow . This simplification is illustrated in Figure (d)d, with a graph equivalent to the one of Figure (c)c. This does not change the optimal value of , which is the quantity of interest for computing the optimal primal variable . We present in Appendix A a formal definition of equivalent graphs. These simplifications are useful in practice, since they reduce the number of edges in the graph and improve the speed of our algorithms.

### 3.4 Computation of the Proximal Operator

Quadratic min-cost flow problems have been well studied in the operations research literature (Hochbaum and Hong, 1995). One of the simplest cases, where  contains a single group as in Figure (a)a, is solved by an orthogonal projection on the -ball of radius . It has been shown, both in machine learning (Duchi et al., 2008) and operations research (Hochbaum and Hong, 1995; Brucker, 1984), that such a projection can be computed in operations. When the group structure is a tree as in Figure (d)d, strategies developed in the two communities are also similar (Jenatton et al., 2010a; Hochbaum and Hong, 1995),888Note however that, while Hochbaum and Hong (1995) only consider a tree-structured sum of -norms, the results from Jenatton et al. (2010a) also apply for a sum of -norms. and solve the problem in operations, where is the depth of the tree.

The general case of overlapping groups is more difficult. Hochbaum and Hong (1995) have shown that quadratic min-cost flow problems can be reduced to a specific parametric max-flow problem, for which an efficient algorithm exists (Gallo et al., 1989).999By definition, a parametric max-flow problem consists in solving, for every value of a parameter, a max-flow problem on a graph whose arc capacities depend on this parameter. While this generic approach could be used to solve Eq. (6), we propose to use Algorithm 1 that also exploits the fact that our graphs have non-zero costs only on edges leading to the sink. As shown in Appendix D, it it has a significantly better performance in practice. This algorithm clearly shares some similarities with existing approaches in network flow optimization such as the simplified version of Gallo et al. (1989) presented by Babenko and Goldberg (2006) that uses a divide and conquer strategy. Moreover, an equivalent algorithm exists for minimizing convex functions over polymatroid sets (Groenevelt, 1991). This equivalence, a priori non trivial, is uncovered through a representation of structured sparsity-inducing norms via submodular functions, which was recently proposed by Bach (2010).

The intuition behind our algorithm, computeFlow (see Algorithm 1), is the following: since is the only value of interest to compute the solution of the proximal operator , the first step looks for a candidate value for by solving the following relaxed version of problem (7):

 \operatornamewithlimitsargminγ∈Rp∑j∈Vu12(uj−γj)2  s.t.  ∑j∈Vuγj≤λ∑g∈Vgrηg. (8)

The cost function here is the same as in problem (7), but the constraints are weaker: Any feasible point of problem (7) is also feasible for problem (8). This problem can be solved in linear time (Brucker, 1984). Its solution, which we denote for simplicity, provides the lower bound for the optimal cost of problem (7).

The second step tries to construct a feasible flow , satisfying additional capacity constraints equal to on arc , and whose cost matches this lower bound; this latter problem can be cast as a max-flow problem (Goldberg and Tarjan, 1986). If such a flow exists, the algorithm returns , the cost of the flow reaches the lower bound, and is therefore optimal. If such a flow does not exist, we have , the lower bound is not achievable, and we build a minimum -cut of the graph (Ford and Fulkerson, 1956) defining two disjoints sets of nodes and ; is the part of the graph which is reachable from the source (for every node in , there exists a non-saturated path from to ), whereas all paths going from  to nodes in are saturated. More details about these properties can be found at the beginning of Appendix B. At this point, it is possible to show that the value of the optimal min-cost flow on all arcs between  and  is necessary zero. Thus, removing them yields an equivalent optimization problem, which can be decomposed into two independent problems of smaller sizes and solved recursively by the calls to computeFlow and computeFlow. A formal proof of correctness of Algorithm 1 and further details are relegated to Appendix B.

The approach of Hochbaum and Hong (1995); Gallo et al. (1989) which recasts the quadratic min-cost flow problem as a parametric max-flow is guaranteed to have the same worst-case complexity as a single max-flow algorithm. However, we have experimentally observed a significant discrepancy between the worst case and empirical complexities for these flow problems, essentially because the empirical cost of each max-flow is significantly smaller than its theoretical cost. Despite the fact that the worst-case guarantees for our algorithm is weaker than theirs (up to a factor ), it is more adapted to the structure of our graphs and has proven to be much faster in our experiments (see Appendix D).101010The best theoretical worst-case complexity of a max-flow is achieved by Goldberg and Tarjan (1986) and is . Our algorithm achieves the same worst-case complexity when the cuts are well balanced—that is , but we lose a factor  when it is not the case. The practical speed of such algorithms is however significantly different than their theoretical worst-case complexities (see Boykov and Kolmogorov, 2004). Some implementation details are also crucial to the efficiency of the algorithm:

• Exploiting maximal connected components: When there exists no arc between two subsets of , the solution can be obtained by solving two smaller optimization problems corresponding to the two disjoint subgraphs. It is indeed possible to process them independently to solve the global min-cost flow problem. To that effect, before calling the function computeFlow(), we look for maximal connected components and call sequentially the procedure computeFlow() for in .

• Efficient max-flow algorithm: We have implemented the “push-relabel” algorithm of Goldberg and Tarjan (1986)

to solve our max-flow problems, using classical heuristics that significantly speed it up in practice; see

Goldberg and Tarjan (1986) and Cherkassky and Goldberg (1997). We use the so-called “highest-active vertex selection rule, global and gap heuristics” (Goldberg and Tarjan, 1986; Cherkassky and Goldberg, 1997), which has a worst-case complexity of for a graph . This algorithm leverages the concept of pre-flow that relaxes the definition of flow and allows vertices to have a positive excess.

• Using flow warm-restarts: The max-flow steps in our algorithm can be initialized with any valid pre-flow, enabling warm-restarts. This is also a key concept in the parametric max-flow algorithm of Gallo et al. (1989).

• Improved projection step: The first line of the procedure computeFlow can be replaced by The idea is to build a relaxation of Eq. (7) which is closer to the original problem than the one of Eq. (8), but that still can be solved in linear time. The structure of the graph will indeed not allow  to be greater than after the max-flow step. This modified projection step can still be computed in linear time (Brucker, 1984), and leads to better performance.

### 3.5 Computation of the Dual Norm

The dual norm of , defined for any vector in by

 Ω∗(κ)≜maxΩ(z)≤1z⊤κ,

is a key quantity to study sparsity-inducing regularizations in many respects. For instance, dual norms are central in working-set algorithms (Jenatton et al., 2009; Bach et al., 2011), and arise as well when proving theoretical estimation or prediction guarantees (Negahban et al., 2009).

In our context, we use it to monitor the convergence of the proximal method through a duality gap, hence defining a proper optimality criterion for problem (2). As a brief reminder, the duality gap of a minimization problem is defined as the difference between the primal and dual objective functions, evaluated for a feasible pair of primal/dual variables (see Section 5.5, Boyd and Vandenberghe, 2004). This gap serves as a certificate of (sub)optimality: if it is equal to zero, then the optimum is reached, and provided that strong duality holds, the converse is true as well (see Section 5.5, Boyd and Vandenberghe, 2004). A description of the algorithm we use in the experiments (Beck and Teboulle, 2009) along with the integration of the computation of the duality gap is given in Appendix C.

We now denote by the Fenchel conjugate of  (Borwein and Lewis, 2006), defined by . The duality gap for problem (2) can be derived from standard Fenchel duality arguments (Borwein and Lewis, 2006) and it is equal to

 f(w)+λΩ(w)+f∗(−κ) for w,κ in Rp with Ω∗(κ)≤λ.

Therefore, evaluating the duality gap requires to compute efficiently in order to find a feasible dual variable  (the gap is otherwise equal to and becomes non-informative). This is equivalent to solving another network flow problem, based on the following variational formulation:

 (9)

In the network problem associated with (9), the capacities on the arcs , , are set to , and the capacities on the arcs , in , are fixed to . Solving problem (9) amounts to finding the smallest value of , such that there exists a flow saturating all the capacities on the arcs leading to the sink . Equation (9) and Algorithm 2 are proven to be correct in Appendix B.