The problem of sparse estimation is becoming increasing important in statistics, machine learning and signal processing. In its simplest form, this problem consists in estimating a regression vector from a set of linear measurements , obtained from the model
where is an matrix, which may be fixed or randomly chosen and is a vector which results from the presence of noise.
An important rational for sparse estimation comes from the observation that in many practical applications the number of parameters is much larger than the data size , but the vector is known to be sparse, that is, most of its components are equal to zero. Under this sparsity assumption and certain conditions on the data matrix , it has been shown that regularization with the norm, commonly referred to as the Lasso method , provides an effective means to estimate the underlying regression vector, see for example [5, 7, 18, 28] and references therein. Moreover, this method can reliably select the sparsity pattern of 
, hence providing a valuable tool for feature selection.
In this paper, we are interested in sparse estimation under additional conditions on the sparsity pattern of the vector . In other words, not only do we expect this vector to be sparse but also that it is structured sparse, namely certain configurations of its nonzero components are to be preferred to others. This problem arises is several applications, ranging from functional magnetic resonance imaging [9, 29]
, to scene recognition in vision, to multi-task learning [1, 15, 23] and to bioinformatics , see  for a discussion.
The prior knowledge that we consider in this paper is that the vector , whose components are the absolute value of the corresponding components of , should belong to some prescribed convex subset of the positive orthant. For certain choices of this implies a constraint on the sparsity pattern as well. For example, the set may include vectors with some desired monotonicity constraints, or other constraints on the “shape” of the regression vector. Unfortunately, the constraint that is nonconvex and its implementation is computational challenging. To overcome this difficulty, we propose a family of penalty functions, which are based on an extension of the norm used by the Lasso method and involves the solution of a smooth convex optimization problem. These penalty functions favor regression vectors such that , thereby incorporating the structured sparsity constraints.
Precisely, we propose to estimate as a solution of the convex optimization problem
where denotes the Euclidean norm, is a positive parameter and the penalty function takes the form
As we shall see, a key property of the penalty function is that it exceeds the norm of when , and it coincides with the
norm otherwise. This observation suggests a heuristic interpretation of the method (1.2): among all vectors which have a fixed value of the norm, the penalty function will encourage those for which . Moreover, when the function reduces to the norm and, so, the solution of problem is expected to be sparse. The penalty function therefore will encourage certain desired sparsity patterns. Indeed, the sparsity pattern of is contained in that of the auxiliary vector at the optimum and, so, if the set allows only for certain sparsity patterns of , the same property will be “transferred” to the regression vector .
There has been some recent research interest on structured sparsity, see [11, 13, 14, 19, 22, 30, 31] and references therein. Closest to our approach are penalty methods built around the idea of mixed - norms. In particular, the group Lasso method  assumes that the components of the underlying regression vector can be partitioned into prescribed groups, such that the restriction of to a group is equal to zero for most of the groups. This idea has been extended in [14, 32] by considering the possibility that the groups overlap according to certain hierarchical or spatially related structures. Although these methods have proved valuable in applications, they have the limitation that they can only handle more restrictive classes of sparsity, for example patterns forming only a single connected region. Our point of view is different from theirs and provides a means to designing more flexible penalty functions which maintain convexity while modeling richer model structures. For example, we will demonstrate that our family of penalty functions can model sparsity patterns forming multiple connected regions of coefficients.
The paper is organized in the following manner. In Section 2 we establish some important properties of the penalty function. In Section 3 we address the case in which the set is a box. In Section 4 we derive the form of the penalty function corresponding to the wedge with decreasing coordinates and in Section 5 we extends this analysis to the case in which the constraint set is constructed from a directed graph. In Section 6 we discuss useful duality relations and in Section 7 we address the issue of solving the problem (1.2) numerically by means of an alternating minimization algorithm. Finally, in Section 8 we provide numerical simulations with this method, showing the advantage offered by our approach.
A preliminary version of this paper appeared in the proceedings of the Twenty-Fourth Annual Conference on Neural Information Processing Systems (NIPS 2010) . The new version contains Propositions 2.1, 2.3 and 2.4, the description of the graph penalty in Section 5, Section 6, a complete proof of Theorem 7.1 and an experimental comparison with the method of .
2 Penalty function
In this section, we provide some general comments on the penalty function which we study in this paper.
We first review our notation. We denote with and the nonnegative and positive real line, respectively. For every we define to be the vector formed by the absolute values of the components of , that is, , where is the set of positive integers up to and including . Finally, we define the norm of vector as and the norm as .
Given an input data matrix and an output vector , obtained from the linear regression model discussed earlier, we consider the convex optimization problem
where is a positive parameter, is a prescribed convex subset of the positive orthant and the function is given by the formula
Note that in (2.1), for a fixed , the infimum over in general is not attained, however, for a fixed , the infimum over is always attained.
Since the auxiliary vector appears only in the second term of the objective function of problem (2.1), and our goal is to estimate , we may also directly consider the regularization problem
where the penalty function takes the form
Note that is convex on its domain because each of its summands are likewise convex functions. Hence, when the set is convex it follows that is a convex function and (2.2) is a convex optimization problem.
An essential idea behind our construction of the penalty function is that, for every , the quadratic function provides a smooth approximation to from above, which is exact at . We indicate this graphically in Figure 1
-a. This fact follows immediately by the arithmetic-geometric mean inequality, which states, for everythat .
Indeed, using again the arithmetic-geometric mean inequality it follows that . Moreover, if for every , then the infimum is attained for . This important special case motivated us to consider the general method described above. The utility of (2.3) is that upon inserting it into (2.2) there results an optimization problem over and with a continuously differentiable objective function. Hence, we have succeeded in expressing a nondifferentiable convex objective function by one which is continuously differentiable on its domain.
Our first observation concerns the differentiability of . In this regard, we provide a sufficient condition which ensures this property of , which, although seemingly cumbersome covers important special cases. To present our result, for any real numbers , we define the parallelepiped .
We say that the set is admissible if it is convex and, for all with , the set is a nonempty, compact subset of the interior of .
If and is an admissible subset of , then the infimum above is uniquely achieved at a point and the mapping is continuous. Moreover, the function is continuously differentiable and its partial derivatives are given, for any , by the formula
We postpone the proof of this proposition to the appendix. We note that, since is continuous, we may compute it at a vector , some of whose components are zero, as a limiting process. Moreover, at such a vector the function is in general not differentiable, for example consider the case .
The next proposition provides a justification of the penalty function as a means to incorporate structured sparsity and establish circumstances for which the penalty function is a norm. To state our result, we denote by the closure of the set .
For every , we have that and the equality holds if and only if . Moreover, if is a nonempty convex cone then the function is a norm and we have that , where and is the canonical basis of .
By the arithmetic-geometric mean inequality we have that , proving the first assertion. If , there exists a sequence in , such that . Since it readily follows that . Conversely, if , then there is a sequence in , such that . This inequality implies that some subsequence of this sequence converges to a . Using arithmetic-geometric mean inequality we conclude that and the result follows. To prove the second part, observe that if is a nonempty convex cone, namely, for any and it holds that , we have that is positive homogeneous. Indeed, making the change of variable we see that . Moreover, the above inequality, , implies that if then . The proof of the triangle inequality follows from the homogeneity and convexity of , namely .
Finally, note that if and only if . Since is convex the maximum above is achieved at an extreme point of the unit ball. ∎
This proposition indicates a heuristic interpretation of the method (2.2): among all vectors which have a fixed value of the norm, the penalty function will encourage those for which . Moreover, when the function reduces to the norm and, so, the solution of problem is expected to be sparse. The penalty function therefore will encourage certain desired sparsity patterns.
The last point can be better understood by looking at problem (2.1). For every solution , the sparsity pattern of is contained in the sparsity pattern of , that is, the indices associated with nonzero components of are a subset of those of . Indeed, if it must hold that as well, since the objective would diverge otherwise (because of the ratio ). Therefore, if the set favors certain sparse solutions of , the same sparsity pattern will be reflected on . Moreover, the term appearing in the expression for favors sparse vectors. For example, a constraint of the form favors consecutive zeros at the end of and nonzeros everywhere else. This will lead to zeros at the terminal components of as well. Thus, in many cases like this, it is easy to incorporate a convex constraint on , whereas it may not be possible to do the same with .
Next, we note that a normalized version of the group Lasso penalty  is included in our setting as a special case. If, for some , forms a partition of the index set , the corresponding group Lasso penalty is defined as
where, for every , we use the notation . It is an easy matter to verify that for .
The next proposition presents a useful construction which may be employed to generate new penalty functions from available ones. It is obtained by composing a set
with a linear transformation, modeling the sum of the components of a vector, across the elements of a prescribed partitionof . To describe our result we introduce the group average map induced by . It is defined, for each , as .
If , and is a partition of then
The idea of the proof depends on two basic observations. The first uses the set theoretic formula
From this decomposition we obtain that
Next, we write and decompose the inner infimum as the sum
Now, the second essential step in the proof evaluates the infimum in the second sum by the Cauchy-Schwarz inequality to obtain that
We now substitute this formula into the right hand side of equation (2.6) to finish the proof.∎
When the set is a nonempty convex cone, to emphasize that the function is a norm we denoted it by . We end this section with the identification of the dual norm of , which is defined as
If is a nonempty convex cone then there holds the equation
By definition, is the smallest constant such that, for every and , it holds that
Minimizing the left hand side of this inequality for yields the equivalent inequality
Since this inequality holds for every , the result follows by taking the supremum of the right hand side of the above inequality over this set. ∎
The formula for the dual norm suggests that we introduce the set . With this notation we see that the dual norm becomes
Moreover, a direct computation yields an alternate form for the original norm given by the equation
3 Box penalty
We proceed to discuss some examples of the set which may be used in the design of the penalty function .
The first example, which is presented in this section, corresponds to the prior knowledge that the magnitude of the components of the regression vector should be in some prescribed intervals. We choose , , and define the corresponding box as The theorem below establishes the form of the box penalty. To state our result, we define, for every , the function .
We have that
Moreover, the components of the vector are given by the equations , .
Since it suffices to establish the result in the case . We shall show that if , then
Since both sides of the above equation are continuous functions of it suffices to prove this equation for . In this case, the function is strictly convex, and so, has a unique minimum in at , see also Figure 1-b. Moreover, if the minimum occurs at , whereas if , it occurs at . This establishes the formula for . Consequently, we have that
Equation (3.1) now follows by a direct computation. ∎
4 Wedge penalty
In this section, we consider the case that the coordinates of the vector are ordered in a nonincreasing fashion. As we shall see, the corresponding penalty function favors regression vectors which are likewise nonincreasing.
We define the wedge
Our next result describes the form of the penalty in this case. To explain this result we require some preparation. We say that a partition of is contiguous if for all , , it holds that . For example, if , partitions and are contiguous but is not.
Given any two disjoint subsets we define the region in
Note that the boundary of this region is determined by the zero set of a homogeneous polynomial of degree two. We also need the following construction.
For every we set and label the elements of in increasing order as . We associate with the set a contiguous partition of , given by , where we define , and set and .
Figure 2 illustrates an example of a contiguous partition along with the set .
A subset of also induces two regions in which play a central role in the identification of the wedge penalty. First, we describe the region which “crosses over” the induced partition . This is defined to be the set
In other words, if the average of the square of its components within each region strictly decreases with . The next region which is essential in our analysis is the “stays within” region, induced by the partition . This region is defined as
where denotes the closure of the set and we use the notation . In other words, all vectors within this region have the property that, for every set , the average of the square of a first segment of components of within this set is not greater than the average over . We note that if is the empty set the above notation should be interpreted as and
From the cross-over and stay-within sets we define the region
Alternatively, we shall describe below the set in terms of two vectors induced by a vector and the set . These vectors play the role of the Lagrange multiplier and the minimizer for the wedge penalty in the theorem below.
For every vector and every subset we let be the induced contiguous partition of and define two vectors and by
Note that the components of are constant on each set , .
For every and we have that
if and only if and ;
If and then .
The first assertion follows directly from the definition of the requisite quantities. The proof of the second assertion is a direct consequence of the fact that the vector is a constant on any element of the partition and strictly decreasing from one element to the next in that partition. ∎
For the theorem below we introduce, for every the sets
We shall establishes not only that the collection of sets form a partition of , that is, their union is and two distinct elements of are disjoint, but also explicitly determine the wedge penalty on each element of .
The collection of sets form a partition of . For each there is a unique such that , and
where . Moreover, the components of the vector are given by the equations , where
First, let us observe that there are inequality constraints defining . It readily follows that all vectors in this constraint set are regular, in the sense of optimization theory, see [4, p. 279]. Hence, we can appeal to [4, Prop. 3.3.4, p. 316 and Prop. 3.3.6, p. 322], which state that is a solution to the minimum problem determined by the wedge penalty, if and only if there exists a vector with nonnegative components such that
where we set Furthermore, the following complementary slackness conditions hold true
To unravel these equations, we let , which is the subset of indexes corresponding to the constraints that are not tight. When , we express this set in the form where . As explained in Definition 4.2, the set induces the partition of . When our notation should be interpreted to mean that is empty and the partition consists only of . In this case, it is easy to solve equations (4.7) and (4.8). In fact, all components of the vector have a common value, say , and by summing both sides of equation (4.7) over we obtain that
Moreover, summing both sides of the same equation over we obtain that
and, since we conclude that .
We now consider the case that . Hence, the vector has equal components on each subset , which we denote by , . The definition of the set implies that the sequence is strictly decreasing and equation (4.8) implies that , for every . Summing both sides of equation (4.7) over we obtain that
which implies that . Since this holds for every and we conclude that and therefore, it follows that .
In summary, we have shown that , , and . In particular, this implies that the collection of sets covers . Next, we show that the elements of are disjoint. To this end, we observe that, the computation described above can be reversed. That is to say, conversely for any and we conclude that and solve the equations (4.7) and (4.8). Since the wedge penalty function is strictly convex we know that equations (4.7) and (4.8) have a unique solution. Now, if then it must follow that . Consequently, by part (b) in Lemma 4.1 we conclude that . ∎
Note that the set and the associated partition appearing in the theorem is identified by examining the optimality conditions of the optimization problem (2.3) for . There are possible partitions. Thus, for a given , determining the corresponding partition is a challenging problem. We explain how to do this in Section 7.
An interesting property of the Wedge penalty, which is indicated by Theorem 4.1, is that it has the form of a group Lasso penalty as in equation (2.5), with groups not fixed a-priori but depending on the location of the vector . The groups are the elements of the partition and are identified by certain convex constraints on the vector . For example, for we obtain that if and otherwise. For , we have that
where we have also displayed the partition involved in each case. We also present a graphical representation of the corresponding unit ball in Figure 3-a. For comparison we also graphically display the unit ball for the hierarchical group Lasso with groups and two group Lasso in Figure 3-b,c,d, respectively.
The wedge may equivalently be expressed as the constraint that the difference vector is less than or equal to zero. This alternative interpretation suggests the -th order difference operator, which is given by the formula
and the corresponding -th wedge
The associated penalty encourages vectors whose sparsity pattern is concentrated on at most different contiguous regions. Note that is not the wedge considered earlier. Moreover, the -wedge includes vectors which have a convex “profile” and whose sparsity pattern is concentrated either on the first elements of the vector, on the last, or on both.
5 Graph penalty
In this section we present an extension of the wedge set which is inspired by previous work on the group Lasso estimator with hierarchically overlapping groups . It models vectors whose magnitude is ordered according to a graphical structure.
Let be a directed graph, where is the set of vertices in the graph and is the edge set, whose cardinality is denoted by . If we say that there is a directed edge from vertex to vertex . The graph is identified by the incidence matrix, which we define as
We consider the penalty for the convex cone and assume, from now on, that is acyclic (DAG), that is, has no directed loops. In particular, this implies that, if then . The wedge penalty described above is a special case of the graph penalty corresponding to a line graph. Let us now discuss some aspects of the graph penalty for an arbitrary DAG. As we shall see, our remarks lead to an explicit form of the graph penalty when is a tree.
If we say that vertex is a child of vertex and is a parent of . For every vertex , we let and be the set of children and parents of , respectively. When is a tree, is the empty set if is the root node and otherwise consists of only one element, the parent of , which we denote by .
Let be the set of descendants of , that is, the set of vertices which are connected to by a directed path starting in , and let be the set of ancestors of , that is, the set of vertices from which a directed path leads to . We use the convention that and .
Every connected subset induces a subgraph of which is also a DAG. If and are disjoint connected subsets of , we say that they are connected if there is at least one edge connecting a pair of vertices in and , in either one or the other direction. Moreover, we say that is below — written — if and are connected and every edge connecting them departs from a node of .
Let be a DAG. We say that is a cut of if it induces a partition of the vertex set such that if and only if vertices and belong to two different elements of the partition.
In other words, a cut separates a connected graph in two or more connected components such that every pair of vertices corresponding to a disconnected edge, that is an element of , are in two different components. We also denote by the set of cuts of , and by the set of descendants of within set , for every