# Combinatorial Penalties: Which structures are preserved by convex relaxations?

We consider the homogeneous and the non-homogeneous convex relaxations for combinatorial penalty functions defined on support sets. Our study identifies key differences in the tightness of the resulting relaxations through the notion of the lower combinatorial envelope of a set-function along with new necessary conditions for support identification. We then propose a general adaptive estimator for convex monotone regularizers, and derive new sufficient conditions for support recovery in the asymptotic setting.

## Authors

• 4 publications
• 129 publications
• 72 publications
• ### Non-monotone risk functions for learning

In this paper we consider generalized classes of potentially non-monoton...
12/04/2020 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Convex Relaxation for Combinatorial Penalties

In this paper, we propose an unifying view of several recently proposed ...
05/06/2012 ∙ by Guillaume Obozinski, et al. ∙ 0

• ### Combinatorial Properties of Metrically Homogeneous Graphs

Ramsey theory looks for regularities in large objects. Model theory stud...
05/16/2018 ∙ by Matěj Konečný, et al. ∙ 0

• ### Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems

Combinatorial algorithms such as those that arise in graph analysis, mod...
06/28/2021 ∙ by Ariful Azad, et al. ∙ 0

• ### Online Optimization with Predictions and Non-convex Losses

We study online optimization in a setting where an online learner seeks ...
11/10/2019 ∙ by Yiheng Lin, et al. ∙ 0

• ### Convex regularization in statistical inverse learning problems

We consider a statistical inverse learning problem, where the task is to...
02/18/2021 ∙ by Tatiana A. Bubba, et al. ∙ 0

• ### Average sampling and average splines on combinatorial graphs

In the setting of a weighted combinatorial finite or infinite countable ...
01/25/2019 ∙ by Isaac Z. Pesenson, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Over the last years, sparsity

has been a key model in machine learning, signal processing, and statistics. While sparsity modelling is powerful,

structured sparsity

models further exploit domain knowledge by characterizing the interdependency between the non-zero coefficients of an unknown parameter vector

. For example, in certain applications domain knowledge may dictate that we should favor non-zero patterns corresponding to: unions of groups [obozinski2011group] in cancer prognosis from gene expression data; or complements of union of groups [jacob2009group] in neuroimaging and background substraction, or rooted connected trees [jenatton2011proximal, zhao2006grouped] in natural image processing. Incorporating such key prior information beyond just sparsity leads to significant improvements in estimation performance, noise robustness, interpretability and sample complexity [baraniuk2010model].

Structured sparsity models are naturally encoded by combinatorial functions. However, direct combinatorial treatments often lead to intractable learning problems. Hence, we often use either non-convex greedy methods or continuous convex relaxations, where the combinatorial penalty is replaced by a tractable convex surrogate; cf., [baraniuk2010model, huang2011learning, bach2011learning].

In this paper, we adopt the convex approach because it benefits from a mature set of efficient numerical algorithms as well as strong analysis tools that rely on convex geometry in order to establish statistical efficiency. Convex formulations are also robust to model mis-specifications. Moreover, there is a rich set of convex penalties with structure-inducing properties already studied in the literature [yuan2006model, jacob2009group, jenatton2011structured, jenatton2011proximal, zhao2006grouped, obozinski2011group]. For an overview, we refer the reader to [bach2011learning] and references therein.

For choosing a convex relaxation, a systematic approach, already adopted in [bach2010structured, chandrasekaran2012convex, obozinski2012convex, halabi2015totally], considers the tightest convex relaxation of combinatorial penalties expressing the desired structure. For instance, [bach2010structured] shows that computing the tightest convex relaxation over the unit -ball is tractable for the ensemble of monotone submodular functions. Similarly, the authors in [halabi2015totally] demonstrates the tractability of such relaxation for combinatorial penalties that can be described via totally unimodular constraints.

A different principled approach in convex relaxations is proposed by [obozinski2012convex], where the authors considered the tightest homogeneous convex relaxation of general set functions regularized by an -norm. The authors show, for instance, the resulting norm takes the form of a generalized latent group Lasso norm [obozinski2011group]. The homogeneity imposed in [obozinski2012convex] naturally ensures the invariance of the regularizer to rescaling of the data. However, such requirement may cost a loss of structure as was observed in an example in [halabi2015totally]. This observation begs the question:

When do the homogeneous and non-homogeneous convex relaxations differ and which structures can be encoded by each?

In order to answer this question, we rigorously identify which combinatorial structures are preserved by the non-homogeneous relaxation in a manner similar to [obozinski2012convex] for the homogeneous one. We further study the statistical properties of both relaxations. In particular, we consider the problem of support recovery in the context of regularized learning problems by these relaxed convex penalties, which was only investigated so far in special cases, e.g., for norms associated with submodular functions [bach2010structured], or for the latent group Lasso norm [obozinski2011group].

To this end, this paper makes the following contributions:

• We derive formulations of the non-homogeneous tightest convex relaxation of general -regularized combinatorial penalties (Section 2.1). We show that any monotone set function is preserved by such relaxation, while the homogeneous relaxation only preserves a smaller subset of set-functions (Section 2.2).

• We identify necessary conditions for support recovery in learning problems regularized by general convex penalties (Section 3.1).

• We propose an adaptive weight estimation scheme and provide sufficient conditions for support recovery under the asymptotic regime (Section 3.2). This scheme does not require any irrepresentability condition and is applicable to general monotone convex regularizers.

• We identify sufficient conditions with respect to combinatorial penalties which ensure that the sufficient support recovery conditions hold with respect to the associated convex relaxations (Section 4).

• We illustrate numerically the effect on support recovery of the choice of the relaxation as well as the adaptive weights scheme (Section  5).

In the sequel, we defer all proofs to the Appendix.

##### Notation.

Let be the ground set and be its power-set. Given and a set , denotes the vector in s.t., and . is defined similarly for a matrix . Accordingly, we let be the indicator vector of the set . We drop the subscript for , so that denotes the vector of all ones. The notation denotes the set complement of with respect to .

The operations and are applied element-wise. For , the -(quasi) norm is given by , and . For , we define the conjugate via .

We call the set of non-zero elements of a vector the support, denoted by . We use the notation from submodular analysis, . We write for . For a function , we will denote by its Fenchel-Legendre conjugate. We will denote by the indicator function of the set , taking value on the set and outside it.

## 2 Combinatorial penalties and convex relaxations

We consider positive-valued set functions of the form such that to encode structured sparsity models. For generality, we do not assume a priori that is monotone (i.e., ). However, as we argue in the sequel, convex relaxations of non-monotone set functions is hopeless.

The domain of is defined as . We assume that it covers , i.e., , which is equivalent to assuming that is finite at singletons if is monotone. A finite-valued set function is submodular if and only if for all and ,  (see, e.g., [fujishige2005submodular, bach2011learning]). Unless explicitly stated, we do not assume that is submodular.

We consider the same model in [obozinski2012convex], parametrized by , with general -regularized combinatorial penalties:

 Fp(w)=1qF(supp(w))+1p∥w∥pp

for , where the set function controls the structure of the model in terms of allowed/favored non-zero patterns and the -norm serves to control the magnitude of the coefficients. Allowing to take infinite values let us enforce hard constraints. For , reduces to . Considering the case is appealing to avoid the clustering artifacts of the values of the learned vector induced by the -norm.

Since such combinatorial regularizers lead to computationally intractable problems, we seek convex surrogate penalties that capture the encoded structure as much as possible. A natural candidate for a convex surrogate of is then its convex envelope (largest convex lower bound), given by the biconjugate (the Fenchel conjugate of the Fenchel conjugate) . Two general approaches are proposed in the literature to achieve just this; one requires the surrogate to also be positively homogeneous [obozinski2012convex] and thus considers the convex envelope of the positively homogeneous envelope of , given by , which we denote by , the other computes instead the convex envelope of directly [halabi2015totally, bach2010structured], which we denote by . Note that from the definition of convex envelope, it holds that .

### 2.1 Homogeneous and non-homogeneous convex envelopes

In [obozinski2012convex], the homogeneous convex envelope of was shown to correspond to the latent group Lasso norm [obozinski2011group] with groups set to all elements of the power set . We recall this form of in Lemma 1 as well as a variational form of which highlights the relation between the two. Other variational forms can be found in the Appendix.

###### Lemma 1 ([obozinski2012convex]).

The homogeneous convex envelope of is given by

 Ωp(w) =infη∈Rd+1pd∑j=1|wj|pηp−1j+1qΩ∞(η), (1) Ω∞(w) =minα≥0{∑S⊆VαSF(S):∑S⊆VαS1S≥|w|}. (2)

The non-homogeneous convex envelope of is only considered thus far in the case where . [halabi2015totally] shows that where is any proper , lower semi-continuous (l.s.c.) convex extension of , i.e., (cf., Lemma 1 in [halabi2015totally]). A natural choice for is the convex closure of , which corresponds to the tightest convex extension of on (cf., Appendix for a more rigorous treatment).

Lemma 2 below presents this choice, deriving a new form of that parallels (2). We also derive the non-homogeneous convex envelope of for any and present the variational form relating it to in Lemma 2. For simplicity, the variational form (3) presented below holds only for monotone functions ; the general form and other variational forms that parallel the ones known for the homogeneous envelope are presented in the Appendix.

###### Lemma 2.

The non-homogeneous convex envelope of , for monotone functions , is given by

 Θp(w) =infη∈[0,1]d1pd∑j=1|wj|pηp−1j+1qΘ∞(η), (3) Θ∞(w) =minα≥0{∑S⊆VαSF(S):∑S⊆VαS1S≥|w|,∑S⊆VαS=1}. (4)

The infima in (1) and (3), for , can be replaced by a minimization, if we extend by continuity in zero with if and otherwise, as suggested in [jenatton2010structured] and [bach2012optimization]. Note that, for , both relaxations reduce to . Hence, the -relaxations essentially lose the combinatorial structure encoded in . We thus follow up on the case .

In order to decide when to employ or , it is of interest to study the respective properties of these two relaxations and to identify when they coincide. Remark 1 shows that the homogeneous and non-homogeneous envelopes are identical, for , for monotone submodular functions.

###### Remark 1.

If is a monotone submodular function then , where denotes the Lovász extension of [lovasz1983submodular].

The two relaxations do not coincide in general: Note the added constraints in (3) and the sum constraint on in (4). Another clear difference to note is that are norms that belong to the broad family of H-norms [micchelli2013regularizers, bach2012optimization], as shown in [obozinski2012convex]. On the other hand, by virtue of being non-homogeneous, are not norms in general. We illustrate below two interesting examples where and differ.

###### Example 1 (Berhu penalty).

Since the cardinality function is a monotone submodular function, . However, this is not the case for . In particular, we consider the -regularized cardinality function . Figure 1 shows that the non-homogeneous envelope is tighter than the homogeneous one in this case. Indeed, is simply the -norm, while is given by if and otherwise. This penalty, called “Berhu,” is introduced in [owen2007robust]

to produce a robust ridge regression estimator and is shown to be the convex envelope of

in [jojic2011convex].

This kind of behavior, where the non-homogeneous relaxation acts as an -norm on the small coefficients and as for large ones, is not limited to the Berhu penalty, but holds for general set functions. However the point where the penalty moves from one mode to the other depends on the structure of and is different along each coordinate. This is easier to see via the second variational form of presented in the Appendix. We further illustrate in the following example.

###### Example 2 (Range penalty).

Consider the range function defined as where () denotes maximal (minimal) element in . This penalty allow us to favor the selection of interval non-zero patterns on a chain or rectangular patterns on grids. It was shown in [obozinski2012convex] that for any . On the other hand, has no closed form solution, but is different from -norm. Figure 2 illustrates the balls of different radii of and . We can see how the penalty morphs from -norm to and squared -norm respectively, with different “speed” along each coordinate. Looking carefully for example on the ball , we can see that the penalty acts as an -norm along the -plane and as a squared -norm along the -plane.

We highlight other ways in which the two relaxations differ and their implications in the sequel.

In terms of computational efficiency, note that even though the formulations (1) and (3) are jointly convex in , and can still be intractable to compute and to optimize.

However, for certain classes of functions, they are tractable. For example, since for monotone submodular functions, is the Lovász extension of , as stated in Remark 1, then they can be efficiently computed by the greedy algorithm [bach2011learning]. Moreover, efficient algorithms to compute , the associated proximal operator and to solve learning problems regularized with is proposed in [obozinski2012convex]. Similarly, if can be expressed by integer programs over totally unimodular constraints as in [halabi2015totally], then ,

and their associated Fenchel-type operators can be computed efficiently by linear programs. Hence, we can use conditional gradient algorithms for numerical solutions.

### 2.2 Lower combinatorial envelopes

In this section, we are interested in analyzing which combinatorial structures are preserved by each relaxation. To that end, we generalize the notion of lower combinatorial envelope (LCE) [obozinski2012convex]. The homogeneous LCE of is defined as the set function which agrees with the -homogeneous convex relaxation of at the vertices of the unit hypercube, i.e., .

For the non-homogeneous relaxation, we define the non-homogeneous LCE similarly as . The -relaxation reflects most directly the combinatorial structure of the function . Indeed, -relaxations only depend on through the -relaxation as expressed in the variational forms (1) and (3).

We say is a tight relaxation of if . Similarly, is a tight relaxation of if . and are then extensions of from to ; in this sense, the relaxation is tight for all of the form . Moreover, following the definition of convex envelope, the relaxation (resp. ) is the same for and (resp.  and ), and hence, the LCE can be interpreted as the combinatorial structure preserved by each convex relaxation.

The homogeneous relaxation can capture any monotone submodular function [obozinski2012convex]. Since is the Lovász extension [bach2010structured] in this case, and hence, . Also, since the two -relaxations are identical for this class of functions, their LCEs are also equal, i.e., .

The LCEs, however, are not equal in general. In fact, the non-homogeneous relaxation is tight for a larger class of functions. In particular, the following proposition shows that is equal to the monotonization of , that is , for all set functions , and is thus equal to the function itself if is monotone.

###### Proposition 1.

The non-homogenous lower combinatorial envelope can be written as

 ~F−(A) =Θ∞(1A) =infαS∈{0,1}{∑S⊆VαSF(S):∑S⊆VαS1S≥1A,∑S⊆VαS=1} =infS⊆V{F(S):A⊆S}.
###### Proof.

To see why we can restrict to be integral, let , then such that , then and hence . Hence we have and . ∎

Proposition 1 argues that the non-homogeneous convex envelope is tight if and only if is monotone. Two important practical implications follow from this result.

Given a target model that cannot be expressed by a monotone function, it is impossible to obtain a tight convex relaxation. Non-convex methods can be potentially better.

On the other hand, if the model can be expressed by a monotone non-submodular set function, the homogeneous function may not be tight, and hence, a non-homogeneous relaxation can be more useful. For instance, [obozinski2012convex] shows that for any set function where for all singletons and , the homogeneous LCE and accordingly is the -norm, thus losing completely the structure encoded in .

We discuss three examples that fall in this class of functions, where the non-homogeneous relaxation is tight while the homogeneous one is not.

###### Example 3 (Range penalty).

Consider . For , we have , while by Prop. 1.

###### Example 4 (Dispersive ℓ0-penalty).

Given a set of predefined groups , consider the dispersive -penalty, introduced by [halabi2015totally]: where the columns of correspond to the indicator vectors of the groups, i.e., . The dispersive penalty enforces the selection of sparse supports where no two non-zeros are selected from the same group. Neural sparsity models induce such structures [hegde2009compressive]. In this case, we have , while by Prop. 1.

###### Example 5 (Weighted graph model).

Given a graph , consider a relaxed version of the weighted graph model of [hegde2015nearly]: , where is the number of connected components formed by the forest corresponding to and is the total weight of edges in the forest . This model describes a wide range of structures, including 1D-clustering, tree hierarchies, and the Earth Mover Distance model. We have , while by Prop. 1.

The last two examples belong to a natural class of structured sparsity penalties of the form , which favors sparse non-zero patterns among a set of allowed patterns. If is down-monotone, i.e., , then the non-homogeneous relaxation preserves its structure, i.e., , while its homogeneous relaxation is oblivious to the hard constraints, with .

## 3 Sparsity-inducing properties of convex relaxations

The notion of LCE captures the combinatorial structure preserved by convex relaxations in a geometric sense. In this section, we characterize the preserved structure from a statistical perspective.

To this end, we consider the linear regression model

, where is a fixed design matrix, is the response vector, and

is a vector of iid random variables with mean

and variance

. Given , we define as a minimizer of the regularized least-squares:

 minw∈Rd12∥y−Xw∥22+λnΦ(w), (5)

We are interested in the sparsity-inducing properties of and on the solutions of (5). In this section, we consider though the more general setting where is any proper normalized () convex function which is absolute, i.e., and monotonic in the absolute values of , that is . In what follows, monotone functions refer to this notion of monotonicity.

We determine in Section 3.1 necessary conditions for support recovery in (5) and in Section 3.2 we provide sufficient conditions for support recovery and consistency of a variant of (5). As both and are normalized absolute monotone convex functions, the results presented in this section apply directly to them as a corollary.

For simplicity, we assume , thus is unique. This forbids the high-dimensional setting. We expect though the insights developed towards the presented results to contribute to understanding the high-dimensional learning setting, which we defer to a later work.

### 3.1 Continuous stable supports

Existing results on the consistency of special cases of the estimator (5) typically rely heavily on decomposition properties of [negahban2011unified, bach2010structured, obozinski2011group, obozinski2012convex]. The notions of decomposability assumed in these prior works are either too strong or too specific to be applicable to the general convex penalties and we are considering. Instead, we introduce a general weak notion of decomposability applicable to any absolute monotone convex regularizer.

###### Definition 1 (Decomposability).

Given and , , we say that is decomposable at w.r.t if such that

 Φ(w+Δ)≥Φ(w)+MJ∥Δ∥∞.

For example, for the -norm, this decomposability property holds for any and , with .

It is reasonable to expect this property to hold at the solution of (5) and its support . Theorem 1 shows that this is indeed the case. In Section 3.2, we devise an estimation scheme able to recover supports that satisfy this property at any . This leads then to following notion of continuous stable supports, which characterizes supports with respect to the continuous penalty . In Section 4, we relate this to the notion of discrete stable supports, which characterizes supports with respect to the combinatorial penalty .

###### Definition 2 (Continuous stability).

We say that is weakly stable w.r.t if there exists , such that is decomposable at wrt . Furthermore, we say that is strongly stable w.r.t if for all s.t. , is decomposable at wrt .

Theorem 1 considers slightly more general estimators than (5) and shows that weak stability is a necessary condition for a non-zero pattern to be allowed as a solution.

###### Theorem 1.

The minimizer of , where

is a strongly-convex and smooth loss function and

has a continuous density w.r.t to the Lebesgue measure, has a weakly stable support w.r.t.

, with probability one.

This new result extends and simplifies the result in [bach2010structured] which consideres the special case of quadratic loss functions and being the -convex relaxation of a submodular function. The proof we present, in the Appendix, is also shorter and simpler.

###### Corollary 1.

Assume has a continuous density w.r.t to the Lebesgue measure, then the support of the minimizer of Eq. (5) is weakly stable wrt with probability one.

Restricting the choice of regularizers in (5) to convex relaxations as surrogates to combinatorial penalties is motivated by computational tractability concerns. However, other non-convex regularizers such as -quasi-norms [knight2000asymptotics, frank1993statistical] or more generally penalties of the form , where is a monotone concave penalty [fan2001variable, daubechies2010iteratively, gasso2009recovering] can be more advantageous than the convex -norm. Such penalties are closer to the -quasi-norm and penalize more aggressively small coefficients, thus they have a stronger sparsity-inducing effect than -norm.

The authors in [jenatton2010structured] extended such concave penalties to the - quasi-norm for some , which enforces sparsity at the group level more aggressively. We generalize this to where is any structured sparsity-inducing monotone convex regularizer.

These non-convex penalties lead to intractable estimation problems, but approximate solutions can be obtained by majorization-minimization algorithms, as suggested for e.g., in [figueiredo2007majorization, zou2008one, candes2008enhancing].

###### Lemma 3.

Let be a monotone convex function, admits the following majorizer, , , which is tight at .

We consider the adaptive weight estimator (6) resulting from applying a 1-step majorization-minimization to (5),

 minw∈Rd12∥y−Xw∥22+λnΦ(|w0|α−1∘|w|), (6)

where is a -consistent estimator to , that is converging to at rate (typically obtained from

We study sufficient support recovery and estimation consistency conditions for (6) for general convex monotone regularizers . Such consistency results have been established for (6), in the classical asymptotic setting, only in the special case of -norm in [zou2006adaptive] and for the (non-adaptive) estimator (5) for homogeneous convex envelopes of monotone submodular functions, for in [bach2010structured] and for general in [obozinski2012convex], in the high dimensional setting, and for latent group Lasso norm in [obozinski2011group], in the asymptotic setting.

Compared to prior works, the discussion of support recovery is complicated here by the fact that is not necessarily a norm (e.g., if ) and only satisfies a weak notion of decomposability.

As in [zou2006adaptive], we consider the classical asymptotic regime in which the model generating the data is of fixed finite dimension while . As before, we assume and thus the minimizer of (6) is unique, we denote it by .

The following Theorem extends the results of [zou2006adaptive] for the -norm to any normalized absolute monotone convex regularizer if the true support satisfy the sufficient condition of strong stability in Definition 2. As we previously remarked this condition is trivially satisfied for the -norm.

###### Theorem 2.

[Consistency and Support Recovery] Let be a proper normalized absolute monotone convex function and denote by the true support . If , is strongly stable with respect to and satisfies , then the estimator (6) is consistent and asymptotically normal, i.e., it satisfies

 √n(^wJ−w∗J)d→N(0,σ2Q−1JJ), (7)

and

 P(supp(^w)=J)→1. (8)

Consistency results in most existing works are established under various necessary conditions on , some of which are difficult to verify in practice, such as the irrepresentability condition (c.f., [zou2006adaptive, bach2010structured, obozinski2011group, obozinski2012convex]). Adding data-dependent weights does not require such conditions and allows recovery even in the correlated measurement matrix setup as illustrated in our numerical results (c.f., Sect. 5).

## 4 Sparsity-inducing properties of combinatorial penalties

In section 3, we derived neccesary and sufficient conditions for support recovery defined with respect to the continuous convex penalties and . In this Section, we translate these to conditions with respect to the combinatorial penalties themselves. Hence, the results of this section allows one to check which supports to expect to recover, without the need to compute the corresponding convex relaxation. To that end, we introduce in Section 4.1 discrete counterparts of weak and strong stability, and show in Section 4.2 that discrete strong stability is a sufficient, and in some cases necessary, condition for support recovery.

### 4.1 Discrete stable supports

We recall the concept of discrete stable sets [bach2010structured], also referred to as flat or closed sets [krause2012near]. We refer to such sets as discrete weakly stable sets and introduce a stronger notion of discrete stability.

###### Definition 3 (Discrete stability).

Given a monotone set function , a set is said to be weakly stable w.r.t if .
A set is said to be strongly stable w.r.t if .

Note that discrete stability imply in particular feasibility, i.e., . Also, if is a strictly monotone function, such as the cardinality function, then all supports are stable w.r.t . It is interesting to note that for monotone submodular functions, weak and strong stability are equivalent. In fact, this equivalence holds for a more general class of functions, we call -submodular.

###### Definition 4.

A function is -submodular iff s.t.,

 ρ[F(B∪{i})−F(B)]≤F(A∪{i})−F(A)

The notion of -submodularity is a special case of the weakly DR-submodular property defined for continuous functions [hassani2017gradient]. It is also related to the notion of weak submodularity (c.f., [das2011submodular, elenberg2016restricted]). We show in the appendix that -submodularity is a stronger condition.

###### Proposition 2.

If is a finite-valued monotone function, is -submodular iff discrete weak stability is equivalent to strong stability.

###### Example 6.

The range function is -submodular with .

### 4.2 Relation between discrete and continuous stability

This section provides several technical results relating the discrete and continuous notions of stability. It thus provides us with the necessary tools to characterize which supports can be correctly estimated w.r.t the combinatorial penalty itself, without going through its relaxations.

###### Proposition 3.

Given any monotone set function , all sets strongly stable w.r.t to are also strongly stable w.r.t and .

It follows then by Theorem 2 that discrete strong stability is a sufficient condition for correct estimation.

###### Corollary 2.

If is equal to or for and is strongly stable w.r.t , then Theorem 2 holds, i.e., the adaptive estimator (6) is consistent and correctly recovers the support. This also holds for if we further assume that .

Furthermore, if is -submodular, then by Proposition 2, it is enough for to be weakly stable w.r.t for Corollary 2 to hold. Conversely, Proposition 4 below shows that discrete strong stability is also a necessary condition for continuous strong stability, in the case where and is equal to its LCE.

###### Proposition 4.

If and is strongly stable w.r.t , then is strongly stable w.r.t . Similarly, for any monotone , if is strongly stable w.r.t , then is strongly stable w.r.t .

Finally, in the special case of monotone submodular function, the following Corollary 3, along with Proposition 3 demonstrates that all definitions of stability become equivalent. We thus recover the result in [bach2010structured] showing that discrete weakly stable supports correspond to the set of allowed sparsity patterns for monotone submodular functions.

###### Corollary 3.

If is monotone submodular and is weakly stable w.r.t then is weakly stable w.r.t .

### 4.3 Examples

We highlight in this section what are the supports recovered by the adaptive estimator (AE) (6) with the homogeneous convex relaxation and non-homogeneous convex relaxation of some examples of structure priors. For simplicity, we will focus on the case . Also in all the examples we consider below, weak and strong discrete stability are equivalent, so we omit the weak/strong specification. Note that it is desirable that the regularizer used enforces the recovery of only the non-zero patterns satisfying the desired structure.

Monotone submodular functions: As discussed above, for this class of functions, all stability definitions are equivalent and . As a result, AE recovers any discrete stable non-zero pattern. This includes the following examples (c.f., [obozinski2012convex] for further examples).

• Cardinality: As a strictly monotone function, all supports are stable w.r.t to it. Thus AE recovers all non-zero patterns with and , given by the -norm.

• Overlap count function: where is a collection of predefined groups and their associated weights. and are given by the -group Lasso norm, and stable patterns are complements of union of groups. For example, for hierarchical groups (i.e., groups consisting of each node and its descendants on a tree), AE recovers rooted connected tree supports.

• Modified range function: The range function can be transformed into a submodular function, if scaled by a constant as suggested in [bach2010structured], yielding the monotone submodular function and . This can actually be written as an instance of with groups defined as . This norm was proposed to induce interval patterns by [jenatton2011structured], and indeed its stable patterns are interval supports. We will compare this function in the experiments with the direct convex relaxations of the range function.

Range function: The range function is -submodular, thus its discrete strongly and weakly stable supports are identical and they correspond to interval supports. As a result, AE recovers interval supports with . On the other hand, since the homogeneous LCE of the range function is the cardinality, AE recovers all supports with .

Down monotone structures: Functions of the form , where is down-monotone, also have their discrete strongly and weakly stable supports identical and given by the feasible set . These structures include the dispersive and graph models discussed in examples 4 and 5. Since their homogeneous LCE is also the cardinality, then AE recovers all supports with , and only feasible supports with .

## 5 Numerical Illustration

To illustrate the results presented in this paper, we consider the problem of estimating the support of a parameter vector whose support is an interval. It is natural then to choose as combinatorial penalty the range function whose stable supports are intervals. We aim to study the effect of adaptive weights, as well as the effect of the choice of homogeneous vs. non-homogeneous convex relaxation for regularization, on the quality of support recovery.

As discussed in Section 4.3, the -homogeneous convex envelope of the range is simply the -norm. Its -non-homogeneous convex envelope can be computed using the formulation (3), where only interval sets need to be considered in the constraints, leading to a quadratic number of constraints. We also consider the -norm that corresponds to the convex relaxation of the modified range function .

We consider a simple regression setting in which is a constant signal whose support is an interval. The choice of is well suited for constant valued signals. The design matrix is either drawn as (1) an i.i.d Gaussian matrix with normalized columns, or (2) a correlated Gaussian matrix with normalized columns, with the off-diagonal values of the covariance matrix set to a value . We observe noisy linear measurements , where the noise vector is i.i.d. with variance , where is varied between and . We solve problem (6) with and without adaptive weights , where is taken to be the least squares solution and .

We assess the estimators obtained through the different regularizers both in terms of support recovery and in terms of estimation error. Figure 3 plots (in logscale) these two criteria against the noise level . We plot the best achieved error on the regularization path, where the regularization parameter was varied between and . We set the parameters to .

We observe that the adaptive weight scheme helps in support recovery, especially in the correlated design setting. Indeed, Lasso is only guaranteed to recover the support under an “irrepresentability condition" [zou2006adaptive]. This is satisfied with high probability only in the non-correlated design. On the other hand, adaptive weights allow us to recover any strongly stable support, without any additional condition, as shown in Theorem 2. The -norm performs poorly in this setup. In fact, the modified range function , introduced a gap of between non-empty sets and the empty set. This leads to the undersirable behavior, already documented in [bach2010structured, jenatton2011structured] of adding all the variables in one step, as opposed to gradually. Adaptive weights seem to correct for this effect, as seen by the significant improvement in performance. Finally, note that choosing the “tighter" non-homogeneous convex relaxation leads to better support recovery. Indeed, performs better than -norm in all setups.

## 6 Conclusion

We presented an analysis of homogeneous and non-homogeneous convex relaxations of -regularized combinatorial penalties. Our results show that structure encoded by submodular priors can be equally well expressed by both relaxations, while the non-homogeneous relaxation is able to express the structure of more general monotone set functions. We also identified necessary and sufficient stability conditions on the supports to be correctly recovered. We proposed an adaptive weight scheme that is guaranteed to recover supports that satisfy the sufficient stability conditions, in the asymptotic setting, even under correlated design matrix.

#### Acknowledgements

We thank Ya-Ping Hsieh for helpful discussions. This work was supported in part by the European Commission under ERC Future Proof, SNF 200021-146750, SNF CRSII2-147633, NCCR Marvel. Francis Bach acknowledges support from the chaire Economie des nouvelles données with the data science joint research initiative with the fonds AXA pour la recherche, and the Initiative de Recherche “Machine Learning for Large-Scale Insurance” from the Institut Louis Bachelier.

## 7 Appendix

### 7.1 Variational forms of convex envelopes (Proof of lemma 2 and Remark 1)

In this section, we recall the different variational forms of the homogeneous convex envelope derived in [obozinski2012convex] and derive similar variational forms for the non-homogeneous convex envelope, which includes the ones stated in lemma 2). These variational forms will be needed in some of our proofs below.

###### Lemma 4.

The homogeneous convex envelope of admits the following variational forms.

 Ω∞(w) =minα{∑S⊆VαSF(S):∑S⊆VαS1S≥|w|,αS≥0}. (9) Ωp(w) =minv{∑S⊆VF(S)1/q∥vS∥p:∑S⊆VvS=|w|,supp(vS)⊆S}. (10) =maxκ∈Rd+d∑i=1κ1/qi|wi| s.t. κ(A)≤F(A),∀A⊆V. (11) =infη∈Rd+1pd∑j=1|wj|pηp−1j+1qΩ∞(η). (12)

The non-homogeneous convex envelope of a set function , over the unit -ball was derived in [halabi2015totally], where it was shown that where is any proper, l.s.c. convex extension of (c.f., Lemma 1 [halabi2015totally]). A natural choice for is the convex closure of , which corresponds to the tightest convex extension of on . We recall the two equivalent definitions of convex closure, which we have adjusted to allow for infinite values.

###### Definition 5 (Convex Closure; c.f., [dughmi2009submodular, Def. 3.1]).

Given a set function , the convex closure is the point-wise largest convex function from to that always lowerbounds .

###### Definition 6 (Equivalent definition of Convex Closure; c.f., [Vondrak2010, Def. 1] and [dughmi2009submodular, Def. 3.2]).

Given any set function , the convex closure of can equivalently be defined as:

 f−(w)=inf{∑S⊆VαSF(S):w=∑S⊆VαS1S,∑S⊆VαS=1,αS≥0}

It is interesting to note that where is Lovász extension iff is a submodular function [Vondrak2010].

The following lemma derive variational forms of for any that parallel the ones known for the homogeneous envelope.

###### Lemma 5.

The non-homogeneous convex envelope of admits the following variational forms.

 Θ∞(w) =inf{∑S⊆VαSF(S):∑S⊆VαS1S≥|w|,∑S⊆VαS=1,αS≥0}. (13) Θp(w) =maxκ∈Rdd∑j=1ψj(κj,wj)+minS⊆VF(S)−κ(S), ∀w∈dom(Θp(w)). (14) =infη∈[0,1]d1pd∑j=1|wj|pηp−1j+1qf−(η), (15)

where , and where we define

 ψj(κj,wj) :={κ1/qj|wj| if |wj|≤κ1/pj,κj≥01p|wj|p+1qκjotherwise.

If is monotone, , then we can replace by in (15) and we can restrict in (14).

To prove the variational form (13) in Lemma 5, we need to show first the following property of .

###### Proposition 5 (c.f., [dughmi2009submodular, Prop. 3.23] ).

The minimum values of a proper set function and its convex closure are equal, i.e.,

 minw∈[0,1]df−(w)=minS⊆VF(S)

If is a minimizer of , then is a minimizer of . Moreover, if is a minimizer of , then every set in the support of , where , is a minimizer of .

###### Proof.

First note that, implies that . On the other hand, . The rest of the proposition follows directly. ∎

Given the choice of the extension , the variational form (13) of given in lemma 5 follows directly from definition 6 and proposition 5, as shown in the following corollary.

###### Corollary 4.

Given any set function and its corresponding convex closure , the convex envelope of over the unit -ball is given by

 Θ∞(w) =infα{∑S⊆VαSF(S):∑S⊆VαS1S≥|w|,∑S⊆VαS=1,αS≥0}. =infv{∑S⊆VF(S)∥vS∥∞:∑S⊆VvS=|w|,∑S⊆V∥vS∥∞=1,supp(vS)⊆S}.
###### Proof.

satisfies the first 2 assumptions required in Lemma 1 of [halabi2015totally], namely, is a lower semi-continuous convex extension of which satisfies

 maxS⊆Vm(S)−F(S)=maxw∈[0,1]dmTw−f−(w),∀m∈Rd+

To see this note that . The other inequality is trivial. The corollary then follows directly from Lemma 1 in [halabi2015totally] and definition 6. ∎

Note that . Note also that is monotone even if is not. On the other hand, if is monotone, then is monotone on and . Then the proof of remark 1 follows, since if is a monotone submodular function and is its Lovász extension, then , where the last equality was shown in [bach2010structured].

Next, we derive the convex relaxation of for a general .

###### Proposition 6.

Given any set function