 # Optimality of 1-norm regularization among weighted 1-norms for sparse recovery: a case study on how to find optimal regularizations

The 1-norm was proven to be a good convex regularizer for the recovery of sparse vectors from under-determined linear measurements. It has been shown that with an appropriate measurement operator, a number of measurements of the order of the sparsity of the signal (up to log factors) is sufficient for stable and robust recovery. More recently, it has been shown that such recovery results can be generalized to more general low-dimensional model sets and (convex) regularizers. These results lead to the following question: to recover a given low-dimensional model set from linear measurements, what is the "best" convex regularizer? To approach this problem, we propose a general framework to define several notions of "best regularizer" with respect to a low-dimensional model. We show in the minimal case of sparse recovery in dimension 3 that the 1-norm is optimal for these notions. However, generalization of such results to the n-dimensional case seems out of reach. To tackle this problem, we propose looser notions of best regularizer and show that the 1-norm is optimal among weighted 1-norms for sparse recovery within this framework.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the observation model in a Hilbert space (with associated norm ):

 y=Mx0 (1)

where is an under-determined linear operator, is a -dimensional vector and is the unknown. We suppose that belongs to a low-dimensional model (a union of subspaces). We consider the following minimization program.

 x∗∈argminMx=yR(x) (2)

where is a regularization function. A huge body of work gives practical regularizers ensuring that for several low-dimensional models (in particular sparse and low rank models, see  for a most complete review of these results). The operator is generally required to satisfy some property (e.g., the restricted isometry property) to guarantee recovery. In this work, we aim at finding the “best” regularizer for exact recovery of . Ideally we would like to set

(the characteristic function of

) but it is not practical in many cases (sparse and low rank recovery) as it is generally not convex, and even NP-hard to compute as a combinatorial problem. Consequently, we restrict the search for the best regularizer to a class of interesting regularizers . In our examples, the set is a subset of the set of convex functions. Other interesting classes might be considered, such as partly smooth functions .

### 1.1 Best regularizer with respect to a low dimensional model

Defining what is the “best” regularizer in for recovery is not immediate. Ideally, to fit to the inverse problem, we must define a compliance measure that depends on both the kind of unknown and measurement operator we consider. If we have some knowledge that where is a set of linear operators, we want to define a compliance measure that tells us if a regularizer is good in these situations, and maximize it. Such maximization might yield a function that depends on (e.g., in , when looking for tight continuous relaxation of the penalty a dependency on appears). We aim for a more universal notion of optimal regularizer that does not depend on . Hence, we look for a compliance measure and its maximization

 R∗∈argmaxR∈CAΣ(R). (3)

In the sparse recovery example studied in this article, the existence of a maximum of is verified. However, we could ask ourselves what conditions on and are necessary and sufficient for the existence of a maximum, which is out of the scope of this article.

### 1.2 Compliance measures

When studying recovery with a regularization function , two types of guarantees are generally used: uniform and non-uniform. To describe these recovery guarantees, we use the following definition of descent vectors.

###### Definition 1.1 (Descent vectors).

For any , the collection of descent vectors of at is

 TR(x):={z∈H:R(x+z)≤R(x)}. (4)

We write . Recovery is characterized by descent vectors (recall that is the result of minimization (2)):

• Uniform recovery: Let a linear operator. Then “for all , ” is equivalent to .

• Non-uniform recovery: Let a linear operator and . Then is equivalent to .

Hence, a regularization function is “good” if leaves a lot of space for to not intersect it (trivially). In dimension , if there is no orientation prior on the kernel of , the amount of space left can be quantified by the “volume” of where is the unit sphere with respect to . Hence a compliance measure for uniform recovery can be defined as

 AUΣ(R):=1−vol(TR(Σ)∩S(1))vol(S(1)). (5)

More precisely, here, the volume of a set is the measure of with respect to the uniform measure on the sphere (i.e. the -dimensional Haussdorf measure of ). When maximizing this compliance measure with convex regularizers (proper, coercive and continuous), it has been shown that we can limit ourselves to atomic norms with atoms included in the model [11, Lemma 2.1]. When looking at non-uniform recovery for random Gaussian measurements, the quantity

represents the probability that a randomly oriented kernel of dimension 1 intersects (non trivially)

. The highest probability of intersection with respect to quantifies the lack of compliance of , hence we can define:

 ANUΣ(R):=1−supx∈Σvol(TR(x)∩S(1))vol(S(1)) (6)

Note that this can be linked with the Gaussian width and statistical dimension theory of sparse recovery [5, 1].

###### Remark 1.1.

In infinite dimension, the volume of the sphere vanishes, making the measures above uninteresting. However,  and  show that we can often come back to a low-dimensional recovery problem in an intermediate finite (potentially high dimensional) subspace of . Adapting the definition of to this subspace allows to extend these compliance measures.

Another possibility for the -dimensional case, which we develop in this article, is to use recovery results based on the restricted isometry property (RIP). They have been shown to be adequate for multiple models , to be tight in some sense  for sparse and low rank recovery, to be necessary in some sense  and to be well adapted to the study of random operators . In particular, it has been shown that if has a RIP with constant on the secant set , with being fully determined by and  , then stable recovery is possible. Hence, by taking

 ARIPΣ(R):=δΣ(R), (7)

the larger is, the less stringent are RIP recovery conditions for recovery of elements of with . We develop these ideas more precisely in Section 3 and perform the maximization with such measure in the sparse recovery case.

### 1.3 Optimality results

Within this framework we show the following results:

1. In Section 2, when , the set of -sparse vectors and is the set of (symmetric) atomic norms with atoms included in the model. Both and are uniquely maximized by multiples of the -norm (Theorem 2.1). In this case, it is possible to exactly compute and maximize the compliance measure. While this study gives a good geometrical insight of the quantities at hand, extending these exact calculations to the general -sparse recovery in dimension seems out of reach.

2. In Section 3, we describe precisely how compliance measures based on RIP recovery can be defined. We then study two of these measures (based on state of the art recovery guarantees) for , the set of -sparse vectors (vectors with at most non zero coordinates) and the set of weighted -norms. We show that both of these measures are uniquely maximized by multiples of the -norm (Theorem 3.1 and Theorem 3.2).

## 2 The case of 1-sparse vectors in 3D

We investigate in detail the case of 3-dimensional vectors which are 1 sparse, i.e. the simplest interesting case (uniform recovery of 1-sparse vectors in 2 dimensions is impossible if is not invertible). Here, we have and . We consider weighted -norms, i.e., atomic norms of the form (where is the atomic norm generated by atoms ). To maximize (respectively ), we have to compute the surface of the intersections of 3 descent cones with the unit sphere (respectively the surface of the biggest possible intersection of a descent cone with ).

As shown in Figure 1, with symmetries, we need to compute the intersection of 3 descent cones of a weighted -norm at 1-sparse vectors. This is the object of Lemma 2.1. Note that in the case of asymmetrical atoms, we would then need to calculate intersections of the sphere with a tetrahedron. To ease the notations, we introduce the following quantities which represent the cosine of the angles of a tetrahedron of interest. For , set , and .

###### Lemma 2.1.

Let , a 1-sparse vector and three positive real numbers. Then,

 vol(T∥⋅∥w(x)∩S(1))=4tan−1(11+ci(μ)) (8)

where

 ci(μ)=1+∑j≠iβij(μ)+∏j≠iβij(μ). (9)

With this Lemma 2.1, we can prove that the -norm is the best regularizer for 1-sparse signals both for uniform and non-uniform recovery among all weighted -norms.

###### Theorem 2.1.

Let the set of 1-sparse vectors in and . The -norm is the best regularizer among the class for , both for the uniform and non-uniform case,

 argmaxR∈CAUΣ(R)=argmaxR∈CANUΣ(R)=∥⋅∥1

We expect a similar result if is generalized to asymmetrical atomic norms. To extend these exact calculations to the -dimensional case, we would need to be able to compare intersections of spheres and descent cones of without an analytic formula, which appears to be a very difficult task.

## 3 Compliance measures based on the RIP

In this Section, we detail how we can use RIP recovery conditions to build compliance measures that are more easily managed. We start by recalling definitions and results about RIP recovery guarantees then apply our methodology. We also give Lemma that emphasize the relevant quantity (depending on the geometry of the regularizer and the model) to optimize.

###### Definition 3.1 (RIP constant).

In a Hilbert space , let a union of subspaces. Let a linear map, the RIP constant of is defined as

 δ(M)=infx∈Σ−Σ∣∣ ∣∣∥Mx∥2H∥x∥2H−1∣∣ ∣∣, (10)

where (differences of elements of ) is called the secant set.

In , an explicit constant is given, such that guarantees exact recovery of elements of by minimization (2). This constant is only sufficient (and sharp in some sense for sparse and low rank recovery). An ideal RIP based compliance measure would be to use a sharp RIP constant (which is not explicit, it is an open question to derive closed form formulations of this constant for sparsity and other low-dimensional models) defined as:

 δsharpΣ(R):=infM:kerM∩TR(Σ)≠{0}δ(M). (11)

It is the best RIP constant of measurement operators where uniform recovery fail. The lack of analytic expressions for limits the possibilities of exact optimization with respect to . We propose to look at two compliance measures:

• Measures based on necessary RIP conditions  which yields sharp recovery constants for particular set of operators, e.g.,

 δnecΣ(R):=infz∈TR(Σ)∖{0}δ(I−Πz). (12)

where is the orthogonal projection onto the one-dimensional subspace (other intermediate necessary RIP constants can be defined). Another open question is to determine whether generally or for some particular models.

• Measures based on sufficient RIP constants for recovery (e.g., from ).

Note that we have the relation

 δsuffΣ(R)≤δsharpΣ(R)≤δnecΣ(R). (13)

To summarize, in the following, instead of considering the most natural RIP based compliance measure (based on ), we use the best known bounds of this measure.

### 3.1 Compliance measures based on necessary RIP conditions

In this case, instead of working with actual RIP constants, it is easier to use (equivalently) the restricted conditioning.

###### Definition 3.2 (Restricted conditioning).
 γ(M):=supx∈(Σ−Σ)∩S(1)∥Mx∥2Hinfx∈(Σ−Σ)∩S(1)∥Mx∥2H. (14)

The RIP constant is increasing with respect to . In the following, we consider the compliance measure

 ARIP,necΣ(R):=γΣ(R)=infz∈TR(Σ)∖{0}γ(I−Πz). (15)

When maximizing , we look at optimal regularizers for recovery with tight frames with a kernel of dimension 1.

From here, we specialize to and . Hence with and (for uniform recovery is not possible for non-invertible ). To show optimality of the -norm, we use the following characterization of .

###### Lemma 3.1.

Let and . Then

 ARIP,necΣ(R)=11−infz∈TR(Σ)∖{0}supx∈(Σ−Σ)∩S(1)⟨x,z⟩2∥z∥2H. (16)

We consider the set . Note that

 argmaxR∈CARIP,necΣ(R)=argminR∈Cz∈TR(Σ)∖{0}sup∥zTc2∥22∥zT2∥22=argminR∈CBΣ(R) (17)

where is a notation for the support of biggest coordinates in , i.e. for all , we have .

Studying the quantity for permits to conclude.

###### Theorem 3.1.

Let , and . Suppose , then

 ∥⋅∥1=argmaxR∈CARIP,necΣ(R). (18)

In the next section, we will see that the optimization of the sufficient RIP constant leads to very similar expressions.

### 3.2 Compliance measures based on sufficient RIP conditions

In , it was shown for a union of subspaces and an arbitrary regularizer , that an explicit RIP constant is sufficient to guarantee reconstruction. From here we set , .

###### Proposition 3.1.

Let the set of -sparse vectors. Consider the constant from [Eq. (5)], we have:

 δsuffΣ(R)=1√supz∈TR(Σ)∖{0}∥zTc∥2Σ∥zT∥22+1 (19)

where denotes the support of the biggest coordinates of and is the atomic norm generated by (the convex gauge induced by the convex envelope of ).

In the following, we use

 ARIP,suffΣ(R):=δsuffΣ(R). (20)

It must be noted that characterization of Proposition 3.1 was used as a lower bound on in  (hence this part of the proof of the control of in  is exact).

Similarly to the necessary case, from Proposition 3.1, we have

 argmaxR∈CARIP,suffΣ(R)=argminR∈Cz∈TR(Σ)∖{0}sup∥zTc∥2Σ∥zT∥22=argminR∈CDΣ(R) (21)

where denotes the support of the biggest coordinates of and . Remark the similarity between the fundamental quantity to optimize for the necessary case and the sufficient case and . Studying the quantity for leads to the result.

###### Theorem 3.2.

Let , and . Suppose , then

 ∥⋅∥1=argsupR∈CARIP,suffΣ(R). (22)

## 4 Discussion and future work

We have shown that, not surprisingly, the -norm is optimal among weighted -norms for sparse recovery for several notions of compliance. This result had to be expected due to symmetries of the problem. However, the important point is that we could explicitly quantify the notion of good regularizer. This is promising for the search of optimal regularizers for more complicated low-dimensional models such as “sparse and low rank” models or hierarchical sparse models. For the case of sparsity, we expect to be able to generalize the optimality of the -norm to the more general case of atomic norms with atoms included in the model. We also expect similar result for low-rank recovery and the nuclear norm as technical tools are very similar.

It must be noted that for RIP compliance measures, we did not use a constructive proof (we exhibited the maximum of the compliance measure). A constructive proof, i.e. an exact calculation and optimization of the quantities and would be more satisfying as it would not require the knowledge of the optimum, which is our ultimate objective.

We used compliance measures based on (uniform) RIP recovery guarantees to give results for the general sparse recovery case, it would be interesting to do such analysis using (non-uniform) recovery guarantees based on the statistical dimension or Gaussian width of the descent cones [5, 1]. One would need to precisely lower and upper bound these quantities, similarly to our approach with the RIP, to get satisfying results.

Finally, while these compliance measures are designed to make sense with respect to known results in the area of sparse recovery, one might design other compliance measures tailored for particular needs, in this search for optimal regularizers.

## Acknowledgements

This work was partly supported by the CNRS PEPS JC 2018 (project on efficient regularizations). The authors would like to thank Rémi Gribonval for his insights on this problem.

## References

•  D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp.

Living on the edge: phase transitions in convex programs with random data.

Information and Inference, 3(3):224–294, 2014.
•  A. Argyriou, R. Foygel, and N. Srebro. Sparse Prediction with the k-Support Norm. Advances in Neural Information Processing Systems, 25:1457–1465, 2012.
•  A. Bourrier, M. Davies, T. Peleg, P. Perez, and R. Gribonval. Fundamental performance limits for ideal decoders in high-dimensional linear inverse problems. Information Theory, IEEE Transactions on, 60(12):7928–7946, 2014.
•  T. Cai and A. Zhang. Sparse representation of a polytope and recovery of sparse signals and low-rank matrices. Information Theory, IEEE Transactions on, 60(1):122–132, 2014.
•  V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.
•  M. E. Davies and R. Gribonval. Restricted isometry constants where sparse recovery can fail for . Information Theory, IEEE Transactions on, 55(5):2203–2214, 2009.
•  S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing. Springer, 2013.
•  A. V. Oosterom and J. Strackee. The solid angle of a plane triangle. IEEE Transactions on Biomedical Engineering, BME-30(2):125–126, 1983.
•  G. Puy, M. E. Davies, and R. Gribonval. Recipes for stable linear embeddings from hilbert spaces to . arXiv preprint arXiv:1509.06947, 2015.
•  E. Soubies, L. Blanc-Féraud, and G. Aubert. A continuous exact penalty (cel0) for least squares regularized problem. SIAM Journal on Imaging Sciences, 8(3):1607–1639, 2015.
•  Y. Traonmilin and R. Gribonval. Stable recovery of low-dimensional cones in Hilbert spaces: One RIP to rule them all. Applied And Computational Harmonic Analysis, In Press, 2016.
•  S. Vaiter, M. Golbabaee, J. Fadili, and G. Peyré. Model selection with low complexity priors. Information and Inference: A Journal of the IMA, 4(3):230–287, 2015.

## 5 Annex

This section describes the tools and proofs used to obtain our results.

### 5.1 Proofs for Section 2

###### Proof of Lemma 2.1.

We start our proof by observing that the descent cone at is given by the conic envelope of the rays defined by vectors of the form , i.e.,

 T∥⋅∥w(x)=¯¯¯¯¯¯¯¯¯¯¯cone{±μjej−μiei}j≠i. (23)

Using the symmetry of this set, we can split it as four tetrahedra of equal size, depending on the sign in front of the vector ,

 T∥⋅∥w(x)=⋃s∈{−1,1}2¯¯¯¯¯¯¯¯¯¯¯cone{(sjμjej−μiei)j≠i,−μiei}. (24)

We have reduced the problem to computing the area of the intersection between a tetrahedron and the unit sphere. To ease the notation, let , and we consider the cone

 T=¯¯¯¯¯¯¯¯¯¯¯cone{μ2e2−μ1e1,μ3e3−μ1e1,−μ1e1}, (25)

as depicted in Figure 1. Following [8, Equation (8)], we have that

 tan(12vol(T∩S(1)))=11+∑3j=1cosαi,j, (26)

Using the definition of , we have that . Indeed, using standard relation in a triangle, one has

 cosα12=μ1√μ21+μ22=β12,cosα13=μ1√μ21+μ23=β13

and, using Al-Kashi relation,

 cosα11=μ21√(μ21+μ22)(μ21+μ23)=β12β13=β11.

###### Proof of Theorem 2.1.

Assume w.l.o.g that , i.e., .

Non-uniform case. Using Lemma 2.1, we compute the quantity

 supx∈Σvol(TR(x)∩S(1)) =supi∈{1,2,3}vol(TR(ei)∩S(1)) (27) =supi∈{1,2,3}4tan−1(11+ci(μ)). (28)

Now,

 argmaxR∈CANUΣ(R) =argmaxR∈C{1−supx∈Σvol(TR(x)∩S(1))vol(S(1))} by definition (6) =argminR∈Csupx∈Σvol(TR(x)∩S(1)) =argmin1=μ1≤μ2≤μ3maxi∈{1,2,3}4tan−1(11+ci(μ)) using (28) =argmin1=μ1≤μ2≤μ3maxi∈{1,2,3}11+ci(μ) tan−1 is increasing

Hence,

 argmaxR∈CANUΣ(R)=argmin1=μ1≤μ2≤μ3mini∈{1,2,3}ci(μ).

Observe that under the constraint , the minimum of is achieved at . Indeed, we have , , thus and . Since every term is positive, we also have that . Thus, . Hence, we solve the problem

 min1=μ1≤μ2≤μ31+(1+1μ23)−1/2+(1+μ22μ23)−1/2+((1+1μ23)(1+μ22μ23))−1/2,

which is achieved for .

Uniform case. The computation is similar, replacing the minimum with the sum of for .

 vol(TR(Σ)∩S(1)) =3∑i=1vol(TR(ei)∩S(1)) no intersection between the cones =43∑i=1tan−1(11+ci(μ)) using Lemma~{}???.

Again, using the fact than and , we have that

 3min1=μ1≤μ2≤μ3tan−1(11+c1(μ))≤min1=μ1≤μ2≤μ33∑i=1tan−1(11+ci(μ))≤3∑i=1tan−1(11+ci(1,1,1))

Since, the left term is minimized by as proved in the non-uniform case and that

 3∑i=1tan−1(11+ci(1,1,1))=3tan−1(11+c1(1,1,1)),

we have that the minimum is achieved at . Moreover, since and are strictly increasing, this minimum is unique. ∎

### 5.2 Proofs for Section 3.1

#### 5.2.1 Link between RIP constants and restricted conditioning.

Suppose has RIP with constants , on , i.e.

 (1−δ1)∥x∥2H≤∥Mx∥22≤(1+δ2)∥x∥2H.

For any we have

 λ2(1−δ1)∥x∥2H≤∥λMx∥22≤λ2(1+δ2)∥x∥2H

We look for such that has symmetric RIP :

 λ2(1−δ1)=1−δ′λ2(1+δ2)=1+δ′ (29)

It implies

 (1−δ′)1−δ1(1+δ2)=1+δ′δ′(1+1+δ21−δ1)=(1+δ2)1−δ1−1δ′2+δ2−δ11−δ1=δ2+δ11−δ1δ′=δ2+δ12+δ2−δ1 (30)

We can always rescale such that has RIP . taking and We have .

#### 5.2.2 Proofs

###### Proof of Lemma 3.1.

Let . Let where is the orthogonal projection onto the one-dimensional subspace .

Let , we always have and

 ∥Mx∥22=∥x−Πzx∥2H=∥x∥2H−2⟨x,Πzx⟩+∥Πzx∥2H=∥x∥2H−⟨x,z⟩2∥z∥2H.

Thus

 ∥Mx∥22∥x∥2H=1−⟨x,v⟩2∥x∥2H∥v∥2H (31)

and

 γ(M)=supx∈(Σ−Σ)∩S(1)∥Mx∥2Hinfx∈(Σ−Σ)∩S(1)∥Mx∥2H=1+supx∈(Σ−Σ)∩S(1)−⟨x,z⟩2∥z∥2H1+infx∈(Σ−Σ)∩S(1)−⟨x,z⟩2∥z∥2H.=1−infx∈(Σ−Σ)∩S(1)⟨x,z⟩2∥z∥2H1−supx∈(Σ−Σ)∩S(1)⟨x,z⟩2∥z∥2H (32)

For all , there is such that , hence (just take where is the support of the greatest coordinates of ) . This gives the desired result. ∎

The following Lemma shows that we can simplify the calculation of

###### Lemma 5.1.

Let the set of -sparse vectors. Let ,

 BΣ(R):=supz∈TR(Σ)∖{0}∥zTc2∥22∥zT2∥22=supz≠0:∥zH0∥w≥∥zHc0∥w∥zTc2∥22∥zT2∥22. (33)

where is the support of the biggest weights: i.e. for all , we have .

###### Proof.

We have

 BΣ(R)=supH:|H|≤ksupz≠0:∥zH∥w≥∥zHc∥w∥zTc2∥22∥zT2∥22. (34)

Let such that , there is a permutation of (adequately transfer the values of to and the values on to ) such that . We remark that the expression does not depend on the order of the coordinates of . This shows that which concludes the proof. ∎

###### Lemma 5.2.

Let the set of -sparse vectors. Let such that . We have

 BΣ(R)=supz∈TR(Σ)∖{0}∥zTc2∥22∥zT2∥22=supz≠0:∥zH0∥w=∥zHc0∥w∥zTc2∥22∥zT2∥22 (35)

where is the support of the biggest weights.

###### Proof.

We use Lemma 5.1. Let such that . This implies that is not a constant vector otherwise the hypothesis on weights would be violated. If there is such that