 # A Method for Finding Structured Sparse Solutions to Non-negative Least Squares Problems with Applications

Demixing problems in many areas such as hyperspectral imaging and differential optical absorption spectroscopy (DOAS) often require finding sparse nonnegative linear combinations of dictionary elements that match observed data. We show how aspects of these problems, such as misalignment of DOAS references and uncertainty in hyperspectral endmembers, can be modeled by expanding the dictionary with grouped elements and imposing a structured sparsity assumption that the combinations within each group should be sparse or even 1-sparse. If the dictionary is highly coherent, it is difficult to obtain good solutions using convex or greedy methods, such as non-negative least squares (NNLS) or orthogonal matching pursuit. We use penalties related to the Hoyer measure, which is the ratio of the l_1 and l_2 norms, as sparsity penalties to be added to the objective in NNLS-type models. For solving the resulting nonconvex models, we propose a scaled gradient projection algorithm that requires solving a sequence of strongly convex quadratic programs. We discuss its close connections to convex splitting methods and difference of convex programming. We also present promising numerical results for example DOAS analysis and hyperspectral demixing problems.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A general demixing problem is to estimate the quantities or concentrations of the individual components of some observed mixture. Often a linear mixture model is assumed

. In this case the observed mixture is modeled as a linear combination of references for each component known to possibly be in the mixture. If we put these references in the columns of a dictionary matrix , then the mixing model is simply . Physical constraints often mean that should be nonnegative, and depending on the application we may also be able to make sparsity assumptions about the unknown coefficients . This can be posed as a basis pursuit problem where we are interested in finding a sparse and perhaps also non-negative linear combination of dictionary elements that match observed data. This is a very well studied problem. Some standard convex models are non-negative least squares (NNLS) [2, 3],

 minx≥012∥Ax−b∥2 (1)

and methods based on minimization [4, 5, 6]. There are also variations that enforce different group sparsity assumptions on [7, 8, 9].

In this paper we are interested in how to deal with uncertainty in the dictionary. The case when the dictionary is unknown is dealt with in sparse coding and non-negative matrix factorization (NMF) problems [10, 11, 12, 13, 14, 15], which require learning both the dictionary and a sparse representation of the data. We are, however, interested in the case where we know the dictionary but are uncertain about each element. One example we will study in this paper is differential optical absorption spectroscopy (DOAS) analysis , for which we know the reference spectra but are uncertain about how to align them with the data because of wavelength misalignment. Another example we will consider is hyperspectral unmixing [17, 18, 19]. Multiple reference spectral signatures, or endmembers, may have been measured for the same material, and they may all be slightly different if they were measured under different conditions. We may not know ahead of time which one to choose that is most consistent with the measured data. Although there is previous work that considers noise in the endmembers 

and represents endmembers as random vectors

, we may not always have a good general model for endmember variability. For the DOAS example, we do have a good model for the unknown misalignment , but even so, incorporating it may significantly complicate the overall model. Therefore for both examples, instead of attempting to model the uncertainty, we propose to expand the dictionary to include a representative group of possible elements for each uncertain element as was done in .

The grouped structure of the expanded dictionary is known by construction, and this allows us to make additional structured sparsity assumptions about the corresponding coefficients. In particular, the coefficients should be extremely sparse within each group of representative elements, and in many cases we would like them to be at most -sparse. We will refer to this as intra group sparsity. If we expected sparsity of the coefficients for the unexpanded dictionary, then this will carry over to an inter group sparsity assumption about the coefficients for the expanded dictionary. By inter group sparsity we mean that with the coefficients split into groups, the number of groups containing nonzero elements should also be sparse. Modeling structured sparsity by applying sparsity penalties separately to overlapping subsets of the variables has been considered in a much more general setting in [8, 23].

The expanded dictionary is usually an underdetermined matrix with the property that it is highly coherent because the added columns tend to be similar to each other. This makes it very challenging to find good sparse representations of the data using standard convex minimization and greedy optimization methods. If satisfies certain properties related to its columns not being too coherent , then sufficiently sparse non-negative solutions are unique and can therefore be found by solving the convex NNLS problem. These assumptions are usually not satisfied for our expanded dictionaries, and while NNLS may still be useful as an initialization, it does not by itself produce sufficiently sparse solutions. Similarly, our expanded dictionaries usually do not satisfy the incoherence assumptions required for minimization or greedy methods like Orthogonal Matching Pursuit (OMP) to recover the sparse solution [25, 26]. However, with an unexpanded dictionary having relatively few columns, these techniques can be effectively used for sparse hyperspectral unmixing .

The coherence of our expanded dictionary means we need to use different tools to find good solutions that satisfy our sparsity assumptions. We would like to use a variational approach as similar as possible to the NNLS model that enforces the additional sparsity while still allowing all the groups to collaborate. We propose adding nonconvex sparsity penalties to the NNLS objective function (1). We can apply these penalties separately to each group of coefficients to enforce intra group sparsity, and we can simultaneously apply them to the vector of all coefficients to enforce additional inter group sparsity. From a modeling perspective, the ideal sparsity penalty is . There is a very interesting recent work that directly deals with constraints and penalties via a quadratic penalty approach . If the variational model is going to be nonconvex, we prefer to work with a differentiable objective when possible. We therefore explore the effectiveness of sparsity penalties based on the Hoyer measure [29, 30], which is essentially the ratio of and norms. In previous works, this has been successfully used to model sparsity in NMF and blind deconvolution applications [29, 31, 32]. We also consider the difference of and norms. By the relationship, , we see that while the ratio of norms is constant in radial directions, the difference increases moving away from the origin except along the axes. Since the Hoyer measure is twice differentiable on the non-negative orthant away from the origin, it can be locally expressed as a difference of convex functions, and convex splitting or difference of convex (DC) methods  can be used to find a local minimum of the nonconvex problem. Some care must be taken, however, to deal with its poor behavior near the origin. It is even easier to apply DC methods when using - as a penalty, since this is already a difference of convex functions, and it is well defined at the origin.

The paper is organized as follows. In Section 2 we define the general model, describe the dictionary structure and show how to use both the ratio and the difference of and norms to model our intra and inter group sparsity assumptions. Section 3 derives a method for solving the general model, discusses connections to existing methods and includes convergence analysis. In Section 4 we discuss specific problem formulations for several examples related to DOAS analysis and hyperspectral demixing. Numerical experiments for comparing methods and applications to example problems are presented in Section 5.

## 2 Problem

For the non-negative linear mixing model

, let , and with . Let the dictionary have normalized columns and consist of groups, each with elements. We can write and , where each and . The general non-negative least squares problem with sparsity constraints that we will consider is

 minx≥0F(x):=12∥Ax−b∥2+R(x) , (2)

where

 R(x)=M∑j=1γjRj(xj)+γ0R0(x) . (3)

The functions represent the intra group sparsity penalties applied to each group of coefficients , , and is the inter group sparsity penalty applied to . If is differentiable, then a necessary condition for to be a local minimum is given by

 (y−x∗)T∇F(x∗)≥0∀y≥0 . (4)

For the applications we will consider, we want to constrain each vector to be at most 1-sparse, which is to say that we want . To accomplish this through the model (2), we will need to choose the parameters to be sufficiently large.

The sparsity penalties and will either be the ratios of and norms defined by

 Hj(xj)=γj∥xj∥1∥xj∥2andH0(x)=γ0∥x∥1∥x∥2 , (5)

or they will be the differences defined by

 Sj(xj)=γj(∥xj∥1−∥xj∥2)andS0(x)=γ0(∥x∥1−∥x∥2) . (6)

A geometric intuition for why minimizing promotes sparsity of is that since it is constant in radial directions, minimizing it tries to reduce without changing . As seen in Figure 1, sparser vectors have smaller norm on the sphere.

Neither or is differentiable at zero, and is not even continuous there. Figure 2 shows a visualization of both penalties in two dimensions.

To obtain a differentiable , we can smooth the sparsity penalties by replacing the norm with the Huber function, defined by the infimal convolution

 (7)

In this way we can define differentiable versions of sparsity penalties and by

 Hϵjj(xj) =γj∥xj∥1ϕ(xj,ϵj)+ϵj2 (8) Hϵ0(x) =γ0∥x∥1ϕ(x,ϵ0)+ϵ02
 Sϵj(xj) =γj(∥xj∥1−ϕ(xj,ϵj)) (9) Sϵ0(x) =γ0(∥x∥1−ϕ(x,ϵ0))

These smoothed sparsity penalties are shown in Figure 3. Figure 3: Visualization of regularized l1/l2 and l1 - l2 penalties

The regularized penalties behave more like near the origin and should tend to shrink that have small norms.

An alternate strategy for obtaining a differentiable objective that doesn’t require smoothing the sparsity penalties is to add

additional dummy variables and modify the convex constraint set. Let

, denote a vector of dummy variables. Consider applying to vectors instead of to . Then if we add the constraints , we are assured that will only be applied to nonzero vectors, even though is still allowed to be zero. Moreover, by requiring that , we can ensure that at least of the vectors have one or more nonzero elements. In particular, this prevents from being zero, so is well defined as well.

The dummy variable strategy is our preferred approach for using the / penalty. The high variability of the regularized version near the origin creates numerical difficulties. It either needs a lot of smoothing, which makes it behave too much like , or its steepness near the origin makes it harder numerically to avoid getting stuck in bad local minima. For the - penalty, the regularized approach is our preferred strategy because it is simpler and not much regularization is required. Smoothing also makes this penalty behave more like near the origin, but a small shrinkage effect there may in fact be useful, especially for promoting inter group sparsity. These two main problem formulations are summarized below as Problem 1 and Problem 2 respectively.

Problem 1:

 minx,dFH(x,d) :=12∥Ax−b∥2+M∑j=1γjHj(xj,dj)+γ0H0(x) such that x>0,d>0,M∑j=1djϵj≤M−r and ∥xj∥1+dj≥ϵj, j=1,...,M .

Problem 2:

 minx≥0FS(x):=12∥Ax−b∥2+M∑j=1γjSϵj(xj)+γ0Sϵ0(x) .

## 3 Algorithm

Both Problems 1 and 2 from Section 2 can be written abstractly as

 minx∈XF(x):=12∥Ax−b∥2+R(x), (10)

where is a convex set. Problem 2 is already of this form with . Problem 1 is also of this form, with . Note that the objective function of Problem 1 can also be written as in (10) if we redefine as and consider an expanded vector of coefficients that includes the dummy variables, . The data fidelity term can still be written as if columns of zeros are inserted into at the indices corresponding to the dummy variables. In this section, we will describe algorithms and convergence analysis for solving (10) under either of two sets of assumptions.

###### Assumption 1.
• is a convex set.

• and the eigenvalues of

are bounded on .

• is coercive on in the sense that for any , is a bounded set. In particular, is bounded below.

###### Assumption 2.
• is concave and differentiable on .

• Same assumptions on and as in Assumption 1

Problem 1 satisfies Assumption 1 and Problem 2 satisfies Assumption 2. We will first consider the case of Assumption 1.

Our approach for solving (10) was originally motivated by a convex splitting technique from [34, 35] that is a semi-implicit method for solving when can be split into a sum of convex and concave functions , both in . Let be an upper bound on the eigenvalues of , and let be a lower bound on the eigenvalues of . Under the assumption that it can be shown that the update defined by

 xn+1=xn+Δt(−∇FC(xn+1)−∇FE(xn)) (11)

doesn’t increase for any time step . This can be seen by using second order Taylor expansions to derive the estimate

 F(xn+1)−F(xn)≤(λEmax−12λC%min−1Δt)∥xn+1−xn∥2. (12)

This convex splitting approach has been shown to be an efficient method much faster than gradient descent for solving phase-field models such as the Cahn-Hilliard equation, which has been used for example to simulate coarsening 

and for image inpainting

.

By the assumptions on , we can achieve a convex concave splitting, , by letting and for an appropriately chosen positive definite matrix . We can also use the fact that is quadratic to improve upon the estimate in (12) when bounding by a quadratic function of . Then instead of choosing a time step and updating according to (11), we can dispense with the time step interpretation altogether and choose an update that reduces the upper bound on as much as possible subject to the constraint. This requires minimizing a strongly convex quadratic function over .

###### Proposition 3.1.

Let Assumption 1 hold. Also let and be lower and upper bounds respectively on the eigenvalues of for . Then for and for any matrix ,

 F(y)−F(x)≤(y−x)T((λR−12λr)I−C)(y−x)+(y−x)T(12ATA+C)(y−x)+(y−x)T∇F(x) . (13)
###### Proof.

The estimate follows from combining several second order Taylor expansions of and with our assumptions. First expanding about and using to simplify notation, we get that

 F(x)=F(y)−hT∇F(y)+12hT∇2F(y−α1h)h

for some . Substituting as defined by (10), we obtain

 F(y)−F(x)=hT(ATAy−ATb+∇R(y))−12hTATAh−12hT∇2R(y−α1h)h (14)

Similarly, we can compute Taylor expansions of about both and .

 R(x)=R(y)−hT∇R(y)+12hT∇2R(y−α2h)h .
 R(y)=R(x)+hT∇R(x)+12hT∇2R(x+α3h)h .

Again, both and are in . Adding these expressions implies that

 hT(∇R(y)−∇R(x))=12hT∇2R(y−α2h)h+12hT∇2R(x+α3h)h .

From the assumption that the eigenvalues of are bounded above by on ,

 hT(∇R(y)−∇R(x))≤λR∥h∥2 . (15)

Adding and subtracting and to (14) yields

 F(y)−F(x) =hTATAh+hT(ATAx−ATb+∇R(x))+hT(∇R(y)−∇R(x)) −12hTATAh−12hT∇2R(y−α1h)h =12hTATAh+hT∇F(x)+hT(∇R(y)−∇R(x))−12hT∇2R(y−α1h)h .

Using (15),

 F(y)−F(x)≤12hTATAh+hT∇F(x)−12hT∇2R(y−α1h)h+λR∥h∥2 .

The assumption that the eigenvalues of are bounded below by on means

 F(y)−F(x)≤(λR−12λr)∥h∥2+12hTATAh+hT∇F(x) .

Since the estimate is unchanged by adding and subtracting for any matrix , the inequality in (13) follows directly. ∎

###### Corollary.

Let be symmetric positive definite and let denote the smallest eigenvalue of . If , then for ,

 F(y)−F(x)≤(y−x)T(12ATA+C)(y−x)+(y−x)T∇F(x) .

A natural strategy for solving (10) is then to iterate

 xn+1=argminx∈X(x−xn)T(12ATA+Cn)(x−xn)+(x−xn)T∇F(xn) (16)

for chosen to guarantee a sufficient decrease in . The method obtained by iterating (16) can be viewed as an instance of scaled gradient projection [37, 38, 39] where the orthogonal projection of onto is computed in the norm . The approach of decreasing by minimizing an upper bound coming from an estimate like (13) can be interpreted as an optimization transfer strategy of defining and minimizing a surrogate function , which is done for related applications in [12, 13]. It can also be interpreted as a special case of difference of convex programming .

Choosing in such a way that guarantees may be numerically inefficient, and it also isn’t strictly necessary for the algorithm to converge. To simplify the description of the algorithm, suppose for some scalar and symmetric positive definite . Then as gets larger, the method becomes more like explicit gradient projection with small time steps. This can be slow to converge as well as more prone to converging to bad local minima. However, the method still converges as long as each is chosen so that the update decreases sufficiently. Therefore we want to dynamically choose to be as small as possible such that the update given by (16) decreases by a sufficient amount, namely

 F(xn+1)−F(xn)≤σ[(xn+1−xn)T(12ATA+Cn)(xn+1−xn)+(xn+1−xn)T∇F(xn)]

for some . Additionally, we want to ensure that the modulus of strong convexity of the quadratic objective in (16) is large enough by requiring the smallest eigenvalue of to be greater than or equal to some . The following is an algorithm for solving (10) and a dymamic update scheme for that is similar to Armijo line search but designed to reduce the number of times that the solution to the quadratic problem has to be rejected for not decreasing sufficiently. 5   Algorithm 1: A Scaled Gradient Projection Method for Solving (10) Under Assumption 1

Define , , , , , and set .

while    or

 y=argminx∈X (x−xn)T(12ATA+cnC)(x−xn)+(x−xn)T∇F(xn)

if

else

end if

end while

It isn’t necessary to impose an upper bound on in Algorithm 3 even though we want it to be bounded. The reason for this is because once , will be sufficiently decreased for any choice of , so is effectively bounded by .

Under Assumption 2 it is much more straightforward to derive an estimate analogous to Proposition 3.1. Concavity of immediately implies

 R(y)≤R(x)+(y−x)T∇R(x) .

 12∥Ay−b∥2=12∥Ax−b∥2+(y−x)T(ATAx−ATb)+12(y−x)TATA(y−x)

yields

 F(y)−F(x)≤(y−x)T12ATA(y−x)+(y−x)T∇F(x) (17)

for . Moreover, the estimate still holds if we add to the right hand side for any positive semi-definite matrix . We are again led to iterating (16) to decrease , and in this case need only be included to ensure that is positive definite. We can let since the dependence on is no longer necessary. We can choose any such that the smallest eigenvalue of is greater than , but it is still preferable to choose as small as is numerically practical. 5   Algorithm 2: A Scaled Gradient Projection Method for Solving (10) Under Assumption 2

Define , symmetric positive definite and .

while    or

 xn+1=argminx∈X (x−xn)T(12ATA+C)(x−xn)+(x−xn)T∇F(xn) (18)

end while   Since the objective in (18) is zero at , the minimum value is less than or equal to zero, and so by (17).

Algorithm 17 is also equivalent to iterating

 xn+1=argminx∈X12∥Ax−b∥2+∥x∥2C+xT(∇R(xn)−2Cxn) ,

which can be seen as an application of the simplified difference of convex algorithm from  to . The DC method in  is more general and doesn’t require the convex and concave functions to be differentiable.

With many connections to classical algorithms, existing convergence results can be applied to argue that limit points of the iterates of Algorithms 17 and 3 are stationary points of (10). We still choose to include a convergence analysis for clarity because our assumptions allow us to give a simple and intuitive argument. The following analysis is for Algorithm 3 under Assumption 1. However, if we replace with and with , then it applies equally well to Algorithm 17 under Assumption 2. We proceed by showing that the sequence is bounded, and limit points of are stationary points of (10) satisfying the necessary local optimality condition (4).

###### Lemma 3.2.

The sequence of iterates generated by Algorithm 3 is bounded.

###### Proof.

Since is non-increasing, , which is a bounded set by assumption. ∎

###### Lemma 3.3.

Let be the sequence of iterates generated by Algorithm 3. Then .

###### Proof.

Since is bounded below and non-increasing, it converges. By construction, satisfies

 −[(xn+1−xn)T(12ATA+Cn)(xn+1−xn)+(xn+1−xn)T∇F(xn)]≤1σ(F(xn)−F(xn+1)) .

By the optimality condition for (16),

 (y−xn+1)T((ATA+2Cn)(xn+1−xn)+∇F(xn))≥0∀y∈X .

In particular, we can take , which implies

 (xn+1−xn)T(ATA+2Cn)(xn+1−xn)≤−(xn+1−xn)T∇F(xn) .

Thus

 (xn+1−xn)T(12ATA+Cn)(xn+1−xn)≤1σ(F(xn)−F(xn+1)) .

Since the eigenvalues of are bounded below by , we have that

 ρ∥xn+1−xn∥2≤1σ(F(xn)−F(xn+1)) .

The result follows from noting that

 limn→∞∥xn+1−xn∥2≤limn→∞1σρ(F(xn)−F(xn+1)),

which equals since converges. ∎

###### Proposition 3.4.

Any limit point of the sequence of iterates generated by Algorithm 3 satisfies for all , which means is a stationary point of (10).

###### Proof.

Let be a limit point of . Since is bounded, such a point exists. Let be a subsequence that converges to . Since , we also have that . Recalling the optimality condition for (16),

 0 ≤(y−xnk+1)T((ATA+2Cnk)(xnk+1−xnk)+∇F(xnk))≤ ∥y−xnk+1∥∥ATA+2Cnk∥∥xnk+1−xnk∥+(y−xnk+1)T∇F(xnk)∀y∈X .

Following , proceed by taking the limit along the subsequence as .

 ∥y−xnk+1∥∥xnk+1−xnk∥∥ATA+2Cnk∥→0

since and is bounded. By continuity of we get that

 (y−x∗)T∇F(x∗)≥0∀y∈X .

Each iteration requires minimizing a strongly convex quadratic function over the set as defined in (16). Many methods can be used to solve this, and we want to choose one that is as robust as possible to poor conditioning of . For example, gradient projection works theoretically and even converges at a linear rate, but it can still be impractically slow. A better choice here is to use the alternating direction method of multipliers (ADMM) [41, 42], which alternately solves a linear system involving and projects onto the constraint set. Applied to Problem 2, this is essentially the same as the application of split Bregman  to solve a NNLS model for hyperspectral demixing in . We consider separately the application of ADMM to Problems 1 and 2. The application to Problem 2 is simpler.

For Problem 2, (16) can be written as

 xn+1=argminx≥0(x−xn)T(12ATA+Cn)(x−xn)+(x−xn)T∇FS(xn) .

To apply ADMM, we can first reformulate the problem as

 minu,vg≥0(v)+(u−xn)T(12ATA+Cn)(u−xn)+(u−xn)T∇FS(xn)such thatu=v , (19)

where is an indicator function for the constraint defined by .

Introduce a Lagrange multiplier and define a Lagrangian

 L(u,v,p)=g≥0(v)+(u−xn)T(12ATA+Cn)(u−xn)+(u−xn)T∇FS(xn)+pT(u−v) (20)

and augmented Lagrangian

 Lδ(u,v,p)=L(u,v,p)+δ2∥u−v∥2 ,

 L(u∗,v∗,p)≤L(u∗,v∗,p∗)≤L(u,v,p∗)∀u,v,p

by alternately minimizing with respect to , minimizing with respect to and updating the dual variable . Having found a saddle point of , will be a solution to (19) and we can take to be the solution to (16). The explicit ADMM iterations are described in the following algorithm. 5   Algorithm 3: ADMM for solving convex subproblem for Problem 2

Define , and arbitrarily and let .

while not converged

 uk+1 =xn+(ATA+2Cn+δI)−1(δ(vk−xn)−pk−∇FS(xn)) vk+1 =Π≥0(uk+1+pkδ) pk+1 =pk+δ(uk+1−vk+1)

k = k + 1

end while

Here denotes the orthogonal projection onto the non-negative orthant. For this application of ADMM to be practical, should not be too expensive to apply, and should be well chosen.

Since (16) is a standard quadratic program, a huge variety of other methods could also be applied. Variants of Newton’s method on a bound constrained KKT system might work well here, especially if we find we need to solve the subproblem to very high accuracy.

For Problem 2, (16) can be written as

 (xn+1,dn+1) =argminx,d(x−xn)T(12ATA+Cxn)(x−xn)+(d−dn)TCdn(d−dn)+ (x−xn)T∇xFH(xn,dn)+(d−dn)T∇dFH(xn,dn) .

Here, and represent the gradients with respect to and respectively. The matrix is assumed to be of the form , with a diagonal matrix. It is helpful to represent the constraints in terms of convex sets defined by

 Xϵj={[xjdj]∈Rmj+1:∥xj∥1+dj≥ϵj,xj≥0,dj≥0} j=1,...,M ,
 Xβ={d∈RM:M∑j=1djβj≤M−r,dj≥0} ,

and indicator functions and for these sets.

Let and represent and . Then by adding splitting variables and we can reformulate the problem as

 minu,w,vx,vd ∑jgXϵj(vxj,vdj)+gXβ(w)+(u−xn)T(12ATA+Cxn)(u−xn)+(w−dn)TCdn(w−dn)+ (x−xn)T∇xFH(xn,dn)+(w−dn)T∇dFH(xn,dn)s.t.vx=u,vd=w .

Adding Lagrange multipliers and for the linear constraints, we can define the augmented Lagrangian

 Lδ(u,w,vx,vd,px,pd) =∑jgXϵj(vxj,vdj)+gX