Stochastic Greedy Algorithms For Multiple Measurement Vectors

Sparse representation of a single measurement vector (SMV) has been explored in a variety of compressive sensing applications. Recently, SMV models have been extended to solve multiple measurement vectors (MMV) problems, where the underlying signal is assumed to have joint sparse structures. To circumvent the NP-hardness of the ℓ_0 minimization problem, many deterministic MMV algorithms solve the convex relaxed models with limited efficiency. In this paper, we develop stochastic greedy algorithms for solving the joint sparse MMV reconstruction problem. In particular, we propose the MMV Stochastic Iterative Hard Thresholding (MStoIHT) and MMV Stochastic Gradient Matching Pursuit (MStoGradMP) algorithms, and we also utilize the mini-batching technique to further improve their performance. Convergence analysis indicates that the proposed algorithms are able to converge faster than their SMV counterparts, i.e., concatenated StoIHT and StoGradMP, under certain conditions. Numerical experiments have illustrated the superior effectiveness of the proposed algorithms over their SMV counterparts.

There are no comments yet.

Authors

• 47 publications
• 57 publications
• 49 publications
• 13 publications
• 7 publications
• 4 publications
• 3 publications
• Iterative Reweighted Algorithms for Sparse Signal Recovery with Temporally Correlated Source Vectors

Iterative reweighted algorithms, as a class of algorithms for sparse sig...
04/28/2011 ∙ by Zhilin Zhang, et al. ∙ 0

• Stochastic Iterative Hard Thresholding for Graph-structured Sparsity Optimization

Stochastic optimization algorithms update models with cheap per-iteratio...
05/09/2019 ∙ by Baojian Zhou, et al. ∙ 0

• Orthogonal subspace based fast iterative thresholding algorithms for joint sparsity recovery

Sparse signal recoveries from multiple measurement vectors (MMV) with jo...
01/22/2021 ∙ by Ningning Han, et al. ∙ 0

• Efficient Relaxed Gradient Support Pursuit for Sparsity Constrained Non-convex Optimization

Large-scale non-convex sparsity-constrained problems have recently gaine...
12/02/2019 ∙ by Fanhua Shang, et al. ∙ 0

• Orthogonal Matching Pursuit with Replacement

In this paper, we consider the problem of compressed sensing where the g...
06/14/2011 ∙ by Prateek Jain, et al. ∙ 0

• Extension of Sparse Randomized Kaczmarz Algorithm for Multiple Measurement Vectors

The Kaczmarz algorithm is popular for iteratively solving an overdetermi...
01/10/2014 ∙ by Hemant Kumar Aggarwal, et al. ∙ 0

• Adaptive Partition-based SDDP Algorithms for Multistage Stochastic Linear Programming

In this paper, we extend the adaptive partition-based approach for solvi...
08/29/2019 ∙ by Murwan Siddig, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reconstruction of sparse signals from limited measurements has been studied extensively with a variety of applications in various imaging sciences, machine learning, computer vision and so on. The major problem is to reconstruct a signal which is sparse by itself or in some transformed domain from a few measurements (or observations) acquired by a certain sensing machine. Let

be the signal to be reconstructed. Then the sparse signal reconstruction problem can be formulated as an constrained minimization problem

 (1)

where the sparsity counts nonzeros in . Here

is a loss function measuring the discrepancy between the acquired measurements

(

) and the measurements predicted by the estimated solution. In particular, if the measurements are linearly related to the underlying signal, i.e., there exists a sensing matrix

such that where is the Gaussian noise, then the least squares loss function is widely used:

 F(x)=12∥y−Ax∥22.

In this case, (1) is a single measurement vector (SMV) sparse signal reconstruction problem. The choice of depends on the generation mechanism of the measurements. Since the measurements are typically generated continuously in most imaging techniques, it becomes significantly important in practice to reconstruct a collection of sparse signals, expressed as a signal matrix, from multiple measurement vectors (MMV). More precisely, the signal matrix with () nonzero rows can be obtained by solving the following MMV model

 (2)

where stands for the row-sparsity of which counts nonzero rows in . Note that it is possible that certain columns of have more zero components than zero rows of . The MMV sparse reconstruction problem was first introduced in magnetoencephalography (MEG) imaging cotter2005sparse ; rao2004diversity , and has been extended to other applications he2008cg ; bazerque2010distributed ; majumdar2011joint ; majumdar2012face ; majumdar2013rank ; davies2012rank ; li2017atomic .

Many SMV algorithms can be applied to solve MMV problems. The most straightforward way is to use SMV algorithms to reconstruct each signal vector sequentially or simultaneously via parallel computing, and then concatenate all resultant signals to form the estimated signal matrix. We call these types of SMV algorithms, concatenated SMV algorithms. On the other hand, the MMV problem can be converted to an SMV one by columnwise stacking the unknown signal matrix as a vector and introducing a block diagonal matrix as the new sensing matrix . However, both approaches do not fully take advantage of the joint sparse structure of the underlying signal matrix, and lack computational efficiency as well. In this paper, we develop MMV algorithms without concatenation of the SMV results or vectorization of the unknown signal matrix.

Since the term in (1) and (2) is non-convex and non-differentiable, many classical convex optimization algorithms fail to produce a satisfactory solution. To handle the NP-hardness of the problem, many convex relaxation methods and their MMV extensions have been developed, e.g., the -regularized M-FOCUSS cotter2005sparse , and the -regularized MMV extensions of the alternating direction method of multipliers lu2011fast ; Jian2015split . By exploiting the relationship between the measurements and the correct atoms, multiple signal classification (MUSIC) feng1996spectrum and its improved variants kim2012compressive ; lee2012subspace have been developed. However, in the rank defective cases when the rank of the measurement matrix is much smaller than the desired row-sparsity level, the MUSIC type of methods will mostly fail to identify the correct atoms. The third category of algorithms for solving the constrained problem is the class of greedy algorithms that seek the sparsest solution by updating the support iteratively, including Orthogonal Matching Pursuit (OMP) tropp2004greed , simultaneous OMP (S-OMP) chen2006theoretical ; tropp2006algorithms , Compressive Sampling Matching Pursuit (CoSaMP) needell2009cosamp , Regularized OMP (ROMP) needell2010signal , Subspace-Pursuit (SP) dai2009subspace , and Iterative Hard-Thresholding (IHT) blumensath2009iterative . It has been shown that CoSaMP and IHT are more efficient than the convex relaxation methods with strong recovery guarantees. However, most of these algorithms work for compressive sensing applications where is a least squares loss function. Recently, the Gradient Matching Pursuit (GradMP) nguyenunified has been proposed to extend CoSaMP to handle more general loss functions. To further improve the computational efficiency and consider the non-convex objective function case, Stochastic IHT (StoIHT) and Stochastic GradMP (StoGradMP) have been proposed nguyen2014linear . Nevertheless, the aforementioned greedy algorithms are designed for solving the SMV problem and the concatenated extension to the MMV versions will result in limited performance especially for large data sets. In this paper, we propose the MMV Stochastic IHT (MStoIHT) and the MMV Stochastic GradMP (MStoGradMP) methods for solving the general MMV joint sparse recovery problem (2). To accelerate convergence, the mini-batching technique is applied to the proposed algorithms. We also show that the proposed algorithms converge faster than their SMV concatenated counterparts under certain conditions. A large variety of numerical experiments on joint sparse matrix recovery and video sequence recovery have demonstrated the superior performance of the proposed algorithms over their SMV counterparts in terms of running time and accuracy.

Organization. The rest of the paper is organized as follows. Preliminary knowledge and notation clarifications are provided in Section 2. Section 3 presents the concatenated SMV algorithms, and the proposed stochastic greedy algorithms, i.e., MStoIHT and MStoGradMP, in detail. Section 4 discusses how to apply the mini-batching technique to accelerate the proposed algorithms. Convergence analysis is provided in Section 5. By choosing the widely used least squares loss function as , joint sparse signal recovery in distributed compressive sensing is discussed in Section 6. Extensive numerical results are shown in Section 7. Finally, some concluding remarks are made in Section 8.

2 Preliminaries

To make the paper self-contained, we first introduce some useful notation and definitions, and then briefly describe the related algorithms, i.e., StoIHT and StoGradMP. Let and be the number of elements in the set . Consider a finite atom set (a.k.a. the dictionary) with each atom .

2.1 Vector Sparsity

Assume that a vector can be written as a linear combination of ’s, i.e., with

 D=[d1⋯dN],α=(α1,…,αN)T.

Then the support of with respect to and is defined by

 suppα,D(x)={i∈[N]:αi≠0}:=supp(α).

The -norm of with respect to is defined as the minimal support size

 ∥x∥0,D=minα{|T|:x=∑i∈Tαidi,T⊆[N]}=minα|suppα,D(x)|.

Since absolute homogeneity does not hold in general, i.e., holds if and only if , this -norm is not a norm. Here the smallest support is called the support of with respect to , denoted by . Thus

Note that the support may not be unique if is over-complete in that there could be multiple representations of with respect to the atom set due to the linear dependence of the atoms in . The vector is called -sparse with respect to if

 ∥x∥0,D≤k.

2.2 Matrix Sparsity

By extending vector sparsity, we define the row sparsity for a matrix as follows

 ∥X∥r,0,D=minΩ{|Ω|:Ω=L⋃i=1suppD(X⋅,i)},

where is the -th column of . Here the minimal common support is called the (row-wise) joint support of with respect to , denoted by , which satisfies

 |supprD(X)|=∥X∥r,0,D.

The matrix is called -row sparse with respect to if all columns of share a joint support of size at most with respect to , i.e.,

 ∥X∥r,0,D≤k.

2.3 Functions Defined on A Matrix Space

Given a function , the matrix derivative is defined by concatenating gradients magnus2010concept

 ∂f∂X=[∂f∂Xi,j]n×L=[∇X⋅,1f⋯∇X⋅,Lf], (3)

where is the -th entry of . Notice that

where is the -th row vector of , and is the trace operator to add up all the diagonal entries of a matrix. The inner product for any two matrices is defined as

 ⟨U,V⟩=Tr(UTV).

Note that the equality

 (4)

and the Cauchy-Schwartz inequality

 ⟨U,V⟩≤∥U∥F∥V∥F (5)

still hold. By generalizing the concepts in nguyen2014linear , we define the -restricted strong convexity property and the strong smoothness property (a.k.a. the Lipschitz condition on the gradient) for the functions defined on a matrix space.

Definition 1

The function satisfies the -restricted strong convexity (-RSC) if there exists such that

 f(X′)−f(X)−⟨∂f∂X(X),X′−X⟩≥ρ−k2∥∥X′−X∥∥2F (6)

for all matrices with .

Definition 2

The function satisfies the -restricted strong smoothness (-RSS) if there exists such that

 ∥∥∥∂f∂X(X)−∂f∂X(X′)∥∥∥F≤ρ+k∥∥X−X′∥∥F (7)

for all matrices with .

2.4 Related Work

StoIHT (see Algorithm 1) and StoGradMP (see Algorithm 2) have been proposed to solve the constrained SMV problem nguyen2014linear

 minx∈Rn1MM∑i=1~fi(x),subject to∥x∥0,D≤k. (8)

At each iteration of StoIHT, one component function

is first randomly selected with probability

. Here the input discrete probability distribution

’s satisfy

 M∑i=1p(i)=1,andp(i)≥0,i=1,…,M.

Next in the “Proxy” step, gradient descent along the selected component is performed. Then the last two steps, i.e., “Identify” and “Estimate”, essentially project the gradient descent result to its best -sparse approximation. Given and , the best -sparse approximation operator acted on and , denoted by , constructs an index set with such that

 ∥PΓw−w∥2≤η∥∥w−w(k)∥∥2wherew(k)=argminy∈R(DΓ)|Γ|≤k∥w−y∥2.

Here is the orthogonal projection of onto the subspace in spanned by the atoms with indices in , and is the best -sparse approximation of in the subspace . In particular, if and with , then returns the index set of the first largest entries of in absolute value, i.e.,

 approxk(w,1)={i1,i2,…,ik:|wi1|≥…≥|wik|≥…≥|win|}:=ˆΓ.

Then the projection reads as in componentwise form

There are two widely used stopping criteria:

 ∥∥xt+1−xt∥∥2∥xt∥2<ε,and1MM∑i=1~fi(xt)<ε,

where is a small tolerance. It is well known that the first stopping criteria is more robust in practice.

Different from StoIHT, StoGradMP involves the gradient matching process, i.e., to find the best -sparse approximation of the gradient rather than the estimated solution. At the solution estimation step, the original problem is restricted to the components from the estimated support. It has been empirically shown that StoGradMP converges faster than StoIHT due to the more accurate estimation of the support. But StoGradMP requires that the sparsity level is no more than .

3 Proposed Stochastic Greedy Algorithms

In this section, we present concatenated SMV algorithms, and develop stochastic greedy algorithms for MMV problems based on StoIHT and StoGradMP. Suppose that there are differentiable and convex functions that satisfy the -restricted strong smoothness property (see Definition 2), and their mean

 F(X)=1MM∑i=1fi(X) (9)

satisfies the -restricted strong convexity property (see Definition 1). These assumptions will be used extensively throughout the entire paper. Consider the following row-sparsity constrained MMV problem

 minX∈Rn×L1MM∑i=1fi(X),% subject to∥X∥r,0,D≤k. (10)

By vectorizing , i.e., rewriting as a vector by columnwise stacking, we can relax (10) to a sparsity constrained SMV problem of the form (8) where the sparsity level is replaced by . Since does not necessarily guarantee , the solution to the relaxed problem may not be the same as the vectorization of the solution to (10). On the other hand, the iterative stochastic algorithms such as StoIHT and StoGradMP, can be developed to the concatenated versions for solving (10) under the following assumption on the objective function ’s.

Assumption 1. The objective function in (10) is separable, in the sense that a collection of functions exist with

 fi(X)=L∑j=1gi,j(X⋅,j),i=1,…,M. (11)

Under this assumption, the concatenated algorithms, i.e., CStoIHT in Algorithm 3 and CStoGradMP in Algorithm 4, can be applied to solve (10), which essentially reconstruct each column of by solving the SMV problem (8). Notice that the outer loops of CStoIHT and CStoGradMP can be executed in a parallel manner on a multi-core computer, when the order of the inner loop and the outer loop in Algorithm 3 can be swapped. However, if the sparsity level is very large, then the support sets of ’s are prone to overlap less initially which results in the less accurate estimation of the joint support and larger errors in the initial iterates. In addition, for some nonlinear function which can not be separated as a sum of functions for columns of , it will be challenging to find an appropriate objective function for the corresponding SMV problem.

To circumvent the aforementioned issues, we first propose the MMV Stochastic Iterative Hard Thresholding algorithm (MStoIHT) detailed in Algorithm 5. Compared to StoIHT, MStoIHT replaces the gradient by the matrix derivative (3). The second significant difference lies in the “Identify” and “Estimate” steps, especially the operator . Now we extend the operator from sparse vectors to row-sparse matrices. Given and , the best -row sparse approximation operator acted on and , denoted by , constructs a row index set such that

 ∥∥PΓX⋅,j−X⋅,j∥∥2≤η∥∥X⋅,j−(X⋅,j)k∥∥2,j=1,…,L,

where is the best -sparse approximation of the column vector with respect to . In particular, if and , returns the row index set of the first largest row norms in , i.e.,

 approxrk(X,1)={i1,i2,…,ik:∥Xi1,⋅∥2≥∥Xi2,⋅∥2≥…≥∥Xin,⋅∥2}:=˜Γ.

By abusing the notation, we define to be the projection of onto the subspace of all row-sparse matrices with row indices restricted to . Therefore, we have

 P˜ΓX=[P˜ΓX⋅,1P˜ΓX⋅,2…P˜ΓX⋅,L].

Due to the common support , the projection can be written as in row-wise form

Here returns a -row sparse matrix, whose nonzero rows correspond to the rows of with largest row norms.

Next, we propose the MMV Stochastic Gradient Matching Pursuit (MStoGradMP) detailed in Algorithm 6, where the two gradient matching steps involve the operator . The stopping criteria in all proposed algorithms can be set as the same as those in Algorithm 1 and Algorithm 2.

4 Batched Acceleration

To accelerate computations and improve performance, we apply the mini-batching technique to obtain batched variants of Algorithms 5 and 6. We first partition the index set into a collection of equal-sized batches where the batch size for all . For simplicity, we assume that is an integer. Similar to the approach in needell2016batched , we reformulate (9) as

 F(X)=1dd∑i=1(1b∑j∈τifj(X)). (12)

Based on this new formulation, we get the batched version of Algorithm 5, which is termed as BMStoIHT, described in Algorithm 7. Here the input probability satisfies

 1dd∑i=1p(i)=1andp(i)≥0,i=1,…,d.

Likewise, we get a batched version of MStoGradMP, termed as BMStoGradMP, which is detailed in Algorithm 8. It is empirically shown in Section 7 that the increase of the batch size greatly speeds up the convergence of both algorithms, which is more obvious in BMStoIHT. As a by-product, mini-batching can also improve the recovery accuracy. However, there is a trade-off between the batch size and the performance improvement needell2016batched .

5 Convergence Analysis

In this section, we provide the convergence guarantees for the proposed MStoIHT and MStoGradMP, together with their SMV counterparts, i.e., CStoIHT and CStoGradMP. To simplify the discussion, the result at the -th iteration of CStoIHT/CStoGradMP refers to the result obtained after inner iterations and outer iterations of Algorithm 3/Algorithm 4, or equivalently the maximum number of inner iterations is set as . Furthermore, all convergence results can be extended to their batched versions, i.e., BStoIHT and BMStoGradMP. Comparison of contraction coefficients shows that the proposed algorithms have faster convergence under the -RSC, -RSS and separability of the objective function (see Assumption 1).

By replacing the -norm and vector inner product with the Frobenius norm and the matrix inner product respectively and using (4) and (5), we get similar convergence results for MStoIHT and MStoGradMP as in nguyen2014linear :

Theorem 5.1

Let be a feasible solution of (10) and be the initial solution. At the -th iteration of Algorithm 5, the expectation of the recovery error is bounded by

 E∥∥Xt+1−X∗∥∥F≤κt+1∥∥X0−X∗∥∥F+σX∗1−κ, (13)

where

 κ =2√1−γ(2−γα3k)ρ−3k+√(η2−1)(1+γ2α3k¯ρ+3k−2γρ−3k), (14) αk =max1≤i≤Mρ+k(i)Mp(i),ρ+k=max1≤i≤Mρ+k(i),¯ρ+k=1MM∑i=1ρ+k(i).

Thus Algorithm 7 converges linearly. In particular, if , then

 κ=2√1−2ρ−3k+α3kρ−3k. (15)

To solve the problem (10), CStoIHT uses StoIHT to restore each column of separately and then concatenate all column vectors to form a matrix. To analyze the convergence of CStoIHT, we first derive an upper bound for following the proof in (nguyen2014linear, , Section 8.2). Note that we look for the contraction coefficient of rather than .

Lemma 1

Let be a feasible solution of (10) and be the initial solution. Under Assumption 1, there exist such that the expectation of the recovery error squared at the -th iteration of Algorithm 3 for estimating the -th column of is bounded by

 EIt∥∥Xt+1⋅,j−X∗⋅,j∥∥22≤κt+1j∥∥X0⋅,j−X∗⋅,j∥∥22+σj1−κj, (16)

where is the approximation of at the -th iteration of StoIHT with the initial guess , i.e., the result at the -th inner iteration and -th outer iteration of Algorithm 3 with the initial guess . Here is the set of all indices randomly selected at or before the -th step of the algorithm.

Proof

Due to the separable form of in Assumption 1, we consider problems of the form

 minw1MM∑i=1gi,j(w),∥w∥0,D≤k,j=1,…,L, (17)

where are given in (11). This relaxation is also valid since is also a feasible solution of (17) if is a feasible solution of (10). Let , , ,

 u=∥∥∥wt−w∗−γMp(it)PΩ(∇git,j(wt)−∇git,j(w∗))∥∥∥2+∥∥∥γMp(it)PΩ∇git,j(w∗)∥∥∥2:=u1+u2,

and

 v=(η2−1)∥∥∥wt−w∗−γMp(it)∇git,j(wt)∥∥∥22.

Let

 v1 =(η2−1)∥∥∥wt−w∗−γMp(it)(∇git,j(wt)−∇git,j(w∗))∥∥∥22, v2 =(η2−1)∥∥∥γMp(it)∇git,j(w∗)∥∥∥22.

The inequality yields that

 v≤2v1+2v2.

By the direct computation, we can show that

 x2−2ux−v≤0,

which implies that

 x≤u+√u2+v.

Thus the inequality yields

 x2≤2u2+2(u2+v)=4u2+2v≤8(u21+u22)+4v1+4v2.

By taking the conditional expectation on both sides, we obtain

 Eit|It−1x2 ≤8Eit|It−1u21+8Eit|It−1u22+4Eit|It−1v1+4Eit|It−1v2 +4(η2−1)(1+γ2α3k,j¯ρ+3k,j−2γρ−3k,j)∥∥wt−w∗∥∥22+4γ2(η2−1)minitM2(p(it))2Eit∥∥∇git,j(w∗)∥∥22 =(8(1−(2γ−γ2α3k,j)ρ−3k,j)+4(η2−1)(1+γ2α3k,j¯ρ+3k,j−2γρ−3k,j))∥∥wt−w∗∥∥22 :=κj∥∥wt−w∗∥∥22+σj.

Note that all coefficients , and depend on the function in (11). Therefore

 EIt∥∥wt+1−w∗∥∥22≤κjEIt−1∥∥wt−w∗∥∥22+σj.

which implies that

 E∥∥Xt+1⋅,j−X∗⋅,j∥∥22≤κt+1j∥∥X0⋅,j−X∗⋅,j∥∥22+σj1−κj,j=1,…,L.

Here the contraction coefficient is

 κj=8(1−(2γ−γ2α3k,j)ρ−3k,j)+4(η2−1)(1+γ2α3k,j¯ρ+3k,j−2γρ−3k,j), (18)

and the tolerance parameter is

 σj=4γ2min1≤i≤MM2(p(it))2(2Eit∥∥PΩ∇git,j(w∗)∥∥22+(η2−1)Eit∥∥∇git,j(w∗)∥∥22). (19)

In particular, if and , then

 κj=8(1−2ρ−3k,j+α3k,jρ−3k,j),σj=8MM∑i=1∥∥PΩ∇gi,j(X∗⋅,j)∥∥22.
Theorem 5.2

Let be a feasible solution of (10) and be the initial solution. Under Assumption 1, at the -th iteration of Algorithm 3, the expectation of the recovery error is bounded by

 (20)

where is the approximation of at the -th iteration of Algorithm 3 with the initial guess . Here the contraction coefficient and the tolerance parameter are defined as

 ˆκ=√max1≤j≤Lκj,δ=√∑Lj=1σj1−max1≤j≤Lκj,

where is the contraction coefficient for each StoIHT defined in (18).

For each