## 1 Introduction

Reconstruction of sparse signals from limited measurements has been studied extensively with a variety of applications in various imaging sciences, machine learning, computer vision and so on. The major problem is to reconstruct a signal which is sparse by itself or in some transformed domain from a few measurements (or observations) acquired by a certain sensing machine. Let

be the signal to be reconstructed. Then the sparse signal reconstruction problem can be formulated as an constrained minimization problem(1) |

where the sparsity counts nonzeros in . Here

is a loss function measuring the discrepancy between the acquired measurements

() and the measurements predicted by the estimated solution. In particular, if the measurements are linearly related to the underlying signal, i.e., there exists a sensing matrix

such that where is the Gaussian noise, then the least squares loss function is widely used:In this case, (1) is a single measurement vector (SMV) sparse signal reconstruction problem. The choice of depends on the generation mechanism of the measurements. Since the measurements are typically generated continuously in most imaging techniques, it becomes significantly important in practice to reconstruct a collection of sparse signals, expressed as a signal matrix, from multiple measurement vectors (MMV). More precisely, the signal matrix with () nonzero rows can be obtained by solving the following MMV model

(2) |

where stands for the row-sparsity of which counts nonzero rows in . Note that it is possible that certain columns of have more zero components than zero rows of . The MMV sparse reconstruction problem was first introduced in magnetoencephalography (MEG) imaging cotter2005sparse ; rao2004diversity , and has been extended to other applications he2008cg ; bazerque2010distributed ; majumdar2011joint ; majumdar2012face ; majumdar2013rank ; davies2012rank ; li2017atomic .

Many SMV algorithms can be applied to solve MMV problems. The most straightforward way is to use SMV algorithms to reconstruct each signal vector sequentially or simultaneously via parallel computing, and then concatenate all resultant signals to form the estimated signal matrix. We call these types of SMV algorithms, *concatenated SMV* algorithms. On the other hand, the MMV problem can be converted to an SMV one by columnwise stacking the unknown signal matrix as a vector and introducing a block diagonal matrix as the new sensing matrix . However, both approaches do not fully take advantage of the joint sparse structure of the underlying signal matrix, and lack computational efficiency as well. In this paper, we develop MMV algorithms without concatenation of the SMV results or vectorization of the unknown signal matrix.

Since the term in (1) and (2) is non-convex and non-differentiable, many classical convex optimization algorithms fail to produce a satisfactory solution. To handle the NP-hardness of the problem, many convex relaxation methods and their MMV extensions have been developed, e.g., the -regularized M-FOCUSS cotter2005sparse , and the -regularized MMV extensions of the alternating direction method of multipliers lu2011fast ; Jian2015split . By exploiting the relationship between the measurements and the correct atoms, multiple signal classification (MUSIC) feng1996spectrum and its improved variants kim2012compressive ; lee2012subspace have been developed. However, in the rank defective cases when the rank of the measurement matrix is much smaller than the desired row-sparsity level, the MUSIC type of methods will mostly fail to identify the correct atoms. The third category of algorithms for solving the constrained problem is the class of greedy algorithms that seek the sparsest solution by updating the support iteratively, including Orthogonal Matching Pursuit (OMP) tropp2004greed , simultaneous OMP (S-OMP) chen2006theoretical ; tropp2006algorithms , Compressive Sampling Matching Pursuit (CoSaMP) needell2009cosamp , Regularized OMP (ROMP) needell2010signal , Subspace-Pursuit (SP) dai2009subspace , and Iterative Hard-Thresholding (IHT) blumensath2009iterative . It has been shown that CoSaMP and IHT are more efficient than the convex relaxation methods with strong recovery guarantees. However, most of these algorithms work for compressive sensing applications where is a least squares loss function. Recently, the Gradient Matching Pursuit (GradMP) nguyenunified has been proposed to extend CoSaMP to handle more general loss functions. To further improve the computational efficiency and consider the non-convex objective function case, Stochastic IHT (StoIHT) and Stochastic GradMP (StoGradMP) have been proposed nguyen2014linear . Nevertheless, the aforementioned greedy algorithms are designed for solving the SMV problem and the concatenated extension to the MMV versions will result in limited performance especially for large data sets. In this paper, we propose the MMV Stochastic IHT (MStoIHT) and the MMV Stochastic GradMP (MStoGradMP) methods for solving the general MMV joint sparse recovery problem (2). To accelerate convergence, the mini-batching technique is applied to the proposed algorithms. We also show that the proposed algorithms converge faster than their SMV concatenated counterparts under certain conditions. A large variety of numerical experiments on joint sparse matrix recovery and video sequence recovery have demonstrated the superior performance of the proposed algorithms over their SMV counterparts in terms of running time and accuracy.

Organization. The rest of the paper is organized as follows. Preliminary knowledge and notation clarifications are provided in Section 2. Section 3 presents the concatenated SMV algorithms, and the proposed stochastic greedy algorithms, i.e., MStoIHT and MStoGradMP, in detail. Section 4 discusses how to apply the mini-batching technique to accelerate the proposed algorithms. Convergence analysis is provided in Section 5. By choosing the widely used least squares loss function as , joint sparse signal recovery in distributed compressive sensing is discussed in Section 6. Extensive numerical results are shown in Section 7. Finally, some concluding remarks are made in Section 8.

## 2 Preliminaries

To make the paper self-contained, we first introduce some useful notation and definitions, and then briefly describe the related algorithms, i.e., StoIHT and StoGradMP. Let and be the number of elements in the set . Consider a finite atom set (a.k.a. the dictionary) with each atom .

### 2.1 Vector Sparsity

Assume that a vector can be written as a linear combination of ’s, i.e., with

Then the support of with respect to and is defined by

The -norm of with respect to is defined as the minimal support size

Since absolute homogeneity does not hold in general, i.e., holds if and only if , this -norm is not a norm. Here the smallest support is called the support of with respect to , denoted by . Thus

Note that the support may not be unique if is over-complete in that there could be multiple representations of with respect to the atom set due to the linear dependence of the atoms in . The vector is called -sparse with respect to if

### 2.2 Matrix Sparsity

By extending vector sparsity, we define the row sparsity for a matrix as follows

where is the -th column of . Here the minimal common support is called the (row-wise) *joint support* of with respect to , denoted by , which satisfies

The matrix is called -row sparse with respect to if all columns of share a joint support of size at most with respect to , i.e.,

### 2.3 Functions Defined on A Matrix Space

Given a function , the matrix derivative is defined by concatenating gradients magnus2010concept

(3) |

where is the -th entry of . Notice that

where is the -th row vector of , and is the trace operator to add up all the diagonal entries of a matrix. The inner product for any two matrices is defined as

Note that the equality

(4) |

and the Cauchy-Schwartz inequality

(5) |

still hold. By generalizing the concepts in nguyen2014linear , we define the -restricted strong convexity property and the strong smoothness property (a.k.a. the Lipschitz condition on the gradient) for the functions defined on a matrix space.

###### Definition 1

The function satisfies the -restricted strong convexity (-RSC) if there exists such that

(6) |

for all matrices with .

###### Definition 2

The function satisfies the -restricted strong smoothness (-RSS) if there exists such that

(7) |

for all matrices with .

### 2.4 Related Work

StoIHT (see Algorithm 1) and StoGradMP (see Algorithm 2) have been proposed to solve the constrained SMV problem nguyen2014linear

(8) |

At each iteration of StoIHT, one component function

is first randomly selected with probability

. Here the input discrete probability distribution

’s satisfyNext in the “Proxy” step, gradient descent along the selected component is performed. Then the last two steps, i.e., “Identify” and “Estimate”, essentially project the gradient descent result to its best -sparse approximation. Given and , the best -sparse approximation operator acted on and , denoted by , constructs an index set with such that

Here is the orthogonal projection of onto the subspace in spanned by the atoms with indices in , and is the best -sparse approximation of in the subspace . In particular, if and with , then returns the index set of the first largest entries of in absolute value, i.e.,

Then the projection reads as in componentwise form

There are two widely used stopping criteria:

where is a small tolerance. It is well known that the first stopping criteria is more robust in practice.

Different from StoIHT, StoGradMP involves the gradient matching process, i.e., to find the best -sparse approximation of the gradient rather than the estimated solution. At the solution estimation step, the original problem is restricted to the components from the estimated support. It has been empirically shown that StoGradMP converges faster than StoIHT due to the more accurate estimation of the support. But StoGradMP requires that the sparsity level is no more than .

## 3 Proposed Stochastic Greedy Algorithms

In this section, we present concatenated SMV algorithms, and develop stochastic greedy algorithms for MMV problems based on StoIHT and StoGradMP. Suppose that there are differentiable and convex functions that satisfy the -restricted strong smoothness property (see Definition 2), and their mean

(9) |

satisfies the -restricted strong convexity property (see Definition 1). These assumptions will be used extensively throughout the entire paper. Consider the following row-sparsity constrained MMV problem

(10) |

By vectorizing , i.e., rewriting as a vector by columnwise stacking, we can *relax* (10) to a sparsity constrained SMV problem of the form (8) where the sparsity level is replaced by . Since does not necessarily guarantee , the solution to the relaxed problem may not be the same as the vectorization of the solution to (10). On the other hand, the iterative stochastic algorithms such as StoIHT and StoGradMP, can be developed to the *concatenated* versions for solving (10) under the following assumption on the objective function ’s.

Assumption 1. The objective function in (10) is *separable*, in the sense that a collection of functions exist with

(11) |

Under this assumption, the concatenated algorithms, i.e., CStoIHT in Algorithm 3 and CStoGradMP in Algorithm 4, can be applied to solve (10), which essentially reconstruct each column of by solving the SMV problem (8). Notice that the outer loops of CStoIHT and CStoGradMP can be executed in a parallel manner on a multi-core computer, when the order of the inner loop and the outer loop in Algorithm 3 can be swapped. However, if the sparsity level is very large, then the support sets of ’s are prone to overlap less initially which results in the less accurate estimation of the joint support and larger errors in the initial iterates. In addition, for some nonlinear function which can not be separated as a sum of functions for columns of , it will be challenging to find an appropriate objective function for the corresponding SMV problem.

To circumvent the aforementioned issues, we first propose the MMV Stochastic Iterative Hard Thresholding algorithm (MStoIHT) detailed in Algorithm 5. Compared to StoIHT, MStoIHT replaces the gradient by the matrix derivative (3). The second significant difference lies in the “Identify” and “Estimate” steps, especially the operator . Now we extend the operator from sparse vectors to row-sparse matrices. Given and , the best -row sparse approximation operator acted on and , denoted by , constructs a *row* index set such that

where is the best -sparse approximation of the column vector with respect to . In particular, if and , returns the row index set of the first largest row norms in , i.e.,

By abusing the notation, we define to be the projection of onto the subspace of all row-sparse matrices with row indices restricted to . Therefore, we have

Due to the common support , the projection can be written as in row-wise form

Here returns a -row sparse matrix, whose nonzero rows correspond to the rows of with largest row norms.

## 4 Batched Acceleration

To accelerate computations and improve performance, we apply the mini-batching technique to obtain batched variants of Algorithms 5 and 6. We first partition the index set into a collection of equal-sized batches where the batch size for all . For simplicity, we assume that is an integer. Similar to the approach in needell2016batched , we reformulate (9) as

(12) |

Based on this new formulation, we get the batched version of Algorithm 5, which is termed as BMStoIHT, described in Algorithm 7. Here the input probability satisfies

Likewise, we get a batched version of MStoGradMP, termed as BMStoGradMP, which is detailed in Algorithm 8. It is empirically shown in Section 7 that the increase of the batch size greatly speeds up the convergence of both algorithms, which is more obvious in BMStoIHT. As a by-product, mini-batching can also improve the recovery accuracy. However, there is a trade-off between the batch size and the performance improvement needell2016batched .

## 5 Convergence Analysis

In this section, we provide the convergence guarantees for the proposed MStoIHT and MStoGradMP, together with their SMV counterparts, i.e., CStoIHT and CStoGradMP. To simplify the discussion, the result at the -th iteration of CStoIHT/CStoGradMP refers to the result obtained after inner iterations and outer iterations of Algorithm 3/Algorithm 4, or equivalently the maximum number of inner iterations is set as . Furthermore, all convergence results can be extended to their batched versions, i.e., BStoIHT and BMStoGradMP. Comparison of contraction coefficients shows that the proposed algorithms have faster convergence under the -RSC, -RSS and separability of the objective function (see Assumption 1).

By replacing the -norm and vector inner product with the Frobenius norm and the matrix inner product respectively and using (4) and (5), we get similar convergence results for MStoIHT and MStoGradMP as in nguyen2014linear :

###### Theorem 5.1

To solve the problem (10), CStoIHT uses StoIHT to restore each column of separately and then concatenate all column vectors to form a matrix. To analyze the convergence of CStoIHT, we first derive an upper bound for following the proof in (nguyen2014linear, , Section 8.2). Note that we look for the contraction coefficient of rather than .

###### Lemma 1

Let be a feasible solution of (10) and be the initial solution. Under Assumption 1, there exist such that the expectation of the recovery error squared at the -th iteration of Algorithm 3 for estimating the -th column of is bounded by

(16) |

where is the approximation of at the -th iteration of StoIHT with the initial guess , i.e., the result at the -th inner iteration and -th outer iteration of Algorithm 3 with the initial guess . Here is the set of all indices randomly selected at or before the -th step of the algorithm.

###### Proof

Due to the separable form of in Assumption 1, we consider problems of the form

(17) |

where are given in (11). This relaxation is also valid since is also a feasible solution of (17) if is a feasible solution of (10). Let , , ,

and

Let

The inequality yields that

By the direct computation, we can show that

which implies that

Thus the inequality yields

By taking the conditional expectation on both sides, we obtain

Note that all coefficients , and depend on the function in (11). Therefore

which implies that

Here the contraction coefficient is

(18) |

and the tolerance parameter is

(19) |

In particular, if and , then

###### Theorem 5.2

Let be a feasible solution of (10) and be the initial solution. Under Assumption 1, at the -th iteration of Algorithm 3, the expectation of the recovery error is bounded by

(20) |

where is the approximation of at the -th iteration of Algorithm 3 with the initial guess . Here the contraction coefficient and the tolerance parameter are defined as

where is the contraction coefficient for each StoIHT defined in (18).

###### Proof

For each

Comments

There are no comments yet.