Reconstruction of sparse signals from limited measurements has been studied extensively with a variety of applications in various imaging sciences, machine learning, computer vision and so on. The major problem is to reconstruct a signal which is sparse by itself or in some transformed domain from a few measurements (or observations) acquired by a certain sensing machine. Letbe the signal to be reconstructed. Then the sparse signal reconstruction problem can be formulated as an constrained minimization problem
where the sparsity counts nonzeros in . Here
is a loss function measuring the discrepancy between the acquired measurements(
) and the measurements predicted by the estimated solution. In particular, if the measurements are linearly related to the underlying signal, i.e., there exists a sensing matrixsuch that where is the Gaussian noise, then the least squares loss function is widely used:
In this case, (1) is a single measurement vector (SMV) sparse signal reconstruction problem. The choice of depends on the generation mechanism of the measurements. Since the measurements are typically generated continuously in most imaging techniques, it becomes significantly important in practice to reconstruct a collection of sparse signals, expressed as a signal matrix, from multiple measurement vectors (MMV). More precisely, the signal matrix with () nonzero rows can be obtained by solving the following MMV model
where stands for the row-sparsity of which counts nonzero rows in . Note that it is possible that certain columns of have more zero components than zero rows of . The MMV sparse reconstruction problem was first introduced in magnetoencephalography (MEG) imaging cotter2005sparse ; rao2004diversity , and has been extended to other applications he2008cg ; bazerque2010distributed ; majumdar2011joint ; majumdar2012face ; majumdar2013rank ; davies2012rank ; li2017atomic .
Many SMV algorithms can be applied to solve MMV problems. The most straightforward way is to use SMV algorithms to reconstruct each signal vector sequentially or simultaneously via parallel computing, and then concatenate all resultant signals to form the estimated signal matrix. We call these types of SMV algorithms, concatenated SMV algorithms. On the other hand, the MMV problem can be converted to an SMV one by columnwise stacking the unknown signal matrix as a vector and introducing a block diagonal matrix as the new sensing matrix . However, both approaches do not fully take advantage of the joint sparse structure of the underlying signal matrix, and lack computational efficiency as well. In this paper, we develop MMV algorithms without concatenation of the SMV results or vectorization of the unknown signal matrix.
Since the term in (1) and (2) is non-convex and non-differentiable, many classical convex optimization algorithms fail to produce a satisfactory solution. To handle the NP-hardness of the problem, many convex relaxation methods and their MMV extensions have been developed, e.g., the -regularized M-FOCUSS cotter2005sparse , and the -regularized MMV extensions of the alternating direction method of multipliers lu2011fast ; Jian2015split . By exploiting the relationship between the measurements and the correct atoms, multiple signal classification (MUSIC) feng1996spectrum and its improved variants kim2012compressive ; lee2012subspace have been developed. However, in the rank defective cases when the rank of the measurement matrix is much smaller than the desired row-sparsity level, the MUSIC type of methods will mostly fail to identify the correct atoms. The third category of algorithms for solving the constrained problem is the class of greedy algorithms that seek the sparsest solution by updating the support iteratively, including Orthogonal Matching Pursuit (OMP) tropp2004greed , simultaneous OMP (S-OMP) chen2006theoretical ; tropp2006algorithms , Compressive Sampling Matching Pursuit (CoSaMP) needell2009cosamp , Regularized OMP (ROMP) needell2010signal , Subspace-Pursuit (SP) dai2009subspace , and Iterative Hard-Thresholding (IHT) blumensath2009iterative . It has been shown that CoSaMP and IHT are more efficient than the convex relaxation methods with strong recovery guarantees. However, most of these algorithms work for compressive sensing applications where is a least squares loss function. Recently, the Gradient Matching Pursuit (GradMP) nguyenunified has been proposed to extend CoSaMP to handle more general loss functions. To further improve the computational efficiency and consider the non-convex objective function case, Stochastic IHT (StoIHT) and Stochastic GradMP (StoGradMP) have been proposed nguyen2014linear . Nevertheless, the aforementioned greedy algorithms are designed for solving the SMV problem and the concatenated extension to the MMV versions will result in limited performance especially for large data sets. In this paper, we propose the MMV Stochastic IHT (MStoIHT) and the MMV Stochastic GradMP (MStoGradMP) methods for solving the general MMV joint sparse recovery problem (2). To accelerate convergence, the mini-batching technique is applied to the proposed algorithms. We also show that the proposed algorithms converge faster than their SMV concatenated counterparts under certain conditions. A large variety of numerical experiments on joint sparse matrix recovery and video sequence recovery have demonstrated the superior performance of the proposed algorithms over their SMV counterparts in terms of running time and accuracy.
Organization. The rest of the paper is organized as follows. Preliminary knowledge and notation clarifications are provided in Section 2. Section 3 presents the concatenated SMV algorithms, and the proposed stochastic greedy algorithms, i.e., MStoIHT and MStoGradMP, in detail. Section 4 discusses how to apply the mini-batching technique to accelerate the proposed algorithms. Convergence analysis is provided in Section 5. By choosing the widely used least squares loss function as , joint sparse signal recovery in distributed compressive sensing is discussed in Section 6. Extensive numerical results are shown in Section 7. Finally, some concluding remarks are made in Section 8.
To make the paper self-contained, we first introduce some useful notation and definitions, and then briefly describe the related algorithms, i.e., StoIHT and StoGradMP. Let and be the number of elements in the set . Consider a finite atom set (a.k.a. the dictionary) with each atom .
2.1 Vector Sparsity
Assume that a vector can be written as a linear combination of ’s, i.e., with
Then the support of with respect to and is defined by
The -norm of with respect to is defined as the minimal support size
Since absolute homogeneity does not hold in general, i.e., holds if and only if , this -norm is not a norm. Here the smallest support is called the support of with respect to , denoted by . Thus
Note that the support may not be unique if is over-complete in that there could be multiple representations of with respect to the atom set due to the linear dependence of the atoms in . The vector is called -sparse with respect to if
2.2 Matrix Sparsity
By extending vector sparsity, we define the row sparsity for a matrix as follows
where is the -th column of . Here the minimal common support is called the (row-wise) joint support of with respect to , denoted by , which satisfies
The matrix is called -row sparse with respect to if all columns of share a joint support of size at most with respect to , i.e.,
2.3 Functions Defined on A Matrix Space
Given a function , the matrix derivative is defined by concatenating gradients magnus2010concept
where is the -th entry of . Notice that
where is the -th row vector of , and is the trace operator to add up all the diagonal entries of a matrix. The inner product for any two matrices is defined as
Note that the equality
and the Cauchy-Schwartz inequality
still hold. By generalizing the concepts in nguyen2014linear , we define the -restricted strong convexity property and the strong smoothness property (a.k.a. the Lipschitz condition on the gradient) for the functions defined on a matrix space.
The function satisfies the -restricted strong convexity (-RSC) if there exists such that
for all matrices with .
The function satisfies the -restricted strong smoothness (-RSS) if there exists such that
for all matrices with .
2.4 Related Work
At each iteration of StoIHT, one component function
is first randomly selected with probability
. Here the input discrete probability distribution’s satisfy
Next in the “Proxy” step, gradient descent along the selected component is performed. Then the last two steps, i.e., “Identify” and “Estimate”, essentially project the gradient descent result to its best -sparse approximation. Given and , the best -sparse approximation operator acted on and , denoted by , constructs an index set with such that
Here is the orthogonal projection of onto the subspace in spanned by the atoms with indices in , and is the best -sparse approximation of in the subspace . In particular, if and with , then returns the index set of the first largest entries of in absolute value, i.e.,
Then the projection reads as in componentwise form
There are two widely used stopping criteria:
where is a small tolerance. It is well known that the first stopping criteria is more robust in practice.
Different from StoIHT, StoGradMP involves the gradient matching process, i.e., to find the best -sparse approximation of the gradient rather than the estimated solution. At the solution estimation step, the original problem is restricted to the components from the estimated support. It has been empirically shown that StoGradMP converges faster than StoIHT due to the more accurate estimation of the support. But StoGradMP requires that the sparsity level is no more than .
3 Proposed Stochastic Greedy Algorithms
In this section, we present concatenated SMV algorithms, and develop stochastic greedy algorithms for MMV problems based on StoIHT and StoGradMP. Suppose that there are differentiable and convex functions that satisfy the -restricted strong smoothness property (see Definition 2), and their mean
satisfies the -restricted strong convexity property (see Definition 1). These assumptions will be used extensively throughout the entire paper. Consider the following row-sparsity constrained MMV problem
By vectorizing , i.e., rewriting as a vector by columnwise stacking, we can relax (10) to a sparsity constrained SMV problem of the form (8) where the sparsity level is replaced by . Since does not necessarily guarantee , the solution to the relaxed problem may not be the same as the vectorization of the solution to (10). On the other hand, the iterative stochastic algorithms such as StoIHT and StoGradMP, can be developed to the concatenated versions for solving (10) under the following assumption on the objective function ’s.
Assumption 1. The objective function in (10) is separable, in the sense that a collection of functions exist with
Under this assumption, the concatenated algorithms, i.e., CStoIHT in Algorithm 3 and CStoGradMP in Algorithm 4, can be applied to solve (10), which essentially reconstruct each column of by solving the SMV problem (8). Notice that the outer loops of CStoIHT and CStoGradMP can be executed in a parallel manner on a multi-core computer, when the order of the inner loop and the outer loop in Algorithm 3 can be swapped. However, if the sparsity level is very large, then the support sets of ’s are prone to overlap less initially which results in the less accurate estimation of the joint support and larger errors in the initial iterates. In addition, for some nonlinear function which can not be separated as a sum of functions for columns of , it will be challenging to find an appropriate objective function for the corresponding SMV problem.
To circumvent the aforementioned issues, we first propose the MMV Stochastic Iterative Hard Thresholding algorithm (MStoIHT) detailed in Algorithm 5. Compared to StoIHT, MStoIHT replaces the gradient by the matrix derivative (3). The second significant difference lies in the “Identify” and “Estimate” steps, especially the operator . Now we extend the operator from sparse vectors to row-sparse matrices. Given and , the best -row sparse approximation operator acted on and , denoted by , constructs a row index set such that
where is the best -sparse approximation of the column vector with respect to . In particular, if and , returns the row index set of the first largest row norms in , i.e.,
By abusing the notation, we define to be the projection of onto the subspace of all row-sparse matrices with row indices restricted to . Therefore, we have
Due to the common support , the projection can be written as in row-wise form
Here returns a -row sparse matrix, whose nonzero rows correspond to the rows of with largest row norms.
4 Batched Acceleration
To accelerate computations and improve performance, we apply the mini-batching technique to obtain batched variants of Algorithms 5 and 6. We first partition the index set into a collection of equal-sized batches where the batch size for all . For simplicity, we assume that is an integer. Similar to the approach in needell2016batched , we reformulate (9) as
Likewise, we get a batched version of MStoGradMP, termed as BMStoGradMP, which is detailed in Algorithm 8. It is empirically shown in Section 7 that the increase of the batch size greatly speeds up the convergence of both algorithms, which is more obvious in BMStoIHT. As a by-product, mini-batching can also improve the recovery accuracy. However, there is a trade-off between the batch size and the performance improvement needell2016batched .
5 Convergence Analysis
In this section, we provide the convergence guarantees for the proposed MStoIHT and MStoGradMP, together with their SMV counterparts, i.e., CStoIHT and CStoGradMP. To simplify the discussion, the result at the -th iteration of CStoIHT/CStoGradMP refers to the result obtained after inner iterations and outer iterations of Algorithm 3/Algorithm 4, or equivalently the maximum number of inner iterations is set as . Furthermore, all convergence results can be extended to their batched versions, i.e., BStoIHT and BMStoGradMP. Comparison of contraction coefficients shows that the proposed algorithms have faster convergence under the -RSC, -RSS and separability of the objective function (see Assumption 1).
By replacing the -norm and vector inner product with the Frobenius norm and the matrix inner product respectively and using (4) and (5), we get similar convergence results for MStoIHT and MStoGradMP as in nguyen2014linear :
To solve the problem (10), CStoIHT uses StoIHT to restore each column of separately and then concatenate all column vectors to form a matrix. To analyze the convergence of CStoIHT, we first derive an upper bound for following the proof in (nguyen2014linear, , Section 8.2). Note that we look for the contraction coefficient of rather than .
Let be a feasible solution of (10) and be the initial solution. Under Assumption 1, there exist such that the expectation of the recovery error squared at the -th iteration of Algorithm 3 for estimating the -th column of is bounded by
where is the approximation of at the -th iteration of StoIHT with the initial guess , i.e., the result at the -th inner iteration and -th outer iteration of Algorithm 3 with the initial guess . Here is the set of all indices randomly selected at or before the -th step of the algorithm.
Due to the separable form of in Assumption 1, we consider problems of the form
The inequality yields that
By the direct computation, we can show that
which implies that
Thus the inequality yields
By taking the conditional expectation on both sides, we obtain
Note that all coefficients , and depend on the function in (11). Therefore
which implies that
Here the contraction coefficient is
and the tolerance parameter is
In particular, if and , then
where is the approximation of at the -th iteration of Algorithm 3 with the initial guess . Here the contraction coefficient and the tolerance parameter are defined as
where is the contraction coefficient for each StoIHT defined in (18).