I Introduction
Sparsity techniques are becoming more and more popular in machine learning, statistics, medical imaging and computer vision as the emerging of compressive sensing. Based on compressive sensing theory
[1, 2], a small number of measurements are enough to recover the original data, which is an alternative to Shannon/Nyquist sampling theorem for sparse or compressible data acquisition.Ia Standard Sparsity and Algorithms
Suppose is the sampling matrix and is the measurement vector, the problem is to recover the sparse data by solving the linear system . Sometimes the data is not sparse but compressible under some base such as wavelet, and the corresponding problem is where denotes the set of wavelet coefficients. Although the problem is underdetermined, the data can be perfectly reconstructed if the sampling matrix satisfy restricted isometry property (RIP) [3] and the number of measurements is larger than for sparse data^{1}^{1}1We mean there are at most nonzero components in the data. [4, 5].
To solve the underdetermined problem, we may find the sparsest solution via norm regularization. However, because the problem is NPhard [6] and impractical for most applications , norm regularization methods such as the lasso [7] and basis pursuit (BP) [8] are first used to pursue the sparse solution. It has been proved that the norm regularization can exactly recover the sparse data for CS inverse problem under mild conditions [9, 1]
. Therefore, a lot of efficient algorithms have been proposed for standard sparse recovery. Generally speaking, those algorithms can be classified into three groups: greedy algorithms
[10, 11], convex programming [12, 13, 14]and probability based methods
[15, 16].IB Joint Sparsity and Algorithms
Beyond standard sparsity, the nonzeros components of often tend to be in some structures. This comes to the concept of structured sparsity or modelbased compressive sensing [17, 18, 19]. In contrast to standard sparsity that only relies on the sparseness of the data, structured sparsity models exploit both the nonzero values and the corresponding locations. For example, in the multiple measurement vector (MMV) problem, the data is consisted of several vectors that share the same support ^{2}^{2}2The set of indices corresponding to the nonzero entries is often called the support. This is called joint sparsity that widely arise in cognitive radio networks [20]
, directionofarrival estimation in radar
[21], multichannel compressive sensing [22, 23] and medical imaging [24, 25]. If the data is consist of sparse vectors, the measurement bound could be substantially reduced to instead of for standard sparsity [26, 17, 18, 27].A common way to implement joint sparsity in convex programming is to replace the norm with norm, which is the summation of norms of the correlated entries [28, 29]. norm for joint sparsity has been used in many convex solvers and algorithms [30, 25, 31, 32]. In Bayesian sparse learning or approximate message passing [33, 34, 35], data from all channels contribute to the estimation of parameters or hidden variables in the sparse prior model.
IC Tree Sparsity and Algorithms
Another common structure would be the hierarchical tree structure, which has already been successfully utilized in image compression [36], compressed imaging [37, 38, 39, 40], and machine learning [41]. Most nature signals/images are approximately treesparse under the wavelet basis. A typical relationship with tree sparsity is that, if a node on the tree is nonzero, all of its ancestors leading to the root should be nonzeros. For multichannel data , measurements are required if each channel is treesparse.
Due to the overlapping and intricate structure of tree sparsity, it is much harder to implement. For greedy algorithms, StructOMP [17] and TOMP [42] are developed for exploiting tree structure where the coefficients are updated by only searching the subtree blocks instead of all subspace. In statistical models [37, 38], hierarchical inference is used to model the tree structure, where the value of a node is not independent but relies on the distribution or state of its parent. In convex programming [39, 43], due to the tradeoff between the recovery accuracy and computational complexity, this is often approximated as overlapping group sparsity [44], where each node and its parent are assigned into one group.
ID Forest Sparsity
Although both joint sparsity and tree sparsity have been widely studied, unfortunately, there is no work that study the benefit of their combinations so far. Actually, in many multichannel compressive sensing or MMV problems, the data has joint sparsity across different channels and each channel itself is treesparse. Note that this differs from CHiLasso [45], where sparsity is assumed inside the groups. No method has fully exploited both priors and no theory guarantees the performance. In practical applications, researchers and engineers have to choose either joint sparsity algorithms by giving up their intra treesparse prior, or tree sparsity algorithms by ignoring their inter correlations.
In this paper, we propose a new sparsity model called forest sparsity to bridge this gap. It is a natural extension of existing structured sparsity models by assuming that the data can be represented by a forest of mutually connected trees. We give the mathematical definition of forest sparsity. Based on compressive sensing theory, we prove that for a forest of ksparse trees, only measurements are required for successful recovery with high probability. That is much less than the bounds of joint sparsity and tree sparsity on the same data. The theory is further extended to the case on MMV problems, which is ignored in existing structured sparsity theories [17, 18, 19]. Finally, we derive an efficient algorithm to optimize the forest sparsity model. The proposed algorithm is applied on medical imaging applications such as multicontrast magnetic resonance imaging (MRI), parallel MRI (pMRI), as well as color images, multispectral image reconstruction. Extensive experiments demonstrate the advantages of forest sparsity over the stateoftheart methods in these applications.
IE Paper Organization
The remainder of the paper is organized as follows. Existing works for standard sparsity, joint sparsity and tree sparsity are reviewed in Section II. We will propose forest sparsity and give the benefit of forest sparsity in Section III. An algorithm is presented in Section IV. We conduct experiments on four applications compared with standard sparsity, joint sparsity, and tree sparsity algorithms in Section V. And finally the conclusion is drawn in Section VI.
Ii Background and Related Work
In compressive sensing (CS), the capture of a sparse signal and compression are integrated into a single process [2, 3]. We do not capture sparse data directly but rather capture linear measurements based on a measurement matrix . To stably recover the sparse data from measurements, the measurement matrix is required to satisfy the Restricted Isometry Property (RIP) [3]. Let denote the union dimensional subspaces where lives in.
Definition 1: (RIP) An matrix has the restricted isometry property with restricted isometry constant , if for all , and
(1) 
CS result shows that, if
, a subGaussian random matrix
^{4}^{4}4It includes Gaussian and Bernoulli random matrices etc. [46]. can satisfy the RIP with high probability [47, 48].Recently, structured sparsity theories demonstrate that when there is some structured prior information (e.g. group, tree, graph) in , the measurement bound could be reduced [18, 17]. Suppose is in the union of subspaces , then the RIP can be extended to the RIP [48]:
Definition 2: (RIP) An matrix has the restricted isometry property with restricted isometry constant , if for all , and
(2) 
RIP property has been proved to be sufficient for robust recovery of structuredsparse signals under noisy conditions [18]. The required number of measurements has been quantified for a subGaussian random matrix that has the RIP [48]:
Theorem 1: (RIP) Let be the union of subspaces of dimension in . For any , let
(3) 
then there exists a constant and a randomly generated subGaussian matrix satisfies the RIP with probability at least .
From (3), we could intuitively observe that can be less by reducing the number of subspaces . It coincides with the intuition that the result will be improved when more priors are utilized. For standard sparse data, there is no more constraint to reduce the number of possible subspaces . Let , the CS result for standard sparsity can be derived from Theorem 1.
Now we consider structured sparse data. Following [18], if a sparse data can form a tree or can be sparsely represented as a tree under one orthogonal sparse basis (e.g. wavelet), and the nonzero components naturally form a subtree, then it is called treesparse data.
Definition 3: Treesparse data in is defined as
={:
, , where forms a connected subtree. }.
Here denotes a subspace of the data as and the support is in . denotes the complement of and denotes the coefficients under . It implies that, if an entry of is in , all its ancestors on the tree must be in .
For treesparse data, we say it has the tree sparsity property. Most natural signals or images have tree sparsity property, since they can be sparsely represented with the wavelet tree structure. Specially, the wavelet coefficients of a 1D signal form a binary tree and those of a 2D image yield a quadtree. If the union of all subspaces are denoted by , it is obviously that and the number of subspaces .
Theorem 2: For treesparse data, there exists a subGaussian random matrix that has the RIP with probability if the number of measurements satisfies that:
(4) 
where and denote absolute constants.
So far, we have reviewed standard sparsity and tree sparsity on single channel data. For multichannel data that contains channels or vectors (i.e. ), each of which is standardly sparse, the bound for the number of measurement should be . If each channel is treesparse and independently, the measurement bound for a subGaussian random matrix is .
It is important to note that the Tchannel sparse data has sparsity but not . Different from the above independent channels, another case is that all channels of the data may be highly correlated, which corresponds to joint sparse data:
Definition 4: Jointsparse data is defined as
={: ,
, , } .
Similar as treesparse data, jointsparse data has the joint sparsity property. It has to be clarified that joint sparsity does not rely on tree sparsity. The former utilizes the structure across different channels, while the later utilizes the structure within each channel. Previous works implies that the minimum measurement bound for such joint sparse data is [26, 17, 18].
Iii Forest Sparsity
In practical applications, it happens usually that multichannel images, such as color images, multispectral images and MR images, have the joint sparsity and tree sparsity simultaneously. It is because: (a) the wavelet coefficients of each channel naturally yield a quadtree; (b) all channels represent the same physical objects (e.g. nature scenes or human organs), and the wavelet coefficients of each channel tend to be large/small simultaneously due to same boundaries of the objects. Therefore, the support of such data is consist of several connected trees and like a forest. Fig. 1 shows the forest structure in multicontrast MR images. We could find that the nonzero coefficients are not random distributed but forms a connected forest. Unfortunately, existing treebased algorithms can only recover multichannel data channelbychannel separately, and it is unknown how to model the tree structure in existing joint sparsity algorithms. In addition, there are no theoretical results in previous works showing how much better the recovery can be improved by fully exploiting the prior information.
Motivated by this limitation, we extend previous works to a more special but widely existed case. For multichannel data, if it is jointly sparse, and more importantly, the common support of different channels yields a subtree structure, we call this kind of data forestsparse data:
Definition 5: Forestsparse data is defined as
={: ,
, , where forms a connected subtree, } .
Similarly, the forestsparse data has forest sparsity property. This definition implies that if the coefficients at the same position across different channels are nonzeros, all their ancestor coefficients are all nonzeros. Learning with forest sparsity, we search the sparsest solution that follow the forest structure in the CS inverse problem. Any solution that violates the assumption will be penalized. Intuitively, the solution will be more accurate. We obtain our main result in the following theorem:
Theorem 3: For forestsparse data, there exists a subGaussian random matrix that has the RIP with probability if the number of measurements satisfies that:
(5) 
where and are absolute constants.
For both cases, the bound is reduced to . The proofs of Lemma 1 as well as Lemma 2, 4 are included in the appendices. Using the RIP, forestsparse data can be robustly recovered from noisy compressive measurements.
Table I lists all the measurement bounds for the forestsparse data with different models. Standard sparsity model only exploits the sparseness while no prior information about the locations of the nonzero elements is involved. It is the classical but worst model for forestsparse data. These location priors are partially utilized in joint sparsity and tree sparsity models. One of them only studies the correlations across channels, while the other one only learns the intra structure. Our result is significantly better than those of joint sparsity and tree sparsity, and far better than that of standard sparsity, especially when is large. Only the proposed model fully exploits all these structures.
So far, we have analyzed the result by forest sparsity over previous results. In all these results, the measurement matrix is assumed to be a dense subGaussian matrix. However, in many practical problems, each data channel is measured by a distinct compressive matrix , , which are called multiple measurement vectors (MMV) problems or multitask learning ( e.g., [22, 49, 50]). Here and later, we assume that follow the same distribution but may be different. Therefore, the matrix is actually a blockdiagonal matrix rather than a dense matrix. The linear system can be written as:
(6) 
The nondiagonal blocks in are all zeros. Intuitively, such blockdiagonal matrices have no better results than the dense matrices that discussed above, due to the less randomness. Unfortunately, the performance of the random blockdiagonal matrices has not been analyzed on structured sparse data before, as all existing structured sparsity theories concentrate on the dense random matrix [17, 18, 19]. In this article, we extend the theoretical result to the blockdiagonal matrix in the MMV problems.
Theorem 4: For forestsparse data, there exists a blockdiagonal matrix composed by subGaussian random matrices as in (6), that has the RIP with probability if the number of measurements satisfies that:
(7) 
where ; and ; , and are absolute constants.
For both cases, the bound can be written as . In contrast to previous results on dense matrices with i.i.d subGaussian entries, this bound also depends on the energy of the data. It is not hard to find that and . In the best case, when and , the measurement bound is . It shows a similar performance as the dense subGaussian matrix in Theorem 3. In the worst case, the energy of the data concentrate on one single channel/task, i.e., all except a single index . The measurement bound then is , which is even worse than that in Theorem 2 for independent tree sparse channels. Even for the same blockdiagonal matrix, the analysis makes clear that its performance may varies significantly depending on the data being measured. In the worst case, their measurement bound can increase times. However, the increased factor for blockdiagonal matrices also applies to standard sparse data, joint sparse data and tree sparse data. For the same measurement matrix and the same data, the advantage of forest sparsity still exists. Due to this reason, we do not evaluate the term in the experiments, while focus our interest on comparing different sparsity models on the same data.
Iv Algorithm
In this article, the forest structure is approximated as overlapping group sparsity [44] with mixed norm. Although it may not be the best approximation, it is enough to demonstrate the benefit of forest sparsity. To evaluate the forest sparsity model, we need to compare different models via a similar framework. From the definition of forestsparse data, we could find that a coefficient is large/small, its parent and ”neighbors”^{5}^{5}5Parent denotes the parent node on the same channel while neighbors mean coefficients at the same position on other channels. also tend to be large/small. All parentchild pairs in the same position across different channels are assigned into one group, and the problem becomes overlapping group sparsity regularization. Similar scheme has been used in approximating tree sparsity [39, 40], where each node and its parent are assigned into one group. We write the approximated problem as:
(8) 
where denotes one of the coefficient groups discussed above (an example is demonstrated in Fig.1(c)), denotes the coefficients in group and is the set of all groups.
The mixed norm encourages all the components in the same group to be zeros or nonzeros simultaneously. With our group configuration, it encourages forest sparsity. We present an efficient implementation based on fast iterative shrinkagethresholding algorithm (FISTA) [12] framework for this problem. This is because FISTA can be easily applied for standard sparsity and joint sparsity, which could make the validation of the benefit of the proposed model more convenient. In addition, the formulation (8) can be easily extended to the combination of total variation (TV) via the Fast Composite Splitting Algorithms (FCSA) scheme [51]. Note that other algorithms may be used to solve the forest sparsity problems, e.g. [32, 44, 52], but determining the optimal algorithm for forest sparsity is not the scope of this article.
FISTA [12] is a accelerated version of proximal method which minimizes the object function with the following form:
(9) 
where is a convex smooth function with Lipschitz constant and is a convex but usually nonsmooth function. It comes to the original FISTA when and , which is summarized in Algorithm 1, where, denotes the transpose of .
For the second step, there is closed form solution by softthresholding. For joint sparsity problem where , the second step also has closed form solution. We call the version as FISTA_Joint for joint sparsity. However, for the problem (8) with overlapped groups, we can not directly apply FISTA to solve it.
In order to transfer the problem (8) to nonoverlapping version, we introduce a binary matrix () to duplicate the overlapped coefficients. Each row of only contains one 1 and all else are 0s. The 1 appears in the th column corresponds to the th coefficient of . Intuitively, if the coefficient is included in groups, will contains such rows. An auxiliary variable is used to constrain . This scheme is widely utilized in the alternating direction method (ADM) [32]. The alternating formulation becomes:
(10) 
where is another positive parameter. We iteratively solve this alternative formulation by minimizing and subproblems respectively. For the subproblem:
(11) 
which has the closed form solution:
(12) 
We denote it as a shrinkgroup operation. For the subproblem:
(13) 
The optimal solution is , which contains a large scale inverse problem. Actually, this problem can be efficient solved by various methods. In order to compare with FISTA and FISTA_Joint, we apply FISTA to solve (13). This could demonstrate the benefit of forest sparsity more clearly. Let and . Supposing its Lipschitz constant to be , the whole algorithm is summarized in Algorithm 2.
For the first step, we solve (11) while keeps the same. The object function value in (10) decreases. For the second step, (13) is solved by FISTA iteratively while keeps the same. Therefore, the object function value in (10) decreases in each iteration and the algorithm is convergent. Algorithm 2 is also used to implement tree sparsity by recovering the data channelbychannel separately. We call it FISTA_Tree.
In some practical applications, the data tends to be forestsparse but not strictly. We can soften and complement the forest assumption with other penalties, such as joint norm or TV. For example, after combining TV, problem (10) becomes:
(14) 
where ; and denote the forward finite difference operators on the first and second coordinates respectively; is a positive parameter. Compared with Algorithm 2, we only need to set and the corresponding subproblem has already been solved [12, 51, 53]. This TV combined algorithm is called FCSA_Forest, which will be used in the experiments. To avoid repetition, it is not listed.
V Applications and Experiments
We conduct experiments on the RGB color image, multicontrast MR images, MR image of multichannel coils and the multispectral image to validate the benefit of forest sparsity. All experiments are conducted on a desktop with 3.4GHz Intel core i7 3770 CPU. Matlab version is 7.8(2009a). If the sampling matrix is by , the sampling ratio is defined as
. All measurements are mixed with Gaussian white noise of 0.01 standard deviation. SignaltoNoise Ratio (SNR) is used as the metric for evaluations:
(15) 
where is the Mean Square Error between the original data and the reconstructed ; denotes the power level of the original data where
denotes the variance of the values in
.Va Multicontrast MRI
Multicontrast MRI is a popular technique to aid clinical diagnosis. For example T1 weighted MR images could distinguish fat from water, with water appearing darker and fat brighter. In T2 weighted images fat is darker and water is lighter, which is better suited to imaging edema. Although with different intensities, T1/T2 or protondensity weighted MR images are scanned at the same anatomical position. Therefore they are not independent but highly correlated. Multicontrast MR images are typically forestsparse under the wavelet basis. Suppose are the multicontrast images for the same anatomical cross section and are the corresponding undersampled data in Fourier domain, the forestsparse reconstruction can be formulated as:
(16) 
where is the vertorized data of and is the measurement matrix for the image . This is an extension of conventional CSMRI [54]. Fig. 1 shows an example of the forest structure in multicontrast MR images.
The data is extracted from the SRI24 MultiChannel Brain Atlas Dataset [55]. In the Fourier domain, we randomly obtain more samples in low frequencies and less samples in higher frequencies. This sampling scheme has been widely used for CSMRI [54, 56, 51]. Fig. 2 shows the original multicontrast MR images and the sampling mask.
We compare four algorithms on this dataset: FISTA, FISTA_Joint, FISTA_Tree and FISTA_Forest. The parameter is set 0.035, and is set to . We run each algorithm 400 iterations. Fig. 3 (a) demonstrates the performance comparisons among different algorithms. From the figure, we could observe that modeling with forest sparsity achieves the highest SNR after convergence. Although the algorithm for forest sparsity takes more time due to the overlapping structure, it always outperforms all others in terms of accuracy.
In addition, as total variation is very popular in CSMRI [54, 51, 25], we compare our FCSA_Forest algorithm with FCSA [51] (TV is combined in FISTA), FCSA_Joint [25] (TV is combined in FISTA_Joint) and FCSA_Tree. The parameter for TV is set , the same as that in previous works [56, 51]. Fig. 3 (b) demonstrates the performance comparison including TV regularization. Compared with Fig. 3 (a), all algorithms improve at different degrees. However, the ranking does not change, which validates the superiority of forest sparsity. As FCSA has been proved to be better than other algorithms for general compressive sensing MRI (CSMRI) [54, 56, 57] and FCSA_Joint [25] better than [24, 58] in multicontrast MRI, the proposed method further improves CSMRI and make it more feasible than before.


In order to validate the benefit of forest sparsity in terms of measurement number, we conduct an experiment to reconstruct multicontrast MR images from different sampling ratios. Fig. 4 demonstrates the final results of four algorithms with sampling ratio from to . With more sampling, all algorithms have better performance. However, The forest sparsity algorithm always achieves the best reconstruction. For the same reconstruction accuracy, the FISTA_Forest algorithm only requires about measurements to achieve SNR 28, which is approximately , , less than that of FISTA_Joint, FISTA_Tree and FISTA respectively. More results of forest sparsity on multicontrast MRI can be found in [59].
VB parallel MRI
To improve the scanning speed of MRI, an efficient and feasible way is to acquire the data in parallel with multichannel coils. The scanning time depends on the number of measurements in Fourier domain, and it will be significantly reduced when each coil only acquires a small fraction of the whole measurements. The bottleneck is how to reconstruct the original MR image efficiently and precisely. This issue is called pMRI in literature. Sparsity techniques have been used to improve the classical method SENSE [60]. However, when the coil sensitivity can not be estimated precisely, the final image would contain visual artifacts. Unlike previous CSSENSE [61] which reconstructs the images of multicoils individually, calibrationless parallel MRI [62, 63] recovers the aliased images of all coils jointly by assuming the data is jointly sparse.
Let equal to the number of coils and be the measurement vector from coil . It is therefore the same CS problem as (16). The final result of CaLMMRI is obtained by a sum of square (SoS) approach without coil sensitivity and SENSE encoding. It shows comparable results with those methods which need precise coil configuration. As shown in Fig. 5, the appearances of different images obtained from multicoils are very similar. This method can be improved with forest sparsity, since the images follow the forest sparsity assumption.




There are two steps for compressive sensing pMRI reconstruction in CaLMMRI [62]: 1) the aliased images are recovered from the undersampled Fourier signals of different coil channels by CS methods; 2) The final image for clinical diagnosis is synthesized by the recovered aliased images using the sumofsquare (SoS) approach. As discussed above, these aliased images should be forestsparse under the wavelet basis. We compare our algorithm with FISTA_Joint and SPGL1 [31] which solves the joint norm problem in CaLMMRI. For the second step, all methods use the SoS approach from the aliased images that they recovered. All algorithms run enough time until it has converged.
Table II and Table III show all the comprehensive comparisons among these algorithms. For the same algorithm, more measurements or more number of coils tend to increase the SNRs of aliased images, although it does not result in linear improvement for the final image reconstruction. Another observation is that FISTA_Joint and SPGL1 have similar performance in terms of SNR on this data. This is because both of them solve the same joint sparsity problem, even with different schemes. Upgrading the model to forest sparsity, significant improvement can be gained. Finally, it is unknown how to combine TV in SPGL1. However, both FISTA_Joint and FISTA_Forest can easily combine TV, which can further enhance the results [25].
VC Color Image Reconstruction
Color images captured by optical camera can be represented as combinations of red, green, blue three colors. Different colors synthesized by these three colors seems realistic to human eyes. By observing the color channels are highly correlated, joint sparsity prior is utilized in recent recovery [23]. Modeling with norm regularization can gain additional SNR to standard norm regularization. Further more, each color channel tends to be wavelet treesparse. If we model the problem with forest sparsity, this result would be reasonably better.
For color images, we compare our algorithm with FISTA, FISTA_Joint and FISTA_Tree. Fig. 6 shows the visual results recovered by different sparse penalties. Only after 50 iterations, the image recovered by our algorithm is very close to the original one with the fewest artifacts (shown in the zoomed region of interest).
VD Multispectral Image Reconstruction
Different from common color images, a multispectral or hyperspectral
image is consisted of much more bands, which provides both spatial
and spectral representations of scenes. It is widely utilized on remote
sensing with applications to agriculture, environment detection etc..
However, the collection of large amount of data costs both
huge imaging time and storage space. By compressive sensing data
acquisition, the cost of imaging for remote sensing data could be
significantly reduced [64]. Like RGB images, the bands of
multispectral image should represent the same scene. Each band has
tree sparsity property. Therefore, they follow the forest sparsity assumption.
Fig. 7 shows bands 6 to 14 of a multispectral image of 1992 AVIRIS
Indian Pine Test Site 3 ^{6}^{6}6The data is downloaded from
https://engineering.purdue.edu
/biehl/MultiSpec/hyperspectral.html.
For multispectral image, we test a dataset of 1992 AVIRIS image Indian Pine Test Site 3 (examples shown in Fig. 7). It is a mile portion of Northwest Tippecanoe County of Indiana. There are total bands. Each band is recovered separately for standard sparsity and tree sparsity, while every 3 bands are reconstructed simultaneously by jointsparse model and forestsparse model. Each image is cropped to for convenience. The number of wavelet decomposition levels is set to 3. The SNRs of all recovered images for band 6 to 66 are shown in Fig. 8. One could observe that modeling with forest sparsity always achieves the highest SNRs, which validates the benefit of forest sparsity.
Vi Conclusion
In this paper, we have proposed a novel model forest sparsity for sparse learning and compressive sensing. This model enriches the family of structured sparsity and can be widely applied on numerous fields of sparse regularization problems. The benefit of the proposed model has been theoretically proved and empirically validated in practical applications. Under compressive sensing assumptions, significant reduction of measurements is achieved with forest sparsity compared with standard sparsity, joint sparsity or independent tree sparsity. A fast algorithm is developed to efficiently solve the forest sparsity problem. While applying it on practical applications such as multicontrast MRI, pMRI, multispectral image and color image reconstruction, extensive experiments demonstrate the superiority of forest sparsity over standard sparsity, joint sparsity and tree sparsity in terms of both accuracy and computational complexity.
Appendix A Proof of Theorem 2
The proof is conducted on the binary tree case for convenience. The bound for quadtree can be easily extended.
First, we need to figure out the number of subtrees (size ) of a binary tree (size ). Note that the root of the subtrees should be the binary tree’s root.
Case 1: when , the number of subtrees of size is just the Catalan number:
(17) 
Case 2: when , the number of subtrees of size should follow [18]:
(18) 
where , , , are some constants. Therefore we have:
(19) 
Appendix B Proof of Theorem 3
If the data is forestsparse, the support set of different trees are dependent. It means if the support set for one tree is fixed, then all support sets for other trees are fixed. Accordingly, the number of combinations is still . Note that the sparsity number is as there are trees. Therefore,
(22) 
For both cases, the bound is reduced to .
Appendix C Proof of Theorem 4
We first derive the sufficient condition that guarantees the RIP for blockdiagonal matrices.
Theorem 5. Let a matrix be composed by subGaussian random matrices as in (6). For any fixed subset with and , we have with probability exceeding :
(23) 
for all . and are absolute constants, and .
Proof. Let’s denote and we have . We choose a finite set of points , such that and for all . We have and covering number satisfies for any (see Chap 13 of [65]).
As the blockdiagonal matrix is composed by subGaussian random matrices, we have for each and any :
(24) 
with and defined above. This probability is indicated in Theorem III.1 of [66].
Taking union bound, we obtain with probability exceeding :
(25) 
which gives
(26) 
Now we define as the smallest nonnegative number such that
(27) 
for all and . We have
(28) 
Note the above result holds for any and . We choose and . Since , it is easy to see that , which proves
(31) 
Similar, can be proved using the same way. Finally, we obtain with probability exceeding :
(32) 
which completes the proof as .
Based on this theorem, we know that any sparse satisfies
(33) 
with probability exceeding .
Suppose there are combinations of such set , from appendix A and B we know that
(34) 
for forest sparse data.
By taking the union bound, we known that (23) fails with probability less than
(35) 
which gives
(36) 
From this one theorem 4 can be easily derived. For both cases, the bound can be written as .
References
 [1] E. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, 2006.
 [2] D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
 [3] E. Candès, “Compressive sampling,” in Proc. Int. Congr. Math., 2006, pp. 1433–1452.
 [4] E. Candes and J. Romberg, “Sparsity and incoherence in compressive sampling,” Inverse problems, vol. 23, no. 3, p. 969, 2007.
 [5] E. Candes, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Comm. Pure Appl. Math., vol. 59, no. 8, pp. 1207–1223, 2006.
 [6] B. Natarajan, “Sparse approximate solutions to linear systems,” SIAM J. Sci. Comput., vol. 24, no. 2, pp. 227–234, 1995.
 [7] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Series B Stat. Methodol., pp. 267–288, 1996.
 [8] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.
 [9] D. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via minimization,” Optimally, vol. 100, no. 5, pp. 2197–2202, 2002.
 [10] J. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Trans. Inf. Theory, vol. 50, no. 10, pp. 2231–2242, 2004.
 [11] D. Needell and J. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Appl. Computat. Harmon. Anal., vol. 26, no. 3, pp. 301–321, 2009.
 [12] A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM J. Imaging Sci., vol. 2, no. 1, pp. 183–202, 2009.
 [13] M. Figueiredo, R. Nowak, and S. Wright, “Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems,” IEEE J. Sel. Topics Signal Process., vol. 1, no. 4, pp. 586–597, 2007.

[14]
K. Koh, S. Kim, and S. Boyd, “An interiorpoint method for largescale l1regularized logistic regression,”
J. Mach. Learn. Res., vol. 8, no. 8, pp. 1519–1555, 2007.  [15] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Trans. Signal Process., vol. 56, no. 6, pp. 2346–2356, 2008.
 [16] D. Donoho, A. Maleki, and A. Montanari, “Messagepassing algorithms for compressed sensing,” in Proc. National Academy of Sciences, vol. 106, no. 45, 2009, pp. 18 914–18 919.
 [17] J. Huang, T. Zhang, and D. Metaxas, “Learning with structured sparsity,” J. Mach. Learn. Res., vol. 12, pp. 3371–3412, 2011.
 [18] R. Baraniuk, V. Cevher, M. Duarte, and C. Hegde, “Modelbased compressive sensing,” IEEE Trans. Inf. Theory, vol. 56, no. 4, pp. 1982–2001, 2010.
 [19] J. Huang, “Structured sparsity: theorems, algorithms and applications,” Ph.D. dissertation, Rutgers University, 2011.
 [20] J. Meng, W. Yin, H. Li, E. Hossain, and Z. Han, “Collaborative spectrum sensing from sparse observations in cognitive radio networks,” IEEE J. Sel. Areas Commun., vol. 29, no. 2, pp. 327–337, 2011.
 [21] H. Krim and M. Viberg, “Two decades of array signal processing research: the parametric approach,” IEEE Signal Process. Mag., vol. 13, no. 4, pp. 67–94, 1996.
 [22] D. Baron, M. Wakin, M. Duarte, S. Sarvotham, and R. Baraniuk, “Distributed compressed sensing,” arXiv preprint arXiv:0901.3403, 2005.
 [23] A. Majumdar and R. Ward, “Compressive color imaging with groupsparsity on analysis prior,” in Proc. IEEE Int. Conf. Image Process.(ICIP), 2010, pp. 1337–1340.
 [24] B. Bilgic, V. Goyal, and E. Adalsteinsson, “Multicontrast reconstruction with bayesian compressed sensing,” Magn. Reson. Med., vol. 66, no. 6, pp. 1601–1615, 2011.
 [25] J. Huang, C. Chen, and L. Axel, “Fast Multicontrast MRI Reconstruction,” in Proc. Med. Image Comput. Comput. Assist. Interv. (MICCAI), 2012, pp. 281–288.
 [26] J. Huang and T. Zhang, “The benefit of group sparsity,” Ann. Stat., vol. 38, no. 4, pp. 1978–2004, 2010.
 [27] J. Huang, X. Huang, and D. Metaxas, “Learning with dynamic group sparsity,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2009, pp. 64–71.
 [28] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. R. Stat. Soc. Series B Stat. Methodol., vol. 68, no. 1, pp. 49–67, 2005.
 [29] F. Bach, “Consistency of the group lasso and multiple kernel learning,” J. Mach. Learn. Res., vol. 9, pp. 1179–1225, 2008.
 [30] S. Cotter, B. Rao, K. Engan, and K. KreutzDelgado, “Sparse solutions to linear inverse problems with multiple measurement vectors,” IEEE Trans. Signal Process., vol. 53, no. 7, pp. 2477–2488, 2005.
 [31] E. Van Den Berg and M. Friedlander, “Probing the pareto frontier for basis pursuit solutions,” SIAM J. Sci. Comput., vol. 31, no. 2, pp. 890–912, 2008.
 [32] W. Deng, W. Yin, and Y. Zhang, “Group sparse optimization by alternating direction method,” TR1106, Department of Computational and Applied Mathematics, Rice University, 2011.
 [33] D. Wipf and B. Rao, “An empirical bayesian strategy for solving the simultaneous sparse approximation problem,” IEEE Trans. Signal Process., vol. 55, no. 7, pp. 3704–3716, 2007.
 [34] S. Ji, D. Dunson, and L. Carin, “Multitask compressive sensing,” IEEE Trans. Signal Process., vol. 57, no. 1, pp. 92–106, 2009.
 [35] J. Ziniel and P. Schniter, “Efficient highdimensional inference in the multiple measurement vector problem,” arXiv preprint arXiv:1111.5272, 2011.
 [36] A. Manduca and A. Said, “Wavelet compression of medical images with set partitioning in hierarchical trees,” in Proc. Int. Conf. IEEE Engineering in Medicine and Biology Soc. (EMBS), vol. 3, 1996, pp. 1224–1225.
 [37] L. He and L. Carin, “Exploiting structure in waveletbased bayesian compressive sensing,” IEEE Trans. Signal Process., vol. 57, no. 9, pp. 3488–3497, 2009.
 [38] S. Som and P. Schniter, “Compressive imaging using approximate message passing and a markovtree prior,” IEEE Trans. Signal Process., vol. 60, no. 7, pp. 3439–3448, 2012.
 [39] N. Rao, R. Nowak, S. Wright, and N. Kingsbury, “Convex approaches to model wavelet sparsity patterns,” in Proc. IEEE Int. Conf. Image Process.(ICIP), 2011, pp. 1917–1920.
 [40] C. Chen and J. Huang, “Compressive Sensing MRI with Wavelet Tree Sparsity,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2012, pp. 1124–1132.
 [41] S. Kim and E. Xing, “Treeguided group lasso for multiresponse regression with structured sparsity, with an application to eQTL mapping,” Ann. Appl. Stat., vol. 6, no. 3, pp. 1095–1117, 2012.
 [42] C. La and M. Do, “Treebased orthogonal matching pursuit algorithm for signal reconstruction,” in Proc. IEEE Int. Conf. Image Process.(ICIP), 2006, pp. 1277–1280.
 [43] C. Chen and J. Huang, “The benefit of tree sparsity in accelerated MRI,” Med. Image Anal., 2013.
 [44] L. Jacob, G. Obozinski, and J. Vert, “Group lasso with overlap and graph lasso,” in Proc. Int. Conf. Mach. Learn. (ICML), 2009, pp. 433–440.
 [45] P. Sprechmann, I. Ramirez, G. Sapiro, and Y. C. Eldar, “Chilasso: A collaborative hierarchical sparse modeling framework,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4183–4198, 2011.
 [46] S. Mendelson, A. Pajor, and N. TomczakJaegermann, “Uniform uncertainty principle for bernoulli and subgaussian ensembles,” Constructive Approximation, vol. 28, no. 3, pp. 277–289, 2008.
 [47] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008.
 [48] T. Blumensath and M. Davies, “Sampling theorems for signals from the union of finitedimensional linear subspaces,” IEEE Trans. Inf. Theory, vol. 55, no. 4, pp. 1872–1882, 2009.
 [49] X. Chen, S. Kim, Q. Lin, J. G. Carbonell, and E. P. Xing, “Graphstructured multitask regression and an efficient optimization method for general fused lasso,” arXiv preprint arXiv:1005.3579, 2010.
 [50] X. Chen, X. Shi, X. Xu, Z. Wang, R. Mills, C. Lee, and J. Xu, “A twograph guided multitask lasso approach for eqtl mapping,” in Proc. Int. Conf. Artif. Intell. Stat.(AISTATS), 2012, pp. 208–217.
 [51] J. Huang, S. Zhang, and D. Metaxas, “Efficient MR image reconstruction for compressed MR imaging,” Med. Image Anal., vol. 15, no. 5, pp. 670–679, 2011.
 [52] M. Kowalski, K. Siedenburg, and M. Dörfler, “Social sparsity! neighborhood systems enrich structured shrinkage operators,” IEEE Trans. Signal Process., vol. 61, no. 10, pp. 2498–2511, 2013.
 [53] J. Huang, S. Zhang, H. Li, and D. Metaxas, “Composite splitting algorithms for convex optimization,” Comput. Vis. Image Und., vol. 115, no. 12, pp. 1610–1622, 2011.
 [54] M. Lustig, D. Donoho, and J. Pauly, “Sparse MRI: The application of compressed sensing for rapid MR imaging,” Magn. Reson. Med., vol. 58, no. 6, pp. 1182–1195, 2007.
 [55] T. Rohlfing, N. Zahr, E. Sullivan, and A. Pfefferbaum, “The SRI24 multichannel atlas of normal adult human brain structure,” Hum. Brain Mapp., vol. 31, no. 5, pp. 798–819, 2009.

[56]
S. Ma, W. Yin, Y. Zhang, and A. Chakraborty, “An efficient algorithm for
compressed MR imaging using total variation and wavelets,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)
, 2008, pp. 1–8.  [57] J. Yang, Y. Zhang, and W. Yin, “A fast alternating direction method for TVL1L2 signal reconstruction from partial Fourier data,” IEEE J. Sel. Topics Signal Process., vol. 4, no. 2, pp. 288–297, 2010.
 [58] A. Majumdar and R. Ward, “Joint reconstruction of multiecho MR images using correlated sparsity,” Magn. Reson. Imaging, vol. 29, no. 7, pp. 899–906, 2011.
 [59] C. Chen and J. Huang, “Exploiting both intraquadtree and interspatial structures for multicontrast MRI,” in Proc. IEEE Int. Symp. Biomed. Imaging (ISBI), 2014.
 [60] K. Pruessmann, M. Weiger, M. Scheidegger, P. Boesiger et al., “SENSE: sensitivity encoding for fast MRI,” Magn. Reson. Med., vol. 42, no. 5, pp. 952–962, 1999.
 [61] D. Liang, B. Liu, J. Wang, and L. Ying, “Accelerating SENSE using compressed sensing,” Magn. Reson. Med., vol. 62, no. 6, pp. 1574–1584, 2009.
 [62] A. Majumdar and R. Ward, “Calibrationless multicoil MR image reconstruction,” Magn. Reson. Imaging, 2012.
 [63] C. Chen, Y. Li, and J. Huang, “Calibrationless parallel MRI with joint total variation regularization,” in Proc. Med. Image Comput. Comput. Assist. Interv. (MICCAI), 2013, pp. 106–114.
 [64] J. Ma, “Singlepixel remote sensing,” IEEE Geosci. Remote Sens. Lett., vol. 6, no. 2, pp. 199–203, 2009.
 [65] G. G. Lorentz, M. von Golitschek, and Y. Makovoz, Constructive approximation: advanced problems. Springer Berlin, 1996, vol. 304.
 [66] J. Y. Park, H. L. Yap, C. J. Rozell, and M. B. Wakin, “Concentration of measure for block diagonal matrices with applications to compressive signal processing,” IEEE Trans. Signal Process., vol. 59, no. 12, pp. 5859–5875, 2011.