1 Introduction
This paper aims to recover a sparse vector from its noisy linear measurements:
(1) 
where , , ,
is additive Gaussian white noise, and we have
. (1) is an illposed, highly underdetermined system. However, it becomes easier to solve if is assumed to be sparse, i.e. the cardinality of support of , , is small compared to .A popular approach is to model the problem as the LASSO formulation ( is a scalar):
(2) 
and solve it using iterative algorithms such as the iterative shrinkage thresholding algorithm (ISTA) blumensath2008iterative :
(3) 
where is the softthresholding function^{2}^{2}2Softthresholding function is defined in a componentwise way: and
is usually taken as the largest eigenvalue of
. In general, ISTA converges sublinearly for any given and fixed dictionary and sparse code beck2009fast ., inspired by ISTA, the authors proposed a learningbased model named Learned ISTA (LISTA). They view ISTA as a recurrent neural network (RNN) that is illustrated in Figure
1(a), where , , . LISTA, illustrated in Figure 1(b), unrolls the RNN and truncates it into iterations:(4) 
leading to a
layer feedforward neural network with side connections.
Different from ISTA where no parameter is learnable (except the hyper parameter
to be tuned), LISTA is treated as a specially structured neural network and trained using stochastic gradient descent (SGD), over a given training dataset
sampled from some distribution . All the parameters are subject to learning. The training is modeled as:(5) 
Many empirical results, e.g., gregor2010learning ; wang2016learning ; wang2016d3 ; wang2016learningb ; wang2016learningc , show that a trained layer LISTA (with usually set to ) or its variants can generalize more than well to unseen samples from the same and recover from to the same accuracy within one or two orderofmagnitude fewer iterations than the original ISTA. Moreover, the accuracies of the outputs of the layers gradually improve.
1.1 Related Works
Many recent works sprechmann2015learning ; wang2016sparse ; wang2016learning ; zhangista ; zhou2018sc2net followed the idea of gregor2010learning to construct feedforward networks by unfolding and truncating iterative algorithms, as fast trainable regressors to approximate the solutions of sparse coding models. On the other hand, progress has been slow towards understanding the efficient approximation from a theoretical perspective. The most relevant works are discussed below.
moreau2017understanding attempted to explain the mechanism of LISTA by refactorizing the Gram matrix of dictionary, which tries to nearly diagonalize the Gram matrix with a basis that produces a small perturbation of the ball. They reparameterized LISTA into a new factorized architecture that achieved similar acceleration gain to LISTA. Using an “indirect” proof, moreau2017understanding was able to show that LISTA can converge faster than ISTA, but still sublinearly. Lately, giryes2018tradeoffs tried to relate LISTA to a projected gradient descent descent (PGD) relying on inaccurate projections, where a tradeoff between approximation error and convergence speed was made possible.
xin2016maximal investigated the convergence property of a sibling architecture to LISTA, proposed in wang2016learning , which was obtained by instead unfolding/truncating the iterative hard thresholding (IHT) algorithm rather than ISTA. The authors argued that they can use data to train a transformation of dictionary that can improve its restricted isometry property (RIP) constant, when the original dictionary is highly correlated, causing IHT to fail easily. They moreover showed it beneficial to allow the weights to decouple across layers. However, the analysis in xin2016maximal cannot be straightforwardly extended to ISTA although IHT is linearly convergent blumensath2009iterative under rather strong assumptions.
In borgerding2017amp , a similar learningbased model inspired by another iterative algorithm solve LASSO, approximated message passing (AMP), was studied. The idea was advanced in metzler2017learned to substituting the AMP proximal operator (softthresholding) with a learnable Gaussian denoiser. The resulting model, called Learned Denoising AMP (LDAMP), has theoretical guarantees under the asymptotic assumption named “state evolution.” While the assumption is common in analyzing AMP algorithms, the tool is not directly applicable to ISTA. Moreover, borgerding2017amp shows LDAMP is MMSE optimal, but there is no result on its convergence rate. Besides, we also note the empirical effort in borgerding2016onsager that introduces an Onsager correction to LISTA to make it resemble AMP.
1.2 Motivations and Contributions
We attempt to answer the following questions, which are not fully addressed in the literature yet:

Rather than training LISTA as a conventional “blackbox” network, can we benefit from exploiting certain dependencies among its parameters to simplify the network and improve the recovery results?

Obtained with sufficiently many training samples from the target distribution , LISTA works very well. So, we wonder if there is a theoretical guarantee to ensure that LISTA (4) converges ^{3}^{3}3The convergence of ISTA/FISTA measures how fast the th iterate proceeds; the convergence of LISTA measures how fast the output of the th layer proceeds as increases. faster and/or produces a better solution than ISTA (3) when its parameters are ideal? If the answer is affirmative, can we quantize the amount of acceleration?

Can some of the acceleration techniques such as support detection that were developed for LASSO also be used to improve LISTA?
Our Contributions: this paper aims to introduce more theoretical insights for LISTA and to further unleash its power. To our best knowledge, this is the first attempt to establish a theoretical convergence rate (upper bound) of LISTA directly. We also observe that the weight structure and the thresholds can speedup the convergence of LISTA:

We give a result on asymptotic coupling between the weight matrices and . This result leads us to eliminating one of them, thus reducing the number of trainable parameters. This elimination still retains the theoretical and experimental performance of LISTA.

ISTA is generally sublinearly convergent before its iterates settle on a support. We prove that, however, there exists a sequence of parameters that makes LISTA converge linearly since its first iteration. Our numerical experiments support this theoretical result.

Furthermore, we introduce a thresholding scheme for support selection, which is extremely simple to implement and significantly boosts the practical convergence. The linear convergence results are extended to support detection with an improved rate.
Detailed discussions of the above three points will follow after Theorems 1, 2 and 3, respectively. Our proofs do not rely on any indirect resemblance, e.g., to AMP borgerding2016onsager or PGD giryes2018tradeoffs . The theories are supported by extensive simulation experiments, and substantial performance improvements are observed when applying the weight coupling and support selection schemes. We also evaluated LISTA equipped with those proposed techniques in an image compressive sensing task, obtaining superior performance over several of the stateofthearts.
2 Algorithm Description
We first establish the necessary condition for LISTA convergence, which implies a partial weight coupling structure for training LISTA. We then describe the supportselection technique.
2.1 Necessary Condition for LISTA Convergence and Partial Weight Coupling
Assumption 1 (Basic assumptions).
The signal and the observation noise are sampled from the following set:
(6) 
In other words, is bounded and sparse^{4}^{4}4A signal is sparse if it has no more than nonzero entries. (), and is bounded.
Theorem 1 (Necessary Condition).
Proofs of the results throughout this paper can be found in the supplementary. The conclusion (7) demonstrates that the weights in LISTA asymptotically satisfies the following partial weight coupling structure:
(9) 
We adopt the above partial weight coupling for all layers, letting , thus simplifying LISTA (4) to:
(10) 
where remain as free parameters to train. Empirical results in Fig. 3 illustrate that the structure (9), though having fewer parameters, improves the performance of LISTA.
The coupled structure (9) for softthresholding based algorithms was empirically studied in borgerding2017amp . The similar structure was also theoretically studied in Proposition 1 of xin2016maximal for IHT algorithms using the fixedpoint theory, but they let all layers share the same weights, i.e. .
2.2 LISTA with Support Selection
We introduce a special thresholding scheme to LISTA, called support selection, which is inspired by “kicking” osher2011fast in linearized Bregman iteration. This technique shows advantages on recoverability and convergence. Its impact on improving LISTA convergence rate and reducing recovery errors will be analyzed in Section 3. With support selection, at each LISTA layer before applying soft thresholding, we will select a certain percentage of entries with largest magnitudes, and trust them as “true support” and won’t pass them through thresholding. Those entries that do not go through thresholding will be directly fed into next layer, together with other thresholded entires.
Assume we select of entries as the trusted support at layer . LISTA with support selection can be generally formulated as
(11) 
where is the thresholding operator with support selection, formally defined as:
where includes the elements with the largest magnitudes in vector :
(12) 
To clarify, in (11),
is a hyperparameter to be manually tuned, and
is a parameter to train. We use an empirical formula to select for layer : , where is a positive constant and is an upper bound of the percentage of the support cardinality. Here and are both hyperparameters to be manually tuned.Algorithm abbreviations
For simplicity, hereinafter we will use the abbreviation “CP” for the partial weight coupling in (9), and “SS” for the support selection technique. LISTACP denotes the LISTA model with weights coupling (10). LISTASS denotes the LISTA model with support selection (11). Similarly, LISTACPSS stands for a model using both techniques (13), which has the best performance. Unless otherwise specified, LISTA refers to the baseline LISTA (4).
3 Convergence Analysis
In this section, we formally establish the impacts of (10) and (13) on LISTA’s convergence. The output of the layer depends on the parameters , the observed measurement and the initial point . Strictly speaking, should be written as . By the observation model , since is given and can be taken as , therefore depends on , and . So, we can write . For simplicity, we instead just write .
Theorem 2 (Convergence of LISTACP).
If (noiseless case), (14) reduces to
(15) 
The recovery error converges to at a linear rate as the number of layers goes to infinity. Combined with Theorem 1, we see that the partial weight coupling structure (10) is both necessary and sufficient to guarantee convergence in the noiseless case. Fig. 3 validates (14) and (15) directly.
Discussion: The bound (15) also explains why LISTA (or its variants) can converge faster than ISTA and fast ISTA (FISTA) beck2009fast . With a proper (see (2)), ISTA converges at an rate and FISTA converges at an rate beck2009fast . With a large enough , ISTA achieves a linear rate bredies2008linear ; zhang2017new . With being the solution of LASSO (noiseless case), these results can be summarized as: before the iterates settle on a support^{5}^{5}5After settles on a support, i.e. as large enough such that is fixed, even with small , ISTA reduces to a linear iteration, which has a linear convergence rate tao2016local . ,
Based on the choice of in LASSO, the above observation reflects an inherent tradeoff between convergence rate and approximation accuracy in solving the problem (1), see a similar conclusion in giryes2018tradeoffs : a larger leads to faster convergence but a less accurate solution, and vice versa.
However, if is not constant throughout all iterations/layers, but instead chosen adaptively for each step, more promising tradeoff can arise^{6}^{6}6This point was studied in HaleYinZhang2008_sparse ; xiao2013proximal with classical compressive sensing settings, while our learning settings can learn a good path of parameters without a complicated thresholding rule or any manual tuning.. LISTA and LISTACP, with the thresholds free to train, actually adopt this idea because corresponds to a path of LASSO parameters . With extra free trainable parameters, (LISTACP) or (LISTA), learning based algorithms are able to converge to an accurate solution at a fast convergence rate. Theorem 2 demonstrates the existence of such sequence in LISTACP (10). The experiment results in Fig. 4 show that such can be obtained by training.
Assumption 2.
Signal and observation noise are sampled from the following set:
(16) 
Theorem 3 (Convergence of LISTACPSS).
Given and , let be generated by (13). With the same assumption and parameters as in Theorem 2, the approximation error can be bounded for all :
(17) 
where for all and .
If Assumption 2 holds, is small enough, and (SNR is not too small), then there exists another sequence of parameters that yields the following improved error bound: for all ,
(18) 
where for all , for large enough , and .
The bound in (17) ensures that, with the same assumptions and parameters, LISTACPSS is at least no worse than LISTACP. The bound in (18) shows that, under stronger assumptions, LISTACPSS can be strictly better than LISTACP in both folds: is the better convergence rate of LISTACPSS; means that the LISTACPSS can achieve smaller approximation error than the minimum error that LISTA can achieve.
4 Numerical Results
For all the models reported in this section, including the baseline LISTA and LAMP models , we adopt a stagewise training strategy with learning rate decaying to stabilize the training and to get better performance, which is discussed in the supplementary.
4.1 Simulation Experiments
Experiments Setting. We choose . We sample the entries of
i.i.d. from the standard Gaussian distribution,
and then normalize its columns to have the unit norm. We fix a matrix in each setting where different networks are compared. To generate sparse vectors, we decide each of its entry to be nonzero following the Bernoulli distribution with
. The values of the nonzero entries are sampled from the standard Gaussian distribution. A test set of 1000 samples generated in the above manner is fixed for all tests in our simulations.All the networks have layers. In LISTA models with support selection, we add of entries into support and maximally select in each layer. We manually tune the value of and for the best final performance. With and , we choose for all models in simulation experiments and for LISTASS but for LISTACPSS. The recovery performance is evaluated by NMSE (in dB):
where is the ground truth and
is the estimate obtained by the recovery algorithms (ISTA, FISTA, LISTA, etc.).
Validation of Theorem 1. In Fig 2, we report two values, and , obtained by the baseline LISTA model (4) trained under the noiseless setting. The plot clearly demonstrates that and , as Theorem 1 is directly validated.
Validation of Theorem 2. We report the testset NMSE of LISTACP (10) in Fig. 3. Although (10) fixes the structure between and , the final performance remains the same with the baseline LISTA (4), and outperforms AMP, in both noiseless and noisy cases. Moreover, the output of interior layers in LISTACP are even better than the baseline LISTA. In the noiseless case, NMSE converges exponentially to ; in the noisy case, NMSE converges to a stationary level related with the noiselevel. This supports Theorem 2: there indeed exist a sequence of parameters leading to linear convergence for LISTACP, and they can be obtained by datadriven learning.
Validation of Discussion after Theorem 2. In Fig 4, We compare LISTACP and ISTA with different s (see the LASSO problem (2)) as well as an adaptive threshold rule similar to one in HaleYinZhang2008_sparse , which is described in the supplementary. As we have discussed after Theorem 2, LASSO has an inherent tradeoff based on the choice of . A smaller leads to a more accurate solution but slower convergence. The adaptive thresholding rule fixes this issue: it uses large for small , and gradually reduces it as increases to improve the accuracy HaleYinZhang2008_sparse . Except for adaptive thresholds ( corresponds to in LASSO), LISTACP has adaptive weights , which further greatly accelerate the convergence. Note that we only ran ISTA and FISTA for 16 iterations, just enough and fair to compare them with the learned models. The number of iterations is so small that the difference between ISTA and FISTA is not quite observable.
Validation of Theorem 3. We compare the recovery NMSEs of LISTACP (10) and LISTACPSS (13) in Fig. 5. The result of the noiseless case (Fig. 5(a)) shows that the recovery error of LISTASS converges to at a faster rate than that of LISTACP. The difference is significant with the number of layers , which supports our theoretical result: “ as large enough” in Theorem 3. The result of the noisy case (Fig. 5(b)) shows that LISTACPSS has better recovery error than LISTACP. This point supports in Theorem 3. Notably, LISTACPSS also outperforms LAMP borgerding2017amp , when > 10 in the noiseless case, and even earlier as SNR becomes lower.
Performance with IllConditioned Matrix.
We train LISTA, LAMP, LISTACPSS with illconditioned matrices of condition numbers . As is shown in Fig. 6, as increases, the performance of LISTA remains stable while LAMP becomes worse, and eventually inferior to LISTA when . Although our LISTACPSS also suffers from illconditioning, its performance always stays much better than LISTA and LAMP.4.2 Natural Image Compressive Sensing
Experiments Setting. We perform a compressive sensing (CS) experiment on natural images (patches). We divide the BSD500 MartinFTM01 set into a training set of 400 images, a validation set of 50 images, and a test set of 50 images. For training, we extract 10,000 patches at random positions of each image, with all means removed. We then learn a dictionary from them, using a block proximal gradient method xu2013block . For each testing image, we divide it into nonoverlapping patches. A Gaussian sensing matrices is created in the same manner as in Sec. 4.1, where is the CS ratio.
Since is typically not exactly sparse under the dictionary , Assumptions 1 and 2 no longer strictly hold. The primary goal of this experiment is thus to show that our proposed techniques remain robust and practically useful in nonideal conditions, rather than beating all CS stateofthearts.
Network Extension.
In the real data case, we have no groundtruth sparse code available as the regression target for the loss function (
5). In order to bypass precomputing sparse codes over on the training set, we are inspired by zhou2018sc2net : first using layerwise pretraining with a reconstruction loss w.r.t. dictionary plus an loss, shown in (19), where is the layer index and denotes all parameters in the th and previous layers; then appending another learnable fullyconnected layer (initialized by ) to LISTACPSS and perform an endtoend training with the cost function (20).(19)  
(20) 
Results. The results are reported in Table 1. We build CS models at the sample rates of and test on the standard Set 11 images as in kulkarni2016reconnet . We compare our results with three baselines: the classical iterative CS solver, TVAL3 li2013efficient
; the “blackbox” deep learning CS solver, ReconNet
kulkarni2016reconnet ;a based network unfolded from IHT algorithm blumensath2009iterative , noted as LIHT; and the baseline LISTA network, in terms of PSNR (dB) ^{7}^{7}7 We applied TVAL3, LISTA and LISTACPSS on patches to be fair. For ReconNet, we used their default setting working on patches, which was verified to perform better than using smaller patches.. We build 16layer LIHT, LISTA and LISTACPSS networks and set . For LISTACPSS, we set more entries into the support in each layer for support selection. We also select support w.r.t. a percentage of the largest magnitudes within the whole batch rather than within a single sample as we do in theorems and simulated experiments, which we emprically find is beneficial to the recovery performance. Table 1 confirms LISTACPSS as the best performer among all. The advantage of LISTACPSS and LISTA over ReconNet also endorses the incorporation of the unrolled sparse solver structure into deep networks.Algorithm  20%  30%  40%  50%  60% 

TVAL3  25.37  28.39  29.76  31.51  33.16 
ReconNet  27.18  29.11  30.49  31.39  32.44 
LIHT  25.83  27.83  29.93  31.73  34.00 
LISTA  28.17  30.43  32.75  34.26  35.99 
LISTACPSS  28.25  30.54  32.87  34.60  36.39 
5 Conclusions
In this paper, we have introduced a partial weight coupling structure to LISTA, which reduces the number of trainable parameters but does not hurt the performance. With this structure, unfolded ISTA can attain a linear convergence rate. We have further proposed support selection, which improves the convergence rate both theoretically and empirically. Our theories are endorsed by extensive simulations and a realdata experiment. We believe that the methodology in this paper can be extended to analyzing and enhancing other unfolded iterative algorithms.
Acknowledgments
The work by X. Chen and Z. Wang is supported in part by NSF RI1755701. The work by J. Liu and W. Yin is supported in part by NSF DMS1720237 and ONR N0001417121. We would also like to thank all anonymous reviewers for their tremendously useful comments to help improve our work.
References
 [1] Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14(56):629–654, 2008.
 [2] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

[3]
Karol Gregor and Yann LeCun.
Learning fast approximations of sparse coding.
In
Proceedings of the 27th International Conference on International Conference on Machine Learning
, pages 399–406. Omnipress, 2010. 
[4]
Zhangyang Wang, Qing Ling, and Thomas Huang.
Learning deep l0 encoders.
In
AAAI Conference on Artificial Intelligence
, pages 2194–2200, 2016. 
[5]
Zhangyang Wang, Ding Liu, Shiyu Chang, Qing Ling, Yingzhen Yang, and Thomas S
Huang.
D3: Deep dualdomain based fast restoration of jpegcompressed
images.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2764–2772, 2016.  [6] Zhangyang Wang, Shiyu Chang, Jiayu Zhou, Meng Wang, and Thomas S Huang. Learning a taskspecific deep architecture for clustering. In Proceedings of the 2016 SIAM International Conference on Data Mining, pages 369–377. SIAM, 2016.
 [7] Zhangyang Wang, Yingzhen Yang, Shiyu Chang, Qing Ling, and Thomas S Huang. Learning a deep encoder for hashing. pages 2174–2180, 2016.
 [8] Pablo Sprechmann, Alexander M Bronstein, and Guillermo Sapiro. Learning efficient sparse and low rank models. IEEE transactions on pattern analysis and machine intelligence, 2015.
 [9] Zhaowen Wang, Jianchao Yang, Haichao Zhang, Zhangyang Wang, Yingzhen Yang, Ding Liu, and Thomas S Huang. Sparse Coding and its Applications in Computer Vision. World Scientific.
 [10] Jian Zhang and Bernard Ghanem. ISTANet: Interpretable optimizationinspired deep network for image compressive sensing. In IEEE CVPR, 2018.
 [11] Joey Tianyi Zhou, Kai Di, Jiawei Du, Xi Peng, Hao Yang, Sinno Jialin Pan, Ivor W Tsang, Yong Liu, Zheng Qin, and Rick Siow Mong Goh. SC2Net: Sparse LSTMs for sparse coding. In AAAI Conference on Artificial Intelligence, 2018.
 [12] Thomas Moreau and Joan Bruna. Understanding trainable sparse coding with matrix factorization. In ICLR, 2017.
 [13] Raja Giryes, Yonina C Eldar, Alex Bronstein, and Guillermo Sapiro. Tradeoffs between convergence speed and reconstruction accuracy in inverse problems. IEEE Transactions on Signal Processing, 2018.
 [14] Bo Xin, Yizhou Wang, Wen Gao, David Wipf, and Baoyuan Wang. Maximal sparsity with deep networks? In Advances in Neural Information Processing Systems, pages 4340–4348, 2016.
 [15] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
 [16] Mark Borgerding, Philip Schniter, and Sundeep Rangan. AMPinspired deep networks for sparse linear inverse problems. IEEE Transactions on Signal Processing, 2017.
 [17] Christopher A Metzler, Ali Mousavi, and Richard G Baraniuk. Learned DAMP: Principled neural network based compressive image recovery. In Advances in Neural Information Processing Systems, pages 1770–1781, 2017.
 [18] Mark Borgerding and Philip Schniter. Onsagercorrected deep learning for sparse linear inverse problems. In 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP).
 [19] Stanley Osher, Yu Mao, Bin Dong, and Wotao Yin. Fast linearized bregman iteration for compressive sensing and sparse denoising. Communications in Mathematical Sciences, 2010.
 [20] Kristian Bredies and Dirk A Lorenz. Linear convergence of iterative softthresholding. Journal of Fourier Analysis and Applications, 14(56):813–837, 2008.
 [21] Lufang Zhang, Yaohua Hu, Chong Li, and JenChih Yao. A new linear convergence result for the iterative soft thresholding algorithm. Optimization, 66(7):1177–1189, 2017.
 [22] Shaozhe Tao, Daniel Boley, and Shuzhong Zhang. Local linear convergence of ista and fista on the lasso problem. SIAM Journal on Optimization, 26(1):313–336, 2016.
 [23] Elaine T. Hale, Wotao Yin, and Yin Zhang. Fixedpoint continuation for minimization: methodology and convergence. SIAM Journal on Optimization, 19(3):1107–1130, 2008.
 [24] Lin Xiao and Tong Zhang. A proximalgradient homotopy method for the sparse leastsquares problem. SIAM Journal on Optimization, 23(2):1062–1091, 2013.
 [25] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the International Conference on Computer Vision, volume 2, pages 416–423, 2001.

[26]
Yangyang Xu and Wotao Yin.
A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion.
SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.  [27] Kuldeep Kulkarni, Suhas Lohit, Pavan Turaga, Ronan Kerviche, and Amit Ashok. ReconNet: Noniterative reconstruction of images from compressively sensed measurements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [28] Chengbo Li, Wotao Yin, Hong Jiang, and Yin Zhang. An efficient augmented lagrangian method with applications to total variation minimization. Computational Optimization and Applications, 56(3):507–530, 2013.
 [29] Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization. Athena Scientific Belmont, MA, 1997.
Appendix A Proof of Theorem 1
Proof.
By LISTA model (4), the output of the th layer depends on parameters, observed signal and initial point : . Since we assume , the noise . Moreover, is fixed and is taken as . Thus, therefore depends on parameters and : In this proof, for simplicity, we use denote .
Step 1
Firstly, we prove as .
We define a subset of given :
Since uniformly for all , so does for all . Then there exists a uniform for all , such that for all and , which implies
(21) 
The relationship between and is
Let . Then, (21) implies that, for any and , we have
The fact (21) means . By the definition , as long as , we have . Thus,
Furthermore, the uniform convergence of tells us, for any and , there exists a large enough constant and such that and . Then
Since the noise is supposed to be zero , . Substituting with in the above equality, we obtain
where , is defined in Theorem 1. Equivalently,
(22) 
For any , holds. Thus, the above argument holds for all if . Substituting with in (22), we get
(23) 
where . Taking the difference between (22) and (23), we have
(24) 
Then can be bounded with
(25) 
The above conclusion holds for all . Moreover, as a threshold in , . Thus, for any as long as large enough. In another word, as .
Step 2
We prove that as .
LISTA model (4) and gives
where is the subgradient of . It is a set defined componentwisely:
(26) 
The uniform convergence of implies, for any and , there exists a large enough constant and such that and . Thus,
By the definition (26) of , every element in has a magnitude less than or equal to . Thus, for any , we have , which implies
Combined with (25), we obtain the following inequality for all :
The above inequality holds for all , which implies, for all ,
Since , uniformly for all with . Then, as . ∎
Appendix B Proof of Theorem 2
Before proving Theorem 2, we introduce some definitions and a lemma.
Definition 1.
Mutual coherence of (each column of is normalized) is defined as:
Comments
There are no comments yet.