1 Introduction
In this paper, we mainly consider the following problem, which is to recover a sparse vector
from an observation vector with noise(e.g., additive Gaussian white noise):
(1) 
where ( in general) is the dictionary matrix. To solve Problem (1) which is generally illposed, some prior information such as sparsity or lowrankness needs to be incorporated, for example,
is sparse. A common way to estimate
is to solve the Lasso problem (Tibshirani, 1996):(2) 
where is a regularization parameter. Many methods have been proposed to solve the sparse coding problem, such as least angle regression (Efron et al., 2004), approximate message passing (AMP) (Donoho et al., 2009) and iterative shrinkage thresholding algorithm (ISTA) (Daubechies et al., 2004; Blumensath and Davies, 2008). For solving Problem (2), the update rule of ISTA is
where is the softthresholding (ST) operator with the threshold , is the step size which should be taken in , where
is the largest singular value of the dictionary matrix.
Beck and Teboulle (2009) proved that ISTA can only achieve a sublinear convergence rate.Recently, a class of methods of unfolding the traditional iterative algorithms into deep neural networks (DNNs), which are called
Algorithm Unfolding (Monga et al., 2021) or Deep Unfolding (Hershey et al., 2014), have been proposed, and have gradually attracted more and more attention. This idea was first proposed by Gregor and LeCun (2010), and they unfolded ISTA and viewed ISTA as a recurrent neural network (RNN) and proposed a learningbased model named Learned ISTA (LISTA):
(3) 
where , and are initialized as , and , respectively. All the parameters are learnable and datadriven. Many empirical and theoretical results as in (Aberdam et al., 2020; Giryes et al., 2018) have shown that LISTA can recover from more accurately and use one or two orderofmagnitude fewer iterations than original ISTA. Moreover, the linear convergence of a variant of LISTA (i.e., LISTACPSS) was proved for the first time in (Chen et al., 2018b). In addition, these networks have higher interpretability than general networks, thus can provide some explanations for deep networks. Actually, the deep unfolding algorithm (actually a network) was believed to incorporate some priors of models and algorithms in traditional optimization problems and have the learning capacity of network obtained from training data.
Due to the advantages of the idea of algorithm unfolding, a lot of works such as (Wang et al., 2016; Sprechmann et al., 2015; Ito et al., 2019; Borgerding et al., 2017; Sreter and Giryes, 2018) inspired by (Gregor and LeCun, 2010) have been proposed and successfully applied in various fields. Moreover, a series of studies on LISTA have attracted increasing attentions and inspired many subsequent works in different aspects, including learning based optimization (Xie et al., 2019; Sun et al., 2016), design of DNNs (Metzler et al., 2017; Zhang and Ghanem, 2018; Zhou et al., 2018; Chen et al., 2020; Rick Chang et al., 2017; Zhang et al., 2020; Simon and Elad, 2019) and interpreting the DNNs (Zarka et al., 2020; Papyan et al., 2017; Aberdam et al., 2019; Sulam et al., 2018, 2019).
There are also many works such as (Xin et al., 2016; Giryes et al., 2018; Moreau and Bruna, 2017; Chen et al., 2018b; Liu et al., 2019; Wu et al., 2020; Ablin et al., 2019) to discuss and understand LISTA and its variants from a theoretical perspective. Among them, Chen et al. (2018b) proved that there is a coupling relationship between the two learnable matrices of each layer of LISTA, thereby reducing the number of learnable parameters. They also proved the linear convergence of LISTA for the first time. Later, many subsequent works (Liu et al., 2019; Wu et al., 2020; Ablin et al., 2019) further improved LISTA with different methods. For instance, Liu et al. (2019) simplified the different matrix parameters of each layer of the network to the product of a matrix shared by the network and different scalar parameters of each layer, and proved that using the matrix parameters obtained by solving an optimization problem can achieve the same performance obtaind by learnable matrices. Then Wu et al. (2020)
proposed that the value of the element in the estimate obtained by LISTA may be lower than the expected value, and thus, inspired by gated recurrent unit (GRU)
(Cho et al., 2014; Chung et al., 2015), GLISTA (Wu et al., 2020) was proposed to gain the LISTArelated algorithms. Besides, we also make improvements based on LISTA and proposed an innovative work (Li et al., 2021), and this paper is a condensed version of (Li et al., 2021).However, we find that all the existing variants of LISTA with convergence guarantees are serial, the residual network (ResNet) (He et al., 2016)
, which is influential in deep learning, has not been introduced into LISTA. An important reason is that changing the original structure of LISTA will destroy its excellent mathematical interpretability. Can we get a new LISTA with an interpretable residual structure, which has a convergence guarantee?
Our Main Contributions: The main contributions of this paper are listed as follows:
We propose a novel unfolding network, named Extragradient based LISTA (ELISTA), which is a variant of LISTA with residual structure by employing the idea of extragradient into LISTA and establishing the relationship with ResNet, which is an improvment about the network structure for solving sparse coding problems. To the best of our knowledge, this is the first residual structure LISTA with theoretical guarantees.
We prove the linear convergence of ELISTA. Moreover, we conduct extensive experiments to verify the effectiveness of our algorithm. The experimental results show that our ELISTA is superior to the stateoftheart methods.
2 Extragradient Based LISTA
In this section, we first introduce the technique of extragradient into LISTA and propose an innovative algorithm, named Extragradient based LISTA (ELISTA), and depict it in detail. Moreover, we establish the relationship between ELISTA and ResNet, which is one of the reasons why ELISTA is advantageous.
2.1 Extragradient Method
We note that iterative algorithms, such as ISTA, can actually be treated as a proximal gradient descent method, which is a firstorder optimization algorithm, for special objective functions. Thus, we want to introduce the idea of extragradient into the related iterative algorithms. The extragradient method was first proposed by (Korpelevich, 1976), which is a classical method for variational inequality problems. For optimization problems, the idea of extragradient was first used in (Nguyen et al., 2018), which proposed an extended extragradient method (EEG) by combining this idea with some firstorder descent methods. In the th iteration of EEG, it first calculates the gradient at , and updates according to the gradient to get an intermediate point , then calculates the gradient at , and updates the original point according to the gradient at the intermediate point to obtain , which is the key idea of extragradient. Intuitively, the additional step in each iteration of EEG allows us to examine the geometry of the problem and consider its curvature information, which is one of the most important bottlenecks for firstorder methods. Thus, by using the idea of extragradient, we can get a better result after each iteration. The update rules of EEG for Problem (2) can be rewritten as follows:
(4) 
This form of EEG is similar to ISTA, and thus it can be regarded as a generalization of ISTA.
2.2 Extragradient Based LISTA and the Relationship with ResNet
In order to speed up the convergence of EEG, we combine the algorithm with deep networks and regard and two thresholds of two steps in (4) as learnable parameters, and get the following update rules:
(5) 
However, since the above scheme has two different matrices and to learn in each layer, the number of network parameters greatly increases and the training of the network slows down significantly. Therefore, to address this issue and further establish the connection between the two steps of (5), we convert and into and , respectively, where and are two scalars to learn. Then, inspired by (Liu et al., 2019), we change the of each layer into the same and get a tied algorithm, which can significantly reduce the number of learnable parameters. Finally, we obtain the following update rules for our Extragradient Based LISTA (ELISTA):
(6) 
According to (6), we can get the network structure diagram of ELISTA, as shown in Figure 1. Through our observation and comparison, we find that the network structure of ELISTA is corresponding to ResNet. Since is already given, we can regard as a bias. Thus, from Figure 1
, we can see that the structure of the network obtained by ELISTA is the same as that of ResNet, including weight layer, activation function and identity. As we all know, ResNet can obtain a better performance by improving network structures. Therefore, it is meaningful to discuss and study the explanation for the internal mathematical mechanism of ResNet. On the one hand, to some extend, our algorithm may be regarded as a mathematical explanation of the reason for the superiority of ResNet. On the other hand, the connection and combination of ELISTA and ResNet might be able to explain why our algorithm has better performance than existing methods. Besides, there are a lot of work using ordinary differential equation (ODE) to interpret the network by considering ODE as a continuous equivalent of the residual network (ResNet)
(Chen et al., 2018a). However, we found that ODE can only explain the networks with linear connection blocks, while ours is nonlinear. But, the form of our blocks are less general than those of ODE.LISTA  LAMP  GLISTA  ELISTA 

3 Convergence Analysis
In this section, we provide the convergence analysis of our algorithm. We first give a basic assumption. Then we provide the convergence property of ELISTA. We note that our analysis, like that of Theorems 3 and 4 of (Wu et al., 2020), is proved under the existence of “false positive”, while the theoretical analysis of (Chen et al., 2018b; Liu et al., 2019) was provided under the assumption of no “false positive”, which is difficult to satisfy in reality.
Assumption 1 (Basic assumption).
The signal is sampled from the following set:
In other words, is bounded and sparse . Furthermore, we assume .
This assumption is a basic assumption for this class of algorithms. Almost all the related algorithms need to satisfy this assumption, e.g., (Liu et al., 2019; Wu et al., 2020).
Based on the assumption, we can get the linear convergence of ELISTA, which can be given by the following theorem.
Theorem 1 (Linear Convergence for ELISTA).
If Assumption 1 holds, can be satisfied by selecting properly,
(7) 
are achieved, and is small enough, then for sequences generated by ELISTA, there exist “false positive” with and
(8) 
where , and .
The definitions of and can be found in Definition 1 in (Liu et al., 2019). From Lemma 1 in (Chen et al., 2018b), we know . Besides, the definitions of and can be given by referring to Definition 2 in (Wu et al., 2020). Theorem 1 shows that our ELISTA attains linear convergence. We note that we have not given the detailed proof of Theorem 1, due to page limits. We will provide it in our future work.
4 Experimental Results
In this section, we evaluate our ELISTA in terms of sparse representation performance and 3D geometry recovery via photometric stereo. All the experimental settings are the same as the previous works (Chen et al., 2018b; Liu et al., 2019; Wu et al., 2020). However, the performance of SS (Chen et al., 2018b) is greatly affected by the hyperparameters, and it is necessary to know the sparsity of in advance to set the hyperparameters, which is difficult to get in real situations. Thus, in order to more fairly compare the impact of the network itself on performance, all the networks do not use SS. All training follows (Chen et al., 2018b). For all the methods, and are initialized as 1.0, and and are initialized as . All the results are obtained by running ten times and averaged.
4.1 Sparse Representation Performance
In this subsection, we compare our ELISTA with the stateoftheart methods: LISTA, LAMP and GLISTA. We set , and
, and train the networks with two different noise levels: SNR (SignaltoNoise Ratio) = 30,
and three different ill conditioned matrices with condition numbers = 5, 50, 500. For detailed data generation methods, please see (Li et al., 2021).LISTA  LAMP  GLISTA  ELISTA  

, SNR  38.658  44.967  65.569  83.997 
, SNR  37.471  46.385  63.523  82.848 
, SNR  31.845  43.097  57.542  77.865 
, SNR  23.593  25.045  32.757  32.832 
Table 2 shows that our method obviously outperform the compared methods in the noiseless case. Especially, compared with LISTA, the NMSE performance of our method is almost twice as much as that of LISTA. In the presence of noise, our method achieves the stateoftheart accuracy.
4.2 3D Geometry Recovery via Photometric Stereo
LISTA  GLISTA  ELISTA  

35  0.06836  0.06249  0.04724 
25  0.09664  0.10033  0.06597 
15  0.69334  0.63967  0.53269 
In this subsection, we compare our ELISTA with the stateoftheart methods: LISTA and GLISTA for 3D geometry recovery via photometric stereo, which is a powerful technique used to recover high resolution surface normals from a 3D scene using appearance changes of 2D images in different lighting (Woodham, 1980). In practice, however, the estimation process is often interrupted by nonlambert effects, such as highlights, shadows, or image noise. This problem can be solved by decomposing the observation matrix of the superimposed image under different lighting conditions into ideal lambert components and sparse error terms (Wu et al., 2010; Ikehata et al., 2012), i.e., , where denotes the resulting measurements, denotes the true surface normal, defines a lighting direction, is the diffuse albedo, acting here as a scalar multiplier and is an unknown sparse vector. By multiplying both sides of by the orthogonal complement to , we can get . Let be and be , can be obtained by solving the sparse coding problem. Then we can use to recover . The main experimental settings follow (Xin et al., 2016; Wu et al., 2020; He et al., 2017). Tests are performed using the 32bit HDR grayscale images of objects “Bunny” as in (Xin et al., 2016) with and 40 of the elements of the sparse noise are nonzero. From Table 3, we can find that our method performs much better than LISTA and GLISTA.
5 Conclusions
We proposed a novel extragradient based learned iterative shrinkage thresholding algorithm (called ELISTA) with an interpretable residual structure. Moreover, we proved ELISTA can achieve linear convergence. Extensive empirical results verified the high efficiency of our method. This could have both theoretical and practical impacts to the relationship between new neural network architectures and advanced algorithms, and potentially deepen our understanding to interpretability of deep learning models. One limitation of this paper is that we use the same assumption as in the previous work (Chen et al., 2018b; Liu et al., 2019; Wu et al., 2020), that the sparsity of is small enough. Removing this common assumption is our future work.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Nos. 61876221, 61876220 and 61976164), the Project supported the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61621005), the Major Research Plan of the National Natural Science Foundation of China (Nos. 91438201 and 91438103), the Program for Cheung Kong Scholars and Innovative Research Team in University (No. IRT_15R53), the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048), and the National Science Basic Research Plan in Shaanxi Province of China (No. 2020JM194).
References
 AdaLISTA: learned solvers adaptive to varying models. arXiv preprint arXiv:2001.08456. Cited by: §1.

Multilayer sparse coding: the holistic way.
SIAM Journal on Mathematics of Data Science
1 (1), pp. 46–77. Cited by: §1.  Learning step sizes for unfolded sparse coding. In Advances in Neural Information Processing Systems, pp. 13100–13110. Cited by: §1.
 A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), pp. 183–202. Cited by: §1.
 Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications 14 (56), pp. 629–654. Cited by: §1.
 AMPinspired deep networks for sparse linear inverse problems. IEEE Transactions on Signal Processing 65 (16), pp. 4293–4308. Cited by: §1, §2.2.
 Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6571–6583. Cited by: §2.2.
 Theoretical linear convergence of unfolded ISTA and its practical weights and thresholds. In Advances in Neural Information Processing Systems, pp. 9061–9071. Cited by: §1, §1, §3, §3, §4, §5.
 Rna secondary structure prediction by learning unrolled algorithms. In Proceedings of the International Conference on Learning Representations, Cited by: §1.
 Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv: 1406.1078. Cited by: §1.

Gated feedback recurrent neural networks.
In
International Conference on Machine Learning
, pp. 2067–2075. Cited by: §1.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 57 (11), pp. 1413–1457. Cited by: §1.
 Messagepassing algorithms for compressed sensing. Proceedings of the National Academy of Sciences 106 (45), pp. 18914–18919. Cited by: §1, §2.2.
 Least angle regression. Annals of Statistics 32 (2), pp. 407–499. Cited by: §1.
 Tradeoffs between convergence speed and reconstruction accuracy in inverse problems. IEEE Transactions on Signal Processing 66 (7), pp. 1676–1690. Cited by: §1, §1.
 Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 399–406. Cited by: §1, §1, §2.2.
 From bayesian sparsity to gated recurrent nets. In Advances in Neural Information Processing Systems, pp. 5554–5564. Cited by: §4.2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 770–778. Cited by: §1, Figure 1.  Deep unfolding: modelbased inspiration of novel deep architectures. arXiv preprint arXiv:1409.2574. Cited by: §1.
 Robust photometric stereo using sparse regression. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 318–325. Cited by: §4.2.
 Trainable ISTA for sparse signal recovery. IEEE Transactions on Signal Processing 67 (12), pp. 3113–3125. Cited by: §1.
 The extragradient method for finding saddle points and other problems. Matecon 12, pp. 747–756. Cited by: §2.1.
 Learned extragradient ista with interpretable residual structures for sparse coding. In Proc. AAAI Conf. Artif. Intell., Cited by: §1, §4.1.
 ALISTA: analytic weights are as good as learned weights in LISTA. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §2.2, §3, §3, §3, §4, §5.
 Learned DAMP: principled neural network based compressive image recovery. In Advances in Neural Information Processing Systems, pp. 1772–1783. Cited by: §1.
 Algorithm unrolling: interpretable, efficient deep learning for signal and image processing. IEEE Signal Processing Magazine 38 (2), pp. 18–44. Cited by: §1.
 Understanding trainable sparse coding via matrix factorization. In Proceedings of the International Conference on Learning Representations, Cited by: §1.
 Extragradient method in optimization: convergence and complexity. Journal of Optimization Theory and Applications 176 (1), pp. 137–162. Cited by: §2.1.
 Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research 18 (1), pp. 2887–2938. Cited by: §1.
 One network to solve them all–solving linear inverse problems using deep projection models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5888–5897. Cited by: §1.
 Rethinking the csc model for natural images. In Advances in Neural Information Processing Systems, pp. 2271–2281. Cited by: §1.
 Learning efficient sparse and low rank models. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1821–1833. Cited by: §1.
 Learned convolutional sparse coding. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2191–2195. Cited by: §1.
 On multilayer basis pursuit, efficient algorithms and convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
 Multilayer convolutional sparse modeling: pursuit and dictionary learning. IEEE Transactions on Signal Processing 66 (15), pp. 4090–4104. Cited by: §1.
 Deep ADMMNet for compressive sensing MRI. In Advances in Neural Information Processing Systems, pp. 10–18. Cited by: §1.
 Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §1.

Learning deep encoders.
In
Proceedings of Thirtieth AAAI Conference on Artificial Intelligence
, Cited by: §1.  Photometric method for determining surface orientation from multiple images. Optical Engineering 19 (1), pp. 139–144. Cited by: §4.2.
 SPARSE coding with gated learned ISTA. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §3, §3, §3, §4.2, §4, §5.
 Robust photometric stereo via lowrank matrix completion and recovery. In Asian Conference on Computer Vision, pp. 703–717. Cited by: §4.2.
 Differentiable linearized ADMM. In Proceedings of the 27th International Conference on International Conference on Machine Learning, Cited by: §1.
 Maximal sparsity with deep networks?. In Advances in Neural Information Processing Systems, pp. 4340–4348. Cited by: §1, §4.2.
 Deep network classification by scattering and homotopy dictionary learning. In Proceedings of the International Conference on Learning Representations, Cited by: §1.
 ISTANet: interpretable optimizationinspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1828–1837. Cited by: §1.
 A novel learnable gradient descent type algorithm for nonconvex nonsmooth inverse problems. arXiv preprint arXiv:2003.06748. Cited by: §1.
 SC2Net: sparse LSTMs for sparse coding. In Proceedings of ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1.
Comments
There are no comments yet.