Supervised Learning of Sparsity-Promoting Regularizers for Denoising

06/09/2020 ∙ by Michael T. McCann, et al. ∙ Michigan State University 0

We present a method for supervised learning of sparsity-promoting regularizers for image denoising. Sparsity-promoting regularization is a key ingredient in solving modern image reconstruction problems; however, the operators underlying these regularizers are usually either designed by hand or learned from data in an unsupervised way. The recent success of supervised learning (mainly convolutional neural networks) in solving image reconstruction problems suggests that it could be a fruitful approach to designing regularizers. As a first experiment in this direction, we propose to denoise images using a variational formulation with a parametric, sparsity-promoting regularizer, where the parameters of the regularizer are learned to minimize the mean squared error of reconstructions on a training set of (ground truth image, measurement) pairs. Training involves solving a challenging bilievel optimization problem; we derive an expression for the gradient of the training loss using Karush-Kuhn-Tucker conditions and provide an accompanying gradient descent algorithm to minimize it. Our experiments on a simple synthetic, denoising problem show that the proposed method can learn an operator that outperforms well-known regularizers (total variation, DCT-sparsity, and unsupervised dictionary learning) and collaborative filtering. While the approach we present is specific to denoising, we believe that it can be adapted to the whole class of inverse problems with linear measurement models, giving it applicability to a wide range of image reconstruction problems.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image reconstruction problems, where an image must be recovered from its noisy measurements, appear in a wide range of fields, including computer vision, biomedical imaging, astronomy, nondestructive testing, remote sensing, and geophysical imaging; see

Bertero and Boccacci (1998) for a textbook-length introduction. or McCann and Unser (2019); Ravishankar et al. (2020) for tutorial-length introductions. One prominent approach to solving these problems is to formulate reconstruction as an optimization problem—seeking the image that best fits the measured data according to a prescribed model. We refer to this approach as variational reconstruction because it relies on the calculus of variations to find extrema. These models typically comprise a model of the imaging system (including noise) and a model of plausible images. We restrict the current discussion to objective functions of the form , where

is a (vectorized) image,

are its noisy measurements, is a linear forward model, and is a regularization functional. While this formulation implicitly assumes a linear imaging system and Gaussian noise, it is nonetheless applicable to many modern image problems.

A prominent theme in designing the regularization functional, , has been that of sparsity: the idea that the reconstructed image should admit a sparse (having a small number of nonzero elements) representation in some domain. Examples of these models would be synthesis sparsity ( and is sparse) Tropp (2004), analysis sparsity ( is sparse) Rubinstein et al. (2013), and transform sparsity ( and is sparse) Ravishankar and Bresler (2013). A common and successful approach to promoting sparsity in image reconstructions is to use a regularization functional of the form , where is called a sparsifying transform, analysis dictionary, or analysis operator—we adopt the latter terminology. The resulting objective functions are usually convex, and results from compressive sensing Candes et al. (2006) show that, under certain conditions, the penalty provides solutions with exactly sparsity. While there are several choices for that work well in practice (e, g. wavelets, finite differences, Fourier or discrete cosine transform (DCT)), several authors have also sought to learn from data, an approach called dictionary (or transform) learning  Tosic and Frossard (2011); Rubinstein et al. (2013); Ravishankar and Bresler (2013).

While variational approaches have dominated the field of image reconstruction for the past decades, a recent trend has been to train supervised methods, especially convolutional neural networks (CNNs), to solve image reconstruction problems. Recent reviews McCann et al. (2017); Ongie et al. (2020) give a good picture of this evolving field, but we want to point out a few trends here. Many recent papers attempt to incorporate some aspects of variational reconstruction into a supervised scheme. One way of doing that is by learning some aspect of the regularization, e.g., via learned shrinkage operators Hel-Or and Shaked (2008); Shtok et al. (2013); Kamilov and Mansour (2016); Nguyen et al. (2018), learned plug-and-play denoisers Al-Shabili et al. (2020), using a CNN to enforce a constraint-type regularization  Gupta et al. (2018), or allowing a CNN to take the part of the regularizer in an unrolled iterative algorithm Aggarwal et al. (2019). Broadly speaking, these are attempts to combine the benefits of variational reconstruction (theoretical guarantees, parsimony) with the benefits of supervised learning (data adaptivity, state-of-the-art performance). The approach in this work falls on the same spectrum, but is decidedly on the “shallow learning” end—we aim to bring the benefits of supervision to variational reconstruction, rather than the other way around.

We propose to solve image reconstruction problems using a sparsifying analysis operator that is learned from training data in a supervised way. Our learning problem is

where is a sparsifying operator parameterized by , and are the th training image and its corresponding noise-corrupted version,

and is a fixed, scalar parameter that controls the regularization strength during reconstruction. Note that while (1a) refers to a specific parameterization of the sparsifying operator, , (1b) is defined for any operator , regardless of parameterization; we will continue to use when the role of the parameterization is important and otherwise. Equation (1b) is a variational formulation of image reconstruction for denoising (i.e., ), where we have made the dependence on the operator explicit. Problem (1a) is the minimum mean squared error (MMSE) minimization problem on the training data. In statistical terms, the interpretation of (1) is that we are seeking a prior (parameterized by

) so that the MAP estimator (given by (

1b)) approximates the MMSE estimator Gribonval and Machart (2013). Because we aim to work on images, we assume that is too large to fit into memory, but rather is implemented via functions for applying and its transpose to vectors of the appropriate size. This is true, e.g., when implements convolution(s), as is the case in our experiments.

Related Work

A 2011 work Peyré and Fadili (2011) poses the problem (1) except with a differentiable version of and provides experiments for one-dimensional signals. In Mairal et al. (2012), the authors address a synthesis version of (1), wherein the reconstruction problem involves finding sparse codes, such that is small. This change from the analysis to synthesis formulation means that the optimization techniques used in Mairal et al. (2012) do not apply here. In Sprechmann et al. (2013), the authors derive gradients for a generalization of (1) by relaxing to . This approach gives the gradient in the limit of , however the expression requires computing the eigendecomposition of a large matrix. Therefore the authors use the relaxed version in practice. In a brief preprint Chen et al. (2014), the authors derive a gradient for (1) by expressing in a fixed basis and using a differentiable relaxation. Finally, Chen et al. (2014) provides a nice overview of the topic analysis operator learning in its various forms, and also tackles (1) using a differentiable sparsity penalty.


The main contribution of this work is to derive a gradient scheme for minimizing (1) without relying on relaxation. Smooth relaxations of the penalty term can result in slow convergence Peyré and Fadili (2011), which our formulation avoids. Our formulation has the added advantage of promoting exact sparsity Candes et al. (2006) and also has a close connection to CNNs: the regularization functional in (1b

) can be expressed as a shallow CNN with rectified linear unit (ReLU) activation functions. This connection indicates that we may be able to generalize the methods here to learn a more complex CNN-based regularizer. The computational bottleneck of our approach is solving the reconstruction problem (

1b) at each step of gradient descent on (1a); we propose a warm-start scheme to ameliorate this cost, allowing training to finish in hours on a consumer-level GPU. In a denoising experiment, we demonstrate that the sparsifying operator learned using our method provides improved performance over popular fixed operators, as well as an operator learned in an unsupervised manner.

2 Methods

In order to apply gradient-based techniques to solving (1), our first goal is to compute the gradient of with respect to . The challenge in this lies in computing the partial derivatives (with respect to ) of the elements of each vector (i.e., the Jacobian matrix of

), after which we can obtain the desired result using the chain rule. Again we note that we expect

to be too large to fit into memory, thus our derivation does not rely on factoring . Our approach is first to develop a closed-form expression (i.e., not including ) for , and then to derive the desired gradient.

2.1 Closed-form Expression for

Consider the functional


It is strictly convex in (because the norm term is strictly convex and the norm term is convex) and therefore has a unique global minimizer. Thus we are justified in writing without the possibility of the minimizer not existing or being a nonsingleton. Note that although depends on , and , the - and -dependence is not relevant for this derivation and we will not continue to notate it explicitly.

Our key insight is that we can have a closed-form expression for in terms of if we know the sign pattern of , and we can always find the sign pattern of by solving (1b). We first define some notation. Let denote the sign pattern associated with a given , i.e., , where is defined to be -1 when ; 0 when ; and 1 when . Considering a fixed , we define matrices that pull out the rows of that give rise to zero, negative, and positive values in . Let , , , and denote the number of zero, nonzero, negative, and positive elements of , respectively. Similarly, let , , and denote the index of the th zero, negative, and positive element of . Let and be defined as


With this notation in place, we can write that


for all such that . This is true because whenever , the minimizer of (2) is feasible for (4). Similarly, we use to simplify the norm term,


where is a vector of ones, and, again the equality holds for all such that .

Now that we have transformed the problem into an equality-constrained quadratic minimization, we can use standard results (e.g., see Boyd and Vandenberghe (2004) Section 10.1.1) to state the KKT conditions for (5):


where the underbraces give names (, , and ) to each quantity to simplify the subsequent notation. Because is nonsingular, is invertible whenever has full row rank Boyd and Vandenberghe (2004).

In order to cleanly write the result, we define two more selection matrices that are useful for pulling out the and parts of . Let and be defined as and We can then state the following result:

Theorem 1 (Solutions of sparse analysis-form denoising)

For any , and for all such that with full row rank, the minimizer of the sparse analysis-form denoising problem (2) is given by , where each term in the right hand side depends on as described above.

Theorem 1 provides a closed-form expression for that is valid in each region where is constant. Because our formulation results in exact sparsity, these regions can have nonempty interiors, allowing us to compute gradients. As a brief example of this, consider the scalar denoising problem . Assuming that , one can show that when and otherwise. As a result, , , and ; a similar result holds when . Thus is smooth except at , which form a set of measure 0.

Also, note that Theorem 1 could be extended to cover the case where is not full rank by expressing in terms of the pseudoinverse of . In that case, the following derivations would need to be adapted to use the gradient of the pseudoinverse Golub and Pereyra (1973).

2.2 Gradient of With Respect to

Even using automatic differentiation software tools (e.g., PyTorch 

Paszke et al. (2019)), Theorem 1 does not provide a way to compute the gradient of in (1) with respect to because of the presence of . Due to the size of in practice, the inverse needs to be computed iteratively, and tracking gradients though hundreds of iterations would require an impractical amount of memory. To avoid this problem, we use matrix calculus to manipulate the gradients into a form amenable to automatic differentiation. In the following, we use the notation of Minka (2000) (a resource we highly recommend for this type of math), where is defined to be the part of that is linear in . As in (1) we use the subscript to denote quantities that depend on the the training pair. From Theorem 1 and matrix calculus rules, we have




with and as defined in (7) and where all gradients are with respect to . As we will describe in the next section, it turns out that, with the help of automatic differentiation software, (8) is sufficient to compute the desired gradient.

2.3 Implementation of Gradient Calculation

We now give an outline of how we compute this gradient in practice. As we have stated before, we expect to be too large to fit in memory. As a result, our desired gradient is not with respect to , but actually with respect to ; luckily, automatic differentiation software can handle this detail for us. For a given and for each :

  1. [wide]

  2. Solve the reconstruction problem (1b) to find . This can be accomplished via established techniques for convex optimization; we use the ADMM Boyd et al. (2011) with the split variable .

  3. Determine the selection matrices and . Because the obtained via ADMM may still have some small error, there may not be exact zeros in , which complicates determining . Our approach is to look for zeros in the ADMM split variable corresponding to because it is both approximately equal to and because it has exact zeros (because it is the result of soft thresholding in ADMM). Alternatively, setting some small threshold on the magnitude would have a similar effect.

  4. Solve the KKT system (6) to find the in (7). We use the conjugate gradient (CG) algorithm Shewchuk (1994). While is symmetric and therefore CG could be applied to solve directly, we find that, because there could be some inexactness in , it is more stable to solve .

  5. Compute the in (7) and find by solving a linear system .

  6. Turn on automatic differentiation with respect to , compute the scalar quantity , and perform an automatic gradient calculation. The key here is that there are several quantities in (8) that depend on (through application of ), but we only want automatic differentiation to happen through and , as indicated by (8).

The total gradient is then the sum of the gradients for each . (In fact, rather than looping over , the whole process can be done in a single shot by concatenating the ’s into a long vector when solving the reconstruction problem in Step 1; this may be faster, but requires careful bookkeeping in the code.)

In summary, to compute the gradient at a given (equivalently, ), we need to run ADMM once and CG twice. The most expensive operation is the application of , which happens repeatedly during CG and ADMM. In our experiments, we needed a few hundred applications of during CG and a few thousand in ADMM, making the ADMM the bottleneck.

2.4 Learning

With gradient in hand, we can use any of a variety of first-order methods to solve (1

), including gradient descent, stochastic gradient descent (SGD), ADAM 

Kingma and Ba (2014), or BFGS Avriel (2003). We found that SGD worked well in our experiments.

We make two additional implementation notes. First, it is very important to monitor the accuracy of the inverse problems being solved during gradient evaluation, i.e., ADMM and CG, and to ensure that their hyperparameters are such that accurate solutions are computed. If the results are inaccurate, gradients will be inaccurate and, in our experience, learning will fail. Second, we found it useful to store the

’s at each iteration and use them as a warm start in the next iteration. This allows fewer iterations to be used when solving the reconstruction problem, thereby speeding up training. (In the case of using ADMM to solve the reconstruction problem, we need to save not only the ’s, but also the associated split and dual variables.)

3 Experiments

We now present the details of our proof-of-concept denoising experiments.

3.1 Data and Evaluation

Our experiment uses simple, synthetic images (see Figure 0(a) for an example). The images are  pixels and generated according to a dead leaves model that produces images that mimic the statistics of natural images Lee et al. (2001)

. Specifically, an image is formed by superimposing a fixed number of rectangles (100, in our case), each with a random height, width, and intensity (uniform between 0 and 1). For each image, we create a noisy version by adding IID, zero-mean Gaussian noise with a standard deviation of 0.1 (see Figure 

0(b) for an example). We use small (compared to typical digital photographs) images to reduce training times; we use the dead leaves model because it contains structure that we expect our approach will be able to capture.

Our experiments compare the ability of several methods (Section 3.2) to perform denoising—to restore a clean image from its noisy version. For evaluation, we generate a testing set of ten images, along with their corresponding noisy versions. Our figure of merit is the SNR on the entire testing set, defined as


where is reconstruction of the th noisy image in the testing set, is the th ground truth image in the testing set, and is the th pixel of . The SNR is expressed in decibels (dB). Note that the SNR as defined here is a monotone function of the training objective (1a), so the supervised method maximizes SNR on the training set. Another common metric is average SNR (i.e., SNR computed separately for each image and averaged), which does not have this property. This detail is unlikely to matter much, especially when the testing set is as homogeneous as ours. In a setting where it does matter, could be adapted to account for it. Another common metric is the peak SNR (PSNR). If one considers 1.0 to be the peak, SNR values can be converted to PSNR values by adding ; this value is 4.69 dB on our testing set.

3.2 Methods Compared

We compare the denoising performance of five methods. Here, we briefly describe each method including its parameters; we discuss parameter tuning in the next section.

BM3D Makinen et al. (2019) denoises images patches by combining similar patches into groups and performing sparse representation on the groups. Its main parameter is the standard deviation of the noise, with a higher value giving smoother results. BM3D typically shows excellent performance on Gaussian denoising of natural images, and we expect it to be strong on our dataset, as well. We use the Python implementation from the author’s website,

Total variation (TV) is a variational method that denoises by solving the the reconstruction problem (1b) with fixed to a finite differencing operator. The dimensions of are because there are two filters (vertical and horizontal differences) and the size of the valid convolution along each dimension is 64 - 2 + 1. We use the anisotropic version of TV (with no 2-norm on the finite differences) because it is a good fit for dataset and to make it more comparable to the other methods. We solve the reconstruction problem using the ADMM Boyd et al. (2011), with 400 outer and 40 inner iterations, which is sufficient for the cost to be stable to 4 significant digits. The TV method has one scalar parameter, , which controls the regularization strength. Because TV promotes piecewise-constant reconstructions, we expect it to perform well in the experiments on our dataset.

DCT sparsity, like TV, is a variational method with a fixed . In this case, performs a 2D DCT on each block of the input (see Figure 1(a) for the corresponding filters). Following recent work with this type of regularizer Rubinstein et al. (2013), we remove the constant filter from the DCT, meaning that is . Like TV, the DCT method has one scalar parameter, .

Unsupervised learned analysis sparsity learns by minimizing with respect to . The learned is then used in to perform reconstruction. We use the term unsupervised because the ’s are not used during training. We parameterize by its filters, so is and is ; these filters are initialized to the DCT. To avoid the trivial solution , we constrain the filters of to be orthogonal during learning Yaghoobi et al. (2012); Ravishankar and Bresler (2015). We minimize the training objective using the ADMM Boyd et al. (2011) with split variables

and where orthogonality is enforced by solving an orthogonal Procrustes problem using a singular value decomposition (SVD) each iteration 

Ravishankar and Bresler (2015). After learning, we remove the first (constant) filter. The main parameter of the algorithm is the used during reconstruction. We expect that this method should be able to learn a good sparsifying analysis operator because our dataset is very structured.

Supervised learned analysis sparsity (proposed) denoises by solving the reconstruction problem (1b) after has been learned by supervised training, i.e., solving (1) on a training set. We parameterize as a convolution with eight, learnable filters, initialized to the (nonconstant) DCT filters. The parameters of the supervised method are and the training hyperparamters. If were a global minimizer of (1), and if we assume the model generalizes from training to testing, then the supervised method would perform optimal denoising of the testing set (in the MSE sense). Unfortunately, (1) is a challenging, nonconvex problem, where we are likely to find only local optima in practice.

3.3 Training and Parameter Tuning

For training and parameter tuning, we generate a dataset of ten training images along with their corresponding noisy versions. For the methods with only a scalar parameter (BM3D, TV, and DCT), we perform a parameter sweep and select the value that maximizes the signal-to-noise ratio (SNR) of the results on this training set. For the unsupervised method, we use the

found for the DCT method and train to sparsify clean images (as described in the previous section). After training , we perform another 1D parameter sweep on , finding the optimal value for denoising the training set.

For the supervised method, we use the found during the DCT parameter sweep, and perform stochastic gradient descent on , with the gradient normalized by the number of pixels and batch size and with with step size set to 2.0. Following Smith et al. (2017), we increase the batch size during training to improve convergence. Specifically, we use a batch size of 1 for 5000 iterations, 5 for 2500 more iterations, and 10 for 2500 more iterations. We do not adapt after training as (unlike in the unsupervised case) appears in the training objective and therefore should adapt to it. Following best practices in reproducibility Pineau and Sinha (2020), we note that we did perform multiple testing evaluations during the development of the algorithm. These were mainly for debugging and exploring different initialization and learning schedules. We did not omit any results that would change our reported findings.

4 Results and Discussion

We report the quantitative results of our denoising experiment in Table 1. All the methods we compared were able to significantly denoise the input image, improving the noisy input by at least 6.5 dB. The reported SNRs are lower than would be typical for natural image with this level of noise (c.f. Zhang et al. (2017), where results are around 30 dB); this is both because we report SNR rather than PSNR and because our dataset lacks the large, smooth areas typically found in high resolution natural images. Parameter sweeps for BM3D and the TV-, and DCT-based methods took under ten minutes; training for the unsupervised learning method took minutes; and training for the supervised method took eight hours. For all methods, reconstructions took less than one minute.

The proposed supervised learning-based method gave the best result, followed by unsupervised learning, BM3D, TV, and DCT-based denoising. The strong performance of the proposed method shows that learning was successful and that the learned analysis operator generalizes to unseen images. That the supervised method outperforms TV is notable, because, naively, we might have guessed that anisotropic TV is an optimal sparsifying transform for these images (because they comprise only horizontal and vertical edges). Because BM3D and the proposed method exploit different types of image structure (self-similarity vs local sparsity), we conjecture that combining the methods by performing supervised analysis operator learning on matched or grouped patches could further improve performance. Reference Wen et al. (2019) shows that such an approach works in the unsupervised case.

input BM3D TV DCT unsupervised supervised (proposed)
testing SNR (dB) 15.29 22.82 22.59 21.90 22.84 23.16
Table 1: Results of the denoising experiment. See Section 3.2 for method details.
(a) truth, dB
(b) noisy, 16.2 dB
(c) BM3D, 22.8 dB
(d) TV 23.2 dB
(e) DCT, 22.1 dB
(f) unsupervised, 22.9 dB
(g) supervised, 23.2 dB




Figure 1: Denoising results on a testing image. Each result is shown with the corresponding absolute error map and each subcaption reports the SNR on this image.
(a) DCT
(b) unsupervised
(c) supervised





Figure 2: Summed absolute filter responses (left) and filters (right).

We can also explore the results by looking at the output from each algorithm on a single image (Figure 1

). Qualitatively, the BM3D and DCT results look oversmoothed, probably as a result of patch averaging and penalizing high frequencies, respectively. The TV result has the characteristic stairstep pattern. The results for both the unsupervised and supervised methods are sharper than the DCT and BM3D results and avoid the stairstep pattern of TV, but they otherwise look similar. The error map for the supervised method looks less noisy than for the unsupervised method, which suggests that the supervised method does a better job of smoothing flat areas in the image.

As another way to investigate our results, we display the filters and the corresponding aggregate filter response map (sum of the absolute filter responses over channels) on a clean testing image for the DCT-, unsupervised learning-, and supervised learning-based methods (Figure 2). Both the unsupervised and supervised methods are initialized with DCT filters, and both make visible changes to them during training. Both approaches also result in highly-structured filters. However, the unsupervised learning filters are (by construction) orthonormal, while the supervised ones are not. The supervised method seems to use this flexibility in two ways. First, the filters are weighted differently, with norms ranging from 0.75 to 1.54. This effect is apparent in Figure 1(c), where the filters in the first row and first column are less saturated than the others. Second, the filters are not orthogonal (e.g., note the pair of filters in the first row in Figure 1(c)). From looking at the aggregate response maps, we see that the effect of these differences is that aggregate filter responses for the supervised method are are smaller on edges and larger on corners than for either of the other methods.

5 Conclusions and Future Work

We have presented a new approach for learning a sparsifying analysis operator for use in variational denoising. The main contribution of the work is an expression for the gradient of the training objective that does not rely on relaxation to a differentiable sparsity penalty.

We are interested in adapting this approach in two main directions. First, we want to include a linear forward model in the objective function in (1b), making the approach applicable to the general class of linear inverse problems. The main challenge in this generalization is that the invertibility of the KKT matrix would depend on the properties forward operator. Second, we would like to extend the approach to the case where is a nonlinear function, e.g., a CNN. We suspect that for a ReLU CNN, our the KKT approach can still be used, with the change that the sign pattern must be specified at each ReLU activation function. Moving to a CNN-based regularizer also complicates solving the regularized reconstruction problem itself, because the proximal map used inside ADMM would no longer exist. On the experimental side, we would like to validate the approach on natural images and make a thorough comparison to recent, CNN-based approaches for denoising, e.g. Zhang et al. (2017). These experiments may provide insight into how much of the performance of these methods come from their architecture and how much comes from the supervised learning per se.

Broader Impact

While our work is primarily of a theoretical and exploratory nature (using no real data), it certainly may eventually lead to societal impacts. We expect these to be mainly positive: our work aims to improve the quality of image reconstructions, which, e.g., could help make MRI images clearer, allowing doctors to make better clinical decisions. However, this technology could also be used in negative ways, e.g., to create better images from intrusive surveillance systems. Using learning in these systems has its own pitfalls. Returning to the MRI example, training data from minority populations may be scarce, resulting in systems that produce spurious results for minority patients at a higher rate than for patients in the majority. As compared to highly-parameterized systems like those used in deep learning, our shallow learning approach requires much less training data to perform well, which may help ameliorate this type of disparity.


  • H. K. Aggarwal, M. P. Mani, and M. Jacob (2019) MoDL: model-based deep learning architecture for inverse problems. IEEE Transactions on Medical Imaging 38 (2), pp. 394–405. External Links: Document Cited by: §1.
  • A. H. Al-Shabili, H. Mansour, and P. T. Boufounos (2020) Learning plug-and-play proximal quasi-newton denoisers. In 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings, Barcelona, Spain. External Links: Document Cited by: §1.
  • M. Avriel (2003) Nonlinear programming: analysis and methods. Dover Publications, Mineola, NY. External Links: ISBN 9780486432274 Cited by: §2.4.
  • M. Bertero and P. Boccacci (1998) Introduction to inverse problems in imaging. Institute of Physics Publishing, Philadelphia, Pa. External Links: ISBN 0750304391 Cited by: §1.
  • S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers.

    Foundations and Trends in Machine Learning

    3 (1), pp. 1–122.
    External Links: ISSN 1935-8237, Document Cited by: item 1, §3.2, §3.2.
  • S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge University Press. External Links: ISBN 0521833787, Document Cited by: §2.1.
  • E.J. Candes, J. Romberg, and T. Tao (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52 (2), pp. 489–509. External Links: Document Cited by: §1, §1.
  • Y. Chen, T. Pock, and H. Bischof (2014) Learning -based analysis and synthesis sparsity priors using bi-level optimization. arXiv:1401.4105 [cs.CV]. External Links: 1401.4105 Cited by: §1.
  • Y. Chen, R. Ranftl, and T. Pock (2014) Insights into analysis operator learning: from patch-based sparse models to higher order MRFs. IEEE Transactions on Image Processing 23 (3), pp. 1060–1072. External Links: Document Cited by: §1.
  • G. H. Golub and V. Pereyra (1973) The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate. SIAM Journal on Numerical Analysis 10 (2), pp. 413–432. External Links: ISSN 0036-1429, Link Cited by: §2.1.
  • R. Gribonval and P. Machart (2013) Reconciling "priors" & "priors" without prejudice?. In Advances in Neural Information Processing Systems 26, Vol. 2, pp. 2193–2201. External Links: Link Cited by: §1.
  • H. Gupta, K. H. Jin, H. Q. Nguyen, M. T. McCann, and M. Unser (2018) CNN-based projected gradient descent for consistent CT image reconstruction. IEEE Transactions on Medical Imaging 37 (6), pp. 1440–1453. External Links: Document Cited by: §1.
  • Y. Hel-Or and D. Shaked (2008) A discriminative approach for wavelet denoising. IEEE Transactions on Image Processing 17 (4), pp. 443–457. External Links: Document Cited by: §1.
  • U. S. Kamilov and H. Mansour (2016) Learning optimal nonlinearities for iterative thresholding algorithms. IEEE Signal Processing Letters 23 (5), pp. 747–751. External Links: Document Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv:1412.6980 [cs.LG]. External Links: 1412.6980 Cited by: §2.4.
  • A. Lee, D. Mumford, and J. Huang (2001) Occlusion models for natural images: a statistical study of a scale invariant dead leaves model. International Journal of Computer Vision 41, pp. 35–59. External Links: Document Cited by: §3.1.
  • J. Mairal, F. Bach, and J. Ponce (2012) Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4), pp. 791–804. External Links: Document Cited by: §1.
  • Y. Makinen, L. Azzari, and A. Foi (2019)

    Exact transform-domain noise variance for collaborative filtering of stationary correlated noise

    In 2019 IEEE International Conference on Image Processing Proceedings, Taipei, Taiwan. External Links: Document Cited by: §3.2.
  • M. T. McCann, K. H. Jin, and M. Unser (2017) Convolutional neural networks for inverse problems in imaging: a review. IEEE Signal Processing Magazine 34 (6), pp. 85–95. External Links: Document Cited by: §1.
  • M. T. McCann and M. Unser (2019) Biomedical image reconstruction: from the foundations to deep neural networks. Foundations and Trends in Signal Processing 13 (3), pp. 283–359. External Links: Document Cited by: §1.
  • T. P. Minka (2000) Old and new matrix algebra useful for statistics. Technical report MIT Media Lab. External Links: Link Cited by: §2.2.
  • H. Q. Nguyen, E. Bostan, and M. Unser (2018) Learning convex regularizers for optimal bayesian denoising. IEEE Transactions on Signal Processing 66 (4), pp. 1093–1105. External Links: Document Cited by: §1.
  • G. Ongie, A. Jalal, C. A. Metzler, R. G. Baraniuk, A. G. Dimakis, and R. Willett (2020) Deep learning techniques for inverse problems in imaging. arXiv:2005.06001 [eess.IV]. External Links: 2005.06001 Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §2.2.
  • G. Peyré and J. M. Fadili (2011) Learning analysis sparsity priors. In Sampling Theory and Applications, Singapore, Singapore, pp. 4. External Links: Link Cited by: §1, §1.
  • J. Pineau and K. Sinha (2020) Designing the reproducibility program for NeurIPS 2020. Note: External Links: Link Cited by: §3.3.
  • S. Ravishankar and Y. Bresler (2013) Learning sparsifying transforms. IEEE Transactions on Signal Processing 61 (5), pp. 1072–1086. External Links: Document Cited by: §1.
  • S. Ravishankar and Y. Bresler (2015) Sparsifying transform learning with efficient optimal updates and convergence guarantees. IEEE Transactions on Signal Processing 63 (9), pp. 2389–2404. External Links: Document Cited by: §3.2.
  • S. Ravishankar, J. C. Ye, and J. A. Fessler (2020) Image reconstruction: from sparsity to data-adaptive methods and machine learning. Proceedings of the IEEE 108 (1), pp. 86–109. External Links: Document Cited by: §1.
  • R. Rubinstein, T. Peleg, and M. Elad (2013) Analysis k-SVD: a dictionary-learning algorithm for the analysis sparse model. IEEE Transactions on Signal Processing 61 (3), pp. 661–677. External Links: Document Cited by: §1, §3.2.
  • J. Shewchuk (1994) An introduction to the conjugate gradient method without the agonizing pain. Technical report Carnegie Mellon University. External Links: Link Cited by: item 3.
  • J. Shtok, M. Elad, and M. Zibulevsky (2013) Learned shrinkage approach for low-dose reconstruction in computed tomography. International Journal of Biomedical Imaging 2013, pp. 1–20. External Links: Document Cited by: §1.
  • S. L. Smith, P. Kindermans, C. Ying, and Q. V. Le (2017) Don’t decay the learning rate, increase the batch size. arXiv:1711.00489 [cs.LG]. External Links: 1711.00489 Cited by: §3.3.
  • P. Sprechmann, R. Litman, T. Ben Yakar, A. M. Bronstein, and G. Sapiro (2013) Supervised sparse analysis and synthesis operators. In Advances in Neural Information Processing Systems 26, pp. 908–916. External Links: Link Cited by: §1.
  • I. Tosic and P. Frossard (2011) Dictionary learning. IEEE Signal Processing Magazine 28 (2), pp. 27–38. External Links: ISSN 1053-5888, Document Cited by: §1.
  • J.A. Tropp (2004) Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory 50 (10), pp. 2231–2242. External Links: Document Cited by: §1.
  • B. Wen, S. Ravishankar, and Y. Bresler (2019) VIDOSAT: high-dimensional sparsifying transform learning for online video denoising. IEEE Transactions on Image Processing 28 (4), pp. 1691–1704. External Links: Document Cited by: §4.
  • M. Yaghoobi, S. Nam, R. Gribonval, and M. E. Davies (2012) Noise aware analysis operator learning for approximately cosparse signals. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), External Links: Document Cited by: §3.2.
  • K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. External Links: Document Cited by: §4, §5.