# Sparse ℓ^q-regularization of inverse problems with deep learning

We propose a sparse reconstruction framework for solving inverse problems. Opposed to existing sparse reconstruction techniques that are based on linear sparsifying transforms, we train an encoder-decoder network D ∘ E with E acting as a nonlinear sparsifying transform. We minimize a Tikhonov functional which used a learned regularization term formed by the ℓ^q-norm of the encoder coefficients and a penalty for the distance to the data manifold. For this augmented sparse ℓ^q-approach, we present a full convergence analysis, derive convergence rates and describe a training strategy. As a main ingredient for the analysis we establish the coercivity of the augmented regularization term.

## Authors

• 27 publications
• 10 publications
• 3 publications
• 7 publications
04/20/2020

### Sparse aNETT for Solving Inverse Problems with Deep Learning

We propose a sparse reconstruction framework (aNETT) for solving inverse...
02/01/2020

### Deep synthesis regularization of inverse problems

Recently, a large number of efficient deep learning methods for solving ...
03/02/2019

### A unifying representer theorem for inverse problems and machine learning

The standard approach for dealing with the ill-posedness of the training...
02/27/2018

### A Mathematical Framework for Deep Learning in Elastic Source Imaging

An inverse elastic source problem with sparse measurements is of concern...
07/13/2020

### Tikhonov functionals with a tolerance measure introduced in the regularization

We consider a modified Tikhonov-type functional for the solution of ill-...
09/20/2019

### Sparse regularization of inverse problems by biorthogonal frame thresholding

We analyze sparse frame based regularization of inverse problems by mean...
12/23/2021

### Shearlet-based regularization in statistical inverse learning with an application to X-ray tomography

Statistical inverse learning theory, a field that lies at the intersecti...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Various applications in medical imaging, remote sensing and elsewhere require solving inverse problems of the form

 y=Ax+z, (1.1)

where is a linear operator between Hilbert spaces , , and is the data distortion. Inverse problems are well analyzed and several established approaches for its solution exist, including filter-based methods or variational regularization techniques [1, 2]

. In the very recent years, neural networks (NN) and deep learning appeared as new paradigms for solving inverse problems, and demonstrate impressive performance. Several approaches have been developed, including two-step networks

[3, 4, 5], variational networks [6], iterative networks [7, 8] and regularizing networks [9].

Standard deep learning approaches may lack data consistency for unknowns very different from the training images. To address this issue, in [10] a deep learning approach has been introduced where minimizers

 xα∈argminx∥A(x)−y∥2Y+αϕ(E(x)) (1.2)

are investigated. Here is a trained NN, a Hilbert space, a functional, and the regularization parameter. The resulting reconstruction approach has been named NETT (for network Tikhonov regularization), as it is a generalized form of Tikhonov regularization using a NN as trained regularizer.

In [10] it is shown that under suitable assumption, NETT yields a convergent regularization method. Moreover, in that paper a training strategy has been proposed, where is trained to favor artifact-free reconstructions selected from a set of training images from a certain data manifold ; see [11] for a simplified training strategy.

### Coercive variant of NETT

One of the main assumptions in the analysis of [10] is the coercivity of the regularizer . For the general form used in (3.1), this requires special care in the design and training of the network. In order to overcome this limitation, in this paper we propose a modified form of the regularizer for which we are able to rigorously proof its coercivity. More precisely, we consider

 xα∈argminx∥A(x)−y∥2Y+α(ϕ(E(x))+β2∥x−(D∘E)(x)∥22). (1.3)

Here, is an encoder-decoder network trained so such that for any we have and that is small. The term implements learned prior knowledge. The additional term forces to be close to data manifold and, as we shall prove, also guarantees coercivity of the regularization functional.

In particular, in this paper we investigate the case where for some index set and is a weighted -norm used as sparsity prior. To construct an appropriate network, we train a (modified) tight frame U-net [12] of the form using the -norm of during training, and take the encoder part as analysis network.

### Outline

This paper is organized as follows. In Section 2, we present a convergence analysis for the augmented -NETT (see (2.1)). In particular, as main auxiliary result, we establish the coercivity of the regularization term. In Section 3

, we derive convergence rates which provide quantitative estimates for the reconstruction accuracy. In Section

4, we present a suggested network structure using a modified tight frame U-net and a corresponding training strategy. The paper concludes with a short summary and outlook given in Section 5.

## 2 Well-posedness and convergence

### 2.1 Augmented ℓq-Nett

To solve the inverse problem (1.1) we propose and analyze the augmented -NETT, which considers minimizers of

 Tα,y(x)\coloneqq∥Ax−y∥2Y+α(∑λ∈Λwλ|(E(x))λ|q+β2∥x−(D∘E)(x)∥2). (2.1)

Here is the regularization parameter, is called encoder network, is called decoder network, a countable index set, are positive weights, is a tuning parameter, and describes the used norm. The case yields a sparsity promoting regularization term , frequently studied when is a basis of frame [13, 14, 15, 16, 17, 18]. In the present paper, we allow and to be non-linear mappings.

For our convergence analysis, we use the following assumptions, that we assume to be satisfied throughout this section.

###### Condition 2.1 (Augmented ℓq-Nett).

1. [leftmargin=3em,label = (A0)]

2. is bounded linear;

3. is weakly sequentially continuous;

4. is weakly sequentially continuous;

5. .

The first term in the considered regularizer

 (2.2)

was proposed in [10] to impose a sparsity condition on the signal . In this paper, we add the extra term forcing the minimizers of being close to the solution manifold . This term also allows to prove the coercivity of (see the argument in the proof of Theorem 2.2), which is essential to our analysis.

### 2.2 Well-posedness

###### Theorem 2.2 (Existence).

For all and all , the augmented -NETT functional (2.1) has at least one minimizer.

###### Proof.

Let us first prove that is coercive. Indeed, let us assume that there exists a sequence such that and is bounded. Then, is bounded in . Since , we obtain

 ∥E(xk)∥22=∑Λ|(E(xk))λ|2≤(∑Λ|(E(xk))λ|q)2/q=∥E(xk)∥2q.

Therefore, is also bounded in , too. Now, since is weakly sequentially continuous, this implies that also is a bounded sequence. From the estimate

 ∥xk∥2≤2(∥xk−(D∘E)(xk)∥2+∥(D∘E)(xk)∥2)≤4βRq,w(xk)+2∥(E∘D)(xk)∥2

it follows that is a bounded sequence. This is a contradiction and finishes the proof that is coercive.

Because the network is weakly sequentially continuous, the functional is weakly lower semi-continuous. Therefore, is weakly lower semi-continuous, too. Since is bounded from below by , it has an infimum . Let be a sequence such that . Since is coercive, the sequence is bounded, and hence, has an accumulation point in the weak topology, denoted by . Because is sequentially lower semi-continuous, it follows that . Therefore, is a minimizer of . ∎

###### Theorem 2.3 (Stability).

Let , , with , and . Then weak accumulation points of exist and are minimizers of . For any weak accumulation point and subsequence of , it holds that .

###### Proof.

The proof follows the lines of [2, Theorem 3.23]. We note that the convexity of the regularizer assumed in [2] is not needed in that proof. For the sake of completeness, we sketch here a proof for the non-convex regularizer . Fix . Then, for all , we have , which implies

 αRq,w(xk)≤Tα,yk(x)≤2(∥Ax∥2+∥yk∥2)+αRq,w(x).

Consequently, is bounded and therefore has a weakly convergent subsequence . Let us prove that each such accumulation point satisfies . Indeed, given any , we have which implies and therefore Since this holds for all , we obtain . It now remains to prove . For that purpose, write . Then , which implies

 limsupℓαRq,w(xk(ℓ))≤Tα,y(x+)−∥Ax+−y∥2=αRq,w(x+).

Together with the weak sequential lower-continuity of the regularizer , this yields and concludes the proof. ∎

### 2.3 Convergence

We call an -minimizing solution of the equation if

 x+∈argmin{Rq,w(x)∣x∈X∧Ax=y}.

As in the convex case [2], one shows that an -minimizing solution exists whenever is solvable.

###### Theorem 2.4 (Weak Convergence).

Let , set , let satisfy for some sequence with , suppose , and let the parameter choice satisfy

 limδ→0α(δ)=limδ→0δ2α(δ)=0. (2.3)

Then the following hold:

1. has at least one weak accumulation point ;

2. Every weak accumulation point of is an -minimizing solution of ;

3. Every weakly convergent subsequence satisfies ;

4. If the -minimizing solution of is unique, then .

###### Proof.

This follows along the lines of [2, Theorem 3.26]. ∎

Next we derive the strong convergence. For that purpose, let us recall the notions of absolute Bregman distance and total nonlinearity, defined in [10].

###### Definition 2.5 (Absolute Bregman distance).

Let be Gâteaux differentiable at . The absolute Bregman distance at with respect to is defined by

 ΔF(~x,x)=|F(~x)−F(x)−F′(x)(~x−x)| for ~x∈X.

Here denotes the Gâteaux derivative of at .

###### Definition 2.6 (Total nonlinearity).

Let be Gâteaux differentiable at . We define the modulus of total nonlinearity of at as the function given by

 νF(x,t)=inf{ΔF(~x,x)∣~x∈D∧∥~x−x∥=t}.

We call totally nonlinear at , if for all .

The following convergence result in the norm topology holds.

###### Theorem 2.7 (Strong Convergence).

Assume that has a solution, let be totally nonlinear at all -minimizing solutions of , and let , , , be as in Theorem 2.4. Then there is a subsequence of and an -minimizing solution of such that . Moreover, if the -minimizing solution of is unique, then in the norm topology.

###### Proof.

Follows from [10, Theorem 2.8]. ∎

### 2.4 Example: Sparse analysis regularization with a dictionary

A simple application of the above results is the case where is a bounded linear operator with closed range. We can write for so-called atoms and interpret as (analysis) dictionary. Moreover, we take the decoder network as the pseudoinverse of .

We have and the regularizer takes the form

 Rq,w(x)\coloneqq∑λ∈Λwλ|⟨eλ,x⟩|q+β2∥Pker(E)(x)∥2 (2.4)

Clearly the conditions 2, 3 are satisfied, which implies that existence, stability and weak convergence for sparse analysis dictionary regularization with (2.4) hold. Following [2, Theorem 3.49] one also derives the strong convergence.

Note that if is a frame of , then in which case (2.4) yields the standard sparse regularizer . However, for a general trained dictionary we will typically have . This is even the case for overcomplete dictionaries, because the dictionary is only trained on elements in a small subset of which are supposed to satisfy a sparse analysis prior. In this case, the additional term in (2.4) ensures coercivity of the regularizer, which is essential for the convergence of Tikhonov regularization.

## 3 Convergence rates

Let us now prove a convergence rate in the absolute Bregman distance. For that purpose, we consider general Tikhonov regularization

 Tα,y(x)\coloneqq∥A(x)−y∥2+αR(x)→minx. (3.1)

Here is a general, possibly non-convex, regularizer, and the linear forward operator.

The convergence rates will be derived under the following assumptions:

1. [leftmargin=3em,label = (B0)]

2. is a bounded linear with finite-dimensional range.

3. is coercive and weakly sequentially lower semi-continuous;

4. is Lipschitz,

5. is Gâteaux differentiable.

###### Remark 3.1.

Note that the regularizer of the augmented -NETT (1.3) satisfies 2-4 as long

and the activation functions of the encoder-decoder network

are differentiable (such as the sigmoid function or smooth versions of ReLU). Condition

3 can be relaxed to a local Lipschitz property.

The main restriction in the above list of assumptions is that has finite-dimensional range. However, this assumption holds true in practical applications such as sparse data tomography. Unlike [10], we do not assume that , which is quite difficult to validate in practice. Modified provable conditions will be studied in future work.

We start our analysis with the following result.

###### Proposition 3.2.

Let 1-4 be satisfied and assume that is an -minimizing solution of . Then there exists a constant such that

 ∀x∈X:ΔR(x,x+)≤R(x)−R(x%+)+C∥A(x)−A(x+)∥.
###### Proof.

Let us first prove that for some constant it holds

 ∀x∈X:R(x+)−R(x)≤γ∥A(x+)−A(x)∥. (3.2)

Indeed, let be the orthogonal projection onto and define . Then, and . Since the restricted operator is injective and has finite-dimensional range, it is bounded from below by a constant . Therefore,

 ∥A(x+)−A(x)∥=∥A(x0)−A(x)∥=∥A(x0−x)∥≥γ0∥x0−x∥. (3.3)

On the other hand, since is the -minimizing solution of and is Lipschitz, we have . Together with (3.3) we obtain (3.2).

Next we prove that there is a constant such that

 ⟨R′(x+),x+−x⟩≤γ1∥A(x+)−A(x)∥. (3.4)

Since is an -minimizing solution of and is Gâteaux differentiable, we obtain for . On the other hand, if , we have and . This finishes the proof of (3.4).

The proof now follows that in [10, Proposition 3.3]. Indeed, We note that

• .

Therefore, using (3.2) and (3.4), we obtain

 ΔR(x,~x) ≤|R(x+)−R(x)|+|⟨R′(x+),x−x+⟩| ≤R(x)−R(x+)+(2γ+γ′)∥Ax−Ax+∥,

which concludes our proof with . ∎

The following results is our main convergence rates result. It is similar to Proposition [10, Theorem 3.1], but uses different assumptions.

###### Theorem 3.3 (Convergence rates results).

Let 1-4 be satisfied and suppose . Then as .

###### Proof.

From Proposition 3.2, we obtain

 αΔF(xα,δ,x+) ≤ αR(xα,δ)−αR(x+)+Cα∥A(xα,δ)−A(x+)∥ = Tα,δ(xα,δ)−∥A(xα,δ)−yδ∥2−(Tα,δ(x+)−∥A(x+)−yδ∥2) + Cα∥A(xα,δ)−A(x+)∥ ≤ δ2+Cαδ−∥A(xα,δ)−yδ∥2+Cα∥A(xα,δ)∥.

Cauchy’s inequality gives . For , we easily conclude . ∎

## 4 Network design and training

For the encoder-decoder type network required for the augmented regularizer (2.2) we propose a modified tight frame U-net together with a sparse training strategy.

The tight frame-Unet has been introduced in [12] and is less smoothing than the classical U-net [19] in image reconstruction. The tight frame U-net of [12] uses a residual (or by-pass) connection, that is not well suited for our purpose. We therefore work with a modified tight frame U-net that has been used in [20] for sparse synthesis regularization with neural networks.

### 4.1 Modified tight-frame U-net

For simplicity we assume that is already a finite dimensional space and contains 2D images of size and channels.

The architecture of the modified tight frame U-net is shown in Figure 4.1. It uses a hierarchical multi-scale representation defined by

 Nℓ=Gℓ∘⎛⎜ ⎜ ⎜ ⎜⎝⎡⎢ ⎢ ⎢ ⎢⎣Hh∘H⊺hHd∘H⊺dHv∘H⊺vL∘Nℓ+1∘L⊺⎤⎥ ⎥ ⎥ ⎥⎦∘Fℓ⎞⎟ ⎟ ⎟ ⎟⎠ for ℓ∈{0,…,L−1}, (4.1)

with . Here

• is the number of used scales;

• and are convolutional layers followed by a nonlinearity;

• are horizontal, vertical and diagonal high-pass filters and is a low-pass filter such that the tight frame property is satisfies,

 HhH⊺h+HvH⊺v+HdH⊺d+LL⊺=id. (4.2)

Following [12]

we define the filters by applying the tensor products

HH, HL, LH and LL of the Haar wavelet low-pass and high-pass filters separately in each channel.

We write the tight frame U-net defined by (4.1) in the form where is the encoder and the decoder part. The encoder part

 E(x)=(E(ℓ)(x))ℓ=0,…,L

maps the image to the high frequency parts , , at the th scale, denoted by for , and to the low frequency part at the coarsest scale, denoted by . The decoder then synthesizes the image from recursively via (4.2).

### 4.2 Sparse network training

To enforce sparsity in the encoded domain we will use a combination of mean-squared-error and an

-penalty of the filter coefficients as loss-function for training. The idea is to thereby enhance the sparsity in the high-pass filtered images.

Given a set of training images we aim for an encoder-decoder network reproducing . For that purpose, we take the encoder-decoder pair as the minimizer of the loss function (the empirical risk)

 RN(Eθ,Dη)=β2NN∑i=1∥(Dη∘Eθ)(xi)−xi∥22+1NN∑i=1L∑ℓ=0wℓ∥E(ℓ)θ(xi)∥qq+γ1∥θ∥22+γ2∥η∥22,

Here and are the adjustably parameters in the tight frame U-net architecture (specifically, in the convolution layers and .) The first term of the loss-function is supposed to train the network to reproduce the training images . Following the sparse regularization strategy, the second term forces the network to learn convolutions such that high-pass filtered coefficients are sparse. The additional term ensures the coercivity of the loss-function and balances the size of the weights and .

Results for the sparse network training described above can be found in [20]. Actual application of the trained network and the augmented -NETT to limited data problems in CT and elsewhere is subject of current work.

## 5 Conclusion and outlook

In this paper, we proposed and analyzed regularization (called augmented -NETT) using the encoder of a encoder-decoder network as sparsifying transform. In order to obtain the coercivity of the regularizer, we augmented with an additional penalty , which can be seen as a measure for the distance of from the ideal data manifold. We were able to prove well-posedness and convergence of the augmented -NETT and derived convergence rates in the absolute Bregman distance. We proposed the modified tight frame U-net for the network architecture together with an appropriate sparse training strategy.

Application to sparse data tomography is subject of current work. Theoretical comparison with frame and dictionary based sparse regularization methods will be studied. Moreover, we work on the derivation of additional provable convergence rates results of the augmented -NETT.

## Acknowledgments

D.O. and M.H. acknowledge support of the Austrian Science Fund (FWF), project P 30747-N32. The research of L.N. has been supported by the National Science Foundation (NSF) Grants DMS 1212125 and DMS 1616904.

## References

• [1] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of inverse problems, ser. Mathematics and its Applications.   Dordrecht: Kluwer Academic Publishers Group, 1996, vol. 375.
• [2] O. Scherzer, M. Grasmair, H. Grossauer, M. Haltmeier, and F. Lenzen, Variational methods in imaging, ser. Applied Mathematical Sciences.   New York: Springer, 2009, vol. 167.
• [3] D. Lee, J. Yoo, and J. C. Ye, “Deep residual learning for compressed sensing MRI,” in IEEE 14th International Symposium on Biomedical Imaging, 2017, pp. 15–18.
• [4]

K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,”

IEEE Trans. Image Process., vol. 26, no. 9, pp. 4509–4522, 2017.
• [5] S. Antholzer, M. Haltmeier, and J. Schwab, “Deep learning for photoacoustic tomography from sparse data,” Inverse Probl. Sci. and Eng., vol. in press, pp. 1–19, 2018.
• [6] E. Kobler, T. Klatzer, K. Hammernik, and T. Pock, “Variational networks: connecting variational methods and deep learning,” in

German Conference on Pattern Recognition

.   Springer, 2017, pp. 281–293.
• [7] J. R. Chang, C.-L. Li, B. Poczos, and B. V. Kumar, “One network to solve them all–solving linear inverse problems using deep projection models,” in

IEEE International Conference on Computer Vision (ICCV)

, 2017, pp. 5889–5898.
• [8] J. Adler and O. Öktem, “Solving ill-posed inverse problems using iterative deep neural networks,” Inverse Probl., vol. 33, p. 124007, 2017.
• [9] J. Schwab, S. Antholzer, and M. Haltmeier, “Deep null space learning for inverse problems: convergence analysis and rates,” Inverse Probl., vol. 35, no. 2, p. 025008, 2019.
• [10] H. Li, J. Schwab, S. Antholzer, and M. Haltmeier, “NETT: Solving inverse problems with deep neural networks,” 2018, arXiv:1803.00092.
• [11] S. Antholzer, J. Schwab, J. Bauer-Marschallinger, P. Burgholzer, and M. Haltmeier, “NETT regularization for compressed sensing photoacoustic tomography,” in Photons Plus Ultrasound: Imaging and Sensing 2019, vol. 10878, 2019, p. 108783B.
• [12] Y. Han and J. C. Ye, “Framing U-Net via deep convolutional framelets: Application to sparse-view CT,” IEEE Trans. Med. Imag., vol. 37, pp. 1418–1429, 2018.
• [13] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Comm. Pure Appl. Math., vol. 57, pp. 1413–1457, 2004.
• [14] R. Ramlau and G. Teschke, “A tikhonov-based projection iteration for nonlinear ill-posed problems with sparsity constraints,” Numerische Mathematik, vol. 104, no. 2, pp. 177–203, 2006.
• [15] M. Grasmair, M. Haltmeier, and O. Scherzer, “Sparse regularization with penalty term,” Inverse Probl., vol. 24, no. 5, pp. 055 020, 13, 2008.
• [16] ——, “Necessary and sufficient conditions for linear convergence of -regularization,” Comm. Pure Appl. Math., vol. 64, no. 2, pp. 161–182, 2011.
• [17] S. Vaiter, G. Peyré, C. Dossal, and J. Fadili, “Robust sparse analysis regularization,” IEEE Transactions on information theory, vol. 59, no. 4, pp. 2001–2016, 2012.
• [18] M. Burger, J. Flemming, and B. Hofmann, “Convergence rates in -regularization if the sparsity assumption fails,” Inverse Problems, vol. 29, no. 2, p. 025013, 2013.
• [19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
• [20] D. Obmann, J. Schwab, and M. Haltmeier, “Sparse synthesis regularization with deep neural networks,” arXiv:1902.00390, 2019.