# Using auto-encoders for solving ill-posed linear inverse problems

Compressed sensing algorithms recover a signal from its under-determined linear measurements by exploiting its structure. Starting from sparsity, recovery methods have steadily moved towards more complex structures. Recently, the emerging machine learning techniques, especially the generative models based on neural nets, potentially, can learn general complex structures. Inspired by the success of such models in various computer vision tasks, researchers in compressed sensing have recently started to employ them to design efficient recovery methods. Consider a generative model defined as function g: U^k→R^n, U⊂R. Assume that the function g is trained such that it can describe a class of desired signals Q⊂R^n. The standard problem in noiseless compressed sensing is to recover x∈ Q from under-determined linear measurements y=A x, where y∈R^m and m≪ n. A recovery method based on g finds g( u), ∈ U^k, which has the minimum measurement error. In this paper, the performance of such a recovery method is studied and it is proven that, if the number of measurements (m) is larger than twice the dimension of the generative model (k), then x can be recovered from y, with a distortion that is a function of the distortion induced by g in describing x, i.e. _ u∈ U^kg( u)- x. To derive an efficient method, an algorithm based on projected gradient descent is proposed. It is proven that, given enough measurements, the algorithm converges to the optimal solution and is robust to measurement noise. Numerical results showing the effectiveness of the proposed method are presented.

## Authors

• 8 publications
• 59 publications
12/06/2019

### Lower Bounds for Compressed Sensing with Generative Models

The goal of compressed sensing is to learn a structured signal x from a ...
02/26/2019

### GAN-based Projector for Faster Recovery in Compressed Sensing with Convergence Guarantees

A Generative Adversarial Network (GAN) with generator G trained to model...
07/06/2015

### Visual Data Deblocking using Structural Layer Priors

The blocking artifact frequently appears in compressed real-world images...
09/02/2021

### Solving Inverse Problems with Conditional-GAN Prior via Fast Network-Projected Gradient Descent

The projected gradient descent (PGD) method has shown to be effective in...
09/09/2019

### Signal retrieval with measurement system knowledge using variational generative model

Signal retrieval from a series of indirect measurements is a common task...
03/17/2022

### Design of Compressed Sensing Systems via Density-Evolution Framework for Structure Recovery in Graphical Models

It has been shown that the task of learning the structure of Bayesian ne...
03/02/2021

### Structural Sparsity in Multiple Measurements

We propose a novel sparsity model for distributed compressed sensing in ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Solving inverse problems is at the core of many data acquisition systems, such as magnetic resonance imaging (MRI) and optical coherence tomography (OCT). In many of such systems, through proper quantization in time or space, the measurement system can be modeled as a system of linear equations as follows. The unknown signal to be measured is a high-dimensional signal , which belongs to , a compact subset of . The measured signal can be modeled as . Here , and

denote the sensing matrix, the measurement vector, and the measurement noise, respectively. The goal is often to design an efficient algorithm that recovers

from the measurements , when . In addition to computational complexity, the efficiency of an algorithm is measured in terms of its required number of measurements, its reconstruction quality, and its robustness to noise.

Compressed sensing, i.e., solving the described ill-posed linear inverse problem, is only possible, if the set is appropriately structured. Initially, the main focus in compressed sensing was on structures, such as sparsity [1, 2]. While many signals of interest are in fact sparse in some transform domain, most of them follow structures beyond sparsity as well. Enabling a recovery algorithm to take advantage of the full structure of the source potentially leads to a higher recovery performance. This has motivated researchers in compressed sensing to explore algorithms that go beyond simple models such as sparsity.

Developing a compressed sensing recovery method involves two major steps: i) studying the class of signals (e.g., natural images, or magnetic resonance imaging (MRI) images) and discovering a common property among them, say property , and ii) devising an efficient algorithm that finds a signal that is consistent with the measurements and also satisfies property . For instance, the well-known iterative hard thresholding algorithm [3] is an algorithm that is developed for the case where property is defined as sparsity.

In order to move beyond simple structures such as sparsity, one approach is to design learning-based recovery methods that are able to learn complex signal models from training data. In this approach, instead of the structure of the desired class of signals being given to us, it is learned from an available training dataset. One of the most recent approaches in this direction is to use generative models that are based on neural networks.

Generative models are commonly used in machine learning to solve classic problems, such as classification. The role of a generative model is to learn the distribution of a class of signals, such that it is able to generate samples from that class. (Refer to Chapters 4 and 12 in [4] to learn more about using generative models in classification.) Modern generative methods achieve this goal typically through employing trained neural networks. Generative adversarial nets (GANs) and variational auto-encoders (VAEs) are examples of methods to design complex generative models [5]. The success of such approaches in solving machine learning problems heavily relies on their ability to learn distributions of various complex signals, such as image and audio files. This success has encouraged researchers from areas such as compression, denoising and compressed sensing to also look into the application of such methods, as tools to capture the structures of signals of interest.

Given a class of signals, , consider a corresponding trained generative function , . Assume that is trained by enough samples from , such that it is able to represent signals from appropriately. In this paper, we study the performance of an optimization-based compressed sensing recovery method that employs as a mechanism to capture the structure of signals in . We derive sharp bounds connecting the properties of function (its dimensions, its error in representing the signals in , and its smoothness level) to the performance of the resulting recovery method.

The organization of the paper is as follows. Section II describes the problem of compressed sensing using generative models and states our main result in this domain. Section III describes an efficient algorithm based on projected gradient descent to approximate the solution of the exhaustive search optimization discussed in the prior sections. Section IV reviews some related work from the literature. Section V presents numerical results showing the performance of the proposed algorithm. Section VI presents the proofs of the main results and Section VII concludes the paper.

### I-a Notations

Vectors are denoted by bold letters, such as and . Sets are denoted by calligraphic letters, such as and . For a set , denotes its cardinality. For and , denotes the bit quantized version of is defined as . For a set and , let denote the set where every member in is quantized in bits, i.e.,

 Ab≜{[x]b:x∈A}.

## Ii Recovery using generative models

Consider a class of signals defined by a compact set . (For example, can be the set of images of human faces, or the set of MRI images of human brains.) Let function denote a generative function trained to represent signals in set .

###### Definition 1.

Function is said to represent with distortion , if

 minu∈Uk1√n∥g(u)−x∥≤δ. (1)

Consider the standard problem of compressed sensing, where instead of explicitly knowing the structure of signals in , we have access to function , which is known to well-represent signals in . In this setup, signal is measured as , where , and denote the sensing matrix, the measurement vector, and the measurement noise, respectively. The goal is to recover from , typically with , via using the function to define the structure of signals in .

To solve this problem, ideally, we need to find a signal that is i) compatible with the measurements , and ii) representable with function . Hence, ignoring the computational complexity issues, we would like to solve the following optimization problem:

 ^u=argminu∈Uk∥Ag(u)−y∥, (2)

After finding , signal

can be estimated as

 ^x=g(^u). (3)

The main goal of this paper is to theoretically study the performance of this optimization-based recovery method. We derive bounds that establish a connection between the ambient dimension of the signal , the parameters of the function , and the number of measurements .

To prove such theoretical results, we put some constraints on function . More precisely, consider and let and , where and . Assume that

1. represents with distortion , where ,

2. is -Lipschitz,

3. is a bounded subset of .

Define and as in (2) and (3), respectively. The following theorem characterizes the connection between the properties of function (input dimension and Lipschitz constant ), the number of measurements () and the reconstruction distortion ().

###### Theorem 1.

Consider and let . Assume that the entries of are i.i.d.  and . Define and as in (2) and (3), respectively. Set free parameters and , such that Then, if , we have

 1√n∥^x−x∥≤L√η(1+2√mn)δ1−2η−υ+α+4δ−2η+2δ−1η√nϵz.

where

, with a probability larger than

 1−e−(1−ζ)kln1δ−e−0.8ηk−e−m2−e−2(1−ζ)kln1δ,

where .

The proof of Theorem 1 is presented in Section VI-C. Theorem 1 essentially states that, as long as the number of measurements is larger than , then the optimization in (3) is able to recover the desired signal, with a distortion, which goes to zero, as converges to zero. The proof employs the following two lemmas, which assume that the optimization is done over a quantized space.

###### Lemma 1.

Consider and let , where and . Assume that the entries of are i.i.d.  and . Given function , define and as

 ^ub=argminu∈Ukb∥Ag(u)−y∥, (4)

and

 ^xb=g(^ub). (5)

Then,

 ∥^xb−x∥≤√1+τ11−τ2minu∈Ukb∥g(u)−x∥+ϵz√1−τ2,

with a probability larger than

 1−|Ub|kem2(τ2+ln(1−τ2))−e−m2(τ1−ln(1+τ1)).

The proof of Lemma 1 is presented in Section VI-A. Lemma 1 does not make any assumptions about the properties of function and its ability to describe signals in our desired set . The following lemma assumes some constraints on and also sets the free parameters and and the quantization level , to derive meaningful connections between the number of measurements and the reconstruction quality.

###### Lemma 2.

Consider the same setup as in Theorem 1. Define and as (4) and (5), respectively. Set free parameters and . Assume that

 m≥ηk, (6)

and

 b=⌈(1−υ)log1δ⌉, (7)

Then, we have

 1√n∥^xb−x∥≤2δ1−υ−1ηL√kn+2δ1−1η+ϵzδ−1η,

with a probability larger than

 1−e−(υ−ζ)kln1δ−e−0.8ηk,

where .

###### Remark 1.

Comparing Lemma 2 with Theorem 1 reveals that there is factor 2 difference between minimum their required number of measurements. Currently, these factor 2 shows up as a price of searching over the continuous space , instead of the discrete space . However, it is not clear to us at this point, if this factor is an artifact of the proof, or a fundamental difference between the two results.

## Iii Iterative projected gradient descent

The optimization described in (2) is a challenging non-convex optimization. Given that for generative models based on neural networks, the cost function is differentiable, [7] proposes applying standard gradient descent to solve this optimization. However, given the non-convexity of the cost function, there is no guarantees that this method succeeds. To address this issue, we propose an efficient algorithm based on projected gradient descent.

For , let

 st+1 =^xt+μAT(y−A^xt) ut+1 =argminu∈Uk∥st+1−g(u)∥ ^xt+1 =g(ut+1). (8)
###### Theorem 2.

Consider , and , where denotes a compact subset of and . Here, are i.i.d. . Assume that function is -Lipschitz and satisfies (1), for some . Define and as (4) and (5), respectively. Choose free parameters and define , and as

 η≜kn(1+(√nm+2)2)L2δ2α, (9)
 γ1≜(2+√nm)2(Lδα√kn+1), (10)

and

 γ2≜√2kn(2+√nm), (11)

respectively. Assume that

 m≥40(1+α+υ)klog1δ. (12)

Let , and for , define as (8). Then, for every , if , then, either , or

 1√n∥^x−^xt+1∥≤0.9+η√n∥^x−^xt∥+⎛⎜ ⎜ ⎜⎝ ⎷6(1+α)(log1δ)km+γ2Lδα⎞⎟ ⎟ ⎟⎠σ√n+γ1δ,

with a probability larger than

 1−2−2kυlog1δ−e−m2−e−0.1(1+α)(log1δ)k+2(ln2)k−e−0.15m.

The proof of Theorem 2 is presented in Section VI-D. Theorem 2 proves that while the original optimization is not convex, given enough measurements, the described projected gradient descent algorithm converges, even in the presence of additive white Gaussian noise.

In order to implement the proposed iterative method, described in (8), the step that might seem challenging is the projection step, i.e., . However, one can build (train) a separate neural network that approximates this step. Concatenating this neural network with the the neural network that approximates yields an “auto-encoder” that maps a high-dimensional signal into low-dimensions, and then back to its original dimension. Using this perspective, the last two steps of the algorithm, basically pass through an autor-encoder.

In Section V, we report simulation results showing the performance of the proposed algorithm applied with a training a neural network to implement the generative function .

## Iv Related work

As mentioned earlier, researchers in compressed sensing have recently started using neural networks to design novel compressed sensing recovery methods [6, 7]. Various networks have been developed for diverse applications of compressed sensing [8, 9, 10, 11, 12, 13, 14].The hope is to design algorithms with superior performance compared to the existing methods. Similar to this paper, the authors of [7] also study application of generative models in CS. For the optimization-based method described in Eqs. (2)-(3), [7] proves that roughly measurements are sufficient for recovery. To compare with our main result, note that our Theorem 1 states that the required number of measurements can get arbitrary close to , regardless of constant and other parameters of the problem.

Another related line of work is the application of compression codes in designing efficient compression-based recovery methods [15]. The goal of such methods it to elevate the scope of structures used by CS algorithms to those used by compression codes. Such an optimization is very similar to Eq. (3). However, the difference between these two approaches is that while a lossy compression code can be represented by a discrete set of codewords, a generative function has a continuous input and therefore it is more challenging to derive theoretical performance bounds.

## V Simulation results

We trained a two-layer AutoEncoder [16] as our generative model, , depicted in Fig. 1, and we use the MNIST [17] digits data in our training and testing (but different sets). We used 35,000 digits to train our network, where the output of the first layer is a latent variable with a dimension of , recalling the input is of dimension . For the second layer, the input is of dimension , and we vary the dimension of the output from to . This is the dimension of our generative function in our theory. The -loss between the input (of the encoder) and output (of the decoder) are used and we imposed the sparsity the both encoder and decoder.

During our testing, we compressively sample digits using various number of measurements and reconstruct them using the trained AutoEncoder with various number of by employing the iterative projected gradient descent algorithm described in Section III. The Peak-Signal-to-Noise-Ratio (PSNR) is employed as the metric to compare our proposed algorithm with the Lasso [18] and the Total-Variation (TV) [19] algorithms. Fig. 2 plots the average PSNR of reconstructed digits against increasing measurements using different algorithms. It can be seen that our proposed algorithm outperforms both Lasso and TV and when the measurement number () is getting larger, the gap between our algorithm and Lasso is getting smaller. This may due to the fact that our trained generative model does not learn a perfect structure. A deeper and more complicated model may perform better. Another interesting finding is that the performance of Lasso is increasing sharply when . This may due to the sparsity of the digits. This also confirms that the Lasso’s performance is bounded by the sparsity of the signal while the performance of our proposed method mostly rely on the dimension of the generative model, , which can be much smaller than the sparsity of the signal.

In Fig. 3, we exploit the performance change of our proposed algorithm with different . It can be seen that results are fluctuating within 0.5dB for increasing from 100 to 1,500. This denotes that the might be sufficient to capture the structure of these digits. From our Theorem 1, might be enough to recover the digits; this corresponds to , which is also a changing point in our PSNR curve in Fig. 2. Therefore, even with some approximation, we can claim that our simulation has verified our theoretical guarantee.

Fig. 4 shows some results of the reconstructed digits by three algorithms, which again verifies the excellent performance of our proposed algorithm.

## Vi Proofs

The following lemma from [20]

on the concentration of Chi-squared random variables is used in the proof.

###### Lemma 3 (Chi-squared concentration).

Assume that are i.i.d. . For any we have

 P(∑mi=1U2i>m(1+τ))≤e−m2(τ−ln(1+τ)), (13)

and for ,

 P(∑mi=1U2i

Also, the following lemma from [21] are used in the proof of Theorem 2.

###### Lemma 4.

Consider and such that . Let . Consider matrix with i.i.d. standard normal entries. Then, for any ,

 P(⟨u,v⟩−1m⟨Au,Av⟩≥τ)≤em((α−τ)s)−m2ln((1+sα)2−s2), (15)

where is a free parameter smaller than . Specifically, for ,

 P(⟨u,v⟩−1m⟨Au,Av⟩≥0.45)≤2−0.05m. (16)
###### Lemma 5.

Consider and , where are i.i.d. . Then the distribution of is the same as the distribution of , where is independent of .

### Vi-a Proof of Lemma 1

Let

 ~ub=argminu∈Ukb∥g(u)−x∥, (17)

and Since is the minimizer of , over all , we have

 ∥Ag(^ub)−y∥ ≤∥Ag(~ub)−y∥. (18)

But, since ,

 ∥A(g(^ub)−x)−z∥ ≤∥A(g(~ub)−x)−z∥. (19)

From the triangle inequality, and . It follows from (19) that

 ∥A(g(^ub)−x)∥ ≤∥A(g(~ub)−x)∥+2∥z∥. (20)

Given , define event as

 E1≜{∥A(g(~ub)−x)∥≤√m(1+τ1)∥g(~ub)−x∥}. (21)

From Lemma 3,
On the other hand, given , define event as

 E2={∥A(g(u)−x)∥ ≥√m(1−τ2)∥g(u)−x∥: ∀u∈Ukb}. (22)

Again by Lemma 3, for a fixed , with a probability larger than ,

 ∥A(g(u)−x)∥≥m(1−τ2)∥g(u)−x∥. (23)

Therefore, applying the union bound, it follows that

 P(Ec2)≤|Ub|kem2(τ2+ln(1−τ2)).

Therefore, conditioned on , from (20), we have

 √m(1−τ2)∥g(^ub)−x∥=√m(1−τ2)∥^xb−x∥ ≤√m(1+τ1)∥~xb−x∥+2∥z∥.

Since by assumption , we have

 ∥^xb−x∥≤√1+τ11−τ2∥~xb−x∥+2ϵz√1−τ2.

### Vi-B Proof of Lemma 2

Define and as done in Lemma 1. Also, let , and . From Lemma 1,

 ∥^xb−x∥≤√1+τ11−τ2minu∈Ukb∥g(u)−x∥+2ϵz√1−τ2, (24)

with a probability larger than On the other hand,

 minu∈Ukb∥g(u)−x∥=∥g(~ub)−x∥ (a)≤∥g([~u]b)−x∥=∥g([~u]b)−g(~u)+g(~u)−x∥ ≤∥g([~u]b)−g(~u)∥+∥g(~u)−x∥ (b)≤L∥[~u]b−~u∥+√nδ≤L√k2−b+√nδ, (25)

where holds because is the minimizer of over all in and as well. holds because is -Lipschitz by assumption and also satisfies (1). But is defined as (45). Therefore,

 L√k2−b+√nδ≤δ1−υL√k+δ√n. (26)

Set and

 τ2=1−δ2η.

Combining (24) and (26), and using the selected parameters, it follows that

 1√n∥^xb−x∥≤2δ1−υ−1ηL√kn+2δ1−1η+2ϵz√nδ−1η. (27)

Since is a compact set, there exist integer numbers and , such that . Therefore, , where . Therefore, since, for , and by assumption, it follows that

 kln|Ub|+m2(τ2+ln(1−τ2)) ≤k(lna+bln2)+ηk2(τ2+ln(1−τ2)) ≤k(lna−(1−v)lnδ)+ηk2(τ2+ln(1−τ2)), (28)

where the last line follows because and hence . Therefore, from (28), inserting the value of , we have

 kln|Ub|+m2(τ2+ln(1−τ2)) ≤k(lna−(1−υ)lnδ)+ηk2(1−δ2η+2ηlnδ) =−k(υ−ζ)ln1δ, (29)

where

 ζ=lna+η2(1−δ2η)ln1δ. (30)

Note that only depend on , and and .

### Vi-C Proof of Theorem 1

As in Lemma 2, define and . Also, given and , define events and as (21) and (22), respectively. Then, as proved in Lemma 2, for and , choosing , , for and , conditioned on , (27) holds. Also, we proved that

 P(E1∩E2)≥1−e−(υ−ζ)kln1δ−e−0.8ηk,

where and comes from and . While this result bounds , to prove our desired result, we need to bound by connecting it to . To derive this, note that

 ∥^x−x∥ =∥g(^u)−g([^u]b)+g([^u]b)+g(^ub)−g(^ub)−x∥ (a)≤L√k2−b+∥g([^u]b)−g(^ub)∥+∥g(^ub)−x∥ (b)≤L√kδ1−υ+∥g([^u]b)−g(^ub)∥+∥^xb−x∥, (31)

where follows from the assumed Lipschitz property of function and the definition of symbol-by-symbol quantization operator. holds since and . To finish the proof, we need to bound .

Since and , it follows that

 ∥Ag(^u)−y∥≤∥Ag(^ub)−y∥≤∥Ag([^u]b)−y∥. (32)

Therefore, using (32) and applying the triangle inequality, it follows that

 ∥Ag([^u]b)−Ag(^ub)∥ =∥Ag([^u]b)−Ag(^u)+Ag(^u)−y+y−Ag(^ub)∥ ≤∥Ag([^u]b)−Ag(^u)∥+∥Ag(^u)−y∥ +∥y−Ag(^ub)∥ ≤Lσmax(A)2−b√k+2∥Ag(^ub)−y∥, (33)

where the last step follows from the Lipschitz property of function and (32). Given , define event as

 {∥A(g(u1)−g(u2))∥≥√m(1−τ3)∥g(u1)−g(u2)∥: ∀(u1,u2)∈Ukb×Ukb}. (34)

Recall that . Conditioned on ,

 ∥Ag([^u]b)−A^xb∥≥√m(1−τ3)∥g([^u]b)−^xb∥. (35)

Combining (33) and (35), it follows that

 ∥g([^u]b)−^xb∥≤ 1√m(1−τ3)(Lσmax(A)2−b√k+2∥Ag(^ub)−y∥). (36)

But, using (20) and (25), conditioned on , we have

 ∥Ag(^ub)−y∥≤√m(1+τ1)∥g(~ub)−x∥+2∥z∥ ≤√m(1+τ1)(L√k2−b+√nδ)+2∥z∥. (37)

Define event as As shown in [22],

Now combining the bound in (37) with the bound in (36), and setting , conditioned on , we have

 1√n∥g([^u]b)−^xb∥ ≤1√1−τ3(L(1+6√mn)2−b√km+4δ)+4ϵz√n(1−τ3) ≤1√1−τ3(L√η(1+2√mn)δ−υ+4)δ+4ϵz√n(1−τ3), (38)

where the last line follows because and . Combining Lemma 3 and the union bound, it follows that

 P(Ec3)≤|Ub|2kem2(τ3+ln(1−τ3)). (39)

Using the same argument as the one used to derive (28), since , and , choosing , we have

 2kln|Ub|+m2(τ3+ln(1−τ3)) ≤2k(lna−(1−v)lnδ)+ηk2(1−δ4η+4ηlnδ) ≤−2(1−ζ)kln1δ, (40)

where . For , from (38), it follows that

 1√n∥g([^uc]b)−g(^u)∥ ≤L√η(1+2√mn)δ1−2η−υ +4δ1−2η+4√nδ−2ηϵz. (41)

Finally, combining (27) and (31) and (41), it follows that, conditioned on , we have

 1√n∥^x−x∥ ≤2δ1−υ−1ηL√kn+