 # Image Restoration from Parametric Transformations using Generative Models

When images are statistically described by a generative model we can use this information to develop optimum techniques for various image restoration problems as inpainting, super-resolution, image coloring, generative model inversion, etc. With the help of the generative model it is possible to formulate, in a natural way, these restoration problems as Statistical estimation problems. Our approach, by combining maximum a-posteriori probability with maximum likelihood estimation, is capable of restoring images that are distorted by transformations even when the latter contain unknown parameters. This must be compared with the current state of the art which requires exact knowledge of the transformations. We should also mention that our method does not contain any regularizer terms with unknown weights that need to be properly selected, as is common practice in all recent generative image restoration techniques. Finally, we extend our method to accommodate combinations of multiple images where each image is described by its own generative model and the participating images are being separated from a single combination.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As a first step towards the presentation of our methodology for image restoration let us introduce a simple mathematical problem. Assume that we are observing a vector

which is a transformed and noisy version of a hidden vector . We are interested in estimating from the observation when we have available a generative model that captures the statistical behavior of .

A number of well known computer vision problems can be formulated as the estimation problem we just described:

Inpainting, which consists in reconstructing an image when we have available its partially erased version ; Super-resolution, where from a lower resolution image we recover a higher resolution version ; Image coloring, where from a gray-color image we recover the full-color (RGB) version ; finally, Generative model inversion, where from an image we identify the input to the generative model that generates an output which is as close as possible to the available image ; are just a few of the image restoration problems of interest.

Regarding image inpainting, classical techniques can be found in

afonso ; barnes ; efros ; hu ; shen ; for super-resolution an overview of classical methods exists in allebach ; freeman-1 ; freeman-2 ; gu ; li-x and for coloring in horiuchi ; semary . With the advent of generative networks goodfellow , a new class of tools has emerged which is based on generative models. For inpainting such methods are considered in siavelis ; yeh1 ; yeh2 , with the super-resolution and coloring problem also being mentioned in the same articles. The inversion of a generative model enjoys exactly the same formulation and suitable solutions can be found in daras ; lei .

Early efforts based on generative modeling daras ; lei ; siavelis ; yeh1 ; yeh2 were using the generative model partially (only the generator function). Only recently asim ; whang we see techniques that employ the complete model (generator function and input density) improving the performance of the corresponding methods. Even in these recent efforts we observe the existence of weighting parameters that need to be selected. This selection is carried out by applying the corresponding method multiple times with different parameter values and adopting the one delivering the best result. Since our own methodology relies on the well established Statistical estimation theory, we will be able to identify in a well defined way all parameters entering into the problem. In fact the method we are going to develop will be able to treat cases where the transformation responsible for the image distortion is not exactly known as required by all existing methods.

## 2 Statistical estimation theory

Let us recall some very basic elements from Statistical estimation theory. Consider two random vectors where is hidden while is observed. Using , we would like to estimate the hidden vector when

are statistically related, with their relationship captured by the joint probability density function (pdf)

.

The existence of a joint density allows for the applications of the well known Bayesian estimation theory (veeravalli, , pp. 259–279) for the estimation of from . According to this theory, any deterministic function of can play the role of an estimator for . Bayesian estimation provides a solid mathematical mechanism for evaluating these candidate estimators and identifying the one that will play the role of the final optimum estimator. Following this theory, we first need to provide a cost function that places a cost on each combination of estimate and true value. Then, the performance criterion of an estimator is defined as the average cost , where expectation is with respect to and . An estimator will be regarded as optimum if it minimizes the average cost among all deterministic functions of .

In the existing theory, one can find several meaningful cost functions along with their corresponding optimum estimators. From the available possibilities we distinguish two candidates that are of interest to our analysis. Specifically, we focus on the minimum mean square error (MMSE) and the maximum a-posteriori probability (MAP) estimators (veeravalli, , pp. 267-268) which we present next.

MMSE: The cost function for this estimator is . It is then well known veeravalli that the optimum estimator is defined as the conditional mean

 ^Z=E[Z|Y]=∫Zf(Z|Y)dZ=∫Zf(Y,Z)f(Y)dZ=∫Zf(Y,Z)dZ∫f(Y,Z)dZ. (1)

where denotes the conditional probability density of given .

MAP: Here the cost function is somewhat peculiar and defined as

 C(^Z,Z)={1,if ∥^Z−Z∥≥δ0,otherwise,

where denotes a very small quantity (tending to 0). This criterion is known (veeravalli, , page 267) to lead to the well known MAP estimator which is defined as

 ^Z=argmaxZf(Z|Y)=argmaxZf(Y,Z)f(Y)=argmaxZf(Y,Z), (2)

corresponding to the most likely given the observations . There are of course other popular alternatives as, for example, the minimum mean absolute error (MMAE) criterion which leads to the conditional median estimator (veeravalli, , page 267). However, in this work we analyze only the two estimators depicted in (1) and (2) and in the simulations we basically use the MAP estimator that presents clear computational advantages.

### 2.1 Including unknown parameters

The previous well known results are based on the assumption that there is available (known) a joint density that captures the statistical relationship between and . In practice, however, the joint density may also contain a number of parameters that we express with the help of a vector . In other words, the joint pdf of is of the form for some vector . The Bayesian approach treats parameters as random as well, consequently is regarded as the pdf of conditioned on . Since is also random, its statistical behavior is captured by the pdf . This implies that the joint density of all three random vectors has the form .

As before, we assume that we observe and interested in estimating . The question now is how should we treat . There are two possibilities:

Marginalization of : We can compute and then use (1) or (2).

Estimation of : We can apply (1) or (2) with replaced by , thus considering as part of the quantities to be estimated. Since our focus is only on this implies that finding is simply an intermediate step towards our desired goal.

In the applications of interest, as we will see, we have available the density but not . We can overcome this lack of knowledge by following a worst-case scenario, namely assume that is the most uninformative prior. If , where is some known set, then this corresponds to selecting to be the uniform over , provided that the Lebesgue measure is finite. If then we can adopt to be the improper uniform (oconnor, , Page 27). We can easily verify that in both cases we obtain the same results if we consider from the start that for all . This is exactly what we are going to adopt in our subsequent analysis.

MMSE: Here, marginalization and estimation of result in exactly the same estimate for which, under the improper uniform, takes the form

 ¯f(Y,Z)=∫γ∈Cf(Y,Z|γ)dγ,  ^Z=∫Z¯f(Y,Z)dZ∫¯f(Y,Z)dZ. (3)

MAP: For this estimator the two approaches differ. In the case of the marginalization approach we obtain

 ^Z=argmaxZ¯f(Z,Y), (4)

with defined in (3), while the estimation approach yields

 ~f(Z,Y)=maxγ∈Cf(Y,Z|γ),  ^Z=argmaxZ~f(Z,Y). (5)

We observe that (5) is equivalent to first performing a maximum likelihood estimation (veeravalli, , pp. 319–351) of and then a MAP estimation of .

After this very brief presentation of the necessary results from Statistical estimation theory, we are now ready to apply these ideas to image restoration problems.

## 3 Image restoration and generative model

Let us focus on the problems of interest and include the generative model into our formulation. Suppose is a random vector described by the generative model with the input being distributed according to the density . Both functions that comprise the generative model are assumed known. The generator

can be a neural network (deep or shallow), trained with the help of adversarial

arjovsky ; basioti1 ; binkowski ; goodfellow ; nowozin or non-adversarial techniques dziugaite . If a non-adversarial training method is adopted then this clearly suggests that a discriminator function, which plays an important role in the techniques proposed in daras ; siavelis ; yeh1 ; yeh2 , does not exist. For this reason our goal, similarly to asim ; bora ; whang , is to propose estimation techniques that do not rely on discriminator functions.

Every time a realization of occurs, we assume that it undergoes a transformation and its noisy version is observed as a data vector . More specifically, the observation vector and the original vector are related through the following equation

 Y=T(X,α)+W. (6)

is a deterministic transformation with known functional form that can possibly contain unknown parameters with a known set. is a random vector independent from that expresses additive noise and/or modeling errors. For we assume that it is distributed according to the density which has a known functional form and possibly unknown parameters , with a known set. The problem we would like to solve is the recovery of the vector from the observation vector . To achieve this estimate we intend to exploit the generative model in the following way: Instead of finding directly the estimate as in asim ; whang which requires the generator function to be invertible, we propose to obtain the estimate of the input to the generator and then estimate as . This simple variation allows for the adoption of any form for the generator function even a non-invertible one and, as we will see, for efficient treatment of unknown parameters.

To apply the estimation theory presented in the previous section we first need to find the joint density of . It is easy to see that the parameter vector of the general theory corresponds to the combination and

 f(Y,Z|α,β)=g(Y−T(G(Z),α)|β)h(Z). (7)

In (7) is completely known since it is the pdf of the input to the generative model, while is known up to possibly some unknown parameter vector . If we apply the estimators of the previous section then for the MMSE in (3) we must define

 ¯g(W)=∫β∈Bg(W|β)dβ,    ¯¯g(Y,Z)=∫α∈A¯g(Y−T(G(Z),α))dα (8)

which yields the estimate

 ^Z=∫Z¯¯g(Y,Z)h(Z)dZ∫¯¯g(Y,Z)h(Z)dZ=EZ[Z¯¯g(Y,Z)]EZ[¯¯g(Y,Z)]. (9)

The last ratio contains expectations with respect to which is distributed according to . Similarly, for the MAP estimator in (4) and (5) we can write for the marginalization approach that

 ^Z=argmaxZ¯¯g(Y,Z), (10)

with defined in (9), while for the estimation approach we have

 ~g(W)=maxβ∈Bg(W|β),  ~~g(Y,Z)=maxα∈A~g(Y−T(G(Z),α)),  ^Z=argmaxZ~~g(Y,Z)h(Z). (11)

When the transformation does not contain any unknown parameters, the previous expressions simplify to

 (12)

for the MMSE, while for the MAP estimation we have

 ^Z=argmaxZ¯g(Y−T(G(Z)))h(Z), or ^Z=argmaxZ~g(Y−T(G(Z)))h(Z), (13)

with the first corresponding to marginalization and the second to maximum likelihood estimation of . Our approach, because it is based on the very well established Statistical estimation theory, enjoys a number of interesting properties: 1) As in asim ; whang , it uses the complete generator model in order to perform the estimation and does not require any discriminator function. 2) The final optimization problem does not contain terms in the form of regularizers that include unknown weights. 3) We can treat transformations and noise pdfs that contain unknown parameters which are being properly identified using maximum likelihood estimation. The last two properties are unique to our proposed approach and are not present in any other existing generative image restoration methodology.

### 3.1 Gaussian noise and Gaussian input

Let us now specify even further our mathematical model. For the additive noise vector appearing in the data model in (6

), we assume that it has Gaussian elements that are independent and identically distributed with mean zero and variance

. We adopt the Gaussian model only for simplicity. It is possible to resort to more general noise models as for example the Student’s -distribution which was successfully employed in classical (non generative) image restoration techniques chantas . Unfortunately Student’s

distribution does not offer closed form expressions for its parameters as the Gaussian distribution. And this is something we would like to have in order to be able to compare our resulting cost function with the costs employed in the existing literature. Limiting ourselves to Gaussian noise for

where each element has zero mean and variance , yields

 g(W|β)=e−12β2∥W∥2√(2π)Nβ2N,  ¯g(W)=∫∞0g(W|β)dβ=C∥W∥N+1,  ~g(W)=maxβ≥0g(W|β)=C′∥W∥N,

where constants and is the length of the vector . If we assume that the input density is also Gaussian , where

the identity matrix, then for known transformations

the MMSE estimate in (12) becomes

 ^Z=EZ[Z∥Y−T(G(Z))∥−(N+1)]EZ[∥Y−T(G(Z))∥−(N+1)]≈∑Li=1Zi∥Y−T(G(Zi))∥−(N+1)]∑Li=1∥Y−T(G(Zi))∥−(N+1)]. (14)

We note that we generate realizations of

and by evoking the the Law of Large Numbers we approximate the MMSE estimate. For the MAP estimates in (

13) we have

 ^Z=argminZ{(N+1)log∥Y−T(G(Z))∥2+∥Z∥2}, or ^Z=argminZ{Nlog∥Y−T(G(Z))∥2+∥Z∥2}, (15)

with the first corresponding to the marginalization and the second to the estimation of . The two expressions are clearly very similar especially in the case where . We can now compare our optimizations in (15) with

 ^Z=argminZ{∥Y−T(G(Z))∥2+λ∥Z∥} and ^Z=argminZ{∥Y−T(G(Z))∥2+λ∥Z∥2}, (16)

where the first is proposed in asim and the second in bora ; whang . Both approaches in (16) contain an unknown weight which in order to be selected properly we need to solve the corresponding optimization problem several times for different values of this parameter and select the one delivering the best results. In our approach in (15) such parameter is clearly not present. Another notable difference is how the error distance is combined with the input power . In our method we use the logarithm of the distance while in asim ; bora ; whang it is the distance itself combined with . We would like to emphasize that the cost function of our approach is not selected in some arbitrary way but it is the outcome of a theoretical analysis which is based on the Statistical estimation theory.

Even though the MMSE estimator in (14) does not involve any additional optimization when the transformation has no parameters, it can be used only when the length of is small. Indeed, for large the expression may very easily take extremely small or extremely large values which will cause computational problems due to finite precision. Unless is of the order of a few tens this method should not be employed. This observation clearly applies to images where can be several thousands. From now on we focus on MAP estimation and since the difference between the two versions in (15) is minor when is large, we adopt the second version where we estimate instead of marginalizing it.

### 3.2 Parametric transformations

Let us now consider the more challenging problem of a transformation containing unknown parameters . Following our general theory for the case of the noise and the generator input being both Gaussian, the MAP estimator with maximum likelihood estimation of the parameters is equivalent to

 ^Z=argminZ{Nlog(minα∥Y−T(G(Z),α)∥2)+∥Z∥2}. (17)

Additionally, if the transformation is linear111In most restoration problems the transformation enters as a matrix multiplied element-by-element with the ideal image which is also expressed as a matrix. However if we reshape the image into a vector this multiplication becomes the classical matrix/vector multiplication that we adopt in our analysis., that is, , which is the case in most restoration problems, then the optimization problem in (17) becomes

 ^Z=argminZ{Nlog(minα∥Y−T(α)G(Z)∥2)+∥Z∥2}, (18)

where is a matrix parametrized with . Finally we can further advance our analysis if we assume that is linear in its parameters namely it can be decomposed as

 T(α)=α1T1+⋯+αMTM (19)

where are known matrices and is the unknown parameter vector. As an example consider the coloring problem where we have with each matrix isolating the corresponding RGB component from the ideal colored image and the scalar quantities denoting the percentage by which each color component contributes to the final gray level. If these percentages are known before hand then the resulting transformation is also known and the coloring problem can be treated by existing techniques. If, however, are unknown, then we need to estimate these parameters in parallel with , by solving (18).

Regarding the minimization with respect to we can either combine it with the minimization with respect to and use, for example, a gradient descent for the pair or, in the linear case we can find the analytic solution of the minimization with respect to , substitute it, and then minimize only over . The first idea is straightforward and requires no further explanation. For the second if we define the matrix then from (18) and (19) we conclude that

 minα∥Y−T(α)G(Z)∥2=minα∥Y−Xα∥2=∥Y∥2−Y⊺X(X⊺X)−1X⊺Y,

with the last outcome following from the Orthogonality Principle hajek and indicating the projection error of onto the space generated by the columns of . This expression, when substituted in (18), yields

 ^Z=argminZ{Nlog(∥Y∥2−Y⊺X(X⊺X)−1X⊺Y)+∥Z∥2}, (20)

where the only unknown is and, we recall from its definition, that is also a function of .

### 3.3 The image separation problem

In anirudh ; soltani existing single image methods are extended to combinations of multiple images where each image is described by a separate generative model. These extensions experience the same drawbacks as the original methods, namely, a) they contain multiple regularizer terms with unknown weights that need to be properly determined, b) the proposed criteria are not the outcome of some rigorous mathematical analysis, and c) the corresponding methods cannot accommodate combinations involving unknown parameters.

We can overcome the previous weaknesses by generalizing our methodology to cover multiple images as well. For simplicity we only consider the two image case with the analysis of any number of images being very similar. Suppose that we have two images each satisfying a generative model with input density . If the data model follows

 Y=α1X1+α2X2+W (21)

where the additive noise/modeling error has density with parameters then we can combine all parts and produce the joint probability density

 f(Y,Z1,Z2|α1,α2,β)=g(Y−α1G1(Z1)−α2G2(Z2)|β)h1(Z1)h2(Z2). (22)

In (22) we made the assumption that are statistically independent which produced the product of the two input densities. Following the general theory and limiting ourselves to the MAP estimator we need to solve the optimization problem

 maxZ1,Z2maxα1,α2,βf(Y,Z1,Z2|α1,α2,β)=maxZ1,Z2{maxα1,α2,βg(Y−α1G1(Z1)−α2G2(Z2)|β)}h1(Z1)h2(Z2). (23)

If, as before, is Gaussian with mean 0 and covariance and both input vectors are independent Gaussian with mean 0 and unit covariance matrix then from (23), after maximizing over , we conclude that

 {^Z1,^Z2}=argminZ1,Z2{Nlog(minα1,α2∥Y−α1G1(Z1)−α2G2(Z2)∥2)+∥Z1∥2+∥Z2∥2}. (24)

We can either apply gradient descent on the combination or solve analytically for , substitute, and then minimize over . Regarding the latter we have from the Orthogonality Principle hajek

 minα1,α2∥Y−α1G1(Z1)−α2G2(Z2)∥2=∥Y∥2−Y⊺G(G⊺G)−1G⊺Y

where . Substituting in (24) gives rise to

 {^Z1,^Z2}=argminZ1,Z2{Nlog(∥Y∥2−Y⊺G(G⊺G)−1G⊺Y)+∥Z1∥2+∥Z2∥2}, (25)

where we recall that is a function of .

We can also accommodate the case where transformations are applied to each individual image suggesting that each image can experience different deformation before the final combination. This corresponds to replacing each component in (21) with where each transformation can have its own local unknown parameters . Obtaining the necessary equations for this more general setup presents no particular difficulty.

Unlike the classical source separation problem cardoso ; kofidis where we need as many (linear) combinations of the sources as the number of sources we are seeking, here separation can be achieved from a single combination. Of course this is possible because we have available a statistical description of the sources in the form of generative models. We recall that in classical source separation such description is not present and separation is achieved by processing directly the available combinations.

## 4 Experiments

For our experiments we used the CelebA dataset liu which contains 202,599 RGB images that were initially cropped to 64x64x3 and then separated into two sets 202,499 for training and 100 for testing. The training data were used to design a Wasserstein GAN arjovsky using the progressive method developed in karras . We used the following configuration, Generator: input 512 Gaussian . We employ 5 layers. Each layer consists of two convolution operations with kernels except the first layer that has and kernels and the last that has two and one kernel resulting in an output of

. After each convolution a leaky ReLU is applied except for the last

convolution where no nonlinear function is used. The intermediate layers also involve an upsampling operation. Discriminator: Input with 6 layers in total. The first 5 layers have 2 convolutions with two kernels except the first layer which has an additional layer and the last layer which has a and a kernel. After each convolution we apply a leaky ReLU except for the last kernel where no nonlinearity is used. In the intermediate layers we apply downsampling except the last layer. Finally we employ a fully connected part which provides the scalar output of the discriminator.

Perhaps the most common deformation in images corresponds to a linear filter convolved with an image. In particular when the filter is one-dimensional and applied to the image row-by-row then this can model a horizontal motion blur. This idea can clearly be extended to cover blurring in any direction but for simplicity we consider the case of horizontal blurring. We use a finite impulse response filter of length 5 with coefficients that were randomly generated. Examples of the resulting visual effect are depicted in Figure 1.

The goal is from the deformed images to restore the originals. We compare the methods of yeh1 and whang (which in this example coincides with bora ) against our method. The techniques in yeh1 ; whang ; bora require exact knowledge of the filter coefficients. They also require fine-tuning of the weight appearing in (16). This is achieved by solving multiple examples of optimization problems with various weight values and selecting the one delivering the smallest reconstruction error. In the case of yeh1 this turned out to be 0.6 while in bora ; whang 0.2. We should emphasize that these values are filter dependent meaning that if the filter changes we need to repeat the fine-tuning procedure. What is even more crucial is that the fine-tuning requires exact knowledge of the filter which, in case the filter is unknown, fine-tuning will be impossible to perform.

Since our method has no unknown weights it can be applied directly without the need of any fine-tuning procedure. We distinguish two versions in our approach. In the first we assume that we know the filter coefficients exactly in order to make fair comparisons with bora ; yeh1 ; whang . In the second version we assume that the filter coefficients are unknown, therefore, in the second version we simultaneously estimate the filter coefficients and restore the original image by solving 20. Unlike our proposed technique, existing methods do not perform this combined optimization and are therefore unable to restore the original image when the transformations contain unknown parameters.

In all competing methods, we applied the momentum gradient descent qian

with normalized gradients where the momentum hyperparameter was set to 0.999 and the learning rate to 0.001. As in

daras ; yeh1 we ran our simulation for every testing image five times, by starting it from different points and retaining the solution with the smallest reconstruction error. Figure 2, row a) depicts ten examples of original faces; in row b) we see their blurred version when we apply the selected filter; rows c), d), e) present the restoration provided by yeh1 , bora ; whang and proposed method respectively when the filter is known; finally row f) is the restoration results with our method but assuming that the filter is unknown and we need to estimate it at the same time with the restoration process. Table 1 contains the corresponding restoration error and PSNR of each method. Visibly it is difficult to understand the difference in quality between the various techniques and this is also expressed in Table 1 where the restoration errors and PSNRs are comparable. This observation applies even in the case of our method that assumes the filter to be unknown. Of course, we would like to emphasize once more, that our method compared to its existing rivals enjoys certain unique properties a) it does not contain weights that need fine-tuning; b) it is capable of restoring images even when the transformations have unknown parameters and c) our criterion is not ad-hoc but the outcome of a rigorous mathematical analysis. Figure 2: Row a) Original images; b) Horizontally blurred images; c) yeh1 , known transformation; d) bora ; whang , known transformation; e) Proposed, known transformation; f) Proposed, unknown transformation. Table 1: corresponding reconstruction errors and PSNRs.

The second set of experiments consists in recovering an RGB image from one of its chromatic components. For such we select the green channel. This information is passed to the methods of yeh1 ; whang ; bora and the first version of our method. However, in our second version we assume that we do not know which channel generates the observed gray-level data and we attempt to find the right channel at the same time we perform our restoration. We recall that the channel decomposition is a linear process that can be implemented with three matrices . The fact that now our parameter is discrete does not result in any serious change in the optimization problem in (18), which must be modified as

 ^Z=argminZ{Nlog(mini=R,G,B∥Y−TiG(Z)∥2)+∥Z∥2}, (26)

For the solution of (15), (16) and (26) we employed the same algorithm and hyperparameters of the first example, except of course the weight in yeh1 and bora ; whang which we had to fine-tune again. This resulted in 0.1 for yeh1 and 0.5 for bora ; whang . Figure 3 depicts the original RGB images in row a); the green channel in gray-level in row b); rows c), d), e) have the restoration with yeh1 , bora ; whang and the Figure 3: Row a) Original images; b) Green channel in gray; c) yeh1 , known channel; d) bora ; whang , known channel; e) Proposed, known channel; f) Proposed, unknown channel. Table 2: corresponding reconstruction errors and PSNRs.

proposed method respectively but when the channel is known; finally f) contains the proposed method when the channel is unknown and we need to solve (26). We also see in Table 2 the corresponding restoration errors and PSNRs. Again we realize that our proposed methodology provides comparable restoration quality as the existing methods when the transformation is known without the need to fine-tune any weights and, most importantly, it can also deliver comparable quality even if the transformation contains unknown parameters which can take continuous or discrete values.

## 5 Conclusion

We introduced a general image restoration method which is based on a generative model description of the class of images to be restored. Our processing technique relies on the Statistical estimation theory and is capable of restoring images through a mathematically well-defined optimization problem that does not require any tuning of weights of regularizer terms. The most important advantage of our method consists in its ability to restore images even when the transformation responsible for the deformation contains unknown parameters. Experiments using the CelebA dataset show that our method when applied to transformations with unknown parameters it is capable of delivering similar restoration quality as the existing state of the art that needs exact knowledge of the transformations.

#### Acknowledgment

This work was supported by the US National Science Foundation under Grant CIF 1513373, through Rutgers University.