Facial Information Recovery from Heavily Damaged Images using Generative Adversarial Network- PART 1

08/27/2018 ∙ by Pushparaja Murugan, et al. ∙ XRVision 8

Over the past decades, a large number of techniques have emerged in modern imaging systems to capture the exact information of the original scene regardless of shake, motion, lighting conditions and etc., These developments have progressively addressed the acquisition of images in high speed and high resolutions. However, the various ineradicable real-time factors cause the degradation of the information and the quality of the acquired images. The available techniques are not intelligent enough to generalize this complex phenomenon. Hence, it is necessary to develop an intellectual framework to recover the possible information presented in the original scene. In this article, we propose a kernel free framework based on conditional-GAN to recover the information from the heavily damaged images. The degradation of images is assumed to be occurred by the combination of a various blur. Learning parameter of the cGAN is optimized by multi-component loss function that includes improved wasserstein loss with regression loss function. The generator module of this network is developed by using U-Net architecture with local Residual connections and global skip connection. Local connections and a global skip connection are implemented for the utilization of all stages of features. Generated images show that the network has the potential to recover the probable information of blurred images from the learned features. This research work is carried out as a part of our IOP studio software 'Facial recognition module'.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Images are representation of visual information in digital form but the image acquisition and formation process degrade the information of the representation of the original scene while capturing it. Blur, point-wise non-linearities and the noise formation are common case of degradation that usually occurs by the image sensing system. Image blur is a unavoidable information degradation. In another hand, it is form of bandwidth reduction of the images due to the image formation process. One of the easiest possible solution is to capture the images in shorter exposure intervals during the image acquisition process. In this case, noise formation in capture is inevitable when capturing the images in dark lighting conditions. Another possible solution is information recovery and reconstruction / off-line process. There has been a lot of developments in digital image processing techniques to sustain the visual informations and to increase the quality of the images while capturing the images in off-line . The main objective of the information recovery and restoration is to estimating the possible information of an original scene from the degraded images. It severs in many area such as astronomy, medical image process and satellite image processing as well as the commercial photographic industries. Information recovery and reconstruction can be carried by image in-printing and deblurring process

[1] [2] [3] [4]. Image in-painting is a process of generating possible information to fill the damaged or missing regions in an image by utilizing the available information that includes restoration of old images, removing scratches, texts, special effects and filling the damaged regions [5]

. The in-printing process can be largely classified into two categories such as structural in-printing and texture in-printing. Structural in-printing is concerned with propagation of structure into the missing region and synthesization of texture on that area which can be very effective for in-printing small region. The texture in-printing is using the global information from multiple images and filling the missing regions which is very effective in large missing area

[6]

. In early 1990’s, first image in-paint model is proposed based on non-linear partial differential equations to restore the information in the damaged images. In this method, the gray level information is propagated in the direction of isophotes to obtain a full image

[7]. Bertozzi is proposed a method for in-printing image through two dimensional fluid dynamics navier-strokes equation [8]. Cheng-Shian Lin and Jin-Jang Leou suggested an another four-step approach for inpainting [9]. Marcelo Bertalm et al proposed a hybrid approach of using structure and texture in-printing. The idea of this approach is to decomposing the images into two functions by their character and work on those functions with structure and texture filling [10].

Information degradation caused by the blur effects produces visually unattractive images. Fast moving objects, image acquisition in dim lighting condition, capturing long distance objects, other side of focused area in the images are typical example for the blur generation where even the high speed and higher resolution sensing system perform very badly [11]. Image deblurring is an inverse problems where the reconstruction or recover the information of shared images from the degraded images [12]. Numerous investigation is carried out to deblur the images based on non-blind deblur and blind deblur method. In non-blind deblur methods such as Richardson-Lucy [13] and Wiener filter [14] , the blur kernal is assumed to be known and the information recovery is carried out by using the blurred images and the kernels. In blind deblur methods, the kernel is known to be unknown and the estimation of kernel is carried out by using the blurred images. However, the recent developments in deblur methods are presented to tackle the both blind and non-blind problems. Due to the ill-nature of the blur kernels as well as the noise in the kernel, the acquired blurred images doesn’t exactly represent the information about the original scene and true kernels are significantly mismatched [15].

The recent development in convolutional Neural Network (CNN) would provide the possibility of addressing this limitation. The architecture of CNN, learning of CNN, hyperparameter optimization, and the limitation are elaborately discussed in our previous studies

[16] [17] [18] [19]

. Another remarkable invention based on CNN architecture is Generative Adversarial Network. A typical CNN architecture has two convolutional neural networks. Image style transformation, deblur GAN, text to images are the one of most promising developments in translation and transformation of image information. Generative Adversarial Network is an artificial intelligence technique, consists of two networks namely discriminator and generator competing with each other in zero sum game. The idea of this generative model is firstly introduced by Ion Goodfellow

[20]

. The generator networks learn to map from the latent space and generates the image from the data distribution while the discriminator networks discriminates the real data distribution and the generator data distribution. The training objective of the GAN network is to increase the error rate of the discriminative network. The subsequent development of GAN network such as DCGAN, improved DCGAN improved the performance of the vanilla GAN. In DCGAN, the author suggested that the importance of Batch Normalization in both generator and discriminator module as well as the importance of avoiding fully connected layers and striding instead of pooling

[21]. The technique proposed in improved DCGAN allows to generator high resolution images where the authors has suggested various enhancements on training such as feature matching, Historical averaging, One-sided label smoothing, and Virtual batch normalization [22]. Another GAN network is conditional GAN, known as CGAN where utilization label information is resulted with better quality of image generation and governance over the images. One of the important invention in GAN is Wasserstein GAN which overcomes the limitation of vanilla GAN by optimizing the learning parameters by using wasserstein Earth Mover distance as the objective function [23]

. Super resolution GAN

[24]

, pix2pix GAN, cycle GAN

[25] are the notable promising inventions in generative models that provide the possibilities of translation of image information from the real data distribution over a noisy data distribution. In super resolution GAN, the authors proposed an content loss objective functions along with the adversarial loss function. Content loss function is Euclidean distance between the high level feature of generated and real data image distribution which allows the generation of more similar images to the high resolution original image. Yeh et al proposed an architecture for image in-printing to fill the information in the mission region of the images [26]. Ramakrishnan et al, proposed a kernel free blind algorithm to deblur images by using pix2pix and fully connected dense layer [27]. From the recent invention in Generative Adversarial Network, it is been clearly understood that the GAN network have the potential of preserving the inherent textural and structural information and generating convening images that looks more close to the real data distribution.

1 Proposed Network

Recently, Generative Adversarial Networks are finding important role in supervised, semi-supervised as well as unsupervised learning vision tasks as the generative models implicitly learn probability density of high dimensional distributions of the data and generate natural looking images. The generator and the discriminator in the GAN network competing each other in zero sum game to optimize the learning parameters. The schemantic of the GAN network is shown in the Figure.

1

. The generator generate images of natural looking data samples from noise input data to fake the discriminator while the discriminator tends to distinguish the generated samples from the read data. Both the forger (Generator) and the expert (Discriminator) learn simultaneously by minimize the distance between the probability distribution of real and generated data. However, while the discriminator has the access to the generated data and real data, the generator has no access to the real data distribution. The noise input data to the discriminator provide the possible information about the ground truth to distinguish between the synthetic generated data and real data distribution. The same noise data distribution is used for training the generator to produce natural looking images close to the real data with superior quality. The generator and the discriminator composed of deep convolutional layer and fully connected dense layers. Since the necessity of direct invertible of the generator and the discriminator, the both network modules has to be continuous and differentiable everywhere.

Figure 1: GAN architecture

In a typical GAN architecture, the discriminator network maps the generated images distribution to the real data distribution and the generator learn to map the representations of the latent space to the space of data distribution , where is represent the samples from the latent space of image distribution. For a fixed generator , the discriminator

is trained to classifying the images as the fake and real input. Once the discriminator is trained optimally, the generator is continued to learn to generate images close the original images. These captured statistical distribution of the training data is applied to solve wide variety of problems such as semantic image editing, style transfer, data augmentation and image retrieval, translation and transformation. Detail overview of super resolution, style transfer, photo generation are discussed in

[25]

. Conditional GAN known as cGAN learns the statistical distribution from the training data and the noise vector

to

by placing a condition on the discriminator. Markovian discriminator allows to achieve perceptually superior results on generation of images from label maps reconstructing objects from the edge maps, and the colorizing images.

1.1 Generator and Discriminator

The network generator architecture is shown in the Figure 2

. It consists of one convolutional block at the head, two convolutional block at the rear side, seven residual blocks. Residual blocks consist of four sequent convolutional layer, instance normalization layer and the activation function sequentially. The output of the third activation layer in every residual blocks is internally connected with output of the first activation layer in next residual blocks. Along with these local connection, a global skip connection is also introduced. Drop with probability of 50 % is implemented in the residual blocks. InstanceNorm and LeakyReLU with

is introduced in every convolutional network expect the last layer. Reutilization of features between the subsequent layers allow the network to reconstruct the possible information from the learned features. Also, it is noted that the performance of the architecture is higher even with the smaller network.

Figure 2: Generator architecture

The discriminator in the GEN architecture is the expert that distinguish the difference between real data and generator data. In other hand, it helps the generator to generate more realistic information from the learned data distribution. In this architecture, Markovian patch discriminator with ten convolutional layer is implemented which also enforce the coloration in the generated natural images.

1.2 Loss function

The objective of the Generator is to learn the distribution , approximate to the real distribution

and generate samples such that the probability density function of the generated samples

equals to the probability density function of the real samples ). This can be approached by for learn the differential function such that and directly and optimize through maximum likelihood or learn the differential transformation function of and optimize through maximum likelihood where

is the existing common distribution such as uniform or Gaussian distribution.

The discriminator has to recognize the data from the real data distribution , where indicates the estimated probability of data points . In case of binary classification, if the estimated probability is the positive class and is the negative class , the cross entropy distribution between and is,

(1)

For a given point and corresponding label , the Eq (1) can be expressed as,

(2)

It is been understood from the above equation, one of term tends to set to zero depending on the values of . For the entire dataset distribution, the above equation can be written as,

(3)

In Generative Adversarial Network, the data distribution can be from the real data or the generator data . Addition to that, we expect exactly half of data from the two sources. In order to encode this information probabilistically in the Eq (3), the sum is replaced with expectation and label is with half of the values. Hence, the loss function is,

(4)

In optimizing the value , for the given real data distribution , the estimated probability over the real data accurate by maximizing the value and for given fake distribution from the generator is close to zero by maximizing the value . On other hand, the generator is trained to increase the chance of producing the estimated probability high for the fake data by mini zing the . Hence, the generator and discriminator tends to fight each other in minmax game to minimize the loss function.

(5)

If the discriminator is trained before the generator parameter update, the minimization of loss function is equal to minimization of Jensen-Shannon divergence between the real data distribution and generated data distribution . However, optimizing the value function Eq. (5) suffer from vanishing gradient and model collapse as the discriminator saturates on GD. They proposed an another method to measure the probability distribution based on Wasserstein distance, known as Earth Mover distance. It can be stated as, minimum transportation cost of moving the mass from the distribution into the distribution , provided the and distribution are continuous and differentiable everywhere. The Wasserstein loss function can be expressed as by using Kantorovich-Rubinstein duality ,

(6)

where, is the set of 1-Lipschitz functions . After the optimization of the network , to ensure the discriminator probability estimation is to be in the set of 1-Lipschitz functions, the author introduce the weight clipping though it lead to undesirable results. In order to overcome this, gradient penalty term is introduced,

(7)

As stated in the original paper, is kept as 10 during the learning. When gradient penalty in the Eq (7

) is fully optimized by back propagation, the discriminator will act as 1-Lipschitz function. However, if the ’image latent information transforming GAN architecture’ is trained without the perceptual content loss network don’t converge to meaningful state. Perceptional loss is nothing but the

loss and can be defined as Euclidean difference between deep network feature maps of the real data samples and generated data samples. The perceptional loss is given in the Eq. (8).

(8)

where, is represent the feature maps of deep Convolutional Neural Network before max-pooling layer. and are pixels dimensions of the feature maps. The adversarial network loss function is combination of the above loss functions (7) (8). Hence the total loss is given by,

2 Dataset preparation

The real blurs in the images are extremely complex which cannot be approximated simple parametric model to generate synthetically. Also ,It is very unlikely to happen to acquire the image pairs of blurred image and corresponding shape images to train the GAN network. Hence, the image pairs are created artificially. The high resolution sharp images are collected from various mobile phone camera images. YOLO network is trained to localize the faces in the images. The cropped images are later used in the training. The main objective of the generating blurred images is to degrade the information presented in the original image. In that connection, high degree of motion blur, camera shake blur and defocus blur is applied to the original image. There are

image pairs are created for training. Since we want only learn the statistical information about the sharp image, the repeated image is also present in the training but with different blur. The original image and the corresponding blurred images are taken for training the GAN network. Many investigation is also successfully developed to develop synthetic blur images. These method proposed that the synthetic blurred images can be created by convolving the shape images with linear motion of blur kernel or randomly sampling sie random points and fit a spline. In our synthetic generation of blur images are concerned with varying the direction of the blur kernel.

2.1 Image blur generation

The blur kernel, known as point spread functions causes the image pixel to record light photons from the multiple scene points. In real time, many factor can cause the image blur that can degrade the information and quality of the objects appeared in the scene [28]. Commonly image blur can be induced by object motion, atmospheric turbulence, physical intrinsic, camera shake and defocus. In classical deblur algorithm methods, the information recovery from the degraded images requires the understanding the kernel and appropriate modeling of the information presented. Also, it highly complicated to generate blur images that could occur in real time. Hence, its necessity to understand the image formation models. Image formation posses the information of radiometric and geometric by projecting the 3-D world in to the 2-D focal plane. The light rays passes through the camera lens is projected into the focal points. This can be modeled as, concatenation of the perspective projection and geometric distortion. The digital information of the images are formed by the discretization of the analog images which is transformed by the light photons [29]. This can be expressed as,

(9)

where, is represent the absorbed blurred image as the function of sampling operator, is represent the perspective projection of the real planer scene, is represent the extrinsic kernel blur, is represent the intrinsic kernel blur and is represent the convolution operation. The (9) is show the blur image formulation. The information recovery from the blurred image can expressed as ignoring the sampling effects,

(10)

where, is the latent sharp image from the and h is the estimated kernel by combining the extrinsic and intrinsic blur . The general objective is to recover the information and the kernel from the degraded absorbed blurred images . For simplification, the effect of camera response function can be neglected and the blur generation can be written as,

(11)

Considering the whole image, this can be expressed with matrix-vector form as follows,

(12)

is the noise happen to be appear in the information of the image by the sensing system. cab be modeled as Gaussian noise, Poisson noise, and impulse noise. This various degradation models based on the noise assumptions is shown in the Figure 3.

Figure 3: Degradation model based on the noise assumption

Though many other factors cause the blur in the images, can be characterized as a specific properties of the blur which leads to different types of blur such as motion blur, camera shake blur, defocus blur and atmospheric intrinsic blur.

Figure 4: Training data samples - Image pair

2.2 Motion blur

Motion blur is commonly occurred when capturing fast moving object and when long time of exposure is needed. Motion blur caused bu the relative motion between the objects and sensing system. When the object motions is relatively fast as compares with the exposure period, the motion blur can occur as a linear motion blur. This can be represented by 1-D averaging of the neighbor pixels,

where, is the coordinates from the center , is the moving distance and is the direction of the object moving. Motion blur can be generated by the mentioned Equation. However, in real time, the moving objects in the image only occupies a part of the image and the blur generated by such operation is extremely complicated.

2.3 camera shake blur

Camera shake blur is another kind of blur which commonly occur in many real time cases. Unlike motion blur, it caused by the motion of the camera instead of the object which result with degradation of information. Camera rotation cause the most complicated blur as it indulges the in-plane and our-of-plane rotation with respect to the focal plane. In in-plane rotation, kernel blur varies across the images from the camera rotation axis while in out-of-plane rotation, the degree of spatial variance across the image is dependent on the focal length of the camera. Also, it can occur in in low lighting conditions. The motion of the camera in an irregular directions cause the in plane or out of plane motion. While capturing a long distance object, light translation in the camera motion is spatially invariant. Hence, this can be modeled as linear blur motion,

During the camera translation, the objects nearer undergoes large shift, if different objects lies in different focal plane in a single scene that causes the large degree of information degradation in an image.

2.4 Defocus blur

Defocus blur usually occurs at the image by improper focus of the image by the image sensing system. In different depths of scenes, the object outside the focus are highly suffocates from the defocus blur. It may even occur by the single lens incorporation of the sensing system by acquiring information from the object out side of the depth of the focus. Defocus blur can be approximately modeled by uniform circular model,

where, is the radius of the circle.

3 Result and Discussion

The training of the GAN network to recover the image information is carried out by using NVDIA- GTX 980m GPU. Stochastic gradient descent with batch size of 4 and Adam optimizer are implemented to increase the learning speed and network convergence towards to global minimum.

Figure 5: Generated images from artificially created blur images/ On training
Figure 6: Generated images from artificially created blur images/ On testing

The learning rate is set to and are set to

respectively. The network is trained for 240 epochs for a week. In order to acquire the probability density from the real data distribution,

number of kernels with large pixel size of is implemented in first and second convolutional layer in generator. 128 number of kernels are used in Resnet blocks with the kernel size of . It is important to note that the Instance BatchNorm is implemented in after every convolution operators to avoid discriminator loss quickly approach zero. LeakyReLU with is used as the non-linear activation function to avoid the sparse gradient. In up-sampling module, PixelShuffle, Convolution Transpose with stride two is used while in down-sampling module, average Pooling, Convolution operation with stride of two is enchanted. Without these up-sampling, down-sampling modules, generator suffers from the generation of undesirable pixel noise. Dropout with probability of is used after the convolution operation. To optimize the network, as mentioned in image to image translation GAN network, training is carried out to increase the instead minimizing the . Alternate gradient step is used between the generator and the discriminator. The generated images on training is shown in the Figure 5. First column of the images is the artificially created blurred images, middle column is the ground truth and the final column is the generated images. During the training, the network is generated more realistic images. Figure 6 shows the generated images on test data. Left pair of images are the successful information on the testing. Right side pair images are the failure cases. During the testing of the architecture, many cases are resulted with failure. The network is capable of recover the information, if the network is trained on same family of degraded images. However, the network can be trained to have more generalization, if the network is trained on wide variety of image pairs.

4 Conclusion

A cGAN based framework for recovering the possible information from heavily blurred images is proposed. The training of the network is carried out by using adversarial loss and perceptional loss function. The generated images from various experiments show that the addition of up-sampling and down-sampling module in the generator network is help to increase the performance of the network dramatically as well as to recover the informations. The primary objective is only to recover the information from the blurred image faces, only face images with blur is considered for training and testing. Hence, any comparative study between the state-of-art models isn’t conducted. Also, another important conclusion is made from the experimentation that the network is capable of recover the information fully as long as the network is trained on the same verity of degraded images. This research work is carried out as a part of our IOP studio software development ’Facial recognition module’.

References