CDGAN: Cyclic Discriminative Generative Adversarial Networks for Image-to-Image Transformation

01/15/2020 ∙ by Kancharagunta Kishan Babu, et al. ∙ IIIT Sri City 16

Image-to-image transformation is a kind of problem, where the input image from one visual representation is transformed into the output image of another visual representation. Since 2014, Generative Adversarial Networks (GANs) have facilitated a new direction to tackle this problem by introducing the generator and the discriminator networks in its architecture. Many recent works, like Pix2Pix, CycleGAN, DualGAN, PS2MAN and CSGAN handled this problem with the required generator and discriminator networks and choice of the different losses that are used in the objective functions. In spite of these works, still there is a gap to fill in terms of both the quality of the images generated that should look more realistic and as much as close to the ground truth images. In this work, we introduce a new Image-to-Image Transformation network named Cyclic Discriminative Generative Adversarial Networks (CDGAN) that fills the above mentioned gaps. The proposed CDGAN generates high quality and more realistic images by incorporating the additional discriminator networks for cycled images in addition to the original architecture of the CycleGAN. To demonstrate the performance of the proposed CDGAN, it is tested over three different baseline image-to-image transformation datasets. The quantitative metrics such as pixel-wise similarity, structural level similarity and perceptual level similarity are used to judge the performance. Moreover, the qualitative results are also analyzed and compared with the state-of-the-art methods. The proposed CDGAN method clearly outperformed all the state-of-the-art methods when compared over the three baseline Image-to-Image transformation datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

There are many real world applications where images of one particular domain need to be translated into different target domain. For example, sketch-photo synthesis [39], [41], [24]

is required in order to generate the photo images from the sketch images that helps to solve many real time law and enforcement cases, where it is difficult to match sketch images with the gallery photo images due to domain disparities. Similar to sketch-photo synthesis, many image processing and computer vision problems need to perform the image-to-image transformation task, such as Image Colorization, where gray-level image is translated into the colored image

[5], [38], Image in-painting, where lost or deteriorated parts of the image are reconstructed [33], [23]

, Image, video and depth map super-resolution, where resolution of the images is enhanced

[12], [20], [31], [9], Artistic style transfer, where the semantic content of the source image is preserved while the style of the target image is transferred to the source image [10], [4], and Image denoising, where the original image is reconstructed from the noisy measurement [2]. Some other applications like Video rain removal [18], Semantic segmentation [6]

, Face recognition

[34], [3] are also needed to perform image-to-image transformation. However, traditionally the image-to-image transformation methods are proposed for a particular specified task with the specialized method, which is suited for that task only.

Fig. 1: A few generated samples obtained from our experiments. The , , and rows represent the Sketch-Photos from CUHK Face Sketch dataset [30], Labels-Buildings from FACADES dataset [26] and RGB-NIR scenes from RGB-NIR Scene dataset [1], respectively. The column represents the input images. The , and columns show the generated samples using DualGAN [35], CycleGAN [40], and introduced CDGAN methods, respectively. The last column shows the ground truth images. The artifacts generated are highlighted with red color rectangles in and columns corresponding to DualGAN and CycleGAN, respectively.

I-a CNN Based Image-to-Image Transformation Methods

Deep Convolutional Neural Networks (CNN) are used as an end-to-end frameworks for image-to-image transformation problems. The CNN consists of a series of convolutional and deconvolutional layers. It tries to minimize the single objective (loss) function in the training phase while learns the network weights that guide the image-to-image transformation. In the testing phase, the given input image of one visual representation is transformed into another visual representation with the learned weights. For the first time, Long et al.

[19] have shown that the convolutional networks can be trained in end-to-end fashion for pixelwise prediction in semantic segmentation problem. Larsson et al. [16] have developed a fully automatic colorization method for translating the grayscale images to the color images using a deep convolutional architecture. Zhang et al. [36] have proposed a novel method for end-to-end photo-to-sketch synthesis using the fully convolutional network. Gatys et al. [7] have introduced a neural algorithm for image style transfer, that constrains the texture synthesis with learned feature representations from the state-of-the art CNNs. These methods treat image transformation problem separately and designed CNNs suited for that particular problem only. It opens an active research scope to develop a common framework that can work for the different image-to-image transformation problems.

Fig. 2: Image-to-image transformation framework based on proposed Cyclic Discriminative Generative Adversarial Networks (CDGAN) method. and are the generators, and are the discriminators, and are the real images, and are Synthesized images, and are the Cycled images in domain and domain , respectively.

I-B GAN Based Image-to-Image Transformation Methods

In 2014, Ian Goodfellow et al. [8] have introduced Generative Adversarial Networks (GAN) as a general purpose solution to the image generation problem. Rather than generating images of the given random noise distribution as mentioned in [8], GANs are also used for different computer vision applications like image super resolution [17], real-time style transfers [12], sketch-to-photo synthesis [14], and domain adaptation [28]. Mizra et al. [22] have introduced conditional GANs (cGAN) by placing a condition on both of the generator and the discriminator networks of the basic GAN [8] using class labels as an extra information. Conditional GANs have boosted up the image generation problems. Since then, an active research is being conducted to develop the new GAN models that can work for different image generation problems. Still, there is a need to develop the common methods that can work for different image generation problems, such as sketch-to-face, NIR-to-RGB, etc.

Isola et al. have introduced Pix2Pix [11], as a general purpose method consisting a common framework for the image-to-image transformation problems using Conditional GANs (cGANs). Pix2Pix works only for the paired image datasets. It consists the generative adversarial loss and loss as an objective function. Wang et al. have proposed PAN [27], a general framework for image-to-image transformation tasks by introducing the perceptual adversarial loss and combined it with the generative adversarial loss. Zhu et al. have investigated CycleGAN [40], an image-to-image transformation framework that works over unpaired datasets also. The unpaired datasets make it difficult to train the generator network due to the discrepancies in the two domains. This leads to a mode collapse problem where the majority of the generating images share the common properties results in similar images as outputs for different input images. To overcome this problem Cycle-consistency loss is introduced along with the adversarial loss. Yi et al. have developed DualGAN [35], a dual learning framework for image-to-image transformation of the unsupervised data. DualGAN consists of the reconstruction loss (similar to the Cycle-consistency loss in [40]) and the adversarial loss as an objective function. Wang et al. have introduced PS2MAN [29], a high quality photo-to-sketch synthesis framework consisting of multiple adversarial networks at different resolutions of the image. PS2MAN consists Synthesized loss in addition to the Cycle-consistency loss and adversarial loss as an objective function. Recently, Kancharagunta et al. have proposed CSGAN [13], an image-to-image transformation method using GANs. CSGAN considers the Cyclic-Synthesized loss along with the other losses mentioned in [40].

Most of the above mentioned methods consist of two generator networks and which read the input data from the Real_Images ( and ) from the domains and , respectively. These generators and generate the Synthesized_Images ( and ) in domain and , respectively. The same generator networks and are also used to generate the Cycled_Images ( and ) in the domains and from the Synthesized_Images and , respectively. In addition to the generator networks and , the above mentioned methods also consist of the two discriminator networks and . These discriminator networks and are used to distinguish between the Real_Images ( and ) and the Synthesized_Images ( and ) in the its respective domains. The losses, namely Adversarial loss, Cyclic-Consistency loss, Synthesized loss and Cyclic-Synthesized loss are used in the objective function.

In this paper, we introduce a new architecture called Cyclic-Discriminative Generative Adversarial Network (CDGAN) for the image-to-image transformation problem. It improves the architectural design by introducing a Cyclic-Discriminative adversarial loss computed among the Real_Images and the Cycled_Images. The CDGAN method also consists of two generators and and two discriminators and similar to other methods. The generator networks and are used to generate the Synthesized_Images ( and ) from the Real_Images ( and ) and the Cycled_Images ( and ) from the Synthesized_Images ( and ) in two different domains and , respectively. The two discriminator networks and are used to distinguish between the Real Images ( and ) and the Synthesized Images ( and ) and also between the Real Images ( and ) and the Cycled Images ( and ).

Following is the main contributions of this work:

  • We propose a new method called Cyclic Discriminative Generative Adversarial Network (CDGAN), that uses Cyclic-Discriminative (CD) adversarial loss computed over the Real_Images and the Cycled_Images. This loss helps to increase the quality of the generated images and also reduces the artifacts in the generated images.

  • We evaluate the proposed CDGAN method over three benchmark image-to-image transformation datasets with four different benchmark image quality assessment measures.

  • We conduct the ablation study by extending the concept of proposed Cyclic-Discriminative adversarial loss between the Real_Images and the Cycled_Images with the state-of-the art methods like CycleGAN [40], DualGAN [35], PS2GAN [29] and CSGAN.

The remaining paper is arranged as follows; Section II

presents the proposed method and the losses used in the objective function; Experimental setup with the datasets and evaluation metrics used in the experiment are shown in section

III. Result analysis and ablation study are conducted in section IV followed by the Conclusions in section V.

Ii Proposed CDGAN Method

Consider a paired image dataset between two different domains and represented as where is the number of pairs. The goal of the proposed CDGAN method is to train two generator networks and and two discriminator networks and . The generator is used to translate the given input image from domain into the output image of domain and the generator is used to transform an input sample from domain into the output sample in domain . The discriminator is used to differentiate between the real and the generated image in domain and in the similar fashion the discriminator is used to differentiate between the real and the generated image in domain . The Real_Images ( and ) from the domains and are given to the generators and to generate the Synthesized_Images ( and ) in domains and , respectively as,

(1)
(2)

The Synthesized_Images ( and ) are given to the generators and to generate the Cycled_Images ( and , respectively as,

(3)
(4)

where, is the Cycled_Image in domain and is the Cycled_Image in domain . As shown in the Fig. 2, the overall work of the proposed CDGAN method is to read two Real_Images ( and ) as input, one from the domain and another from the domain . These images and are first translated into the Synthesized_Images ( and ) of other domains and by giving them to the generators and , respectively. Later the translated Synthesized_Images ( and ) from the domains and are again given to the generators and respectively, to get the translated Cycled_Images ( and ) in domains and , respectively.

The proposed CDGAN method is able to translate the input image from domain into the image in another domain , such that the has to be look like same as the . In the similar fashion the input image is translated into the image , such that it also looks like same as the . The difference between the input real images and the translated synthesized images should be minimized in order to get the more realistic generated images. Thus, the suitable loss functions are needed to be used.

Ii-a Objective Functions

The proposed CDGAN method as shown in the Fig. 2 consists of five loss functions, namely Adversarial loss, Synthesized loss, Cycle-consistency loss, Cyclic-Synthesized loss and Cyclic-Discriminative (CD) Adversarial loss.

Ii-A1 Adversarial Loss

The least-squares loss introduced in LSGAN [21] is used as an objective in Adversarial loss instead of the negative log likelihood objective in the vanilla GAN [8] for more stabilized training. The Adversarial loss is calculated between the fake images generated by the generator network against the decision by the discriminator network either it is able to distinguish it as real or fake. The generator network tries to generate the fake image which looks like same as the real image. The real and the generated images are distinguished using the discriminator network. In the proposed CDGAN, Synthesized_Image () in domain is generated from the generator by using the Real_Image () of domain . The Real_Image () and the Synthesized_Image () in domain are distinguished by the discriminator . It is written as,

(5)

where, is the Adversarial loss in domain . In the similar fashion, the Synthesized_Image () in domain is generated from the Real_Image () of domain by using the generator network . The Real_Image () and the Synthesized_Image () in domain are differentiated by using the discriminator network . It is written as,

(6)

where, is the Adversarial loss in domain . Adversarial loss is used to learn the distributions of the input data during training and to produce the real looking images in testing with the help of that learned distribution. The Adversarial loss tries to eliminate the problem of outputting blurred images, some artifacts are still present.

Ii-A2 Synthesized Loss

Image-to-image transformation is not only aimed to transform the input image from source domain to target domain, but also to generate the output image as much as close to the original image in the target domain. To fulfill the later one the Synthesized loss is introduced in [29]. It computes the loss in domain between the Real_Image () and the Synthesized_Image () and given as,

(7)

where, is the Synthesized loss in domain , and are the Real and Synthesized images in domain .

In the similar fashion, the loss in domain between the Real_Image () and the Synthesized_Image () is computed as the Synthesized loss and given as,

(8)

where, is the Synthesized loss in domain , and are the Real and the Synthesized images in domain . Synthesized loss helps to generate the fake output samples closer to the real samples in the target domain.

Ii-A3 Cycle-consistency Loss

To reduce the discrepancy between the two different domains, the Cycle-consistency loss is introduced in [40]. The loss in domain between the Real_Image ( and the Cycled_Image () is computed as the Cycle-consistency loss and defined as,

(9)

where, is Cycle-consistency loss in domain , and are the Real and Cycled images in domain . In the similar fashion, the loss in domain between the Real_Image () and the Cycled_Image () is computed as the Cycle-consistency loss and defined as,

(10)

where, is Cycle-consistency loss in domain , and are the Real and the Cycled images in domain . The Cycle-consistency losses, i.e., and used in the objective function act as both forward and backward consistencies. These two Cycle-consistency losses are also included in the objective function of the proposed CDGAN method. The scope of different mapping functions for larger networks is reduced by these losses. They also act as the regularizer for learning the network parameters.

Methods Losses
GAN [8]
Pix2Pix [11]
**DualGAN [35]
CycleGAN [40]
PS2GAN [29]
CSGAN [13]
CDGAN (Ours)
TABLE I: Showing relationship between the six benchmark methods and the proposed CDGAN method in terms of losses. **DualGAN is similar to CycleGAN. The tick mark represents the presence of a loss in a method.

Ii-A4 Cyclic-Synthesized Loss

The Cyclic-Synthesized loss introduced in CSGAN [13], which is computed as the loss between the Synthesized_Image () and the Cycled_Image () in domain and defined as,

(11)

where, is the Cyclic-Synthesized loss, and are the Synthesized and Cycled images in domain .

Similarly, the loss in domain between the Synthesized_Image () and the Cycled_Image () is computed as the Cyclic-Synthesized loss and defined as,

(12)

where, is the Cyclic-Synthesized loss, and are the Real and the Cycled images in domain . These two Cyclic-Synthesized losses are also included in the objective function of the proposed CDGAN method. Although, all the above mentioned losses are included in the objective function of the proposed CDGAN method, still there is a lot of scope to improve the quality of the images generated and to remove the unwanted artifacts produced in the resulting images. To fulfill that scope, a new loss called Cyclic-Discriminative Adversarial loss is proposed in this paper to generate the high quality images with reduced artifacts.

Ii-A5 Proposed Cyclic-Discriminative Adversarial Loss

The Cyclic-Discriminative Adversarial loss proposed in this paper is calculated as the adversarial loss between the Real_Images ( and ) and the Cycled_Images ( and ). The adversarial loss in domain between the Cycled_Image () and the Real_Image () by using the generator and the discriminator is computed as the Cyclic-Discriminative adversarial loss in domain and defined as,

(13)

where, is the Cyclic-Discriminative adversarial loss in domain , (=) and are the cycled and real image respectively.

Similarly, the adversarial in domain loss between the Cycled_Image () and the Real_Image () by using the generator and the discriminator is computed as the Cyclic-Discriminative adversarial loss in domain and defined as,

(14)

where, is the Cyclic-Discriminative adversarial loss in domain , (=) and are the cycled and real image respectively. Finally, we combine all the losses to have the CDGAN Objective function. The relationship between the proposed CDGAN method and six benchmark methods are shown in Table I for better understanding.

Ii-B CDGAN Objective Function

The CDGAN method final objective function combines existing Adversarial loss, Synthesized loss, Cycle-consistency loss and Cyclic-Synthesized loss along with the proposed Cyclic-Discriminative Adversarial loss as follows,

(15)

where and are the proposed Cyclic-Discriminative Adversarial losses described in subsection II-A5; and are the adversarial losses, and are the Synthesized losses, and are the Cycle-consistency losses and and are the Cyclic-Synthesized losses explained in the subsections II-A1, II-A2, II-A3 and II-A4, respectively. The , , , , and are the weights for the different losses. The values of these weights are set empirically.

Iii Experimental Setup

This section is devoted to describe the datasets used in the experiment with train and test partitions, evaluation metrics used to judge the performance and the training settings used to train the models. This section also describes the network architectures and baseline GAN models.

Iii-a DataSets

For the experimentation we used the following three different baseline datasets meant for the image-to-image transformation task.

Iii-A1 CUHK Face Sketch Dataset

The CUHK111http://mmlab.ie.cuhk.edu.hk/archive/facesketch.htmldataset consists of students face photo-sketch image pairs. The cropped version of the data with the dimension is used in this paper. Out of the total images of the dataset, images are used for the training and the remaining images are used for the testing.

Iii-A2 CMP Facades Dataset

The CMP Facades222http://cmp.felk.cvut.cz/ tylecr1/facade/ dataset contains a total of labels and corresponding facades image pairs. Out of the total images of this dataset, images are used for the training and the remaining images are used for the testing.

Iii-A3 RGB-NIR Scene Dataset

The RGB-NIR333https://ivrl.epfl.ch/research-2/research-downloads/supplementary_material-cvpr11-index-html/ dataset consists of images taken from different categories captured in both RGB and Near-infrared (NIR) domains. Out the total images of this dataset, images are used for the training and the remaining images are used for the testing.

Fig. 3: Generator and Discriminator Network architectures.

Iii-B Training Information

The network is trained with the input images of fixed size , each image is resized from the arbitrary size of the dataset to the fixed size of . The network is initialized with the same setup in [11]. Both the generator and the discriminator networks are trained with batch size from scratch to epochs. Initially, the learning rate is set to for the first epochs and linearly decaying down it to over the next

epochs. The weights of the network are initialized with the Gaussian distribution having mean

and standard deviation

. The network is optimized with the Adam solver [15] having the momentum term as instead of , because as per [25] the momentum term as or higher values can cause to substandard network stabilization for the image-to-image transformation task. The values of the weight factors for the proposed CDGAN method, and are set to , and are set to and and are set to (see Equation 15). For the comparison methods, the values for the weight factors are taken from their source papers.

Iii-C Network Architectures

Network architectures of the generator and the discriminator as shown in Fig.3 and used in this paper are taken from [40]. Following is the residual blocks used in the generator network: , , , , , , , where, represents a Convolutional layer with

filters and stride

, represents a Convolutional_InstanceNorm_ReLU layer with filters and stride , represents residual blocks consist of two Convolutional layers with filters for both layers, and represents DeConvolution_InstanceNorm_ReLU layer with filters and stride .

For the discriminator, we use PatchGAN from [11]. The discriminator network consists of: , , , , , where represents a Convolution_InstanceNorm_LeakyReLU layer with filters and stride , and represents a Convolutional layer with filters and stride

to produce the final one-dimensional output. The activation function used in this work is the Leaky ReLU with 0.2 slope. The first Convolution layer does not include the InstanceNorm.

Iii-D Evaluation Metrics

To better understand the improved performance of the proposed CDGAN method, both the quantitative and qualitative metrics are used in this paper. Most widely used image quality assessment metrics for image-to-image transformation like Peak Signal to Noise Ratio (PSNR), Mean Square Error (MSE), and Structural Similarity Index (SSIM)

[32] are used under quantitative evaluation. Learned Perceptual Image Patch Similarity (LPIPS) proposed in [37] is also used for calculating the perceptual similarity. The distance between the ground truth and the generated fake image is computed as the LPIPS score. The enhanced quality images generated by the proposed CDGAN method along with the six benchmark methods and ground truth are also compared in the result section.

Datasets Metrics Methods
GAN [8] Pix2Pix [11] DualGAN [35] CycleGAN [40] PS2GAN [29] CSGAN [13] CDGAN
CUHK SSIM

MSE

PSNR

LPIPS



FACADES
SSIM

MSE

PSNR

LPIPS


RGB-NIR
SSIM

MSE

PSNR

LPIPS 0.178

TABLE II: The quantitative comparison of the results of the proposed CDGAN with different state-of-the art methods trained on CUHK, FACADES and RGB-NIR Scene Datasets. The average scores for the SSIM, MSE, PSNR and LPIPS metrics are reported. Best results are highlighted in bold and second best results are shown in italic font.

Iii-E Baseline Methods

We compare the proposed CDGAN model with six benchmark models, namely, GAN[8], Pix2Pix[11], DualGAN[35], CycleGAN[40], PS2MAN [29] and CSGAN [13] to demonstrate its significance. All the above mentioned comparisons are made in paired setting only.

Iii-E1 Gan

The original vanilla GAN proposed in [8]

is used to generate the new samples from the learned distribution function with the given noise vector. Whereas, the GAN used for comparison in this paper is implemented for image-to-image translation from the Pix2Pix

444https://github.com/phillipi/pix2pix [11] by removing the loss and keeping only the adversarial loss.

Iii-E2 Pix2Pix

For this method, the code provided by the authors in Pix2Pix[11] is used for generating the result images with the same default settings.

Iii-E3 DualGAN

For this method, the code provided by the authors in DualGAN555https://github.com/duxingren14/DualGAN[35] is used for generating the result images with the same default settings from the original code.

Iii-E4 CycleGAN

For this method, the code provided by the authors in CycleGAN666https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix[40] is used for generating the result images with the same default settings.

Iii-E5 Ps2gan

In this method, the code is implemented by adding the synthesized loss to the existing losses of CycleGAN[40] method. For a fair comparison of the state-of-the-art methods with proposed CDGAN method, the PS2MAN [29] method originally proposed with multiple adversarial networks is modified to single adversarial networks, i.e., PS2GAN.

Iii-E6 Csgan

For this method, the code provided by the authors in CSGAN777https://github.com/KishanKancharagunta/CSGAN[13] is used for generating the result images with the same default settings.

Fig. 4: The qualitative comparison of generated faces for sketch-to-photo synthesis over CUHK Face Sketch dataset. The Input, GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN, CSGAN, CDGAN and Ground Truth images are shown from left to right, respectively. The CDGAN generated faces have minimal artifacts and look like more realistic with sharper images.
Fig. 5: The qualitative comparison of generated building images for label-to-building transformation over FACADES dataset. The Input, GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN, CSGAN, CDGAN and Ground Truth images are shown from left to right, respectively. The CDGAN generated building images have minimal artifacts and looking more realistic with sharper images.
Fig. 6: The qualitative comparison of generated RGB scenes for RGB-to-NIR scene image transformation over RGB-NIR scene dataset. The Input, GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN, CSGAN, CDGAN and Ground Truth images are shown from left to right, respectively. The CDGAN generated RGB scenes have minimal artifacts and looking more realistic with sharper images.
Datasets Metrics Methods
DualGAN DualGAN+ CycleGAN CycleGAN+ PS2GAN PS2GAN+ CSGAN CSGAN+ Our Method
CUHK SSIM
MSE
PSNR
LPIPS
FACADES SSIM
MSE
PSNR
LPIPS
RGB-NIR SSIM
MSE
PSNR
LPIPS
TABLE III: The quantitative results comparison between the proposed CDGAN method and different state-of-the art methods trained on CUHK, FACADES and RGB-NIR Scene Datasets. The average scores for the SSIM, MSE, PSNR and LPIPS scores are reported. The ’+’ symbol represents the presence of Cyclic-Discriminative Adversarial loss. The results with Cyclic-Discriminative Adversarial loss are highlighted in italic and best results produced by the proposed CDGAN method are highlighted in bold.

Iv Result Analysis

This section is dedicated to analyze the results produced by the introduced CDGAN method. To better express the improved performance of the proposed CDGAN method, we consider both the quantitative and qualitative evaluation. The proposed CDGAN method is compared against the six benchmark methods, namely, GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN and CSGAN. We also perform an ablation study over the proposed cyclic-discriminative adversarial loss with DualGAN, CycleGAN, PS2GAN and CSGAN to investigate its suitability with existing losses.

Iv-a Quantitative Evaluation

For the quantitative evaluation of the results, four baseline quantitative measures like SSIM, MSE, PSNR and LPIPS are used. The average scores of these four metrics are calculated for all the above mentioned methods. The results over CUHK, FACADES and RGB-NIR scene datasets are shown in Table II. The larger SSIM and PSNR scores and the smaller MSE and LPIPS scores indicate the generated images with better quality. The followings are the observations from the results of this experiment:

  • The proposed CDGAN method over CUHK dataset achieves an improvement of , , , , and in terms of the SSIM metric and reduces the MSE at the rate of , , , , and as compared to GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN and CDGAN, respectively.

  • For the FACADES dataset, the proposed CDGAN method achieves an improvement of , , , , and in terms of the SSIM metric and reduces the MSE at the rate of , , , , and as compared to GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN and CDGAN, respectively.

  • The proposed CDGAN method exhibits an improvement of , , , , and in the SSIM score, whereas shows the reduction in the MSE score by , , , , and as compared to GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN and CDGAN, respectively, over RGB-NIR scene dataset.

  • The performance of DualGAN is very bad over FACADES and RGB-NIR scene datasets because, these tasks involve high semantics-based labeling. The similar behavior of DualGAN is also observed by its original authors [35].

  • The PSNR and LPIPS measures over CUHK, FACADES and RGB-NIR scene datasets also show the reasonable improvement due to the proposed CDGAN method compared to other methods.

From the above mentioned comparisons over three different datasets, it is clearly understandable that the proposed CDGAN method generates more structurally similar, less pixels to pixel noise and perceptually real looking images with reduced artifacts as compared to the state-of-the-art approaches.

Iv-B Qualitative Evaluation

In order to show the improved quality of the output images produced by the proposed CDGAN, we compare and show few sample image results generated by the CDGAN against six state-of-the art methods. These comparisons over CUHK, FACADES and RGB-NIR scene datasets are shown in Fig. 4, 5 and 6, respectively. The followings are the observations and analysis drawn from these qualitative results:

  • From Fig. 4, it can be observed that the images generated by the CycleGAN, PS2GAN and CSGAN methods, contain the reflections on the faces for the sample faces of CUHK dataset. Whereas, the proposed CDGAN method is able to eliminate this effect due to its increased discriminative capacity. Moreover, the facial attributes in generated faces such as eyes, hair and face structure, are generated without the artifacts using the proposed CDGAN method.

  • The proposed CDGAN method generates the building images over FACADES dataset with enhanced textural detail information such as window and door sizes and shapes. As shown in Fig. 5, the buildings generated by the proposed CDGAN method consist more structural information. In the first row of Fig. 5, the building generated by the CDGAN method from the top left corner contains more window information compared to the remaining methods.

  • The qualitative results over RGB-NIR scene dataset are illustrated in Fig 6). The generated RGB images using the proposed CDGAN method contain more semantic, depth aware and structure aware information as compared to the state-of-the-art methods. It can be seen in the first row of Fig. 6 that the proposed CDGAN method is able to generate the grass in green color at the bottom portion of the generated RGB image, where other methods fail. Similarly, it can be also seen in row of Fig. 6 that the proposed CDGAN method generates the tree image in front of the building, whereas the compared methods fail to generate. The possible reason for such improved performance is due to the discriminative ability of the proposed method for varying depths of the scene points. It confirms the structure and depth sensitivity of CDGAN method.

  • As expected, DualGAN completely fails to produce high semantics labeling based image-to-image transformation tasks over FACADES (see Column of Fig. 5) and RGB-NIR (see Column of Fig. 6) scene datasets.

It is evident from the above qualitative results that the images generated by CDGAN method are more realistic and sharp with reduced artifacts as compared to the state-of-the-art GAN models of image-to-image transformation.

Iv-C Ablation Study

In order to analyze the importance of the proposed Cyclic-Discriminative adversarial loss, We conduct an ablation study on losses. We also investigate its dependency on other loss functions such as Adversarial, Synthesized, Cycle-consistency and Cyclic-Synthesized losses. Basically, the Cyclic-Discriminative adversarial loss is added to DualGAN, CycleGAN, PS2GAN, and CSGAN and compared with original results of these methods. The comparison results over the CUHK, FACADES and RGB-NIR scene datasets are shown in Table III. We observe the following points:

  • The proposed Cyclic-Discriminative adversarial loss when added to the DualGAN and CycleGAN have a mix of positive and negative impacts on the generated images as shown in the DualGAN+ and CycleGAN+ columns in the Table III. Note that the proposed Cyclic-Discriminative adversarial loss is computed between the Real_Image and the Cycled_Image. Because the DualGAN+ and CycleGAN+ does not use the synthesized losses, our cyclic-discriminator loss becomes more powerful in these frameworks. It may lead to situation where the generator is unable to fool the discriminator. Thus, the generator may stop learning after a while.

  • It is also observed that the proposed Cyclic-Discrminative adversarial loss is well suited with the PS2GAN and CSGAN methods as depicted in the Table III. The improved performance is due to a very tough mini-max game between the powerful generator equipped with synthesized losses and powerful proposed discriminator. Due to this very competitive adversarial learning, the generator is able to produce very high quality images.

  • It can be seen in the last column of Table III that the CDGAN still outperforms all other combinations of losses. It confirms the importance of the proposed Cyclic-Discriminative adversarial loss in conjunction with the existing loss functions such as Adversarial, Synthesized, Cycle-consistency and Cyclic-Synthesized losses.

This ablation study reveals that the proposed CDGAN method is able to achieve the improved performance when proposed Cyclic-Discriminative adversarial loss is used with synthesized losses.

V Conclusion

In this paper an improved image-to-image transformation method called CDGAN is proposed. A new Cyclic-Discriminative adversarial loss is introduced to increase the adversarial learning complexity. The introduced Cyclic-Discriminative adversarial loss along with the existing losses are used in CDGAN method. Three different datasets namely CUHK, FACADES and RGB-NIR scene are used for the image-to-image transformation experiments. The experimental quantitative and qualitative results are compared against GAN models including GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN and CSGAN. It is observed that the proposed method outperforms all the compared methods over all datasets in terms of the different evaluation metrics such as SSIM, MSE, PSNR and LPIPS. The qualitative results also point out the improved generated images in terms of the more realistic, structure preserving, and reduced artifacts. It is also noticed that the proposed method deals better with the varying depths of the scene points. The ablation study over different losses reveals that the proposed loss is better suited with the synthesized losses as it increases the competitiveness between generator and discriminator to learn more semantic features. It is also noticed that the best performance is gained after combining all losses, including Adversarial loss, Synthesized loss, Cycle-consistency loss, Cyclic-Synthesized loss and Cyclic-Discriminative adversarial loss.

Acknowledgement

We are grateful to NVIDIA Corporation for donating us the NVIDIA GeForce Titan X Pascal 12GB GPU which is used for this research.

References

  • [1] M. Brown and S. Süsstrunk (2011-06) Multispectral SIFT for scene category recognition. In

    Computer Vision and Pattern Recognition (CVPR11)

    ,
    Colorado Springs, pp. 177–184. Cited by: Fig. 1.
  • [2] A. Buades, B. Coll, and J. Morel (2005) A non-local algorithm for image denoising. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 60–65. Cited by: §I.
  • [3] B. Cao, N. Wang, J. Li, and X. Gao (2018) Data augmentation-based joint learning for heterogeneous face recognition. IEEE transactions on neural networks and learning systems 30 (6), pp. 1731–1743. Cited by: §I.
  • [4] J. Chen, X. He, H. Chen, Q. Teng, and L. Qing (2016)

    Single image super-resolution based on deep learning and gradient transformation

    .
    In IEEE International Conference on Signal Processing, pp. 663–667. Cited by: §I.
  • [5] Z. Cheng, Q. Yang, and B. Sheng (2015) Deep colorization. In IEEE International Conference on Computer Vision, Cited by: §I.
  • [6] J. Fu, J. Liu, Y. Wang, J. Zhou, C. Wang, and H. Lu (2019) Stacked deconvolutional network for semantic segmentation. IEEE Transactions on Image Processing (), pp. 1–1. External Links: Document, ISSN 1057-7149 Cited by: §I.
  • [7] L. A. Gatys, A. S. Ecker, and M. Bethge (2016) Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423. Cited by: §I-A.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §I-B, §II-A1, TABLE I, §III-E1, §III-E, TABLE II.
  • [9] C. Guo, C. Li, J. Guo, R. Cong, H. Fu, and P. Han (2018) Hierarchical features driven residual learning for depth map super-resolution. IEEE Transactions on Image Processing (), pp. 1–1. External Links: Document, ISSN 1057-7149 Cited by: §I.
  • [10] T. Guo, H. S. Mousavi, and V. Monga (2016)

    Deep learning based image super-resolution with coupled backpropagation

    .
    In IEEE Global Conference on Signal and Information Processing, pp. 237–241. Cited by: §I.
  • [11] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5967–5976. Cited by: §I-B, TABLE I, §III-B, §III-C, §III-E1, §III-E2, §III-E, TABLE II.
  • [12] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §I-B, §I.
  • [13] K. B. Kancharagunta and S. R. Dubey (2019) CSGAN: cyclic-synthesized generative adversarial networks for image-to-image transformation. arXiv preprint arXiv:1901.03554. Cited by: §I-B, §II-A4, TABLE I, §III-E6, §III-E, TABLE II.
  • [14] H. Kazemi, M. Iranmanesh, A. Dabouei, S. Soleymani, and N. M. Nasrabadi (2018) Facial attributes guided deep sketch-to-photo synthesis. In Computer Vision Workshops (WACVW), 2018 IEEE Winter Applications of, pp. 1–8. Cited by: §I-B.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §III-B.
  • [16] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In European Conference on Computer Vision, pp. 577–593. Cited by: §I-A.
  • [17] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Vol. 2, pp. 4. Cited by: §I-B.
  • [18] J. Liu, W. Yang, S. Yang, and Z. Guo (2019-02) D3R-net: dynamic routing residue recurrent network for video rain removal. IEEE Transactions on Image Processing 28 (2), pp. 699–712. External Links: Document, ISSN 1057-7149 Cited by: §I.
  • [19] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I-A.
  • [20] A. Lucas, S. Lopez-Tapiad, R. Molinae, and A. K. Katsaggelos (2019) Generative adversarial networks and perceptual losses for video super-resolution. IEEE Transactions on Image Processing (), pp. 1–1. External Links: Document, ISSN 1057-7149 Cited by: §I.
  • [21] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley (2017) Least squares generative adversarial networks. In IEEE International Conference on Computer Vision, pp. 2813–2821. Cited by: §II-A1.
  • [22] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §I-B.
  • [23] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §I.
  • [24] C. Peng, X. Gao, N. Wang, D. Tao, X. Li, and J. Li (2015) Multiple representations-based face sketch–photo synthesis. IEEE transactions on neural networks and learning systems 27 (11), pp. 2201–2215. Cited by: §I.
  • [25] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §III-B.
  • [26] R. Tyleček and R. Šára (2013) Spatial pattern templates for recognition of objects with regular structure. In German Conference on Pattern Recognition, Cited by: Fig. 1.
  • [27] C. Wang, C. Xu, C. Wang, and D. Tao (2018) Perceptual adversarial networks for image-to-image transformation. IEEE Transactions on Image Processing 27 (8), pp. 4066–4079. Cited by: §I-B.
  • [28] C. Wang, M. Niepert, and H. Li (2019) RecSys-dan: discriminative adversarial networks for cross-domain recommender systems. IEEE transactions on neural networks and learning systems. Cited by: §I-B.
  • [29] L. Wang, V. Sindagi, and V. Patel (2018) High-quality facial photo-sketch synthesis using multi-adversarial networks. In IEEE International Conference on Automatic Face & Gesture Recognition, pp. 83–90. Cited by: 3rd item, §I-B, §II-A2, TABLE I, §III-E5, §III-E, TABLE II.
  • [30] X. Wang and X. Tang (2008) Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence (11), pp. 1955–1967. Cited by: Fig. 1.
  • [31] Z. Wang, P. Yi, K. Jiang, J. Jiang, Z. Han, T. Lu, and J. Ma (2018) Multi-memory convolutional neural network for video super-resolution. IEEE Transactions on Image Processing (), pp. 1–1. External Links: Document, ISSN 1057-7149 Cited by: §I.
  • [32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §III-D.
  • [33] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li (2017)

    High-resolution image inpainting using multi-scale neural patch synthesis

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [34] M. Yang, L. Zhang, S. C. Shiu, and D. Zhang (2013) Robust kernel representation with statistical local features for face recognition. IEEE transactions on neural networks and learning systems 24 (6), pp. 900–912. Cited by: §I.
  • [35] Z. Yi, H. Zhang, P. Tan, and M. Gong (2017) DualGAN: unsupervised dual learning for image-to-image translation. In IEEE International Conference on Computer Vision, pp. 2868–2876. Cited by: Fig. 1, 3rd item, §I-B, TABLE I, §III-E3, §III-E, TABLE II, 4th item.
  • [36] L. Zhang, L. Lin, X. Wu, S. Ding, and L. Zhang (2015) End-to-end photo-sketch generation via fully convolutional representation learning. In ACM International Conference on Multimedia Retrieval, pp. 627–634. Cited by: §I-A.
  • [37] R. Zhang, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In IEEE International Conference on Computer Vision, Cited by: §III-D.
  • [38] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In European Conference on Computer Vision, Cited by: §I.
  • [39] S. Zhang, R. Ji, J. Hu, X. Lu, and X. Li (2018) Face sketch synthesis by multidomain adversarial learning. IEEE transactions on neural networks and learning systems 30 (5), pp. 1419–1428. Cited by: §I.
  • [40] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, pp. 2242–2251. Cited by: Fig. 1, 3rd item, §I-B, §II-A3, TABLE I, §III-C, §III-E4, §III-E5, §III-E, TABLE II.
  • [41] M. Zhu, J. Li, N. Wang, and X. Gao (2019) A deep collaborative framework for face photo–sketch synthesis. IEEE transactions on neural networks and learning systems 30 (10), pp. 3096–3108. Cited by: §I.