PCSGAN: Perceptual Cyclic-Synthesized Generative Adversarial Networks for Thermal and NIR to Visible Image Transformation

02/13/2020 ∙ by Kancharagunta Kishan Babu, et al. ∙ IIIT Sri City 0

In many real world scenarios, it is difficult to capture the images in the visible light spectrum (VIS) due to bad lighting conditions. However, the images can be captured in such scenarios using Near-Infrared (NIR) and Thermal (THM) cameras. The NIR and THM images contain the limited details. Thus, there is a need to transform the images from THM/NIR to VIS for better understanding. However, it is non-trivial task due to the large domain discrepancies and lack of abundant datasets. Nowadays, Generative Adversarial Network (GAN) is able to transform the images from one domain to another domain. Most of the available GAN based methods use the combination of the adversarial and the pixel-wise losses (like L1 or L2) as the objective function for training. The quality of transformed images in case of THM/NIR to VIS transformation is still not up to the mark using such objective function. Thus, better objective functions are needed to improve the quality, fine details and realism of the transformed images. A new model for THM/NIR to VIS image transformation called Perceptual Cyclic-Synthesized Generative Adversarial Network (PCSGAN) is introduced to address these issues. The PCSGAN uses the combination of the perceptual (i.e., feature based) losses along with the pixel-wise and the adversarial losses. Both the quantitative and qualitative measures are used to judge the performance of the PCSGAN model over the WHU-IIP face and the RGB-NIR scene datasets. The proposed PCSGAN outperforms the state-of-the-art image transformation models, including Pix2pix, DualGAN, CycleGAN, PS2GAN, and PAN in terms of the SSIM, MSE, PSNR and LPIPS evaluation measures. The code is available at: <https://github.com/KishanKancharagunta/PCSGAN>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Thermal (THM) and Near-Infrared (NIR) cameras are used to capture the images in those situations, where Visible (VIS) cameras fail to capture. The image captured in the THM/NIR domain is difficult to understand by human examiners due to the lack of information. Moreover, a large domain gap also exists between the VIS and THM/NIR images. Nowadays, the importance to match the images captured from THM/NIR to VIS is increasing due to their extensive usage in real world applications, such as for military, law enforcement, commercial, and etc. [Applications].

Fig. 1: The efficacy of the proposed PCSGAN model. The sample images in column are taken from the WHU-IIP (i.e., thermal images in and rows) and the RGB-NIR scene (i.e., NIR images in and rows) datasets, respectively. The generated images using Pix2pix [cGAN], DualGAN [dualGAN], CycleGAN [CyclicGAN], and proposed PCSGAN method are shown in , , , and columns, respectively. The ground truth images in visible domain are shown in the last column. The rectangles in red color depict the artifacts, blurred and missing parts in the generated images by existing methods, which are overcome by the proposed PCSGAN method.

The Thermal/NIR to visible image transformation has been an active research area due to great demand in real world applications. Broadly, this problem can be categorized under image generation and transformation. Generative Adversarial Network (GAN) [gan]

was developed by Goodfellow et al. in 2014 to generate the images learned from the given data distribution by providing a latent vector

as the input. Later on, GAN and a variation of GAN called conditional GAN (cGAN) [cGAN] was proposed for image-to-image translation. It has shown a wide variety of applications, such as image manipulation [AIMIM]

, image super-resolution

[DLISR], [SISRDL], image style transfer [PLRTSTSR], [PRSISRGAN], image-to-image transformation [cGAN], [CyclicGAN], [Photo-to-Caricature], [triplefaceimage], image in-painting [multi-image-inpainting], feature detection in images [saliency] and etc. Recent developments are Cyclic Synthesized GAN (CSGAN) [csgan], Cyclic Discriminative GAN (CDGAN) [cdgan], Style-Based Generator GAN (SG-GAN) [karras2019style], and Generative adversarial minority oversampling (GAMO) [mullick2019generative].

Isola et al. have proposed Pix2pix [cGAN], a common framework for image-to-image transformation conditioned on the input image suitable only for paired dataset. Wang et al. have extended Pix2Pix to Perceptual Adversarial Network (PAN) [PerceptualGAN]. They have used the perceptual loss between the features of generated and target images. Zhu et al. and Yi et al. have introduced CycleGAN [CyclicGAN] and DualGAN [dualGAN], respectively, by adding a constraint between the real and the cycled images. Wang et al. have proposed PS2MAN [ps2man] for synthesizing the facial photo images from the sketch images by using the multi-adversarial networks with synthesized loss. Other notable works are GAN based visible face synthesis (GAN-VFS) [vfsgan], Thermal-to-Visible GAN (TV-GAN) [tvgan], semantic-guided GAN (SG-GAN) [MTVFISGGAN], etc.

It is pointed out from the above discussed literature that the vanilla version of GAN is not able to produce very realistic and artifact free images. The existing GAN based methods rely over the dedicated and specific losses in its objective function. Moreover, in order to improve the performance, some methods use the additional information computed in prior. Thus, there is a need to find the suitable objective/loss functions for the image-to-image transformation problem.

A new method for Thermal (THM)/Near-Infrared (NIR) to Visible (VIS) image transformation called Perceptual Cyclic-Synthesized Generative Adversarial Networks (PCSGAN) is introduced in this paper. The PCSGAN method consists of two different networks, called the generator and the discriminator network like CycleGAN [CyclicGAN]. It is like a two player mini-max game, where the discriminator network tries to maximize the given objective function by correctly differentiating between the real and the generated VIS images. Meanwhile, the generator network tries to minimize the same objective function by generating real looking VIS image to fool the discriminator network as depicted in Fig. 1. In addition to the adversarial losses proposed in GAN [gan] and the pixel-wise similarity losses proposed in [cGAN], [CyclicGAN], we also use perceptual losses introduced in [PLRTSTSR] for THM/NIR to VIS image synthesis.

The main contributions are summarized as follows:

  • A new method for Thermal/NIR to Visible image transformation called Perceptual Cyclic-Synthesized Generative Adversarial Network (PCSGAN) is proposed.

  • The PCSGAN utilizes two perceptual losses called the Cycled_Perceptual loss and the Synthesized_Perceptual loss in addition to the Adversarial and Pixel-wise losses.

  • The detailed experiments are conducted to show the improved performance of the proposed PCSGAN method over the WHU-IIP face and the RGB-NIR scene datasets.

  • Further, the ablation studies on losses are also conducted to verify the effectiveness of the added losses.

The rest of the paper is organized in following sections: the proposed PCSGAN method is described in Section II along with the different loss functions and architecture; Section III

describes the different datasets, evaluation metrics, and state-of-the-art methods; Section

IV shows the experimental results with analysis. Section V conducts the ablation studies; and Section VI concludes with future directions.

Ii Proposed Method

In this section, the proposed PCSGAN framework along with the different loss functions is presented and the networks details are explained.

Fig. 2: The PCSGAN framework for Thermal-to-Visible image transformation. and are the generator networks for Thermal to Visible and Visible to Thermal transformations, respectively. and

are the discriminators to distinguish between the Real_Images (probability

) and Synthesized_Images (probability ) in Visible and Thermal domains, respectively. and are the Adversarial losses, and are the Cycle-consistency losses, and are the Synthesized losses, and are the Cycled_Perceptual losses and and are the Synthesized_Perceptual losses. is the feature extractor to extract the features from the images.

Ii-a The PCSGAN Framework

The PCSGAN framework as shown in the Fig. 2 is used to transform the images from the source Thermal/NIR domain to the target Visible domain . For ease of understanding, we consider Thermal to Visible transformation, the same explanation applies for NIR to Visible also. The PCSGAN framework consists two generator networks and to transform the images from Thermal to Visible and Visible to Thermal, respectively. These generator networks are trained adversarially by using two discriminator networks and in domains and , respectively.

For illustration, the first generator receives an input image, in domain and transform it into the Synthesized_Image in domain as , i.e., . In this context, the generator network is trained to generate the which looks same as to fool the discriminator , such that it should not be able to distinguish between and . Whereas, the discriminator is trained to distinguish clearly between the as real and as generated. On the other hand, the second generator transforms the Real Image from domain into the Synthesized Image in domain as, . Here, the generator network is trained to generate the close to to fool the discriminator to think as . Whereas, the discriminator distinguishes the as real and as fake. Thus, the generator networks and are trained adversarially by competing with the discriminator networks and , respectively. We use the adversarial loss functions and introduced in [LSGAN] to train the combined generator and discriminator networks.

min_G_Vmax_D_VL_LSGAN_V(G_V, D_V) = EV ∼Pdata(V) [(DV(Real_V)-1)2] + ET ∼Pdata(T) [DV(GV(Real_T))2].

min_G_Tmax_D_TL_LSGAN_T(G_T, D_T) = ET ∼Pdata(T) [(DT(Real_T)-1)2] + EV ∼Pdata(V) [DT(GT(Real_V))2]. where, and are the adversarial losses in the domains and , respectively. The adversarial losses guide the generator to produce the images in the target domain.

Ii-A1 Pixel-wise Similarity Loss Functions

In this work, two pixel-wise similarity loss functions called Cycle-consistency loss introduced in [CyclicGAN] and Synthesized loss introduced in [ps2man] are included. The CycleGAN is originally proposed for unpaired datasets and in this case training the model with only adversarial losses leads to the mode collapse problem, where most of the images from a source domain mapped to the single image in the target domain.

Cycle-consistency loss, calculated as the loss between the Real Images ( and ) and Cycled Images ( and ), is used to overcome from the above mentioned problem. The and are computed as,

(1)
(2)

The Cycle-consistency loss in domain and domain are represented as and , respectively, and given as,

(3)
(4)

where, is the Cycle-consistency loss calculated between the Real_Image () and Cycled_Image () in domain using the loss and is the Cycle-consistency loss calculated between the Real_Image () and Cycled_Image () in domain using the loss. In the THM to VIS image transformation, these Cycle-consistency losses act as an additional regularizers and help to learn the network parameters by reducing the artifacts in the produced images.

Losses Methods
GAN [gan] Pix2pix [cGAN] **DualGAN [dualGAN] CycleGAN [CyclicGAN] PS2GAN [ps2man] PAN [PAN] PCSGAN (Ours)
TABLE I: The relationship between the different loss functions used in the PCSGAN model and recent models, including GAN [gan], Pix2pix [cGAN], DualGAN[dualGAN], CycleGAN [CyclicGAN], PS2GAN [ps2man], and PAN [PAN], respectively. **DualGAN method objective function is similar to the CycleGAN method.

Synthesized loss is introduced based on the observation that the task of the generator is not only to generate the synthesized images for fooling the discriminator, but also the generated images should look like realistic and closer to the target domain. This is not possible only with the Adversarial loss and Cycle-consistency loss. Synthesized loss is computed between the Real_Images ( and ) and the Synthesized Images ( and ) in domain and , respectively. The Synthesized losses in domain and are represented as and and defined as,

(5)

where, is the Synthesized loss calculated between the Real Image () and Synthesized Image () in domain using the loss, and,

(6)

where, is the Synthesized loss calculated between the Real Image () and Synthesized Image () in domain using the loss.

Ii-A2 Perceptual Similarity Loss Functions

For Thermal/NIR to Visible image transformation, the above discussed pixel-wise losses help the generator to produce the images closer to the target domain from pixel-wise content perspective. However, both the Cycle-consistency loss and Synthesized loss fail to get perceptual information for human judgment on the quality of the images [context_enc]. So, when only pixel-wise similarity losses are used for image transformation, the generated images generally suffer with the reduced sharpness and missing fine-details in the structures [PAN]. To solve this problem, feature-based loss functions were introduced in [PLRTSTSR] to provide additional constraints to enhance the quality of the transformed image. In this work, we also utilize the power of feature-based loss by adding two more additional perceptual losses, namely, Cycled Perceptual loss and Synthesized Perceptual loss.

The Cycled_Perceptual loss is calculated by extracting the intermediate features with the help of pre-trained feature extractor network (), like VGG-19 or ResNet-50. It is computed between the Real Images ( and ) and the Cycled Images ( and ) in domain and , represented as and , respectively. These losses are given as,

(7)

where, is the Cycled Perceptual loss calculated as mean absolute error (MAE) by extracting features between the Real Image () and Cycled Image () in domain , and,

(8)

where, is the Cycled Perceptual loss calculated as mean absolute error (MAE) by extracting the features between the Real Image () and Cycled Image () in domain .

Synthesized Perceptual loss is similar to the Synthesized loss, instead of the pixel-wise loss it is computed using the feature loss. This loss is computed between the Real Images ( and ) and Synthesized Images ( and ) in domain and , represented as and , respectively. These losses are computed as,

(9)

where, is the Synthesized Perceptual loss computed as the mean absolute error (MAE) by extracting the features from the Real Image () and Synthesized_Image () in domain , and,

(10)

where, is the Synthesized Perceptual loss computed as the mean absolute error (MAE) by extracting the features from the Real Image () and the Synthesized Image () in domain .

Ii-A3 Final Objective Function

The final objective function of the PCSGAN framework consists of the Adversarial losses, Cycle-consistency losses, Synthesized losses, Cycled Perceptual losses and Synthesized Perceptual losses. It is given as,

(11)

where, and are the Adversarial losses, and are the Cycle-consistency losses, and are the Synthesized losses, and are the Cycled_Perceptual losses and and are the Synthesized_Perceptual losses. The weights are set empirically for the different losses used in the final objective function which are as follows: , , , , , , and . The relation between the different loss functions used in the proposed PCSGAN method and the state-of-the-art methods is summarized in Table I. The losses and shown in the Table I are the Perceptual Adversarial Losses introduced in the PAN [PAN] method calculated in domains and , respectively.

Methods Metrics
SSIM MSE PSNR LPIPS MSSIM
Pix2pix [cGAN]
DualGAN [dualGAN]
CycleGAN [CyclicGAN]
PS2GAN [ps2man]
PAN [PAN]
PCSGAN (Ours)
TABLE II: Quantitative evaluation of the results compared between the PCSGAN method and the state-of-the-art methods using the SSIM, MSE, PSNR, LPIPS, and MSSIM scores over the WHU-IIP face dataset.
Methods Metrics
SSIM MSE PSNR LPIPS
Pix2pix [cGAN]
DualGAN [dualGAN]
CycleGAN [CyclicGAN]
PS2GAN [ps2man]
PAN [PAN]
PCSGAN (Ours)
TABLE III: Quantitative evaluation of the results compared between the PCSGAN method and the state-of-the-art methods using the SSIM, MSE, PSNR, and LPIPS scores over the RGB-NIR scene dataset.

Ii-B Implementation Details

As we can see from the Fig. 2 that the PCSGAN framework contains two generator networks ( and ) and two discriminator networks ( and ). One generator and one discriminator are in the source domain , whereas another generator and another discriminator are in the target domain .

Ii-B1 Generator Network

Similar to the CycleGAN [cGAN], we adopt the network with residual blocks from Jhonson et al. [PLRTSTSR] as our generators networks. The generator network consists of: convolution layers , and , followed by residual blocks and transpose convolution layers , and one final convolution layer . A convolution with number of filters of size

with stride

is represented by . A residual block consisting of two Conv layers with filters is denoted by . A deconvolution layer with filters of size having stride is denoted by

. Instance Norm is also used second convolution and all deconvolution layers. The ReLU activation function is used. All the input images given to the PCSGAN method for training are resized to

.

Ii-B2 Discriminator Network

A PatchGAN is used in the discriminator network similar to [cGAN]. The layers of discriminator network are as follows: hidden layers , , and . The Instance Norm is used all layers except first convolution layer. The final one-dimensional output is computed by a convolutional layer having stride . The activation function used in Leaky ReLU having slope .

Ii-B3 Training Details

In this work, as the proposed PCSGAN method is an enhancement of the CycleGAN method [cGAN], the training details are same as in the CycleGAN. The input images of fixed size are given to the network as mentioned in [cGAN]. The generator network with residual blocks, is best suitable for this size. We train the generator and the discriminator networks for epochs with only one sample per batch. Initially, we use learning rate for epochs and then linearly decaying to for the next

epochs. The networks are initialized with the Gaussian distribution having

mean and standard deviation. The Adam [adam] optimizer with momentum term as is used for optimizing the network.

Iii Experimental Setup

Iii-a Data Sets

In this experiment, WHU-IIP Thermal and Visible face dataset and RGB-NIR Near-Infrared to Visible scene dataset are used. The WHU-IIP111http://iip.whu.edu.cn/projects/IR2Vis_dataset.html face dataset contains a total of paired thermal and visible facial images taken from individuals [TVFITGAN]. We use samples from individuals in training set and samples from individuals in test set. The RGB-NIR222https://ivrl.epfl.ch/research-2/research-downloads/supplementary_material-cvpr11-index-html/ scene dataset consists the samples from classes with images in total. The image pairs are captured in both Near-infrared (NIR) as well as Visible (RGB) domains. The training and testing sets contain and samples, respectively.

Iii-B Evaluation Metrics

In order show the improved outcome of the proposed PCSGAN method, we use the qualitative as well as the quantitative measures. Baseline image quality evaluation metrics, like Structural Similarity Index (SSIM) [SSIM]

, Peak Signal to Noise Ratio (PSNR), Mean Square Error (MSE), Learned Perceptual Image Patch Similarity (LPIPS)

[LPIPS], and Multiscale Structural Similarity Index (MS-SSIM) [MSSSIM] are used as the quantitative measures.

Iii-C State-Of-The-Art Compared Methods

The four state-of-the-art Image-to-image transformation methods are used for comparison purpose, namely, Pix2pix, DualGAN, CycleGAN and PS2GAN. In order to have a fair comparison, we evaluate all the methods in paired setting only.

Iii-C1 Pix2pix

The publicly available code from Pix2pix333https://github.com/phillipi/pix2pix [cGAN] is used with the default settings.

Iii-C2 DualGAN

We use DualGAN with the default settings as per the code available444https://github.com/duxingren14/DualGAN[dualGAN].

Iii-C3 CycleGAN

We use CycleGAN with the default settings as per the code available555https://github.com/junyanz/pytorch-CycleGAN-and-Pix2pix[CyclicGAN].

Iii-C4 Ps2gan

For this method, the original code taken from the authors consists of multiple adversarial networks [ps2man]. It is originally proposed for generating the result images by calculating the losses at different resolutions of the given input images. For the fair comparison with the remaining state-of-the-art methods, we implement the PS2MAN with single adversarial networks, i.e., PS2GAN by adding Synthesized loss to the CycleGAN[CyclicGAN] with other existing losses.

Iii-C5 Pan

The code is taken from the PAN 666https://github.com/DLHacks/pix2pix_PAN[PAN], the same settings mentioned in the original paper are used for the experiment.

Fig. 3: The qualitative comparison of results over WHU_IIP dataset. From left-to-right input image, images generated by Pix2pix, DualGAN, CycleGAN, PS2GAN, PAN, PCSGAN and ground truth image, respectively.

Iv Experimental Results and Analysis

In the experiments, the proposed PCSGAN method is evaluated in terms of both the quantitative and qualitative measures.

Iv-a Quantitative Results

The PCSGAN method is quantitatively evaluated by using five widely used image quality assessment metrics, namely SSIM, MSE, PSNR, LPIPS and MS-SSIM. The PCSGAN method clearly shows the improved performance over the state-of-the-art methods as shown in the Table II and III over the WHU-IIP face dataset and RGB-NIR scene dataset, respectively. The PCSGAN shows an improvement over Pix2pix [cGAN], DualGAN [dualGAN], CycleGAN [CyclicGAN], PS2GAN [ps2man] and PAN [PAN] with

  • an increment of {, , , and }, {, , , and } and {, , , and } in % in terms of the SSIM, the PSNR and the MS-SSIM scores, respectively, on the WHU-IIP face dataset,

  • a reduction of {, , , and } and {, , , and } in % in terms of the MSE and the LPIPS score, respectively on the WHU-IIP face dataset,

  • an increment of {, , , and } and {, , , and } in % in terms of the SSIM and the PSNR scores, respectively, on the RGB-NIR scene dataset, and

  • a reduction of {, , , and }, {, , , and } in % in terms of the MSE and the LPIPS score, respectively on the RGB-NIR scene dataset.

These quantitative results confirm the superiority of the proposed PCSGAN method compared to the state-of-the-art methods.

Iv-B Qualitative Results

The qualitative results are also analyzed in this experiment to better understand the visual quality of the generated images in Fig. 3 on WHU-IIP face dataset and Fig. 4 on RGB-NIR scene dataset. The qualitative comparison of results over the WHU-IIP face dataset is described as follows:

  • It can be observed that the quality and fine details (i.e., facial attributes) of the resulting face images generated by the proposed PCSGAN method are comparatively better than the state-of-the-art methods as shown in Fig. 3.

  • In particular, the and rows in Fig. 3 show that the proposed PCSGAN method generated result images are much closer (more fine facial attribute details and less blurriness) to the ground truth images.

  • It can also be observed that the state-of-the art methods fail (see facial attributes and blurriness in the images) to generate the result images closer to the ground truth images.

Fig. 4: The qualitative comparison of results over RGB_NIR scene dataset. From left-to-right input image, images generated by Pix2pix, DualGAN, CycleGAN, PS2GAN, PAN, PCSGAN and ground truth image, respectively.

In a similar way, the qualitative comparison of results over the RGB-NIR scene dataset is described as follows:

  • From the Fig. 4, it can be observed that the quality and fine details (i.e., color and texture) of the resulting scene images generated by the proposed PCSGAN method are comparatively better than the state-of-the-art methods.

  • In particular from , and columns of the Fig. 4, it can be observed that Pix2pix, DulGAN and CycleGAN methods completely fail to predict the color and texture of the target domain images. Whereas, it can be seen from and columns that the PS2GAN and PAN methods are able to generate the images, somewhat closer to the ground truth images and still can be improved further.

  • The PCSGAN method generates higher quality and more realistic images compared to the state-of-the-art image-to-image transformation methods and same can be observed from column.

The qualitative comparisons of the results over the WHU-IIP face and NIR-RGB scene datasets clearly show that the proposed PCSGAN method achieves much better results than the state-of-the-art image-to-image transformation methods.

Fig. 5: The generated images using different different losses in PCSGAN framework over the WHU_IIP face dataset. The first and last columns represent the input and target images, respectively. The to columns, from left to right, shows the generated images using AL, AL++CL, AL+CL+CPL, AL+CL+SL, AL+CL+SL+SPL, and AL+CL+CPL+SL+SPL losses, respectively.
Fig. 6: The generated images using different different losses in PCSGAN framework over the RGB_NIR scene dataset. The columns are similar to Fig. 5.
Loss Functions Metrics
SSIM MSE PSNR LPIPS MSSIM
AL
AL+CL
AL+CL+CPL
AL+CL+SL
AL+CL+SL+SPL
AL+CL+CPL+SL+SPL
TABLE IV: Ablation study over different loss functions used in the proposed PCSGAN over WHU-IIP face dataset.

V Ablation Study

In this paper, we conduct an ablation study using different loss functions, namely, Adversarial losses, pixel-wise losses and perceptual losses. It is dedicated to better understand the impact of the newly added perceptual loss functions. For simplicity, we label the Adversarial Loss as

, Cycle-consistency Loss as , Synthesized Loss , Cycle-Perceptual Loss as and Synthesized-Perceptual Loss as . The ablation study is performed over the WHU-IIP face and the RGB-NIR scene datasets in terms of both the quantitative measures (summarized in the Table IV and Table V) and the qualitative measures (illustrated in the Fig. 5 and Fig. 6), respectively. The observations from this ablation study are described as follows:

V-a Al

In this setting, we use only the Adversarial losses mentioned in Equation II-A and II-A as the objective function for image transformation over the WHU-IIP face dataset and the RGB-NIR scene dataset. Over the WHU-IIP dataset, the generated images suffer with severe artifacts and lack of facial attribute information. However, over the RGB-NIR dataset, the resulting images suffer with the mode collapse problem and completely fail to generate the images. Thus, the Adversarial losses alone, are unable to generate the good quality realistic images and lead to a high domain discrepancy gap.

Loss Functions Metrics
SSIM MSE PSNR LPIPS
AL
AL+CL
AL+CL+CPL
AL+CL+SL
AL+CL+SL+SPL
AL+CL+CPL+SL+SPL
TABLE V: Ablation study over different loss functions used in the proposed PCSGAN over RGB-NIR scene dataset.

V-B Al+cl

In this setting, we use Cycle-consistency losses mentioned in Equation 3 and 4 along with the Adversarial losses as the objective function. From the results obtained, it can be seen that the Cycle-consistency loss help to overcome the mode collapse problem over the RGB-NIR scene dataset. Whereas, the generated images still suffer from the color disparity and lack of fine details over both the WHU-IIP and RGB-NIR scene datasets.

V-C Al+cl+cpl

In this setting, we use Cycled_Perceptual losses mentioned in Equation 7 and 8 along with the Adversarial losses and Cycle-Consistency losses as the objective function. From the results, it can be observed that the Cycled_Perceptual losses help to generate the images with finer details and proper color judgment.

V-D Al+cl+sl

In this setting, we use Synthesized losses mentioned in Equation 5 and 6 along with the Cycle-consistency losses and the Adversarial losses as the objective function. It is noticed from the results that synthesized losses boost the generator to produce the images not only to fool the discriminator, but also to look closer to the target domain. Although, the generated images are closer to the target domain images, still they suffer from lack of fine details in terms of the color, texture and shape.

V-E Al+cl+sl+spl

In this setting, we use Synthesized_Perceptual losses mentioned in Equation 9 and 10 along with the Synthesized_losses, Cycle-consistency losses and Adversarial losses as the objective function. The results after adding the Synthesized_Perceptual losses, preserve the fine details and generate images closer to the target domain. Still, there is a scope to decrease the domain discrepancy gap by adding Cycled_Perceptual loss.

V-F Al+cl+cpl+sl+spl

For the proposed PCSGAN, we combine all the above mentioned losses, namely, Adversarial losses, Cycle-consistency losses, Cycled_Perceptual losses, Synthesized losses and Synthesized_Perceptual losses as the objective function. From the results, it is clear that the images generated by the proposed PCSGAN are more realistic with finer details and negligible artifacts when compared to the remaining above mentioned settings.

We also conducted an experiment by adding Perceptual Adversarial loss introduced in PAN [PAN] to our proposed PCSGAN method, where we did not find much increase in the quality of the generated images.

From this ablation study, it is pointed out that the Adversarial losses help to generate the images in the target domains, but the generated images suffer from artifacts, blurred portions, lack of fine details and sometimes from the mode collapse problem. The pixel-wise similarity loss functions, namely, Cycle-consistency losses and Synthesized losses help to generate images closer to the target domain and to reduce the artifacts. The perceptual losses, namely, Cycled_Perceptual and Synthesized_Perceptual losses force the network to generate the images with semantic and finer details in terms of the consistency in color, texture and shape for different regions.

Followings are the results summary:

  1. The proposed PCSGAN has shown the outstanding performance over the problem of Thermal/NIR to Visible image transformation.

  2. The proposed PCSGAN outperforms the existing GAN methods such as Pix2Pix, DualGAN, CycleGAN, PS2GAN and PAN in terms of the different measures such as SSIM, MSE, PSNR, and LPIPS.

  3. The proposed PCSGAN generates the images with better quality and fine details.

  4. The importance of different losses used in the proposed PCSGAN is observed using the ablation study with different combination of losses in objective function.

Vi Conclusion

In this paper, we present a new PCSGAN model for transforming the images from Thermal/NIR domain to Visible domain. The proposed PCSGAN method uses the perceptual losses in addition to the adversarial and pixel-wise losses which are generally used by the recent state-of-the-art image-to-image transformation methods. The quality of the generated images has been greatly improved in terms of the finer details, reduced artifacts and semantics after addition of the perceptual losses, namely, Cycled_Perceptual losses and Synthesized_Perceptual losses. The same is observed through both the quantitative and qualitative measures. The proposed method outperforms the other state-of-the-art compared methods for Thermal-Visible and NIR-Visible transformation problems. The improved results are observed using the SSIM, PSNR, MSE and LPIPS performance measures. The ablation study confirms the the suitability and relevance of the added perceptual losses which boost the quality of the generated images. The future work includes the extension over more than two modalities. Moreover, we also want to extend this work to find the effective losses and constraints for image transformation between the heterogeneous datasets, such as visible, sketch, thermal and NIR.

Acknowledgement

We are thankful to NVIDIA Corporation for donating us the NVIDIA GeForce Titan X Pascal 12GB GPUs which is used in this research.

References