Matching Thermal to Visible Face Images Using a Semantic-Guided Generative Adversarial Network

03/03/2019 ∙ by Cunjian Chen, et al. ∙ 0

Designing face recognition systems that are capable of matching face images obtained in the thermal spectrum with those obtained in the visible spectrum is a challenging problem. In this work, we propose the use of semantic-guided generative adversarial network (SG-GAN) to automatically synthesize visible face images from their thermal counterparts. Specifically, semantic labels, extracted by a face parsing network, are used to compute a semantic loss function to regularize the adversarial network during training. These semantic cues denote high-level facial component information associated with each pixel. Further, an identity extraction network is leveraged to generate multi-scale features to compute an identity loss function. To achieve photo-realistic results, a perceptual loss function is introduced during network training to ensure that the synthesized visible face is perceptually similar to the target visible face image. We extensively evaluate the benefits of individual loss functions, and combine them effectively to learn the mapping from thermal to visible face images. Experiments involving two multispectral face datasets show that the proposed method achieves promising results in both face synthesis and cross-spectral face matching.



There are no comments yet.


page 3

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Matching thermal spectrum (THM) face images against visible spectrum (VIS) face images has received increased attention in the literature, due to its broad applications in the military, commercial, and law enforcement domains [5]. Thermal emissions from the face images are less sensitive to changes in ambient lighting. Further, thermal face images can be acquired in dark environments characterized by low ambient lighting thereby making them suitable for nighttime face recognition. However, images present in legacy face datasets are typically RGB images acquired in the visible spectrum. Therefore, cross-spectral matching of THM face images against VIS face images is of particular importance in delivering nighttime face recognition systems with a high degree of accuracy.

The appearance variation between two face samples of the same subject captured under different illumination conditions can be larger than that of two samples belonging to two different subjects [2]. Existing approaches for cross-spectral face matching can be categorized as follows: (a) photometric normalization of images in each spectral band [2, 11]; (b) projecting THM and VIS images into a common subspace [21, 8]; and (c) mapping THM images to VIS images via image synthesis [19, 18]. These approaches have demonstrated effectiveness in minimizing inter-spectral differences and resulting in modest gains for cross-spectral face matching accuracy. Their performance is still far from satisfactory for practical needs [5] and the synthesized face images often appear unrealistic due to the lack of sufficient facial details [2, 19].

Recent advances in generative adversarial networks (GANs) has witnessed success in various face-related applications, including face completion [13], frontalization [24, 7, 30], and age progression [29]

. GANs are neural networks that consist of two components: a generator that learns how to synthesize some type of data (e.g., images), and a discriminator that learns to discriminate between real data and synthesized data 

[4]. GANs have been used to learn the mapping from input face images to target face images, such as profile face images to frontal view face images [7]

. The mapping function is typically constrained by the use of per-pixel loss functions computed between the output and target face images during training. Loss functions are used during the training phase to measure the disparity between the actual output of the neural network and the expected target output. GANs have also been used to synthesize VIS images from their THM counterparts 

[31, 28, 27]. These solutions integrate additional loss functions, such as identity loss [31], attribute loss [28] and shape loss [27], into the generative and discriminative components in order to further constrain the mapping function. In spite of these advances, semantic information is not explicitly considered when using these GANs to learn the mapping from THM to VIS face images. We assert that these semantic cues, pertaining to the features around eyes, nose and mouth regions, may be beneficial for synthesizing identity-preserved face images.

In this regard, we propose a semantic-guided generative adversarial network (SG-GAN) to regularize GAN training with semantic priors in order to effectively synthesize VIS images from THM images. Specifically, the semantic priors are extracted by a face parsing network [14]. After that, these semantic priors are used to compute a semantic loss function. Similar schemes have been explored in the context of face completion [13] and deblurring [22]. However, the semantic priors have yet to be explored in the context of face recognition. Further, we show that extracting multi-scale identity features from a pre-trained face recognition network can boost the performance as well.

The framework of the proposed SG-GAN method is illustrated in Figure 1

. The generator network is an encoder-decoder architecture with skip-connections, whose input is a thermal image, and the output is a synthesized visible image. The discriminator network is a CNN classifier that learns to separate the real pairs from the synthesized pairs. The real pair consists of input thermal and target visible images (ground-truth), and the synthesized pair consists of input thermal and synthesized visible images. The identity, perceptual and semantic parsing networks accept a synthesized visible image and the target visible image to compute the identity loss, perceptual loss and semantic loss, respectively. A weighted combination of different loss functions is used to optimize the entire network during training. Only the generator network is required during the testing.

Fig. 1: Flowchart depicting the training of the proposed SG-GAN framework. The generator takes a thermal image as an input to produce a synthesized visible image as an output. The discriminator is trained to distinguish between two pairs of images: the real pair, consisting of an input thermal image and a target visible image, and the synthesized pair, consisting of an input thermal image and a synthesized visible image. The identity and perceptual networks accept a synthesized visible image and the target visible image to compute the identity loss and perceptual loss values, respectively. The semantic parsing network accepts both synthesized visible image and the target visible image to compute the semantic loss value.

The main contributions of this work are summarized here:

  • We use a face parsing network to extract semantic labels as priors to regularize the mapping from thermal to visible face images.

  • We use an identity network to extract multi-scale identity features.

  • We investigate different loss functions and demonstrate their individual as well as combined effectiveness.

  • We evaluate the proposed network on two different benchmark multispectral face datasets and achieve promising results.

The rest of the paper is organized as follows. Section II discusses recent advancements in synthesizing VIS face images from THM face images. Section III describes the proposed SG-GAN method used in this work, with a particular emphasis on the various loss functions. Section IV discusses the image synthesis and face matching results on two multispectral face datasets. Conclusions are drawn in Section V.

Ii Related Work

In this section, we briefly discuss the existing literature on synthesizing visible face images from thermal face images.

Feature-based Synthesis: Chen and Ross [2] demonstrated that a VIS image could be reconstructed from a THM image using hidden factor analysis. Riggan et al. [19]

performed VIS image reconstruction using a two-stage process, consisting of feature extraction and feature regression using CNNs. However, these reconstructed results were observed to be blurry. In a subsequent work, Riggan et al. 


used a fully convolutional neural network to learn a global mapping between THM and VIS images, as well as the regions around the eyes, nose and mouth. The final synthesized image was a combination of global and local mapping functions resulting in better quality output.

GAN-based Synthesis: With the advent of generative adversarial networks, image-to-image synthesis has achieved promising results. Zhang et al. [31] proposed a GAN-based method to synthesize VIS face images from THM images. A combination of identity loss and perceptual loss functions was used to optimize the proposed framework. The identity loss function was computed by the features extracted from a single layer of a fine-tuned VGG model [17]. Similarly, a closed-set face recognition loss function was proposed to regularize the discriminator network during training in [33]. Their discriminator network not only distinguished between real and synthesized samples, but also performed closed-set face recognition. The face recognition loss function was computed using a pre-trained VGG model without any fine-tuning. In addition to the identity loss function, the attribute loss function has also been used to optimize the network [28]. In [28], an attribute predictor was developed by fine-tuning the VGG-Face network using 10 annotated attributes. The experiments demonstrated that the incorporation of the attribute loss function resulted in much better performance than when using the identity loss function alone. Wang et al. [27] incorporated the shape loss function into the CycleGAN consisting of a generative network and a detector network. The shape loss value was computed as the euclidean distance of the 68 detected landmarks between the synthesized VIS image and the target VIS image.

However, all the aforementioned methods do not explicitly considered the semantic information of the face to regularize network training. The main difference between our proposed method and [28] is that we use a semantic loss function to regularize the training process, whereas the latter uses an attribute loss function to guide the learning process. Further, we demonstrate that the use of semantic loss function is able to reduce the per-pixel loss value, calculated between the synthesized VIS face and target VIS face. Another notable difference is that SG-GAN extracts multi-scale features using an identity feature extraction network (see Figure 1).

Iii Proposed Method

In this work, the proposed SG-GAN is developed based on an existing generative adversarial network model, known as pix2pix 

[9]. Given a thermal image of size , the objective of our method is to learn the mapping from to in a supervised setting, i.e., , where is the target visible image of size . For the generator, we used U-Net [9] with skip connections, which concatenates all the channels at layer with those at layer -. Here, is the total number of layers and is the specific layer number. This allows the low-level information to be shared between the encoder and the decoder layers (see Figure 1). For the discriminator, , we used the PatchGAN [9] classifier that operates on

image patches and classifies them as being real or synthesized. Both generator and discriminator use a sequence of convolution blocks characterized by a composite function of three different operations: convolution, Batch Normalization (BN), followed by a Rectified Linear Unit (ReLU). A dropout rate of 50% is used for

. To optimize these two networks, the gradient update is alternated after every step between and .

Iii-a Identity Extraction Network

To extract identity features from face images, we train a face recognition network based on the VGG-19 network [17] from scratch. Specifically, we utilize a newly created large-scale face dataset termed VGGFace2 [1] to train this network. MTCNN [32] is utilized to automatically detect the face and its landmark. The detected landmarks are used to geometrically normalize the face image to a size of . After that, the input image is cropped based on a randomly generated aspect ratio and resized to

. The VGG-19 architecture has 16 convolutional layers and 3 fully-connected layers. In this setting, we do not use batch normalization to train the identity network. Training images were randomly flipped horizontally to facilitate data augmentation. A starting learning rate of 0.001 and a batch size of 32 were used. The maximum number of epochs for training was set to 90.

To better extract the identity-specific features, intermediate features from multiple layers of VGG-19 are concatenated together. Specifically, we have considered features extracted from the -, -, -, - and - layers. These concatenated features are used to compute the identity loss function between the synthesized VIS and target VIS face images. In the end, the final identity loss function is weighted by different coefficients, where larger weights are assigned to the features extracted at deeper layers of the network. The default weights for the individual layers are 1/32, 1/16, 1/8, 1/4, and 1, respectively. A similar approach was used to compute the perceptual loss function for high-resolution image synthesis in [26].

Iii-B Semantic Face Parsing Network

Existing literature on synthesizing VIS face images from THM face images does not explicitly consider the semantic information related to different facial components. In this approach, we propose to regularize the adversarial network training by introducing semantic priors as an additional loss function.111The thermal image has been converted from greyscale to RGB so that it conforms to the generator network’s input. Here, the semantic priors are calculated by a face parsing network [14], where the input is the visible face image and the output is the semantic labels that correspond to 11 different classes (see Figure 2). These classes, corresponding to different facial components, are based on the label information provided by the HELEN dataset [14]. These labeled facial components correspond to face skin, left eye, right eye, left brow, right brow, nose, inner mouth, upper lip, lower lip, hair, and background [23]. Since we are interested in salient facial components that are crucial to face recognition, class labels associated with the eyebrows, eyes, nose and mouth regions are grouped together. The resulting semantic label image is used to compute the semantic loss function. Note that the thermal and visible face images are aligned based on the manually annotated eye landmarks. The semantic labels are calculated for both synthesized visible image and target visible image. Figure 2

shows the semantic parsing results on visible face images. It must be noted that these parsing results can be further refined to remove outliers (e.g., part of the background is incorrectly classified as face skin in Figure 

2), by fine-tuning the pre-trained model on the benchmark datasets used in this work.

Fig. 2: Examples of face parsing results. The images in the top row are the visible face images, while the ones in the middle are the 11-class semantic label images. The bottom row is the result of converting the 11-class semantic label images to 2-class semantic label images. The visible face images are from the PCSO dataset [2].

Iii-C Loss Functions

SG-GAN uses a set of loss functions that consists of adversarial and per-pixel loss values, as well as identity, perceptual and semantic loss values. These loss functions are independently investigated and then combined in an effective manner.

Iii-C1 Adversarial Loss

The adversarial loss function, , is defined as [9]:


where is the generator, is the discriminator, is the thermal image and is the target visible image. indicates that is from the true data distribution and indicates that both are from the true data distribution. The objective of generator is to synthesize visually realistic visible face images from thermal face images, while the discriminator is structured to distinguish the target visible face images from the synthesized ones, conditioned on the input thermal image. This min-max game will reach an equilibrium when neither nor can further reduce their loss values [12]. In summary, attempts to minimize the objective function while attempts to maximize it. Mathematically, this could be described as:


In practice, the generator maximizes the probability of synthesized VIS samples to be classified as real by the discriminator. Therefore, the above loss function can be alternatively updated as 





Here, is the generator loss and is the discriminator loss. Maximizing is the same as maximizing . An optimal convergence of the adversarial network will result in and  [16]. This would indicate that the discriminator is unable to separate the synthesized pairs from the real pairs.

Iii-C2 Per-pixel Loss Function

The per-pixel loss function, , is computed as the mean value of the absolute element-wise difference between the images:


where, and are the synthesized and target visible images, respectively. is used to reduce the space of all possible mapping functions between THM and VIS images such that the synthesized VIS samples look visually similar to the target VIS images. We also tried replacing with smooth or norms, but observed no significant improvement in matching performance. The per-pixel loss function compares the synthesized and target visible images at pixel-level.

Iii-C3 Perceptual Loss Function

The perceptual loss function, , is defined as follows,



denotes the features extracted from multiple layers of the pre-trained VGG-19 network on ImageNet dataset. The perceptual loss function 

[10] is used to measure the high-level semantic difference between the synthesized visible face image and the target visible face image; this ensures that the synthesized result is smooth and perceptually similar to the target. The features are extracted at multiple layers and concatenated together to form a single feature descriptor.

Iii-C4 Identity Loss Function

The identity loss function, , is defined as follows:


where, denotes the features extracted from multiple layers of the pre-trained VGG-19 network on face dataset. The loss function is used to ensure that the synthesized visible face image contains identity-specific features that are similar to the ground-truth target visible image. Although the use of

loss function, as defined earlier, can ensure visual similarity between the synthesized visible image and the real original image, it could produce blurry results since

tends to smoothen the output image. also lacks high-level information about the identity. Hence, it is of particular importance to integrate the identity loss function to extract high-level identity-specific features.

Iii-C5 Semantic Loss Function

The semantic loss function, , is defined as follows:


where, denotes the semantic labels extracted from the pre-trained semantic parsing network. The loss function is used to ensure that the shape and size of the synthesized visible face image is consistent with that of the ground-truth target visible image. The semantic loss value is measured as the difference of the semantic labels computed between the synthesized visible and target visible images.

Iii-D Implementation Details

The loss function used in our proposed SG-GAN framework was formulated as follows:


where, is the adversarial loss consisting of both generator and discriminator losses, is the per-pixel loss between synthesized visible face image and the target visible face image, is the perceptual loss, is the identity loss and is the semantic loss. Based on empirical analysis, we set , , , and as the default weights for , , , and , respectively. The final objective function is:


The proposed framework was implemented in PyTorch. During the training phase, batch normalization with Adam optimization was used. The default number of epochs used in our training was 200. Random cropping and image flipping operations were used for data augmentation during training. We first scale the images to size

, and then randomly crop them to a size of . The starting learning rate for Adam optimization was 0.0002, which was fixed for the first 100 epochs and decreased by 1/100 after each subsequent epoch. The remaining parameters use the default values described in [9].

Iv Experimental Results

To verify the effectiveness of the proposed method, experiments were conducted on the Pinellas County Sheriff’s Office (PCSO) [2] and Army Research Laboratory (ARL) [6] datasets. It must be noted that the PCSO and ARL datasets consist of face images acquired in the middle-wave infrared (MWIR) and long-wave infrared (LWIR) spectra, respectively. To evaluate face matching performance, we choose face recognition matchers that have achieved state-of-the-art results on the LFW dataset [25, 3].

Iv-a Face Recognition Matchers

AM-Softmax: The AM-Softmax [25] matcher was trained on the VGGFace2 [1]

dataset consisting of 7,773 subjects and 1,428,908 images. AM-Softmax was developed by introducing additive angular margin for the Softmax loss after performing both feature and weight normalization. We used stochastic gradient descent (SGD) with momentum to update the weights. The momentum was set to 0.9 and the base learning rate was initialized to 0.1. The batch size was 256 and the maximum number of iterations was 30,000. We used the “step” learning rate policy which drops the learning rate by a factor of 0.1 after 16,000, 24,000 and 28,000 iterations. This results in a trained model of size 106MB. After that, we extracted a 1024-dimensional feature vector from the penultimate fully connected layer to represent the input image. The match score was computed using the cosine similarity metric. We evaluated the trained model on the LFW benchmark dataset and achieved a classification accuracy of 99.35%, thereby suggesting the efficacy of this method for visible spectrum face recognition.

MobileFaceNet: MobileFaceNet [3], a reminiscent of MobileNetV2 [20], uses global depth-wise convolution layer to replace the global average pooling layer. It significantly reduces the model size while maintaining comparable recognition accuracies on the LFW and MegaFace datasets [3]. We employed the MobileFaceNet architecture and trained using the angular softmax (A-Softmax) loss function [15] on the VGGFace2 dataset. The momentum was set to 0.9 and the base learning rate was initialized to 0.1. The batch size was 128 and the maximum number of iterations was set to 60,000. We used the “step” learning rate policy which drops the learning rate by a factor of 0.1 after 36,000, 50,000 and 58,000 iterations. This resulted in a trained model size of 8MB. After that, we extracted a 256-dimensional feature vector from the penultimate fully connected layer and used the cosine similarity metric to compare feature vectors. We tested the model on the LFW benchmark dataset and achieved a classification accuracy of 99.30%.

The AM-Softmax and MobileFaceNet matchers deliver strong baselines. Considering that the VGGFace network is employed in this work to extract identity-specific features during SG-GAN training, we use these other matchers to demonstrate the generalization capability of the SG-GAN in synthesizing VIS from THM face images. In this setting, the synthesized visible image is matched against the visible image.

Iv-B Evaluation on PCSO Dataset

The PCSO dataset contains data from 1,004 subjects, where 1,003 of the subjects have two visible face images and one thermal face image each. Based on manually localized eye coordinates, the face images were aligned and cropped to an image size of . Following the evaluation benchmark given in [2], the first 667 subjects were used for training and the rest were used for testing. The training and test subsets consist of 1,333 and 337 THM-VIS pairs, respectively. There is no overlap between the training and test subjects. Examples of thermal, synthesized visible and ground-truth visible face images are shown in Figure 3. The proposed SG-GAN method appears to have produced photo-realistic results. Discriminative features pertaining to the eye, nose and mouth regions are better preserved. This suggests that it is beneficial to embed semantic priors when training the GAN. In this regard, we have considered the semantic class labels related to eyebrows, eyes, nose and mouth regions. Further, the use of the identity extraction network allows identity-specific features to be similar to that of the targets. The corresponding face matching experiments, reported with Area Under the Curve (AUC) and Equal Error Rate (EER), are summarized in Table I.

AM-Softmax MobileFaceNet
AUC (%) EER (%) AUC (%) EER (%)
Direct Matching 69.98 34.65 69.18 35.52
+ 86.73 21.36 87.59 20.77
++ 87.53 19.82 88.47 19.09
+++ 89.76 18.40 91.14 17.51
Proposed (SG-GAN) 90.12 15.98 92.16 15.01
TABLE I: Evaluation of face matching using the AM-Softmax and MobileFaceNet matchers on the PCSO dataset. The impact of different loss functions can be seen here. A higher AUC is better while a lower EER is better.
Fig. 3: Synthesizing VIS face images from THM images on the PCSO dataset. Compared to the use of the “” loss function, the output of SG-GAN is semantically more close to the ground-truth VIS image especially around the salient facial regions.

Ablation Study: To demonstrate the effectiveness of different loss functions, an ablation study was conducted using the PCSO dataset. As can be seen in Table I and Figure 4, the perceptual loss function contributes slightly to the improvement of face matching performance. The addition of

loss function results in a pronounced difference in the AUC value. Here, “Direct Matching” refers to the setting where (a) deep features are directly extracted from THM and VIS face images and (b) the extracted features are compared to produce a match score. “

” denotes the performance of the original pix2pix model [9]. ++ refers to the performance where perceptual loss function is added. Similarly, +++ is used to denote the matching performance when both and loss functions are added. The proposed SG-GAN method is developed on the basis of +++ that has been further regularized by a semantic loss function that is computed from the semantic labels extracted via the face parsing network.

(a) AM-Softmax
(b) MobileFaceNet
Fig. 4: An ablation study of different loss functions on the PCSO dataset using different face matchers. Our proposed solution, SG-GAN, achieves much better performance than . The figure is best viewed in color.
(a) PCSO
(b) ARL
Fig. 5: Visualizing the convergence of different loss functions with and without the use of semantic regularization when training using the PCSO and ARL datasets. The top row shows the loss and the bottom row shows the loss.

In addition to face matching experiments, we also computed and visualized the output images due to the use of and loss values on the same dataset. Loss values computed from individual batches were averaged during each training epoch. These two loss functions are vital to learning identity-specific features; hence, their convergence reflects how well the GAN model is trained (see Figure 5(a)).

The proposed method exhibits much better photo-realistic results in synthesizing THM from VIS face images. This is of particular importance when cross-spectral face recognition systems deployed in practice require human intervention. As can be seen from Figure 4, the effectiveness of the proposed SG-GAN method and the loss functions used are consistently observed across different face recognition matchers. This suggests that the proposed SG-GAN method is generalizable across different CNN-based face matchers. For the evaluations below, the AM-Softmax matcher is adopted.

Iv-C Evaluation on ARL Dataset

The ARL dataset consists of thermal and visible face images captured from 60 subjects [6]. According to the protocol used in [19], images of 60 subjects are used for the experiment. The interocular distance in these images is  87 pixels. 30 subjects were randomly chosen for training and the remaining 30 were used for test and evaluation. This results in a total of 480 THM-VIS image pairs in the training and test sets. Face images in this dataset have a size of , which is center-cropped to a size of by retaining a scale factor of of 0.8 in both vertical and horizontal directions, in order to remove unnecessary background information. Examples of images from the ARL dataset can be seen in Figure 6. In this evaluation, we adopt 2-class semantic labels to compute the semantic loss function, resulting in better face matching performance. This can be attributed to more robust semantic parsing formulation derived from the pre-trained face parsing network. We set in this setting.

Fig. 6: Generating VIS images from THM images on the ARL dataset using the SG-GAN method.

We also compare the proposed method against state-of-the-art CNN-based [19, 18] and GAN-based [31, 28] synthesis methods that were previously evaluated on the ARL dataset. As seen from Table II, AP-GAN [28]

using ground-truth attributes, rather than the estimated attributes, obtains 86.08% AUC and 23.13% EER, while our proposed SG-GAN method achieves 92.51% AUC and 15.25% EER. The AP-GAN 

[28] incorporates the attribute loss function when training its GAN; thus, the focus is on preserving attribute information. On the other hand, our proposed SG-GAN regularizes the training using semantic information; thus, the focus is on preserving facial information around significant components. An ablation study with respect to different combinations of loss functions is also conducted on this dataset (see Figure 7). This further validates the effectiveness of our SG-GAN with semantic regularization. As seen from Figure 5(b), the use of such a regularization scheme ensures that the loss functions converge at a faster speed for . Though the convergence speed for is not impacted, is typically considered to be more significant for generating visually similar results since it operates at the pixel-level.

Algorithm AUC (%) EER (%)
Feature-based Synthesis [19] 68.52 34.36
Multi-Region Synthesis [18] 82.49 26.25
GAN-VFS [31] 79.30 27.34
AP-GAN [28] 84.16 23.90
AP-GAN + GT [28] 86.08 23.13
Proposed SG-GAN 92.51 15.25
SG-GAN + Fine-tune 93.08 14.24
TABLE II: Comparison of the proposed SG-GAN method against other state-of-the-art synthesis-based approaches previously reported on the ARL dataset. The AM-Softmax face matcher is used here.
Fig. 7: An ablation study of different loss functions on the ARL dataset. Our proposed solution SG-GAN achieves much better performance than . The face recognition matcher used here is AM-Softmax. The figure is best viewed in color.

The number of subjects in the ARL dataset is not as large as that of the PCSO dataset. This could potentially limit the capacity of the generator network to effectively learn features that are beneficial for thermal-to-visible face synthesis. To address this issue, we utilize the SG-GAN model trained on the PCSO dataset and fine-tuned using the training partition of the ARL dataset. The trained SG-GAN model consists of two separate models: and . As described earlier, is responsible for generating photo-realistic VIS images and is used to determine whether the generated sample is real or synthesized. Two different fine-tuning strategies were tested. The first strategy performs fine-tuning on both and while the second strategy performs fine-tuning on only. Based on our experimental analysis, the latter is observed to result in better face matching accuracy. In the latter case, although we only use model for fine-tuning, model will continue to be trained based on the outputs from . As can be seen from Table II and Figure 7, the fine-tuning approach obtains better results. This offers a practical solution to deal with image synthesis using limited samples, since the fine-tuning strategy adopted in this work is different from the one used in image classification, where specific layers are fine-tuned.

V Conclusions

This paper proposes a novel synthesis-based method for matching thermal face images against visible spectrum images using a GAN-based approach. The proposed SG-GAN method utilizes semantic labels extracted by a face parsing network to compute the semantic loss function that regularizes network training, thereby improving both the quality of the synthesized face images and the accuracy of cross-spectral face matching. Experiments on two different datasets indicate that the proposed method is effective in synthesizing VIS images from THM images, and subsequently improves cross-spectral face matching accuracy. Future work would involve conducting experiments on a large-scale dataset to further investigate the role of different loss functions in the proposed method.


  • [1] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. of FG, 2018.
  • [2] C. Chen and A. Ross. Matching thermal to visible face images using hidden factor analysis in a cascaded subspace learning framework. Pattern Recognit. Lett., 72:25 – 32, 2016.
  • [3] S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient CNNs for accurate real-time face verification on mobile devices. In Proc. of CCBR, pages 428–438, 2018.
  • [4] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. of NIPS, pages 2672–2680, 2014.
  • [5] S. Hu, N. Short, B. S. Riggan, M. Chasse, and M. S. Sarfraz. Heterogeneous face recognition: Recent advances in infrared-to-visible matching. In Proc. of FG, pages 883–890, 2017.
  • [6] S. Hu, N. J. Short, B. S. Riggan, C. Gordon, K. P. Gurton, M. Thielke, P. Gurram, and A. L. Chan. A polarimetric thermal database for face recognition research. In Proc. of CVPR Workshops, pages 187–194, 2016.
  • [7] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis. In Proc. of ICCV, 2017.
  • [8] S. Iranmanesh, A. Dabouei, H. Kazemi, and N. Nasrabadi. Deep cross polarimetric thermal-to-visible face recognition. CoRR, 2018.
  • [9] P. Isola, J. Zhu, T. Zhou, and A. A. Efros.

    Image-to-Image translation with conditional adversarial networks.

    In Proc. of CVPR, pages 5967–5976, 2017.
  • [10] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In Proc. of ECCV, pages 694–711, 2016.
  • [11] B. F. Klare and A. K. Jain. Heterogeneous face recognition using kernel prototype similarities. IEEE Trans. on PAMI, 35(6):1410–1422, 2013.
  • [12] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly. The GAN landscape: Losses, architectures, regularization, and normalization. CoRR, 2018.
  • [13] Y. Li, S. Liu, J. Yang, and M. Yang. Generative face completion. In Proc. of CVPR, pages 5892–5900, 2017.
  • [14] S. Liu, J. Yang, C. Huang, and M. Yang. Multi-objective convolutional learning for face labeling. In Proc. of CVPR, pages 3451–3459, 2015.
  • [15] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In Proc. of CVPR, pages 6738–6746, 2017.
  • [16] Y. Luo, Z. Zheng, L. Zheng, T. Guan, J. Yu, and Y. Yang. Macro-Micro adversarial network for human parsing. In Proc. of ECCV, 2018.
  • [17] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proc. of BMVC, 2015.
  • [18] B. S. Riggan, N. J. Short, and S. Hu. Thermal to visible synthesis of face images using multiple regions. In Proc. of WACV, pages 30–38, 2018.
  • [19] B. S. Riggan, N. J. Short, S. Hu, and H. Kwon. Estimation of visible spectrum faces from polarimetric thermal faces. In Proc. of BTAS, pages 1–7, 2016.
  • [20] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proc. of CVPR, 2018.
  • [21] M. S. Sarfraz and R. Stiefelhagen. Deep perceptual mapping for cross-modal face recognition.

    International Journal of Computer Vision

    , 122(3):426–438, 2017.
  • [22] Z. Shen, W.-S. Lai, T. Xu, J. Kautz, and M.-H. Yang. Deep semantic face deblurring. In Proc. of CVPR, 2018.
  • [23] B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang. Exemplar-based face parsing. In Proc. of CVPR, pages 3484–3491, 2013.
  • [24] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for pose-invariant face recognition. In Proc. of CVPR, 2017.
  • [25] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Process. Lett., 25(7):926–930, 2018.
  • [26] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. CoRR, 2017.
  • [27] Z. Wang, Z. Chen, and F. Wu. Thermal to visible facial image translation using generative adversarial networks. IEEE Signal Process. Lett., 25(8):1161–1165, 2018.
  • [28] D. Xing, H. Zhang, and V. Patel. Polarimetric thermal to visible face verification via attribute preserved synthesis. In Proc. of BTAS, 2018.
  • [29] H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of GANs. In Proc. of CVPR, 2018.
  • [30] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-pose face frontalization in the wild. In Proc. of ICCV, 2017.
  • [31] H. Zhang, V. M. Patel, B. S. Riggan, and S. Hu. Generative adversarial network-based synthesis of visible faces from polarimetric thermal faces. In Proc. of IJCB, pages 100–107, 2017.
  • [32] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett., 23(10):1499–1503, 2016.
  • [33] T. Zhang, A. Wiliem, S. Yang, and B. Lovell. TV-GAN: Generative adversarial network based thermal to visible face recognition. In Proc. of ICB, pages 174–181, 2018.