A Large-Scale, Time-Synchronized Visible and Thermal Face Dataset

by   Domenick Poster, et al.
West Virginia University

Thermal face imagery, which captures the naturally emitted heat from the face, is limited in availability compared to face imagery in the visible spectrum. To help address this scarcity of thermal face imagery for research and algorithm development, we present the DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF). With over 500,000 images from 395 subjects, the ARL-VTF dataset represents, to the best of our knowledge, the largest collection of paired visible and thermal face images to date. The data was captured using a modern long wave infrared (LWIR) camera mounted alongside a stereo setup of three visible spectrum cameras. Variability in expressions, pose, and eyewear has been systematically recorded. The dataset has been curated with extensive annotations, metadata, and standardized protocols for evaluation. Furthermore, this paper presents extensive benchmark results and analysis on thermal face landmark detection and thermal-to-visible face verification by evaluating state-of-the-art models on the ARL-VTF dataset.



There are no comments yet.


page 1

page 3

page 4

page 8


A Synthesis-Based Approach for Thermal-to-Visible Face Verification

In recent years, visible-spectrum face verification systems have been sh...

Face segmentation: A comparison between visible and thermal images

Face segmentation is a first step for face biometric systems. In this pa...

Using Convolutional Neural Networks for Relative Pose Estimation of a Non-Cooperative Spacecraft with Thermal Infrared Imagery

Recent interest in on-orbit servicing and Active Debris Removal (ADR) mi...

Deception Detection and Remote Physiological Monitoring: A Dataset and Baseline Experimental Results

We present the Deception Detection and Physiological Monitoring (DDPM) d...

Multi-Scale Thermal to Visible Face Verification via Attribute Guided Synthesis

Thermal-to-visible face verification is a challenging problem due to the...

Pixel-Wise Motion Deblurring of Thermal Videos

Uncooled microbolometers can enable robots to see in the absence of visi...

Combining Visible and Infrared Spectrum Imagery using Machine Learning for Small Unmanned Aerial System Detection

Advances in machine learning and deep neural networks for object detecti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A set of images from the RGB (left), stereo monochrome (middle), and LWIR (right) cameras from the baseline (top), expression (middle), and off-pose (bottom) sequences.

The use of thermal imaging has grown steadily over the past several decades, aided by improvements in sensor technology as well as reductions in cost. Thermal infrared sensors capture heat emissions, such as those radiated by the human body, in the - medium wave infrared (MWIR) band and -

longwave infrared (LWIR) band. Thermal imaging of faces have applications in the military and law enforcement for face recognition in low-light and nighttime environments

[15][19][33][7] and healthcare [11][28][35], which require robust recognition models in challenging unconstrained operational conditions. However, the majority of MWIR and LWIR face datasets available at the time of this paper’s writing consist of lower resolution images from older thermal sensors.

While good rank-1 face recognition rates (around 90%) have been reported using 6464 cropped face images captured by these older thermal cameras [24], there is still a large gap in meeting the aforementioned requirements for military, law enforcement, and healthcare applications. To help address these requirements, face datasets containing high resolution thermal imagery under various conditions, such as variable pose, expression, occlusion, and resolutions are needed. Furthermore, it is oftentimes desirable to synchronize and co-register the data being collected across multiple sensors to support algorithm development of fusion, domain adaptation, and cross-modal image synthesis approaches.

To this end, we present the Army Research Laboratory Visible-Thermal Face (ARL-VTF) dataset. This dataset is, to the best of our knowledge, the largest thermal face dataset publicly available for scientific research to date. The main contributions of the ARL-VTF dataset are:

  • A multi-modal, time synchronized acquisition of 395 subjects and over 500,000 face images captured using multiple visible cameras for stereo 3D vision and one LWIR sensor (sample images shown in Figure 1).

  • Three image sequences capturing baseline, expression, and pose conditions for each subject. A fourth condition, eye glasses, is captured if a subject wears glasses.

  • Annotations for head pose, eyewear, face bounding box, and 6 face landmarks locations.

  • Standardized protocols for model training and evaluation.

Results and analysis on the tasks of thermal face landmark detection and thermal-to-visible face verification using state-of-the-art deep learning models are presented as a benchmark.

2 Literature Review

Dataset Modalities Subjects Variability IR Resolution Range (m)
UND [5] LWIR, RGB I,E,T Unspecified
IRIS [1] LWIR, RGB P,I,E Unspecified
Terravic [23] LWIR P,G Unspecified
UH [3] MWIR P,E Unspecified
NVIE [34] LWIR, Mono I,E,G
Carl [9] N/LWIR, RGB I,E,T
Eurocom [22] LWIR, RGB P,I,E,G,O
Tufts [26] N/LWIR, RGB P,E
Table 1: Summary statistics of datasets containing MWIR or LWIR face data ordered (approximately) from least to most recent. Whether controlled or uncontrolled, the presence of the following variable conditions is noted: (P)ose, (I)llumination, (E)xpression, (T)ime-lapse, (G)lasses, and (O)cclusion. Image resolution is written as (wh).

In this section we provide a thorough comparison of several publicly released MWIR or LWIR face datasets and briefly highlight some notable characteristics of each. Table 1 presents a high-level comparison of the key statistics of different datasets, including the ARL Visible-Thermal Face Dataset (ARL-VTF) presented in this paper.

Collected primarily in 2002 with visible and LWIR cameras, the University of Notre Dame (UND) [5] dataset remains as one of the largest datasets in terms of unique identities (with 241 subjects), but has only four images per subject, and used what is now considered a very low resolution and low sensitivity uncooled microbolometer.

The IRIS [1] dataset has simultaneous recordings of 30 subjects in variable poses and expressions in both LWIR and visible, however no annotations included besides the subject id. The IRIS-M3 [4] dataset, however, contains 88 subjects simultaneously captured under a variety of indoor and outdoor lighting conditions with not only LWIR and visible cameras but also a multi-spectral imaging module.

Two different datasets have both been referred to as the University of Houston (UH) dataset. The more recent version [3] contains 7,590 MWIR images from 138 subjects. A slightly older version [18] contains 88 subjects and simultaneous acquisition of visible, thermal, and range data for 3d model generation. The thermal IR camera is not specified though presumably it is the same as [3].

The Natural Visible and Infrared Expression Database (NVIE) [34] captures subjects displaying a wide range of emotions. Sequences of unposed expressions were elicited by having subjects, some of whom wore glasses, observe video clips. Sequences of posed expressions were also captured with all subjects with and without glasses. The LWIR and grayscale visible image streams were simultaneously recorded and manually time-synchronized. While 238 subjects participated in the collection, [34] notes that there is only data for 105-112 subjects in the majority of scenarios.

Similar to [34], the KTFE dataset [25] elicited natural displays of emotion from 26 subjects through the use of video clips. Instrumental music was used between sequences to promote a neutral emotional state. Subjects were allowed to wear glasses during the collection. The data was captured simultaneously with an InfRec R300 camera.

The Carl dataset [9] contains time-lapse data of 41 subjects captured in four separate sessions spaced two days apart in which subjects were allowed uncontrolled natural variations in their expressions. The data was simultaneously recorded using a combined visible/LWIR camera and a separate NIR camera.

The Université Laval Face Motion and Time Lapse (ULFMT) database [12] contains 238 subjects recorded in multiple sequences under variable conditions, including significant time-lapse on the order of two to four years. Although the data was collected from the near, short, medium and long wave infrared bands, only the MWIR data has been released to date.

The ARL Multi-Modal Face Database (MMFD) dataset is composed of two separate collections, first presented in [14] and then extended in [40], both with simultaneously acquired visible, LWIR, and Polarimetric LWIR data. It has a combined total of 111 subjects. Unique to this dataset is the variable distances at which subjects are captured.

The Eurocom dataset[22], with 50 subjects captured using a combined visible/LWIR camera, notably contains a wide variety of acquisition scenarios, including sequences during which the eye and mouth regions are occluded by the subject’s hand.

The RWTH-Aachen [20] dataset contains high resolution LWIR images of 94 subjects. Each subject is captured with variable expressions and head poses (both pitch and yaw), in controlled and uncontrolled sequences. The dataset is well annotated for emotions, discrete facial actions, and face landmarks. It cannot be used on its own for thermal-to-visible face recognition due to an absence of visible data, however it can still be employed to develop thermal landmark detection algorithms.

The Tufts Face Database [26] is a multi-modal dataset with several image acquisition devices and scenarios. The scenarios involve the simultaneous capture of visible and LWIR frontal images as well as visible, NIR, LWIR images acquired with a mobile, multi-camera sensor platform being rotated in front of the subject in an arc. In both scenarios, subjects were asked to pose with a variety of expressions and also sunglasses. Also included in the dataset are images from a 3D light-field camera, 3D point cloud reconstructed facial images, and computer-generated face sketches. The dataset contains 100 subjects.

Compared to the ARL-VTF dataset with 395 subjects, the next largest high-resolution thermal face dataset, ULFMT, contains 238 subjects and features MWIR and RGB video recordings under a comprehensive set of variable conditions but lacks synchronized data. For the RWTH dataset, although it utilized a higher resolution thermal camera and provides annotations for variable expressions, it contains no visible imagery counterpart. In contrast, ARL-VTF’s synchronized acquisition and stereo arrangement supports algorithm development for 3D model learning [6], multi-modal fusion [18], domain adaptation [30], and cross-domain image synthesis [13]. Three such synthesis approaches [8][16][39] for thermal-to-visible face verification are showcased in Section 4.2.

In summary, the ARL-VTF dataset is the only dataset which has all of the following characteristics: a) time-synchronized visible and thermal imagery, b) data collected using a current commercially available uncooled LWIR camera, c) variable expression, pose, and eyewear, d) facial landmark annotations, and e) the largest number of subjects and images to-date.

3 Database Collection

The data collection occurred over the course of 9 days in November 2019. The released dataset contains 395 subjects, each of whom completed an Institutional Review Board (IRB) approved consent form prior to image acquisition. The subjects were seated in front of a thermally neutral background 2.1 meters from the sensor array with their heads at approximately the same height as the sensors. Illumination was provided by the standard fixed overhead room lighting. The collection area setup is pictured in Figure 2.

Figure 2: The collection area showing the sensor array as it collects the baseline (frontal) image sequence.

Subjects’ faces were recorded for approximately 10 seconds under each of the following conditions:

  1. A baseline sequence of frontal images with the subject maintaining a neutral expression. If subjects were wearing glasses, they were asked to remove them.

  2. An expression sequence of frontal images of the subject counting out loud incrementally starting from one.

  3. A pose sequence of images where subjects were asked to slowly turn their heads from left to right. However, a small number of subjects rotated their entire bodies from left to right using the swiveling chair.

  4. If subjects naturally wear glasses (removed for sequences 1-3), they were asked to put them back on for an additional sequence of baseline images.

Camera Modality Resolution (wh) IPD
FLIR Grasshopper3 {1, 4} Mono visible
FLIR Boson {2} LWIR
Basler Scout {3} RGB color
Table 2: Visible and LWIR camera information. The {} enumeration corresponds to the camera labeling in Figure 3

. The Mean (M) and Standard Deviation (SD) of inter-pupil distances (IPDs) are calculated using the baseline image sequence.

Sensors: This dataset was collected with an array of three visible cameras and one LWIR thermal sensor. The visible imagery was recorded using two monochrome FLIR Grasshopper3 CMOS cameras and one RGB Basler Scout CCD camera. The LWIR data is captured by a FLIR Boson uncooled VOx microbolometer with a spectral band of to and thermal sensitivity of 50 mk. Table 2 lists the camera specifications. The sensors were mounted onto a single optical plate as shown in Figure 3. Data from a fifth sensor (a LWIR polarimeter) is omitted from this dataset as it was not time-synchronized with the other cameras.

Figure 3: Sensor array with two FLIR Grasshopper3 cameras {1, 4}, the FLIR Boson LWIR sensor {2}, and the Basler Scout camera {3}. Polarimetric LWIR sensor {5} data not included.

Sensor Calibration and Synchronization: Sensor calibrations were conducted each day of the data collection to enable post-processing for 2D image registration and 3D geometric calibrations of the multiple visible and infrared sensors. An checkerboard pattern with 20mm squares is mounted in front of a black body source which provides contrast for both visible and thermal images. For the thermal camera, a custom designed thermal/visible pattern using 20mm square holes with 10mm spacing was used. The visible and thermal sensor checkerboard calibration patterns are presented in Figure 4. In order to facilitate the development of 3d-based algorithms, the intrinsic and extrinsic camera parameters are provided with this dataset.

(a) Visible Pattern
(b) Thermal Pattern
Figure 4: Calibration patterns for the visible and thermal sensors.

Using custom software to interface with each camera vendor’s respective SDK software, the images were captured in a time-synchronized fashion via multithreaded software triggers at 15 frames per second due to bandwith limitations regarding data transfer.

3.1 Dataset Details and Usage

In total, the dataset contains 395 subjects and 549,712 images. To provide a sense of face resolution, the average inter-pupil distances (IPDs) of frontal baseline images are tabulated in Table 3. IPDs are calculated as the pixel distance between the left and right eye centers. To facilitate reproducibility and evaluation, the dataset is divided into subject-disjoint development (training and validation) and test sets with 295 subjects in the development set and the remaining 100 subjects in the test set. The subjects within the development set are sub-divided into training and validation sets using a 5-fold cross-validation scheme for hyper-parameter tuning and model selection. Of the 395 total subjects, 60 subjects were recorded both with and without glasses. These subjects have been evenly divided between the development and test sets, and proportionally divided between the training and validation sets (24 for training and 6 for validation).

3.1.1 Thermal-to-Visible Face Verification Protocols

We use the following grammar to describe the type of images in each gallery and probe set. In order to facilitate detailed analysis, the temporally-disjoint sets of gallery and probe images are defined in terms of a sequence category and an eyewear category. Gallery and Probe protocols are designated “G” and “P” respectively. “V” and “T” refer to the visible and thermal spectrum data. The sequence categories “B”, “E”, and “P” signify the baseline, expression, and pose sequences, respectively. The “” symbol represents any or all sequence categories. For the purposes of the evaluation protocol, B also includes the glasses image sequence. There are three eyewear categories which describe if a subject possesses glasses and if the glasses are being worn in the image. Images of subjects who do not possess glasses use the tag 0, whereas subjects who have their glasses removed or worn are notated - and +, respectively. The eyewear category is omitted when no filtering has been done on the basis of eyewear. In extended Backus–Naur form, the rules for producing descriptive protocol labels are:

set “G” “P” ;
modality “V” “T” ;
sequence “B” “E” “P” “*” ;
eyewear “0” “-” “+” ;
protocol set, “_”, modality,
sequence, [eyewear+] ;

Specific protocols have been developed for the evaluation of thermal-to-visible face verification algorithms. As the collection process yielded a different number of images for each subject, the test data has been selectively sampled to provide an equal number of images per subject and sequence. Additionally, specific images have been further designated as either probe or gallery images in order to standardize evaluation. Gallery images are composed solely of baseline images from the visible cameras. Probes are thermal images from all three sequences. Two distinct galleries are specified: 1) G_VB0- in which no subjects are wearing glasses, and 2) G_VB0+ wherein glasses are worn by the subjects who have them.

The gallery and probe sets were constructed as follows. Seven evenly-spaced timestamps were selected from each subject’s baseline sequence, starting from the first timestamp and ending with the last. The images from each of the three visible cameras corresponding to the first and last timestamp in the sequence are placed into G_VB0-. The images from the LWIR camera corresponding to the remaining five timestamps are designated as probes (P_TB0-). If a glasses sequence was recorded for that subject, then this process is repeated for the images in that sequence, with the resulting images becoming associated with the G_VB0+ and P_TB+ protocols. Next, 25 timestamps for the expression sequence are selected, spaced evenly to cover the span of the sequence. The images corresponding to those timestamps from all four cameras are added to the subject’s set of probe images (P_TE0-). The same is done for the pose sequence (P_TP0-).

In summary, each subject has 6 gallery images (2 timestamps 3 visible cameras) and 5 baseline probe images (5 timestamps 1 thermal camera) without any eyewear. The subjects with glasses have an additional set of gallery and baseline probe images where the glasses are worn. This protocol can easily be extended to visible-to-visible or visible-to-thermal face verification by including the remaining images from the other cameras.

Sequence Probes Gallery
Mono 1 Mono 2 RGB
Table 3: The number of probe and gallery images in the thermal-to-visible face verification protocol per subject in the test set. Mono 1 and 2 refer to the Grasshopper3 cameras. *only pertains to subjects who brought glasses to the collection.

However, it should be noted that the development set has not been similarly balanced. All available images of a subject are by default included in the development set. Sub-sampling the development data is left to the user’s discretion.

Annotations: Face bounding box and face landmark coordinates were generated using a commercial off-the-shelf face and landmark detector (Neurotechnology Verilook SDK) applied independently to the two high-resolution FLIR Grasshopper3 images assisted by manual supervision and correction of annotations. Face landmarks are in a 6-point annotation scheme corresponding to the left eye center, right eye center, base of nose, left mouth corner, right mouth corner, and center of mouth. The stereo arrangement of the Grasshopper3 cameras enabled the annotated points to be projected into the coordinate spaces of the LWIR and Scout RGB cameras using 3D geometry.

The stereo setup also allowed for the automatic estimation of head pose achieved using OpenCV’s

[2] implementation of the Perspective-n-Point with RANSAC algorithm. Figure 5 displays the distribution of estimated yaw angles captured during the pose sequence across all subjects. There is some slight asymmetry in the distribution about 0, partially due to the fact that subjects oftentimes did not complete the full 180 head rotation. Metadata for each image includes the subject ID, camera, timestamp, image sequence, detected face bounding box, detected 6-point face landmarks, and estimated yaw angle.
Requesting the Database: Requests for the database can be made by contacting Matthew Thielke (matthew.d.thielke.civ@mail.mil). Requestors will be asked to sign a database release agreement and each request will be vetted for valid scientific research.

Figure 5: Distribution of head poses in terms of estimated yaw angles from the pose image sequence.

4 Performance Benchmarks

Benchmark results for landmark detection and thermal-to-visible face verification are provided in this section.

4.1 Face Landmark Detection

Sequence Mean Std Median MAD Max Error AUC Failure Rate
Table 4: Landmark detection performance statistics in terms of the NRMSE.

The Deep Alignment Network (DAN) [21]

is a multi-stage convolutional neural network (CNN) designed to iteratively update the predicted landmark locations given an initial shape estimate. It has shown promising results for face landmark detection on both visible

[37] and thermal [29][20] imagery. The model was trained with thermal face images from all of the recording sequences. The detected face bounding boxes are used to crop the images. The output of the model is the predicted face shape , where is the number of face landmark locations.

For these benchmarks, we set and detect the left and right eye centers, the base of the nose, and the left and right mouth corners. Landmark detection performance is evaluated using the Normalized Root Mean Square Error (NRMSE),


where is the number of samples in the test set and and are the predicted and ground-truth landmark coordinates, respectively. The error is normalized by the Euclidean distance between the top left point, , and bottom right point, , of the ground-truth shape’s rectangular bounds. The face diagonal is used to normalize the error, rather than the IPD, as it is more stable in off-pose conditions [36]. As per [37], in addition to the mean and standard deviation (Std), the median, Median Absolute Deviation (MAD), and maximum NRMSE statistics are tabulated in Table 4. We set a threshold of 0.08 NRMSE for the Failure Rate and Area Under the Curve (AUC) of the Cumulative Error Distribution (CED).

Figure 6: NRMSE of baseline, expression, and glasses sequences.
Figure 7: CED for the baseline, expression, glasses, and pose sequences.
Figure 8:

Bivariate distribution generated using Gaussian kernel density estimator of NRMSE across head yaw for the pose sequence. ‘

’ indicates outliers with NRMSE


As seen from Figures 6 and 7, the DAN achieves good performance on all frontal images, including images with expressions or glasses. The model fails on the head pose sequence, where performance significantly degrades with yaw angles beyond 20, as illustrated in Figure 8. Interestingly, while images with glasses have a slightly higher NRMSE on average compared to the other frontal images, they also have tighter performance bounds and a 0% Failure Rate, as shown in Figure 6. This may be due to the distinct visual cues granted by glasses (which absorb heat emissions and appear black in thermal images), or simply by virtue of a small sample size of subjects with glasses.

4.2 Thermal-to-Visible Face Verification

One domain-invariant feature learning approach and three thermal-to-visible synthesis approaches are benchmarked against the ARL-VTF dataset. The verification performance is measured by the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) metrics, as well as the True Accept Rate (TAR) at False Accept Rates (FAR) equaling 1% and 5%.

The first method matches thermal and visible face images by learning a domain adaptive feature extractor as proposed in [10]

. This framework exhibits four main parts: (1) a truncated version of VGG16 and Resnet to extract common features, (2) a “Residual Spectral Transform” subnetwork that learns a mapping between the visible and thermal features, (3) a cross-domain identification loss to optimize task-level discrimination, and (4) a domain invariance loss which ensures domain unpredictability. The extracted probe and gallery image features are compared using the cosine similarity measure. The results reported in Figure 

9 and Table 6 corresponding to this baseline were yielded by the VGG16 version of the framework. The images are preprocessed similarly to [31][14] (with bandpass filtering omitted) by first aligning images to a 5-point canonical coordinate scheme via similarity transformation and then loosely cropping the aligned face images to pixels in order to provide enhanced contextual information.

The remaining three methods employ Generative Adversarial Networks (GANs) to learn a mapping from thermal face images to visible face images. Once the visible image is synthesized from the input probe thermal image, a pre-trained VGG-Face model


is used to extract deep features (i.e. output from relu5_3 layer ) from the synthesized visible probe image as well as the visible gallery image to perform thermal to visible face verification. The cosine similarity between the two feature vectors is calculated to produce the verification score. The inputs into these synthesis models are

face images cropped according to the annotated bounding boxes. Images from all four sequences are used to train the models. The following GAN-based methods are used for evaluation:

  • Pix2Pix [16]: Conditioned on thermal images, Pix2Pix model synthesizes visible images using a U-net based architecture [16][32].

  • GANVFS [39]: GANVFS uses identity loss and perceptual loss [17] to train a synthesis network.

  • Self-attention based CycleGAN (SAGAN) [8]: A self-attention module [38] is adapted with CycleGAN [41] for thermal to visible synthesis.

Figure 9: The ROC curves corresponding to the different methods for gallery G_VB0- and protocols P_T*0-.
Gallery G_VB0- Gallery G_VB0+
Probes Method AUC EER FAR=1% FAR=5% AUC EER FAR=1% FAR=5%
P_TB0 Raw
Pix2Pix [16]
Di [8]
Fondje [10]
GT Vis-to-Vis
P_TB- Raw
Di [8]
Fondje [10]
GT Vis-to-Vis
Table 5: Verification performance comparisons among the baseline methods, state-of-the-art methods for various settings.
Figure 10: Sample synthesized images corresponding to different methods. First, second, and third rows correspond to baseline, expression, and profile faces.

Additionally, two baseline methods are established to gauge the performance of the GAN-based approaches. As a naive baseline method (labelled “Raw”), the thermal probes and visible gallery images are input directly to the VGG-Face model. In this scenario, no synthesis is performed on the thermal probes, nor is the VGG-Face model trained on the thermal data. As a ground-truth baseline method (labelled “GT”), the thermal probe images are replaced with the corresponding “ground-truth” visible images captured synchronously by the Basler Scout RGB camera.

The cross-modal face verification and synthesis results are shown in Figure 9 and Figure 10, respectively. As can be seen from Figure 9, simply extracting deep features from the raw images does not produce good verification results. This is mainly due to fact that both thermal and visible images have significantly different characteristics. The AUC corresponding to this method is only 61.37%. Pix2Pix which is a conditional GAN-based method provides slightly better results than the simple baseline of extracting features from raw data producing AUC of 71.12%. Both GANVFS and SAGAN methods are more advanced synthesis approaches and perform much better on this dataset, producing AUC of 97.94% and 99.28%, respectively. The Equal Error Rates (EER) of the Pix2Pix, GANVFS, and SAGAN models are 33.8%, 8.14%, and 3.97%, respectively. The synthesis results shown in Figure 10 are also consistent with the verification results shown in Figure 9 and Table 6.

In addition to the baseline comparisons, we analyze how different variations (baseline, expression, pose, eyewear) influence the cross-spectrum matching performance of different methods. As can be seen from Figure 10, expression slightly degrades the performance of the baseline methods. For instance, the AUC performance of SAGAN method reduces from 99.28% to 98.46%. We see similar degradation for GANVFS and Pix2Pix methods on expressive face images as well. From Figures 9 and 10, we can also see that pose affects the performance of different synthesis methods the most. The performance of the synthesis-based methods is constrained by the VGG-Face model’s performance. This is evidenced by a reduction of the AUC from 99.99% in the baseline sequence to 75.76% for the pose sequence when using the ground-truth visible probe images as input. The EER of the Pix2Pix, GANVFS, and SAGAN models are 47.22%, 41.66%, and 40.24%, respectively. This experiment clearly shows that there is much that need to be done to deal with pose, expression and occlusion variations for cross-modal synthesis and verification. More advanced methods that specifically address these issues for heterogeneous face synthesis and verification are needed. A complete set of performance metrics for all the models, probe sets, and galleries are included in supplementary material.

5 Conclusion

A new, large-scale face dataset of time-synchronized visible and LWIR thermal imagery is presented. In order to emulate real-world conditions, variations of expressions, head pose, and eyeglasses have been systematically captured. Furthermore, the dataset is evaluated on the tasks of thermal face landmark detection and thermal-to-visible face verification using multiple state-of-the-art algorithms. Analysis of the results indicates two challenging scenarios. First, the performance of the thermal landmark detection and thermal-to-visible face verification models were severely degraded on off-pose images. Secondly, the thermal-to-visible face verification models encountered an additional challenge when a subject was wearing glasses in one image but not the other. This effect is further exacerbated in the thermal domain due to the occlusion induced by heat absorption in the lenses.

Acknowledgements: The authors would like to acknowledge sponsorship provided by the Defense Forensics & Biometrics Agency to conduct this research, and thank Michelle Giorgilli and Tom Cantwell for the discussions and their guidance. The authors would also like to thank Lars Ericson at IARPA and Chris Nardone, Marcia Patchan, and Stergios Papadakis at the JHU Applied Physics Laboratory for enabling ARL’s participation in the 2019 IARPA ODIN data collection.


  • [1] B. Abidi IRIS Thermal/Visible Face Database, IEEE OTCBVS WS Series Bench. Note: http://vcipl-okstate.org/pbvs/bench/, accessed 2020-06-09 Cited by: Table 1, §2.
  • [2] G. Bradski (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §3.1.1.
  • [3] P. Buddharaju, I. T. Pavlidis, P. Tsiamyrtzis, and M. Bazakos (2007-04) Physiology-based face recognition in the thermal infrared spectrum. IEEE Trans. Pattern Anal. Mach. Intell. 29 (4), pp. 613–626. External Links: Document, ISSN 01628828 Cited by: Table 1, §2.
  • [4] H. Chang, H. Harishwaran, M. Yi, A. Koschan, B. Abidi, and M. Abidi (2006) An indoor and outdoor, multimodal, multispectral and multi-illuminant database for face recognition. In

    Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.

    Vol. 2006. External Links: Document, ISBN 0769526462, ISSN 10636919 Cited by: Table 1, §2.
  • [5] X. Chen, P. J. Flynn, and K. W. Bowyer (2003) Visible-light and infrared face recognition. In ACM Work. Multimodal User Authentication, pp. 48–55. Cited by: Table 1, §2.
  • [6] J. Deng, S. Cheng, N. Xue, Y. Zhou, and S. Zafeiriou (2018) Uv-gan: adversarial facial uv map completion for pose-invariant face recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 7093–7102. Cited by: §2.
  • [7] X. Di, H. Zhang, and V. M. Patel (2018) Polarimetric thermal to visible face verification via attribute preserved synthesis. In 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), Vol. , pp. 1–10. Cited by: §1.
  • [8] X. Di, B. S. Riggan, S. Hu, N. J. Short, and V. M. Patel (2019) Polarimetric thermal to visible face verification via self-attention guided synthesis. In International Conference on Biometrics (ICB), pp. 1–8. Cited by: §2, 3rd item, Table 5, Table 6.
  • [9] V. Espinosa-Duró, M. Faundez-Zanuy, and J. Mekyska (2013-03) A New Face Database Simultaneously Acquired in Visible, Near-Infrared and Thermal Spectrums. Cognit. Comput. 5 (1), pp. 119–135. External Links: Document, ISSN 18669956 Cited by: Table 1, §2.
  • [10] C. N. Fondje, S. Hu, N. J. Short, and B. S. Riggan (2020) Cross-domain identification for thermal-to-visible face recognition. arXiv preprint arXiv:2008.08473. Cited by: §4.2, Table 5.
  • [11] T. Gault and A. Farag (2013) A Fully Automatic Method to Extract the Heart Rate from Thermal Video. In 2013 IEEE Conf. Comput. Vis. Pattern Recognit. Work., pp. 336–341. External Links: Document, ISBN 9780769549903 Cited by: §1.
  • [12] R. S. Ghiass, H. Bendada, and X. Maldague (2018) Université Laval Face Motion and Time-Lapse Video Database (UL-FMTV). Technical report Université Laval. External Links: Link Cited by: Table 1, §2.
  • [13] R. He, Y. Li, X. Wu, L. Song, Z. Chai, and X. Wei (2020) Coupled adversarial learning for semi-supervised heterogeneous face recognition. Pattern Recognition, pp. 107618. Cited by: §2.
  • [14] S. Hu, N. J. Short, B. S. Riggan, C. Gordon, K. P. Gurton, M. Thielke, P. Gurram, and A. L. Chan (2016) A Polarimetric Thermal Database for Face Recognition Research. In IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., pp. 119–126. Cited by: Table 1, §2, §4.2.
  • [15] S. Hu, N. Short, B. S. Riggan, M. Chasse, and M. S. Sarfraz (2017-05) Heterogeneous Face Recognition: Recent Advances in Infrared-to-Visible Matching. In 2017 12th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG 2017), pp. 883–890. External Links: Document, ISBN 978-1-5090-4023-0, Link Cited by: §1.
  • [16] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, Cited by: §2, 1st item, Table 5, Table 6.
  • [17] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In European Conference on Computer Vision (ECCV), pp. 694–711. Cited by: 2nd item.
  • [18] I. A. Kakadiaris, G. Passalis, T. Theoharis, G. Toderici, I. Konstantinidis, and N. Murtuza (2005) Multimodal face recognition: Combination of geometry with physiological information. In Proc. - 2005 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognition, CVPR 2005, Vol. II, pp. 1022–1029. External Links: Document, ISBN 0769523722 Cited by: §2, §2.
  • [19] B. F. Klare and A. K. Jain (2013) Heterogeneous face recognition using kernel prototype similarities. IEEE Trans. Pattern Anal. Mach. Intell. 35 (6), pp. 1410–1422. External Links: Document, ISSN 01628828 Cited by: §1.
  • [20] M. Kopaczka, R. Kolk, J. Schock, F. Burkhard, and D. Merhof (2019-05) A Thermal Infrared Face Database with Facial Landmarks and Emotion Labels. IEEE Trans. Instrum. Meas. 68 (5), pp. 1389–1401. External Links: Document, ISSN 00189456 Cited by: Table 1, §2, §4.1.
  • [21] M. Kowalski, J. Naruniec, and T. Trzcinski (2017)

    Deep Alignment Network: A Convolutional Neural Network for Robust Face Alignment

    In IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., Vol. 2017-July, pp. 88–97. External Links: Document, 1706.01789, ISBN 9781538607336, ISSN 21607516 Cited by: §4.1.
  • [22] K. Mallat and J. L. Dugelay (2018-10) A benchmark database of visible and thermal paired face images across multiple variations. In 2018 Int. Conf. Biometrics Spec. Interes. Group, BIOSIG 2018, External Links: Document, ISBN 9783885796763 Cited by: Table 1, §2.
  • [23] R. Miezianko Terravic Facial IR Database. Note: http://vcipl-okstate.org/pbvs/bench/, accessed 2020-06-09 Cited by: Table 1.
  • [24] E. Mostafa, R. Hammoud, A. Ali, and A. Farag (2013-12) Face recognition in low resolution thermal images. Comput. Vis. Image Underst. 117 (12), pp. 1689–1694. External Links: Document, ISSN 10773142 Cited by: §1.
  • [25] H. Nguyen, K. Kotani, F. Chen, and B. Le (2013-10) A thermal facial emotion database and its analysis. In Pacific-Rim Symp. Image Video Technol., Vol. 8333 LNCS, pp. 397–408. External Links: Document, ISBN 9783642538414, ISSN 16113349 Cited by: Table 1, §2.
  • [26] K. Panetta, A. Samani, X. Yuan, Q. Wan, S. Agaian, S. Rajeev, S. Kamath, R. Rajendran, S. P. Rao, A. Kaszowska, and H. A. Taylor (2020-03) A Comprehensive Database for Benchmarking Imaging Systems. IEEE Trans. Pattern Anal. Mach. Intell. 42 (3), pp. 509–520. External Links: Document, ISSN 19393539 Cited by: Table 1, §2.
  • [27] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. Cited by: §4.2.
  • [28] I. Pavlidis, J. Levine, and P. Baukol (2001) Thermal image analysis for anxiety detection. In IEEE Int. Conf. Image Process., Vol. 2, pp. 315–318. External Links: Document Cited by: §1.
  • [29] D. Poster, S. Hu, N. Nasrabadi, and B. Riggan (2019) An Examination of Deep-Learning Based Landmark Detection Methods on Thermal Face Imagery. In IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., Cited by: §4.1.
  • [30] C. Reale, N. M. Nasrabadi, H. Kwon, and R. Chellappa (2016) Seeing the forest from the trees: a holistic approach to near-infrared heterogeneous face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 54–62. Cited by: §2.
  • [31] B. S. Riggan, N. J. Short, and S. Hu (2016) Optimal feature learning and discriminative framework for polarimetric thermal to visible face recognition. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1–7. External Links: Document Cited by: §4.2.
  • [32] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: 1st item.
  • [33] P. Tsiamyrtzis, J. Dowdall, D. Shastri, I. T. Pavlidis, M. Frank, and P. Ekman (2007) Imaging facial physiology for the detection of deceit. International Journal of Computer Vision 71 (2), pp. 197–214. Cited by: §1.
  • [34] S. Wang, Z. Liu, S. Lv, Y. Lv, G. Wu, P. Peng, F. Chen, and X. Wang (2010-11) A natural visible and infrared facial expression database for expression recognition and emotion inference. IEEE Trans. Multimed. 12 (7), pp. 682–691. External Links: Document, ISSN 15209210 Cited by: Table 1, §2, §2.
  • [35] Z. Wang, G. Horng, T. Hsu, C. Chen, and G. Jong (2020-05)

    A Novel Facial Thermal Feature Extraction Method for Non-Contact Healthcare System

    IEEE Access 8, pp. 86545 – 86553. External Links: Document, ISSN 21693536 Cited by: §1.
  • [36] Y. Wu and Q. Ji (2019) Facial landmark detection: a literature survey. International Journal of Computer Vision 127 (2), pp. 115–142. Cited by: §4.1.
  • [37] S. Zafeiriou, G. Trigeorgis, G. Chrysos, J. Deng, and J. Shen (2017) The Menpo Facial Landmark Localisation Challenge: A step towards the solution. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Work., pp. 170–179. External Links: Link Cited by: §4.1, §4.1.
  • [38] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In

    Proceedings of the 36th International Conference on Machine Learning

    pp. 7354–7363. Cited by: 3rd item.
  • [39] H. Zhang, V. M. Patel, B. S. Riggan, and S. Hu (2017-01) Generative adversarial network-based synthesis of visible faces from polarimetrie thermal faces. In IEEE International Joint Conference on Biometrics (IJCB), Vol. , pp. 100–107. External Links: Document, 1708.02681, ISBN 9781538611241 Cited by: §2, 2nd item, Table 5, Table 6.
  • [40] H. Zhang, B. S. Riggan, S. Hu, N. J. Short, and V. M. Patel (2019-06) Synthesis of High-Quality Visible Faces from Polarimetric Thermal Faces using Generative Adversarial Networks. Int. J. Comput. Vis. 127 (6-7), pp. 845–862. External Links: Document, 1812.05155, ISSN 0920-5691 Cited by: Table 1, §2.
  • [41] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: 3rd item.