IR2VI: Enhanced Night Environmental Perception by Unsupervised Thermal Image Translation

by   Shuo Liu, et al.

Context enhancement is critical for night vision (NV) applications, especially for the dark night situation without any artificial lights. In this paper, we present the infrared-to-visual (IR2VI) algorithm, a novel unsupervised thermal-to-visible image translation framework based on generative adversarial networks (GANs). IR2VI is able to learn the intrinsic characteristics from VI images and integrate them into IR images. Since the existing unsupervised GAN-based image translation approaches face several challenges, such as incorrect mapping and lack of fine details, we propose a structure connection module and a region-of-interest (ROI) focal loss method to address the current limitations. Experimental results show the superiority of the IR2VI algorithm over baseline methods.


page 3

page 6


Galaxy Image Translation with Semi-supervised Noise-reconstructed Generative Adversarial Networks

Image-to-image translation with Deep Learning neural networks, particula...

Unsupervised Image-to-Image Translation with Generative Adversarial Networks

It's useful to automatically transform an image from its original form t...

A Novel Application of Image-to-Image Translation: Chromosome Straightening Framework by Learning from a Single Image

In medical imaging, chromosome straightening plays a significant role in...

toon2real: Translating Cartoon Images to Realistic Images

In terms of Image-to-image translation, Generative Adversarial Networks ...

Joint haze image synthesis and dehazing with mmd-vae losses

Fog and haze are weathers with low visibility which are adversarial to t...

Memory-Guided Collaborative Attention for Nighttime Thermal Infrared Image Colorization

Nighttime thermal infrared (NTIR) image colorization, also known as tran...

Full-Resolution Correspondence Learning for Image Translation

We present the full-resolution correspondence learning for cross-domain ...

1 Related Work

1.1 Infrared and Visible Image Fusion

IR and VI image fusion is an active research in the last two decades, where the objective is to fuse the IR and the VI image into a composite image to boost imaging quality for improved visual capability of human and robot machines [12]. The image fusion methods can be roughly categorized into methods in spatial domain and transform domain. The implementation in the spatial domain is straightforward, such as weighted average and gradient transfer fusion [20]. The transform-domain based algorithms include non-subsampled contourlet transform (NSCT) [2], wavelet [24], guided filter [30], etc. These transform image fusion methods are developed with the assumption that the IR and VI images are fully registered. Nevertheless, the visible camera does not function in most night environments, which means only the IR image can be acquired and the image fusion operation cannot be further performed.

1.2 Infrared Image Colorization

IR image colorization is a type of color transferring technique which aims at transforming a gray-scale IR image into a multi-channel RGB image. Basically, this technique can be divided into non-parametric and parametric based methods. The non-parametric based methods

[8, 9, 29] generally require colorful reference images whose structure is also similar to the source IR image, and then the methods utilize the image analogies framework [10] to transfer the color onto the IR image. While the parametric based methods [16, 25, 22]

can directly estimate chrominance values by training one or multiple prediction models, such as deep convolutional neural networks (DCNNs)

[16] or GANs [25, 22]. However, these colorization approaches either require paired pixel-wise aligned training dataset or rely on a colorful reference image, which is hardly acquired in a night vision application. Contrasted with IR image colorization methods, our IR2VI can mapping the intrinsic features from VI image to the IR image and does not need a fully registered dataset.

1.3 Image-to-Image Translation

Image-to-Image translation is to learn a mapping function from a source data distribution to one or multiple data distributions. Recent progresses in this field were achieved with GANs [7]. These GAN approaches can be categorized into supervised and unsupervised ones. For the supervised models [23, 28]

, the L1 loss function is commonly adopted and thus the paired images are required. While the unsupervised models

[5, 17, 14] alleviate the difficulty for obtaining data pairs with different techniques, such as variational auto-encoders (VAEs) [17] or cycle consistency [14]. However, the unsupervised methods can also lead to several undesirable problems, such as incorrect mappings, when applied to the IR-to-VI image translation task. In our IR2VI framework, we designed a structure connection module and ROI focal loss to successfully address these problems.

2 The IR2VI Framework

2.1 Overall Architecture

As we can see in the Fig. 1, the basic architecture of IR2VI includes a generator, a global discriminator, and an ROI discriminator. The generator translates an IR image to a synthetic VI image that looks similar to the real VI image, while the global discriminator distinguishes translated VI images from real ones. The ROI discriminator aims to distinguish the ROIs between translated VI image and real ones. In this way, the synthetic VI images are designed to be indistinguishable from the real VI images.

Figure 1: An overall architecture of the proposed IR2VI framework. Note that this is a brief illustration of the architecture, which actually needs to be duplicated for training in the CycleGAN way.

Similar to CycleGAN [14] and StarGAN [5], we adopted the residual auto-encoder architecture from Johnson et al. [13] with 9 residual blocks [15] for the generative network. We follow the naming convention used in image translation community [17, 28, 14], with the network configuration expressed as follows:

where the represents a

Convolution-BatchNorm-ReLU layer with

filters and stride

. And the right top means that is for structure connection module which will be introduced in the following subsection. denotes a Convolution-BatchNorm-ReLU (CBR) layer with

filters, and stride 2. We also employed reflection padding to reduce boundary artifacts.

denotes a residual block which consists of two convolutional layers with the same number of filters on both layer. represents a fractional strided CBR layer with filters, and stride . denotes fusion layer where we utilize and functions to fuse the output information from both structure connection and residual auto-encoder. We adopted the PatchGAN [23] with 4 hidden layers for all the discriminative networks, with the network configuration is as follows:

where denotes a Convolution-BatchNorm-LeakyReLU layer with filters and stride (except for the last layer with stride ). After the last layer, we applied a convolution to produce a dimensional output. BatchNorm is not applied to the first layer. We set the slope for leakyReLU.

For training the IR2VI, four loss functions were utilized (cycle consistency loss, global adversarial loss, ROI cycle-consistency loss, and ROI adversarial loss). Details about each loss function are provided in the following sections.

Basically, the IR2VI framework evolves from the CycleGAN [14]. In contrast to CycleGAN, we made two important improvements: (1) A structure connection module has been added into the generator to constrain the structure deformation; and (2) a ROI focal loss is calculated in the training stage, which enables the critical regions to be focused in translation procedure.

2.2 Implementation Details

2.2.1 Structure Connection

Incorrect mapping is a common issue for the unsupervised image translation models which directly lack supervised signals. When objects in the source image are overly bright, which is an extremely common situation for the IR image at night, the translation models will be confused and map the objects to any random permutation of objects in the target domain. As the example in Fig. 2, where the CycleGAN wrongly mapped the ground to the forest and the vehicle to a different object. To solve the incorrect mapping problem, we added a shortcut to the generator to connect input image with generated image, which is called “structure connection.” A convolution layer is adopted to extract the detailed structure information from the IR image and then fuse it with the semantic information generated by the residual auto-encoder model. In this way, the deep CNN is able to focus on the semantic level task while the structure connection enables the synthetic VI image to keep original structure information.

Figure 2: An example of the results from the CycleGAN to illustrate the incorrect mapping problem.

2.2.2 Cycle Consistency Loss

The cycle consistency loss was proposed by Zhu et al. in [14]. The basic idea is to learn two mappings , and , which can translate the image between two domains. For the , it forces , while for , it forces . Thus, it becomes possible to constrain the cycle-consistency and eliminate undesirable mappings. The cycle consistency loss can be formulated as follows:


2.2.3 Global Adversarial Loss

The global adversarial loss is derived from the global discriminator, which aims to distinguish the full-size image from the real domain with the full-size image from the synthetic domain. Because the image fed to the discriminator network is full-sized, the adversarial loss is designated as the global adversarial loss. As aforementioned, two mapping functions are created to manipulate the cycle consistency loss. The global adversarial loss is applied to both mapping functions. Taking the mapping function as example, its discriminator is . Thus, the global adversarial loss is formulated as:


2.2.4 ROI Focal Loss

Generally, the generated images via adversarial training are often lack of fine details and realistic textures [3, 28]. This is manifested when the concerned object is extremely small. To end this, we propose a region of interest (ROI) focal loss which consists of ROI adversarial loss and ROI cycle-consistency loss. The ROI approach is suitable for those training dataset with bounding boxes. In contrast to the cycle consistency loss and global adversarial loss which take the full-size image as input, the ROI focal loss operates in the ROI. To obtain the ROIs from the full-size image, the ROI pooling layer [6] is adopted, which was proposed to solve the object detection challenge. Based on provided bounding boxes, the ROI pooling layer is able to crop and reshape the arbitrary area to the fixed size image. In our work, we set as the fixed size of ROI image and name the ROI pooling function as . Same as the cycle consistency loss and global adversarial loss, the ROI focal loss is applied to both mapping functions. Here, the mapping function is used as an example.

The ROI cycle-consistency loss can be formulated as follows:


The network configuration of ROI discriminator is same as that of the global discriminator. The ROI adversarial loss can be expressed as follows:


where represents the ROI discriminator for VI images.

2.3 Full Objective

Finally, the complete objective function can be written as:


where and are the hyper-parameters that control the relative importance of cycle consistency loss and the ROI focal loss. For simplicity, represents . Finally, the method resolves:


2.4 Evaluation Protocol

As there is no ground truth associated with the translated image, it is hard to evaluate the performance of the different image translation methods. In this study, we focused on the dataset with bounding boxes. Thus, it is possible to assess different methods by performing object detection on the synthesized images. Specifically, We adopted the object detector of Faster R-CNN with ResNet 101 network presented in [11], and trained it on the day-time VI image dataset (target domain). Then, different image translation methods were empoyed to generate the synthetic VI image from the IR image at night. Finally, the trained object detector model was performed on the synthetic VI image collections. One choice is to use the de-facto standard average precision (AP) to evaluate the performance of object detector, which is calculated as the ratio between the area under Precision-Recall curve (less than 1) to the entire area (which is 1).

Figure 3: Comparison of different results on the SENSIAC dataset.

3 Experimental Results

This section introduces the Military Sensing Information Analysis Center (SENSIAC) dataset that is used in all the experiments. Then the settings of the hardware and training detail of IR2VI are listed. Lastly, the IR2VI is compared against state-of-the-art methods for subjective qualitative and objective quantitative analysis.

3.1 Dataset

SENSIAC [1] recently released a large-scale military IR-VI image dataset for automatic target recognition (ATR). In this study, the proposed IR2VI framework is evaluated with the SENSIAC dataset along with the state-of-the-art methods. Basically, the SENSIAC dataset contains 207GB of middle-wave infrared (MWIR) videos and 106GB of VI (grayscale) videos along with manually labeled bounding boxes. Various types of objects are included in this dataset, for instance, soldiers, military vehicles, and civilian vehicles. It worth noting that the dataset was collected during both day-time and night-time with multiple observation distances from 500 to 5000 meters. However, it has paired IR-VI videos in day-time but only has IR videos in the night-time.

The objective of this study is to translate the night-time thermal images to the day-time VI images, where only the night-time IR videos and the day-time VI videos are used in the experiments. We selected 3 different observation distances, e.g., 1000, 1500, and 2000 meters, and split into training/testing datasets [18]. For training the image translation models, we sampled the keyframe at 3 (every 10 frames). Thus, there are 2700 night-time IR training images and 2691 day-time VI training images. Note that all the night-time IR images are preprocessed by histogram equalization operation prior to being fed into the models. For training the object detector, the keyframe is sampled at 6 (every 5 frames). So, there are 4573 day-time VI images and 5400 night-time IR images in training dataset. Meanwhile, there are 2812 day-time VI images and 3240 night-time IR images in testing dataset.

3.2 Experimental Setup

The IR2VI was developed based on CycleGAN [14]

by using Pytorch deep learning toolbox 

[21]. We used a workstation with an NVIDIA GeForce GTX 1080 GPU, an Intel Core i7 CPU and 32 GB Memory.

For the hyper-parameters, the parameters are and in Equation 5

. All the networks were trained from scratch, and the weights were initialized from a Gaussian distribution with zero mean and

standard deviation. The Adam solver was employed with a batch size of and set the learning rate at

for the first 20 epochs and a linearly decaying rate to zero over the next 20 epochs.

For the fair comparison, we did not modify the default setting of the baseline methods except the image channel, image size, batch size, and training epochs. To be specific, the images in SENSIAC dataset are grayscale, so the input channel was set to for the input channel of all the networks and the output channel of the generation network. Because the limited capacity of the GPU memory, the training epochs were set to with batch size for CycleGAN [14] and UNIT [17], training epochs with batch size for StarGAN [5]. And the images were center-cropped to pixels before feeding into the baseline networks. Our IR2VI is a kind of object-based framework, so the images were cropped to with at least one object. Because the generator network of every method is a fully CNN which is able to take an image of arbitrary size as input, the full-size image is fed to the network in the testing stage.

3.3 Results

In this section, we compared with the state-of-the-art unsupervised image translation methods: CycleGAN [14], UNIT [17] and StarGAN [5].

3.3.1 Subjective Comparisons

All the methods were trained on the same training set and tested on the unseen images. Figure 3 shows the translated images from unseen images by different methods. It is apparent that the CycleGAN and the UNIT have the serious incorrect mapping problems. The CycleGAN could not tell where are the trees and ground, so it mapped the ground to a forest. In the second testing image, the CycleGAN incorrectly generated two vehicles. The translated images by UNIT are almost similar without too much semantic information. For the StarGAN method, it has few incorrect mapping problems but lacks sharp texture information. Significantly, we qualitatively observed that our IR2VI provided the highest visual quality of translation results compared to the baseline methods. It can not only bring the spatial semantic information but also makes the target clear. We believe that our IR2VI framework benefits from the advantages of the structure connection module and the ROI focal loss.

3.3.2 Quantitative Comparisons

For the quantitative objective evaluation, we applied the evaluation protocol introduced in Section 2.4. Figure 4 and Table 1 show the Precision-Recall (PR) curves and Average Precision (AP) scores of the object detector on translated images by different translation methods.

Figure 4: Precision-Recall curve of the object detector on different synthesis images.
AP (%) 91.70
Table 1: Average precision scores of the object detector on the generated testing images by different translation methods.

The results clearly show that there is a large margin between different methods, and our IR2VI achieved the best AP score at which has a margin to the second rank method, StarGAN. These results demonstrate that the IR2VI is capable of adding semantic visible information and also add object shape information to the original thermal images. Even though the translated images by the StarGAN lack texture information, the blur shape information can also help the VI object detector to accomplish detection. However, the incorrect mapping problems in CycleGAN and UNIT made the VI detector completely fail as indicated with a nearly zero AP score.

4 Conclusion

In this paper, we proposed an unsupervised thermal image translation framework for context enhancement at night, called IR2VI. Thanks to the proposed structure connection module in the generative network, IR2VI is able to overcome the incorrect mapping problem which is commonly faced by the state-of-the-art image translation methods. Moreover, the proposed ROI focal loss enables IR2VI to generate a synthetic VI image with more fine details as compared with baseline models. The results demonstrate the IR2VI contributions which not only broaden the area of context enhancement for night vision but also can be applied to many other related research fields within image fusion.


The authors would like to thank NVIDIA for supporting our research by offering computing equipments. And this research was enabled in part by the support from West Grid ( and Compute Canada ( . In particular, the authors express their sincere gratitude to Huan Liu and Junchi Bin for the helpful discussions when this work was being carried out.


  • [1] Military sensing information analysis center (SENSIAC)., 2008. [Online; accessed 01-November-2017].
  • [2] G. Bhatnagar and Z. Liu. A novel image fusion framework for night-vision navigation and surveillance. Signal, Image and Video Processing, 9(1):165–175, 2015.
  • [3] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In

    The IEEE International Conference on Computer Vision (ICCV)

    , 2017.
  • [4] T. Chijiiwa, T. Ishibashi, and H. Inomata. Histological study of choroidal melanocytes in animals with tapetum lucidum cellulosum. Graefe’s archive for clinical and experimental ophthalmology, 228(2):161–168, 1990.
  • [5] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint arXiv:1711.09020, 2017.
  • [6] R. Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, Dec 2015.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems (NIPS), pages 2672–2680, 2014.
  • [8] R. K. Gupta, A. Y.-S. Chia, D. Rajan, E. S. Ng, and H. Zhiyong. Image colorization using similar images. In Proceedings of the 20th ACM international conference on Multimedia (ACMMM), pages 369–378. ACM, 2012.
  • [9] T. Hamam, Y. Dordek, and D. Cohen. Single-band infrared texture-based image colorization. In Electrical & Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pages 1–5. IEEE, 2012.
  • [10] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques (SIGGRAPH), pages 327–340. ACM, 2001.
  • [11] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [12] X. Jin, Q. Jiang, S. Yao, D. Zhou, R. Nie, J. Hai, and K. He. A survey of infrared and visual image fusion methods. Infrared Physics & Technology, 85:478–501, 2017.
  • [13] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European Conference on Computer Vision (ECCV), pages 694–711. Springer, 2016.
  • [14] Z. Jun-Yan, P. Taesung, I. Phillip, and E. Alexei, A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision (ICCV), pages 2242–2251, Oct 2017.
  • [15] H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
  • [16] M. Limmer and H. P. Lensch. Infrared colorization using deep convolutional neural networks. In IEEE International Conference on Machine Learning and Applications (ICMLA), pages 61–68. IEEE, 2016.
  • [17] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
  • [18] S. Liu and Z. Liu. Multi-channel CNN-based object detection for enhanced situation awareness. In Sensors and Electronics Technology (SET) panel Symposium SET-241 on 9th NATO Military Sensing Symposium, Quebec City, QC, Canada, June 2017.
  • [19] Z. Liu, E. Blasch, Z. Xue, J. Zhao, R. Laganiere, and W. Wu. Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: a comparative study. IEEE transactions on pattern analysis and machine intelligence, 34(1):94–109, 2012.
  • [20] J. Ma, C. Chen, C. Li, and J. Huang. Infrared and visible image fusion via gradient transfer and total variation minimization. Information Fusion, 31:100–109, 2016.
  • [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In Advances in Neural Information Processing (NIPS) Workshop Autodiff, 2017.
  • [22] S. Patricia, L., S. Angel, D., and V. Boris, X. Infrared image colorization based on a triplet dcgan architecture. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 212–217, July 2017.
  • [23] I. Phillip, Z. Jun-Yan, Z. Tinghui, and E. Alexei, A.

    Image-to-image translation with conditional adversarial networks.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, July 2017.
  • [24] F. Shamsafar, H. Seyedarabi, and A. Aghagolzadeh. Fusing the information in visible light and near-infrared images for iris recognition. Machine Vision and Applications, 25(4):881–899, 2014.
  • [25] P. L. Suárez, A. D. Sappa, and B. X. Vintimilla. Learning to colorize infrared images. In International Conference on Practical Applications of Agents and Multi-Agent Systems, pages 164–172. Springer, 2017.
  • [26] J. M. Sullivan. Assessing the potential benefit of adaptive headlighting using crash databases. University of Michigan, Ann Arbor, Transportation Research Institute, 1999.
  • [27] A. Ulhaq, X. Yin, J. He, and Y. Zhang. FACE: Fully automated context enhancement for night-time video sequences. Journal of Visual Communication and Image Representation, 40:682–693, 2016.
  • [28] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.
  • [29] Y. Zheng, E. P. Blasch, and Z. Liu. Multispectral image fusion and night vision colorization. International Society for Optics and Photonics, 2018.
  • [30] Z. Zhou, M. Dong, X. Xie, and Z. Gao. Fusion of infrared and visible images for night-vision context enhancement. Applied optics, 55(23):6480–6490, 2016.