Humans have reasonable night vision with poor capabilities given improper environments. They have poor vision in low light conditions but with the advantage of rich colour vision in better lighting conditions. Human eyes have cone photoreceptor cells which are colour perception sensitive and rod photoreceptor cells which are receptive to brightness. The cones are unable to adapt well in low lighting conditions.
Colour vision is very important to the human brain. It helps to identify objects and to understand the surrounding environment. Studies   have shown that human brain interpretation with colour vision improves the accuracy and the speed of object detection and recognition as compared to monochrome or false-colour visions. Due to this biologically limited interpretability, artificial night vision has become increasingly important in military missions, pharmaceutical studies, driving in darkness, and in security systems.
The use of thermal infrared cameras has seen an important increase in many applications, due to their long wavelength which allows capturing the objects invisible heat radiation despite lighting conditions. They are robust against some obstacles and illumination variations and can capture objects in total darkness. However, the human visual interpretability of thermal infrared images is limited, and so transforming thermal infrared images to Visible spectrum images is extremely important.
The mapping process from monochrome Visible images into colour images is called colorization, which has been broadly investigated in computer vision and image processing   . However, it is an ill-posed problem because the two images are not directly correlated. A single object in the grayscale domain has a single representation while it might have different possible colour values in its true colour image counterpart. This is also true in the thermal images with additional challenging problems. For instance, a single object with different temperature conditions will have different thermal signature that can correspond to a single-colour value, while the thermal signature of two identical material objects at the same temperature conditions will look identical in the thermal infrared images, but have different colour values in their Visible image counterpart.
Transforming thermal infrared images to Visible images is a very challenging task since they do not have the same electromagnetic spectrums and so their representations are different. In grayscale image colorization, the problem is to transform the luminance values into only the chrominance values, while in thermal image colorization, the problem requires estimating the luminance and the chrominance given only the thermal signature. Accordingly, a delivered solution should consider all of these challenges and also provide a method for preserving the representation of the objects in the thermal spectrum, while predicting the possible colour of known relatively fixed in space and time objects, such as the sky, tree leaves, street, traffic signs.
This paper addresses the problem of transforming the thermal images to consistent perceptual Visible images using deep learning models. Our method predicts the low-frequency information of the Visible spectrum images and preserves the high-frequency information from the thermal infrared images. A pan-sharpening method is then used to merge these two bands and creates a plausible Visible image.
2 Related Works
Earlier grayscale image colorization required human guidance to manually apply colour strokes to a selected region or to give a reference image with the same colour palette. This should help the model to assume the similar neighborhood intensity values and assign them a similar color, e.g. Scribble , or Similar images , 
. Recently, the successful applications of convolutional neural networks (ConvNet) have encouraged researchers to investigate automatic end-to-end ConvNet based model on the grayscale colorization problem, , , .
A few researchers have investigated the colorization of near-infrared images (NIR) ,  and have shown a high performance, due to the high correlation between the NIR and RGB images. Their two wavelengths differ only slightly in the red spectrum and thus they have similar Visible light representation correlated in the red channel. In contrast, thermal images taken from the long-wavelength infrared spectrum (LWI) do not correlate with the Visible images since they are measured by the emitted radiation linked to the objects’ temperature. Therefore, predicting the colour of an object in its thermal signature requires a local and global understanding of the image context.
Recently Berg et al.  and Nyberg et al.  presented a fully automatic ConvNet on a thermal infrared to RGB image colorization problem using different objective functions. Their models illustrated a robust method against image pair misalignment. However, the generated images suffer from a high blur effect and artefacts in different locations in the images, e.g. missing objects from the scene, object deformations and some failure images. Kuang et al. in  used a conditional generative adversarial loss to generate a realistic Visible image, with the perceptual loss based on the VGG-16 model, the TV loss to ensure spatial smoothness, and the MSE as content loss. Their work presented better realistic colour representations with fine details but also suffered from the same artefacts, missing objects and object deformations.
The previous works were trained on the KAIST-MS dataset 
which consists of 95,000 thermal-Visible images captured from a device mounted on a moving vehicle. Images were captured during day and night by a thermal camera with an output size of 320x256 and interpolated to have the same size as the Visible images (640x512) using an unknown method and normalized using an unknown histogram equalization method. The procedure used to train the models in previous work reduces the size of the thermal images to their original size and then trains the models only on day time images. The frames were extracted from the video sequence, so it should be considered that, several subsequent images are very similar in most of the sets and it is possible to overfit the dataset. It is also possible that the equalization coupled with the rescaling methods changed the thermal value distribution. Therefore, the proposed model is also trained on the ULB17-VT dataset which contains raw thermal images.
For this work, the target is to transform the thermal infrared images from their temperature representations to colour images. For this reason, this work builds on existing works that have looked at the thermal colorization problem and uses the proposed network architecture by Berg et al.  with small modifications adapted to our outputs.
Preprocessing steps are assumed necessary when the ULB17-VT dataset is used. Images are normalized to using instance normalization in contrast with the KAIST-MS dataset which used histogram equalization. Spikes that occur with sharp low/high temperatures are detected and smoothed using a convolution kernel.
The method proposed here is to transform the thermal image to low-frequency (LF) information in the colour Visible image space in a match with the LF information in the ground truth Visible image. The final colourized image is acquired by applying a post-processing pansharpening step. This process is done by merging the predicted Visible LF information with the high-frequency (HF) information extracted from the input thermal image. This step is assumed necessary to maintain an object’s appearance from the thermal signature and to preserve it in the predicted colourized images. It also helps avoid high artefact occurrences when object representations are different between the two spectrums.
3.1 Proposed Model
The proposed model, as illustrated in Fig. 2, takes the thermal image as input and generates a fully colourized Visible image. For this generated output, L1 content loss is used as an objective function to measure the dissimilarities with the ground truth Visible image. The low-frequency information is then obtained from the generated colourized image and from the ground truth Visible image by applying a Gaussian convolution layer with a kernel of width x and . The dissimilarities between the LF information of the two images is measured using the objective function which is the MSE loss. The total loss is a weighted sum of the L1 and MSE multiplied by since the MSE loss value is smaller than L1.
3.2 Representation and Pre-Processing
The pansharpening method is used as shown in Fig. 2 as a final post-processing step. The thermal low-frequency information is first obtained by applying a Gaussian layer on . The thermal high-frequency information is then extracted by subtracting from . The thermal image is represented with three channels in order to add them to the Visible RGB images. The final colourized thermal image is obtained by adding the input weighted by to the generated low-frequency information as:
The pansharpening method is first applied on the ground truth Visible images to experience and visualize the pan-sharped colourized images before training the model. The thermal signature of the sky in the thermal images is very low with respect to other objects, while humans and other heated objects have a higher thermal signature. The normalization process makes the sky values very close to zero, while in the Visible images this value should be around one. For this reason, the thermal infrared images are inverted before any processing which results in a value around one for the sky in the thermal images.
The proposed method relies on maintaining the high-frequency information taken from the thermal images, as this can reduce the evaluation results compared to the state-of-the-art when the pixel-wise measurement is used. For validation purposes, the PSNR between and with was measured as shown in Fig. 3. This gives an idea of the maximum validation value that can be achieved using the proposed model. The synthesised images are represented as a perceptual visualization quality as shown in Fig. 4. The value was chosen as a trade-off between better perceptual image quality and a reasonable PSNR with the average of for ULB17-VT.V2 and for KAIST-MS. If is decreased the PSNR value increases, but with less plausible perceptual images.
When the weighted thermal HF information is added to the Visible LF information, the synthesized image could have values out of the band in some areas. This results in a black or white color effect when the image is clipped to the range as shown in the red rectangle in Fig. 4. Re-normalizing the image instead of clipping can reduce the image contrast or affect the true colour values since the low frequency information on the three RGB channels is being obtained and added. This problem can be solved by exploring different normalization methods in the pre-processing step and different merging procedures in the post-processing step.
De-spiked thermal images are obtained using a convolution kernel of width x
, which replaces the centre pixel with the median value if the pixel value is three times greater than the standard deviation of the kernel area.
3.3 Networks Architecture
TICPan-Bn The proposed method using the network architecture in .
TICPan The proposed method using the same network architecture, and replacing the batch normalization layer with the instance normalization layer. It shows better enhancement in colour representations and in the metric evaluations.
For this work the ULB17-VT dataset  which contains 404 Visible-thermal image pairs was used. The number of images was increased to 749 Visible-thermal images using the same device and 74 pairs were held for testing. Thermal images were extracted in their raw format and logged in with 16-bit float per-pixel. This new dataset, ULB-VT.v2, is available on 222http://doi.org/10.5281/zenodo.3578267.
The KAIST-MS dataset  was also used and the exiting works on thermal colorization problem were followed. Training was only done on day time images and resized the thermal images to their original resolution of x pixels. The images in KAIST-MS were recorded continuously during driving and stopping the car. This results in a high number of redundant images and explains the over-fitting behaviour and the failure results in previous work. For this reason, only every third image is taken in the training set to yield a set with 10,027 image pairs, while all of the images in the test set are used.
4.2 Training Setup
All experiments were implemented in Pytorch and performed on an NVIDIA TITAN XP graphics card. TIR2Lab and TIC-CGAN  were re-implemented and trained as explained in the original papers.
The proposed model, TICPan, trained using ADAM optimizer with default Pytorch parameters and weights were initialized with He normal initialization . All experiments were trained for epochs and the learning rate was initialized with with decay after 400 epochs. The LeakyReLU layers parameter was set to and the dropout layer was set to .
In each training batch, 32 cropped images of size x were randomly extracted. For each iteration, a random augmentation was applied by flipping horizontally or vertically and rotating in the . Since the number of training images in KAIST-Ms is 14 times more than ULV-VT.v2, the number of iterations for the model to train on the ULV-VT.v2 was increased to match the model trained on KAIST-MS.
For validation, the peak signal-to-noise ratio (PSNR), structural similarity (SSIM) and root-mean-square error (RMSE) were used between the generated colorized images and the true images.
4.3 Quantitative Evaluation
The proposed model was evaluated on transforming thermal infrared images to RGB images compared with the state-of-the-art using the measurement metrics shown in Table 1.
The proposed model evaluation was performed on the full colorized thermal image, which is the result of the fusion of the predicted Visible LF information and the input thermal HF information. This resulted in a higher pixel-wise error compared to other models since the HF content of the image was taken from the thermal domain. However, our method achieved comparable results with the synthesized images as shown in Fig. 3.
It is believed that the pixel-wise metrics are not suitable for the colorization problem where the perception of the image has an important role. The TIR2Lab achieved higher evaluation values while their generated images are uninterruptable. TIC-CGAN has 12.266 million parameters that explain the overfitting behaviour in its generated images. TICPan-BN was excluded because it has the lowest evaluation values and less comparable quality images.
4.4 Qualitative evaluation
Four examples are presented in Fig. 8 on the ULB17-VT.v2 dataset. The TIR2Lab model generated approximated good colour representations for trees with blur effect but failed to produce fine textures and to preserve the image content. On the hand, the TIC-CGAN model generated better image colour quality with fine textures and were more realistic. This is very recognizable, as an over-fitting behaviour, when the test image comes from the same distribution as the densely represented images in the training set such as image number (650).
TICPan generates images that have strong true colour values for objects that are relatively fixed in space and time, such as sky, tree leaves, and streets and buildings. Sky is represented in white or light blue colour, trees are in different shades of green, and streets and buildings also represented with approximated true colour values. However, objects like humans are represented in grey or in black due to the clipping effect. Our method assures that the object thermal signature does not disappear in image transformation or get deformed. The model cannot predict true colour values for the varying objects but it predicts an averaged colour value represented in grey and the final pansharpening process maintains their appearance in the generated colourized images.
In Fig. 9 four examples are presented on the KAIST-MS dataset. The TIR2Lab method produced approximate good true chrominance values but it has heavily blurred images and suffers from recovering fine textures accurately. The produced artefacts are very obvious in the generated images and some objects, such as the walking person in (S6V3I03016) are missing in their outputs. The TIC-CGAN model produced better perceptual colourized thermal images with realistic textures and fine details, but they suffer from the same countereffects of missing objects and objects deformation. This is due to the use of GAN adversarial loss which learns the dataset distribution and estimates what should appear in each location, and also because of the large size of the model and its over-fitting behaviour. This is seen in (S8V2I01723) in the falsely generated road surface markings and in the missing person in (S6V3I03016). In contrast, the proposed TICPan model does not generate very plausible colour values in the KAIST-MS dataset but it generates robust perceptual night vision images that maintain objects’ appearances.
4.4.1 Deformation and missing Objects
Fig. 9 shows missing objects in the TIC-CGAN generated images, such as the person in (S0V0I00601) and the cars in (S0V0I01335). We can also recognize the object deformation in image number (428) and image number (598), while in the TICPan model objects are retained in the generated images.
4.4.2 Overfitting behavior
Fig. 6 illustrates the over-fitting problem in the TIC-CGAN model. Because of its size, it has 12M parameters and is 12 times bigger than the proposed model. This makes it very easy for the model to overfit the dataset and not perform generalisation in the unseen data. In image number (1250), the model can predict the exact colour of the two cars because a similar image appeared in the training set. In the second image number (S0V0I00613), whenever an object comes from the left with a size similar to a bus, the model will predict it as a bus with red colour. The TICPan model cannot predict the exact colour of cars, but instead generates an average grey colour.
4.4.3 Night vision
The TIC-CGAN model failed to generate interpretable images using images that were taken at night, because the image distribution and the image contrast were different from the training images. However, the TICPan model does not suffer from this failure thanks to the pansharpenning process as shown in Fig. 7. In image number (1784), the true RGB image is completely dark and the TICPan model generates a robust perceptual night vision image as compared to the TIC-CGAN model. This is also illustrated in image number (S9V0I00000), where the TICPan model generates a night vision image with less artefacts than the TIC-CGAN model. It should be noted that these artefacts are due to the histogram equalization method used in KAIST-MS.
The objective in this study was to address the problem of transforming thermal infrared images to Visible images with robust perceptual night vision quality. In contrast to the existing methods that map images automatically from their thermal signature to chrominance information, our proposed model seeks to maintain the appearance of objects in their thermal representation from the thermal images and to predict possible colour values.
The evaluation showed that the proposed model has better perceptual images with fewer artefacts and the best representation for night images. This confirms the model generalization capability. The generated images are robust and reliable enabling users to better interpret the images while using night vision. For objects or cases in which missing or deformed objects can cause dramatic accidents, the pan sharpening process is of critical necessity.
This work was supported by the European Regional Development Fund (ERDF) and the Brussels-Capital Region within the framework of the Operational Programme 2014-2020 through the ERDF-2020 project F11-08 ICITY-RDI.BRU. We thank Thermal Focus BVBA for their support.
Multimodal sensor fusion in single thermal image super-resolution. arXiv preprint arXiv:1812.09276. Cited by: §2, §4.1.
Generating visible spectrum images from thermal infrared.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1143–1152. Cited by: §2, 1st item, §3.3, §3, §4.2.
Unsupervised diverse colorization via generative adversarial networks.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 151–166. Cited by: §2.
-  (1999) The role of color and false color in object recognition with degraded and non-degraded images. Technical report NAVAL POSTGRADUATE SCHOOL MONTEREY CA. Cited by: §1.
-  (2015) Deep colorization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 415–423. Cited by: §2.
-  (2017) Pixcolor: pixel recursive colorization. arXiv preprint arXiv:1705.07208. Cited by: §1, §2.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §4.2.
-  (2015) Multispectral pedestrian detection: benchmark dataset and baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1037–1045. Cited by: §2, §4.1.
-  (2016) Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (TOG) 35 (4), pp. 110. Cited by: §2.
-  (2005) Colorization by example.. In Rendering Techniques, pp. 201–210. Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1.
-  (2018) Thermal infrared colorization via conditional generative adversarial network. arXiv preprint arXiv:1810.05399. Cited by: §2, §4.2.
-  (2016) Learning representations for automatic colorization. In European Conference on Computer Vision, pp. 577–593. Cited by: §1.
-  (2004) Colorization using optimization. In ACM transactions on graphics (tog), Vol. 23, pp. 689–694. Cited by: §2.
-  (2016) Infrared colorization using deep convolutional neural networks. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 61–68. Cited by: §2.
-  (2018) Transforming thermal images to visible spectrum images using deep learning. Cited by: §2.
-  (1996) An assessment of the impact of fused monochrome and fused color night vision displays on reaction time and accuracy in target detection. Ph.D. Thesis, Monterey, California. Naval Postgraduate School. Cited by: §1.
-  (2002) Transferring color to greyscale images. In ACM transactions on graphics (TOG), Vol. 21, pp. 277–280. Cited by: §2.
-  (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §1.
Tv-gan: generative adversarial network based thermal to visible face recognition. In 2018 International Conference on Biometrics (ICB), pp. 174–181. Cited by: §2.