Multimodal Sensor Fusion In Single Thermal image Super-Resolution

12/21/2018 ∙ by Feras Almasri, et al. ∙ Université Libre de Bruxelles 4

With the fast growth in the visual surveillance and security sectors, thermal infrared images have become increasingly necessary ina large variety of industrial applications. This is true even though IR sensors are still more expensive than their RGB counterpart having the same resolution. In this paper, we propose a deep learning solution to enhance the thermal image resolution. The following results are given:(I) Introduction of a multimodal, visual-thermal fusion model that ad-dresses thermal image super-resolution, via integrating high-frequency information from the visual image. (II) Investigation of different net-work architecture schemes in the literature, their up-sampling methods,learning procedures, and their optimization functions by showing their beneficial contribution to the super-resolution problem. (III) A bench-mark ULB17-VT dataset that contains thermal images and their visual images counterpart is presented. (IV) Presentation of a qualitative evaluation of a large test set with 58 samples and 22 raters which shows that our proposed model performs better against state-of-the-arts.



There are no comments yet.


page 5

page 7

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In digital images, what we perceive as details greatly depends on the image resolution. The higher the resolution the more accurate the measurement. The visible RGB image has rich information, but objects can occur in different conditions of illumination, occlusion and background clutter. These conditions can severely degrade the system’s performance. Therefore visible data is found to be insufficient and thermal images have become a common tool to overcome these problems. Thermal images are used in industrial processes such as heat and gas detecting and they are also used to solve problems such as object detection and the self-driving car.

Integrating captured information from different sensors such as RGB and thermal offers rich information to improve the system performance. In particular, when the nature of the problem requires this integration, and when the environmental conditions are not optimal for a one sensor approach, multimodal sensor fusion methods have been proposed  [3, 23]. However, thermal sensor cost grows significantly with the increase of its resolution and it is primarily used in low-resolution and in low contrast which introduces the necessity to obtain a higher resolution sensor  [4]

. As a result, a variety of techniques in computer vision have been developed to enhance thermal resolution given their low-resolution counterpart.

The single super-resolution problem has been widely studied and well-defined in computer vision  [1, 22, 17]. It is defined as non-linear mapping and prediction of a high-resolution image (HR) given only one low-resolution image (LR). However, this is an ill-posed problem since it is a one-to-many problem. Given that multiple HR images it is possible to produce a single LR image, thus mapping from LR to SR is to recover the lost information giving only the information in the LR image. Though they achieved high performance, these methods are limited by their handcrafted features techniques.

Recently with the development of the convolutional neural network (ConvNet), several methods have shown the ability to learn high non-linear transformations. Rather than using handcrafted features, the ConvNet model is capable of automatically learning rich features from the problem domain and adapt its parameters by minimizing the loss function. Most recently,the ConvNet model has been widely used in the SR problem and achieved new high performance. Despite significant progress, the proposed solutions still suffer from the lack of ability to recover high-frequency information.

Most SR conventional methods focus on measuring the similarity between the SR image and its ground truth via pixel-wise distance measurement, although the reconstructed images are blurry by missing sharper edges and texture details. However, this problem is the fault of the objective function, as the classical way to optimize the target is to minimize the content loss by minimizing the mean square error (MSE) loss function. By definition this finds the average values in the HR manifold and consequently maximizes the Peak signal-to-noise ratio (PSNR). By only applying the content loss function, the low-frequency information is restored, but not the high-frequency information. However, MSE is limited in preserving the human visual perception, and PSNR measurement cannot indicate the SR visual perception  [17, 19].

Different approaches such as perceptual loss  [14] and adversarial loss  [20],have been proposed to address this drawback, and have shown important progress in recovering high frequency details. Instead of only doing a pixel-wise distance measurement, a mixture of these loss functions could generate high-quality SR images. Also, different model schemes have shown higher image quality such as learning the residual information  [15] or by gradually up-sampling  [18].

The primary focus of this work was to build a deep learning model that applies multimodal sensor fusion using visible (RGB) and thermal images. The model should integrate the two inputs and enhance the thermal image resolution. The latter part is inspired by the recent advances in RGB super-resolution problem.A thermal GAN based framework is proposed, to enhance the LR thermal image by integrating the rich information in the HR visual image. However, the HR visual sensor price is considerably low compared to the LR thermal sensor and it captures extra information taken from a different domain. We show that HR visual images can help the model fill the missed values and generate higher frequency details in the reconstructed SR thermal image. The model proposed uses the content loss to preserve the low-frequency image content, and the adversarial loss to integrate the high-frequency information.

2 Related Works

(a) SRCNN [6]
(b) VDSR [15]
(c) FSRCNN [26]
(d) SRCNN-Dconv
(e) VDSR-Deconv
(f) LAPSRN [18]
Figure 1: Network architecture schemes.

Thus far, a number of studies in computer vision and machine learning have been proposed. This discussion focuses on example-based ConvNet methods. Fig. 

1 depicts the different model architecture schemes, their up-sampling method, and their learning procedure.

A. Resolution Training. The model can be trained to extract features from an up-sampled image in direct mapping using these features to produce SR image as in  [6]

. The input is either pre-processed, using an interpolation method as shown in Fig

1 (a), or up-sampled using trainable parameters as shown in Fig 1 (d). In another approach, the model can extract features directly from the low-resolution image and map them into high-resolution by using up-sampling techniques at the end of the network as in  [26], this model, shown in Fig 1 (c) accelerates the model performance.

B. Residual Learning and Supervision. Since super-resolution output is similar to the low-resolution input,with the high-frequency information missing, the learning can be made to produce only the residual information. VDSR [15] and DRSN  [16] trained a model that learns residual information between LR and HR images. They used a skip connection, that adds the input image to the model residual output, to produce SR image. Lai et al  [18] found that reconstructing the SR image immediately with a high up-sampling scale is a challenging problem. Therefore they addressed the problem in a gradual up-sampling procedure, using deep supervision in each up-sampling scale, and a residual learning as shown in Fig. 1 (f).

C. Up-sampling Procedure. A mixture of network architectures and learning procedures can be used with different up-sampling methods. In SRCNN  [6] and VDSR  [15], the network takes an interpolated image as input, using either a bilinear or bicubic interpolation, or a Gaussian-Spline kernel  [10]. The up-sampling can be learnable and accelerated by using a transposed convolution (deconvolution) layer as in Fig 1 (d) and (e) which is a modified version of SRCNN and VDSR, or added to the end of the network as in FSRCNN Fig 1 (c). A trainable Sub-pixel convolution (Pixelshuffle) ESPCN  [24] is also used at the end of the network to up-sample input features as in  [20].

D. Optimization Function.

The optimization procedure seeks to minimize the distance between the original HR image and the generated SR image. The most used optimization function in the SR problem is the content loss, which is done using the MSE as in  [15] or Charbonnier as in  [18]. SRGAN [19] instead uses the adversarial loss and  [14] uses the perceptual similarity loss to enhance the reconstructed image resolution.

3 Proposed Methods

In this work, the first ConvNet that integrates visual-thermal images to generate thermal image super-resolution is described. Our main contributions are:

  • Unlike the RGB-based super-resolution problem, advancement in thermal super-resolution is still relatively low. Therefore, few benchmarks in thermal image SR  [22] and they are rarely available. To this end, the authors created a benchmark ULB17-VT multimodal sensors dataset that contains thermal images and their visual images counterparts.

  • The model is inspired by the work in  [19]. Different modified network architecture schemes, up-sampling methods and optimizing functions are investigated, to verify our model contribution with reference to the current improvement in the super-resolution problem literature.

  • Confirmation is given which shows that this thermal SR model, which integrates the visual-thermal fusion, does generate higher human perceptual quality images. This improvement is due to the rich details in the visual images, and the common relation with their thermal image counterpart.

  • A qualitative evaluation method based on 58 samples, which is a large qualitative evaluation in the SR problem domain, was also used to test how well the model performed. Twenty-two people were asked to rate the better generated image. The results of study shows that the proposed model performs better against state-of-the-art methods.

3.1 ULB17 Thermal Dataset

A FLIR-E60 camera with multimodal sensors (thermal and color) was used. The thermal sensor produces (320 x 240) pixel resolution with C thermal sensitivity and C to C which provides good quality thermal images. The color sensor is 3 megapixels and produces (2048 x 1536) pixels resolution. This device allows capture of both thermal and RGB images aligned with the same geometric information simultaneously. Thermal images were extracted in their raw format and logged in 16-bit float per-pixel in one channel in their raw format, in contrast with  [11] in which samples are post-processed and compressed into an uint8 format. All samples in this study are aligned automatically by the device supported software with their associated cropped color images of size (240 x 320) pixel with the same geometry.

Images in our benchmark were captured inside ULB111 Université Libre de Bruxelles campus. Each image acquisition process took approximately 3 seconds, which made the data acquisition process rather slow. The acquisition was made in different scenes and different environments (indoor and outdoor, during winter and summer and with static and moving objects) as shown in Fig. 2. Thermal and RGB images were manually extracted and annotated with a total of 570 pairs. The framework is divided into 512 training and validation samples and 58 test samples.

Figure 2: Visual-Thermal samples form our ULB17-VT benchmark.

3.2 Proposed framework

In this section our model methodology including how different model schemes, up-sampling methods, and optimization function are essential in producing better human perceptual SR images are described.

The aim is to estimate the HR thermal image

from its LR counterpart by a 4x factor. The images are produced by first applying Gaussian pyramids to samples and they are then down-scaled by a factor of .25x, from (240 x 320) pixels to (60 x 80) pixels.

Our proposed model belongs to the model scheme of FSRCNN shown in Fig 1

(c). The core of our model scheme is to perform feature extraction and feature mapping on the image original size. The model is constructed using X residual blocks with an identical layout inspired by SRGAN  

[19]. The model then up-samples the feature maps with two sub-pixel convolution layers as proposed by  [24]. Starting from the main model as a baseline,the model gain was investigated as follows:

  • Instead of up-sampling the features at the end of the network, we tested the model scheme of SRCNN [6] shown in Fig 1 (d). Due to investigation of high-resolution training methods, Sub-pixel layers are removed from this model, and two Deconv layers are added at the beginning of the network by a factor of 2.

  • The residual scheme proposed by VDSR as shown in Fig 1 (b) and (e) is tested using (i@) bi-linear interpolation, (ii@) bi-cubic interpolation, and (iii@) different trainable up-sampling layers.

  • Visual images are integrated to investigate whether their texture details can enhance the thermal SR generating process.

  • Recently, the Generative adversarial network (GAN) [8] has witnessed creative implementation in several tasks [13, 2, 27, 28, 25]. GAN provides a powerful framework which consists of an image transformer and an image discriminator, that can generate images with high human perceptual quality and are similar to the real images distribution. To achieve GAN contribution, our baseline model is re-trained on GAN based model.

(a) Generator
(b) Discriminator
Figure 3:

Architecture of our Generator and Discriminator networks with the corresponding (k) kernel size, (n) number of channels and (s) their stride when it is changed. The highlighted area with (*) indicates the model that merges RGB and thermal channels

3.3 Network Architecture

The generator baseline network thermal SRCNN (TSRCNN) shown in Fig. 3 consists of 5 identical residual blocks, inspired by  [19] and follows the layout proposed by  [9]. Each block consists of two convolutional layers with

kernel size, and 64 feature maps each followed by an ELU activation function  


. Batch-normalization layers  

[12] were removed from the residual block, since, as indicated in  [21, 10]

, they are unnecessary for the SR problem. It is also stated that once batch-normalization layers are used, they harm the generated output. To preserve the feature maps size, reflective padding is used around the boundaries before each convolution layer. Feature maps resolution is then increased using two sub-pixel convolution layers  


Visual RGB images are integrated into the model using two convolution layers with kernel size and 64 feature maps each followed by an ELU activation function. Each convolution layer used a 2-step stride, which reduced the feature maps size to be the same size as the thermal input, before they are fused to form the visual-thermal SRCNN (VTSRCNN) model as shown in fig. 3 (*). The fusion is handled by concatenating the two feature maps, followed by a convolution layer with to reduce channel dimensionality to 64 channels. Due to the high correlation between the visual-thermal modes and the rich texture information in the visual image, the network is supposed to learn these features and fuse them to produce SR thermal images with high perceptual quality. Giving an input thermal image and an input RGB image , the objective function seeks to map the LR thermal image to the HR thermal image .

The last two models are re-trained on the GAN based model to form the GAN proposed models (TSRGAN and VTSRGAN). To do this a discriminator network as shown in Fig 3

is added, to classify generated images from original images. The network architecture is similar to the work in  

[19], except for the batch normalization layer and ELU activation function. The model consists of eight convolution layers with

that increase the feature by a factor of 2 from 64 to 512 channels. Image resolution is reduced using a 2-step stride convolution between each layer which doubles the channels number. In this model, adapted average pooling is used on top of the 512 feature maps followed by two dense layers. Finally, a sigmoid activation function is used to produce a probability of the input being an original HR image or a generated SR thermal image.

3.4 Loss Function

Our baseline models (TSRCNN and VTSRCNN) are trained using only the content loss function, which is the mean square error (MSE), while our GAN based models (TSRGAN and VTSRGAN) are trained on a weighted sum of the content loss and the adversarial loss, which is obtained from the discriminator network. By using only the adversarial loss, the models are not able to converge. This is most likely ascribed to the lack of overlap in the distribution supports between original images and generated images. Therefore, the content loss was necessary for the GAN based model. The models that take only thermal images or visual-thermal images are shown in Eq.(1).


Content Loss. MSE in Eq.(2) is the most used optimization function in the SR image problem [6, 26, 15]. The model is trained on optimizing the Euclidean distance between the constructed and the ground truth . Although the MSE is highly correlated in maximizing the PSNR, it suffers from the lack of high-frequency details, which results in blurred and over-smoothed images. However, it does help the model preserve low-frequency details from the LR image and supports the adversarial loss which could not always ensure convergence.


Adversarial Loss. To ensure high-frequency details from the original HR distribution, the adversarial loss is added to the content loss. The models are first pre-trained on the content loss MSE and then fine-tuned using the total loss function (Eq. 3), which is a weighted sum of the adversarial loss (Eq. 5) and the content loss (Eq. 1)where is a fixed parameter.


The discriminator network is trained using the cross-entropy loss function (Eq. 4) that classifies the original images from the generated images. The generator loss is trained on the recommended equation (Eq.13 in [7]).


3.5 Implementation and training details

All of the models are implemented in Pytorch and trained on NVIDIA TITAN Xp using randomly selected mini-batches of size 12, plus 12 RGB mini-batches when the visual-thermal fusion model is used. The generator model uses RMSPROP optimizer with alpha=0.9. In the GAN based model the discriminator is trained using the SGD optimizer. The baseline model, and all other investigated models, are trained using the content loss for 5000 epochs. The pre-trained baseline model is used to initialize the generator in the adversarial model, where D and G are trained for another 5000 epochs. All models are trained with initial learning rate

and decreased to .

4 Experiments

4.1 Model analysis

4.1.1 Resolution training size.

Attention here is focused on showing the effect of training the model on LR features or on their up-sampled version, which is the difference between the two network schemes (c) and (d) shown in Fig 1. The first extracts and optimizes features of the original LR image, and up-samples them at the end of the network, while the second up-samples the input features first and then optimizes them along the network. By looking at the trade-off between the computation cost and the model performance, the trained model on the up-sampled features increased the computation cost and did not add a significant improvement to the generated images. Instead, the up-sampled training as shown in Fig. 4, depicts a slight increment in the PSNR/SSIM values compared to our proposed model (TSRCNN) but the model could not generate some fine texture details such as the handbag handle in the second image and the person in the background of the first and third images.

Figure 4: (1) HR image. (2) Our proposed model TSRCNN trained on LR image features with PSNR/SSIM (52.353/0.9495). (3) The same model trained on the up-sampled features using 2 Deconv layers with PSNR/SSIM (52.656/0.9510)

4.1.2 Evaluation with the state-of-the-arts.

Before validating the residual learning model scheme and the up-sampling methods of the proposed models, the proposed baseline model is compared with state-of-the-art: VDSR [15] and LAPSRN  [18] which are based on residual learning. VDSR is implemented in two models, (1) the original VDSR that takes only thermal images as input, and (2) our extended VDSRex that takes visual-thermal images as an input of 4 channels. LAPSRN is trained using only thermal images. The experiment was run on our ULB17-VT benchmark, using the same size model and training procedure explained in the STOA paper. Fig. 5 shows that VDSR failed to produce high-frequency details, while the VDSRex produced better results taking advantage of the visual-thermal fusion. However, the proposed baseline model generates images with sharp details and higher perceptual image quality. Table. 1 shows that the proposed model also obtained higher PSNR/SSIM value than the STOA.

Figure 5: Comparison between our baseline model TSRCNN and state-of-the-art, LAPSRN [18], VDSR  [15] and our extended VDSRex.

4.1.3 Residual learning and Up-sampling methods.

Attention was then focused on adapting and investigating the residual learning model to the thermal SR problem. To use the baseline model, TSRCNN in this model scheme, the input should be rescaled to have the same size as the residual output. We trained four different models with four different up-sampling methods: (1) InpDconv-TRSCNNres that integrates TRSCNN and two deconvolution layers to up-sample the input image; (2) InpBilin-TRSCNNres that up-samples the input using bilinear interpolation; (3) InpBicub-TRSCNNres that uses bicubic interpolation; (4) AllDconv-TRSCNNres which is similar to (1) but the two Pixelshuffle layers at the end of the network are replaced by two deconvolution layers.

Fig. 6 shows that the models trained on bilinear and bicubic interpolations methods failed to produce comparable perceptual quality results, and have the lowest PSNR/SSIM values. Models (1) and (4) that use trainable up-sampling methods produced better perceptual results with high PSNR/SSIM values. However, the proposed model produced sharper edges and finer details. Note that the person in the first and third images has sharper details in the proposed model, the bag handle exists only in our model. To this end and to better evaluate the contribution of the models, all models are taken into our qualitative evaluation study.

Figure 6: (1) HR image. (2) Our proposed TSRCNN model with no residual learning and (52.353/0.9495). (3) InpDconv-TSRCNNres with (52.323/0.9491). (4) AllDconv-TSRSCNNres with (52.430/0.9495). (5) InpBilin-TSRCNNres with (51.023/0.9415). (6) InpBicub-TSRCNNres with (51.670/0.9442). Values between brackets are PSNR/SSIM.

4.1.4 Visual-Thermal fusion.

To demonstrate the effect of integrating visual-thermal fusion in the SR problem, and to investigate if the rich information in the visual images can help to produce better thermal SR images, the baseline thermal SR Convnet model (TSRCNN) was trained on only thermal images using the network architecture shown in Fig. 3 (a).Also, the model Visual-Thermal SR Convnet (VTSRCNN)model was trained on thermal and visual images using the model shown in Fig. 3 (a) which integrate the branch (*). Fusing visual-thermal images added more details to the produced thermal SR images, but also added some artifacts in parts of the images. In particular, these artifacts appear when there is displacement in the objects due to the camera design and image capturing mechanism. The integration enhanced the SR images slightly. Thus the comparison is difficult between them, but it can be seen as sharp and extra details in small regions of the SR images as shown in Fig. 7. Therefore, a qualitative evaluation study was set to validate the contribution of the visual-thermal fusion.

4.1.5 Optimization function.

To investigate the contribution of the adversarial loss on producing high perceptual quality SR images; the two models (TSRGAN that takes only thermal images and VTSRGAN which takes visual-thermal images) are trained using content loss and adversarial loss. Fig. 7 shows the proposed models and their contribution to enhancing the thermal SR perceptual quality. Models trained with adversarial loss produced images with high texture details and high-frequency information. Although they added some small artifacts, they produced images that are sharper and less blurry than images generated using only content loss. Table. 1 shows the relationship between the mean square error and the PSNR validation measurement, the model TSRCNN has the highest PSNR/SSIM value and also the most blurry images. To better validate the perceptual quality, our four models are added to the qualitative evaluation study.

Figure 7: Our proposed models trained only on thermal or on visual-thermal fusion, using only content loss or with adversarial loss.
PSNR/SSIM 52.353/0.9495 51.727/0.9434 51.094/0.9285 50.979/0.9289
PSNR/SSIM 45.557/0.8328 52.027/0.9395 51.936/0.9526
Table 1: Quantitative evaluation of the proposed models and the STOA

4.2 Qualitative evaluation study

To evaluate the proposed models and the different investigated schemes in comparison with the STOA, a qualitative evaluation study to alleviate the PSNR/SSIM impact and to assist with evaluating the human visual perception was conducted. A website that allows users to choose the most similar image to its original HR counterpart was created. Twenty-two people, with and without computer vision backgrounds, contributed to this evaluation process. They were asked to vote for a large study case, the test set used in the ULB17-VT benchmark with 58 samples. For each image, 9 models were selected for this evaluation process.

Running a qualitative evaluation study on 9 models with 58 images for each is very exhaustive work for the raters. To encourage them and to reduce the overall number of the selections required, three evaluation groups were created. In each group and for each image only three models were presented, these models were selected randomly and not repeated. The evaluation page shows the original HR image and the three selected models output. The user was asked to select the image that was most similar to its original HR image counterpart. For each selection process, a +1 was awarded in favor of the chosen model against the two other models shown. For example, in group 1 image 1 the models [4, 5, 6] outputs are presented. If the user selects the second image, this means the ranking output is +1 for model 5 over model 4 and +1 for model 5 over model 6 . Finally, the total votes for and against the paired models are normalized as shown in Eq.(6), and also normalized by the number of times these paired models were presented in the evaluation process.


The color-coded votes diagram shown in Fig .8 shows that the proposed models, which integrate visual-thermal fusion, are the highest selected models against almost all the other models. The size of the models to the left of the graph indicates the number of times these models were voted in favor over all the other models, it shows that the model VTSRCNN has and VTSRGAN has . The larger the model to the right of the graph indicates the number of times these models were voted against. The weight of the paths indicates the number of times these models were selected in favor against the opposite model. Our human visual perception study shows that the proposed models with visual-thermal fusion have the highest votes in favor and the lowest votes in disfavor. This highlights the benefits of integrating visual-thermal fusion in the thermal super-resolution problem.

Figure 8: Color-coded votes flow diagram of models: (1)TSRCNN, (2)VTSRCNN, (3)InpDconv-TRSCNNres, (4)AllDconv-TRSCNNres, (5)TSRGAN, (6)VTSRGAN, (7)VDSR, (8)VDSRex, (9)LAPSRN. Left/right represent the models. Paths in the middle represent the vote in favor between model.

4.3 Limitations

Although the proposed models generate better thermal SR images in human visual perception, some artifacts were noticed in the generated images. These artifacts are most likely caused by the device design or are due to the displacement of the device or the object. The models can preserve the high-frequency details of the visual images, the displacement problem could not be approached simply, and needs a better synchronized device to overcome the problem. Due to this displacement in some samples, the reconstructed versions suffer from artifacts around these objects compared to images with no displacement. We leave this problem open for further study and investigation.

5 Conclusion

In this paper, the problem of thermal super-resolution enhancement using the domain of visual images was addressed. A deep residual network that provides a better solution compared to other network schemes and training methods in the literature was proposed. Our result highlights that visual-thermal fusion can enhance the thermal SR image quality, also by the contribution of GAN based model. Furthermore, a qualitative evaluation study was performed and analyzed. This evaluation indicates a better understanding of the problem evaluation than the widely used PSNR/SSIM measurements. Lastly, a new visual-thermal benchmark in super-resolution problem domain was set.