Thermal Object Detection using Domain Adaptation through Style Consistency

06/01/2020 ∙ by Farzeen Munir, et al. ∙ Gwangju Institute of Science and Technology SEECS Orientation 0

A recent fatal accident of an autonomous vehicle opens a debate about the use of infrared technology in the sensor suite for autonomous driving to increase visibility for robust object detection. Thermal imaging has an advantage over lidar, radar, and camera because it can detect the heat difference emitted by objects in the infrared spectrum. In contrast, lidar and camera capture in the visible spectrum, and adverse weather conditions can impact their accuracy. The limitations of object detection in images from conventional imaging sensors can be catered to by thermal images. This paper presents a domain adaptation method for object detection in thermal images. We explore multiple ideas of domain adaption. First, a generative adversarial network is used to transfer the low-level features from the visible spectrum to the infrared spectrum domain through style consistency. Second, a cross-domain model with style consistency is used for object detection in the infrared spectrum by transferring the trained visible spectrum model. The proposed strategies are evaluated on publicly available thermal image datasets (FLIR ADAS and KAIST Multi-Spectral). We find that adapting the low-level features from the source domain to the target domain through domain adaptation increases in mean average precision by approximately 10



There are no comments yet.


page 1

page 2

page 4

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous driving is becoming a reality after almost four decades of incubation, and object detection using deep neural networks is a key element in this success. The autonomous vehicle has to offer broader access to mobility, and while doing so, the safety of a vehicle and the surroundings are primary concerns. SOTIF (Safety of the intended functionality) reflects in detail towards safety violations that happen without technical system failure

[1]. For example, failure to perceive an object in an environment or fog occluding the vision. The autonomous vehicle should be capable of operating safely in such situations. The perception of the environment plays a vital role in the safety of the autonomous vehicle. Environmental perception is generally defined as awareness of, or knowledge about the surroundings, and the understanding of the situation by the visual perception [2].

The sensors commonly used for perception in the autonomous vehicle includes lidar, RGB cameras, and Radar. One of the essential aspects of perception is object detection. All the aforementioned sensors are employed in object detection. Each sensor has its own drawback. Lidar gives a sparse 3D map of the environment, but small objects like pedestrians and cyclists are hard to detect at a distance. The RGB camera performs poorly in unfavorable illumination conditions such as low lighting, sun glare, and glare from the headlight of the vehicle. Radar has a low spatial resolution to detect pedestrians accurately. There exists a gap in object detection in adverse lighting conditions [3]. The inclusion of a thermal camera in the sensor’s suite will fill the blind spots in environmental perception. A thermal camera is robust against illumination variation and advantageous to be deployed during day and night. The object detection and classification are indispensable for visual perception, which provides a basis for computing perception in an autonomous vehicle.

Fig. 1: (a) Object detection in thermal images through style consistency (ODSC). Visible spectrum (RGB image) is treated as a style image whereas, the thermal image is considered as content image. The output shows the enhanced image having low-level features adapt from the visible spectrum. (b) Cross-domain model transfer with style transfer. Style from the thermal image is transferred to the visible spectrum (RGB content image).

Object detection in visible spectrum (RGB) domain is considered sufficient for conventional AI application, and resulted in deep neural network models for robust object detection [4] [5] [6]

. However, the accuracy of object detection in thermal images has not yet attained the state-of-the-art results compared to the visible spectrum. The aforementioned object detection algorithms depend on networks that have been trained on sizable RGB datasets such as ImageNet

[7], PASCAL-VOC [8], and MS-COCO [9]. There exist a comparable scarcity of such large scale public datasets in the thermal domain. Two primary datasets for urban thermal imagery that are available include, FLIR ADAS image dataset [10] and KAIST Multi-Spectral dataset [11]. KAIST Multi-Spectral dataset only gives annotations for persons, while the FLIR ADAS dataset gives annotation for four classes. In order to overcome the absence of the large scale labeled dataset, here, a domain adoption technique for object detection in the thermal domain is presented.

Currently, numerous approaches for domain adaptation have been introduced, which aims to narrow down the gap between source and target domain. Among many, generative adversarial networks (GAN) [12] and domain confusion [13] for the feature adaptation are noteworthy. The domain adaptation prospects in data starved thermal images domain motivate this study, which explores a derivative of closing the gap between visible and infrared spectrum in the context of object detection. Domain adaptation is influenced by generative models, for instance, CycleGAN [14] that translates the single instance of source domain to target domain without translating the style attributes to the target domain. The low-level visual cues have an implicit impact on the performance of object detection [15]. The delegation of these visual cues in the target domain from the source domain can be beneficial for robust object detection in the target domain.

This work explores the translation of low-level features adopted from a source domain (RGB) to a target domain (thermal) using domain adaptation to improve object detection in the target domain. The multi-style transfer is applied to transfer the low-level features such as curvatures and edges from the source domain to the target domain. Deep learning-based object detection architectures that rely on classical backbone like VGG

[16], ResNet [17], are trained on the multi-style transfer images from scratch for the robust object detection in the infrared spectrum (target domain). Moreover, we have proposed a cross-domain model transfer method for object detection in thermal images supplementing the domain adaptation. The cross-domain model transfer for which the object detection deep neural networks have trained in the source domain (visible spectrum). The trained models, referred to as cross-domain models, are evaluated with multi-style transfer images and without multi-style transfer images in the target domain (infrared spectrum). The proposed techniques are evaluated with FLIR ADAS [10] and KAIST Multi-Spectral [11], and PASCAL-VOC evaluation is used to determine the average mean precision of the detected objects[8].

Major contributions in this paper are highlighted below:

  1. Improved object detection in the infrared spectrum (thermal images) by exploring the low-level features using style consistency. The proposed object detection framework outperformed existing benchmarks in terms of mean average precision.

  2. Cross-domain model transfer paradigm not only enhances the object detection in the infrared spectrum (thermal images) but also provides an alternative yet effective method for labeling the unlabelled dataset.

The rest of the paper is organized as follows: Section II discusses the related literature. In Section III, the proposed methodology is discussed. Section IV focuses on experimentation and results. Section V shows the comparison and discussion about the proposed method. Section VI concludes the study.

Ii Related Work

Ii-a Object Detection

Human vision is robust to identify the objects in countless challenging conditions, but it is not a trivial task for the autonomous vehicle. The ultimate goal of object detection in images is to localize and identify all instances of the same object or different objects present in the image. The significant work is done on person detection in thermal images by taking into account the temperature sensed in the surrounding. Classical image processing techniques can be used for detection like thresholding is used in [18]

. The HOG features and local binary patterns are used to extract features from thermal images, and the features are used to train SVM classifiers

[19][20][21]. Deep neural networks have gained repute in object detection tasks in RGB images, and are used for object detection in thermal images [22][23][24][25]. The feature maps from multispectral images are extracted and fed to object detector, i.e., faster-RCNN and YOLO. [26] augment multispectral images with their saliency map such that it focuses attention on pedestrians during the daytime. They train Faster-RCNN for pedestrians detection and fine-tune it on extracted feature maps. [27] uses CycleGAN to generate the thermal images from RGB images, to remove the dependency of pairing the RGB and thermal images in the dataset. They have used a variant of Faster-RCNN, which used both the thermal and RGB images to detect objects.

Fig. 2:

The proposed model framework for object detection in thermal images through style consistency. (a) Multi-style generative network architecture for generating the style images. Visible spectrum (RGB images) and thermal images are given as style and content image respectively to the network. The siamese network captures the low-level features of style image, which is transferred to the transformation network through the CoMatch layer. A pre-trained loss network is used for MSGNet learning by computing the difference between content and style image with the targets. (b) The detection networks which includes (Faster-RCNN backbone with ResNet-101, SSD-300 with backbone VGG16, MobileNet, and EfficientNet, SSD-512 with VGG16 backbone) are trained on the style images and then tested in the target domain (thermal images) for the object detection.

Ii-B Domain Adaptation

Typically neural networks encounter performance degradation when they are tested upon different datasets due to environmental changes. In some cases, the dataset is not large enough to train and optimize a network. Therefore techniques like domain adaptation provide a crucial tool to the research community.

The domain adaptation for object detection includes techniques like the generation of synthetic data or augmentation to real data to train the network. [28] have used publicly available object detection labeled datasets from various domains and multiple classes and merged them. For example, the fashion dataset Modanet is merged with the MS-COCO dataset by leveraging Faster-RCNN using domain adaptation. In [29], Faster-RCNN is used to make image and instance-level adaptation. [30] have introduced a two-step method, where they have optimized a detector to low-level features, and then it is developed as a robust classifier for high-level features by enforcing distance minimization between content and style image. [31]

has proposed a cross-domain semi-supervised learning structure that takes advantage of pseudo annotations to learn optimal representations of the target domain. They have used the fine-grained domain transfer, progressive confidence based annotation augmentation, and annotation sampling strategy.

Ii-C Style Transfer

Image Style transfer is a process that renders the content of the image from one domain with the style of another image from another domain. [32]

has demonstrated the use of feature representation from the convolution neural network for style transfer between two images. They have shown that features obtained from CNN are separable. They manipulate the feature representation between style and content images to generate new and visually meaningful images.

[33] have proposed style transfer based on a single object. They have used patch permutation to train a GAN to learn the style and apply it to the content image. [34] has introduced XGAN, consisting of auto-encoder, which captures the shared features from style and content images in an unsupervised way and along which it learns the translation of style onto the content image. [35] has proposed the CoMatch layer, which learns second-order statistics of features and matches them with style image. Using the CoMatch layer, they have developed the Multi-style Generative Network giving a real-time performance.

The dawn of deep learning has significantly improved the object detection paradigm by training the neural network models on the large dataset of the visible spectrum (RGB images). A novel approach to improve object detection for thermal images is introduced in this study by domain adaptation through style transfer. The scarcity or non-existence of labeled data provides a challenge to the research community, and labeling is not an easy task. The proposed approach can be used to perform domain adaptation for other datasets, like introducing foggy weather in the Kitti dataset or convert day images to night images.

Iii Proposed Method

This section presents the proposed methods for thermal object detection through style consistency and cross-domain model transfer for object detection in thermal images.

Fig. 3: An overview of the cross-domain model transfer method. The detection networks are trained using the visible spectrum (RGB images). Afterward, these trained models are tested by implying the cross-model transfer with style transfer using MSGNet and also without style transfer. (Detection Network*) implies that the same detection networks are used for testing in the target domain.

Iii-a Object Detection in Thermal Images through Style Consistency (ODSC)

The recent advances in deep learning have revolutionized the domain of object detection in the RGB image domain. However, in the infrared image domain, it lacks accuracy. Deep neural networks for object detection perform feature computation at a low-level and also at a high-level [39] [15]. In this part of the proposed work, we argue that by transferring the low-level features from the source domain (RGB) using domain adaption increases the object detection performance in the target domain (thermal).

For the domain adaption between thermal images (content images ) and visible spectrum (RGB) images (style images ), we have adopted the multi-style generative network (MSGNet) for style transfer [35]. The leverage of translating the specific style from the source to the target domain through the multi-style generative network provides an extra edge over the CycleGAN [14]. The CycleGAN generates one translated image from the source image of a specific style. MSGNet provides the capability to translate multi-style from the source domain to the target domain while closing the gap between two domains. The network extracts low-level features such as texture and edges from the source domain while keeping the high-level features consistent in the target domain. Fig. 2(a) shows the framework for transferring the style from the visible spectrum (RGB) images to thermal images.

The architecture of the MSGNet is shown in Fig. 2(a). MSGNet network takes both the content image and style image as input, while the previously known architectures, like, Neural Style [33] that takes only the content image and then generates the transferred image. The Generator network is composed of an encoder consisting of the siamese network [36], which shares its network weights with the transformation network through the CoMatch layer. The CoMatch layer matches the second-order feature statistics of content image to the style images . For a given content image and a style image, the activation of the descriptive network at the scale represents the content image where , , are the number of feature map channels, the height of feature map and width respectively. The distribution of features in style image is represented using the Gram Matrix given by equation. 1. In order to find the desired solution in the CoMatch layer that preserves the semantic content of source image as well as matches the feature statics of target style, an iterative approximation approach is adopted by incorporating the computational cost in the training stage as shown in the equation. 2.


where, is a reshaping function in Gram Matrix for zero-centered data.


where is a learnable matrix.

The minimization of a weighted combination of the content and style difference between the generator network output and targets for a given pre-trained loss network . The generator network is given by and parameterizes by , weights. The learning is done by sampling the content image and style image

, and estimate the weights,

of the generator to minimize the loss:


where and are the regularization parameters for content and style losses. The content image is consider at scale and style image is considered at scales . The total variational regularization is , which is used for the smoothness of the generated image [40].

The proposed framework for object detection through style consistency is presented in Fig. 2. The network consists of two modules; the first part consists of a multi-style network. It generates the style images by adapting low-level features transformation between the content image consisting of thermal image and style image consisting of the RGB image. As compared to the thermal images, the transferred style images contain low-level features, but the semantic shapes are preserved in these generated images keeping the high-level semantic features consistent. The second module is comprised of the state-of-the-art detection architectures: Faster-RCNN [4] backbone with ResNet-101 [17], SSD-300 and 512 [5] with backbone VGG16 [16], MobileNet [37] and EfficientNet [38]. The networks are trained on the styled images, which bridge the gap between the visible spectrum and thermal images. The backbone in the Faster-RCNN and SSD are initialized with pre-trained weights obtained from training on the imageNet dataset [7]. The trained detection network is evaluated on style images and thermal images. The accuracy of testing on thermal images shows the efficacy of object detection.

Iii-B Cross-Domain Model Transfer for Object detection in Thermal Images (CDMT)

This study aims to use the advantage of domain adaptation through style consistency and transfer of low-level features from the thermal images (source domain) to the visible spectrum (RGB images) target domain. For cross-domain model transfer, the source and target domain are swapped compared to the first part of the proposed work. Fig. 3 shows the overall framework for cross-domain model transfer object detection in thermal images. The detection networks ( Faster-RCNN backbone with ResNet-101, SSD-300 with backbone VGG16, MobileNet, and EfficientNet, SSD-512 with VGG16 backbone) are trained on the visible spectrum (RGB images) and then the trained model is tested on the thermal images. As the detection networks are trained on a different domain, in this case, visible spectrum (RGB) images, the performance of these networks on thermal images will be marginal. The efficacy of thermal object detection can be increased by using the style consistency. The MSGNet is trained with RGB images as the content image, and the style is borrowed from the thermal images. The style transferred images are then passed to the same detection networks that are trained earlier on the visible spectrum (RGB) images, which improves the object detection in thermal images. This cross-domain model transfer can be applied as a weak object detection module for the unlabeled dataset, as in our case for thermal images.

Iv Experimentation and Results

Iv-a Datasets

We have used two thermal image datasets in this study. First is the FLIR ADAS dataset [10], and the second one is the KAIST Multi-Spectral dataset [11]. FLIR dataset consists of 9214 images with objects annotated using a bounding box as an evaluation measure. The objects are classified into four categories i.e., car, person, bicycle, and dog. However, the dog category has very few annotations, so it is not considered in this study. The images have a resolution of and obtained from FLIR Tau2 Camera. The dataset consists of day and night images, approximately images are captured during the daytime, and images are capture during nighttime. The dataset consists of both visible spectrum (RGB images) and thermal images, but annotations are only available for thermal images. The visible spectrum (RGB images) and thermal images are not paired so that the thermal annotations cannot be used with a visible spectrum (RGB images). Thermal images with annotations are only considered in this study. A standard split of the dataset into a training and a validation data is considered during experimentation. The training dataset consists of images, and the validation contains images.

The KAIST Multi-Spectral dataset contains images from both the visible spectrum (RGB images) and the thermal spectrum, and for each category, the dataset has both daytime and nighttime images. Annotations are only provided for the person class with a given bounding box. The visible spectrum (RGB images) and thermal images are paired, which means annotations for the thermal and the visible spectrum (RGB images) are the same. Images are captured using a FLIR A35 camera with a resolution of . We have applied a standard split of the dataset, using of the images in the dataset in training, and of the images in the dataset for validation.

Iv-B Object Detection in Thermal Images through Style Consistency

The evaluation of the proposed method is demonstrated using state-of-the-art object detection networks. The object detection networks include Faster-RCNN, SSD-300, and SSD-512. These object detection networks are implemented with different backbone architecture; for instance, ResNet-101 is used as a backbone network in Faster-RCNN; VGG16, MobileNet, and EfficientNet are used with SSD-300; SSD-512 uses VGG16 as backbone architecture. The dataset comprises of FLIR ADAS and KAIST Multi-Spectral dataset. The FLIR ADAS dataset is partitioned into training and testing using standard split, while the KAIST dataset is only used in testing the object detection networks. All the networks are implemented in Pytorch, having formulated the data in PASCAL-VOC format. The standard PASCAL-VOC evaluation criteria are used in this study


Iv-B1 Baseline

A baseline approach is experimented first for the competitive analysis with the proposed methodology. The object detection networks are trained with their specific training configurations. In training the Faster-RCNN, a pre-trained model of ResNet-101 is adapted and fine-tuned on the thermal image dataset. The network is trained using Adam optimizer with a learning rate of and a momentum of

for total epochs of


The experimental evaluation with the SSD object detection network constitutes two different architectures, i-e SSD-300 and SSD-512. In the case of training the SSD-300, the pre-trained models of backbone networks are fine-tuned on the training data. The learning rate for VGG16, MobileNet, and EfficientNet used as the backbone network for SSD-300 are ,, and , respectively. For the SSD-512 experimentation, only pre-trained VGG-16 is used as a backend for training with a learning rate of . All the networks have used a batch size of on the Nvidia-GTX- having GB of computational memory.

Iv-B2 Experimental Configuration

In the proposed methodology, the MSGNet is trained with thermal images to serve as a content image, whereas the RGB images correspond to style images. In training the MSGNet, VGG16 is used as a loss network. The pre-trained weights of the loss network on the ImageNet dataset are employed for training the MSGNet. In a loss network, the balancing weights as referred to in the equation. 3 are and respectively while the total variational regularization for content and style is . In the experimental configuration, the size of the style image is iteratively updated, having a size of , respectively. The size of the content images is resized to . The Adam optimizer is used with a learning rate of in the training configuration. The MSGNet is trained for a total of epochs with a batch of on the Nvidia-GTX-.

The trained model of MSGNet results in the generation of style images, as shown in Fig. 1 (a). These style images are used in training the object detection networks. The detection networks trained on style images are evaluated on the test data comprise of thermal images. The training configuration of these object detection networks is kept similar as the baseline configuration to make a comparative analysis.

Fig. 4: Object detection in thermal images through style consistency in the top row. (a) Ground-truth provided by FLIR ADAS, (b) Baseline best model (SSD512-VGG16) qualitative result. (c) Qualitative result of the best model (SSD512-VGG16) using our proposed method. Cross-domain model transfer (CDMT) results on the KAIST Multi-Spectral dataset in the bottom row. (a) Ground-truth provided by the KAIST Multi-Spectral dataset. (b) Baseline model (SSD512-VGG16) tested on thermal images using CDMT without style transfer. (c) Qualitative result of proposed CDMT with style transfer on the KAIST Multi-Spectral dataset.

Iv-B3 Experimental Results

For the evaluation of our experimental configuration, we have tested the baseline and proposed method, on both thermal datasets (FLIR ADAS and KAIST Multi-Spectral). Table-I shows the mean average precision (mAP) scores of the baseline configuration for each detection network, i.e., the networks are trained on thermal images and evaluated on thermal images. Table-II shows that the quantitative results of the proposed method. The best model configuration for the proposed method is (SSD512+VGG16) as shown in experimental results. The mAP score of the best model configuration of the proposed method has a better evaluation score compared to the baseline configuration. On the contrary, the detection networks trained on the thermal images tested on the style images show the marginal efficacy, as shown by Table-III. Fig. 1(a) shows the qualitative result of object detection in thermal images through style consistency. The qualitative results of best model configuration (SSD512+VGG16) is shown in Fig. 4 ().

FLIR ADAS Dataset KAIST Multi-Spectral Dataset
Network Architecture Backbone car bicycle person Average mAP person
Faster-RCNN ResNet-101 0.6799 0.4276 0.548 0.5518 0.3283
SSD-300 VGG-16 0.7561 0.4502 0.6197 0.6087 0.6687
SSD-300 MobileNet-v2 0.4774 0.1943 0.3163 0.3284 0.5998
SSD-300 EfficientNet 0.6809 0.2747 0.4992 0.4849 0.6162
SSD-512 VGG-16 0.8055 0.5399 0.702 0.6825 0.6409
TABLE I: Quantitative analysis using Baseline configuration for object detection networks.
FLIR ADAS Dataset KAIST Multi-Spectral Dataset
Network Architecture Backbone car bicycle person Average mAP person
Faster-RCNN ResNet-101 0.7190 0.4394 0.6201 0.5928 0.5345
SSD-300 VGG-16 0.7991 0.4691 0.6253 0.6312 0.7536
SSD-300 MobileNet-v2 0.5434 0.2798 0.3638 0.3957 0.7465
SSD-300 EfficientNet 0.7405 0.3512 0.5169 0.5362 0.6770
SSD-512 VGG-16 0.8233 0.5553 0.7101 0.6962 0.7725
TABLE II: Quantitative analysis using Proposed Method (ODSC) configuration.
FLIR ADAS Dataset KAIST Multi-Spectral Dataset
Network Architecture Backbone car bicycle person Average mAP person
Faster-RCNN ResNet-101 0.3030 0.1985 0.2115 0.2377 0.1410
SSD-300 VGG-16 0.6824 0.3286 0.5260 0.5123 0.6137
SSD-300 MobileNet-v2 0.4551 0.1363 0.2899 0.2937 0.4773
SSD-300 EfficientNet 0.3637 0.1193 0.2289 0.2373 0.4449
SSD-512 VGG-16 0.6779 0.3736 0.5538 0.5351 0.4961
TABLE III: Quantitative analysis of testing object detection networks trained on thermal images and tested on style images

Iv-C Cross Domain Model Transfer for Object detection in Thermal Images (CDMT)

The cross-domain model evaluation employs the training of object detectors on the visible spectrum (RGB images). The KAIST dataset is used in this experiment, considering that the labels are available for both domains. The object detection networks incorporated in this study include Faster-RCNN, SSD-300, and SSD-512. The network model configuration is similar to ODSC. The Faster-RCNN is backend with ResNet-101 backbone. The SSD-300 network is experimented with VGG16, MobileNet, and EfficientNet backbone. Furthermore, SSD-512 is backend with VGG16 architecture. The learning rate for training all detection networks is except for the SSD-300 with EfficientNet backbone, which is tested with . The batch size is for all the aforementioned detection networks.

Similar to the ODSC, MSGNet is used to generate styled images, as shown by Fig. 1(b). In this case, the content images consist of the visible domain (RGB images), and the style is transferred from thermal images, which signifies that the style transfer between the content image (RGB images) and style image (thermal images) increase the object detection efficacy. The hyper-parameters for the MSGNet are kept the same as described in the experimental configuration of object detection in thermal images through style consistency. The detection networks are then tested on these generated styled images.

Iv-C1 Experimental Results

The method’s assessment is investigated by evaluating the trained network on the styled images and non-styled images. Table-IV shows the quantitative results of cross-domain model transfer. The quantitative results show that using the cross-domain model transfer with style transfer increases the object detection efficacy compared to cross-domain model transfer without style transfer. In addition to that, the method of using cross-domain model transfer will overcome the gap of annotating the unlabelled dataset and assists as a weak detector for the unlabelled dataset. The qualitative evaluation of using style transfer for CDMT is shown in Fig. 1(b) and Fig. 4 () shows the qualitative results of object detection using CDMT with style transfer.

KAIST Multi-Spectral Dataset
Domain CDMT without Style Transfer CDMT with Style Transfer
Network Architecture Backbone person person
Faster-RCNN ResNet-101 0.5754 0.7254
SSD-300 VGG-16 0.0.6098 0.7598
SSD-300 MobileNet-v2 0.2512 0.7012
SSD-300 EfficientNet 0.1995 0.5495
SSD-512 VGG-16 0.6202 0.7702
TABLE IV: Quantitative analysis of Cross Domain Model Transfer (CDMT)
Dataset FLIR ADAS KAIST Multi-Spectral
Method car bicyle person mAP person
MMTOD-UNIT [27] 0.7042 0.4581 0.5945 0.5856 -
MMTOD-CG [27] 0.6985 0.4396 0.5751 0.5711 0.5226
PiCA-Net [26] - - - - 0.658*
Net [26] - - - - 0.7085*
Intel [41] 0.571 0.1312 0.245 0.3157 -
ACF+T+THOG [11] - - - - 0.7139
Ours (ODSC) Faster-RCNN+ResNet101 0.7190 0.4394 0.6201 0.5928 0.5345
SSD300 +VGG16 0.7991 0.4691 0.6253 0.6312 0.7536
SSD300+ Mobilenet V2 0.5434 0.2798 0.3638 0.3957 0.7465
SSD300+ EfficientNet 0.7405 0.3512 0.5169 0.5362 0.6770
SSD512+VGG16 0.8233 0.5553 0.7101 0.6962 0.7725
Ours (CDMT) Faster-RCNN+ResNet101 - - - - 0.7254
SSD300 +VGG16 - - - - 0.7598
SSD300+ Mobilenet V2 - - - - 0.7012
SSD300+ EfficientNet - - - - 0.5495
SSD512+VGG16 - - - - 0.7702
TABLE V: Comparison of our proposed methods (ODSC and CDMT) with state-of-the-art methods.(*) represent average (day+night) mean Average Precision score. (-) indicates that the respective algorithm is not tested on the specified dataset.

V Discussion

For the efficacy of the proposed methodology, an extensive analysis is conducted of the proposed methods with state-of-the-art methods. Table-V shows a comparison between the proposed methods (ODSC and CDMT) and state-of-the-art methods. In our analysis, we have considered those methods, in which the standard PASCAL-VOC evaluation is used on both FLIR ADAS and KAIST Multi-Spectral dataset.

In addition to the mAP scores, class mAP scores are also compared with state-of-the-art methods in comparison to the proposed approach. Further, the comparison of the proposed method is not limited to the methods that only include domain adaptation. The object detection results are compared with the general object detection methods like PiCA-Net [26] and RNet [26], which have used saliency maps for object detection. It is apparent from the Table-V that in most of the categories, our proposed strategies have better performance in comparison to the existing benchmark.

In future work, we aim to improve the perception of autonomous vehicles under low lighting conditions. Lane detection and segmentation are essential aspects, which are challenging to do in the visible domain. Achieving these tasks in the thermal domain will contribute to the enhanced visual perception of autonomous vehicles.

Vi Conclusion

This study focuses on improving object detection in low lighting conditions for autonomous vehicles. A new approach is introduced to perform domain adaptation from visible domain to thermal domain through style consistency. We have utilized MSGNet to transfer low-level features from the source domain to the target domain while keeping high-level semantic features consistent. The proposed method outperforms the existing benchmark for object detection in the thermal domain. Moreover, the effectiveness of style transfer is strengthened by using a cross-domain model transfer between visible and thermal domains. The application of the proposed approach exists in the autonomous vehicle under low lighting conditions and also robots in general. Object detection is an integral aspect of perception, and failure to detect the object compromises the safety of the autonomous vehicle. Thermal images provide additional insight into the surroundings while exploring the infrared spectrum, and the proposed techniques improve the results of object detection in thermal images with a positive impact on the safety of autonomous driving.