The recent technological advancements have enabled autonomous vehicles to be deployed on the road, ensuring the safety standard as indicated by SOTIF-ISO/PAS-21448 (Safety of the intended functionality)111https://www.daimler.com/innovation/case/autonomous/safety-first-for-automated-driving-2.htm. Keeping in this context, the perception of the autonomous vehicle has an integral element in defining the environment for the autonomous vehicle [rosique2019systematic]. For the environment’s perception, the most common exteroceptive sensors include cameras, Lidar, and radar. These sensors have their own merits and demerits; for instance, cameras (visible spectrum) provide an in-depth high resolution of the environment but have an illumination problem and cannot be utilized in night conditions. Lidar uses laser light for modelling the environment and provides the 3D point-cloud data of the environment. Despite providing the 3D environment, Lidar is extremely expensive and has a resolution problem in adverse weather conditions. Radars also enable the autonomous vehicle to identify small objects near the vehicle but have a low resolution at the range. These sensors have limited operating capability in perceiving the environment for the autonomous vehicle at night[van2018autonomous]. The utilization of a thermal camera in the autonomous vehicle’s sensor suite provides the necessary solution for the perception of the environment at night yet requires an efficient perception algorithm in term of object detection[leira2021object].
Object detection is an essential component that forms the basis for the perception of the autonomous vehicle. Although the use of the deep neural network has illustrated the significant improvement in object detection, yet most of the efforts have focused on the visible spectrum images that include YOLO[yolov3], SSD[ssd], Faster-RCNN[faster], RefineDet[refine], M2Det[m2det]. There is still room for improvement in thermal object detection compared to its counterpart, visible spectrum object detection. Some of the literature work focuses on thermal object detection by employing the domain adaptation techniques to transfer the knowledge from the visible spectrum to the thermal domain for object detection in the thermal domain. In addition, the visible spectrum domain object detectors have also used on the thermal images for object detection. All these approaches for thermal object detection have utilized the supervised learning mechanism for feature representation. The abundance of unlabelled dataset leverage the feature representation by employing the self-supervised learning techniques as a surrogate to supervised learning for the labelled dataset.
Learning good representation without human supervision is a fundamental problem in computer vision. The two main classes that tend to illustrate the possible solution are generative and discriminative. Generative approaches in the form of auto-encoder[salakhutdinov2009deep]
and generative adversarial network model[goodfellow2014generative]
the pixels in the input space that leads to the computationally expensive representation learning for feature extraction[hinton2006fast] [kingma2013auto]. Besides, the discriminative approaches utilize the objective function for learning the representation by designing a pretext task to train the deep neural network[gidaris2018unsupervised]
. This framework of learning the representation using the pretext task confide on the heuristics of designing the pretext task that limits the generalization of learned representations. Contrastive learning-based discriminative approaches are employed for learning the representation in the latent space to overcome the heuristics of pretext tasks[chen2020simple][oord2018representation].
The conjecture of this work relies on the hypothesis that the view-invariant representations are encoded by the human brain [den2012prediction] [hohwy2013predictive]. Humans view the surrounding environment with different sensory modalities. These sensory modalities are incomplete and noisy, but the prominent factor about the environment, for instance, geometry, semantics, and physics, are shared between these sensory modalities, illustrate the powerful model representation invariant to different views [smith2005development]. This study explores this hypothesis for the learning representations that capture information shared between the visible spectrum and thermal domain. For this purpose, we have employed the contrastive learning approach to learn the features embedding in the latent space projecting the distance (normally measured as Euclidean distance) to nearby points for the same scene corresponds to two sensory domain and far apart in the context of different views. Fig. 1 illustrates the pictorial overview of our proposed framework. The proposed neural network SSTN is designed in a two-stage network. The first stage, self-supervised contrastive learning, corresponds to the feature representation by constrative learning and maximizing the mutual information between the two domains and transferring the domain knowledge of visible spectrum to the thermal domain in a self-supervised manner for the object detection in the thermal domain. The later stage illustrates the thermal object detector’s design consisting of a multi-scale encoder-decoder transformer network architecture that incorporates the features embedding from the self-supervised contrastive learning stage. The increment of the mean average score of the proposed method with ResNet101 in comparison to other state-of-art methods is 2.57% for FLIR-ADAS and 2.37% for the KAIST Multi-Spectral dataset.
The main contributions of our work are:
The proposed work SSTN network have illustrated the utilization of a self-supervised contrastive learning approach to maximize the information between visible and thermal domain for the object detection in the thermal domain.
The contrastive learning approach is incorporated to cater to the scarcity of labelled dataset and has learned the feature representation in a self-supervised manner.
In addition, the proposed work, to the best of our knowledge, is the first work to incorporate self-supervised contrastive learning in multi-sensor data for
Further, we have extended the feature embedding learned in a self-supervised manner to a multi-scale encoder-decoder transformer network for thermal object detection.
We have demonstrated the efficacy of the proposed network (best model) on two public datasets, FLIR-ADAS and KAIST Multi-Spectral dataset, which illustrates the mean average precision score of 2.57% for FLIR-ADAS and 2.37% for KAIST Multi-Spectral dataset for thermal object detection in comparison to other state-of-art methods.
The rest of the paper is organized as follows: Section II gives related work. Section III explains the proposed methodology. The experimentation and results are discussed in Section IV, and section V concludes the paper.
Ii Related Work
Deep neural networks have been used as a function approximator for predicting and classifying objects in the visible and infrared spectrum domain. In literature, extensive research has carried out on designing the robust object detector for the visible spectrum domain[yolov3][m2det][faster][ssd]. For thermal object detection, the research focuses on employing feature engineering by fusing the information.[krivsto2020thermal] have used visible spectrum domain object detector YOLOv3 for the person detection in the thermal domain and have benchmarked the performance of YOLOv3 with Faster-RCNN, SSD, and Cascade R-CNN. [ghose2019pedestrian] illustrates the thermal object detection problem by augmenting the thermal image frame with the corresponding saliency maps to provide an attention mechanism for pedestrian detection in the thermal domain using Faster-RCNN. The multispectral images (thermal, near-infrared (NIR)) are fused and used to train the YOLOv3 for thermal object detection. Besides deep neural network, classical image processing approaches are also applied for thermal object detection [soundrapandiyan2015adaptive]. [baek2017efficient] [li2012effective]
have applied the HOG and local binary pattern for feature extraction and trained the support vector machine (SVM) classifiers for the object detection in the thermal domain. Much research has also been done by incorporating the domain knowledge transfer between the visible and infrared domain.[devaguptapu2019borrow]
have employed the generative adversarial approach for modelling the thermal image from the visible RGB image. This generated thermal image is utilized for training the variant of Faster-RCNN to detect the object in both RGB and thermal images. Similarly, a cross-domain semi-supervised learning framework is proposed for feature representation in the target domain[yu2019unsupervised]
. In addition, researchers have also illustrated the use case of transfer learning between domain using generative adversarial networks[zheng2020p] [royer2020xgan].
Self-supervised learning provides a way to learn the features embedding without explicit availability of labelled dataset. This representation learning in the context of image domain uses predictive approaches to learn the feature embedding by devising the artificially handcrafted pretext tasks [doersch2015unsupervised] [noroozi2016unsupervised] [zhang2016colorful] [chen2019self]. Although these pretext tasks illustrate the promising feature representations, yet these approaches depend on the heuristics of designing the pretext tasks. Another approach to learn feature embedding is to use the contrastive learning method. [hadsell2006dimensionality] have proposed an approach to learn the representation by contrasting the negative and positive pairs. Similarly, to improve the computation efficacy, a method employing the memory bank to store the instance class representation is proposed in this study [wu2018unsupervised]. This work is further investigated, and the memory bank method is replaced by in-batch negative sampled sampling [ji2019invariant] [ye2019unsupervised]. In addition, [chen2020simple] have proposed a simple framework to learn the feature representation through contrastive learning irrespective of memory bank or specialized architectures. This framework explores data augmentation techniques, a learnable nonlinear transformation layer between representation and contrastive, representation learning with contrastive cross-entropy loss. [tian2019contrastive] have explored the contrastive predictive coding and proposed contrastive multiview coding for the representation learning between the different view of the same scene by augmenting them in color channel space. Similarly, a supervised contrastive loss is introduced in this work [khosla2020supervised].
Iii Proposed Method
Learning effective visual representation without labelled data is a challenging problem. The goal is to learn representations that capture useful features for deep-learning-based object detection. In this work, we explicitly explore the co-occurrence of information in multi-sensor data that is obtained from the thermal image and RGB image. A two-stage deep neural network, Self-supervised thermal network (SSTN), is proposed, where the first stage consists of pre-training involving self-supervised representation learning by contrastive learning approach. The second stage consists of a multi-scale encoder-decoder transformer network for object detection in the thermal domain based on features learned in the previous stage.
Iii-a Self-supervised Contrastive Learning
The thermal data captures the surrounding information using the heat signature of different objects and exhibits large variations. Thermal images require a robust representation learning mechanism. Therefore, we have adopted a multi-spectrum self-supervised contrastive learning for feature representation. Fig. 2 shows the self-supervised contrastive learning network. Let and represent a set consisting of the same scene’s thermal and RGB image. A stochastic data augmentation module transforms the input to two correlated images. A neural network encoder maps
to representation vector,, and is normalized to unit hyperspace in [khosla2020supervised],[chen2020simple]. Both and are separately fed to the same encoder network to obtain a pair of representation vectors. The ResNet network is adopted as an encoder network [a10]
, due to its applicability and common usage. We instantiate a multi-layer perceptron as a projection network that mapsto a vector [a3]. The multi-layer perceptron consists of a single hidden layer with a size of and an output layer of size . The output vector is also normalized to lie in the unit hypersphere, facilitating the calculation of the inner product in projection space [tian2019contrastive].
For the formulation of the contrastive loss, a set of random pairs of thermal and RGB images are selected. The corresponding batch size for this configuration is , to be fed for training the network. and represent thermal and RGB sample pair respectively of . In a multi-spectrum batch, is an index of the randomly selected thermal sample, and is an index of its corresponding RGB sample. The Eq.1
gives the self-supervised contrastive learning loss function[tian2019contrastive][a5].
Here, , is a scalar parameter for temperature, represents the inner product of the two vectors, and . represent index of the anchor and is called a positive in the batch, while other indexes are negatives. To clarify each anchor have one positive pair and negative pairs.
Iii-B Encoder-Decoder Transformer Network
A transformer encoder inputs low-resolution feature maps which is obtained from convolution neural network backbone. In this work, a multi-scale feature mapsare obtained from the self-supervised contrastive learning stage, where . The multi-scale feature maps , where are extracted from pre-training stages to 222The convolution blocks as explained in [a10] in the ResNet encoder [a10]. The feature maps are transformed by convolution . The feature map is obtained by performing
and strideconvolution. All multi-scale feature maps are of channels. The encoder inputs and outputs the same resolution feature maps. Every transformer encoder layer consists of a multi-scale, multi-head self-attention module [a6][a7]. and denotes a set of query and key elements. Let represent a query element with representation features and represents a key element with feature embedding . The multi-head, multi-scale self-attention features are calculated by
where represents the number of attention head, shows the input feature maps levels. and
represent weight tensor which are learned during the training.. Attention weight tensor is normalized by , here and are learnable weight tensors. The spatial position encoding are learned and shared among all the attention layers of an encoder for a certain query, key and feature map pair [a8].
The Transformer decoder mimics the encoder framework of sub-layers. The layers consist of multi-scale cross-attention and self-attention modules. Each type of attention module input object queries. The object queries are initially set to zero, and object queries are learnt positional encodings and encoder memory. In the case of the cross-attention module, object queries derive features from multi-scale feature maps, and output feature maps from the encoder are the key elements. Moreover, in the self-attention module, attention is computed amid object queries. The decoder transforms the object queries into output embeddings. The output embeddings are then fed to the feed-forward network, which computes the bounding box coordinates and class labels.
Iii-B3 Feed-Forward Network
The decoder’s output embedding is given to the feed-forward network, which consists of 3 layer perceptron network with RELU activation and a hidden dimension of. A fully connected layer is used as the final layer, which performs the linear projection to predict the output. The output consists of class labels and the bounding box coordinates comprising height, width and the normalized centre coordinates. Moreover, a no-class label is also predicted for the images with no object present in the image.
Iv Experimentation and Results
In this study, we have used two publicly available datasets i.e. FLIR-ADAS dataset333https://www.flir.in/oem/adas/adas-dataset-form/, and the KAIST Multi-Spectral dataset [kiast]
. FLIR-ADAS dataset contains 9214 thermal and RGB image pair, and objects are annotated in MS-CoCo format, i.e. a bounding box and class labels. The aforementioned dataset has 3 class labels, car, person and bicycle. The data is collected using the FLIR Tau2 camera and each image has the resolution of. Both day and night time data is collected. We have used a standard split of the dataset in training and testing as illustrated in Table I.
|Dataset||Total Images||Train Images||Test Images|
The KAIST Multi-Spectral dataset has 95000 image pair of thermal and RGB image [kiast]. However, in this dataset only the person class is annotated. KAIST Multi-Spectral dataset is collected using FLIR A35 camera with a resolution of and include day and night time images. We have used the standard split of the dataset for training and testing as illustrated in Table I.
Iv-B Evaluation Metric
In this study, we have used the standard MS COCO evaluation metric[coco]. Intersection over union is computed between the area covered by the ground-truth bounding box and the area covered by the predicted bounding box, given by Eq.3.
The True Positive (TP) is held true for the and False Positive (FP) is considered when
. Based on TP and FP precision and Recall are calculated shown below. Recall and Precision are calculated using Eq.4.
For each class, Average precision (AP) is computed. AP is the area under the PR curve, which is computed using the Eq.5.
MS COCO evaluation metric varies the threshold from to with a incremnet of . In this study is considered for all the experiments.
This section explains the experimentation details of the proposed network (SSTN) and discusses the results.
Iv-C1 Self-supervised Contrastive Learning
The self-supervised contrastive network is trained using thermal and RGB image pair. The input images undergo augmentations that include random crop and resize, horizontal flip and colour jitters. For this study, we have only considered ResNet50 and ResNet101 as the encoder for the self-supervised network. The network is trained for epoch with a batch size of , learning rate of and temperature of
. The network is implemented using Pytorch deep learning library, and the weights of the network are optimized using the loss function given by Eq.1. All the experimentation is performed on three GPU machine, where each GPU has a memory of 12Gb.
Iv-C2 Faster-RCNN Baseline
The faster-RCNN with ResNet backbone is employed to develop a baseline for the proposed network [faster]
. Two types of experiments are performed. First, the network is trained in a supervised manner on thermal image data with no pre-training weights. Second, the network backbone is initialized by the features optimized by the self-supervised contrastive network and finetune on labelled thermal image data. We have adopted Pytorch implementation of Faster-RCNN for this purpose. The input data is augmented with a random scale, randomly crop and resize and horizontally flipped. It is noted that input image augmentation is kept the same for all the experiments conducted with Faster-RCNN and multi-scale encoder-decoder transformer network. The network is trained with a stochastic gradient descent optimizer with a learning rate of
Iv-C3 Encoder-Decoder Transformer Network
The transformer encoder-decoder network is trained using an AdamW [a9] optimizer with an initial learning rate of , the backbone learning rate of and weight decay of . The batch size is set to 2, and the network is trained for 100 epochs. The multi-scale feature maps are extracted from a self-supervised contrastive learning network, as shown in Fig. 2. For the transformer number of encoder and decoder are equal to . Absolute position encodings are considered a function of sine and cosine at different frequencies and then concatenated to achieve the final position encoding across the channels. It is to be noted that in all the experimentation, the hyper-parameters values are determined heuristically.
An optimal bipartite matching scheme is selected for loss calculation inspired by [a8]. The proposed method outputs a set of predictions. Eq.6 defines the bipartite matching loss between a set of ground-truth and a set of the predicted labels . illustrates the permutations of N elements. corresponds to pair-wise matching of predicted and ground-truth labels. A Hungarian algorithm is used to compute assignments between ground-truth and predictions. The matching cost aggregates both the class labels and the bounding box between the predictions and the ground-truth. Lets element indexes the ground-truth set , where is the class labels and
is normalized bounding box vector including center coordinates, width and height. The predicted class probability is define byand bounding box is given by . To match the prediction and ground-truth set, direct one-to-one correspondence is found without any duplicates. The Hungarian loss function is illustrated by Eq.7.
illustrates the permutations of N elements. corresponds to pair-wise matching of predicted and ground-truth labels. A Hungarian algorithm is used to compute assignments between ground-truth and predictions. The matching cost aggregates both the class labels and the bounding box between the predictions and the ground-truth. Lets element indexes the ground-truth set , where is the class labels and is normalized bounding box vector including center coordinates, width and height. The predicted class probability is define by and bounding box is given by . To match the prediction and ground-truth set, direct one-to-one correspondence is found without any duplicates. The Hungarian loss function is illustrated by Eq.7.
indicate the optimal assignment. The computes the score for the bounding box is represented as shown in Eq.8
where, loss with complete loss [a10] is used to calculate
represents distance with predicted bounding box and ground-truth bounding box . is the digonal length of smallest enlosing that covers the two boxes. shows the euclidean distance. is a positive trade-off parameter and illustrates the consistency of aspect ratio given by Eq.10 and Eq.11.
For the training and are used. The object query is set to . Moreover, the experiment is also conducting with replacing self-supervised contrastive learning (pre-training stage) with ResNet backbone and trained using no pre-training weights and using the same parameter configuration.
|Methods||mAP Score||mAP Score|
|Faster-RCNN +ResNet50 Backbone||56.70||59.02|
|Faster-RCNN+ ResNet50 w/ feature activation||62.45||65.61|
|Faster-RCNN +ResNet101 Backbone||59.03||64.31|
|Faster-RCNN+ ResNet101 w/ feature activation||64.34||67.52|
|Encoder-decoder transformer w/ ResNet50||68.77||62.08|
|Encoder-decoder transformer w/ ResNet101||72.03||68.25|
|Self-supervised thermal network (SSTN50)||75.19||70.62|
|Self-supervised thermal network (SSTN101)||77.57||73.22|
|Methods||mAP Score||mAP Score|
|(Our) Self-supervised Thermal Network (SSTN101)||77.57||73.22|
Table II shows the quantitative evaluation of the proposed algorithms. The proposed framework is evaluated on FLIR-ADAS and KAIST Multi-Spectral dataset, using ResNet 50 and ResNet101 as an encoder in the self-supervised contrastive learning stage. ResNet101 enable the self-supervised contrastive network to learn the more reliable representation of the data as the mAP score improve by 2.38% for FLIR-ADAS and 2.6% for KAIST Multi-Spectral dataset as compared to ResNet50. The efficacy of the self-supervised contrastive network is visible from the performance of SSTN with comparison to an encoder-decoder transformer with a ResNet backbone. An increase of 6.42% in mAP score in FLIR-ADAS and 8.54% mAP increase in KAIST Multi-Spectral with ResNet50 and an increase of 5.54% mAP score in FLIR-ADAS and 4.97% mAP improvement with ResNet101. A similar trend is visible in the baseline consisting of the Faster-RCNN network. Using the pre-train weights to finetune on thermal images, improve the accuracy by 5.75% in the FLIR-ADAS dataset and 6.59% in the KAIST Multi-Spectral dataset ResNet50 and 5.31% increase in FLIR-ADAS and 3.21% in KAIST Multi-Spectral with ResNet101. Fig.3 and Fig.4 illustrate the quantitative comparison of mAP score between the proposed best model SSTN-101 and other variants.
Moreover, The proposed algorithm is compared with other-state-of-the art algorithm, as shown in Table III. The Self-supervised thermal network outperforms the existing algorithm by 2.57% for FLIR-ADAS and 2.37% for the KAIST Multi-Spectral dataset. Fig.5 shows the qualitative results on FLIR-ADAS and KAIST Multi-Spectral dataset.
This study focuses on thermal object detection since thermal imagery is a fundamental tool for an autonomous vehicle. We have employed the self-supervised technique to learn enhanced feature representation using unlabelled data. A multi-scale encoder-decoder transformer network used these enhanced feature embedding to develop a robust thermal image object detector. The efficacy of a self-supervised thermal network is evaluated on FLIR-ADAS and KAIST Multi-Spectral datasets. The mean average precision of 77.57% is achieved on the FLIR-ADAS dataset and 73.22% is achieved on the KAIST Multi-Spectral dataset. In future work, we aim to fuse Lidar data with thermal image data to improve object detection in adverse environmental condition. Moreover, use the thermal data to understand and classify weather condition to optimize the perception system of the autonomous vehicle.