ICRA 2020 | Repository for "PST900 RGB-Thermal Calibration, Dataset and Segmentation Network" | C++, Python, PyTorch
In this work we propose long wave infrared (LWIR) imagery as a viable supporting modality for semantic segmentation using learning-based techniques. We first address the problem of RGB-thermal camera calibration by proposing a passive calibration target and procedure that is both portable and easy to use. Second, we present PST900, a dataset of 894 synchronized and calibrated RGB and Thermal image pairs with per pixel human annotations across four distinct classes from the DARPA Subterranean Challenge. Lastly, we propose a CNN architecture for fast semantic segmentation that combines both RGB and Thermal imagery in a way that leverages RGB imagery independently. We compare our method against the state-of-the-art and show that our method outperforms them in our dataset.READ FULL TEXT VIEW PDF
ICRA 2020 | Repository for "PST900 RGB-Thermal Calibration, Dataset and Segmentation Network" | C++, Python, PyTorch
Code for my Master thesis
The ability to parse raw imagery and ascertain pixel-wise and region-wise semantic information is desirable for environment perception enabling advanced robot autonomy. Semantic segmentation is a major subject of robotics research and has applications ranging from medicine  and agriculture 
to autonomous vehicles. Most popularly, convolutional neural networks (CNNs) have been applied to image classification tasks, where they dramatically out-perform their classical counterparts. CNNs have also grown in popularity as being highly effective for extracting semantic information from color images. The recent growth in autonomous vehicle research has driven the design of datasets, benchmarks and network architectures that focus on semantic segmentation from RGB imagery[8, 36, 1, 15, 9].
Until recently, thermal cameras were primarily used by the military, with their cost and restricted usage making them difficult to acquire . However, over the last few years thermal cameras have become more easily accessible and their competitive prices have led to an increase in popularity. The primary use case thus far has been in surveillance, and most popular thermal cameras such as FLIR’s range of LWIR cameras are specifically designed to identify human temperature signatures. There is a vast body of research in the field of thermal camera based human identification and tracking, which is out of the scope of our work.
We propose the usage of thermal cameras in addition to RGB cameras in challenging environments. Specifically, we look at environments with visibility and illumination limitations, such as in underground tunnels, mines and caves. We show that the additional information from the long-wave infrared spectrum can help to improve overall segmentation accuracy since it is not dependent on visible spectrum illumination which RGB cameras rely heavily upon. In this work, we also show that the segmentation of objects that do not possess very unique thermal signatures, such as hand-drills, also improves with the fusion of thermal information.
Using thermal imagery in addition to RGB for general purpose semantic segmentation is a growing field of research with methods such as MFNet  and RTFNet , which are currently the latest and most popular CNN-based approaches. However, for these methods to generalize well and achieve state-of-the-art accuracy, they require large amounts of training data. Unlike for RGB imagery, large datasets of annotated thermal imagery for semantic segmentation are hard to find. Ha et al. present a dataset which is the most recent dataset for RGB and Thermal segmentation . In our work, we present what we believe is the second dataset containing calibrated RGB and Thermal imagery that has per-pixel annotations across four different classes.
We additionally propose a dual-stream CNN architecture that combines RGB and Thermal imagery for semantic segmentation. We design the RGB stream to be independently re-usable as it is easier to collect large amounts of RGB data and annotations. We design our Thermal stream to leverage the information learned from the RGB stream to refine and improve class predictions. In contrast to a single or tightly coupled network architecture like MFNet and RTFNet, we are able to leverage both modalities in a way that is able to achieve high accuracy while working in real-time on embedded GPU hardware. In summary, our contributions are as follows:
A method of RGB and LWIR (thermal) camera calibration that uses no heated elements, allowing for faster, portable calibration in the field.
PST900 - Penn Subterranean Thermal 900 : A dataset of approximately 900 annotated RGB and LWIR (thermal) images in both raw 16-bit and FLIR’s AGC 8-bit format from a variety of challenging environments . An additional 3416 annotated RGB images are also provided from these environments.
A dual-stream CNN architecture that is able to fuse RGB information with thermal information in such a way that allows for RGB stream re-usability and fast, real-time inference on embedded GPU platforms such as the NVIDIA Jetson TX2 and AGX Xavier.
Extensive experiments comparing our method to similar approaches on both PST900 and the MFNet dataset.
There has been work in Thermal and RGB interaction in the form of cross modal prediction, where either RGB or Thermal imagery is used to predict the other 
. This idea of cross modal learning has also extended to stereo disparity estimation, where matching is done across the two different modalities[29, 4], resulting in the creation of interesting cross modal datasets such as LITIV  and St. Charles . Since thermal cameras still operate at a lower resolution than similarly priced RGB cameras, Choi et al. and Feras et al. propose learning based enhancement 
and super-resolution methods for thermal imagery and use RGB imagery as a guide.
Qiao et al. propose a novel level set method for contour detection using an edge based active contour model designed specifically for thermal imagery . Luo et al. use semantic information in an egocentric RGB-D-T SLAM pipeline  using a variant of YOLO on their RGB-D-T data . Their experiments suggest that the model is most heavily benefited by the thermal modality and that thermal residues provide good indicators for action recognition tasks.
Directly relevant to our work is MFNet , an RGB-T semantic segmentation network architecture proposed by Ha et al. . They present an RGBT dataset in urban scene settings for autonomous vehicles and a dual encoder architecture for RGB and Thermal image data. They show that this architecture performs better than naively introducing the thermal modality as an extra channel. Additionally, the authors state that with slightly misaligned RGB-T images, introducing thermal as a fourth channel can have detrimental effects to segmentation accuracy, often performing worse than RGB alone.
Recently, Sun et al. proposed a segmentation architecture that uses a dual ResNet encoder with a small decoder . The multimodal fusion is performed by an element-wise summation of feature blocks from both the RGB and Thermal encoder pathways. Their decoder architecture makes use of a novel Upception block which alternatingly preserves and increases spatial resolution while reducing channel count. They evaluate their network against popular semantic segmentation networks such as U-Net , SegNet , PSPNet , DUC-HDC  and ERFNet . They also compare their work against MFNet and show that RTFNet outperforms theirs on the MFNet dataset. From here on, we will compare our method to MFNet and RTFNet as they are the most relevant to our work.
Our primary motivation for designing a calibration procedure was for portability and ease of use in resource constrained subterranean environments. Current calibration methods are either active: where the calibration target is a thermal emitter or is externally heated to retain a thermal signature or passive: where no explicit heat source is required. Popular among active methods is a paper calibration target heated by means of an external heat source such as a flood lamp in order to drive the temperature of the black ink higher than the white, resulting in an inverted checkerboard in the thermal camera. We initially used this technique and were able to calibrate RGB-T intrinsics and extrinsics successfully, but the process required significant effort to ensure that the amount of heat imparted to the checkerboard was sufficient to obtain sharp checkerboard corners consistently. This was specially difficult because of the fast cool-down time of the heated elements. Additionally, this required a large heat source (flood lamp), which was a burden to transport and use in the field. This motivated our search for a calibration target that was completely passive and could be used with off-the-shelf calibration tools.
Zoetgnande et al. propose an active method to calibrate a low-cost low-resolution stereo LWIR thermal camera system. They design a board with 36 bulbs and develop a sub-pixel corner estimation algorithm to detect these heat signatures against a wooden calibration frame . Zalud et al. detail a five camera RGB-D(ToF)-T system for robotic telepresence . Their calibration target was an actively heated aluminum checkerboard with squares cut out, placed in front of a black insulator. Rangel et al. present a detailed comparison of different calibration targets and methods for calibrating Depth and Thermal cameras together . They compare active and passive methods and pick a calibration target with circular cutouts that requires minimal heating prior to calibration to appear in both depth and thermal imagery.
Tarek et al. present a visual odometry method that uses stereo thermal cameras . While this work is not directly relevant to ours, they propose an interesting passive calibration method: they design a calibration board from highly polished aluminum with matte black squares applied to it. The calibration board is placed such that the aluminium reflects the cold sky, leading to a high contrast between the aluminium and the squares. While this is a step in the right direction for portability and ease of use, it is still challenging for us to rely on the cold sky effect in the event of calibration in subterranean environments. Our proposed method draws inspiration from this method to propose a passive method that will work in both environments.
A characteristic of LWIR reflections from metallic surfaces such as aluminum and copper is that detections in this band have much lower emissivities () compared to other bands such as medium wave infra red (MWIR) . This leads to strong detections of reflections even if the material is rough and unpolished. We propose a calibration target that uses thermal reflectivity, specifically the reflections of the thermographer, i.e the person calibrating the system, to illuminate sand blasted aluminum squares mounted on a black acrylic background to form a checkerboard (Fig 2). We decided against polishing either surfaces since we found that highly polished surfaces tended to interfere with corner detection in RGB imagery. In practice the silver aluminum checkers appear to be at a higher temperature than the black acrylic background in the thermal imagery. We also achieve sufficient contrast in the RGB imagery to use existing checkerboard detectors. Note that since in both modalities, thermal and RGB, the silver checkers appear to have higher intensity values the correspondence between the two images is direct and no correspondence inversions were required such as those required when heating black ink on paper. For corner detection, OpenCV’s chessboard detector performed poorly in both RGB and Thermal, and we used a C++ implementation of libcbdetect instead . Once the checkerboards are detected, we use OpenCV’s fisheye camera calibration toolbox to first calibrate RGB and Thermal intrinsics followed by extrinsics.
Let , and , be the camera matrix and distortion co-efficients obtained from intrinsic calibration of the RGB camera and thermal camera respectively. From these parameters, we obtain and , i.e the undistorted camera matrices for both cameras. The RGB camera is a Stereolabs Zed Mini and its intrinsics are modeled with a plumb-bob model, whereas the Thermal camera is a FLIR Boson 320 2.3mm and its intrinsics are captured with a fisheye model.
To register each pixel from the Thermal camera to the frame of the RGB camera, we require a mapping of all image co-ordinates in the thermal image to image co-ordinates in the rgb image. We achieve this by first projecting all RGB co-ordinates into 3D () using the depth image acquired from stereo depth estimation. We then identify a mapping to the thermal frame by projecting these points back onto the thermal camera frame as seen in Fig 3.
For some co-ordinate , let be a point in the undistorted RGB image. Let denote the depth provided by the stereo system at that pixel. This point is projected to 3D location using as follows:
This 3D co-ordinate can then be projected onto the thermal frame using the calibrated extrinsics and ,
where is the point in the thermal image to which is mapped to. We now have from which we find
. While calculating the inverse mapping, we handle the issue of parallax and the many-to-one mapping by choosing the closest 3D point during re-projection. In our dataset, we provide aligned thermal imagery with holes, but provide a simple interpolation script to perform hole filling.
In this section, we present our dataset of 894 aligned pairs of RGB and Thermal images with per-pixel human annotations. This dataset was driven by the needs of the DARPA Subterranean Challenge111https://www.subtchallenge.com/, where a set of 4 visible artifacts (fire-extinguisher, backpack, hand-drill, survivor : thermal mannequin, human) are to be identified in challenging underground environments where there are no guarantees of environmental illumination or visibility. We therefore resort to equipping our robots and data collection platforms with high intensity LEDs. Our sensor head, as shown in Fig 1, consists of a Stereolabs ZED Mini stereo camera and a FLIR Boson 320 camera. We intentionally opted for a wider field of view thermal camera for a greater overlap between the two sensors. We design a calibration procedure to obtain camera intrinsics and system extrinsics as mentioned in the previous section. We then collect representative data from multiple different environments with varying degrees of lighting, these environments include the Number 9 Coal Mine in Lansford PA, cluttered indoor and outdoor spaces as seen in Fig 4.
In addition to this corpus of labelled and calibrated RGB-Thermal data, we additionally provide a much larger set of similarly annotated RGB-only data, collected over a much larger set of environments. Our proposed method is able to leverage both datasets to produce efficient and accurate predictions.
Labels are acquired from a pool of human-annotators within our laboratory. The human-annotators are briefed on the different artifacts present, and the difficulty of visible identification of small artifacts such as the hand-drill. Annotations are made per-pixel and each set of rgb, thermal and label is verified by the authors for accuracy. Artifacts incorrectly labeled or missed artifacts are sent back for re-labeling and re-verification. With this process in place, we acquired our dataset of 894 aligned and annotated RGB-thermal image pairs and 3416 annotated RGB images. Our dataset is made publicly available to the community along with a basic toolkit here: https://github.com/ShreyasSkandanS/pst900_thermal_rgb.
|Class Imbalance - per pixel (in %)|
|Class Imbalance - per instance (in %)|
Collecting large amounts of RGB data and acquiring accurate per-pixel human annotations is significantly easier and cheaper than collecting calibrated and aligned RGB-T data. Therefore, we designed a network that leverages this fact by having an independent RGB stream that can be trained without thermal data. We introduce the thermal modality to the output of this stream to further improve the initial results. We propose a sequential, dual stream architecture that draws influence from ResNet-18 , UNet  and ERFNet  and show that our design is efficient allowing real-time inference on embedded hardware, flexible since the early exit provides a coarser prediction quickly, and accurate outperforming other methods on our dataset and showing competitive performance on the MFNet dataset.
. The network is first trained on annotated RGB images only. We use a weighted negative log-likelihood loss during training and select the model with the highest mean Intersection-over-Union (mIoU). The weights for the loss function are calculated using the weighting scheme proposed by Paszkeet al. . On datasets such as ours, a weighting scheme is necessary given the dramatic imbalance between background and foreground classes as seen in Table I. Once the network is trained, we remove the final softmax operator which results in what is intuitively a per-pixel confidence volume for the different classes in our dataset. We use this volume, along with the thermal modality, as input to our next fusion stream.
The Fusion stream takes as it’s input the confidence volume along with the input thermal imagery and color image. The information is concatenated and passed to an ERFNet-based encoder-decoder architecture . Our architecture differs in that it has a larger set of initial feature layers to account for the larger input. Additionally, we use fewer layers at the end of the encoder. We then freeze the RGB stream and train this entire architecture as a whole using the same loss function as before. Once again, we select our best model to be the one with the lowest mean IoU value.
|Dataset: MFNet Dataset|
|Network||Mode||Background||Car||Person||Bike||Curve||Car Stop||Guardrail||Color Cone||Bump||mIoU|
|Ours: RGB Stream||RGB||0.9678||0.7673||0.4873||0.5532||0.2917||0.2785||0.1525||0.3580||0.4264||0.4776|
|Dataset: PST900 Dataset|
|Ours: RGB Stream||RGB||0.9883||0.6814||0.6990||0.5151||0.4989||0.6765||18|
In this section, we compare our method against relevant methods MFNet , and RTFNet . We also compare our method against naïve RGB-T fusion implementations on relevant segmentation networks such as ERFNet , MAVNet , Fast-SCNN  and UNet . In our naïve implementations of RGB-T segmentation networks, we introduce the thermal modality by concatenating the thermal image as a fourth channel to the original RGB input. For all our experiments, we compare both RGB and RGB-T performance as seen in Table II and Table III. We measure performance using mean Intersection over Union (mIoU)
across all classes. We train all the models in PyTorch using an NVIDIA DGX-1. For MFNet and RTFNet, we use author-recommended batch sizes, loss functions and training scripts where applicable. For the rest of the networks, we use a fixed batch size, learning rate, and loss function across all experiments. We measure inference latency in milliseconds on an NVIDIA AGX Xavier embedded GPU device, which is the central compute unit on-board our mobile robot platforms. To allow for fair comparison between methods, we use PST900 RGBT data and exclude the RGB only data.
As shown in Table II, the best performing method is RTFNet, with our method following second. MAVNet achieves the lowest scores on this dataset, reaching a maximum mIoU of 22.26% with naïve fusion of thermal information. Nguyen et al. design this network with performance as a primary objective , which is supported by the low inference latency of this network, as seen in Table III. Fast-SCNN achieves an mIoU of 28.88% and 32.83% with RGB and RGB-T modalities respectively. The largest increase in class IoU between these two models is seen in the Person class, which comports with the intuition that humans have a strong unique thermal signature, and the network is able to narrow in on this to improve its overall accuracy. However, Fast-SCNN was originally designed for high resolution data, which could explain the low performance . We were unable to achieve the results mentioned by Sun et al. with UNet on this dataset and therefore for comparisons, we refer to their experiments since we use the same training and validation split. RTFNet performs better than other methods in both ResNet-50 and ResNet-152 variants. This is followed by our method, which closely outperforms ERFNet with a naïvely added thermal fourth channel. Our network achieves 48.42% mIoU on this dataset and from a performance perspective, is roughly 4 times faster than RTFNet-152 as shown in Table III.
All the above networks are trained and evaluated on our PST900 dataset. Our method achieves the best performance with 68.36% mIoU and RTFNet-152 at 57.61% mIoU; results are shown in Table III. Qualitative comparisons shown in Fig 6. The measured latency of our proposed method on our embedded GPU hardware is approximately 42ms, which is significantly faster than RTFNet-152. For the training and evaluation of the networks compared here, we use the originally recommended training parameters prescribed in their respective works. Aside from our method, the observed trend is similar to the previous experiments on the MFNet dataset, where RTFNet and ERFNet achieve high accuracy. However, RTFNet is outperformed by ERFNet on this dataset, achieving an mIoU of 62.55% with naïve thermal fusion. Interesting to note is the naïve introduction of thermal information to UNet and Fast-SCNN results in lower performance than RGB alone, whereas this is not observed on the MFNet dataset.
We posit that our dataset is significantly more challenging for thermal fusion networks since there is plenty of information available in RGB alone, making it potentially difficult to learn an informative correlation between both modalities. Additionally, our dataset contains the same object in situations where it is both above and below the ambient air temperature, resulting in an inversion in thermal imagery. This could be potentially challenging when learning RGB-Thermal correlations. Our hypothesis is strengthened by the two cases where performance degrades with the naïve introduction of thermal. The IoU for the Survivor class increases, while objects that might have stronger color cues than thermal cues perform worse, like the backpack class.
We are able to achieve very competitive results with RGB alone, and accuracy further improves when the Fusion stream is added. We observe that our RGB Stream performs poorly with a naïve fusion approach. This also supports our hypothesis that for the task of learning correlated representations for RGB and Thermal, a late fusion strategy is highly beneficial as opposed to fusing both modalities early on in the network architecture. This effect is exacerbated by the fact that our dataset contains objects of interest that have very strongly identifiable cues from RGB alone, such as the red backpack and orange hand-drill. When learning to identify these objects, the network may be prone to more heavily weight these attributes in RGB, and neglect the more subtle cues to be learned from Thermal.
In summary, this work explores Thermal (LWIR) as a viable supporting modality for general semantic segmentation in challenging environments. We propose an RGB and Thermal camera calibration technique that is both portable and easy to use. To further research in this field, we also present PST900, a collection of 894 aligned and annotated RGB and Thermal image pairs and make this publicly available to the community. We compare various existing methods on this dataset and propose a dual-stream CNN architecture for RGB and Thermal guided semantic segmentation that achieves state of the art performance on this dataset. Our network works in real-time on embedded GPUs and can be use in mobile robotic systems. We also compare this method on the MFNet dataset and show that our method is competitive with existing methods. Additionally, we highlight the need for late fusion in these architectures by noting poor performance with naïve fusion approaches. We also discuss some of the challenges of RGB-Thermal fusion for object identification, such as when objects of interest may not have easily discernible thermal signatures but have strong cues from RGB.
Augmented reality meets computer vision: efficient data generation for urban driving scenes. International Journal of Computer Vision 126 (9), pp. 961–972. Cited by: §I.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §II.
Counting apples and oranges with deep learning: a data-driven approach. IEEE Robotics and Automation Letters 2 (2), pp. 781–788. Cited by: §I.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §I.