Log In Sign Up

The Cross-Modality Disparity Problem in Multispectral Pedestrian Detection

by   Lu Zhang, et al.

Aggregating extra features of novel modality brings great advantages for building robust pedestrian detector under adverse illumination conditions. However, misaligned imagery still persists in multispectral scenario and will depress the performance of detector in a non-trivial way. In this paper, we first present and explore the cross-modality disparity problem in multispectral pedestrian detection, providing insights into the utilization of multimodal inputs. Then, to further address this issue, we propose a novel framework including a region feature alignment module and the region of interest (RoI) jittering training strategy. Moreover, dense, high-quality, and modality-independent color-thermal annotation pairs are provided to scrub the large-scale KAIST dataset to benefit future multispectral detection research. Extensive experiments demonstrate that the proposed approach improves the robustness of detector with a large margin and achieves state-of-the-art performance with high efficiency. Code and data will be publicly available.


page 3

page 5


Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems

Multispectral pedestrian detection is capable of adapting to insufficien...

What Can Help Pedestrian Detection?

Aggregating extra features has been considered as an effective approach ...

Learning Cross-Modal Deep Representations for Robust Pedestrian Detection

This paper presents a novel method for detecting pedestrians under adver...

Robust pedestrian detection in thermal imagery using synthesized images

In this paper we propose a method for improving pedestrian detection in ...

BAANet: Learning Bi-directional Adaptive Attention Gates for Multispectral Pedestrian Detection

Thermal infrared (TIR) image has proven effectiveness in providing tempe...

How Far are We from Solving Pedestrian Detection?

Encouraged by the recent progress in pedestrian detection, we investigat...

Resisting the Distracting-factors in Pedestrian Detection

Pedestrian detection has been heavily studied in the last decade due to ...

1 Introduction

Pedestrian detection is an important research topic in computer vision field with various applications, such as video surveillance, autonomous driving, and robotics. In recent years, many works in robot vision

[2, 47], pedestrian detection [38, 18, 20, 16, 30, 42], and 3D object detection [4, 37, 7] show that the adoption of novel modality can improve the performance and offer competitive advantages over single sensor systems. Additionally, as the novel sensors (e.g. thermal and depth cameras) become cheaper and more accessible, extensive applications can rely on multimodal input sources, including self-driving cars, security monitoring, military operations, etc. Motivated by this, multispectral pedestrian detection has attracted massive attention and provides new opportunities for tackling challenging problems such as adverse illumination conditions and occlusions.

In existing multispectral pedestrian datasets [16, 12], color-thermal image pairs are geometrically aligned to the best, modality-shared annotations are provided and most of the state-of-the-art detectors [42, 25, 19, 44] build their frameworks based on this.

However, in real-world scenarios, the modality alignment assumption is hard to come into existence due to several factors, such as the physical properties of different sensors (e.g. parallax, mismatched resolutions and field-of-views), imperfection of alignment algorithms, external disturbance, and hardware aging. Also, in the automatically aligned dataset [16], nonrigid transformation can be observed in many color-thermal image pairs, making the pedestrian localization more difficult with the presence of position shifting problem, which is worthy of attention but still exhibits a lack of study. In this paper, we notice and present it as the cross-modality disparity problem, i.e. the spatial difference of imageries between two or more modalities, which is embodied as the color-thermal disparity in multispectral pedestrian detection.

In general, the color-thermal disparity problem can degrade the performance of pedestrian detector mainly in two aspects. First, features to be fused are inconsistent since the multispectral inputs are mismatched in the corresponding position, which can lead to unstable inference, i.e

. classification and localization. In addition, as for one authentic pedestrian instance, it is not clear that which modality’s imagery serves as the reference one, making it hard to determine a reliable location. Second, the modality-shared annotations can introduce severe label bias on account of the color-thermal disparity problem. Specifically, since the ground truth annotation is assigned to color and thermal images simultaneously, the bounding box needs to be wider to encompass both modalities whereas naturally introduces biases for every single modality, resulting in biased learning targets of regression for the localization process. Besides, for detectors based on the deep convolutional neural networks (CNN)

[35, 26], biased labels will taint the mini-batch sampling [11] process, in which the intersection over union (IoU) overlap is used for foreground/background classification.

Moreover, the calibration and alignment process for color-thermal cameras can be tortuous, and generally require particular hardware as well as special calibration board [16, 39]. Once the devices start to operate, the inevitable external forces such as mechanical vibrations and temperature variation may decrease the calibration quality. Therefore, how to robustly localize each individual person on mismatched modalities remains to be one of the most critical issues for multispectral pedestrian detectors, especially in real applications.

To address this problem, in this paper, we first analyze the impacts of cross-modality disparity and provide dense color-thermal annotation pairs that formulate information of each modality. Based on the annotation pairs, we propose a novel detection framework consisting of two inter-connected parts, namely, the Region Feature Alignment (RFA) module and the RoI Jittering training strategy. With RFA, the region feature of one modality is required to shift and align with the other. Meanwhile, an RoI Jittering training strategy is adopted to randomly jitter the RoI of sensed modality, providing more generalization with the learning targets of RFA. In other words, the RFA empowers the model to align region-wise features between modalities and RoI Jittering strategy prevents the model learning a biased transformation pattern from the original dataset, finally making the detector more robust to the cross-modality disparity. To sum up, the main contributions of this work are as follows:

  • To the best of our knowledge, this is the first work in literature to present and analyze the cross-modality disparity problem in multispectral pedestrian or even more general object detection. In order to benefit the future research, we also provide dense color-thermal annotation pairs to formulate clearer multimodal information of the individual pedestrian in the popular KAIST [16] dataset.

  • A novel CNN-based framework is proposed to enhance the robustness of the pedestrian detector to the cross-modality disparity. Inspired by the image registration [48, 1] task, we set a reference image and sensed image, the former is fixed as the reference and the latter is unconstrained. We propose a Region Feature Alignment (RFA) module to align sensed region features with the reference, and a RoI Jittering training strategy is adopted to further improve the robustness to random disparity. Our model can be trained in an end-to-end way.

  • Extensive evaluation demonstrates that the proposed approach significantly improves the robustness of the pedestrian detector, highlighting its effectiveness under the cross-modality disparity problem. Meanwhile, our model achieves the state-of-the-art performance on the challenging KAIST benchmark with high efficiency.

2 Related Work

Multispectral Pedestrian Detection. As an essential step for various applications (e.g. robotics, autonomous driving, and video surveillance), pedestrian detection has attracted great attention from the computer vision community. Over the years, extensive features and algorithms have been proposed, including both traditional detectors (e.g. ICF [10], ACF [8], LDCF [31] , Checkerboard [45]) and the lately dominated CNN-based detectors [43, 30, 15, 22, 41, 46].

In recent years, the release of large-scale multispectral pedestrian benchmarks is encouraging the research community to advance the state-of-the-art by efficiently exploiting multispectral input data. Hwang et al. [16]

propose the KAIST benchmark and an extended ACF method, leveraging aligned color-thermal image pairs for around-the-clock pedestrian detection. With the recent development of deep learning, the CNN-based methods

[42, 40, 21, 5, 44] significantly improve the multispectral pedestrian detection performances. Liu et al. [25] adopt the Faster R-CNN architecture and analyze different fusion stages within the CNN. In [19], the Region Proposal Network (RPN) and Boosted Forest (BF) framework is adapted for multispectral input data. Xu et al. [42] design a cross-modality representation learning framework to overcome adverse illumination conditions. Zhang et al. [44] present a single shot detector based on the cross-modality attention mechanism.

Most existing methods are employed under the modality alignment assumption, commonly use fusion strategies (e.g. element-wise summation, concatenation and Network-in-Network [24]) that directly fuse features of different modalities in the corresponding pixel position. However, the cross-modality disparity problem extensively exists in the multispectral scenarios and causes adverse impacts to the detection performance, but still lacks of study.

Image Registration. Image registration (i.e. spatial alignment) [48, 1, 6, 29] is a common task which aims to align two or more images captured by different sensors, at different times and from different viewpoints. It geometrically aligns two images: the reference and sensed images, which can be considered as an image-level solution for the cross-modality disparity problem.

In general, image registration methods can be roughly divided into two categories: area-based and feature-based methods. Area-based methods [34, 3, 23, 17] deal directly with the image intensity values rather than detect salient structures such as features. For feature-based methods [28, 32, 27], features are extracted and represented by spatial points, i.e. control points in [48], then the image registration reduces to a feature matching problem.

Though well-established, the image registration mainly focuses on the low-level transformation of the whole image, which actually introduces time-consuming preprocessing and disenables the CNN-based detector to be trained in an end-to-end fashion.

3 Analysis of the Cross-Modality Disparity

To provide insights into the cross-modality disparity problem, in this section, we start with our observation and analysis of the large-scale KAIST multispectral pedestrian benchmark [16]. Then we experimentally study how much the cross-modality disparity impacts pedestrian detection results. Notably, in order to enrich the labeling features and benefit the multispectral pedestrian detection research, we provide novel color-thermal annotation pairs of pedestrians by scrubbing the KAIST dataset.

3.1 Preliminaries

Before delving into the analysis, we first introduce the dataset and baseline detector that we use.

Dataset and Evaluation Metrics.

KAIST [16] benchmark is a popular and challenging multispectral pedestrian dataset which contains color-thermal image pairs with dense annotations and unique pedestrians. It is recorded in various scenes at day and night time to consider changes in light conditions. The image pairs are captured by special beam splitter-based hardware to physically eliminate the parallax. As for evaluation, we use the widely adopted improved test set annotations provided by Liu et al. [25] since there are some problematic annotations in the original test set. The log miss rate averaged over the false positives per image (FPPI) range of () is calculated to measure the detection performance, the lower score indicates better performance. All methods are evaluated under the reasonable configuration [9].

Baseline Detector. We build our detector based on the Faster R-CNN [35] detection framework and adopt the halfway fusion settings [25] for multispectral pedestrian detection. To mitigate the negative effect of harsh quantization on localization, we use RoIAlign [14] instead of the standard RoIPool [35] for the region feature pooling process. Our baseline detector achieves () 111The score in parentheses refers to evaluation results using the original test annotations, here we show them for clear comparison. on the KAIST reasonable test set, which is better than the reported result ()1 in [25].

3.2 Important Observations

Albeit special hardware and alignment algorithms are employed in the KAIST [16] dataset, the cross-modality disparity problem still remains in many color-thermal image pairs. Inspecting the instances of data and annotations in [16], several common issues can be summarized.

(a) Poor alignment
(b) Ambiguous categories
Figure 1: The visualization examples of ground truth annotations in the KAIST dataset. Image patches are cropped on the same position of color-thermal image pairs. The original annotations are shared by both modalities. New annotation pairs in green, original ones in red and yellow (red box denotes person and yellow box represents people).

Mismatched Relationship. KAIST dataset provides modality-shared annotations for the color-thermal image pairs based on the “modality alignment assumption”. But in fact, many pedestrian objects still suffer from the color-thermal disparity problem. Specifically, as illustrated in Fig. 1(a), all the three the color-thermal image pairs indicate severe position shifting problem. Such undesirable properties bring serious issues to the detector: directly fusing the feature maps of different modalities results in inconsistency of the final feature, and lack of the ability to pool matched feature increases the chance of misclassification and mislocalization.

Localization Bias. Since each color-thermal pair shares same ground truth boxes in original annotations, localization is in a dilemma when images are not aligned well. Solutions are employed in the original process of annotation, but proved to be Band-Aid ones. One way is to adopt larger bounding boxes, encompassing pedestrians of both the color and thermal modalities as much as possible, as shown in the first column of Fig. 1(a). Another remedy is to only focus on one particular modality, as shown in the second and third columns of Fig. 1(a). However, for each method above, the authentic pedestrian object only occupies a biased portion of the ground truth in at least one modality. Such compromises finally lead to the negative localization bias problem, which is very harmful for mini-batch sampling and localization process in the training phase.

Ambiguous Categories. Besides the above issues, some misclassifications of objects are also found in original annotations. As shown in the first column of Fig. 1(b), two separable pedestrians are misclassified to the people category, and in the second column, a distinct pedestrian near the car is missed.

Figure 2: Surface plot of the detection performances within the position shifting experiments. Horizontal coordinates indicate different step sizes by which sensed images are shifted along the x-axis and y-axis, and vertical coordinates denote the log-average miss rates () measured on the reasonable test set of KAIST dataset.

-6 -4 -1 0 1
1 31.91 26.03 22.81 22.60 23.13
0 32.17 25.76 22.63 22.41 22.66
-1 33.08 26.14 22.52 22.18 21.86
-4 35.69 28.76 23.52 22.34 22.65
-6 37.32 30.20 24.55 24.10 24.05
Table 1: Numerical results of the position shifting experiments. The scores are corresponding with the results in Fig. 2. Result in the origin is highlighted in blue and other better results around are in red.

3.3 How the Cross-Modality Disparity Impacts?

To quantitatively evaluate how the of color-thermal disparity influences multispectral pedestrian detection results, we conduct experiments on the KAIST dataset with baseline detector by imitating several typical cases of disparity.

Position Shifting. In the testing phase, we fix the thermal input as reference image but spatially shift the sensed one along both x-axis and y-axis. The offset pixel value is selected in , which contains a total of

shifting modes. A positive integer denotes shifting to the right or top direction, and the negative refers to the left or bottom. The size of input image remains unchanged by padding zero intensity.

As illustrated in Fig. 2, the performance dramatically drops as the absolute shifting value increases. Especially, the worst case suffers relative performance decrement, i.e. from to . Some numerical results in accord with Fig. 2 are shown in Table 1.

Moreover, the erratic fluctuation can be observed around the origin . Table 1 shows that the origin only generates suboptimal result, which offers another perspective on the original color-thermal disparity problem in the dataset.

Figure 3: The statistical results of bounding box shifting disparity within color-thermal image pairs. The horizontal axis denotes numerical values of shifting disparity within color-thermal image pairs, and the vertical axis indicates the frequency of measured image pairs corresponding to specific deviation interval.

3.4 Color-Thermal Annotation Pairs

In order to address original annotation issues and provide better multimodal labelings for multispectral pedestrian detection research, in this paper, we present color-thermal bounding boxes pairs that are manually annotated by the following principles:

  • Localize all modalities. The pedestrians are separately localized in both color and thermal images, aiming at clearly formulating the object information of each modality.

  • Place tight boxes around pedestrians. We place relatively tighter bounding boxes which contain less cluttered background information.

In summary, our new annotations differ from the original ones in the following aspects: both color and thermal modalities are annotated separately, more accurate labelings and precise localizations are performed, and some missed but salient pedestrians are also annotated.

Statistics of Mismatched Objects. Benefit from the proposed color-thermal annotation pairs, we can derive the statistical information of the original color-thermal disparity in KAIST dataset. As illustrated in Fig. 3, the shifting disparity within color-thermal image pairs is salient on both x-axis and y-axis, mostly ranging from to pixels. More than half of the pedestrian instances suffer more or less shifting disparity.

Figure 4: Overview of our framework. The network adopts Faster R-CNN [35] architecture with two branches to deal with color-thermal inputs. Given a pair of images, cross-modality feature maps are fused in halfway fashion by RPN module, then the RFA module is introduced to align the region features. After alignment, the region features of color and thermal feature maps are pooled respectively, then concatenated and fed to subsequent prediction layers.

Conclusion. The above analysis validates our motivations: the cross-modality disparity problem is common in multispectral scenarios and the pedestrian detectors are tainted by it in a non-trivial way. To address this issue, we next propose a novel CNN-based framework to improve the robustness of detectors to modality mismatched scenes.

4 The Proposed Approach

This section introduces the proposed framework, including the region feature alignment module (Section. 4.1) and the RoI jittering training strategy (Section. 4.2). Before that, to explicitly formulate the transformation relation between modalities, we introduce the concept of the reference image and sensed image into the multispectral setting inspired by the image registration [48, 1] task. In the training phase, the reference image is fixed, and the learnable feature-level alignment and RoI jittering process are conducted on the sensed one.

4.1 Region Feature Alignment

We propose the Region Feature Alignment (RFA) module that aims to align sensed region features with the reference. The design of RFA is driven by two main needs: 1) predicting the shift transformation between the two modalities for alignment. 2) since the disparity comes in various forms on different images and also varies in different regions, the transformation prediction and alignment process should be in a region-wise way.

The proposed framework is illustrated in Fig. 4. For the region proposal network (RPN), we fuse the feature maps in halfway fashion [25] to generate numerous shared proposals so as to keep the potential recall rate on both the reference and sensed modality. However, due to the color-thermal disparity problem, directly pooling shared proposals will lead to inconsistent feature representations, thus the RFA module is inserted to align region features of proposals before the pooling operation.

The concrete connection scheme of the RFA module is shown in Fig. 5. Given several proposals, the enlarged contextual RoIs are utilized to encompass sufficient information of regions. For each modality, we use the RoIAlign layer to pool the contextual region features into small feature map with a fixed spatial extent of (e.g. ).

Then the residual region feature is computed by subtraction and fed into fully-connected layers to predict the position shift of this region between two modalities. Since we have access to the ground truth pairs on both modalities, the transformation targets can be calculated as follows:


In Eq. 1, , denote the box’s center coordinates, and indicate the width and height of the box . Variables , are for the sensed and reference ground truth box respectively, is the transform target for coordinate, and likewise for .

Similar to Fast R-CNN [11], we use the smooth L1 loss as the regression loss to measure the accuracy of predicted transformation, i.e.,


where is the index of RoI in a mini-batch, is the predicted coordinates after transformation, and are the associated ground truth class label (pedestrian vs. background ) and coordinates of the i-th sensed RoI. is the total number of ground truth objects to be aligned (i.e. ).

Figure 5: Connection scheme of the RFA module. RF denotes region features and refers to element-wise subtraction.

Multi-task Loss. For each training example, we minimize an objective function of Fast R-CNN which is defined as follows:


where and are the predicted confidence and coordinates of the pedestrian, and are the associated ground truth label and reference ground truth coordinates. Here the two terms and are weighted by a balancing parameter . In our current implementation, we set

, and thus the two terms are roughly equally weighted. For the RPN module, the loss function is defined as in the literature


4.2 RoI Jittering

In reality, unexpected disparities can appear under diverse scenarios or over time, posing greater challenges to the multispectral detection system. To reduce the impact of training bias and further enhance the robustness of the RFA module, we propose a novel training strategy, named RoI Jittering, to randomly generate various transformation modes in the training phase.

With the RFA module trained to predict the transformation between the reference and sensed ground truths, the targets of transformation are fixed for each pedestrian instance. The purpose of RoI Jittering is to introduce the stochastic disturbance to the sensed RoIs, and enable them to jitter within a certain extent. As well, the transformation targets of RFA change accordingly, which enriches the pattern of cross-modality disparity while training. Fig. 6 depicts the jittering process. Specifically, the original sensed RoI is expected to learn the transformation between and . After the jittering process, the new region features of is utilized to predict a new transformation relation between the reference and sensed modalities.

The jittering targets are randomly generated from a normal distribution,


where denotes transformation targets of x-axis and y-axis, and

is the hyperparameter of the radiation extent of jittering. After, the

is jittered to the using the inverse process of bounding box transformation of Eq. 1.

(a) Reference image
(b) Sensed image
Figure 6: Visualization of the RoI Jittering strategy. Red box denotes the ground truth, and stand for the reference and sensed modality respectively. Blue box represents the RoI, i.e. the proposal, which is shared by both modalities. , , and are three feasible proposal instances after jittering.

Mini-batch Sampling. When training CNN-based detectors, a small set of samples is randomly selected. In this paper, the positive and negative examples are defined with respect to the reference modality. We treat RoI pair as positive if the reference RoI has IoU overlap with reference ground truth box greater than 0.5, and negative if the IoU is between 0.1 and 0.5.

5 Experiments


 15.69  17.01  4.15  18.21  10.33  16.22  1.13  18.16  6.92
Baseline-T 15.16 16.79 4.03 18.01 10.97 15.88 1.01 18.08 8.09
RFA 15.36 17.08 1.96 17.64 4.76 16.04 0.63 17.75 3.48
RFA+RoIJ 14.62 16.62 2.67 16.84 3.46 15.13 0.12 16.92 3.35
Table 2: Quantitative results of the robustness of detectors to position shifting on the KAIST dataset. There are four measurement configurations: , , , and . is used to compare the performance of detectors (lower score indicates better performance). denotes the score at the origin, ,

represent the mean and variance of

scores respectively. For methods, Baseline-C denotes the baseline detector trained with the proposed color annotations, and Baseline-T is trained with the proposed thermal annotations. RFA refers to the detector that uses the RFA module before RoI pooling layer. RFA+RoIJ denotes the proposed method, which consists of the RFA module and RoI Jittering strategy. The top two results are highlighted in red an blue.
Figure 7: Surface plot of the detection performances within the position shifting experiments.

5.1 Experimental Setup

We carry out experiments on the recent predominant KAIST multispectral benchmark, following the settings in Section. 3. All experiments related to the KAIST are conducted on the improved annotations unless otherwise stated.

In this section, we set the thermal input as the reference image and color input as the sensed one, considering the thermal modality provides more stable imagery in both daytime and nighttime. Also, we experimentally study the opposite configuration, more detailed analysis can be found in the supplementary material.

Implementation Details. The backbone VGG16 [36]

network of our method is pretrained on the ImageNet dataset. We set the

and of RoI Jittering to by default. One can adjust it to cope with wider or narrower disparity, i.e

. the concrete parameter depends on the system requirements rather than settled. All the images are horizontally flipped for data augmentation. We train the detector for 3 epochs with the learning rate of

and decay it to

for another one epoch. The network is optimized using the Stochastic Gradient Descent (SGD) algorithm with 0.9 momentum and 0.0005 weight decay. Multi-scale training and testing are not applied to ensure fair comparisons with other methods.

5.2 Robustness to the Cross-Modality Disparity

Following the settings in Section. 3.3, we conduct comparative position shifting experiments with the proposed approach. Fig. 7 depicts the visual results by a surface plot. Compared with the results of baseline in Fig. 2, it can be observed that the overall performance of the proposed detector is improved and the robustness to color-thermal disparity is significantly enhanced, validating the effectiveness of our method.

Apart from the visual surface plot, a set of quantitative measures are adopted to evaluate the robustness of detectors. There are four configurations222 and denote the offset pixel values, which are selected in the following sets:

: , , , , specifically, denotes that the sensed images are shifted discretely along the line which is at a (i.e. ) angle from the x-axis, and likewise for other three configurations. Table 2

shows that the proposed approach achieves the best mean performance and the smallest standard deviation on almost all configurations. This demonstrates the superiority of the proposed method, especially for building robust multispectral pedestrian detector under diverse color-thermal disparity conditions.

(a) Reasonable all-day
(b) Reasonable daytime
(c) Reasonable nighttime
Figure 8: Comparisons with the state-of-the-art methods on the KAIST pedestrian dataset. The scores in the legend are the scores of the corresponding methods.

Choi et al. [5] Park et al. [33] Halfway Fusion [25] Fusion RPN+BF [19] IAF R-CNN [21] Ours
Times(s.) 2.73 0.58 0.43 0.80 0.21 0.08
Table 3: Comparisons of computation time using an NVIDIA GeForce GTX TITAN X GPU.
Fusion Methods
All-day Daytime Nighttime

Eltwise Sum
15.87 18.02 10.43
Eltwise Max 15.27 17.61 10.68
Concat+NIN 14.90 16.95 10.33
Eltwise Sub 14.61 16.78 10.21
Table 4: Effects of different fusion strategies for region feature fusion in the RFA module. Metric: scores on KAIST test set. The Eltwise Sum, Max, and Sub refer to element-wise summation, maximation, and subtraction respectively. Concat+NIN denotes concatenation and channel reduction using Network-In-Network (NIN) [24]. The top two results of each subset are highlighted in red an blue.

5.3 Ablation Study

Region Feature Alignment. To demonstrate the contribution of the RFA module, we construct a detector with RFA and another two baseline detectors (Baseline-C and Baseline-T) trained on the proposed color and thermal annotations respectively. To make a fair comparison, we use the same setting of parameters for all detectors. As shown in the first three columns of Table 2, compared with Baseline-C and Baseline-T, the detector with RFA remarkably reduces the variance under diverse disparity conditions. Specifically, for , the variance is reduced by a significant (from to ), and consistent reduction is also observed on other three configurations.

RoI Jittering. For the ablation study on RoI Jittering, we construct a detector with a combination of RFA module and the RoI Jittering strategy, denoted as RFA+RoIJ, i.e. the proposed detector. As shown in the third and fourth columns of Table 2, compared with the detector only with RFA module, RFA+RoIJ further reduces the mean and variance of results and achieves at the origin, confirming the effectiveness of our method.

Region Feature Fusion. In the RFA module, we fuse the color and thermal region features by element-wise subtraction to model the feature disparity of two modalities, named residual region features. Meanwhile, we explore the effects of other three common fusion strategies. The scores of different fusion strategies are reported in Table 4. It can be observed that the element-wise subtraction achieves the best performance on the all-day test set, and the concatenation with channel reduction follows it.

5.4 Comparisons with State-of-the-art Methods

We evaluate our approach and conduct comparisons with other published methods: ACF+T+THOG (optimized) [16, 19], Halfway Fusion [25], Fusion RPN and Fusion RPN+BF [19], IATDNN+IAMSS [13]. The comparative result curves are illustrated in Fig. 8. The proposed approach achieves state-of-the-art performances on the reasonable all-day, daytime, and nighttime subset, with , , and respectively.

Table 3 illustrates the computational cost of our method compared with the state-of-the-art methods. It can be observed that the proposed approach achieves the best inference speed with only 0.08s per image pair.

6 Conclusion

In this paper, a novel end-to-end framework with a Region Feature Alignment (RFA) module is proposed to alleviate the negative effects caused by the cross-modality disparity in multispectral pedestrian detection. An RoI jittering training strategy is adopted to further improve the robustness of detector to random disparity. In addition, we also scrubbed the large-scale KAIST multispectral dataset by providing dense, high-quality, and modality-independent annotation pairs. As a result, the proposed approach demonstrates state-of-the-art performance on the challenging KAIST dataset and improves the robustness of detector with a large margin. It is worth noting that our method is a generic solution for multispectral object detection rather than only the pedestrian problem. In the future, we plan to explore the generalization of the RFA module and extend it to other tasks, considering that this cross-modality disparity is widespread and hard to completely avoid when multimodal inputs are required.


  • [1] L. G. Brown. A survey of image registration techniques. ACM computing surveys (CSUR), 24(4):325–376, 1992.
  • [2] F. Burian, P. Kocmanova, and L. Zalud. Robot mapping with range camera, ccd cameras and thermal imagers. In Methods and Models in Automation and Robotics (MMAR), 2014 19th International Conference On, pages 200–205. IEEE, 2014.
  • [3] Q.-s. Chen, M. Defrise, and F. Deconinck. Symmetric phase-only matched filtering of fourier-mellin transforms for image registration and recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, (12):1156–1168, 1994.
  • [4] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals using stereo imagery for accurate object class detection. IEEE transactions on pattern analysis and machine intelligence, 40(5):1259–1272, 2018.
  • [5] H. Choi, S. Kim, K. Park, and K. Sohn. Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks. In

    23rd International Conference on Pattern Recognition, ICPR 2016

    . Institute of Electrical and Electronics Engineers Inc., 2017.
  • [6] S. Dawn, V. Saxena, and B. Sharma. Remote sensing image registration techniques: A survey. In International Conference on Image and Signal Processing, pages 103–112. Springer, 2010.
  • [7] Z. Deng and L. J. Latecki. Amodal detection of 3d objects: Inferring 3d bounding boxes from 2d ones in rgb-depth images. In Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, page 2, 2017.
  • [8] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1532–1545, 2014.
  • [9] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. IEEE transactions on pattern analysis and machine intelligence, 34(4):743–761, 2012.
  • [10] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In British Machine Vision Conference, BMVC 2009, London, UK, September 7-10, 2009. Proceedings, 2009.
  • [11] R. Girshick. Fast R-CNN. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [12] A. González, Z. Fang, Y. Socarras, J. Serrat, D. Vázquez, J. Xu, and A. M. López. Pedestrian detection at day/night time with visible and fir cameras: A comparison. Sensors, 16(6):820, 2016.
  • [13] D. Guan, Y. Cao, J. Liang, Y. Cao, and M. Y. Yang. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. arXiv preprint arXiv:1802.09972, 2018.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
  • [15] J. Hosang, M. Omran, R. Benenson, and B. Schiele. Taking a deeper look at pedestrians. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4073–4082, 2015.
  • [16] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon. Multispectral pedestrian detection: Benchmark dataset and baseline. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1037–1045. IEEE, 2015.
  • [17] J. Inglada, V. Muron, D. Pichard, and T. Feuvrier. Analysis of artifacts in subpixel remote sensing image registration. IEEE Transactions on Geoscience and Remote Sensing, 45(1):254–264, 2007.
  • [18] B. C. Ko, J.-Y. Kwak, and J.-Y. Nam. Online learning based multiple pedestrians tracking in thermal imagery for safe driving at night. In Intelligent Vehicles Symposium (IV), 2016 IEEE, pages 78–79. IEEE, 2016.
  • [19] D. König, M. Adam, C. Jarvers, G. Layher, H. Neumann, and M. Teutsch. Fully convolutional region proposal networks for multispectral person detection. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 243–250. IEEE, 2017.
  • [20] A. Leykin, Y. Ran, and R. Hammoud. Thermal-visible video fusion for moving target tracking and pedestrian classification. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
  • [21] C. Li, D. Song, R. Tong, and M. Tang. Illumination-aware faster r-cnn for robust multispectral pedestrian detection. Pattern Recognition, 85:161–171, 2019.
  • [22] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan. Scale-aware fast r-cnn for pedestrian detection. IEEE Transactions on Multimedia, 20(4):985–996, 2018.
  • [23] J. Liang, X. Liu, K. Huang, X. Li, D. Wang, and X. Wang. Automatic registration of multisensor images using an integrated spatial and mutual information (smi) metric. IEEE transactions on geoscience and remote sensing, 52(1):603–615, 2014.
  • [24] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
  • [25] J. Liu, S. Zhang, S. Wang, and D. N. Metaxas. Multispectral deep neural networks for pedestrian detection. In British Machine Vision Conference (BMVC), 2016.
  • [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD:single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [27] J. Ma, Y. Ma, J. Zhao, and J. Tian.

    Image feature matching via progressive vector field consensus.

    IEEE Signal Processing Letters, 22(6):767–771, 2015.
  • [28] J. Ma, J. Zhao, J. Tian, A. L. Yuille, and Z. Tu. Robust point matching via vector field consensus. IEEE Trans. image processing, 23(4):1706–1721, 2014.
  • [29] J. A. Maintz and M. A. Viergever. A survey of medical image registration. Medical image analysis, 2(1):1–36, 1998.
  • [30] J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can help pedestrian detection? In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
  • [31] W. Nam, P. Dollár, and J. H. Han. Local decorrelation for improved pedestrian detection. In Advances in Neural Information Processing Systems, pages 424–432, 2014.
  • [32] S. Pang, J. Xue, Q. Tian, and N. Zheng. Exploiting local linear geometric structure for identifying correct matches. Computer Vision and Image Understanding, 128:51–64, 2014.
  • [33] K. Park, S. Kim, and K. Sohn. Unified multi-spectral pedestrian detection based on probabilistic fusion networks. Pattern Recognition, 80, 2018.
  • [34] B. S. Reddy and B. N. Chatterji. An fft-based technique for translation, rotation, and scale-invariant image registration. IEEE transactions on image processing, 5(8):1266–1271, 1996.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Computer Science, 2014.
  • [37] S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016.
  • [38] A. Torabi, G. Massé, and G.-A. Bilodeau. An iterative integrated framework for thermal–visible image registration, sensor fusion, and people tracking for video surveillance applications. Computer Vision and Image Understanding, 116(2):210–221, 2012.
  • [39] W. Treible, P. Saponaro, S. Sorensen, A. Kolagunda, M. O’Neal, B. Phelan, K. Sherbondy, and C. Kambhamettu. Cats: A color and thermal stereo benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2961–2969, 2017.
  • [40] J. Wagner, V. Fischer, M. Herman, and S. Behnke. Multispectral pedestrian detection using deep fusion convolutional neural networks. In

    24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN)

    , pages 509–514, 2016.
  • [41] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Repulsion loss: Detecting pedestrians in a crowd.
  • [42] D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe. Learning cross-modal deep representations for robust pedestrian detection. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [43] L. Zhang, L. Lin, X. Liang, and K. He. Is faster r-cnn doing well for pedestrian detection? In European Conference on Computer Vision, pages 443–457. Springer, 2016.
  • [44] L. Zhang, Z. Liu, S. Zhang, X. Yang, H. Qiao, K. Huang, and A. Hussain. Cross-modality interactive attention network for multispectral pedestrian detection. Information Fusion, 2018.
  • [45] S. Zhang, R. Benenson, and B. Schiele. Filtered channel features for pedestrian detection. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1751–1760. IEEE, 2015.
  • [46] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Occlusion-aware r-cnn: Detecting pedestrians in a crowd. arXiv preprint arXiv:1807.08407, 2018.
  • [47] S. Zhiwei, W. Yiyan, Z. Changjiu, and Z. Yi. A new sensor fusion framework to deal with false detections for low-cost service robot localization. In Robotics and Biomimetics (ROBIO), 2013 IEEE International Conference on, pages 197–202. IEEE, 2013.
  • [48] B. Zitova and J. Flusser. Image registration methods: a survey. Image and vision computing, 21(11):977–1000, 2003.