Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting

12/08/2020 ∙ by Lingbo Liu, et al. ∙ 0

Crowd counting is a fundamental yet challenging problem, which desires rich information to generate pixel-wise crowd density maps. However, most previous methods only utilized the limited information of RGB images and may fail to discover the potential pedestrians in unconstrained environments. In this work, we find that incorporating optical and thermal information can greatly help to recognize pedestrians. To promote future researches in this field, we introduce a large-scale RGBT Crowd Counting (RGBT-CC) benchmark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people. Furthermore, to facilitate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework, which consists of multiple modality-specific branches, a modality-shared branch, and an Information Aggregation-Distribution Module (IADM) to fully capture the complementary information of different modalities. Specifically, our IADM incorporates two collaborative information transfer components to dynamically enhance the modality-shared and modality-specific representations with a dual information propagation mechanism. Extensive experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting. Moreover, the proposed approach is universal for multimodal crowd counting and is also capable to achieve superior performance on the ShanghaiTechRGBD dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Crowd counting [kang2018beyond, gao2020cnn]

is a fundamental computer vision task that aims to automatically estimate the number of people in unconstrained scenes. Over the past decade, this task has attracted a lot of research interests due to its huge application potentials (e.g., traffic management

[zhang2017understanding, liu2020dynamic] and video surveillance [xiong2017spatiotemporal]). During the recent COVID-19 pandemic [velavan2020covid], crowd counting has also been employed widely for social distancing monitoring [ghodgaonkar2020analyzing].

Figure 1: Visualization of RGB-thermal images in our RGBT-CC benchmark. When only using optical information of RGB images, we cannot effectively recognize pedestrians in poor illumination conditions, as shown in (a) and (b). When only utilizing thermal images, some heating negative objects are hard to be distinguished, as shown in (c) and (d).

In the literature, numerous models [zhang2016single, sindagi2017generating, liu2018crowd, zhang2019attentional, bai2020adaptive, li2018csrnet, liu2019crowd, ma2019bayesian, liu2019context, ma2020learning, liu2020semi, He_Error_2021, MA_UOT_2021] have been proposed for visual crowd counting. Despite substantial progress, it remains a very challenging problem that desires rich information to generate pixel-wise crowd density maps. However, most previous methods only utilized the optical information extracted from RGB images and may fail to accurately recognize the semantic objects in unconstraint scenarios. For instance, as shown in Fig. 1-(a,b), pedestrians are almost invisible in poor illumination conditions (such as backlight and night) and they are hard to be directly detected from RGB images. Moreover, some human-shaped objects (e.g., tiny pillars and blurry traffic lights) have similar appearances to pedestrians [zhang2016faster] and they are easily mistaken for people when relying solely on optical features. In general, RGB images cannot guarantee the high-quality density maps. Therefore, more comprehensive information should be explored for crowd counting.

Fortunately, we observe that thermal images can greatly facilitate distinguishing the potential pedestrians from cluttered backgrounds. Recently, thermal cameras have been extensively popularized due to the COVID-19 pandemic, which increases the feasibility of thermal-based crowd counting. However, thermal images are not perfect. As shown in Fig. 1-(c,d), some hard negative objects (e.g., heating walls and lamps) are also highlighted in thermal images, but they can be eliminated effectively with the aid of optical information. Overall, RGB images and thermal images are highly complementary. To the best of our knowledge, no attempts have been made to simultaneously explore RGB and thermal images for estimating the crowd counts. In this work, to promote further researches of this field, we propose a large-scale benchmark “RGBT Crowd Counting (RGBT-CC)”, which contains 2,030 pairs of RGB-thermal images and 138,389 annotated pedestrians. Moreover, our benchmark makes significant advances in terms of diversity and difficulty, as these RGBT images were captured from unconstrained scenes (e.g., malls, streets, train stations, etc.) with various illumination (e.g., day and night). Our proposed benchmark would be released after peer-review.

Nevertheless, capturing the complementarities of multimodal data (i.e., RGB and thermal images) is non-trivial. Conventional methods [lian2019density, zhou2020cascaded, piao2019depth, jiang2020emph, zhai2020bifurcated, sun2019leveraging]

either feed the combination of multimodal data into deep neural networks or directly fuse their features, which could not well exploit the complementary information. In this work, to facilitate the multimodal crowd counting, we introduce a cross-modal collaborative representation learning framework, which incorporates multiple modality-specific branches, a modality-shared branch, and an Information Aggregation-Distribution Module (IADM) to fully capture the complementarities among different modalities. Specifically, our IADM is integrated with two collaborative components, including

i) an Information Aggregation Transfer that dynamically aggregates the contextual information of all modality-specific features to enhance the modality-shared feature and ii) an Information Distribution Transfer that propagates the modality-shared information to symmetrically refine every modality-specific feature for further representation learning. Furthermore, the tailor-designed IADM is embedded in different layers to learn the cross-modal representation hierarchically. Consequently, the proposed framework can generate knowledgeable features with comprehensive information, thereby yielding high-quality crowd density maps.

It is worth noting that our method has three appealing properties. First, thanks to the dual information propagation mechanism, IADM can effectively capture the multi-modal complementarities to facilitate the crowd counting task. Second, as a plug-and-play module, IADM can be easily incorporated into various backbone networks for end-to-end optimization. Third, our framework is universal for multimodal crowd counting. Except for RGBT counting, the proposed method can also be directly applied for RGB-Depth counting. In summary, the major contributions of this work are three-fold:

  • We introduce a large-scale RGBT benchmark to promote the research in the field of crowd counting, in which 138,389 pedestrians are annotated in 2,030 pairs of RGB-thermal images captured in unconstrained scenarios.

  • We develop a cross-modal collaborative representation learning framework, which is capable of fully learning the complementarities among different modalities with our tailor-designed Information Aggregation-Distribution Module.

  • Extensive experiments conducted on two multimodal benchmarks (i.e., our RGBT-CC and ShanghaiTechRGBD [lian2019density]) demonstrate that the proposed method is effective and universal for multimodal crowd counting.

2 Related Works

Crowd Counting Benchmarks: In recent years, we have witnessed the rapid evolution of crowd counting benchmarks. UCSD [chan2008privacy] and WorldExpo [zhang2015cross] are two early datasets that respectively contain 2,000 and 3,980 video frames with low diversities and low-medium densities. To alleviate the limitations of the aforementioned datasets, Zhang et al. [zhang2016single] collected 1,198 images with 330,165 annotated heads, which are of better diversity in terms of scenes and density levels. Subsequently, three large-scale datasets were proposed in succession. For instance, UCF-QNRF [idrees2018composition] is composed of 1,535 high density images images with a total of 1.25 million pedestrians. JHU-CROWD++ [sindagi2020jhu] contains 4,372 images with 1.51 million annotated heads, while NWPU-Crowd [gao2020nwpu] consists of 2.13 million annotations in 5,109 images. Nevertheless, all the above benchmarks are based on RGB optical images, in which almost all previous methods fail to recognize the invisible pedestrians in poor illumination conditions. Recently, Lian et al. [lian2019density] utilized a stereo camera to capture 2,193 depth images that are insensitive to illumination. However, these images are coarse in outdoor scenes due to the limited depth ranges (020 meters), which seriously restricts their deployment scopes. Fortunately, we find that thermal images are robust to illumination and have large perception distance, thus can help to recognize pedestrians under various scenarios. Therefore, we propose the first RGBT crowd counting dataset in this work, hoping that it would greatly promote the future development in this field.

Crowd Counting Approaches: As a classics problem in computer vision, crowd counting has been studied extensively. Early works [Chan2009bayesian, chen2012feature, idrees2013multi] directly predict the crowd count with regression models, while subsequent works usually generate crowd density maps with deep neural networks and then accumulate all pixels’ values to obtain the final counts. In terms of network architectures, previous approaches can be divided into three categories: (1) basic CNN based methods [fu2015fast, zhang2015cross, wang2015deep, walach2016learning] that adopt the basic convolutional and pooling layers; (2) multicolumn based methods [zhang2016single, sam2017switching, sindagi2017generating, liu2018decidenet, liu2019crowd, qiu2019crowd, zhang2019relational, yuan2020crowd] that utilize several CNN columns to capture multi-scale information; and (3) single-column based methods [cao2018scale, li2018csrnet, liu2019adcrowdnet, jiang2019crowd, ma2019bayesian] that deploy single but deeper networks. However, all aforementioned methods estimate crowd counts only with the optical information of RGB images and are not effective when working in poor illumination conditions. Recently, depth images are used as auxiliary information to count and locate human heads [lian2019density]. Nevertheless, depth images are coarse in outdoor scenarios, thus depth-based methods have relatively limited deployment scopes.

Multi-Modal Representation Learning:

Multi-modal representation learning aims at comprehending and representing cross-modal data through machine learning. There are many strategies in cross-modal feature fusion. Some simple fusion methods 

[kiela2014learning, lian2019density, sun2019leveraging, fu2020jl] obtain a fused feature with the operations of element-wise multiplication/addition or concatenation in the “Early Fusion” and “Late Fusion” way. To exploit the advantages of both early and late fusion, various two-stream-based models [wu2020deepdualmapper, piao2020a2dele, zhao2019contrast, zhang2019attend] propose to fuse hierarchical cross-modal features, achieving the fully representative shared feature. Besides, a few approaches [lu2020cross] explore the use of a shared branch, mapping the shared information to common feature spaces. Furthermore, some recent works [fan2020bbsnet, HDFNet-ECCV2020, zhang2020uc] are presented to address RGBD saliency detection, which is also a cross-modal dense prediction task like RGBT crowd counting. However, most of these works are one-way information transfer, just using depth modality as auxiliary information to help the representation learning of RGB modality. In this work, we propose a symmetric dynamic enhancement mechanism that can take full advantage of the modal complementarities in crowd counting.

Figure 2: The statistics histogram of people distribution in the proposed RGBT Crowd Counting benchmark.
Figure 3: The architecture of the proposed cross-modal collaborative representation learning framework for multimodal crowd counting. Specifically, our framework is composed of multiple modality-specific branches, a modality-shared branch, and an Information Aggregation-Distribution Module (IADM).
Training Validation Testing
#Bright   510 / 65.66   97 / 63.02 406 / 73.39
#Dark   520 / 62.52 103 / 67.74 394 / 74.88
#Total 1030 / 64.07 200 / 65.45 800 / 74.12
Scene malls, streets, train/metro stations, etc
Table 1: The training, validation and testing sets of our RGBT-CC benchmark. In each grid, the first value is the number of images, while the second value denotes the average count per image.

3 RGBT Crowd Counting Benchmark

To the best of our knowledge, there is currently no public RGBT dataset for crowd counting. To promote the future research of this task, we propose a large-scale RGBT Crowd Counting (RGBT-CC) benchmark. Specifically, we first use an optical-thermal camera to take a large number of RGB-thermal images in various scenarios (e.g., malls, streets, playgrounds, train stations, metro stations, etc). Due to the different types of electronic sensors, original RGB images have a high resolution of 2,0481,536 with a wider field of view, while thermal images have a standard resolution of 640480 with a smaller field of view. On the basis of coordinate mapping relation, we crop the corresponding RGB regions and resize them to 640480. We then choose 2,030 pairs of representative RGB-thermal images for manual annotations. Among these samples, 1,013 pairs are captured in the light and 1,017 pairs are in the darkness. A total of 138,389 pedestrians are marked with point annotations, on average 68 people per image. The detailed distribution of people is shown in Fig. 2. Moreover, with a density value of 2.22e-4 people per pixel, our RGBT-CC is denser, compared with UCF-QNRF [idrees2018composition] and NWPU-Crowd [gao2020nwpu] whose density values are 1.12e-4 and 5.95e-5 people per pixel, respectively. Finally, the proposed RGBT-CC benchmark is randomly divided into three parts. As shown in Table 1, 1030 pairs are used for training, 200 pairs are for validation and 800 pairs are for testing.

4 Method

In this work, we propose a cross-modal collaborative representation learning framework for multimodal crowd counting. Specifically, multiple modality-specific branches, a modality-shared branch, and an Information Aggregation-Distribution Module are incorporated to fully capture the complementarities among different modalities with a dual information propagation paradigm. In this section, we adopt the representative CSRNet [li2018csrnet] as a backbone network to develop our framework for RGBT crowd counting. It is worth noting that our framework can be implemented with various backbone networks (e.g., MCNN [zhang2016single], SANet [cao2018scale], and BL [ma2019bayesian]), and is also universal for multimodal crowd counting, as verified in Section 5.4 by directly applying it to the ShanghaiTechRGBD [lian2019density] dataset.

4.1 Overview

As shown in Fig. 3, the proposed RGBT crowd counting framework is composed of three parallel backbones and an Information Aggregation-Distribution Module (IADM). Specifically, the top and bottom backbones are developed for modality-specific (i.e. RGB images and thermal images) representation learning, while the middle backbone is designed for modality-shared representation learning. To fully exploit the multimodal complementarities, our IADM dynamically transfers the specific-shared information to collaboratively enhance the modality-specific and modality-shared representations. Consequently, the final modality-shared feature contains comprehensive information and facilitates generating high-quality crowd density maps.

Given an RGB image and a thermal image

, we first feed them into different branches to extract modality-specific features, which maintain the specific information of individual modality. The modality-shared branch takes a zero-tensor as input and hierarchically aggregates the information of modality-specific features. As mentioned above, each branch is implemented with CSRNet, which consists of (1) a front-end block with the first ten convolutional layers of VGG16

[simonyan2014very] and (2) a back-end block with six dilated convolutional layers. More specifically, the modality-specific branches are based on the CSRNet front-end block, while the modality-shared branch is based on the last 14 convolutional layers of CSRNet. In our work, the -th dilated convolutional layer of back-end block is renamed as “Conv_”. For convenience, the RGB, thermal, and modality-shared features at Conv_ layer are denoted as , , and , respectively.

Figure 4: (a) Information Aggregation Transfer: we first extract the contextual information / from modality-specific features /, and then propagate them dynamically to enhance the modality-shared feature . (b) Information Distribution Transfer: the contextual information of the enhance feature is distributed adaptively to each modality-specific feature for feedback refinement. “+” denotes element-wise addition and “-” refers to element-wise subtraction.

After feature extraction, we employ the Information Aggregation-Distribution Module described in Section

4.2 to learn cross-modal collaborative representation. To exploit the multimodal information hierarchically, the proposed IADM is embedded after different layers, such as Conv_, Conv_, Conv_, and Conv_. Specifically, after Conv_, IADM dynamically transfers complementary information among modality-specific and modality-shared features for mutual enhancement. This process can be formulated as follow:

(1)

where , , and are the enhanced features of , , and respectively. These features are then fed into the next layer of each branch to further learn high-level multimodal representations. Thanks to the tailor-designed IADM, the complementary information of the input RGB image and the thermal image is progressively transferred into the modality-shared representations. The final modality-shared feature contains rich information. Finally, we directly feed into a 1*1 convolutional layer for prediction of the crowd density map .

4.2 Collaborative Representation Learning

As analyzed in Section 1, RGB images and thermal images are highly complementary. To fully capture their complementarities, we propose an Information Aggregation and Distribution Module (IADM) to collaboratively learn cross-modal representation with a dual information propagation mechanism. Specifically, our IADM is integrated with two collaborative transfers, which dynamically propagate the contextual information to mutually enhance the modality-specific and modality-shared representations.

1) Contextual Information Extraction: In this module, we propagate the contextual information rather than the original features, because the later manner would cause the excessive mixing of specific-shared features. To this end, we employ a -level pyramid pooling layer to extract the contextual information for a given feature . Specifically, at the level (=1,…,), we apply a max-pooling layer to generate a feature, which is then upsampled to

with nearest neighbor interpolation. For convenience, the upsampled feature is denoted as

. Finally, the contextual information of feature is computed as:

(2)

where denotes an operation of feature concatenation and is a 1*1 convolutional layer. This extraction has two advantages. First, with a larger receptive field, each position at contains more context. Second, captured by different sensors, RGB images and thermal images are not strictly aligned, as shown in Figure 1. Thanks to the translation invariance of max-pooling layers, we can eliminate the misalignment of RGB-thermal images to some extent.

2) Information Aggregation Transfer (IAT): In our work, IAT is proposed to aggregate the contextual information of all modality-specific features to enhance the modality-shared feature. As shown in Fig. 4-(a), instead of directly absorbing all information, our IAT transfers the complementary information dynamically with a gating mechanism that adaptively filters useful information. Specifically, given features , and , we first extract their contextual information , , and with Eq. 2. Similar to [zhang2019residual, zhao2019spatiotemporal], we then obtain two residual information and by computing the differences between / and . Finally, we apply two gating functions to adaptively propagate the complementary information for refining the modality-shared feature . The enhanced feature is formulated as follow:

(3)

where the gating functions are implemented by convolutional layers, and are the gating weights. refers to an operation of element-wise multiplication. With such a mechanism, the complementary information is effectively embedded into the modality-shared representation, thus our method can better exploit the multimodal data.

3) Information Distribution Transfer (IDT): After information aggregation, we distribute the information of the new modality-shared feature to refine each modality-specific feature respectively. As shown in Fig. 4-(b), with the enhanced feature , we first extract its contextual information , which is then dynamically propagated to and . Simialr to IAT, two gating functions are used for information filtering. Specifically, the enhanced modality-specific features are computed as follow:

Finally, all enhanced features , , and are fed into the following layers of the individual branch for further representation learning.

Input Data Representation Learning GAME(0) GAME(1) GAME(2) GAME(3) RMSE
RGB - 33.94 40.76 47.31 57.20 69.59
T - 21.64 26.22 31.65 38.66 37.38
RGBT Early Fusion 20.40 23.58 28.03 35.51 35.26
Late fusion 19.87 25.60 31.93 41.60 35.09
W/O Gating Mechanism 19.76 23.60 28.66 36.21 33.61
W/O Modality-Shared Feature 18.67 22.67 27.95 36.04 33.73
W/O Information Distribution 18.59 23.08 28.73 36.74 32.91
IADM 17.94 21.44 26.17 33.33 30.91
Table 2: The performance of different inputs and different representation learning approaches on our RGBT-CC benchmark.
Illumination Input Data GAME(0) GAME(1) GAME(2) GAME(3) RMSE
Brightness RGB 23.49 30.14 37.47 48.46 45.40
T 25.21 28.98 34.82 42.25 40.60
RGBT 20.36 23.57 28.49 36.29 32.57
Darkness RGB 44.72 51.70 57.45 66.21 87.81
T 17.97 23.38 28.39 34.95 33.74
RGBT 15.44 19.23 23.79 30.28 29.11
Table 3: The performance under different illumination conditions on our RGBT-CC benchmark. The unimodal data is directly fed into CSRNet, while the multimodal data is fed into our proposed framework based on CSRNet. “” denotes lower is better.

5 Experiments

5.1 Implementation Details & Evaluation Metric

In this work, the proposed method is implemented with PyTorch 

[paszke2019pytorch]. Here we take various models (e.g., CSRNet [li2018csrnet], MCNN [zhang2016single], SANet [cao2018scale], and BL [ma2019bayesian]

) as backbone to develop multiple instances of our framework. To maintain a similar number of parameters to original models for fair comparisons, the channel number of these backbones in our framework is respectively set to 70%, 60%, 60%, and 60% of their original values. The kernel parameters are initialized by Gaussian distribution with a zero mean and a standard deviation of 1e-2. At each iteration, a pair of 640

480 RGBT image is fed into the network. The ground-truth density map is generated with geometry-adaptive Gaussian kernels [zhang2016single]. The learning rate is set to 1e-5 and Adam [kingma2014adam] is used to optimize our framework.

Following [liu2020weighing, sindagi2019multi, liu2020efficient]

, we adopt the Root Mean Square Error (RMSE) as an evaluation metric. Moreover, Grid Average Mean Absolute Error (GAME

[guerrero2015extremely]) is utilized to evaluate the performance in different regions. Specifically, for a specific level , we divide the given images into non-overlapping regions and measure the counting error in each region. Finally, the GAME at level is computed as:

(4)

where is the total number of the testing samples, and are the estimated count and the corresponding ground-truth count in the region of the image. Note that GAME(0) is equivalent to Mean Absolute Error (MAE).

5.2 Ablation Studies

We perform extensive ablation studies to verify the effectiveness of each component in our framework. In this subsection, CSRNet is utilized as the backbone network to implement our proposed method.

Figure 5: Visualization of the crowd density maps generated in different illumination conditions. (a) and (b) show the input RGB images and thermal images. (c) and (d) are the results of RGB-based CSRNet and thermal-based CSRNet. (e) shows the results of CSRNet that takes the concatenation of RGB and thermal images as input. (f) refers to the results of our CSRNet+IDAM. And the ground-truths are shown in (g). We can observe that our density maps and estimated counts are more accurate than those of other methods. (Best to zoom in to view this figure.)
Backbone GAME(0) GAME(1) GAME(2) GAME(3) RMSE
UCNet [zhang2020uc] 33.96 42.42 53.06 65.07 56.31
HDFNet [HDFNet-ECCV2020] 22.36 27.79 33.68 42.48 33.93
BBSNet [fan2020bbsnet] 19.56 25.07 31.25 39.24 32.48
  MCNN 21.89 25.70 30.22 37.19 37.44
  MCNN + IADM 19.77 23.80 28.58 35.11 30.34
  SANet 21.99 24.76 28.52 34.25 41.60
  SANet + IADM 18.18 21.84 26.27 32.95 33.72
CSRNet 20.40 23.58 28.03 35.51 35.26
CSRNet + IADM 17.94 21.44 26.17 33.33 30.91
        BL 18.70 22.55 26.83 34.62 32.67
        BL + IADM 15.61 19.95 24.69 32.89 28.18
Table 4: Performance of different methods on the proposed RGBT-CC benchmark. All the methods in this table utilize both RGB images and thermal images to estimate the crowd counts.

1) Effectiveness of Multimodal Data: We first explore whether the multimodal data (i.e., RGB images and thermal images) is effective for crowd counting. As shown in Table 2, when only feeding RGB images into CSRNet, we obtain less impressive performance (e.g., GAME(0) is 33.94 and RMSE is 69.59), because we cannot effectively recognize people in dark environments. When utilizing thermal images, GAME(0) and RMSE are sharply reduced to 21.64 and 37.38, which demonstrates that thermal images are more useful than RGB images. In contrast, various models in the bottom six rows of Table 2 achieve better performance, when considering RGB and thermal images simultaneously. In particular, our CSRNet+IADM has a relative performance improvement of 17.3% on RMSE, compared with the thermal-based CSRNet.

To further verify the complementarities of multimodal data, the testing set is divided into two parts to measure the performance in different illumination conditions separately. As shown in Table 3, using both RGB and thermal images, our CSRNet+IADM consistently outperforms the unimodal CSRNet in both bright and dark scenarios. This is attributed to the thermal information that greatly helps to distinguish potential pedestrians from the cluttered background, while optical information is beneficial to eliminate heating negative objects in thermal images. Moreover, we also visualize some crowd density maps generated with different modal data in Fig. 7. We can observe that the density maps and estimated counts of our CSRNet+IADM are more accurate. These quantitative and qualitative experiments show that RGBT images are greatly effective for crowd counting.

2) Which Representation Learning Method is Better? We implement six methods for multimodal representation learning. Specifically, “Early Fusion” feeds the concatenation of RGB and thermal images into CSRNet. “Late Fusion” extracts the RGB and thermal features respectively with two CSRNet and then combines their features to generate density maps. As shown in Table 2, these two models are better than unimodal models, but their performance still lags far behind various variants of our IADM. For instance, without gating functions, the variant “W/O Gating Mechanism” directly propagates information among different features and obtains an RMSE of 33.61. The variant “W/O Modality-Shared Feature” obtains a GAME(0) of 18.67 and an RMSE of 33.73, when removing the modality-shared branch and directly refining the modality-specific features. When using the modality-shared branch but only aggregating multimodal information, the variant “W/O Information Distribution” obtains a relatively better RMSE 32.91. When using the full IADM, our method achieves the best performance on all evaluation metrics. This is attributed to our tailor-designed architecture (i.e., specific-shared branches, dual information propagation) that can effectively learn the multimodal collaborative representation, and fully capture the complementary information of RGB and thermal images. These experiments demonstrate the effectiveness of the proposed IADM for multimodal representation learning.

#Level GAME(0) GAME(1) GAME(2) GAME(3) RMSE
=1 18.94 23.05 28.03 35.88 33.01
=2 18.35 22.56 27.84 35.90 31.94
=3 17.94 21.44 26.17 33.33 30.91
=4 17.80 21.39 25.91 33.20 31.48
Table 5: Performance of different level numbers of the pyramid pooling layer in IADM.
Method GAME(0) GAME(1) GAME(2) GAME(3) RMSE
UCNet [zhang2020uc] 10.81 15.24 22.04 32.98 15.70
HDFNet [HDFNet-ECCV2020] 8.32 13.93 17.97 22.62 13.01
BBSNet [fan2020bbsnet] 6.26 8.53 11.80 16.46 9.26
DetNet [liu2018decidenet] 9.74 - - - 13.14
CL [idrees2018composition] 7.32 - - - 10.48
RDNet [lian2019density] 4.96 - - - 7.22
  MCNN 11.12 14.53 18.68 24.49 16.49
  MCNN + IADM 9.61 11.89 15.44 20.69 14.52
        BL 8.94 11.57 15.68 22.49 12.49
        BL + IADM 7.13 9.28 13.00 19.53 10.27
  SANet 5.74 7.84 10.47 14.30 8.66
  SANet + IADM 4.71 6.49 9.02 12.41 7.35
CSRNet 4.92 6.78 9.47 13.06 7.41
CSRNet + IADM 4.38 5.95 8.02 11.02 7.06
Table 6: Performance of different methods on the ShanghaiTechRGBD benchmark. All the methods in this table utilize both RGB images and depth images to estimate the crowd counts.

3) The Effectiveness of Level Number of Pyramid Pooling Layer: In the proposed IDAM, an -level pyramid pooling layer is utilized to extract contextual information. In this section, we explore the effectiveness of the level number. As shown in Table 5, when is set to 1, the GAME(3) and RMSE are 35.88 and 33.01, respectively. As the level number increases, our performance also becomes better gradually, and we can achieve very competitive results when the pyramid pooling layer has three levels. More levels over 3 will not bring additional performance gains. Therefore, the level number is consistently set to 3 in this work.

5.3 Comparison with the State of the Art

We compare the proposed method with state-of-the-art methods on the large-scale RGBT-CC benchmark. The compared methods can be divided into two categories. The first class is the specially-designed models for crowd counting, including MCNN [zhang2016single], SANet [cao2018scale], CSRNet [li2018csrnet], and BL [ma2019bayesian]. These methods are reimplemented to take the concatenation of RGB and thermal images as input in an “Early Fusion” way. The second class is several best-performing models for multimodal learning, including UCNet [zhang2020uc], HDFNet [HDFNet-ECCV2020], and BBSNet [fan2020bbsnet]. Based on their official codes, these methods are reimplemented to estimate crowd counts on our RGBT-CC dataset. As mentioned above, our IADM can be incorporated into various networks, thus here we take CSRNet, MCNN, SANet, and BL as backbone to develop multiple instances of our framework.

The performance of all comparison methods is summarized in Table 7. As can be observed, all instances of our method outperform the corresponding backbone networks consistently. For instance, both MCNN+IADM and SANet+IADM have a relative performance improvement of 18.9% on RMSE, compared with their “Early Fusion” models. Moreover, our CSRNet+IADM and BL+IADM achieve better performance on all evaluation metrics, compared with other advanced methods (i.e., UCNet, HDFNet, and BBSNet). This is because our method learns specific-shared representations explicitly and enhances them mutually, while others just simply fuse multimodal features without mutual enhancements. Thus our method can better capture the complementarities of RGB images and thermal images. This comparison has demonstrated the effectiveness of our method for RGBT crowd counting.

5.4 Apply to RGBD Crowd Counting

We apply the proposed method to estimate crowd counts from RGB images and depth images. In this subsection, we also take various crowd counting models as backbone to develop our framework on ShanghaiTechRGBD [lian2019density] benchmark. The implementation details of the compared methods are similar to the previous subsection. As shown in Table 8, all instances of our framework are superior to their corresponding backbone networks by obvious margins. Moreover, our SANet+IADM and CSRNet+IADM outperform three advanced models (i.e., UCNet, HDFNet, and BBSNet) on all evaluation metrics. More importantly, our CSRNet+IADM achieves the lowest GAME(0) 4.38 and RMSE 7.05, and becomes the new state-of-the-art method on ShanghaiTechRGBD benchmark. This experiment shows that our approach is universal and effective for RGBD crowd counting.

6 Conclusion

In this work, we propose to incorporate optical and thermal information to estimate crowd counts in unconstrained scenarios. To this end, we introduce the first RGBT crowd counting benchmark with 2,030 pairs of RGB-thermal images and 138,389 annotated people. Moreover, we develop a cross-modal collaborative representation learning framework, which utilizes a tailor-designed Information Aggregation-Distribution Module to fully capture the complementary information of different modalities. Extensive experiments on two real-world benchmarks show the effectiveness and universality of the proposed method for multimodal (e.g., RGBT and RGBD) crowd counting.

Backbone Input Feature Learning GAME(0) GAME(1) GAME(2) GAME(3) RMSE
MCNN [zhang2016single] RGB - 36.83 43.12 49.85 58.60 71.16
T - 22.92 26.65 31.33 37.58 38.92
RGBT Early Fusion 21.89 25.70 30.22 37.19 37.44
IADM 19.77 23.80 28.58 35.11 30.34
SANet [cao2018scale] RGB - 35.97 41.45 46.75 54.89 70.52
T - 22.89 25.83 29.48 36.02 42.33
RGBT Early Fusion 21.99 24.76 28.52 34.25 41.60
IADM 18.18 21.84 26.27 32.95 33.72
CSRNet [li2018csrnet] RGB - 33.94 40.76 47.31 57.20 69.59
T - 21.64 26.22 31.65 38.66 37.38
RGBT Early Fusion 20.40 23.58 28.03 35.51 35.26
IADM 17.94 21.44 26.17 33.33 30.91
BL [ma2019bayesian] RGB - 33.32 39.19 44.58 54.11 67.50
T - 19.93 23.31 27.32 34.64 34.08
RGBT Early Fusion 18.70 22.55 26.83 34.62 32.67
IADM 15.61 19.95 24.69 32.89 28.18
Table 7: Performance of unimodal data and multimodal data on the RGBT-CC benchmark.
Backbone Input Feature Learning GAME(0) GAME(1) GAME(2) GAME(3) RMSE
MCNN [zhang2016single] RGB - 10.76 13.81 19.02 25.15 14.66
D - 28.36 42.95 53.41 64.92 38.74
RGBD Early Fusion 11.12 14.53 18.68 24.49 16.49
IADM 9.61 11.89 15.44 20.69 14.52
BL [ma2019bayesian] RGB - 8.83 11.67 15.85 22.85 12.96
D - 26.19 30.04 34.58 41.56 37.23
RGBD Early Fusion 8.94 11.57 15.68 22.49 12.49
IADM 7.13 9.28 13.00 19.53 10.27
SANet [cao2018scale] RGB - 6.89 8.79 11.89 16.48 9.98
D - 25.62 30.68 37.03 44.31 35.94
RGBD Early Fusion 5.74 7.84 10.47 14.30 8.66
IADM 4.71 6.49 9.02 12.41 7.35
CSRNet [li2018csrnet] RGB - 4.96 7.09 9.97 13.55 7.44
D - 28.53 55.46 67.99 76.41 39.06
RGBD Early Fusion 4.92 6.78 9.47 13.06 7.41
IADM 4.38 5.95 8.02 11.02 7.06
Table 8: Performance of unimodal data and multimodal data on the ShanghaiTechRGBD benchmark.
Figure 6: Visualization of the Conv3_3 features before and after IADM. The first row is the features of the input RGB images, while the third row is the features of the input thermal images. The middle row shows the shared features. We can observe that all modality-specific and modality-shared representations have been enhanced after the proposed IADM. (Best viewed in color.)
Figure 7: Visualization of the Conv4_3 features before and after IADM. The first row is the features of the input RGB images, while the third row is the features of the input thermal images. The middle row shows the shared features. We can observe that all modality-specific and modality-shared representations have been enhanced after the proposed IADM. (Best viewed in color.)

Appendix A More Results of Unimodal Data

In the main text, we have reported the results of different methods that take multimodal data as input. In this supplementary file, we also report the unimodal performance of different backbone networks (e.g., MCNN [zhang2016single], SANet [cao2018scale], CSRNet [li2018csrnet] and BL [ma2019bayesian]).

As shown in Table 7, when only taking RGB images as input, these backbone networks perform poorly on the proposed RGBT-CC benchmark, because they fail to recognize people in poor illumination conditions, such as backlight and night. The performance is greatly improved when using thermal images. Nevertheless, both the RGB results and thermal results are worse than multimodal results. In particular, all backbone networks achieve the best performance when capturing the RGB-thermal complementarities with the proposed Information Aggregation-Distribution Module (IADM). Moreover, we also perform unimodal experiments on the ShanghaiTechRGBD dataset [lian2019density]. As shown in Table 8, the unimodal results of all backbone networks are consistently worse than their multimodal results. These experiments demonstrate the effectiveness of multimodal data for crowd counting.

Appendix B Representation Visualization

In this supplementary file, we also visualize and compare the generated features before and after applying the proposed IADM. Here we take BL [ma2019bayesian] as the backbone network. As shown in Fig. 6 and Fig. 7, after applying IADM, both modality-specific and modality-shared representations have been enhanced in various illumination conditions. This demonstrates that our method can indeed capture the complementary information of multimodal data effectively to facilitate the task of crowd counting.

References