Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting

03/12/2022
by   Fusen Wang, et al.
0

Currently, for crowd counting, the fully supervised methods via density map estimation are the mainstream research directions. However, such methods need location-level annotation of persons in an image, which is time-consuming and laborious. Therefore, the weakly supervised method just relying upon the count-level annotation is urgently needed. Since CNN is not suitable for modeling the global context and the interactions between image patches, crowd counting with weakly supervised learning via CNN generally can not show good performance. The weakly supervised model via Transformer was sequentially proposed to model the global context and learn contrast features. However, the transformer directly partitions the crowd images into a series of tokens, which may not be a good choice due to each pedestrian being an independent individual, and the parameter number of the network is very large. Hence, we propose a Joint CNN and Transformer Network (JCTNet) via weakly supervised learning for crowd counting in this paper. JCTNet consists of three parts: CNN feature extraction module (CFM), Transformer feature extraction module (TFM), and counting regression module (CRM). In particular, the CFM extracts crowd semantic information features, then sends their patch partitions to TRM for modeling global context, and CRM is used to predict the number of people. Extensive experiments and visualizations demonstrate that JCTNet can effectively focus on the crowd regions and obtain superior weakly supervised counting performance on five mainstream datasets. The number of parameters of the model can be reduced by about 67 works. We also tried to explain the phenomenon that a model constrained only by count-level annotations can still focus on the crowd regions. We believe our work can promote further research in this field.

READ FULL TEXT

page 1

page 3

page 8

research
04/19/2021

TransCrowd: Weakly-Supervised Crowd Counting with Transformer

The mainstream crowd counting methods usually utilize the convolution ne...
research
02/22/2022

Reinforcing Local Feature Representation for Weakly-Supervised Dense Crowd Counting

Fully-supervised crowd counting is a laborious task due to the large amo...
research
05/23/2021

Boosting Crowd Counting with Transformers

Significant progress on the crowd counting problem has been achieved by ...
research
05/29/2022

Glance to Count: Learning to Rank with Anchors for Weakly-supervised Crowd Counting

Crowd image is arguably one of the most laborious data to annotate. In t...
research
08/02/2021

Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer

Crowd localization is a new computer vision task, evolved from crowd cou...
research
06/21/2022

Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation

In real-world crowd counting applications, the crowd densities in an ima...
research
09/04/2021

Audio-Visual Transformer Based Crowd Counting

Crowd estimation is a very challenging problem. The most recent study tr...

Please sign up or login with your details

Forgot password? Click here to reset