The task of crowd counting in computer vision is to automatically count people numbers in images/videos. With the rapid growth of world’s population, crowd gathering becomes more frequent than ever. To help with crowd control and public safety, accurate crowd counting is demanded.
Early methods count crowds via the detection of individuals [53, 2, 38]. They suffer from heavy occlusions in dense crowds. More importantly, learning such people detectors normally requires bounding box or instance mask annotations for individuals, which often makes it undesirable in large-scale applications. Modern methods mainly conduct crowd counting via density estimation [36, 64, 48, 41, 30, 26, 23, 58]
. Counting is realized by estimating a density map of an image whose integral over the image gives the total people count. Given a training image, its density map is obtained via Gaussian blurring at every head center. Head centers are the required annotations for training. Thanks to the powerful deep neural networks (DNNs), density estimation based methods show a great success in recent progress [64, 43, 23, 39, 46, 58, 47].
Despite above, annotating head centers in dense crowds is still a laborious and tedious process. For instance, it can take up to 10 minutes for our annotators to annotate a single image with 500 persons; while the popular counting dataset ShanghaiTech PartA  has 300 training images with an average of 501 persons per image! To substantially reduce the annotation cost, we study the crowd density estimation in a semi-supervised setting where only handful images are labeled while the rest are unlabeled. This setting has not been largely explored in crowd counting: [4, 65]
propose to actively annotate the most informative video frames for semi-supervised crowd counting, yet the algorithms are not deep learning based and rely on frame consecutiveness. Recently, some deep learning works propose to leverage additional web data[28, 27] or synthetic data  for crowd counting; images in existing dataset are still assumed annotated, or at least many of them. The model transferability is also evaluated in some works [11, 58] where a network is trained on a source dataset with full annotations and tested on a target dataset with no/few annotations.
Given an existing dataset and a power DNN, we find that 1) learning from only a small subset, the performance can vary a lot depending on the subset selection; 2) for specific subset that covers diverse crowd densities, the performance can be quite good (see results in Sec. 4.2). This motivates us to study crowd counting with very limited annotations yet producing very competitive precision. To achieve this goal, we propose an Active Learning framework for Accurate crowd Counting (AL-AC) as illustrated in Fig. 1: given a labeling budget, instead of randomly selecting images to annotate, we first introduce an active labelling strategy to iteratively annotate the most informative images in the dataset and learn the counting model on them. In each cycle we select samples that cover different crowd densities and also dissimilar to previous selections. Eventually, the large amount of unlabeled data are also included into the network training: we design a classifier with gradient reversal layer  to align the intrinsic distributions of labeled and unlabeled data. Since all training samples contain the same object class, e.g. person, we propose to further align distributions in-between training samples by mixing up the latent representations and distribution labels among labeled and unlabeled data in the network. With very limited labeled data, our model produces very competitive counting result.
To summarize, several new elements are offered:
We introduce an active learning framework for accurate crowd counting with limited supervision.
We propose a partition-based sample selection with weights (PSSW) strategy to actively select and annotate both diverse and dissimilar samples for network training.
We design a distribution alignment branch with latent MixUp to align the distribution between the labeled data and large amount of unlabeled data in the network.
Extensive experiments are conducted on standard counting benchmarks, i.e. ShanghaiTech , UCF_CC_50 , Mall ,TRANCOS , and DCC . Results demonstrate that, with a small number of labeled data, our AL-AC reaches levels of performance not far from state of the art fully-supervised methods.
2 Related works
In this section, we mainly survey deep learning based crowd counting methods and discuss semi-supervised learning and active learning in crowd counting.
2.1 Crowd counting
The prevailed crowd counting solution is to estimate a density map of a crowd image, whose integral of the density map gives the total person count of that image . A density map encodes spatial information of an image, regressing it in a DNN is demonstrated to be more robust than simply regressing a global crowd count [62, 30]. Due to the commonly occurred heavy occlusions and perspective distortions in crowd images, multi-scale or multi-resolution architectures are often exploited in DNNs: Ranjan et al.  propose an iterative crowd counting network which produces the low-resolution density map and uses it to generate the high-resolution density map. Cao et al.  propose a novel encoder-decoder network, where the encoder extracts multi-scale features with scale aggregation modules and the decoder generates high-resolution density maps by using a set of transposed convolutions. Furthermore, Jiang et al.  develop a trellis encoder-decoder network that incorporates multiple decoding paths to hierarchically aggregate features at different encoding stages. In order to better utilize multi-scale features in the network, the attention [26, 47], context [48, 25], or perspective [46, 59] information in crowd images is often leveraged into the network.
Apart from density estimation based methods, some other variants in recent trends try to give the individual location or size information in crowd counting [29, 14, 20, 24]. In order to achieve a good counting accuracy, they often integrate themselves into the density estimation pipeline. Our work is a density estimation based approach.
2.2 Semi-supervised learning
Semi-supervised learning  refers to learning with a small amount of labeled data and a large amount of unlabeled data, and has been a popular paradigm in deep learning [56, 40, 19, 61]. It is traditionally studied for classification, where a label represents a class per image [21, 10, 40, 19]. In this work, we focus on semi-supervised learning in crowd counting, where the label of an image means the people count, with individual head points available in most cases. The common semi-supervised crowd counting solution is to leverage both labeled and unlabeled data into the learning procedure: Tan et al.  propose a semi-supervised elastic net regression method by utilizing sequential information between unlabeled samples and their temporally neighboring samples as a regularization term; Loy et al. 
further improve it by utilizing both the spatial and temporal regularization in a semi-supervised kernel ridge regression problem; finally, in, graph Laplacian regularization and spatiotemporal constraints are incorporated into the semi-supervised regression. All these are not deep learning works and rely on temporal information among video frames.
introduce an almost unsupervised learning method that only a tiny proportion of model parameters is trained with labeled data while vast parameters are trained with unlabeled data. Liu et al.[28, 27] propose to learn from unlabeled crowd data via a self-supervised ranking loss in the network. In [28, 27], they mainly assume the existence of a labeled dataset and add extra data from the web; in contrast, our AL-AC seeks a solution for accurate crowd counting with limited labeled data. Our method is also similar to [34, 35] in spirit of the distribution alignment between labeled and unlabeled data. While in [34, 35] they need to generate fake images to learn the discriminator in GAN which makes it hard to learn and converge. Our AL-AC instead mixes the representations of labeled and unlabeled data in the network and learns the discriminator against them.
2.3 Active learning
Active learning defines a strategy determining data samples that, when added to the training set, improve a previously trained model most effectively . Although it is not possible to obtain an universally good active learning strategy 
, there exist many heuristics, which have been proved to be effective in practice. Active learning has been explored in many applications such as image classification [49, 17] and object detection , while in this paper we focus on crowd counting. Methods in this context normally assumes the availability of the whole counting set and choose samples from it, which is the so-called pool-based active learning .  and  employ the graph-based approach to build adjacency matrix of all crowd images in the pool, sample selection is therefore cast as a matrix partitioning problem. Our work is also pool-based active learning.
Lately, Liu et al.  apply active learning in DNN where they measure the informativeness of unlabeled samples via mistakes made by the network on a self-supervised proxy task. The method is conducted iteratively and in each cycle it selects a group of images based their uncertainties to the model. The diversity of selected images is however not carefully taken care in their uncertainty measure, which might result in a biased selection within some specific count range. Our work instead interprets uncertainty from two perspectives: selected samples are diverse in crowd density and dissimilar to previous selection in each learning cycle. It should also be noted that  mainly focuses on adding extra unlabeled data to an existing labeled dataset, while our AL-AC seeks for the limited data to be labeled within a given dataset.
We follow crowd density estimation in deep learning context where density maps are pixel-wise regressed in a DNN [64, 23]. A ground truth density map is generated by convolving Gaussian kernels at head centers in an image 
. The network is optimized through a loss function minimizing the prediction error over the ground truth. In this paper, we place our problem in asemi-supervised setting where we only label several or few dozens of images while the rest large amount remains unlabeled. Both the labeled and unlabeled data will be exploited in model learning. Below, we introduce our active learning framework for accurate crowd counting (AL-AC).
Our algorithm follows an active learning pipeline in general. It is an iterative process where a model is learnt in each cycle and a set of samples is chosen to be labeled from a pool of unlabeled samples . In classic setting, only one single sample is chosen in each cycle. This is however not feasible for DNNs because it is infeasible to train as many models as the number of samples since many practical problems of interest are very large-scale . Hence, the commonly used strategy is batch mode selection [54, 27] where a subset is selected and labeled in each cycle. This subset is added into the labeled set to update the model and repeat the selection in next cycle. The procedure continues until a predefined criterion is met, e.g. a fixed budget.
Our method is illustrated in Fig. 2: given a dataset with labeling budget (number of images as in [42, 27]), we start by labeling samples uniformly at random from . For each labeled sample , we generate its count label and density map based on the annotated head points in . We denote and as the labeled and unlabeled set in cycle 1, respectively. A DNN regressor is trained on for crowd density estimation. Based on ’s estimation of density maps on , we propose a partition-based sample selection with weights strategy to select and annotate samples from . These samples are added to so we have the updated labeled and unlabeled set and in cycle 2. Model is further trained on and updated as . The prediction of is better than as it uses more labeled data, we use the new prediction on to again select samples and add them to . The process moves on until the labeling budget is met. The unlabeled set is also employed in network training through our proposed distribution alignment with latent MixUp. We only use () in the last learning cycle as we observe that adding it in every cycle does not bring us accumulative benefits but rather additional training cost.
The backbone network is not specified in Fig. 2 as it can be any standard backbone. We will detail our selection of backbone, , and in Sec. 4. Below we introduce our partition-based sample selection with weights and distribution alignment with latent MixUp. Overall loss function is given in this end.
3.3 Partition-based sample selection with weights (PSSW)
In each learning cycle, we want to annotate the most informative/uncertain samples and add them to the network. The informativeness/uncertainty of samples is evaluated from two perspectives: diverse in density and dissimilar
to previous selections. It is observed that crowd data often forms a well structured manifold where different crowd densities normally distribute smoothly within the manifold space; the diversity is to select crowd samples that cover different crowd densities in the manifold. This is realized by separating the unlabeled set into different density partitions for diverse selection. Within each partition, we want to select those samples that are dissimilar to previous labeled samples, such that the model has not seen them. The dissimilarity is measured considering both local crowd density and global crowd count: we introduce a grid-based dissimilarity measure (GDSIM) for this purpose. Below, we formulate our partition-based sample selection with weights.
Formally, given the model , unlabeled set and labeled set in cycle, we denote by the predicted crowd count by for an unlabeled image . The histogram of all on discloses the overall density distribution. For the sake of diversity, we want to partition the histogram into parts and select one sample from each. Since the crowd counts are not evenly distributed (see Fig. 3: Left), sampling images evenly from the histogram can end up with a biased view of the original distribution. We therefore employ the Jenks natural breaks optimization  to partition the histogram. Jenks minimizes the variation within each range, so the partitions between ranges reflect the natural breaks of the histogram (Fig. 3).
Within each partition , inspired by grid average mean absolute error (GAME) , we propose a grid-based dissimilarity from an unlabeled sample to labeled samples. Given an image , GAME is originally introduced as an evaluation measure for density estimation,
where is the estimated count in region of image . It can be obtained via the integration over the density of that region ; is the corresponding ground truth count. Given a specific level , GAME subdivides the image using a grid of non-overlaping regions which cover the full image (Fig. 3); the difference between the prediction and ground truth is the sum of the mean absolute error (MAE) in each of these regions. With different , GAME indeed offers moderate ways to compute the dissimilarity between two density maps, taking care of both global counts and local details. Building on GAME, we introduce grid-based dissimilarity measure GDSIM as,
where and are from the unlabeled set and labeled set , respectively; they both fall into the -th partition. and are crowd counts in region as in formula (1) but for different images and (see Fig. 3: Right). Given the level , unlike GAME, we compute the dissimilarity between and by traversing all levels from to (Fig. 3). In this way, the dissimilarity is computed based on both global count () and local density () differences. Afterwards, instead of averaging the dissimilarity scores from to all the in , we use to indicate if is closer to any one of the labeled images, it is regarded as a familiar sample to the model. Ideally, we should choose the most dissimilar sample from each partition; nevertheless, the crowd count in formula (2
) is not ground truth. We convert the GDSIM scores to probabilities and adopt weighted random selection to label one sample from each partition.
3.4 Distribution alignment with latent MixUp
Since labeled data only represents partial crowd manifold, particularly when they are limited, distribution alignment with large amount of unlabeled data becomes necessary even within the same domain. In order for the model to learn a proper subspace representation of the entire set, we introduce distribution alignment with latent MixUp.
We assign labeled data with distribution labels 0 while unlabeled data with labels 1. A distribution classifier branched off from the deep extractor ( in Fig. 2) is designed: it is composed of a gradient reversal layer (GRL) , 1 1 convolution layer and global average pooling (GAP) layer. The GRL multiplies the gradient by a certain negative constant (-1 in this paper) during the network back propagation; it enforces that the feature distributions over the labeled and unlabeled data are made as indistinguishable as possible for the distribution classifier, thus aligning them together.
The hard distribution labels create hard boundaries between labeled and unlabeled data. To further merge the distributions and particularly align in-between training samples, we adapt an idea from MixUp . MixUp normally trains a model on random convex combinations of raw inputs and their corresponding labels. It encourages the model to behave linearly “between” training samples, as this linear behavior reduces the amount of undesirable oscillations when predicting outside the training samples. It has been popularly employed in several semi-supervised classification works [1, 51, 52, 63]. In this work, we integrate it into our distribution alignment branch for semi-supervised crowd counting. We find that mixing raw input images does not work for our problem. Instead we propose to mix their latent representations in the network: supposedly we have two images, , , and their distribution labels , , respectively. The latent representations of and are produced by the deep extractor
as two tensors (and ) from the last convolutional layer of the backbone. We mix up (, ), (, ) with a weight as
where (, ) denotes the mixed latent representation and label. is generated in the same way with : , ; is a hyper-parameter set to 0.5. Both labeled and unlabeled data can be mixed. For two samples with the same label, their mixed label remains. We balance the number of labeled and unlabeled data with data augmentation (see Sec. 4.1) so a mixed pair can be composed of labeled or unlabeled data with (almost) the same probability. MixUp enriches the distribution in-between training samples. Together with GRL, it allows the network to elaborately knit the distributions of labeled and unlabeled data. The alignment is only carried out in the last active learning cycle as an efficient practice. The network training proceeds with a multi-task optimization that minimizes the density regression loss on labeled data and the distribution classification loss for all data including mixed ones, specified below.
3.5 Loss function
For density regression task, we adopt the commonly used pixel-wise MSE loss :
and denote the density map prediction and ground truth of image , respectively.
is the number of labeled images. For the distribution classification task, since distribution labels for mixed samples can be non-integers, we adopt the binary cross entropy with logits loss, which combines a Sigmoid layer with the binary cross entropy loss. Given an image pair, is computed on each individual as well as their mixed representations (see Fig. 2). The overall multi-task loss function is given by
We begin by introducing five counting datasets: ShanghaiTech , UCF_CC_50 , Mall , TRANCOS , and DCC . It covers people [64, 13, 5], vehicle  and cell  counting to demonstrate the generalization ability of our method.
4.1 Experimental Setup
Datasets. ShanghaiTech  consists of 1,198 annotated images with a total of 330,165 people with head center annotations. This dataset is split into SHA and SHB. The average crowd counts are 123.6 and 501.4, respectively. Following , we use 300 images for training and 182 images for testing in SHA; 400 images for training and 316 images for testing in SHB. UCF_CC_50 
has 50 images with 63,974 head center annotations in total. The head counts range between 94 and 4,543 per image. The small dataset size and large variance make this a very challenging counting dataset. We call it UCF for short. Following, we perform 5-fold cross validations to report the average test performance. Mall  contains 2000 frames collected in a shopping mall. Each frame on average has only 31 persons. The first 800 frames are used as the training set and the rest 1200 frames as the test set. TRANCOS  is a public traffic dataset containing 1244 images of different congested traffic scenes captured by surveillance cameras with 46,796 annotated vehicles. 800 images are for training and the rest for testing. DCC  is a cell microscopy dataset, consisting of 177 images with a cell count from 0 to 100. 100 images are used for training and 77 images are used for testing.
Implementation details. The backbone () design follows : VGGnet with 10 convolutional and 6 dilated convolutional layers, it is pretrained on ILSVRC classification task. We follow the setting in 
to generate ground truth density maps. To have a strong baseline, the training set is augmented by randomly cropping patches of 1/4 size of each image. We set a reference number 1200, both labeled and unlabeled data in each dataset are augmented up to this number to have a balanced distribution. For instance, if we have 30 labeled images, we need to crop 40 patches from each image to augment it to 1200. We feed the network with a minibatch of two image patches each time. In order to have the same size of two patches, we further crop them to keep the shorter width and height of the two. We set the learning rate as 1e-7, momentum 0.95 and weight decay 5e-4. We train 100 epochs with SGD optimizer for each active learning cycle and before the last cycle, the network is trained with only labeled data. In the last cycle, it is trained with both labeled and unlabeled data. In all experiments,is 3 for GDSIM (2) and is 3 for loss weight (5). Network inference is on the entire image.
Evaluation protocol. We evaluate the counting performance via the commonly used mean absolute error (MAE) and mean square error (MSE) [43, 48, 26] which measures the difference between the counts of ground truth and estimation. For active learning, we choose to label around 10% images of the entire set, which goes along with our setting of limited supervision. is chosen not too small so that we can normally reach the labeling budget in about 2-4 active learning cycles. Sec. 5 gives a discussion on the time complexity. and are by default 30/40 and 10 on SHA and SHB, 10 and 3 on UCF (initial number is 4), 80 and 20 on Mall and TRANCOS, 10 and 3 on DCC, respectively. We also evaluate different and to show the effectiveness of our method. The baseline is to randomly label images and train a regression model using the same backbone with our AL-AC but without distribution alignment. As in [4, 65]
, taken the randomness into account, we repeat each experiment with 10 trials for both mean and standard deviation, to show the improvement of our method over baseline.
We present ablation study of our AL-AC and its comparison to state of the art fully- and semi-supervised methods.
Ablation study. The proposed partition-based sample selection with weights and distribution alignment with latent MixUp are ablated.
Labeling budget and . As mentioned in Sec. 4.1, we set and by default. Comparable experiments are offered in two ways. First, keeping , we vary from 10 to 40. The results are shown in Table 1. We compare our partition-based sample selection with weights (PSSW) with random selection (RS); distribution alignment is not added in this experiment. For PSSW, its MAE on SHA is gradually decreased from 121.2 with to 85.4 with , the standard deviation is also decreased from 9.3 to 2.5. The MAE result is in general 10 points lower than RS. With different , PSSW also produces lower MAE than RS on SHB. For example, with , PSSW yields an MAE of 14.6 v.s. 17.9 for RS.
|M= 10, m=10||121.2 9.3||121.2 9.3||20.5 4.8||20.5 4.8|
|M=20, m=10||96.7 7.3||111.5 7.4||17.0 1.9||19.3 2.2|
|M=30, m=10||93.5 2.9||102.1 7.0||15.7 1.5||19.9 3.1|
|M=40, m=10||85.4 2.5||93.8 5.6||14.6 1.3||17.9 1.9|
|M = 30, m = 5||92.6 3.1||102.1 7.0||15.1 1.5||19.9 3.1|
|M = 40, m = 5||84.4 2.6||93.8 5.6||14.4 1.2||17.9 1.9|
|M = 30, m =10||MAE||MSE||MAE||MSE|
|PSSW||93.5 2.9||151.0 15.1||15.7 1.5||28.3 3.4|
|PSSW + GRL||90.8 2.7||144.9 14.5||14.7 1.3||27.8 2.9|
|PSSw + GRL + MX||87.9 2.3||139.5 12.7||13.9 1.2||26.2 2.5|
|M = 40, m =10||MAE||MSE||MAE||MSE|
|PSSW||85.4 2.5||144.7 10.7||14.6 1.3||24.6 3.0|
|PSSW + GRL||82.7 2.4||140.9 11.3||13.7 1.3||23.5 2.2|
|PSSW + GRL + MX||80.4 2.4||138.8 10.1||12.7 1.1||20.4 2.1|
|RS+ GRL + MX||87.3||15.1|
|PSSW + GRL + MX||80.4||12.7|
Second, by keeping , we decrease from 10 to 5 and repeat the experiment. Results show that having a small indeed works slightly better: for instance, PSSW with and reduces MAE by 1.0 on SHA compared to PSSW with and . On the other hand, can not be too small as discussed in Sec. 3.2 and Sec. 5. In practice, we still keep for both efficiency and effectiveness.
We notice that the number of labeled people may vary over trials and cycles. Since we do not know the ground truth, we can not make the number of labeled people exactly what we want before labeling them. As in [27, 65, 4], the essential idea of active learning based crowd counting is to find the most informative images to label within a small budget of number of images. Labelling more heads or less does not mean a better or worse performance. To give an insight, we conduct an experiment by only labeling images with over 200 heads on SHB, the MAE is 26.7 v.s. 17.9 for RS in Table 1.
Variants of PSSW. Our PSSW has two components: the Jenks-based partition for diversity, and the GDSIM for dissimilarity (Sec. 3). In order to show the effectiveness of each, we present two variants of PSSW: Even Partition and Global Diff. Even Partition means that Jenks-based partition is replaced by evenly splitting the ranges on the histogram of crowd count while GSDIM remains; Global Diff means that GDSIM is replaced by using the global count difference to measure the dissimilarity while Jenks-based partition remains. We report MAE on SHA and SHB in Table 1: Right. It can be seen that Even Partition produces MAE 89.6 on SHA and 16.2 on SHB, while Global Diff produces 86.6 and 15.3. Both are clearly inferior to PSSW (84.4 and 14.4). This suggests the importance of the proposed diversity and dissimilarity measure.
Distribution alignment with latent MixUp. Our proposed distribution alignment with latent MixUp is composed of two elements: distribution classifer with GRL and latent MixUp (Sec. 3.4). To demonstrate their effectiveness, we present the result of PSSW plus GRL classifer (denoted as PSSW + GRL), and latent MixUp (denoted as PSSW + GRL + MX) in Table 2. We take as an example, adding GRL and MX to PSSW contributes to 5.0 points MAE decrease on SHA and 1.9 points decrease on SHB. Specifically, The MX contributes to 2.3 and 1.0 points decrease on SHA and SHB, respectively. The same observation goes for MSE: by adding GRL and MX, it decreases from 144.7 to 138.8 on SHA, from 24.6 to 20.4 on SHB.
To make a further comparison, we also add the proposed distribution alignment with latent MixUp to RS in Table 2: Right, where we achieve MAE 87.3 on SHA and 15.1 on SHB. Adding GRL+MX to RS also improves the baseline: the performance difference between PSSW and RS becomes smaller; yet, the absolute value of the difference is still big, which justifies our PSSW. Notice PSSW + GRL + MX is the final version of our AL-AC, we use AL-AC hereafter to denote it.
Comparison with fully-supervised methods. We compare our work with those prior arts [64, 43, 23, 39, 46, 47, 31]. All these approaches are fully-supervised methods which utilize annotations of the entire dataset (300 in SHA and 400 in SHB). While in our setting, we label only 30/40 images, 10% of the entire set. It can be seen that our method outperforms the representative methods [64, 43] a few years ago, and are not far from other recent arts, i.e. [23, 39, 46, 47, 31]. A direct comparison to ours is CSRNet , we share the same backbone. With about 10% labeled data, our AL-AC retains 85% accuracy on SHA (68.2 / 80.4), 83% accuracy on SHB (10.6 / 12.7 ). Compared to our baseline (denoted as RS in Table 1), AL-AC in general produces significantly lower MAE, e.g. 87.9 v.s. 102.1 on SHA with ; 17.9 v.s. 12.7 on SHB with .
Despite that we only label 10% data, our distribution alignment with latent MixUp indeed enables us to make use of more unlabeled data across datasets: for instance, a simple implementation with M = 40 on SHA, if we add SHB as unlabeled data to AL-AC for distribution alignment, we obtain an even lower MAE 78.6 v.s. 80.4 in Table 4.
Comparison with semi-supervised methods. There are also some semi-supervised crowd counting methods [27, 42, 35]111Results of [27, 42] can be estimated from their curve plots.. For instance in [42, 35], with they produce MAE 170.0 and 136.9 on SHA, respectively. These are much higher MAE than ours. Since [42, 35] use different architectures from AL-AC, they are not straightforward comparisons. For , it uses about 50% labeled data on SHA (Fig.7 in ) to reach the similar performance of our AL-AC with 10% labeled data. We both adopt the VGGnet yet  utilizes extra web data for ranking loss while we only use unlabeled data within SHA, we use dilated convolutions while  does not. To make them more comparable, we instead use the same backbone of  and repeat AL-AC on SHA (implementation details still follow Sec. 4.1), the mean MAE with M=30, m=10 on SHA becomes 91.4 (v.s. 87.9 in Table 4), which is still much better than that of .
Instead of using limited labeled data, in Fig. 4, we keep increasing till 280 and report the MAE on SHA. It can be seen that, with about 80-100 labeled data (nearly 30%) labeled data, AL-AC already reaches the performance close to the fully-supervised method, as in  (Table 4). The performance will saturate after some point and converge to that of baseline. This is also observed in other works [27, 42].
It has 40 training images in total. We show in Table 4 that, labeling ten of them () already produces a very competitive result: the MAE is 351.4 while the MSE is 448.1. The MAE and MSE are significantly lower (93.3 and 152.2 points) than baseline. We analyzed the result and found that our AL-AC is able to select those hard samples with thousands of persons and label them for training, while this is not guaranteed in random selection. Compared to fully supervised method, e.g. , our MAE is not far. We also present the result of : MAE/MSE is further reduced.
Different from ShanghaiTech and UCF datasets, Mall contains images with much sparser crowds, 31 persons on average per image. Following our setup, we label 80 out of 800 images and compare our AL-AC with both baseline and other fully-supervised methods [37, 26, 12] in Table 5. With 10% labeled data, we achieve MAE 3.8 superior to the baseline and , MSE 5.5 superior to the baseline and . This shows the effectiveness of our method on images of sparse crowds.
4.5 TRANCOS and DCC
To test the generalization ability of our AL-AC on other counting tasks, we evaluate it on TRANCOS and DCC for vehicle and cell counting, respectively. The global count error MAE is presented in Table 6. We label 10% of the images for each dataset. That is, , for TRANCOS, and , for DCC. Our MAE result is 7.5 on TRANCOS with an decrease of 2.6 points from baseline; 4.5 on DCC with an decrease of 2.9 points from baseline. With 10% labeled data, our AL-AC performs close to state of the art, particularly on DCC.
|Method||Baseline*||AL-AC*||Lempitsky||Hydra-3s||POCR ||CSRNet ||CFF |
|TRANCOS||10.1 1.5||7.5 0.8||13.8||11.0||9.7||3.6||2.0|
|DCC||7.4 1.2||4.5 0.4||-||-||8.4||-||3.2|
We present an active learning framework for accurate crowd counting with limited supervision. Given a counting dataset, instead of exhaustively annotating every image, we first introduce a partition-based sample selection with weights to label only a few most informative images and learn a crowd regression network upon them. This process is iteratively repeated till the labeling budget is reached. Next, rather than learning from only labeled data, the abundant unlabeled data are also exploited: we introduce a distribution alignment branch with latent MixUp in the network. Experiments conducted on standard benchmarks show that labeling only 10% of the entire set, our method already performs close to recent state-of-the-art.
By choosing an appropriate , we normally reach the labeling budget in three active learning cycles. In our setting, training data in each dataset are augmented to a fixed number. We run our experiments with GPU GTX1080. It takes around three hours to complete each active learning cycle. The total training hours are more or less the same to fully-supervised training, as in each learning cycle we train much fewer epochs with limited number of labeled data. More importantly, compared to the annotation cost for an entire dataset (see Sec. 1 for an estimation on SHA), ours is substantially reduced !
Acknowledgement: This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61828602 and 51475334; as well as National Key Research and Development Program of Science and Technology of China under Grant No. 2018YFB1305304, Shanghai Science and Technology Pilot Project under Grant No. 19511132100.
Appendix 0.A Appendix: more results
In this appendix, we offer more results on ShanghaiTech, UCF_CC_50, TRANCOS and DCC datasets.
is a state of the art active learning method where a variational autoencoder (VAE) and an adversarial network are trained to play a min-max game discriminating between unlabeled and labeled data with domain label 0 and 1, respectively. Samples from those predicted as “unlabeled” with the lowest probabilities (near 0) are selected for active annotations. We find this min-max idea to be similar to the gradient reversal layer (GRL) in our proposed distribution alignment between labeled and unlabeled data. The GRL also assumes a domain label 0 for the unlabeled data and 1 for the labeled data. It multiplies the gradient by a negative constant (-1) during the network back propagation which enforces the feature distributions over the labeled and unlabeled data as indistinguishable as possible for the distribution classifier. We therefore select unlabeled samples with the lowest probabilities from our domain classifier. In this sense, the distribution alignment with latent MixUp is included in every learning cycle. We denote it by AL-AC-v as a variant of AL-AC and compare it to the full version of our AL-AC in the default setting (M = 40, m =10) on SHA and SHB in Table7: Left. Our AL-AC still works clearly better than this variant.
Next, to test the generalization ability of our method, we offer the results under the default setting M = 40, m =10 by training on SHA and testing on SHB (SHA SHB), and vice versa (SHB SHA). The MAE and MSE for our proposed AL-AC and baseline are reported in Table 7: Right. It can be seen that our AL-AL improves the baseline substantially in this transfer setting.
|M=40, m=10||SHA SHB||SHB SHA|
Our proposed AL-AC is mainly composed of two parts: 1) partition-based sample selection with weights (PSSW); and 2) distribution alignment with latent MixUp. We present detailed results of both components on the UCF_CC_50 dataset.
First, we compare our PSSW with random selection (RS) in Table 8. We choose by default and (initial is 4). The mean MAE at the starting point () is 645.8 for both PSSW and RS. For PSSW, it reduces to 479.2 with , and with ; in contrast, the MAE for RS is 505.8 and 444.7 for and , respectively. PSSW produces clearly lower MAE than RS. We also present the result of PSSW with and : the MAE is also much lower than that of RS.
Next, we study the effect of the proposed distribution alignment with latent MixUp in Table 8. Like in the paper, we add GRL (gradient reversal layer) and MX (latent MixUp) to PSSW and report the result. For , , by adding GRL + MX to PSSW, the mean MAE and MSE further reduce 35.9 and 58.8 points, respectively. We also present the result for , , the contribution of GRL and MX is also significant (e.g. 27.2 points decrease on MAE). Notice PSSW + GRL + MX is equivalent to AL-AC in Table 4.
0.a.3 TRANCOS and DCC
Previously, we present the result of AL-AC by labeling 10% of images for TRANCOS and DCC datasets. In Table 9, we present the result of labeling 20% data for each; that is, for TRANCOS and for DCC. The mean MAE of AL-AC is 5.9 on TRANCOS with a decrease of 2.9 points from baseline; 3.8 on DCC with a decrease of 2.6 points from baseline. With 20% labeled data, our AL-AC performs quite close to the state of the art [32, 23, 47], which utilize full annotations of the datasets.
|M=4, m=4||645.8 36.5||645.8 36.5|
|M=7, m=3||479.2 32.1||505.8 35.3|
|M=10, m=3||387.3 22.5||444.7 25.9|
|M=20, m=10||345.9 24.6||417.2 29.8|
|PSSW + GRL + MX||351.4||448.1|
|PSSW+ GRL + MX||318.7||421.6|
|Method||Baseline*||AL-AC*||Lempitsky||Hydra-3s||POCR ||CSRNet ||CFF |
|TRANCOS||8.8 1.4||5.9 0.9||13.8||11.0||9.7||3.6||2.0|
|DCC||6.4 1.1||3.8 0.5||-||-||8.4||-||3.2|
Appendix 0.B Appendix: more examples
Several new examples of AL-AC are illustrated in Fig. 6 over different datsets (e.g. ShanghaiTech, UCF_CC_50, TRANCOS, and DCC).
-  (2019) Mixmatch: a holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249. Cited by: §3.4.
-  (2006) Unsupervised bayesian detection of independent motion in crowds. In CVPR, Cited by: §1.
-  (2018) Scale aggregation network for accurate and efficient crowd counting. In ECCV, Cited by: §2.1.
-  (2013) From semi-supervised to transfer counting of crowds. In CVPR, Cited by: §1, §2.2, §2.3, §3.3, §4.1, §4.2.
-  (2012) Feature mining for localised crowd counting.. In BMVC, Cited by: §1, §4.1, §4.
-  (2005) Analysis of a greedy active learning strategy. In NIPS, Cited by: §2.3.
Unsupervised domain adaptation by backpropagation. In JMLR, Cited by: §1, §3.4.
-  (2015) An active search strategy for efficient object class detection. In CVPR, Cited by: §2.3.
Extremely overlapping vehicle counting.
Iberian Conference on Pattern Recognition and Image Analysis, Cited by: §1, §3.3, §4.1, §4.
-  (2016) Semi-supervised deep learning by metric embedding. arXiv preprint arXiv:1611.01449. Cited by: §2.2.
-  (2019) One-shot scene-specific crowd counting. In BMVC, Cited by: §1.
-  (2019) Crowd counting using scale-aware attention networks. In WACV, Cited by: §4.4, Table 5.
-  (2013) Multi-source multi-scale counting in extremely dense crowd images. In CVPR, Cited by: §1, §4.1, §4.
-  (2018) Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, Cited by: §2.1.
-  (1967) The data model concept in statistical mapping. International yearbook of cartography 7, pp. 186–190. Cited by: §3.3.
-  (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR, Cited by: §2.1.
-  (2009) Multi-class active learning for image classification. In CVPR, Cited by: §2.3.
-  (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
-  (2016) Temporal ensembling for semi-supervised learning. In ICLR, Cited by: §2.2.
-  (2018) Where are the blobs: counting by localization with point supervision. In ECCV, Cited by: §2.1.
-  (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICMLW, Cited by: §2.2.
-  (2010) Learning to count objects in images. In NIPS, Cited by: Table 9, Table 6.
-  (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, Cited by: §0.A.3, Table 9, §1, §3.1, §4.1, §4.2, §4.2, §4.3, Table 4, Table 4, Table 6.
-  (2019) Density map regression guided detection network for rgb-d crowd counting and localization. In CVPR, Cited by: §2.1.
-  (2019) Recurrent attentive zooming for joint crowd counting and precise localization. In CVPR, Cited by: §2.1.
-  (2018) DecideNet: counting varying density crowds through attention guided detection and density estimation. In CVPR, Cited by: §1, §2.1, §4.1, §4.4, Table 5.
-  (2019) Exploiting unlabeled data in cnns by self-supervised learning to rank. TPAMI. Cited by: §1, §2.2, §2.3, §3.2, §3.2, §4.2, §4.2, §4.2, footnote 1.
-  (2018) Leveraging unlabeled data for crowd counting by learning to rank. In CVPR, Cited by: §1, §2.2.
-  (2019) Point in, box out: beyond counting persons in crowds. In CVPR, Cited by: §2.1.
-  (2018) Crowd counting via scale-adaptive convolutional neural network. In WACV, Cited by: §1, §2.1.
-  (2019) Bayesian loss for crowd count estimation with point supervision. In ICCV, Cited by: §4.2, Table 4, Table 4.
-  (2018) People, penguins and petri dishes: adapting object counting models to new visual domains and object types without forgetting. In CVPR, Cited by: §0.A.3, Table 9, §1, §4.1, Table 6, §4.
-  (2006) Semi-supervised learning. In IEEE Transactions on Neural Networks, Vol. 20, pp. 542–542. Cited by: §2.2.
-  (2018) Crowd counting with minimal data using generative adversarial networks for multiple target regression. In WACV, Cited by: §2.2.
-  (2019) Generalizing semi-supervised generative adversarial networks to regression using feature contrasting. Computer Vision and Image Understanding. Cited by: §2.2, §4.2.
-  (2016) Towards perspective-free object counting with deep learning. In ECCV, Cited by: Table 9, §1, Table 6.
Count forest: co-voting uncertain number of targets using random forest for crowd density estimation. In ICCV, Cited by: §4.4, Table 5.
-  (2006) Counting crowded moving objects. In CVPR, Cited by: §1.
-  (2018) Iterative crowd counting. In ECCV, Cited by: §1, §2.1, §4.2, Table 4, Table 4.
-  (2015) Semi-supervised learning with ladder networks. In NIPS, Cited by: §2.2.
-  (2018) Top-down feedback for crowd counting convolutional neural network. In AAAI, Cited by: §1.
-  (2019) Almost unsupervised learning for dense crowd counting. In AAAI, Cited by: §2.2, §3.2, §4.2, §4.2, footnote 1.
-  (2017) Switching convolutional neural network for crowd counting. In CVPR, Cited by: §1, §4.1, §4.2, Table 4, Table 4.
-  (2018) Active learning for convolutional neural networks: a core-set approach. In ICLR, Cited by: §2.3, §3.2.
-  (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §2.3, §3.2.
-  (2019) Revisiting perspective information for efficient crowd counting. In CVPR, Cited by: §1, §2.1, §4.2, Table 4, Table 4.
-  (2019) Counting with focus for free. In ICCV, Cited by: §0.A.3, Table 9, §1, §2.1, §4.2, Table 4, Table 6.
-  (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV, Cited by: §1, §2.1, §4.1, Table 4.
-  (2019) Variational adversarial active learning. In ICCV, Cited by: §0.A.1, Table 7, §2.3.
-  (2011) Semi-supervised elastic net for pedestrian counting. Pattern Recognition 44 (10-11), pp. 2297–2304. Cited by: §2.2.
Manifold mixup: better representations by interpolating hidden states. In ICML, Cited by: §3.4.
-  (2019) Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825. Cited by: §3.4.
-  (2003) Detecting pedestrians using patterns of motion and appearance. IJCV 63 (2), pp. 153–161. Cited by: §1.
-  (2016) Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology 27 (12), pp. 2591–2600. Cited by: §3.2.
-  (2019) Learning from synthetic data for crowd counting in the wild. In CVPR, Cited by: §1.
-  (2012) Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §2.2.
-  (2017) Spatiotemporal modeling for crowd counting in videos. In ICCV, Cited by: §4.4, Table 5.
-  (2019) Learn to scale: generating multipolar normalized density maps for crowd counting. In ICCV, Cited by: §1, §1.
-  (2019) Perspective-guided convolution networks for crowd counting. In ICCV, Cited by: §2.1.
-  (2015) Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113 (2), pp. 113–127. Cited by: §2.3.
-  (2019) Training object detectors from few weakly-labeled and many unlabeled images. arXiv preprint arXiv:1912.00384. Cited by: §2.2.
-  (2015) Cross-scene crowd counting via deep convolutional neural networks. In CVPR, Cited by: §2.1.
-  (2018) MixUp: beyond empirical risk minimization. In ICLR, Cited by: §3.4.
-  (2016) Single-image crowd counting via multi-column convolutional neural network. In CVPR, Cited by: §1, §1, §1, §2.1, §3.1, §4.1, §4.2, Table 4, Table 4, §4.
-  (2018) Crowd counting with limited labeling through submodular frame selection. IEEE Transactions on Intelligent Transportation Systems 20 (5), pp. 1728–1738. Cited by: §1, §2.2, §2.3, §4.1, §4.2.
-  (2019) Enhanced 3d convolutional networks for crowd counting. In BMVC, Cited by: Table 5.