We consider the task where a model has to predict the location of each object in an image. This task is important for applications such as public safety, crowd monitoring, and traffic management. Typically, bounding boxes [26, 25] or point-level annotations [16, 17, 21, 19, 35, 24, 8] are provided during training. However, we consider the more challenging problem setup where only count-level annotations are available. These labels are cheaper to acquire than point-level annotations, but they make the localization task significantly more difficult for the model. In dense scenes, the model has to identify which objects in the image correspond to the object count. These objects can heavily overlap, can vary widely in scale, shape, and appearance. Current methods  partially address this problem setup but only for datasets where objects are salient and rarely overlap. These methods do not work for dense scene datasets as they are designed to work with training images that have few objects. Thus, we address a novel problem setup of learning to localize objects for dense scenes under count supervision.
Acquiring object count labels in images requires much less human effort than annotating the location of each object. For training images with or less objects, the annotator can obtain the object count much faster than with point annotations through subitizing . For videos, the annotator can obtain the object count quickly across image frames as the count changes much less frequently than the object locations in the video. In some cases, object counts can be obtained with no effort compared to point-level annotations. These cases include keeping count of products on retail stock shelves, and keeping count of a crowd of people at events where the ticket system registers their actual count. In both cases, identifying object locations is important for safety and logistics.
Many methods exist that can perform object localization but they need to be trained on point-level annotations [19, 16, 21, 17] or image-level . They fall under two main categories: density-based and segmentation-based localization. Density-based methods [19, 21] transform the point-level annotations into a density map using a Gaussian kernel. Then, they train using a least-squares objective to predict the density map. However, these methods do not provide individual locations of the objects. On the other hand, segmentation-based methods such as LC-FCN 
train using a loss function that encourages the output to contain a single blob per object. For our framework, we use the individual object locations obtained by LC-FCN to help in generating the pseudo point-level annotations.
is a state-of-the-art method for object counting when only count supervision is provided. This method is an ImageNet pre-trained model such as ResNet50  with a regression layer as its output layer. Unfortunately, Glance is not designed to localize the objects of interest in the image. In this work, we propose a novel approach that uses count supervision to localize objects in dense scenes. Further, we show that our method achieves better count results than Glance.
Most weakly supervised localization methods fall under multiple-instance learning (MIL) . In this setup, each image corresponds to a bag of object proposals. Each bag is labeled based on whether an object class exists. Li et al.  present a two-step approach. First, they use a mask-out strategy to filter the noisy object proposals; then, they use a Faster RCNN  for detection using bags of instances. Tang et al.  use a refinement learning strategy to improve on the quality of the proposals. C-MIL  introduces a continuation optimization method to avoid getting stuck in a local minima. C-WSL  is the most relevant to our work as they use count information to obtain the highest scoring proposals. However, it differs from our setup in that it relies on a classification network that is not designed for dense scenes.
We propose LOOC which can learn to Localize Overlapping Objects with Count supervision. It trains by alternating between two stages. In the first stage, LOOC learns to generate pseudo point-level annotations in a semi-supervised learning manner. In the second stage, LOOC uses a fully-supervised localization method that trains on these pseudo labels.
. This set of scores is the combination of the proposal objectness and the probability heat-map obtained from the trained localization method. The proposals that have low scores are considered unlabeled. The localization method uses the pseudo labels and ignores the regions that are unlabeled. The goal for the localization method is to infer the object probabilities in these unlabeled regions. These probabilities are used to re-score the proposals to generate the pseudo labels in the next round2. At test time, only the localization method is kept, which can be directly used to predict the locations and count of the objects of interest.
Since no direct relevant work exists for this particular setup, we compare our methods against Glance  and the fully supervised LCFCN . We benchmark our methods against various counting datasets such as Trancos , Penguins , UCSD , and Mall . We observed that LOOC achieves a strong new baseline in the novel problem setup where only count supervision is available with respect to localization. Further, we observed that LOOC outperforms current state-of-the-art methods that only use count as their supervision.
We summarize our contributions as follows: we (1) present LOOC, a novel framework that can count and locate objects with count-level supervision for dense scenes; (2) propose a semi-supervised learning scheme where pseudo labels are inferred for unlabeled regions in the image, and (3) show that LOOC achieves better count accuracy than Glance with the addition that it locates objects efficiently.
2 Proposed Method
One of the main challenges of training with only count supervision is to identify which objects of interest in the image correspond to the object count. Object proposals could be used to identify which regions are likely to have the objects of interest [29, 4, 36]. However, proposal methods are class-agnostic as they do not provide the class label. Thus, they might propose the wrong objects.
To alleviate this drawback, we consider a semi-supervised learning methodology where only the centroids of the proposals with the highest saliency score are considered as pseudo point-level annotations. The rest of the proposals represent unlabeled regions. When a localization model is trained on these salient proposals, it can be used to predict a class probability map (CPM) for the objects of interest that are in the unlabeled regions. These probabilities are used as positive feedback to re-score the proposals and obtain better pseudo point-level annotations for the next round.
illustrates the pipeline of our framework LOOC. It consists of three components: a proposal generator, a proposal classifier, and an object localizer. The proposal generator and classifier are used to obtain the pseudo point-level annotations, whereas the object localizer is trained on these annotations to count and localize objects. We explain each of these components below.
2.3 Generating pseudo-labels
In this section, we explain the proposal generator and the classifier and how they can be used to generate pseudo point-level annotations.
First, a proposal generator such as selective search  is used to output 1000 proposals that correspond to different objects in the image. Each of these proposals has an associated score obtained from the object localizer (see Section 2.4 for more detail). The proposal classifier uses these scores to obtain labeled and unlabeled regions in the training images. The regions that do not intersect with any proposal are labeled as background whereas the region that intersect with the highest-scoring proposals are labeled as foreground. The remaining regions are considered unlabeled.
The highest scoring proposals are selected using non-maximum suppression , and their centroids are considered as the pseudo point-level annotations used to train the object localizer.
2.4 Training a Localization Method
Using the pseudo point-level labels, we can train any fully supervised localization network such as LC-FCN  and CSRNet . However, we chose LC-FCN due to its ability to get a location for each object instance rather than a density map. For the point annotations in the labeled regions, LC-FCN is trained using its original loss function described in detail by Laradji et al. .
LC-FCN’s predictions on the unlabeled regions are ignored during training. However, the class probability map (CPM) that LC-FCN outputs for those regions is used to re-score the proposals in order to obtain a new set of pseudo point-level annotations.
2.5 Overall Pipeline
LOOC is trained in cycles where in each cycle it alternates between generating pseudo point-level annotations and training LC-FCN on those labels (Algorithm 1). Let be the true object count for image . At a given cycle, we only consider the top scoring proposals (where ) to be used for obtaining the pseudo point-level annotations. After training LC-FCN with the top proposals, we use its class probability map (CPM) to re-score the proposals and increase by .111we used and The score of each proposal is the mean of CPM’s region that intersects with that proposal. This allows us to pick a larger number of pseudo point-level annotations and increase the size of the labeled regions. The procedure ends when equals for all images, which closely resembles curriculum learning 
under an expectation maximization framework.
In this section, we evaluate LOOC on four dense scene datasets: UCSD , Trancos , Mall , and Penguins . For each of these datasets we only use the count labels instead of the original point-level annotations. For evaluation, we use mean-absolute error (MAE) for measuring counting performance, and grid average mean absolute error (GAME)  for localization performance.
For localization, we compare LOOC against a proposed baseline called TopK. The difference between TopK and LOOC is that TopK uses the fixed scores provided by the proposal generator to score the proposals and LOOC uses the dynamic scores provided by the object localizer’s class probability map (CPM).
We also compare LOOC against Glance, a state-of-the-art counting method that also uses count supervision. While Glance does not localize objects, the purpose of this benchmark is to observe whether the location awareness provided by LOOC can help in counting. Our models use the ResNet-50 backbone for feature extraction, and they are optimized using ADAM  with a learning rate of 1e-5 and a weight decay of . We also got similar results using optimizers that do not require defining a learning rate [32, 22, 31].
UCSD  consists of images collected from a video camera at a pedestrian walkway. This dataset is challenging due to the frequent occurrence of overlapping pedestrians, which makes counting and localization difficult. Following Li et al. 
, we resize the frames to 952x632 pixels using bilinear interpolation to make them suitable for our ResNet based models. We use the frames 601-1400 as training set and the rest as test set, which is a common practice, .
Table 1 shows that LOOC outperforms Glance in terms of MAE, suggesting that localization awareness helps in counting as well. Further, LOOC outperforms TopK with respect to MAE and GAME suggesting that LCFCN provides informative class probability map. LOOC’s results are also close to the fully supervised LCFCN, which indicates that good performance can be achieved with less costly labels. Qualitatively, LOOC is able to accurately identify pedestrians for UCSD (Figure 3).
Trancos  consists of images taken from traffic surveillance cameras for different roads, where the task is to count vehicles, which can highly overlap , making the dataset challenging for localization.
The results shown in Table 1 indicate that LOOC achieves lower MAE than Glance, yet it can perform good localization compared to TopK. Compared to the fully supervised LCFCN, LOOC performs poorly mainly due to the quality of the pseudo point-level annotations, but the qualitative results appear accurate (Figure 3).
Mall  consists of frames of size collected from a fixed camera installed in a shopping mall. These frames have diverse illumination conditions and crowd densities, and the objects vary widely in size and appearance. The results in Table 1 show that LOOC achieves good localization performance compared to TopK and counting performance compared to Glance.
Penguins Dataset  consists of images of penguin colonies collected from fixed cameras in Antarctica. We train on 500 images, and test on 500 unseen images. The quantitative results in Table 1 and qualitative results in Figure 3 show the effectiveness of LOOC in scenes where objects can come in different shapes and sizes, and can densely overlap.
Ablation studies. We evaluate the quality of the pseudo point-level annotations provided by LOOC in Table 2. After training LOOC, we generate the pseudo labels as the centroids of the top scoring proposals on the training set and measure the GAME localization score. We observe that LOOC outperforms TopK, suggesting that relying on LCFCN’s class probability map allows us to score the proposals better. Thus, given count-level supervision, we can use LOOC to obtain high quality point-level annotations and then effectively train a fully-supervised localization on those point labels.
We have proposed LOOC, a method that Localizes Overlapping Objects using Count supervision. LOOC trains by alternating between generating pseudo point-level annotations and training a fully supervised localization method such as LCFCN. The goal is to progressively improve the localization performance based on pseudo labels. The experiments show that LOOC achieves a strong new baseline in the novel problem setup of localizing objects using only count supervision. They also show that LOOC is a new state-of-the-art for counting in this weakly supervised setup. The experiments also show that the pseudo point-level annotations obtained by LOOC are of high quality and can be used to train any fully supervised localization method. For future work, we plan to investigate proposal free methods, perhaps those that rely on topological regularization  to identify the regions of interest. Further, we also plan to look into incorporating regularization methods that help when the amount of labels is limited [33, 27].
-  (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B 39. Cited by: §2.5.
-  (2016) Counting in the wild. In ECCV, Cited by: §1, §3, §3.
-  (2009) Curriculum learning. In ICML, Cited by: §2.5.
-  (2016) Weakly supervised deep detection networks. In CVPR, Cited by: §2.1.
-  (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In CVPR, Cited by: §1, §3, §3.
-  (2017) Counting everyday objects in everyday scenes. In CVPR, Cited by: §1, §1, §1, Table 1.
-  (2012) Feature mining for localised crowd counting.. In BMVC, Cited by: §1, §3, §3.
-  (2019) Object counting and instance segmentation with image-level supervision. In CVPR, Cited by: §1.
A topological loss function for deep-learning based image segmentation using persistent homology. arXiv preprint arXiv:1910.01877. Cited by: §4.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1.
-  (1997) Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89 (1-2), pp. 31–71. Cited by: §1.
-  (2018) C-wsl: count-guided weakly supervised localization. In ECCV, Cited by: §1, §1.
-  (2015) Extremely overlapping vehicle counting. In IbPRIA, Cited by: §1, §3, §3.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §3.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §3.
-  (2018) Where are the blobs: counting by localization with point supervision. In ECCV, Cited by: §1, §1, §1, §2.4, Table 1.
-  (2019) Instance segmentation with point supervision. arXiv preprint arXiv:1906.06392. Cited by: §1, §1.
-  (2019) Where are the masks: instance segmentation with image-level supervision. In BMVC, Cited by: §1.
-  (2010) Learning to count objects in images. In NIPS, Cited by: §1, §1.
-  (2016) Weakly supervised object localization with progressive domain adaptation. In CVPR, Cited by: §1.
CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, Cited by: §1, §1, §2.4, §3.
-  (2020) Stochastic polyak step-size for sgd: an adaptive learning rate for fast convergence. arXiv preprint arXiv:2002.10542. Cited by: §3.
-  (2006) Efficient non-maximum suppression. In ICPR, Cited by: §2.3.
-  (2016) Towards perspective-free object counting with deep learning. In ECCV, Cited by: §1.
-  (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, §1.
-  (2020) Embedding propagation: smoother manifold for few-shot classification. arXiv preprint arXiv:2003.04151. Cited by: §4.
-  (2017) Multiple instance detection network with online instance classifier refinement. In CVPR, Cited by: §1.
-  (2018) Weakly supervised region proposal network and object detection. In ECCV, Cited by: §2.1.
-  (2013) Selective search for object recognition. In ICCV, Cited by: §1, §2.3.
-  (2020) Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search). arXiv preprint arXiv:2006.06835. Cited by: §3.
-  (2019) Painless stochastic gradient: interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, pp. 3732–3745. Cited by: §3.
Manifold mixup: better representations by interpolating hidden states.
International Conference on Machine Learning, pp. 6438–6447. Cited by: §4.
-  (2019) C-mil: continuation multiple instance learning for weakly supervised object detection. In CVPR, Cited by: §1.
-  (2016) Single-image crowd counting via multi-column convolutional neural network. In CVPR, Cited by: §1.
-  (2018) Weakly supervised instance segmentation using class peak response. In CVPR, Cited by: §2.1.