Official code for "Object counting and instance segmentation with image-level supervision", in CVPR 2019 and TPAMI 2020
Common object counting in a natural scene is a challenging problem in computer vision with numerous real-world applications. Existing image-level supervised common object counting approaches only predict the global object count and rely on additional instance-level supervision to also determine object locations. We propose an image-level supervised approach that provides both the global object count and the spatial distribution of object instances by constructing an object category density map. Motivated by psychological studies, we further reduce image-level supervision using a limited object count information (up to four). To the best of our knowledge, we are the first to propose image-level supervised density map estimation for common object counting and demonstrate its effectiveness in image-level supervised instance segmentation. Comprehensive experiments are performed on the PASCAL VOC and COCO datasets. Our approach outperforms existing methods, including those using instance-level supervision, on both datasets for common object counting. Moreover, our approach improves state-of-the-art image-level supervised instance segmentation with a relative gain of 17.8 overlap, on the PASCAL VOC 2012 dataset.READ FULL TEXT VIEW PDF
Generic object counting in natural scenes is a challenging computer visi...
This paper aims to count arbitrary objects in images. The leading counti...
Phenotyping is the process of measuring an organism's observable traits....
The panoptic segmentation task requires a unified result from semantic a...
Learning to count is a learning strategy that has been recently proposed...
We propose to predict histograms of object sizes in crowded scenes direc...
Instance segmentation is an important computer vision problem which rema...
Official code for "Object counting and instance segmentation with image-level supervision", in CVPR 2019 and TPAMI 2020
A refactor of CountSeg without nest
Common object counting, also referred as generic object counting, is the task of accurately predicting the number of different object category instances present in natural scenes (see Fig. 1). The common object categories in natural scenes can vary from fruits to animals and the counting must be performed in both indoor and outdoor scenes (e.g. COCO or PASCAL VOC datasets). Existing works employ a localization-based strategy or utilize regression-based models directly optimized to predict object count, where the latter has been shown to provide superior results . However, regression-based methods only predict the global object count without determining object locations. Beside global counts, the spatial distribution of objects in the form of a per-category density map is helpful in other tasks, e.g., to delineate adjacent objects in instance segmentation (see Fig. 2).
The problem of density map estimation to preserve the spatial distribution of people is well studied in crowd counting [3, 22, 31, 16, 18]. Here, the global count for the image is obtained by summing over the predicted density map. Standard crowd density map estimation methods are required to predict large number of person counts in the presence of occlusions, e.g., in surveillance applications. The key challenges of constructing a density map in natural scenes are different to those in crowd density estimation, and include large intra-class variations in generic objects, co-existence of multiple instances of different objects in a scene (see Fig. 1), and sparsity due to many objects having zero count on multiple images.
Most methods for crowd density estimation use instance-level (point-level or bounding box) supervision that requires manual annotation of each instance location. Image-level supervised training alleviates the need for such user-intensive annotation by requiring only the count of different object instances in an image. We propose an image-level supervised density map estimation approach for natural scenes, that predicts the global object count while preserving the spatial distribution of objects.
Even though image-level supervised object counting reduces the burden of human annotation and is much weaker compared to instance-level supervisions, it still requires each object instance to be counted sequentially. Psychological studies [12, 20, 6, 2] have suggested that humans are capable of counting objects non-sequentially using holistic cues for fewer object counts, termed as a subitizing range (generally 1-4). We utilize this property to further reduce image-level supervision by only using object count annotations within the subitizing range. For short, we call this image-level lower-count (ILC) supervision. Chattopadhyay et al.  also investigate common object counting, where object counts (both within and beyond the subitizing range) are used to predict the global object count. Alternatively, instance-level (bounding box) supervision is used to count objects by dividing an image into non-overlapping regions, assuming each region count falls within the subitizing range. Different to these strategies , our ILC supervised approach requires neither bounding box annotation nor information beyond the subitizing range to predict both the count and the spatial distribution of object instances.
In addition to common object counting, the proposed ILC supervised density map estimation is suitable for other scene understanding tasks. Here, we investigate its effectiveness for image-level supervised instance segmentation, where the task is to localize each object instance with pixel-level accuracy, provided image-level category labels. Recent work of, referred as peak response map (PRM), tackles the problem by boosting the local maxima (peaks) in the class response maps 
of an image classifier using a peak stimulation module. A scoring metric is then used to rank off-the-shelf object proposals[25, 21] corresponding to each peak for instance mask prediction. However, PRM struggles to delineate spatially adjacent object instances from the same object category (see Fig. 2(b)). We introduce a penalty term into the scoring metric that assigns a higher score to object proposals with a predicted count of , providing improved results (Fig. 2(c)). The predicted count is obtained by accumulating the density map over the entire object proposal region (Fig. 2(d)).
We propose an ILC supervised density map estimation approach for common object counting. A novel loss function is introduced to construct per-category density maps with explicit terms for predicting the global count and spatial distribution of objects. We also demonstrate the applicability of the proposed approach for image-level supervised instance segmentation. For common object counting, our ILC supervised approach outperforms state-of-the-art instance-level supervised methods with a relative gain of 6.4and 2.9, respectively, in terms of mean root mean square error (mRMSE), on the PASCAL VOC 2007 and COCO datasets. For image-level supervised instance segmentation, our approach improves the state of the art from 37.6 to 44.3 in terms of average best overlap (ABO), on the PASCAL VOC 2012 dataset.
Chattopadhyay et al.  investigated regression-based common object counting, using image-level (per-category count) and instance-level (bounding box) supervisions. The image-level supervised strategy, denoted as glancing, used count annotations from both within and beyond the subitizing range to predict the global count of objects, without providing information about their location. The instance-level (bounding box) supervised strategy, denoted as subitizing, estimated a large number of objects by dividing an image into non-overlapping regions, assuming the object count in each region falls within the subitizing range. Instead, our ILC supervised approach requires neither bounding box annotation nor beyond subitizing range count information during training. It then predicts the global object count, even beyond the subitizing range, together with the spatial distribution of object instances.
Recently, Laradji et al.  proposed a localization-based counting approach, trained using instance-level (point) supervision. During inference, the model outputs blobs indicating the predicted locations of objects of interest and uses  to estimate object counts from these blobs. Different to , our approach is image-level supervised and directly predicts the object count through a simple summation of the density map without requiring any post-processing . Regression-based methods generally perform well in the presence of occlusions [4, 15], while localization-based counting approaches [14, 9] generalize well with a limited number of training images [15, 14]. Our method aims to combine the advantages of both approaches through a novel loss function that jointly optimizes the network to predict object locations and global object counts in a density map.
Reducing object count supervision for salient object subitizing was investigated in . However, their task is class-agnostic and subitizing is used to only count within the subitizing range. Instead, our approach constructs category-specific density maps and accurately predicts object counts both within and beyond the subitizing range. Common object counting has been previously used to improve object detection [4, 8]. Their approach only uses the count information during detector training with no explicit component for count prediction. In contrast, our approach explicitly learns to predict the global object count.
Here, we present our image-level lower-count (ILC) supervised density map estimation approach. Our approach is built upon an ImageNet pre-trained network backbone (ResNet50). The proposed network architecture has two output branches: image classification and density branch (see Fig. 3). The image classification branch estimates the presence or absence of objects, whereas the density branch predicts the global object count and the spatial distribution of object instances by constructing a density map. We remove the global pooling layer from the backbone and adapt the fully connected layer with a convolution having channels as output. We divide these channels equally between the image classification and density branches. We then add a convolution having output channels in each branch, resulting in a fully convolutional network . Here, is the number of object categories and is empirically set to be proportional to
. In each branch, the convolution is preceded by a batch normalization and a ReLU layer. The first branch provides object category maps and the second branch produces a density map for each object category.
Let I be a training image and
be the corresponding vector for the ground-truth count ofobject categories. Instead of using an absolute object count, we employ a lower-count strategy to reduce the amount of image-level supervision. Given an image I, object categories are divided into three non-overlapping sets based on their respective instance counts. The first set, , indicates object categories which are absent in I (i.e., ). The second set, , represents categories within the subitizing range (i.e, ). The final set, , indicates categories beyond the subitizing range (i.e, , where ).
Let denote the object category maps in the image classification branch, where . Let represent density maps produced by the density branch, where . Here, is the spatial size of both the object category and density maps. The image classification and density branches are jointly trained, in an end-to-end fashion, given only ILC supervision with the following loss function:
Generally, training a density map requires instance-level supervision, such as point-level annotations . Such information is unavailable in our ILC supervised setting. To address this issue, we propose to generate pseudo ground-truth by exploiting the coarse-level localization capabilities of an image classifier [23, 32] via object category maps. These object category maps are generated from a fully convolutional architecture shown in Fig. 3.
While specifying classification confidence at each image location, class activation maps (CAMs) struggle to delineate multiple instances from the same object category [23, 32]. Recently, the local maxima of CAMs are further boosted, to produce object category maps, during an image-classifier training for image-level supervised instance segmentation . Boosted local maxima aim at falling on distinct object instances. For details on boosting local maxima, we refer to . Here, we use local maxima locations to generate pseudo ground-truth for training the density branch.
As described earlier, object categories in I are divided into three non-overlapping sets: , and . To train a one-versus-rest image classifier, we derive binary labels from that indicate the presence or absence of object categories. Let be the peak map derived from object category map () of M such that:
Here, , where is the radius for the local maxima computation. We set , as in 
. The local maxima are searched at all spatial locations with a stride of one. To train an image classifier, a class confidence scoreof the object category is computed as the average of non-zero elements of . In this work, we use the multi-label soft-margin loss  for binary classification.
The image classification branch described above predicts the presence or absence of objects by using the class confidence scores derived from the peak map . However, it struggles to differentiate between multiple objects and single object parts due to the lack of prior information about the number of object instances (see Fig. 2(b)). This causes a large number of false positives in the peak map . Here, we utilize the count information and introduce a pseudo ground-truth generation scheme that prevents training a density map at those false positive locations.
When constructing a density map, it is desired to estimate accurate object counts at any image sub-region. Our spatial loss term in Eq. 1 ensures that individual object instances are localized while the global term constrains the global object count to that of the ground-truth. This enables preservation of the spatial distribution of object counts in a density map. Later, we show that this property helps to improve instance segmentation.
Spatial Loss: The spatial loss is divided into the loss which enhances the positive peaks corresponding to instances of object categories within , and the loss which suppresses false positives of categories within . Due to the unavailability of absolute object count, the set is not used in the spatial loss and treated separately later. To enable ILC supervised density map training using , we generate a pseudo ground-truth binary mask from peak map .
Pseudo Ground-truth Generation: To compute the spatial loss , a pseudo ground-truth is generated for set . For all object categories , the -th highest peak value of peak map is computed using the heap-max algorithm . The -th highest peak value is then used to generate a pseudo ground-truth binary mask as,
Here, is the unit step function which is only if . Although the non-zero elements of the pseudo ground-truth mask indicate object locations, its zero elements do not necessarily point towards the background. Therefore, we construct a masked density map to exclude density map values at locations where the corresponding values are zero. Those density map values should also be excluded during the loss computation in Eq. 4
and backpropagation (see Sec.3.2), due to their risk of introducing false negatives. This is achieved by computing the Hadamard product between the density map and as,
The spatial loss for object categories within the subitizing range is computed between and using a logistic binary cross entropy (logistic BCE)  loss for positive ground-truth labels. The logistic BCE loss transfers the network prediction () through a sigmoid activation layer and computes the standard BCE loss as,
Here, is the cardinality of the set and the norm is computed by taking the summation over all elements in a matrix. For example, = , where and are all-ones vectors of size and , respectively. Here, the highest peaks in are assumed to fall on instances of object category . Due to the unavailability of ground-truth object locations, we use this assumption and observe that it holds in most scenarios.
The spatial loss for the positive ground-truth labels enhances positive peaks corresponding to instances of object categories within . However, the false positives of the density map for are not penalized in this loss. We therefore introduce another term, , into the loss function to address the false positives of . For , positive activations of indicate false detections. A zero-valued mask is used as ground-truth to reduce such false detections using logistic BCE loss,
Though the spatial loss ensures the preservation of spatial distribution of objects, only relying on local information may result in deviations in the global object count.
Global Loss: The global loss penalizes the deviation of the predicted count from the ground-truth. It has two components: ranking loss for object categories beyond the subitizing range (i.e., ) and mean-squared error (MSE) loss for the rest of the categories. penalizes the predicted density map, if the global count prediction does not match with the ground-truth count. i.e.,
Here, the predicted count is the accumulation of the density map for a category over its entire spatial region. i.e. . Note that object categories in were not previously considered in the computation of spatial loss and mean-squared error loss . Here, we introduce a ranking loss  with a zero margin that penalizes under-counting for object categories within ,
The ranking loss penalizes the density branch if the predicted object count is less than for . Recall, the beyond subitizing range starts from .
Within the subitizing range , the spatial loss term is optimized to locate object instances while the global MSE loss () is optimized for accurately predicting the corresponding global count. Due to the joint optimization of both these terms within the subitizing range, the network learns to correlate between the located objects and the global count. Further, the network is able to locate object instances, generalizing beyond the subitizing range (see Fig. 2). Additionally, the ranking loss term in the proposed loss function ensures the penalization of under counting beyond the subitizing range .
Mini-batch Loss:Normalized loss terms , , and are computed by averaging respective loss terms over all images in the mini-batch. The is computed by . For categories beyond the subitizing range, can lead to over-estimation of the count. Hence, is computed by assigning a relatively lower weight () to (see Table. 2). i.e., .
Our network is trained in two stages. In the first stage, the density branch is trained with only and losses using and respectively. The spatial loss in Eq. 1 is excluded in the first stage, since it requires a pseudo ground-truth generated from the image classification branch. The second stage includes the spatial loss.
Backpropagation: We use derived from the image classification branch as a pseudo ground-truth to train the density branch. Therefore, the backproapation of gradients through to the classifier branch is not required (shown with green arrows in Fig. 3). The image classification branch is backpropagated as in . In the density branch, we use Hadamard product of the density map with in Eq. 3 to compute for . Hence, the gradients () for the channel of the last convolution layer of the density branch, due to , are computed as,
Since , and
are computed using MSE, ranking and logistic BCE losses on convolution outputs, their respective gradients are computed using off-the-shelf pytorch implementation.
Inference: The image classification branch outputs a class confidence score for each class, indicating the presence ( , if ) or absence (, if ) of the object category . The predicted count is obtained by summing the density map for category over its entire spatial region. The proposed approach only utilizes subitizing annotations () and accurately predicts object counts for both within and beyond subitizing range (see Fig. 6).
The proposed ILC supervised density map estimation approach can also be utilized for instance segmentation. Note that the local summation of an ideal density map over a ground-truth segmentation mask is one. We use this property to improve state-of-the-art image-level supervised instance segmentation (PRM) . PRM employs a scoring metric that combines instance level cues from peak response maps , class aware information from object category maps and spatial continuity priors from off-the-shelf object proposals [25, 21]. Here, the peak response maps are generated from local maxima (peaks of ) through a peak back-propagation process . The scoring metric is then used to rank object proposals corresponding to each peak for instance mask prediction. We improve the scoring metric by introducing an additional term in the metric. The term penalizes an object proposal , if the predicted count in those regions of the density map is different from one, as = . Here, is the absolute value operator. For each peak, the new scoring metric selects the highest scoring object proposal .
Implementation details: Throughout our experiments, we fix the training parameters. An initial learning rate of is used for the pre-trained ResNet-50 backbone, while image classification and density branches are trained with an initial learning rate of . The number of input channels of convolution for each branch is set to . A mini-batch size of 16 is used for the SGD optimizer. The momentum is set to 0.9 and weight decay to . Considering high imbalance between non-zero and zero counts in COCO dataset (e.g. 79 negative categories for each positive category), only 10% of samples in the set are used to train the density branch. Code will be made public upon publication.
We evaluate common object counting on the PASCAL VOC 2007  and COCO  datasets. For fair comparison, we employ same splits, named as count-train, count-val and count-test, as used in the state-of-the-art methods ,. For COCO dataset, the training set is used as count-train, first half of the validation set as the count-val and its second half as the count-test. In Pascal VOC 2007 dataset, we evaluated against the count of non-difficult instances in the count-test as in .
For instance segmentation, we train and report the results on the PASCAL VOC 2012 dataset similar to .
Evaluation Criteria: The predicted count is rounded to the nearest integer. We evaluate common object counting, as in [4, 14], using root squared error (RMSE) metric and its three variants namely RMSE non-zero (RMSE-nz), relative RMSE (relRMSE) and relative RMSE non-zero (relRMSE-nz). The and errors for category are computed as and, respectively. Here, is the total number of images in the test set and , are the predicted and ground-truth counts for image . The errors are then averaged across all categories to obtain the mRMSE and m-relRMSE on a dataset. The above metrics are also evaluated for ground-truth instances with non-zero counts as mRMSE-nz and m-relRMSE-nz. For all error metrics, smaller numbers indicate better performance. We refer to  for more details. For instance segmentation, the performance is evaluated using Average Best Overlap (ABO)  and , as in . The is computed with intersection over union (IoU) thresholds of 0.25, 0.5 and 0.75.
Supervision Levels: The level of supervision is indicated as SV in Tab. 3 and 4. BB indicates bounding box supervision and PL indicates point-level supervision for each object instance. Image-level supervised methods using only within subitizing range counts are denoted as ILC, while the methods using both within and beyond subitizing range counts are indicated as IC.
Ablation Study: We perform an ablation study on the PASCAL VOC 2007 count-test. First, the impact of our two-branch architecture is analyzed by comparing it with two baselines: class-activation  based regression (CAM+regression) and peak-based regression (Peak+regression) using the local-maximum boosting approach of . Both baselines are obtained by end-to-end training of the network, employing the same backbone, using MSE loss function to directly predict global count. Tab. 1 shows the comparison. Our approach largely outperforms both baseline highlighting the importance of having a two-branch architecture with explicit terms in the loss function to preserve the spatial distribution of objects. Next, we evaluate the contribution of each term in our loss function towards the final count performance.
Fig. 4 shows the systematic improvement in density maps (top row: person and bottom row: bicycle) quality with the incremental addition of (c) spatial and (d) ranking () loss terms to the (b) MSE () loss term. Similar to CAM, the density branch trained with MSE loss alone gives coarse location of object instances. However, many background pixels are identified as part of the object (false positives) resulting in inaccurate spatial distribution of object instances. Further, this inclusion of false positives prevents the delineation of multiple object instances. Incorporating the spatial loss term improves the spatial distribution of objects in both density maps. The density maps are further improved by the incorporation of the ranking term that penalizes the under-estimation of count beyond the subitizing range (top row) in the loss function. Moreover, it also helps to reduce the false positives within the subitizing range (bottom row). Tab. 2 shows the systematic improvement, in terms of mRMSE and mRMSE-nz, when integrating different terms in our loss function. The best results are obtained when integrating all three terms (classification, spatial and global) in our loss function. We also evaluate the influence of that controls the relative weight of the ranking loss. We observe provides the best results and fix it for all datasets.
State-of-the-art Comparison: Tab. 3 and 4 show state-of-the-art comparisons for common object counting on the PASCAL VOC 2007 and COCO datasets respectively. On the PASCAL VOC 2007 dataset (Tab. 3), the glancing approach (glance-noft-2L) of  using image-level supervision both within and beyond the subitizing range (IC) achieves mRMSE score of . Our ILC supervised approach considerably outperforms the glance-noft-2L method with a absolute gain of 21% in mRMSE. Furthermore, our approach achieves consistent improvements on all error metrics, compared to state-of-the-art point-level and bounding box based supervised methods.
Tab. 4 shows the results on COCO dataset. Among the existing methods, the two BB supervised approaches (Seq-sub-ft-3x3 and ens) yields mRMSE scores of and respectively. The PL supervised LC-ResFCN approach  achieves mRMSE score of . The IC supervised glancing approach (glance-noft-2L) obtains mRMSE score of . Our approach outperforms the glancing approach with an absolute gain of 8% in mRMSE. Furthermore, our approach also provides consistent improvements over the glancing approach in the other three error metrics and is only below the two BB supervised methods (Seq-sub-ft3x3 and ens) in m-relRMSE-nz. Fig. 5 shows object counting examples using our approach and the point-level (PL) supervised method . Our approach performs accurate counting on various categories (fruits to animals) under heavy occlusions. Fig. 6 shows counting performance comparison in terms of RMSE, across all categories, on COCO count-test. The x-axis shows different ground-truth count values. We compare with the different IC, BB and PL supervised methods [4, 14]. Our approach achieves superior results on all count values compared to glancing method  despite not using the beyond subitizing range annotations during training. Furthermore, we perform favourably compared to other methods using higher supervision.
Evaluation of density map:
We employ a standard grid average mean absolute error (GAME) evaluation metric used in crowd counting to evaluate spatial distribution consistency in the density map. In GAME(n), an image is divided into non-overlapping grid cells. Mean absolute error (MAE) between the predicted and the ground-truth local counts are reported for and , as in . We compare our approach with the state-of-the-art PL supervised counting approach (LCFCN)  on the 20 categories of the PASCAL VOC 2007 count-test set. Furthermore, we also compare with recent crowd counting approach (CSRnet)  on the person category of the PASCAL VOC 2007 by retraining it on the dataset. For the person category, the PL supervised LCFCN and CSRnet approaches achieve scores of and in GAME(3).The proposed method outperforms LCFCN and CSRnet in GAME (3) with score of , demonstrating the capabilities of our approach in the precise spatial distribution of object counts. Moreover, our method outperforms LCFCN for all 20 categories.
Finally, we evaluate the effectiveness of our density map to improve the state-of-the-art image-level supervised instance segmentation approach (PRM)  on the PASCAL VOC 2012 dataset (see Sec. 3.3). In addition to PRM, the image-level supervised object detection methods MELM , CAM  and SPN  used with MCG mask and reported by  are also included in Tab. 5.
The proposed method largely outperforms all the baseline approaches and , in all four evaluation metrics. Even though our approach marginally increases the level of supervision (lower-count information), it improves the state-of-the-art PRM with a relative gain of 17.8 in terms of average best overlap (ABO). Compared to PRM, the gain obtained at lower IoU threshold (0.25) highlights the improved location prediction capabilities of the proposed method. Furthermore, the gain obtained at higher IoU threshold (0.75), indicates the effectiveness of the proposed scoring function in assigning higher scores to the object proposal that has highest overlap with the ground-truth object, as indicated by the improved ABO performance. Fig. 7 shows qualitative instance segmentation comparison between our approach and PRM.
We proposed an ILC supervised density map estimation approach for common object counting in natural scenes. Different to existing methods, our approach provides both the global object count and the spatial distribution of object instances with the help of a novel loss function. We further demonstrated the applicability of the proposed density map in instance segmentation. Our approach outperforms existing methods for both common object counting and image-level supervised instance segmentation.
The role of pattern recognition in children’s exact enumeration of small numbers.British Journal of Developmental Psychology, 32(2), 2014.
Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes.In CVPR, 2018.
Is object localization for free?-weakly-supervised learning with convolutional neural networks.In CVPR, 2015.
Learning fine-grained image similarity with deep ranking.In CVPR, 2014.
Learning deep features for discriminative localization.In CVPR, 2016.