Log In Sign Up

A Few-Shot Sequential Approach for Object Counting

In this work, we address the problem of few-shot multi-classobject counting with point-level annotations. The proposed techniqueleverages a class agnostic attention mechanism that sequentially attendsto objects in the image and extracts their relevant features. This pro-cess is employed on an adapted prototypical-based few-shot approachthat uses the extracted features to classify each one either as one of theclasses present in the support set images or as background. The proposedtechnique is trained on point-level annotations and uses a novel loss func-tion that disentangles class-dependent and class-agnostic aspects of themodel to help with the task of few-shot object counting. We presentour results on a variety of object-counting/detection datasets, includingFSOD and MS COCO. In addition, we introduce a new dataset thatis specifically designed for weakly supervised multi-class object count-ing/detection and contains considerably different classes and distribu-tion of number of classes/instances per image compared to the existingdatasets. We demonstrate the robustness of our approach by testing oursystem on a totally different distribution of classes from what it has beentrained on


Learning to Count Anything: Reference-less Class-agnostic Counting with Weak Supervision

Object counting is a seemingly simple task with diverse real-world appli...

Dilated-Scale-Aware Attention ConvNet For Multi-Class Object Counting

Object counting aims to estimate the number of objects in images. The le...

Weakly-supervised multi-class object localization using only object counts as labels

We demonstrate the use of an extensive deep neural network to localize i...

SIMCO: SIMilarity-based object COunting

We present SIMCO, the first agnostic multi-class object counting approac...

Iterative Correlation-based Feature Refinement for Few-shot Counting

Few-shot counting aims to count objects of any class in an image given o...

Object Counting: You Only Need to Look at One

This paper aims to tackle the challenging task of one-shot object counti...

Exemplar Free Class Agnostic Counting

We tackle the task of Class Agnostic Counting, which aims to count objec...

1 Introduction

Object counting is an important task in computer vision motivated by a variety of applications such as traffic monitoring, wildlife conservation and retail inventory tracking. Several methods focus on counting objects of a single class, such as people 

[36, 60, 73, 75], cars [11, 48] or cells [25, 49, 70]. However, multi-class counting methods are more relevant to real-world applications such as counting items on supermarket shelves with several items of multiple categories in an image.

While deep learning techniques have revolutionized the field of computer vision in the past decade, the performance of such models often comes at the cost of acquiring large amounts of labelled data. This poses a great challenge for object counting in particular, where acquiring per-item labels with possibly many items per image increases the cost of labelling dramatically. This motivates the development of training strategies that enable the models to recognize and count new categories given only a few labeled images.

Figure 1: Given support and query images of the same task, the model sequentially classifies and pays attention to each object in the image.

Unlike most existing deep learning approaches, humans are capable of learning to count new objects from unseen categories relying only on a few examples. Few-shot learning attempts to enable such data efficiency in machine perception with the goal of training models in low-data regimes where few labelled examples are available for each task. Most existing few-shot learning methods can be categorized into two groups: 1) gradient-based methods [1, 16, 46], which rely on a meta-learner that predicts the parameters of task-specific models, and 2) metric-based methods  [29, 62, 64], which learn a similarity measure to compare a query image against a labelled support set. However, most methods focus on image classification and adapting them to the more complex task of multi-class object counting is nontrivial: object counting requires to both localize and classify each item instead of merely relying on a global comprehension of the image.

In this work, we address the problem of few-shot multi-class object counting using only point-level annotation, wherein only one pixel from each object is annotated. Such setup mitigates the challenge of acquiring large amounts of labelled data by reducing not only the number of required training samples but also the cost of labelling each sample.

Experiments on visual cognition in humans suggest that we do not tend to focus our attention on the entire scene at once. Instead, we attend sequentially to different parts in order to extract relevant information [53]. This appears to be particularly effective in higher level cognitive tasks such as counting objects of multiple categories in a scene, where we tend to zoom into one object at a time [68]. Similarly, our model uses an attention mechanism that sequentially extracts features of the objects in a query image in a specific order. Those features are then compared to the class prototypes extracted from the support images in order to be classified. We focus on scenarios with average density and large number of classes per image, since unlike high-density, single-class tasks such as crowd counting [60, 73, 75], this setting has rarely been addressed in the existing literature. Our contributions can be summarised as follows.

  1. We propose a novel recurrent attention-based system that sequentially computes one attention map per each object in the query image. The maps are then used to weight the feature vectors, which are in turn used to classify each object by comparing it against a set of prototype extracted from the support images (Figure

    1). The labels are sorted in lexicographical order by their coordinates, guiding the model to attend to the objects in the query images in the same order.

  2. We use a novel loss function that consists of a class-agnostic and a class-dependent term. The former helps fit the attention map at each time-step to a Gaussian distribution, hence localizing objects in the image, while the latter encourages the model to classify those items correctly. The ratio of these two losses changes throughout the training, assigning a larger weight on the class-agnostic term at first and exponentially decaying it as the training proceeds.

  3. We introduce a dataset with a distribution of classes and objects that is considerably different from images of natural scenes in publicly available datasets. The objects in our dataset come from various categories of grocery items such as sodas, canned food, etc. The average density of objects and the variety of classes in each image makes the dataset suitable for evaluating and benchmarking few-shot techniques.

2 Related Work

Given that our contributions can be stated both in terms of our approach to multi-class object counting as well as to few-shot learning, in this section we briefly discuss the existing approaches in both domains.

2.1 Object counting

Object counting methods can be roughly divided into two categories: detection-based and regression-based.

Regression-based methods

rely on regressors to estimate the object counts. A variety of successful approaches from heuristic-based to deep learning methods belong to this category  

[4, 5, 8, 10, 30, 39, 41, 43, 55, 56, 59, 76], with Glance [6] and density-based methods being among the most successful examples. Glance [6] uses image-level labels, i.e. per-class counts, and learns to estimate the global count in a single forward pass. Glance is efficient with small counts. However, it employs “subitizing” technique for large counts, which is hard to train and requires bounding box labels. Density-based approaches learn to count by regressing a density map using a least-squares objective and obtain the total count by integrating over the density maps [3, 9, 36, 45, 48, 60, 61, 65, 72, 75]. Some of the density-based methods use a multi-column-based architecture [48, 58, 75], which introduces redundant structures, while others  [60] use a high-level prior to guide the computation of the density maps. These approaches often assume fixed object sizes defined by Gaussian kernels or constrained environments, which work well on crowd-counting problems but not with objects of varying sizes.

Detection-based methods can generalize better for objects with different sizes. These methods first detect the objects and then count the number of detected instances. Most such approaches [40, 52] rely on bounding box labels, which are not only expensive to acquire but also make the model prone to occlusions for densely packed images. To overcome this issue [18]

proposes a soft-IoU layer to estimate the overlap between predicted and ground truth bounding boxes, along with an expectation maximization unit that clusters the Gaussians into groups to resolve overlap ambiguities. However, employing the EM-based step slows the model down significantly while relying on bounding-box annotation makes labelling costly. 

[35] propose employing only point-level annotations to train a model that outputs one blob per instance. This approach mitigates the problem of occlusion in detection-based methods by not relying on bounding boxes. From the object localization perspective, our approach is closely related to the detection based approaches and following [35] we rely only on point-level annotation for our counting model, which are relatively cheap to acquire.

2.2 Few-shot learning

Few-shot learning aims to train models that generalize to new tasks using only a few samples, leveraging prior knowledge. Some early methods follow a Bayesian framework that learns to incorporate a prior such as saliency [15] or strokes and object parts  [32, 33, 34]. Image hallucination is used in [19, 67, 74] to augment the training data to better generalize to new classes. Broadly speaking, there are two main categories of few-shot learning approaches (i) gradient-based and (ii) metric-based approaches.

Gradient-based methods aim at training models that generalize well to new tasks/categories with only a few fine-tuning updates  [51]. The Model-agnostic meta-learning approach, MAML [16], learns to adapt the weights to new task in a few gradient steps. Many recent approaches have built upon the success of MAML [47, 31, 23, 46, 37, 27, 17, 54]. These approaches require fine-tuning and additional optimization steps. In contrast, our model addresses unseen tasks in a feed-forward manner following metric-based approaches, thus avoiding further gradient computations and model updates, which could create additional complication for complex tasks such as object counting.

Metric-based methods learn a distance metric to compare query images against the support images. [29]

uses a siamese network to capture the similarity between images

[62, 64] propose a matching network that learns a differentiable -nearest neighbor model.  [63] present a relation network that learns the optimal distance metric. [26, 57]

use graph neural networks to model the relationship between support and query images. Due to the simplicity and adaptability of metric-based approaches, we base our approach on this group of work and in particular prototypical networks  


Few-shot counting, detection and segmentation: the majority of approaches in few-shot learning focus on the problem of object classification. A few recent approaches have been devoted to address problems such as few-shot object detection [7, 24] and segmentation  [12, 22, 66]. One of the approaches most relevant to ours is [50], where given sparse point annotations they extract support and query features, which are in turn fed to a decoder that generates segmentation results. Even though not framed as a few-shot counting approach, [42] proposes a class-agnostic counting approach using a matching network which takes as input a query image and an exemplar patch containing the object of interest. The outputs of the network are then fed into a discriminative classifier. Finally, to adapt the network to novel categories, a small fraction of the learned parameters are fine-tuned using a few labelled examples. To the best of our knowledge, no prior work has been done specifically on few-shot object counting. In the next sections we define this problem and our proposed approach.

3 Problem Definition

We follow the setup common in few-shot classification and segmentation tasks  [50, 62], where meta-test classes are disjoint from the meta-train classes. We use episodic training, where the input data is divided into mini-batches or episodes. In each episode, the input data is divided into a support set of annotated images or shots used for supervision and a query set of images to perform the task on. All the support and query images within an episode share the same task, namely the same subset of classes from the total classes in the dataset. In this paper, we use training and testing to refer to what the model does within each given episode, while meta-training and meta-testing refer to the process of teaching the model to adapt to new tasks. To summarize, the training proceeds in episodes, where each episode is a mini-batch of meta-train samples all taken from the same task.

We pose the problem as weakly-supervised few-shot object counting, where only point-level annotations are available, i.e. for each object in the image there is only a single annotated pixel anywhere on the area of the object, with a label indicating its class. In addition, we assume that multiple instances from each of the classes may be present in each image for both support and query images. We also assume the number of instances per class varies by class and image. Following the notation in [50], a few-shot object counting task with point-level supervision is defined as a set of input-output pairs sampled from a task distribution . The task inputs are


where are the support images, the query images, the annotation set for the -th support image, and the the point and class labels for the -th object in , and the number of objects in . Finally,

is the background class. The target outputs are


where is the number of objects in the -th query image.

Figure 2: Architecture of our model. The support set images and the query image are fed to a shared feature extractor. The feature maps from each image are passed through the decoder. The decoder uses a sequential attention mechanism to firstly generate a weighting map for each object in the image and then use the maps to generate a feature vector for each object. For the support images instead of using the attention module to generate the feature maps, the Gaussian maps generated from the label points are used as weighting maps to create prototype feature vectors for each class. Class scores are computed by cross-correlating the query feature vectors with prototype feature vectors.

4 Proposed Method

We approach the problem of object counting from a rather intuitive point of view that fits the problem nicely into a few-shot framework. We approach object counting as a two-step process of firstly localizing and distinguishing the objects from the background and secondly classifying each of them as one of the classes in the support set. The localization step is fundamentally independent of the class labels and can therefore be learned from the massive publicly available datasets. This allows the model to be as general as possible and mitigates the data-scarcity problem. The second step, on the other hand, is specific to the classes of interest and has to be learned from the few support images. To explain this further, object counting can be thought of as an image captioning task (similar to

[71]) wherein the model describes what is present in the image by paying attention to all of the objects in the image sequentially and in a specific predetermined order, outputting the classes of the objects that it attends to as a description for the image. The architecture of our proposed system is illustrated in Figure 2. Our model consists of a fully-convolutional feature extractor and an attention-based recurrent decoder that sequentially outputs the class of the object to which it is attending. We explain in detail the components of the architecture in the following subsections.

4.1 Feature Extractor

We use ResNet-50 [20] up to the fully-connected layers as the backbone of our feature extractor module. In order to improve the recognition of objects with different scales, we concatenate the features from four different layers of the backbone network (conv2_x, conv3_x, conv4_x and conv5_x explained in  [20]) after up-sampling them to a fixed size, namely,

of the original image. To make the model more location-aware, we concatenate the encoded location of each pixel in the feature map to the features extracted from the image. This is done by concatenating the one-hot encoding of the

and coordinates together. The constructed feature maps are then processed by a decoder that is explained in the next subsection.

4.2 Decoder

Figure 3: Architecture of the decoder. The extracted image features and the previous LSTM state are used to generate an attention map (Equation 4). The generated map is used to weigh the feature maps and compute a feature vector for the current time-step, , which is then linearly combined with the predicted class scores at the previous time-step to form the input of the LSTM. The output of the LSTM is then used to compute the current class scores using the class prototypes.

The decoder component of this model, similar to the one proposed in [69], uses the extracted features to sequentially output a class index for each of the objects present in the image. This is done by combining an RNN (specifically an LSTM  [21]) and an attention module that interact together in a loop (Figure 3). At each time-step , the attention module generates an attention map by linearly combining the image features and the previous LSTM state and passing the output through a non-linearity as follows:


where , and are trainable weights. The generated attention maps are then used to spatially weigh the image feature maps and thus reduce them into a feature vector, :


This feature vector is then linearly combined with the predicted class scores vector from the previous time step, , to form the next input to the LSTM:


where and are trainable weights. Finally, the output of the LSTM is used to generate class scores for the current time-step:


When computing the class prototype features from the support set, instead of using the attention module to generate the weight maps, we use the annotation point of the current object and an estimate of the standard deviation

(as explained in [75]) to generate a Gaussian kernel . This kernel is centered on the object of interest and is used to weigh the features extracted from the support image. We also generate a weight map for the background class:



is an all-ones matrix with same dimensions as the maps. The feature vectors generated from the support images are averaged over all of the objects from the same class in the support set. The resulting features are used as class prototypes to classify the objects extracted from the query image. In order to construct class prediction logits for the query objects, the extracted feature vector for each object in the query is cross-correlated to all of the class prototypes from the support set, including the background. All of the query feature vectors and prototype feature vectors are passed through a linear layer prior to cross-correlation. The output of this procedure is a vector of size

which is passed through a softmax to compute class probabilities. It should be noted that using a more sophisticated mechanism for scoring the model outputs against the prototypes is a potential extension of our work.

4.3 Training labels

The training labels for each image are a sequence of (point, class) pairs organized in a specific order. The order is optional as long as it is consistent for all the images. We ordered the object labels from top-left to bottom right which is the order they would appear if we flattened the image, i.e. lexicographical order by their coordinates. Finally, class indices are assigned randomly for each task (between 0 and C-1).

4.4 Loss

The loss function consists of two terms: a class-agnostic term which is responsible for the localization of objects by the attention module, and a classification term for classifying the localized objects. For the former we use the sum of KL-divergences between the generated attention maps for the objects in the query image and the corresponding Gaussian maps centered from the annotation points. For the latter, we use cross-entropy loss between the predicted class scores and the corresponding labels. For an efficient training, these two losses should be carefully weighted with respect to each other. At the beginning of the training the focus is more on teaching the model to distinguish the objects from the background. As the model gets better at sequentially attending to and localizing the objects, it becomes more important to focus on classification. To encourage this, we use an adaptive weighing scheme to gradually decrease the weight of the class-agnostic loss as the training proceeds. Our experiments proved the effectiveness of this strategy. The total loss is given by:


where KL stands for Kullback–Leibler divergence, CE is the cross-entropy loss, and with

indexing the time-step, and represent the generated attention mask for the query image and the Gaussian kernel from the labels respectively. is the one-hot class label and and are the time-varying weights associated to the KL and CE loss terms.

5 Cafeteria: a diverse multi-object counting dataset

Object detection datasets such as PASCAL VOC [13] and COCO [38] have been the most popular datasets used for object counting. The main drawback of these datasets is that most images contain only a small number of instances from few categories. This simplifies the multi-class object counting task, specially in the few-shot context. Additionally, most of the object counting specific datasets consist of images with a high density of objects from only a single category: people [4, 8, 75, 76], cars [11, 48], penguins [2] or cells [44, 70]. However, in practical applications multi-class object counting tasks include more challenging scenarios where each image contains objects from several different classes. In order to address these challenges, more sophisticated datasets are required. As part of our effort to address this problem, we introduce the Cafeteria dataset, a diverse dataset with a suitable number of categories that can be used for few-shot object counting.

The Cafeteria dataset is composed of images of grocery items on shelves, fridges and other surfaces taken with cellphone and security cameras (Figure 5). The images were labelled with point-level annotations by skilled annotators. This is a complex dataset where items appear in different shapes, sizes, colors and orientations with a wide variety of backgrounds. The dataset contains several images where objects are densely-packed, providing samples with stacking and occlusion. Unlike the supermarket dataset presented in [18], our images were taken from significantly different angles and distances, and the objects of the same class are not always grouped together. Moreover, our dataset exhibits a high variety in the number of classes and items per image, as can be seen in Table 1 and Figure 4.

Figure 4: Distribution of the Cafeteria dataset.
Cafeteria dataset FS-COCO dataset
Train Test Train Test
# Images 5244 901 7084 5465
# Classes 41 27 38 41
# Tasks 4520 720 268 165
Avg. Classes / Image 4.61 3.62 2.1 1.8
Avg. Objects / Image 12.98 12.65 4.2 3.7
Table 1: Statistics of the Cafeteria and FS-COCO datasets
Figure 5: Examples of the Cafeteria dataset

6 Experiments

In order to demonstrate the generalizability of the proposed model, we train it on three different datasets separately and evaluate each version of the model on the test-sets of all three datasets. We use four different metrics for evaluation: the mean absolute error (MAE) and root mean square error (RMSE), as well as recall and precision metrics (see Section LABEL:metrics for formal definitions). For our baseline comparison, we use GMNet  [42], a class-agnostic object counting model that uses a general matching architecture. To the best of our knowledge, this is the only work in few-shot object counting available in the literature. Finally, we report the outcome of our ablation studies in Section 6.5.

6.1 Training Details

We followed a simple process for compiling few-shot datasets from existing object counting/detection datasets by defining each unique subset of classes as a task. See Section 3 for more detailed definitions of tasks). In our experiments we set the number of query images in each task to , while the number of support images was randomly chosen to be between and . We trained the models end-to-end using the Adam optimizer [28] with an initial learning rate of . The learning rate was decreased by a factor of every time the validation error did not improve after epochs. Each epoch consists of episodes of training. Unless otherwise mentioned, we used a default standard deviation of for the Gaussian kernels used for weighing the features.

6.2 Datasets

The three datasets we use in our experiments are the Cafeteria dataset (Section 5), FSOD, and MS COCO.


in order to evaluate the model on a dataset with a large variance in the appearances and scales of objects in natural scenes, we compiled a relatively small few-shot dataset from MS COCO by splitting the

classes into train and test and compiling meta-train and meta-test samples from the train and test classes respectively. We only retained images corresponding to tasks with sufficient number of images with that composition. We refer to this compiled version of MS COCO as FS-COCO. Table 1 shows some statistics on the resulting dataset.

FSOD: the Few-Shot Object Detection dataset (FSOD) was introduced in [14] as a diverse dataset designed specifically for few-shot object detection. It consists of overall categories with a split between train and test, and a total of images with bounding boxes. To adapt this diverse and challenging dataset to train our weakly-supervised approach, we took the center of the bounding boxes as the annotation points.

6.3 Evaluation Strategy

Our evaluation method can be described as -way -shot where , the number of distinct categories in the image, varies across the tasks. We report the results as a function of , the number of shots. Following the fewshot learning principle, we evaluate our models on classes unseen during training. The only exception is the model trained and tested on Cafeteria, for which the train and test classes overlap (the tasks are still new). The results in this case are still interesting as they show the object counting capability of the model on a challenging dataset. We use the MAE, RMSE, Recall and Precision metrics for evaluation. 111

More details on evaluation metrics are explained in supplementary material

6.4 Results

The results of our model trained on Cafeteria, FSOD and FS-COCO datasets on all of the test-sets are shown in table 2. We can see that the proposed method outperforms the benchmark model in all nine scenarios, as indicated by all the metrics. Table 2 shows that training on Cafeteria results in a recall of and a precision of when tested on FSOD. Given the considerable difference between the distribution of the classes in these two datasets, the results highlight the generalizability of the model in dealing with truly different classes, i.e. the true “few-shot” nature of how the model learns.

It is interesting to note that the results for the models trained on Cafeteria show the largest gap between our model and the benchmark. The reason could be that the Cafeteria dataset has much more classes per image on average than other datasts (see Table1), which makes the problem much more challenging. This suggests that our model is better equipped to learn on images with larger number of categories. It is also noteworthy that the model trained on FS-COCO performs better on FSOD than on FS-COCO itself, which may be due to the challenging nature MS COCO. In order to understand why the results are generally better on the FSOD test-set across the board, we have to remember that apart from the fact that the number of items per image are smaller for FSOD ( items and distinct classes on average), the point-annotations on FSOD are taken from the bounding boxes, resulting in cleaner labels closer to the center of objects.

Trained on Cafeteria-train, FSOD-train, FS-COCO-train,
Method Tested on MAE RMSE Recall Precision MAE RMSE Recall Precision MAE RMSE Recall Precision
GMNet FS-Cafeteria 14.15 16.19 0.23 0.78 3.32 3.76 0.51 0.75 2.54 2.74 0.74 0.16
FS-COCO 4.12 4.49 0.41 0.48 2.82 3.01 0.63 0.45 1.92 2.04 0.51 0.87
FSOD 6.65 6.69 0.36 0.66 2.33 2.35 0.65 0.67 2.10 2.12 0.78 0.17
Ours FS-Cafeteria 1.53 1.95 0.79 0.78 1.58 1.78 0.57 0.92 1.92 2.15 0.43 0.94
FS-COCO 2.83 3.48 0.50 0.50 1.65 1.84 0.63 0.79 1.61 1.79 0.59 0.88
FSOD 2.93 2.95 0.90 0.59 0.86 0.88 0.76 0.99 1.99 2.02 0.92 0.83
Table 2: Performance of models trained on one dataset and tested on another one with the corespondent number of shots ()

6.4.1 Comparison with Few-shot detection approaches

In table 3 we compare our results with the state-of-the-art few-shot object detection approaches [14, 24, 7]. It is important to keep in mind that unlike our approach, the detection-based approaches require bounding boxes for training. Moreover, in order to convert detection-based outputs to counting results we counted every detected object with a higher IoU than 0.5.

Dataset Method Recall Precision
FS-COCO FSOD [14] 0.51 0.41
FODvFR  [24] 0.28 0.12
LSTD(YOLO) [7] 0.20 0.09
Ours 0.59 0.88
FSOD FSOD [14] 0.49 0.44
Ours 0.76 0.99
Table 3: Performance comparison with few-shot detection-based approches,

This comparison results show that even though detection approaches rely on bounding boxes in training, they are considerably outperformed by our approach.

6.5 Ablation Studies

Relying on attention for support images: one might wonder what would happen if extracting features for the class prototypes relies on the attention module instead of the Gaussian maps generated from the labels, similar to what happens with query images. To find out, we trained a version of the model without the label Gaussian maps for the support images. In this case, the two folds of the model in Figure 2 are exactly the same. Table 4 shows the results of this experiment against the original model. Both models are trained and tested on FSOD. The significant drop in the precision for the model that does not use the Gaussian kernels implies that the module is taking full advantage of them.

Removing the encoded coordinates: concatenating the encoded coordinates (EC) at the end of the extracted feature maps from the image makes the model more location aware. This is specifically helpful in following the predetermined order of objects in the image. Table 4 compares the performance of the model trained on FSOD train-set and tested on FSOD test-set with and without the encoded coordinates, indicating that including them improves the performance of the model considerably, specially in terms of the recall rate.

MAE RMSE Recall Precision
Full approach 0.86 0.88 0.76 0.99
Without EC 0.91 0.93 0.71 0.96
Without guide 6.25 6.36 0.93 0.51
Table 4: Effect of the encoded coordinates and guiding the model when extracting the features from the support set

Altering the standard deviation of Gaussian kernels: for very low density datasets such as FSOD and COCO, estimating the standard deviation of the Gaussian kernel from the point level annotations using the approach suggested in [75] is very inaccurate if not impossible since the number of objects per image is usually very small. Therefore, for these datasets we use a fixed default value for the standard deviation. Table 5 shows the results of varying . All the models in this table have been trained and tested on FSOD.

It can be seen that the optimum value of for the FSOD dataset is . The optimal value varies for each dataset, depending on the scale of the objects and their composition. A potential expansion of this work, could be enforcing the model to predict the value of for each class in each image by adding an extra regression head.

MAE RMSE Recall Precision
4 1.83 1.85 0.91 0.64
6 1.74 1.77 0.91 0.65
8 0.86 0.88 0.76 0.99
10 0.91 0.93 0.79 0.93
Table 5: Effect of the standard deviation of the Gaussian kernels

Number of shots: in Figure 6, we observe that increasing the number of shots slightly improves the performance of the model, specially when the number of ways are larger. The reason is that as the number of support images increases, the prototype vectors are averaged over larger number of objects, hence smoothing out the noise in features. We plan to study other approaches for aggregating multiple features from multiple instances of the same class in the support set.

Figure 6: Performance as a function of the number of shots and ways . The model was trained and evaluated on Cafeteria.

7 Conclusion

We addressed the little-studied problem of few-shot multi-class object counting with point-level supervision. We presented results on three challenging datasets with diverse distributions, namely, FSOD, a few-shot subset of MS COCO, and Cafeteria; a dataset we introduced specifically for object counting. Our model employs an intuitive approach to object counting where the objects are first sequentially localized through a class-agnostic attention mechanism before being classified via a prototypical-based few-shot scheme. Our loss function, crafted to reflect this two-step approach, combines two terms corresponding to the localization and classification steps.

Our approach mitigates the challenge of multi-class object counting by reducing the dependency on labelled data, at the same time simplifying the labelling process by relying on point-annotations only. We believe that our work motivates further research into a problem that is of increasing interest in many applications, namely, few-shot multi-class object counting. One such application is inventory tracking, where data-scarcity due to the rapidly changing composition of items, as well as the complexity of the problem setup, poses a real challenge to existing approaches.


  • [1] Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford, B., De Freitas, N.: Learning to learn by gradient descent by gradient descent. In: Advances in neural information processing systems. pp. 3981–3989 (2016)
  • [2] Arteta, C., Lempitsky, V., Zisserman, A.: Counting in the wild. In: European Conference on Computer Vision. pp. 483–498. Springer (2016)
  • [3] Boominathan, L., Kruthiventi, S.S., Babu, R.V.: Crowdnet: A deep convolutional network for dense crowd counting. In: Proceedings of the 24th ACM international conference on Multimedia. pp. 640–644. ACM (2016)
  • [4]

    Chan, A.B., Liang, Z.S.J., Vasconcelos, N.: Privacy preserving crowd monitoring: Counting people without people models or tracking. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–7. IEEE (2008)

  • [5] Chan, A.B., Vasconcelos, N.: Bayesian poisson regression for crowd counting. In: 2009 IEEE 12th international conference on computer vision. pp. 545–551. IEEE (2009)
  • [6] Chattopadhyay, P., Vedantam, R., Selvaraju, R.R., Batra, D., Parikh, D.: Counting everyday objects in everyday scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1135–1144 (2017)
  • [7]

    Chen, H., Wang, Y., Wang, G., Qiao, Y.: Lstd: A low-shot transfer detector for object detection. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

  • [8] Chen, K., Loy, C.C., Gong, S., Xiang, T.: Feature mining for localised crowd counting. In: BMVC. vol. 1, p. 3 (2012)
  • [9] Cheng, Z.Q., Li, J.X., Dai, Q., Wu, X., Hauptmann, A.G.: Learning spatial awareness to improve crowd counting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6152–6161 (2019)
  • [10] Cholakkal, H., Sun, G., Khan, F.S., Shao, L.: Object counting and instance segmentation with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12397–12405 (2019)
  • [11] De Almeida, P.R., Oliveira, L.S., Britto Jr, A.S., Silva Jr, E.J., Koerich, A.L.: Pklot–a robust dataset for parking lot classification. Expert Systems with Applications 42(11), 4937–4949 (2015)
  • [12] Dong, N., Xing, E.: Few-shot semantic segmentation with prototype learning. In: BMVC. vol. 1, p. 6 (2018)
  • [13] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2), 303–338 (Jun 2010)
  • [14] Fan, Q., Zhuo, W., Tai, Y.W.: Few-shot object detection with attention-rpn and multi-relation detector. arXiv preprint arXiv:1908.01998 (2019)
  • [15] Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4), 594–611 (2006)
  • [16]

    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1126–1135. JMLR. org (2017)

  • [17] Finn, C., Xu, K., Levine, S.: Probabilistic model-agnostic meta-learning. In: NeurIPS (2018)
  • [18] Goldman, E., Herzig, R., Eisenschtat, A., Ratzon, O., Levi, I., Goldberger, J., Hassner, T.: Precise detection in densely packed scenes (2019)
  • [19] Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3018–3027 (2017)
  • [20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [21]

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation

    9(8), 1735–1780 (1997)
  • [22] Hu, T., Yang, P., Zhang, C., Yu, G., Mu, Y., Snoek, C.: Attention-based multi-context guiding for few-shot semantic segmentation. In: AAAI Conference on Artificial Intelligence (2019),
  • [23] Jiang, X., Havaei, M., Varno, F., Chartrand, G., Chapados, N., Matwin, S.: Learning to learn with conditional class dependencies. In: ICLR (2019)
  • [24] Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 8420–8429 (2019)
  • [25]

    Khan, A., Gould, S., Salzmann, M.: Deep convolutional neural networks for human embryonic cell counting. In: European Conference on Computer Vision. pp. 339–348. Springer (2016)

  • [26] Kim, J., Kim, T., Kim, S., Yoo, C.D.: Edge-labeling graph neural network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11–20 (2019)
  • [27] Kim, T., Yoon, J., Dia, O., Kim, S., Bengio, Y., Ahn, S.: Bayesian model-agnostic meta-learning. In: NeurIPS (2018)
  • [28] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [29] Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop. vol. 2 (2015)
  • [30] Kong, D., Gray, D., Tao, H.: Counting pedestrians in crowds using viewpoint invariant training. In: BMVC. vol. 1, p. 2. Citeseer (2005)
  • [31] Lacoste, A., Oreshkin, B., Chung, W., Boquet, T., Rostamzadeh, N., Krueger, D.: Uncertainty in multitask transfer learning. arXiv preprint arXiv:1806.07528 (2018)
  • [32] Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simple visual concepts. In: Proceedings of the annual meeting of the cognitive science society. vol. 33 (2011)
  • [33] Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through probabilistic program induction. Science 350(6266), 1332–1338 (2015)
  • [34] Lake, B.M., Salakhutdinov, R.R., Tenenbaum, J.: One-shot learning by inverting a compositional causal process. In: Advances in neural information processing systems. pp. 2526–2534 (2013)
  • [35] Laradji, I.H., Rostamzadeh, N., Pinheiro, P.O., Vazquez, D., Schmidt, M.: Where are the blobs: Counting by localization with point supervision. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 547–562 (2018)
  • [36] Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1091–1100 (2018)
  • [37] Li, Z., Zhou, F., Chen, F., Li, H.: Meta-sgd: Learning to learn quickly for few shot learning. CoRR abs/1707.09835 (2017),
  • [38] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
  • [39] Liu, B., Vasconcelos, N.: Bayesian model adaptation for crowd counts. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4175–4183 (2015)
  • [40] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European conference on computer vision. pp. 21–37. Springer (2016)
  • [41] Liu, Y., Shi, M., Zhao, Q., Wang, X.: Point in, box out: Beyond counting persons in crowds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6469–6478 (2019)
  • [42] Lu, E., Xie, W., Zisserman, A.: Class-agnostic counting. In: Asian Conference on Computer Vision. pp. 669–684. Springer (2018)
  • [43] Marana, A., Costa, L.d.F., Lotufo, R., Velastin, S.: On the efficacy of texture analysis for crowd monitoring. In: Proceedings SIBGRAPI’98. International Symposium on Computer Graphics, Image Processing, and Vision (Cat. No. 98EX237). pp. 354–361. IEEE (1998)
  • [44] Marsden, M., McGuinness, K., Little, S., Keogh, C.E., O’Connor, N.E.: People, penguins and petri dishes: adapting object counting models to new visual domains and object types without forgetting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8070–8079 (2018)
  • [45] Marsden, M., McGuinness, K., Little, S., O’Connor, N.E.: Fully convolutional crowd counting on highly congested scenes. arXiv preprint arXiv:1612.00220 (2016)
  • [46] Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 (2018)
  • [47] Nichol, A., Schulman, J.: Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2 (2018)
  • [48] Onoro-Rubio, D., López-Sastre, R.J.: Towards perspective-free object counting with deep learning. In: European Conference on Computer Vision. pp. 615–629. Springer (2016)
  • [49] Paul Cohen, J., Boucher, G., Glastonbury, C.A., Lo, H.Z., Bengio, Y.: Count-ception: Counting by fully convolutional redundant counting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 18–26 (2017)
  • [50] Rakelly, K., Shelhamer, E., Darrell, T., Efros, A.A., Levine, S.: Few-shot segmentation propagation with guided networks. arXiv preprint arXiv:1806.07373 (2018)
  • [51] Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (2017),
  • [52] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015)
  • [53] Rensink, R.A.: The dynamic representation of scenes. Visual cognition (2000)
  • [54] Rusu, A.A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., Hadsell, R.: Meta-learning with latent embedding optimization. In: ICLR (2019)
  • [55] Ryan, D., Denman, S., Fookes, C., Sridharan, S.: Crowd counting using multiple local features. In: 2009 Digital Image Computing: Techniques and Applications. pp. 81–88. IEEE (2009)
  • [56] Sam, D.B., Peri, S.V., Kamath, A., Babu, R.V., et al.: Locate, size and count: Accurately resolving people in dense crowds via detection. arXiv preprint arXiv:1906.07538 (2019)
  • [57] Satorras, V.G., Estrach, J.B.: Few-shot learning with graph neural networks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings (2018),
  • [58] Shang, C., Ai, H., Bai, B.: End-to-end crowd counting via joint learning local and global count. In: 2016 IEEE International Conference on Image Processing (ICIP). pp. 1215–1219. IEEE (2016)
  • [59] Shi, Z., Mettes, P., Snoek, C.G.M.: Counting with focus for free. In: Proceedings of the IEEE International Conference on Computer Vision. Seoul, Korea (October 2019)
  • [60] Sindagi, V.A., Patel, V.M.: Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). pp. 1–6. IEEE (2017)
  • [61] Sindagi, V.A., Yasarla, R., Patel, V.M.: Pushing the frontiers of unconstrained crowd counting: New dataset and benchmark method. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1221–1231 (2019)
  • [62] Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems. pp. 4077–4087 (2017)
  • [63] Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1199–1208 (2018)
  • [64] Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advances in neural information processing systems. pp. 3630–3638 (2016)
  • [65] Walach, E., Wolf, L.: Learning to count with cnn boosting. In: European Conference on Computer Vision. pp. 660–676. Springer (2016)
  • [66] Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J.: Panet: Few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9197–9206 (2019)
  • [67] Wang, Y.X., Girshick, R., Hebert, M., Hariharan, B.: Low-shot learning from imaginary data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7278–7286 (2018)
  • [68] Wilder, J.D., Kowler, E., Schnitzer, B.S., Gersch, T.M., Dosher, B.A.: Attention during active visual tasks: Counting, pointing, or simply looking. Vision Research
  • [69] Wojna, Z., Gorban, A.N., Lee, D.S., Murphy, K., Yu, Q., Li, Y., Ibarz, J.: Attention-based extraction of structured information from street view imagery. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 1, pp. 844–850. IEEE (2017)
  • [70] Xie, W., Noble, J.A., Zisserman, A.: Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization 6(3), 283–292 (2018)
  • [71] Xu, K., Ba, J.L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37. pp. 2048–2057. ICML’15, (2015),
  • [72] Yan, Z., Yuan, Y., Zuo, W., Tan, X., Wang, Y., Wen, S., Ding, E.: Perspective-guided convolution networks for crowd counting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 952–961 (2019)
  • [73] Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 833–841 (2015)
  • [74] Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., Song, Y.: Metagan: An adversarial approach to few-shot learning. In: Advances in Neural Information Processing Systems. pp. 2365–2374 (2018)
  • [75] Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 589–597 (2016)
  • [76] Zou, Z., Cheng, Y., Qu, X., Ji, S., Guo, X., Zhou, P.: Attend to count: Crowd counting with adaptive capacity multi-scale cnns. Neurocomputing 367, 75–83 (2019)