spatial-counting-network
Spatial Counting Network (SCN) model and Modifying Count Distribution (MCD) protocol
view repo
Machine learning models tend to over-rely on statistical shortcuts. These spurious correlations between parts of the input and the output labels does not hold in real-world settings. We target this issue on the recent open-ended visual counting task which is well suited to study statistical shortcuts. We aim to develop models that learn a proper mechanism of counting regardless of the output label. First, we propose the Modifying Count Distribution (MCD) protocol, which penalizes models that over-rely on statistical shortcuts. It is based on pairs of training and testing sets that do not follow the same count label distribution such as the odd-even sets. Intuitively, models that have learned a proper mechanism of counting on odd numbers should perform well on even numbers. Secondly, we introduce the Spatial Counting Network (SCN), which is dedicated to visual analysis and counting based on natural language questions. Our model selects relevant image regions, scores them with fusion and self-attention mechanisms, and provides a final counting score. We apply our protocol on the recent dataset, TallyQA, and show superior performances compared to state-of-the-art models. We also demonstrate the ability of our model to select the correct instances to count in the image. Code and datasets are available: https://github.com/cdancette/spatial-counting-network
READ FULL TEXT VIEW PDF
We prove that it is #P-complete to count the triangulations of a
(non-si...
read it
Visual counting, a task that predicts the number of objects from an
imag...
read it
Dense crowd counting is a challenging task that demands millions of head...
read it
Common object counting in a natural scene is a challenging problem in
co...
read it
Most counting questions in visual question answering (VQA) datasets are
...
read it
Generic object counting in natural scenes is a challenging computer visi...
read it
Visual counting, a task that aims to estimate the number of objects from...
read it
Spatial Counting Network (SCN) model and Modifying Count Distribution (MCD) protocol
The recent advances in computer vision
(krizhevsky2012alexnet, ; he2016deep, )and natural language processing
(mikolov2013efficient, ) allowed the research community to tackle challenging tasks that combine vision and language (kiros2015skipthoughts, ; karpathy2015deep, ; lu2016vrd, ). One of these tasks is open-ended visual counting. Its goal is to count the number of instances in an image given a question formulated in natural language. It extends visual counting tasks, which are focused on one type of instance (sindagi2018survey, ) or on a limited set of instances (e.g., 80 different objects (chattopadhyay2017counting, )). Solving it could pave the way towards the next generation of counting systems that possess interactive interfaces with applications in biology (lempitsky2010learning, ), medicine (briggs2009quality, ), wildlife monitoring (onoro2016towards, ), smart cities (onoro2016towards, ; lempitsky2010learning, ) and more.Open-ended counting was first introduced as a sub-task of Visual Question Answering (VQA) (antol2015vqa, ; goyal2017vqa2, ; krishna2017vgenome, ; kafle2017tdiuc, ; johnson2017clevr, ) where the goal is to answer any type of questions about an image. An important problem of VQA models is that they tend to memorize statistical shortcuts (geirhos2020shortcut, ) (also called spurious or superficial correlations (agrawal2018vqacp, ), unwanted priors or biases (ramakrishnan2018overcoming, ; cadene2019rubi, ; selvaraju2019taking, ; wu2019self, ; jing2020overcoming, )) between parts of the inputs and the output labels instead of learning proper mechanisms. They reach acceptable results on testing sets that follow a similar distribution as their training set but their performance degrades significantly otherwise (agrawal2018vqacp, ). This issue makes them impractical in real-world settings (see also (stock2017imagenet, ; geirhos2018imagenet, ; barbu2019objectnet, ; alcorn2019strike, ; ilyas2019adversarial, ; goodfellow2014explaining, ) for pointing out this issue on object recognition tasks). Open-ended counting models are greatly inspired and often compared to VQA models (zhang2018counter, ; benyounes2017mutan, ). While they are developed on specialized datasets for open-ended counting (chattopadhyay2017counting, ; trott2018interpretable, ; acharya2019tallyqa, ), their proximity with VQA models makes them potentially subject to similar problems of statistical shortcuts.
In this paper, we first introduce a novel experimental protocol called Modifying Count Distribution (MCD). It is meant to select design choices that are useful for learning how to count instead of learning the shortcuts. It is inspired by previous works on counting from cognitive science (marcus1998rethinking, ; gross2009number, ) and on statistical shortcuts in VQA (agrawal2018vqacp, ; teney2019actively, ). It consists in evaluating the ability to count of a given model on various training and testing sets that follow different count label distributions. As shown in Figure 1, we evaluate the ability of a counting system trained on odd numbers to generalize on even numbers. In this context, models must generalize to unseen or scarcely seen label counts and are heavily penalized for using shortcuts. Inspired by (trott2018interpretable, ), we also evaluate their ability to correctly ground their final answer in the image. We use a standard object detection metric, and also introduce a new metric more suited to the counting task.
With this experimental protocol in mind, we introduce a novel model, Spatial Counting Network (SCN), dedicated to visual analysis and counting in the open-ended setting. Contrarily to state-of-the-art approaches such as RCN (acharya2019tallyqa, ) or Counter (zhang2018counter, ) which are classification models, ours is a regression model. This crucial design choice allows to learn more robust mechanisms by taking into account the structure of the output labels (ordered natural numbers), and by allowing the model to output counting values that have been scarcely or never seen in the training set. Another important design choice is that our model assigns individual counting scores to image regions using fusion and self-attention mechanisms, before computing the final count number. While ILRC (trott2018interpretable, )
learns a hard selection of image regions using reinforcement learning, our model learns a soft selection in an end-to-end fashion. In addition, we introduce an entropy regularization term to enforce sparse regions scores. Our design choices guarantee a certain level of interpretability and help generalization on different count label distributions.
Our paper is designed along the following contributions. We first introduce the MCD protocol based on shifts in count label distribution between train and test sets. Secondly, we introduce an end-to-end learnable model for counting which integrates design choices allowing to learn robust counting mechanisms. Finally, we apply our experimental protocol on the most recent and biggest open-ended counting dataset, TallyQA (acharya2019tallyqa, ). We pursue extensive experiments and show that our model performs better than current state-of-the-art models. We also validate our design choice by reporting improvements in grounding ability.
An ideal experimental and evaluation protocol for open-ended counting should select models that learn the underlying mechanism of counting, rather than models that rely on statistical shortcuts (geirhos2020shortcut, ). These spurious correlations between parts of image-question inputs and the count labels allow models to perform well on pairs of training and testing sets that follow similar distributions but fail on real-world data due to a shift in distribution.
Real-world datasets often contain hidden statistical shortcuts that can be used to reach impressive performances. Detecting them is challenging. A first approach consists in developing specific baselines that only rely on part of the inputs. For instance, question-only models can be used to assess the existence of shortcuts between the question and the answer in VQA datasets (antol2015vqa, ; goyal2017vqa2, ; agrawal2018vqacp, ). However, it is even more challenging to evaluate if state-of-the-art models over-rely on shortcuts. Common approaches rely on expensive annotations (das2017human, ) or on explainability methods (stock2017imagenet, ; manjunatha2019explicit, ). Humans must then interpret if the displayed correlations are statistical shortcuts or not.
Another approach consists in using testing splits that do not follow the training distribution to penalize models that learn these shortcuts instead of the proper mechanism. It simulates the kind of shifts in distribution that can potentially be encountered when deployed in real-world scenarios. For instance, VQA-CP datasets (agrawal2018vqacp, ) are built by re-organizing the training and testing sets of original VQA datasets, changing the distribution of answers per question type. We propose a similar approach for open-ended counting datasets. We introduce strategies to shift the label count distribution between the original training and testing sets. In this context, models must generalize to unseen or scarcely seen label counts and are heavily penalized for using shortcuts. Finally, we select models according to their robustness in shifts in distribution.
We now describe our experimental protocol, Modifying Count Distribution (MCD). It allows to penalize models that over-rely on statistical shortcuts without any need for external annotations or human supervision. Its goal is to select models that have learned a more robust counting mechanism.
Given a pair of training and testing sets made of image-question-label triplets following similar distributions, we introduce strategies to produce a shift in distribution of count labels. The Odd-Even- generates unbalanced pairs by removing a percentage of triplets associated to an even label from the training set and removing the same percentage of triplets associated to an odd label from the testing set. We control the amount of statistical shortcuts that can potentially be learned by varying from 0 to 100. On the extreme sides, Odd-Even- generates the original pairs, while Odd-Even- generates a training set with no even count labels and a testing set with no odd count labels (i.e., a zero-shot setting). Figure 2 displays the shift in distribution obtained when applying the Odd-Even- strategy on the TallyQA training and testing sets. The Odd-Even- is our strategy of choice because it introduces a large shift in distribution of count labels, while allowing classification models to learn from every possible answer. Similarly, we introduce the Even-Odd- strategies to generate unbalanced training and testing sets which are mostly composed of triplets associated to even and odd count labels respectively. As shown in the supplementary materials, all of our strategies produce a small shift in distribution of images and questions, which is important to only evaluate the impact of a shift in count labels.
As raised by (teney2020value, ), similar protocols (agrawal2018vqacp, ) often select models based on their performance on the testing set only. This bad practice encourages adaptive over-fitting (dwork2015preserving, ) on the testing set distribution. We address this common issue by introducing a validation set. Given a pair of unbalanced training and testing sets, we build their associated validation set as a held-out subset of the training set. We use it to tune hyper-parameters and perform early-stopping to not reveal any information on the testing set distribution.
We now describe our model, Spatial Counting Network (SCN). It contains inductive biases to encourage the learning of the counting mechanism, and avoid learning statistical shortcuts. Our model uses multi-modal fusion and self-attention to assign counting scores to individual image regions, which allows the final accumulated count number to be spatially grounded. In order to generalize to modified count distributions, we use a regression loss to train our model (as opposed to a classification loss (zhang2018counter, ; acharya2019tallyqa, )), and use entropy regularization to encourage the counting of natural numbers (as opposed to making discrete decisions trained with reinforcement learning (trott2018interpretable, )).
An overview of our model is shown in Figure 3. Formally, given a dataset consisting of triplets with an image, a natural language question and a count label corresponding to the number of instances (non-negative) in the image, the goal is to learn a mapping where denotes learnable parameters. Our model builds such a mapping by first encoding both inputs and fusing them, which we detail next.
As shown in the first block of Figure 3
, the model uses two encoders to produce vectorized representations for image
and question . For image , a pre-trained object detector (anderson2018bottom, ) is applied to transform the raw pixels to a set of spatially located vectors, with each vector encoding the semantic content of a region (or bounding box) within the image. We project coordinates of each region into vectors of dimensions and sum them to their associated . For , we use skip-thought vectors (kiros2015skipthoughts, ) to obtain its representation . We then merge each with using a multi-modal fusion module from (kim2017mlb, ), resulting in a new set of vectors ready for relationship modeling and spatial counting, to be discussed below.Since the set of bounding boxes used in encoding images can overlap, one core challenge for correct counting is to de-duplicate boxes (zhang2018counter, ; trott2018interpretable, ) that are assigned to the same instance. We address this by modeling general relationships among using self-attention (vaswani2017attention, ), letting the model learn this mechanism. Specifically, a single-head attention module is applied on , yielding (for each region ) a contextualized representation , which is then element-wise summed with . The resulting vectors are denoted as . Beyond de-duplication, modeling pair-wise relationships could also be helpful for complex questions that require grouping regions (e.g. ‘How many types of fruits ?’) or spatial reasoning (e.g. ‘How many cats are under the table ?’).
After relationship modeling, the resulting vectors are then again fused (kim2017mlb, ) with the question representation , and produce a counting score for each region via sigmoid activation. Finally, the global count output is a simple summation of all the individual counting scores. We name our model Spatial Counting Network, because each and every count is explicitly grounded to a spatial region and allows for easy interpretation and visualization.
While the above-described model encapsulates general components like multi-modal fusion and relationship modeling for open-ended counting, we would like to highlight two design choices that are important for improving its generalization to modified count distributions, described next.
First, unlike many state-of-the-art counting models (zhang2018counter, ; acharya2019tallyqa, ) (and general VQA models, including large-scale pretrained vision-and-language models (lu2019vilbert, ; tan2019lxmert, )) that treat count numbers as classification labels, we state they should be interpreted as actual numbers and directly train the model to regress the final output to the ground truth count label . We choose the standard Mean Squared Error (MSE) as the loss:
(1) |
During testing, we round the fractional value to its nearest integer to complete the mapping to count labels. This loss is suited towards counting, as it takes advantage of the natural order of the count labels. It also allows our model to output count labels that were not seen during training, which is beneficial when the testing set follows a different distribution of count labels.
Second, although regression is a natural choice for number-related tasks, directly applying it to open-ended visual counting can be disadvantageous, because it attempts to model the entire output counting range (i.e. can be any real values between and ) and doesn’t take advantage of the fact that all the count labels are integers. One way to fix this is through reinforcement learning (trott2018interpretable, ) which selects regions one by one, but the resulting objective function is hard to optimize directly. Here we propose an alternative solution by simply imposing a binary entropy regularization term per-region:
(2) |
which essentially encourages each sigmoid output to be close to or . Intuitively, it means for each region, there is either one whole object, or none – it won’t be fractional (e.g. 0.5). This regularization not only enforces the final count to be close to integers (since is produced by summing up scores that are close to or ), but also benefits grounding the final count in the image (since it significantly reduces the chance of multiple overlapping regions being assigned some fractional value and summing up to be an integer count), which in turn helps generalization.
Combining MSE and entropy regularization, our final training loss is defined as: , where
is a fixed hyperparameter. We use
, which allows to reach the best performance in our context.We extensively use TallyQA (acharya2019tallyqa, ), the recent and biggest open-ended counting dataset. Its training set contains 130K real images from COCO (lin2014microsoftcoco, ) and Visual Genome (krishna2017vgenome, ). Each image is associated with questions and count labels for a total of 250K triplets. It comes with a testing set of 23K simple questions and 16K complex questions. Simple questions only require an object detection ability, while complex questions require abilities to detect relationships between objects, their attributes, spatial reasoning, and more (acharya2019tallyqa, ). We compare our proposed model to state-of-the-art approaches and strong baselines following our experimental protocol which penalizes models for relying on statistical shortcuts. Importantly, we do not incorporate knowledge about the testing set distribution such as sampling or weighting triplets based on their count labels. We report only accuracy scores, as the RMSEs follow the same trend. We then further study the impact of entropy regularization on the ability to select image regions that are important for counting. Implementation details are provided in the supplementary materials.
Testing set | Validation set | # of Parameters | ||
Simple | Complex | |||
Q-Only (acharya2019tallyqa, ) | 13.14 | 28.97 | 53.78 | 28 M |
I-Only (acharya2019tallyqa, ) | 12.63 | 6.05 | 53.77 | 4 M |
Q+I (acharya2019tallyqa, ) | 17.55 | 24.77 | 61.08 | 30 M |
MUTAN benyounes2017mutan | 18.58 | 26.20 | 64.33 | 58 M |
Counter (zhang2018counter, ) | 17.99 | 20.43 | 67.50 | 12 M |
RCN (acharya2019tallyqa, ) | 30.64 | 26.69 | 69.53 | 47 M |
RCN Regression | 37.25 | 28.55 | 61.41 | 47 M |
SCN (ours) | 48.22 | 32.54 | 54.78 | 52 M |
In Table 1, we compare our model against state-of-the-art approaches and other reported models on TallyQA. Notably, RCN (acharya2019tallyqa, ) and Counter (zhang2018counter, )
are specifically designed to answer counting questions. Scores for SCN are averaged over three runs, with a variance for simple and complex of 0.3 and 1.1 respectively. We train and evaluate each model on a modified version of TallyQA using our Odd-Even-90% strategy. Models that over-rely on statistical shortcuts are expected to perform well on its validation set since it follows the training set distribution but suffer from a large loss in accuracy on the testing sets. Interestingly, RCN and Counter reach a high accuracy of 69.53% and 67.50% on the validation set, but suffer from huge losses of -38.89 and -49.51 accuracy points respectively on the simple testing set. We observe similar losses on the complex testing set. On the contrary, our model reaches the best accuracy of 48.22% on simple questions and 32.54% on complex questions, with gains in accuracy of +17.58 and +5.85 respectively over RCN
(acharya2019tallyqa, ) (third last row) which is the state-of-the-art model and has a similar number of parameters (52M vs. 47M).A notable difference between our model and state-of-the-art models such as RCN and Counter is that they are trained using classification instead of regression. For fair comparisons, we isolate the contribution of this design choice by introducing RCN Regression, which is a modified RCN that outputs a real number before rounding and is trained using the MSE loss. In Table 1 (second last row), we report an accuracy of 37.25% and 28.55% on simple and complex questions from the testing set with gains of +6.61 and +1.86 accuracy points respectively over the original version of RCN. Compared to RCN, we note a smaller loss in accuracy between the validation and testing sets with -24.16 accuracy points on the simple questions against -38.89. These good performances indicate that regression models are a better design choice to avoid learning statistical shortcuts. However, other design choices allow our model to reach further gains with +10.97 and +3.99 accuracy points on simple and complex questions against RCN Regression. These gains are significantly higher than those resulting in the introduction of the regression alone (+6.61 and +1.86).
Gains in accuracy could be due to different patterns such as an important gain on only one count label or small gains on all of them. We study this in Figure 4, where we display a fine-grained comparison between our model and RCN according to their overall accuracy per count label. Interestingly, we report a higher accuracy on even count labels which are less represented in the training set and a lower accuracy on odd count labels which are more represented in the training set. We also report much smaller differences in accuracy between adjacent count labels, compared with RCN. For instance, we report a loss of -29.56 accuracy points between label 1 and 2 compared to -85.15 with RCN. Overall, there is much less variation in our model between even and odd count labels. These results suggest that our design choices are useful to learn a proper mechanism of counting which helps to generalize to a different distribution of count labels.
In Figure 5, we compare our model against the state-of-the-art model RCN for open-ended counting, and its regression version, according to their overall accuracy on a variety of datasets that can be generated with our Odd-Even-p% and Even-Odd-p% strategies. We vary p from 0 to 100 to go from no shift in distribution to the highest shift. We show that while RCN reaches a slightly better accuracy when the shift in distribution is moderate (e.g., p < 60), our model reaches significant and consistent gains when the shift is bigger (e.g. p > 60). As expected, we report larger gains over RCN ranging from +12.78 accuracy points to +34.52 on datasets that possess the most important shift in distributions (e.g. p > 80). We see similar gains over RCN Regression.
We measure the grounding ability as a proxy to evaluate if models have learned the proper counting mechanism, and to assess the interpretability of our model. To this end, we specifically design a dataset named COCO-Grounding, and a grounding metric for open-ended counting called GroundP. Both are detailed in the supplementary material and are publicly available. Unlike (trott2018interpretable, ), GroundP can be usable in future work where models may use different visual features than ours. Intuitively, the GroundP metric measures the proportion of the total score that is correctly located in the ground truth bounding boxes. In Table 2, we first compare our best model against the state-of-the-art on our GroundP metric and on the mean average precision , which is a standard object detection metric. We use a with a detection threshold of 0.2. Both models have been trained on the original TallyQA dataset. As expected, we report a lower accuracy since models that over-rely on statistical shortcuts are not penalized on this dataset. We report the best performances on grounding, and a gain of +24.3 and +9.1 points respectively over our retrained version of Counter. We do not compare against RCN (acharya2019tallyqa, ), because it does not internally associate counting numbers to regions of the image. We also report a gain of +8.7 and +4.7 points respectively over our model optimized without entropy regularization. These results justify the effectiveness of our regularization (Eq. 2) for counting.
COCO-Grounding | TallyQA Simple | TallyQA Complex | ||||
GroundP | Acc. | RMSE | Acc. | RMSE | ||
SCN (ours) | 51.4 | 22.7 | 63.23 | 1.09 | 47.01 | 1.46 |
SCN w/o entropy | 42.7 | 18.0 | 63.88 | 1.07 | 47.03 | 1.45 |
Counter* (zhang2018counter, ) | 27.1 | 13.5 | 65.39 | 1.27 | 53.49 | 1.51 |
RCN (acharya2019tallyqa, ) | - | - | 71.8 | 1.13 | 56.2 | 1.43 |
Counter (zhang2018counter, ) | - | - | 70.5 | 1.15 | 50.9 | 1.58 |
In Figure 6, we display representative examples of outputs of our model with (on the left) and without (on the right) entropy regularization. Both versions are the same to those compared in Table 2. We first compare both models on their ability to get the correct prediction for the question ‘How many giraffes are shown?’. We display bolded red bounding boxes around objects when their associated count value is closed to 1. We find our model trained with entropy selects the correct two regions of giraffes. On the other hand, our model without entropy fails to distinguish duplicates and associates fractional values to multiple regions that possess giraffes. It also predicts the wrong count label (2.71 is rounded to 3). We report similar observations for the question ‘How many zebras are shown?’. Our entropy regularization strategy is thus critical to improve the interpretability of our model. More examples can be found in supplementary materials.
We propose an experimental protocol, called Modifying Count Distribution (MCD), to penalize open-ended counting models that over-rely on statistical shortcuts. It generates various modified dataset versions where the distributions of even and odd count labels are different between the training and testing sets, while keeping similar distributions of words and images. We then introduce a model, Spatial Counting Network (SCN), which encompasses important design choices that help to overcome statistical shortcuts. Specifically, it models region relationships and associates a score to each region before summing them to the final predicted count. It is trained with a regression loss and a regularization that minimizes the binary entropy of each score. We evaluate SCN against state-of-the-art models and report more robustness to distribution changes. We also show that our entropy-based regularization strategy has a beneficial impact on grounding ability. For future work, we plan to extend our experimental protocol to more general machine learning problems.
We develop a framework that aims at reducing the undesired learning of statistical shortcuts, or unwanted biases, from the training data. This is a common and important issue in vision-and-language tasks, and more generally in machine learning. Reducing the learning of shortcuts is essential if we aim to use those models in the real world, where the data distribution does not necessarily follow the training data distribution. It is also related to the algorithm fairness, i.e. the development of models that are independent of some sensitive variables. We believe our approach could be used in similar settings where shortcuts can harm fairness. Finally, our work is a step towards better interpretability for counting models. Interpretability is an important characteristic of models that can have a positive impact on trust towards those systems.
We would like to thank Manoj Acharya and Hisham Cholakkal for their availability and kindness when answering our questions.
The effort from Sorbonne University was partly supported within the Labex SMART supported by French state funds managed by the ANR within the Investissements d’Avenir programme under reference ANR-11-LABX-65, and partly funded by grant DeepVision (ANR-15-CE23-0029-02, STPGP-479356-15), a joint French/Canadian call by ANR & NSERC. This work benefited from the Jean-Zay cluster.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.Efficient estimation of word representations in vector space.
In Proceedings of the International Conference on Learning Representations (ICLR), 2013.Towards perspective-free object counting with deep learning.
In Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2016.Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)
, 2020.Proceedings of the forty-seventh annual ACM symposium on Theory of computing
, pages 117–126, 2015.Bottom-up and top-down attention for image captioning and visual question answering.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.We design the supplementary materials along the main paper sections. In section 6.1, we provide details about the training, validation and testing sets generated with our Odd-Even-p% and Even-Odd-p% strategies on the TallyQA dataset and show that they lead to small shifts in distribution of words and visual concepts while allowing big shifts in distribution of count labels. In Section 6.2, we provide implementation details about models used in this study to ease the reproducibility of our results. In Section 6.3, we provide details about the grounding metrics. In Section 6.4, we provide additional experiment results. We show that our SCN model reaches higher performances than standard baselines. We also display qualitative results. In Section 6.5, we directly address questions on reproducibility from the NeurIPS 2020 community. We also join the code in a zip file. Details about the code can be found in the README.md file inside the zip.
Before applying our ablation strategies (Odd-Even-p% and Even-Odd-p%) on the TallyQA dataset, we first build a validation set by removing 10% of images from the original training set. All image-question-count triplets that possess those images are set aside to build the validation set. We then apply a chosen ablation strategy to each set so that the training and validation sets follow the same count label distributions while the testing set follow a different one. For instance, the Odd-Even-90% strategy removes 90% of triplets associated to even count labels from the training set, 90% of triplets associated to even count labels from the validation set, and 90% of triplets associated to odd count labels from the testing set. In Table 3, we display the number of odd and even triplets in each set when we apply the Odd-Even-p% strategy on TallyQA with various value of . In Table 4, we display the number of triplets in each set when we apply the Even-Odd-p% strategy on TallyQA with various value of .
Training set | Validation set | Testing set | ||||
p% | Odd | Even | Odd | Even | Odd | Even |
0 % | 87,289 | 137,102 | 9,635 | 15,292 | 23,138 | 15,451 |
50 % | 87,289 | 68,549 | 9,635 | 7,644 | 11,565 | 15,451 |
90 % | 87,289 | 13,707 | 9,635 | 1,525 | 2,328 | 15,451 |
100% | 87,289 | 0 | 9,635 | 0 | 0 | 15,451 |
can be obtained with a linear interpolation.
Training set | Validation set | Testing set | ||||
p% | Odd | Even | Odd | Even | Odd | Even |
0 % | 87,289 | 137,102 | 9,635 | 15,292 | 23,138 | 15,451 |
50 % | 43,643 | 137,102 | 4,815 | 15,292 | 23,138 | 7,719 |
90 % | 8,725 | 137,102 | 969 | 15,292 | 23,138 | 1,551 |
100 % | 0 | 137,102 | 0 | 15,292 | 23,138 | 0 |
We compute the distributions of words from the questions and visual concepts in the images in various Odd-Even-p% training sets, and compare them to the original distributions of TallyQA. To compute the words distribution, we proceed as follow. We first remove the common words how, many, can, you, scene, picture, pictured, image, photo, there, are, seen, see, visible, shown, this, in, the, on, be, of, a, to to only keep those that are associated to specific concepts in the images. We then compare the distributions using the Bhattacharyya coefficient bhattacharyya1946measure – a similarity metric which reaches 0 when there is no overlap between distributions, and 1 when both are the same. Similarly, we compute visual concepts distributions by using the categories assigned to every bounding box extracted from our pre-trained object detector (anderson2018bottom, ), and compare the distributions using the Bhattacharyya coefficient. In Table 5, we see that all coefficients are very close to 1, even for the datasets generated with the 100% strategies, which confirms that our protocol l very little the distributions of words and visual concepts.
p% | Words Similarity | Visual similarity |
---|---|---|
0 % | 1.0 | 1.0 |
50 % | 0.997 | 0.9999 |
90 % | 0.986 | 0.9996 |
100 % | 0.976 | 0.9995 |
We will release the Pytorch paszke2019pytorch code to generate the datasets and reproduce our results. We will also release pre-trained models and configuration files with the following hyperparameters.
We use the common Faster R-CNN ren2015faster pre-trained by (anderson2018bottom, ) to extract object features from the image, and the common GRU language model pretrained by (kiros2015skipthoughts, ) to extract language features from the question. To keep a similar number of parameters with the state-of-the-art RCN model (acharya2019tallyqa, ), We use hidden dimensions of 1500 for the multimodal embeddings
, 500 for the self-attention, 768 for both bilinear fusions, and use only one self-attention head. We train our model for 30 epochs with the Adam optimizer
(kingma2015adam, ) and a learning rate of 2.e-5 which is decayed by 0.25 every 2 epochs, starting at epoch 15. The learning rate scheduling was tuned on the validation accuracy of the Odd-Even-90% set. Importantly, for all other experiments, we use the exact same hyper-parameters. We early stop training based on the highest accuracy computed on the validation set. During training, we fix the weight controlling the influence of the entropy loss to 1 in order to keep a similar order of magnitude between the gradients norm computed using the entropy loss and the gradients norm computed using the MSE loss. After obtaining our main results, we experimented values around 1 to assess its robustness. We report small variations in accuracy scores.We follow the implementation and hyperparameters described in acharya2019tallyqa . We create RCN regression by changing the output dimension of the last linear layer from 15 to 1. This allows us to train the model with a MSE regression loss instead of a classification loss. We use the same hyperparameters as RCN.
Similarly to the work done in IRLC trott2018interpretable , we use the grounding ability as a proxy to evaluate the proper counting mechanism, and to assess the interpretability of models. To this end, we design COCO-Grounding, a dataset specifically designed to be usable in future work where visual features may be different than ours. Our dataset is composed of the 4459 images from MSCOCO (lin2014microsoftcoco, ) that can not be found in Visual Genome (krishna2017vgenome, ) and importantly not in the TallyQA training set. Each MSCOCO image is annotated with bounding boxes around objects associated with a category among 80 classes of objects. We use these classes to automatically generate simple questions about a given image using the "How many {class}?" pattern. The answer to a question is a count label obtained by counting the number of bounding boxes associated to the given {class}. We also generate questions associated to the count label 0 by sampling a random class among 80 that is not present on the image. We generate an equal number of 734 image-question-count triplets associated to the count label 0, 1 and 2, and generate all possible triplets for higher count labels (with a maximum label of 15) to reach a total number of 3311 triplets over 2139 images. This subset will be publicly released.
Similarly to object detection models, our model can output bounding box predictions. We thus use a standard metric in object detection tasks everingham2015pascal ; lin2014microsoftcoco called mean average precision. It allows us to evaluate the ability of our model to detect the correct instances of objects to count in the image. We also introduce a novel grounding metric specifically designed for open-ended counting. We refer to it as GroundP for Grounding Precision. It is derived from the grounding metric in trott2018interpretable , with the difference that we use the ground truth bounding boxes as references to compute the grounding, instead of the bounding boxes extracted from the object detection model. This enables us to more accurately evaluate the grounding, and to compare models with different object detection models. The metric consists in weighting the scores assigned by our model to each proposed bounding box by the portion of their size that overlaps with the ground truth bounding boxes. More details are given in the next paragraph.
For each input triplet , we have a set of ground truth bounding boxes . We note the union of all those bounding boxes. It represents the total area of counted objects. For this triplet , our object detection model returns us region proposals . Our model returns, for each region, a score . The final score is . For every proposed bounding box , we compute its precision (the intersection with ground truth bounding boxes over its own area). A precision of 1 means that the bounding box is totally in the ground truth area, whereas a precision of 0 means that is is totally disjoint.
(3) |
We then weight the precision by the score assigned by our model to this region and sum over the regions to obtain our final score: . We recall that our model’s prediction is . The final metric is computed by summing the results over all the images, and normalizing by the sum of predicted scores.
(4) |
This interpretation of the metric is straightforward: It represents the proportion of the final score that is correctly grounded in the image. A value of one would mean that all chosen objects (with a nonzero score) are in the ground truth bounding boxes.
In Table 6, we compare additional baselines against our SCN model on datasets generated with the Odd-Even-90% strategy. The Random (train) baseline consists in randomly sampling the predicted count labels following their distribution in the training set. Random (test) samples count labels following their distribution in the testing set. While this baseline leverages knowledge about the testing set distribution, SCN reaches significantly better accuracy scores with +16.88 accuracy points in simple questions and +4.76 against it. The RCN + Uni. Sample. baseline is trained with a different triplets sampling strategy than RCN based on a uniform sampling of the count labels. By doing so, we seek to train a more robust RCN model against shift in distributions between training and testing sets. As expected, we report lower accuracy scores than RCN on the validation set with a loss of -4.09 point. Interestingly, we report lower scores on the simple testing set with a loss of -4.14 points and slightly better score on the complex testing set with a gain of +1.09. Overall, we report significantly lower scores than our SCN with -21.72 accuracy point on simple. These results show that the uniform sampling strategy is not suited to learn robust counting mechanisms.
Testing set | Validation set | ||
Simple | Complex | ||
Random (train distribution) | 11.04 | 9.21 | 32.15 |
Random (test distribution) | 31.34 | 29.95 | 10.37 |
RCN* (acharya2019tallyqa, ) + Uni. Sampl. | 26.50 | 27.78 | 65.44 |
RCN* (acharya2019tallyqa, ) | 30.64 | 26.69 | 69.53 |
SCN (ours) | 48.22 | 32.54 | 54.78 |
In Figure 7, we show the influence of our entropy loss. We compare our SCN model and a SCN model trained without entropy loss. For the question ’How many people are in the picture?’, we see that our SCN model selects the correct four regions, while our model trained without entropy fails to distinguish duplicates and associates fractional values to multiple regions that possess people. I also leads to a fractional prediction of the count label. In Figure 8, we display two complex questions on the same image, and show that our SCN model is able to select an object (person) and filter according to an attribute (sport).
The range of hyper parameters considered, method to select hyper-parameter configuration and specification of all hyper-parametrs used to generate results.
See Section 6.2.
The exact number of training and evaluation runs.
In Table 1, we report the mean over 3 runs with different seeds for our main result. We report the variation in the associated paragraph.
All other experiments are launched once.
A clear definition of the specific measure or statistics used to report results.
We report standard Accuracy and RMSE which are widely used metrics, grounding metrics which are detailed in Section 6.3.
A description of results with a central tendancy (eg mean) & variation (eg error bars).
In Table 1, we report the mean over 3 runs with different seeds for our main result. We report the variation in the associated paragraph.
The average runtime for each result, or estimated energy cost.
Our SCN model takes 10 hours to train on the original TallyQA and 5 hours on the dataset generated by the Odd-Even-100%. It is the smallest dataset with about half the size of the original TallyQA.
A description of the computing infrastructure used.
We train our model on a single GPU. We have access to several 12GB Titan X Pascal and 32GB Tesla V100.
Comments
There are no comments yet.