Overcoming Statistical Shortcuts for Open-ended Visual Counting

06/17/2020 ∙ by Corentin Dancette, et al. ∙ Facebook Laboratoire d'Informatique de Paris 6 12

Machine learning models tend to over-rely on statistical shortcuts. These spurious correlations between parts of the input and the output labels does not hold in real-world settings. We target this issue on the recent open-ended visual counting task which is well suited to study statistical shortcuts. We aim to develop models that learn a proper mechanism of counting regardless of the output label. First, we propose the Modifying Count Distribution (MCD) protocol, which penalizes models that over-rely on statistical shortcuts. It is based on pairs of training and testing sets that do not follow the same count label distribution such as the odd-even sets. Intuitively, models that have learned a proper mechanism of counting on odd numbers should perform well on even numbers. Secondly, we introduce the Spatial Counting Network (SCN), which is dedicated to visual analysis and counting based on natural language questions. Our model selects relevant image regions, scores them with fusion and self-attention mechanisms, and provides a final counting score. We apply our protocol on the recent dataset, TallyQA, and show superior performances compared to state-of-the-art models. We also demonstrate the ability of our model to select the correct instances to count in the image. Code and datasets are available: https://github.com/cdancette/spatial-counting-network

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 16

Code Repositories

spatial-counting-network

Spatial Counting Network (SCN) model and Modifying Count Distribution (MCD) protocol


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent advances in computer vision

(krizhevsky2012alexnet, ; he2016deep, )

and natural language processing

(mikolov2013efficient, ) allowed the research community to tackle challenging tasks that combine vision and language (kiros2015skipthoughts, ; karpathy2015deep, ; lu2016vrd, ). One of these tasks is open-ended visual counting. Its goal is to count the number of instances in an image given a question formulated in natural language. It extends visual counting tasks, which are focused on one type of instance (sindagi2018survey, ) or on a limited set of instances (e.g., 80 different objects (chattopadhyay2017counting, )). Solving it could pave the way towards the next generation of counting systems that possess interactive interfaces with applications in biology (lempitsky2010learning, ), medicine (briggs2009quality, ), wildlife monitoring (onoro2016towards, ), smart cities (onoro2016towards, ; lempitsky2010learning, ) and more.

Open-ended counting was first introduced as a sub-task of Visual Question Answering (VQA) (antol2015vqa, ; goyal2017vqa2, ; krishna2017vgenome, ; kafle2017tdiuc, ; johnson2017clevr, ) where the goal is to answer any type of questions about an image. An important problem of VQA models is that they tend to memorize statistical shortcuts (geirhos2020shortcut, ) (also called spurious or superficial correlations (agrawal2018vqacp, ), unwanted priors or biases (ramakrishnan2018overcoming, ; cadene2019rubi, ; selvaraju2019taking, ; wu2019self, ; jing2020overcoming, )) between parts of the inputs and the output labels instead of learning proper mechanisms. They reach acceptable results on testing sets that follow a similar distribution as their training set but their performance degrades significantly otherwise (agrawal2018vqacp, ). This issue makes them impractical in real-world settings (see also (stock2017imagenet, ; geirhos2018imagenet, ; barbu2019objectnet, ; alcorn2019strike, ; ilyas2019adversarial, ; goodfellow2014explaining, ) for pointing out this issue on object recognition tasks). Open-ended counting models are greatly inspired and often compared to VQA models (zhang2018counter, ; benyounes2017mutan, ). While they are developed on specialized datasets for open-ended counting (chattopadhyay2017counting, ; trott2018interpretable, ; acharya2019tallyqa, ), their proximity with VQA models makes them potentially subject to similar problems of statistical shortcuts.

In this paper, we first introduce a novel experimental protocol called Modifying Count Distribution (MCD). It is meant to select design choices that are useful for learning how to count instead of learning the shortcuts. It is inspired by previous works on counting from cognitive science (marcus1998rethinking, ; gross2009number, ) and on statistical shortcuts in VQA (agrawal2018vqacp, ; teney2019actively, ). It consists in evaluating the ability to count of a given model on various training and testing sets that follow different count label distributions. As shown in Figure 1, we evaluate the ability of a counting system trained on odd numbers to generalize on even numbers. In this context, models must generalize to unseen or scarcely seen label counts and are heavily penalized for using shortcuts. Inspired by (trott2018interpretable, ), we also evaluate their ability to correctly ground their final answer in the image. We use a standard object detection metric, and also introduce a new metric more suited to the counting task.

With this experimental protocol in mind, we introduce a novel model, Spatial Counting Network (SCN), dedicated to visual analysis and counting in the open-ended setting. Contrarily to state-of-the-art approaches such as RCN (acharya2019tallyqa, ) or Counter (zhang2018counter, ) which are classification models, ours is a regression model. This crucial design choice allows to learn more robust mechanisms by taking into account the structure of the output labels (ordered natural numbers), and by allowing the model to output counting values that have been scarcely or never seen in the training set. Another important design choice is that our model assigns individual counting scores to image regions using fusion and self-attention mechanisms, before computing the final count number. While ILRC (trott2018interpretable, )

learns a hard selection of image regions using reinforcement learning, our model learns a soft selection in an end-to-end fashion. In addition, we introduce an entropy regularization term to enforce sparse regions scores. Our design choices guarantee a certain level of interpretability and help generalization on different count label distributions.

Our paper is designed along the following contributions. We first introduce the MCD protocol based on shifts in count label distribution between train and test sets. Secondly, we introduce an end-to-end learnable model for counting which integrates design choices allowing to learn robust counting mechanisms. Finally, we apply our experimental protocol on the most recent and biggest open-ended counting dataset, TallyQA (acharya2019tallyqa, ). We pursue extensive experiments and show that our model performs better than current state-of-the-art models. We also validate our design choice by reporting improvements in grounding ability.

Figure 1: With the existing experimental protocols for open-ended counting, statistical shortcuts can be used to reach correct predictions without learning the underlying counting mechanisms. We propose a Modifying Count Distribution (MCD) protocol that penalizes models that over-rely on such shortcuts for counting. In this setting, the performances of state-of-the-art models (e.g. RCN acharya2019tallyqa ) are heavily impacted, while our proposed model, Spatial Counting Network (SCN) is more robust to distribution changes in counting.

2 Experimental protocol for open-ended counting

An ideal experimental and evaluation protocol for open-ended counting should select models that learn the underlying mechanism of counting, rather than models that rely on statistical shortcuts (geirhos2020shortcut, ). These spurious correlations between parts of image-question inputs and the count labels allow models to perform well on pairs of training and testing sets that follow similar distributions but fail on real-world data due to a shift in distribution.

2.1 Challenge of statistical shortcuts

Detecting statistical shortcuts

Real-world datasets often contain hidden statistical shortcuts that can be used to reach impressive performances. Detecting them is challenging. A first approach consists in developing specific baselines that only rely on part of the inputs. For instance, question-only models can be used to assess the existence of shortcuts between the question and the answer in VQA datasets (antol2015vqa, ; goyal2017vqa2, ; agrawal2018vqacp, ). However, it is even more challenging to evaluate if state-of-the-art models over-rely on shortcuts. Common approaches rely on expensive annotations (das2017human, ) or on explainability methods (stock2017imagenet, ; manjunatha2019explicit, ). Humans must then interpret if the displayed correlations are statistical shortcuts or not.

Penalizing statistical shortcuts

Another approach consists in using testing splits that do not follow the training distribution to penalize models that learn these shortcuts instead of the proper mechanism. It simulates the kind of shifts in distribution that can potentially be encountered when deployed in real-world scenarios. For instance, VQA-CP datasets (agrawal2018vqacp, ) are built by re-organizing the training and testing sets of original VQA datasets, changing the distribution of answers per question type. We propose a similar approach for open-ended counting datasets. We introduce strategies to shift the label count distribution between the original training and testing sets. In this context, models must generalize to unseen or scarcely seen label counts and are heavily penalized for using shortcuts. Finally, we select models according to their robustness in shifts in distribution.

2.2 Modifying Count Distribution protocol

Figure 2: Number of triplets per count label on the TallyQA training set, testing set of simple questions and testing set of complex questions. Bar plots of the Odd-Even-90% strategy (in strong color) are displayed over the ones of the original TallyQA datasets (in light color). Models that over-rely on statistical shortcuts are penalized when evaluated on the even count labels (in yellow).

We now describe our experimental protocol, Modifying Count Distribution (MCD). It allows to penalize models that over-rely on statistical shortcuts without any need for external annotations or human supervision. Its goal is to select models that have learned a more robust counting mechanism.

Odd-Even- and Even-Odd- strategies

Given a pair of training and testing sets made of image-question-label triplets following similar distributions, we introduce strategies to produce a shift in distribution of count labels. The Odd-Even- generates unbalanced pairs by removing a percentage of triplets associated to an even label from the training set and removing the same percentage of triplets associated to an odd label from the testing set. We control the amount of statistical shortcuts that can potentially be learned by varying from 0 to 100. On the extreme sides, Odd-Even- generates the original pairs, while Odd-Even- generates a training set with no even count labels and a testing set with no odd count labels (i.e., a zero-shot setting). Figure 2 displays the shift in distribution obtained when applying the Odd-Even- strategy on the TallyQA training and testing sets. The Odd-Even- is our strategy of choice because it introduces a large shift in distribution of count labels, while allowing classification models to learn from every possible answer. Similarly, we introduce the Even-Odd- strategies to generate unbalanced training and testing sets which are mostly composed of triplets associated to even and odd count labels respectively. As shown in the supplementary materials, all of our strategies produce a small shift in distribution of images and questions, which is important to only evaluate the impact of a shift in count labels.

Validation set

As raised by (teney2020value, ), similar protocols (agrawal2018vqacp, ) often select models based on their performance on the testing set only. This bad practice encourages adaptive over-fitting (dwork2015preserving, ) on the testing set distribution. We address this common issue by introducing a validation set. Given a pair of unbalanced training and testing sets, we build their associated validation set as a held-out subset of the training set. We use it to tune hyper-parameters and perform early-stopping to not reveal any information on the testing set distribution.

3 Spatial Counting Network

We now describe our model, Spatial Counting Network (SCN). It contains inductive biases to encourage the learning of the counting mechanism, and avoid learning statistical shortcuts. Our model uses multi-modal fusion and self-attention to assign counting scores to individual image regions, which allows the final accumulated count number to be spatially grounded. In order to generalize to modified count distributions, we use a regression loss to train our model (as opposed to a classification loss (zhang2018counter, ; acharya2019tallyqa, )), and use entropy regularization to encourage the counting of natural numbers (as opposed to making discrete decisions trained with reinforcement learning (trott2018interpretable, )).

Overview

An overview of our model is shown in Figure 3. Formally, given a dataset consisting of triplets with an image, a natural language question and a count label corresponding to the number of instances (non-negative) in the image, the goal is to learn a mapping where denotes learnable parameters. Our model builds such a mapping by first encoding both inputs and fusing them, which we detail next.

Figure 3: Spatial Counting Network. It takes an image and a counting question as inputs and outputs a count label. It is built upon object detectors to extract a set of spatially localized visual representations based on bounding boxes. Each of them is modified according to the question and their neighborhood until a counting score is obtained. The score indicates the presence (e.g. ) or absence (e.g. ) of a corresponding instance in the bounding box. The final count prediction is produced by simply summing up all the scores.

Encoders and multi-modal fusion

As shown in the first block of Figure 3

, the model uses two encoders to produce vectorized representations for image

and question . For image , a pre-trained object detector (anderson2018bottom, ) is applied to transform the raw pixels to a set of spatially located vectors, with each vector encoding the semantic content of a region (or bounding box) within the image. We project coordinates of each region into vectors of dimensions and sum them to their associated . For , we use skip-thought vectors (kiros2015skipthoughts, ) to obtain its representation . We then merge each with using a multi-modal fusion module from (kim2017mlb, ), resulting in a new set of vectors ready for relationship modeling and spatial counting, to be discussed below.

Relationships modeling

Since the set of bounding boxes used in encoding images can overlap, one core challenge for correct counting is to de-duplicate boxes (zhang2018counter, ; trott2018interpretable, ) that are assigned to the same instance. We address this by modeling general relationships among using self-attention (vaswani2017attention, ), letting the model learn this mechanism. Specifically, a single-head attention module is applied on , yielding (for each region ) a contextualized representation , which is then element-wise summed with . The resulting vectors are denoted as . Beyond de-duplication, modeling pair-wise relationships could also be helpful for complex questions that require grouping regions (e.g. ‘How many types of fruits ?’) or spatial reasoning (e.g. ‘How many cats are under the table ?’).

Spatial counting

After relationship modeling, the resulting vectors are then again fused (kim2017mlb, ) with the question representation , and produce a counting score for each region via sigmoid activation. Finally, the global count output is a simple summation of all the individual counting scores. We name our model Spatial Counting Network, because each and every count is explicitly grounded to a spatial region and allows for easy interpretation and visualization.

While the above-described model encapsulates general components like multi-modal fusion and relationship modeling for open-ended counting, we would like to highlight two design choices that are important for improving its generalization to modified count distributions, described next.

Regression, not classification

First, unlike many state-of-the-art counting models (zhang2018counter, ; acharya2019tallyqa, ) (and general VQA models, including large-scale pretrained vision-and-language models (lu2019vilbert, ; tan2019lxmert, )) that treat count numbers as classification labels, we state they should be interpreted as actual numbers and directly train the model to regress the final output to the ground truth count label . We choose the standard Mean Squared Error (MSE) as the loss:

(1)

During testing, we round the fractional value to its nearest integer to complete the mapping to count labels. This loss is suited towards counting, as it takes advantage of the natural order of the count labels. It also allows our model to output count labels that were not seen during training, which is beneficial when the testing set follows a different distribution of count labels.

Entropy regularization

Second, although regression is a natural choice for number-related tasks, directly applying it to open-ended visual counting can be disadvantageous, because it attempts to model the entire output counting range (i.e. can be any real values between and ) and doesn’t take advantage of the fact that all the count labels are integers. One way to fix this is through reinforcement learning (trott2018interpretable, ) which selects regions one by one, but the resulting objective function is hard to optimize directly. Here we propose an alternative solution by simply imposing a binary entropy regularization term per-region:

(2)

which essentially encourages each sigmoid output to be close to or . Intuitively, it means for each region, there is either one whole object, or none – it won’t be fractional (e.g. 0.5). This regularization not only enforces the final count to be close to integers (since is produced by summing up scores that are close to or ), but also benefits grounding the final count in the image (since it significantly reduces the chance of multiple overlapping regions being assigned some fractional value and summing up to be an integer count), which in turn helps generalization.

Combining MSE and entropy regularization, our final training loss is defined as: , where

is a fixed hyperparameter. We use

, which allows to reach the best performance in our context.

4 Experiments

Experimental setup

We extensively use TallyQA (acharya2019tallyqa, ), the recent and biggest open-ended counting dataset. Its training set contains 130K real images from COCO (lin2014microsoftcoco, ) and Visual Genome (krishna2017vgenome, ). Each image is associated with questions and count labels for a total of 250K triplets. It comes with a testing set of 23K simple questions and 16K complex questions. Simple questions only require an object detection ability, while complex questions require abilities to detect relationships between objects, their attributes, spatial reasoning, and more (acharya2019tallyqa, ). We compare our proposed model to state-of-the-art approaches and strong baselines following our experimental protocol which penalizes models for relying on statistical shortcuts. Importantly, we do not incorporate knowledge about the testing set distribution such as sampling or weighting triplets based on their count labels. We report only accuracy scores, as the RMSEs follow the same trend. We then further study the impact of entropy regularization on the ability to select image regions that are important for counting. Implementation details are provided in the supplementary materials.

4.1 State-of-the-art comparison using our MCD protocol

Testing set Validation set # of Parameters
Simple Complex
Q-Only (acharya2019tallyqa, ) 13.14 28.97 53.78 28 M
I-Only (acharya2019tallyqa, ) 12.63 6.05 53.77 4 M
Q+I (acharya2019tallyqa, ) 17.55 24.77 61.08 30 M
MUTAN benyounes2017mutan 18.58 26.20 64.33 58 M
Counter (zhang2018counter, ) 17.99 20.43 67.50 12 M
RCN (acharya2019tallyqa, ) 30.64 26.69 69.53 47 M
RCN Regression 37.25 28.55 61.41 47 M
SCN (ours) 48.22 32.54 54.78 52 M
Table 1: State-of-the-art comparison on a modified version of TallyQA using our Odd-Even-90% strategy. 90% of the even labels and odd labels have been removed from the original training and testing sets respectively. We report the final accuracy on the two testing sets: simple or complex. We also report accuracy on the validation set which follows the training distribution and is used for early-stopping. Parameter counts are in millions.

Main results

In Table 1, we compare our model against state-of-the-art approaches and other reported models on TallyQA. Notably, RCN (acharya2019tallyqa, ) and Counter (zhang2018counter, )

are specifically designed to answer counting questions. Scores for SCN are averaged over three runs, with a variance for simple and complex of 0.3 and 1.1 respectively. We train and evaluate each model on a modified version of TallyQA using our Odd-Even-90% strategy. Models that over-rely on statistical shortcuts are expected to perform well on its validation set since it follows the training set distribution but suffer from a large loss in accuracy on the testing sets. Interestingly, RCN and Counter reach a high accuracy of 69.53% and 67.50% on the validation set, but suffer from huge losses of -38.89 and -49.51 accuracy points respectively on the simple testing set. We observe similar losses on the complex testing set. On the contrary, our model reaches the best accuracy of 48.22% on simple questions and 32.54% on complex questions, with gains in accuracy of +17.58 and +5.85 respectively over RCN

(acharya2019tallyqa, ) (third last row) which is the state-of-the-art model and has a similar number of parameters (52M vs. 47M).

Impact of regression loss

A notable difference between our model and state-of-the-art models such as RCN and Counter is that they are trained using classification instead of regression. For fair comparisons, we isolate the contribution of this design choice by introducing RCN Regression, which is a modified RCN that outputs a real number before rounding and is trained using the MSE loss. In Table 1 (second last row), we report an accuracy of 37.25% and 28.55% on simple and complex questions from the testing set with gains of +6.61 and +1.86 accuracy points respectively over the original version of RCN. Compared to RCN, we note a smaller loss in accuracy between the validation and testing sets with -24.16 accuracy points on the simple questions against -38.89. These good performances indicate that regression models are a better design choice to avoid learning statistical shortcuts. However, other design choices allow our model to reach further gains with +10.97 and +3.99 accuracy points on simple and complex questions against RCN Regression. These gains are significantly higher than those resulting in the introduction of the regression alone (+6.61 and +1.86).

4.2 Detailed comparison using our MCD protocol

Figure 4: Comparison between our model and RCN acharya2019tallyqa on our modified version of TallyQA using our Odd-Even-90% strategy. Our model reaches significantly better overall accuracy on even labels (in yellow). These count labels are meant to penalize models that over-rely on statistical shortcuts.

Difference in accuracy per count label

Gains in accuracy could be due to different patterns such as an important gain on only one count label or small gains on all of them. We study this in Figure 4, where we display a fine-grained comparison between our model and RCN according to their overall accuracy per count label. Interestingly, we report a higher accuracy on even count labels which are less represented in the training set and a lower accuracy on odd count labels which are more represented in the training set. We also report much smaller differences in accuracy between adjacent count labels, compared with RCN. For instance, we report a loss of -29.56 accuracy points between label 1 and 2 compared to -85.15 with RCN. Overall, there is much less variation in our model between even and odd count labels. These results suggest that our design choices are useful to learn a proper mechanism of counting which helps to generalize to a different distribution of count labels.

Figure 5: Comparison between our model, RCN acharya2019tallyqa and its regression variant on various versions of TallyQA using our Odd-Even-p% and Even-Odd-p% strategies. p% controls the shift in distributions between the training and testing sets (with the original distribution when p = 0). Models that over-rely on statistical shortcuts (e.g. original RCN) are strongly penalized when p% is high (yellow gradient).

Difference in accuracy on various shifts in distribution

In Figure 5, we compare our model against the state-of-the-art model RCN for open-ended counting, and its regression version, according to their overall accuracy on a variety of datasets that can be generated with our Odd-Even-p% and Even-Odd-p% strategies. We vary p from 0 to 100 to go from no shift in distribution to the highest shift. We show that while RCN reaches a slightly better accuracy when the shift in distribution is moderate (e.g., p < 60), our model reaches significant and consistent gains when the shift is bigger (e.g. p > 60). As expected, we report larger gains over RCN ranging from +12.78 accuracy points to +34.52 on datasets that possess the most important shift in distributions (e.g. p > 80). We see similar gains over RCN Regression.

4.3 Study of the grounding ability

Comparison on COCO-Grounding

We measure the grounding ability as a proxy to evaluate if models have learned the proper counting mechanism, and to assess the interpretability of our model. To this end, we specifically design a dataset named COCO-Grounding, and a grounding metric for open-ended counting called GroundP. Both are detailed in the supplementary material and are publicly available. Unlike (trott2018interpretable, ), GroundP can be usable in future work where models may use different visual features than ours. Intuitively, the GroundP metric measures the proportion of the total score that is correctly located in the ground truth bounding boxes. In Table 2, we first compare our best model against the state-of-the-art on our GroundP metric and on the mean average precision , which is a standard object detection metric. We use a with a detection threshold of 0.2. Both models have been trained on the original TallyQA dataset. As expected, we report a lower accuracy since models that over-rely on statistical shortcuts are not penalized on this dataset. We report the best performances on grounding, and a gain of +24.3 and +9.1 points respectively over our retrained version of Counter. We do not compare against RCN (acharya2019tallyqa, ), because it does not internally associate counting numbers to regions of the image. We also report a gain of +8.7 and +4.7 points respectively over our model optimized without entropy regularization. These results justify the effectiveness of our regularization (Eq. 2) for counting.

COCO-Grounding TallyQA Simple TallyQA Complex
GroundP Acc. RMSE Acc. RMSE
SCN (ours) 51.4 22.7 63.23 1.09 47.01 1.46
SCN w/o entropy 42.7 18.0 63.88 1.07 47.03 1.45
Counter* (zhang2018counter, ) 27.1 13.5 65.39 1.27 53.49 1.51
RCN (acharya2019tallyqa, ) - - 71.8 1.13 56.2 1.43
Counter (zhang2018counter, ) - - 70.5 1.15 50.9 1.58
Table 2: Grounding ability of models trained on original TallyQA. We report standard and our GroundP metric on COCO-Grounding dataset (see supplementary materials for details). Our retrained Counter* (zhang2018counter, ) reaches a different balance between simple and complex sets.

Qualitative study

In Figure 6, we display representative examples of outputs of our model with (on the left) and without (on the right) entropy regularization. Both versions are the same to those compared in Table 2. We first compare both models on their ability to get the correct prediction for the question ‘How many giraffes are shown?’. We display bolded red bounding boxes around objects when their associated count value is closed to 1. We find our model trained with entropy selects the correct two regions of giraffes. On the other hand, our model without entropy fails to distinguish duplicates and associates fractional values to multiple regions that possess giraffes. It also predicts the wrong count label (2.71 is rounded to 3). We report similar observations for the question ‘How many zebras are shown?’. Our entropy regularization strategy is thus critical to improve the interpretability of our model. More examples can be found in supplementary materials.

Figure 6: Qualitative comparison between our model with and without entropy regularization regarding their ability to select the correct regions to count and to provide correct predictions. Red bounding boxes are shown with bolded borders when their associated is close to 1.

5 Conclusion

We propose an experimental protocol, called Modifying Count Distribution (MCD), to penalize open-ended counting models that over-rely on statistical shortcuts. It generates various modified dataset versions where the distributions of even and odd count labels are different between the training and testing sets, while keeping similar distributions of words and images. We then introduce a model, Spatial Counting Network (SCN), which encompasses important design choices that help to overcome statistical shortcuts. Specifically, it models region relationships and associates a score to each region before summing them to the final predicted count. It is trained with a regression loss and a regularization that minimizes the binary entropy of each score. We evaluate SCN against state-of-the-art models and report more robustness to distribution changes. We also show that our entropy-based regularization strategy has a beneficial impact on grounding ability. For future work, we plan to extend our experimental protocol to more general machine learning problems.

Broader impact

We develop a framework that aims at reducing the undesired learning of statistical shortcuts, or unwanted biases, from the training data. This is a common and important issue in vision-and-language tasks, and more generally in machine learning. Reducing the learning of shortcuts is essential if we aim to use those models in the real world, where the data distribution does not necessarily follow the training data distribution. It is also related to the algorithm fairness, i.e. the development of models that are independent of some sensitive variables. We believe our approach could be used in similar settings where shortcuts can harm fairness. Finally, our work is a step towards better interpretability for counting models. Interpretability is an important characteristic of models that can have a positive impact on trust towards those systems.

Acknowledgment

We would like to thank Manoj Acharya and Hisham Cholakkal for their availability and kindness when answering our questions.

The effort from Sorbonne University was partly supported within the Labex SMART supported by French state funds managed by the ANR within the Investissements d’Avenir programme under reference ANR-11-LABX-65, and partly funded by grant DeepVision (ANR-15-CE23-0029-02, STPGP-479356-15), a joint French/Canadian call by ANR & NSERC. This work benefited from the Jean-Zay cluster.

References

  • (1) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2012.
  • (2) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016.
  • (3) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.

    Efficient estimation of word representations in vector space.

    In Proceedings of the International Conference on Learning Representations (ICLR), 2013.
  • (4) Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • (5) Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • (6) Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2016.
  • (7) Vishwanath A Sindagi and Vishal M Patel. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters, 107:3–16, 2018.
  • (8) Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R Selvaraju, Dhruv Batra, and Devi Parikh. Counting everyday objects in everyday scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • (9) Victor Lempitsky and Andrew Zisserman. Learning to count objects in images. In Advances in Neural Information Processing Systems (NIPS), 2010.
  • (10) C Briggs. Quality counts: new parameters in blood cell counting. International journal of laboratory hematology, 31(3):277–297, 2009.
  • (11) Daniel Onoro-Rubio and Roberto J López-Sastre.

    Towards perspective-free object counting with deep learning.

    In Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2016.
  • (12) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  • (13) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • (14) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 123(1):32–73, 2017.
  • (15) Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • (16) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • (17) Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. arXiv preprint arXiv:2004.07780, 2020.
  • (18) Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • (19) Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • (20) Remi Cadene, Corentin Dancette, Hedi Ben-Younes, Matthieu Cord, and Devi Parikh. RUBi: Reducing Unimodal Biases for Visual Question Answering. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • (21) Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • (22) Jialin Wu and Raymond Mooney. Self-critical reasoning for robust visual question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • (23) Chenchen Jing, Yuwei Wu, Xiaoxun Zhang, Yunde Jia, and Qi Wu. Overcoming language priors in vqa via decomposed linguistic representations. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    , 2020.
  • (24) Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Understanding mistakes and uncovering biases. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2018.
  • (25) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • (26) Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • (27) Michael A Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, and Anh Nguyen. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • (28) Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • (29) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • (30) Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Learning to count objects in natural images for visual question answering. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • (31) Hedi Ben-Younes, Remi Cadene, Nicolas Thome, and Matthieu Cord. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • (32) Alexander Trott, Caiming Xiong, and Richard Socher. Interpretable counting for visual question answering. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • (33) Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.
  • (34) Gary F Marcus. Rethinking eliminative connectionism. Cognitive psychology, 37(3):243–282, 1998.
  • (35) Hans J Gross, Mario Pahl, Aung Si, Hong Zhu, Jürgen Tautz, and Shaowu Zhang. Number-based visual generalisation in the honeybee. PloS one, 4(1), 2009.
  • (36) Damien Teney and Anton van den Hengel. Actively seeking and learning from live data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1940–1949, 2019.
  • (37) Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100, 2017.
  • (38) Varun Manjunatha, Nirat Saini, and Larry S. Davis. Explicit Bias Discovery in Visual Question Answering Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • (39) Damien Teney, Kushal Kafle, Robik Shrestha, Ehsan Abbasnejad, Christopher Kanan, and Anton van den Hengel. On the value of out-of-distribution testing: An example of goodhart’s law. arXiv preprint arXiv:2005.09241, 2020.
  • (40) Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In

    Proceedings of the forty-seventh annual ACM symposium on Theory of computing

    , pages 117–126, 2015.
  • (41) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang.

    Bottom-up and top-down attention for image captioning and visual question answering.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • (42) Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard product for low-rank bilinear pooling. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • (43) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • (44) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • (45) Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
  • (46) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
  • (47) Anil Bhattacharyya. On a measure of divergence between two multinomial populations. Sankhyā: the indian journal of statistics, pages 401–406, 1946.
  • (48) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.
  • (49) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • (50) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • (51) Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 2015.

6 Supplementary materials

We design the supplementary materials along the main paper sections. In section 6.1, we provide details about the training, validation and testing sets generated with our Odd-Even-p% and Even-Odd-p% strategies on the TallyQA dataset and show that they lead to small shifts in distribution of words and visual concepts while allowing big shifts in distribution of count labels. In Section 6.2, we provide implementation details about models used in this study to ease the reproducibility of our results. In Section 6.3, we provide details about the grounding metrics. In Section 6.4, we provide additional experiment results. We show that our SCN model reaches higher performances than standard baselines. We also display qualitative results. In Section 6.5, we directly address questions on reproducibility from the NeurIPS 2020 community. We also join the code in a zip file. Details about the code can be found in the README.md file inside the zip.

6.1 Details about the MCB protocol

Training, validation and testing sets statistics

Before applying our ablation strategies (Odd-Even-p% and Even-Odd-p%) on the TallyQA dataset, we first build a validation set by removing 10% of images from the original training set. All image-question-count triplets that possess those images are set aside to build the validation set. We then apply a chosen ablation strategy to each set so that the training and validation sets follow the same count label distributions while the testing set follow a different one. For instance, the Odd-Even-90% strategy removes 90% of triplets associated to even count labels from the training set, 90% of triplets associated to even count labels from the validation set, and 90% of triplets associated to odd count labels from the testing set. In Table 3, we display the number of odd and even triplets in each set when we apply the Odd-Even-p% strategy on TallyQA with various value of . In Table 4, we display the number of triplets in each set when we apply the Even-Odd-p% strategy on TallyQA with various value of .

Training set Validation set Testing set
p% Odd Even Odd Even Odd Even
0 % 87,289 137,102 9,635 15,292 23,138 15,451
50 % 87,289 68,549 9,635 7,644 11,565 15,451
90 % 87,289 13,707 9,635 1,525 2,328 15,451
100% 87,289 0 9,635 0 0 15,451
Table 3: Number of image-question-count triplets for each set generated by our Odd-Even-p% strategy when applied on the TallyQA dataset (Odd-Even-0% leads to the the original TallyQA distribution). Numbers of triplets for intermediate values of

can be obtained with a linear interpolation.

Training set Validation set Testing set
p% Odd Even Odd Even Odd Even
0 % 87,289 137,102 9,635 15,292 23,138 15,451
50 % 43,643 137,102 4,815 15,292 23,138 7,719
90 % 8,725 137,102 969 15,292 23,138 1,551
100 % 0 137,102 0 15,292 23,138 0
Table 4: Number of image-question-count triplets for each set generated by our Even-Odd-p% strategy when applied on the TallyQA dataset (Even-Odd-0% leads to the the original TallyQA distribution). Numbers of triplets for intermediate values of can be obtained with a linear interpolation.

Shift in distribution of questions and visual concepts

We compute the distributions of words from the questions and visual concepts in the images in various Odd-Even-p% training sets, and compare them to the original distributions of TallyQA. To compute the words distribution, we proceed as follow. We first remove the common words how, many, can, you, scene, picture, pictured, image, photo, there, are, seen, see, visible, shown, this, in, the, on, be, of, a, to to only keep those that are associated to specific concepts in the images. We then compare the distributions using the Bhattacharyya coefficient bhattacharyya1946measure – a similarity metric which reaches 0 when there is no overlap between distributions, and 1 when both are the same. Similarly, we compute visual concepts distributions by using the categories assigned to every bounding box extracted from our pre-trained object detector (anderson2018bottom, ), and compare the distributions using the Bhattacharyya coefficient. In Table 5, we see that all coefficients are very close to 1, even for the datasets generated with the 100% strategies, which confirms that our protocol l very little the distributions of words and visual concepts.

p% Words Similarity Visual similarity
0 % 1.0 1.0
50 % 0.997 0.9999
90 % 0.986 0.9996
100 % 0.976 0.9995
Table 5: Bhattacharyya coefficients bhattacharyya1946measure . Words and visual concepts similarity between each of our generated training sets using our Odd-Even-p% strategy and the original TallyQA training set.

6.2 Implementation details

We will release the Pytorch paszke2019pytorch code to generate the datasets and reproduce our results. We will also release pre-trained models and configuration files with the following hyperparameters.

Our SCN model

We use the common Faster R-CNN ren2015faster pre-trained by (anderson2018bottom, ) to extract object features from the image, and the common GRU language model pretrained by (kiros2015skipthoughts, ) to extract language features from the question. To keep a similar number of parameters with the state-of-the-art RCN model (acharya2019tallyqa, ), We use hidden dimensions of 1500 for the multimodal embeddings

, 500 for the self-attention, 768 for both bilinear fusions, and use only one self-attention head. We train our model for 30 epochs with the Adam optimizer

(kingma2015adam, ) and a learning rate of 2.e-5 which is decayed by 0.25 every 2 epochs, starting at epoch 15. The learning rate scheduling was tuned on the validation accuracy of the Odd-Even-90% set. Importantly, for all other experiments, we use the exact same hyper-parameters. We early stop training based on the highest accuracy computed on the validation set. During training, we fix the weight controlling the influence of the entropy loss to 1 in order to keep a similar order of magnitude between the gradients norm computed using the entropy loss and the gradients norm computed using the MSE loss. After obtaining our main results, we experimented values around 1 to assess its robustness. We report small variations in accuracy scores.

RCN and RCN regression

We follow the implementation and hyperparameters described in acharya2019tallyqa . We create RCN regression by changing the output dimension of the last linear layer from 15 to 1. This allows us to train the model with a MSE regression loss instead of a classification loss. We use the same hyperparameters as RCN.

6.3 Details about grounding experiments

COCO-Grounding dataset

Similarly to the work done in IRLC trott2018interpretable , we use the grounding ability as a proxy to evaluate the proper counting mechanism, and to assess the interpretability of models. To this end, we design COCO-Grounding, a dataset specifically designed to be usable in future work where visual features may be different than ours. Our dataset is composed of the 4459 images from MSCOCO (lin2014microsoftcoco, ) that can not be found in Visual Genome (krishna2017vgenome, ) and importantly not in the TallyQA training set. Each MSCOCO image is annotated with bounding boxes around objects associated with a category among 80 classes of objects. We use these classes to automatically generate simple questions about a given image using the "How many {class}?" pattern. The answer to a question is a count label obtained by counting the number of bounding boxes associated to the given {class}. We also generate questions associated to the count label 0 by sampling a random class among 80 that is not present on the image. We generate an equal number of 734 image-question-count triplets associated to the count label 0, 1 and 2, and generate all possible triplets for higher count labels (with a maximum label of 15) to reach a total number of 3311 triplets over 2139 images. This subset will be publicly released.

Evaluation metrics

Similarly to object detection models, our model can output bounding box predictions. We thus use a standard metric in object detection tasks everingham2015pascal ; lin2014microsoftcoco called mean average precision. It allows us to evaluate the ability of our model to detect the correct instances of objects to count in the image. We also introduce a novel grounding metric specifically designed for open-ended counting. We refer to it as GroundP for Grounding Precision. It is derived from the grounding metric in trott2018interpretable , with the difference that we use the ground truth bounding boxes as references to compute the grounding, instead of the bounding boxes extracted from the object detection model. This enables us to more accurately evaluate the grounding, and to compare models with different object detection models. The metric consists in weighting the scores assigned by our model to each proposed bounding box by the portion of their size that overlaps with the ground truth bounding boxes. More details are given in the next paragraph.

Details about our GroundP metric

For each input triplet , we have a set of ground truth bounding boxes . We note the union of all those bounding boxes. It represents the total area of counted objects. For this triplet , our object detection model returns us region proposals . Our model returns, for each region, a score . The final score is . For every proposed bounding box , we compute its precision (the intersection with ground truth bounding boxes over its own area). A precision of 1 means that the bounding box is totally in the ground truth area, whereas a precision of 0 means that is is totally disjoint.

(3)

We then weight the precision by the score assigned by our model to this region and sum over the regions to obtain our final score: . We recall that our model’s prediction is . The final metric is computed by summing the results over all the images, and normalizing by the sum of predicted scores.

(4)

This interpretation of the metric is straightforward: It represents the proportion of the final score that is correctly grounded in the image. A value of one would mean that all chosen objects (with a nonzero score) are in the ground truth bounding boxes.

6.4 Additional results

Baselines

In Table 6, we compare additional baselines against our SCN model on datasets generated with the Odd-Even-90% strategy. The Random (train) baseline consists in randomly sampling the predicted count labels following their distribution in the training set. Random (test) samples count labels following their distribution in the testing set. While this baseline leverages knowledge about the testing set distribution, SCN reaches significantly better accuracy scores with +16.88 accuracy points in simple questions and +4.76 against it. The RCN + Uni. Sample. baseline is trained with a different triplets sampling strategy than RCN based on a uniform sampling of the count labels. By doing so, we seek to train a more robust RCN model against shift in distributions between training and testing sets. As expected, we report lower accuracy scores than RCN on the validation set with a loss of -4.09 point. Interestingly, we report lower scores on the simple testing set with a loss of -4.14 points and slightly better score on the complex testing set with a gain of +1.09. Overall, we report significantly lower scores than our SCN with -21.72 accuracy point on simple. These results show that the uniform sampling strategy is not suited to learn robust counting mechanisms.

Testing set Validation set
Simple Complex
Random (train distribution) 11.04 9.21 32.15
Random (test distribution) 31.34 29.95 10.37
RCN* (acharya2019tallyqa, ) + Uni. Sampl. 26.50 27.78 65.44
RCN* (acharya2019tallyqa, ) 30.64 26.69 69.53
SCN (ours) 48.22 32.54 54.78
Table 6: Comparison against additional baselines on a modified version of TallyQA using our Odd-Even-90% strategy. 90% of the even labels and odd labels have been removed from the original training and testing sets respectively. We report the final accuracy on the two testing sets: simple or complex. We also report accuracy on the validation set which follows the training distribution and is used for early-stopping.

Qualitative results

In Figure 7, we show the influence of our entropy loss. We compare our SCN model and a SCN model trained without entropy loss. For the question ’How many people are in the picture?’, we see that our SCN model selects the correct four regions, while our model trained without entropy fails to distinguish duplicates and associates fractional values to multiple regions that possess people. I also leads to a fractional prediction of the count label. In Figure 8, we display two complex questions on the same image, and show that our SCN model is able to select an object (person) and filter according to an attribute (sport).

Figure 7: Qualitative comparison between our SCN model with and without entropy regularization regarding their ability to select the correct regions to count and to provide correct predictions. Red bounding boxes are shown with bolded borders when their associated is close to 1.
Figure 8: Counting scores produced by our SCN model for two complex questions on the same image. Red bounding boxes are shown with bolded borders when their associated is close to 1.

6.5 ML reproducibility

The range of hyper parameters considered, method to select hyper-parameter configuration and specification of all hyper-parametrs used to generate results.
See Section 6.2.

The exact number of training and evaluation runs.
In Table 1, we report the mean over 3 runs with different seeds for our main result. We report the variation in the associated paragraph. All other experiments are launched once.

A clear definition of the specific measure or statistics used to report results.
We report standard Accuracy and RMSE which are widely used metrics, grounding metrics which are detailed in Section 6.3.

A description of results with a central tendancy (eg mean) & variation (eg error bars).
In Table 1, we report the mean over 3 runs with different seeds for our main result. We report the variation in the associated paragraph.

The average runtime for each result, or estimated energy cost.
Our SCN model takes 10 hours to train on the original TallyQA and 5 hours on the dataset generated by the Odd-Even-100%. It is the smallest dataset with about half the size of the original TallyQA.

A description of the computing infrastructure used.
We train our model on a single GPU. We have access to several 12GB Titan X Pascal and 32GB Tesla V100.