Revisiting Modulated Convolutions for Visual Counting and Beyond

04/24/2020 ∙ by Duy-Kien Nguyen, et al. ∙ 1

This paper targets at visual counting, where the setup is to estimate the total number of occurrences in a natural image given an input query (e.g. a question or a category). Most existing work for counting focuses on explicit, symbolic models that iteratively examine relevant regions to arrive at the final number, mimicking the intuitive process specifically for counting. However, such models can be computationally expensive, and more importantly place a limit on their generalization to other reasoning tasks. In this paper, we propose a simple and effective alternative for visual counting by revisiting modulated convolutions that fuse query and image locally. The modulation is performed on a per-bottleneck basis by fusing query representations with input convolutional feature maps of that residual bottleneck. Therefore, we call our method MoVie, short for Modulated conVolutional bottleneck. Notably, MoVie reasons implicitly and holistically for counting and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong empirical performance. First, it significantly advances the state-of-the-art accuracies on multiple counting-specific Visual Question Answering (VQA) datasets (i.e., HowMany-QA and TallyQA). Moreover, it also works on common object counting, outperforming prior-art on difficult benchmarks like COCO. Third, integrated as a module, MoVie can be used to improve number-related questions for generic VQA models. Finally, we find MoVie achieves similar, near-perfect results on CLEVR and competitive ones on GQA, suggesting modulated convolutions as a mechanism can be useful for more general reasoning tasks beyond counting.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We focus on the task of visual counting (or simply counting), where given two inputs – a query and an image – one needs to predict the correct number of occurrences in the image corresponding to that query. The query is generic, in the sense that it can be a natural language question [antol2015vqa] (e.g. ‘how many kids are near the giraffe’), a category name [chattopadhyay2017counting] (e.g. ‘car’), and so on. We also do not make assumptions about the input image, for example requiring it to be specialized (about e.g. cells or crowds [lempitsky2010learning]) or simplified for object recognition [johnson2017clevr].

But why study counting? In addition to several practical applications [idrees2013multi, zhang2015cross], the ability to count is considered as one of the hallmarks of natural intelligence [kaufman1949discrimination, boysen2014development, gross2009number]. We also argue that visual counting has the potential to serve as a unique, self-contained task [chattopadhyay2017counting, trott2018interpretable] to evaluate a machine’s joint understanding of visual and textual data. Compared to prevailing tasks like multi-class object detection [everingham2015pascal, lin2014microsoft], counting with open-ended questions can go beyond closed-world vocabularies (e.g. 80 categories [lin2014microsoft]) to ‘describe’ the image without localizing objects accurately. Compared to many other types of visual questions (e.g. recognizing a single object or attribute), a counting question is arguably more advanced and challenging [johnson2017clevr], since to answer it correctly intuitively requires performing multiple reasoning steps [zhang2018learning], namely: 1) depicting how the individual instances look like given the query; 2) finding and discriminating them in the image, and 3) accumulating the results. Naturally, such an intuition-driven reasoning procedure was adopted by most existing counting modules [zhang2018learning, trott2018interpretable] on natural images, which perform counting iteratively by mapping candidate regions (or bounding boxes) [ren2015faster, anderson2018bottom] to ‘symbols’ and count them explicitly based on their relationships (Figure 1, top-left).

Figure 1: We study visual counting. Different from previous works [trott2018interpretable, zhang2018learning] that perform explicit, symbolic counting (left), we propose an implicit, holistic counter, MoVie, that directly modulates [perez2018film] convolutional feature maps (right) and can outperform state-of-the-art methods on multiple benchmarks. Its simple design also allows potential generalization beyond counting to other visual reasoning tasks (bottom).

While intuitive and interpretable [trott2018interpretable], counting modules with explicit symbolic-style reasoning on regions has several drawbacks. First, modeling regions can be computationally expensive [jiang2020defense], especially when dense, higher-order relationships are concerned. Iterative prediction further complicates the pipeline [zhang2018learning]. More importantly, counting is merely one type of reasoning task. If we look more broadly at the full spectrum of visual reasoning tasks (e.g. performing logical inference, spatial reasoning, etc

.), it is probably

infeasible to design specialized modules for every one of them (Figure 1, bottom-left). This motivates the central question of our paper: Can we find a simple, feed-forward alternative for visual counting and not limit its applicability to the counting task itself?

Progress on the diagnostic CLEVR dataset [johnson2017clevr] sheds some light. Specifically, it was shown that the seemingly minor architectural change of using queries to modulate convolutional layers [perez2018film] can lead to major improvements in a model’s reasoning power – almost to the point of ‘solving’ that benchmark by achieving 97% overall accuracy and 94% on counting [perez2018film]. The modulation is applied ‘holistically’ [malinowski2018visual] to the entire convolutional feature map whenever used, and results in a straight-forward, fully-convolutional model without explicit symbolic structures or specialized designs [malinowski2018visual] for visual reasoning. Surprising as this finding might be, there’s however limited evidence [de2017modulating] that modulated convolutions can work beyond controlled synthetic world to counting on natural scenes – which may be partially due to the dominance of ‘bottom-up attention’ features [anderson2018bottom] that represent natural images with a set of regions, rather than plain convolutional feature maps [jiang2020defense].

In this paper we fill in this gap and aim to establish a simple and effective counting module for natural images without explicit reasoning. Our work is built on top of the recent discovery that convolutional grid features can be equally powerful as the ‘bottom-up attention’ features for vision and language tasks [jiang2020defense]. Inspired by prior work [perez2018film] and motivated by fusing multi-modalities locally, the central idea behind our approach is to revisit convolutions modulated by query representations. Similar to a ConvNet that can stack multiple convolutional layers together, modulation can take place multiple times, and in multiple locations. Following ResNet [he2016deep], we choose bottleneck as our basic building block for explorations in the design space of modulated convolutions, with each bottleneck being modulated once. Therefore, we call our method MoVie: Modulated conVolutional bottleneck. Inference for MoVie is performed by a simple, feed-forward pass on the full feature map without multiple iterations or modeling regions (Figure 1, top-right).

Despite zero explicit symbolic reasoning, MoVie demonstrates strong performance on counting. First, using open-ended question as input, it significantly improves the state-of-the-art on various counting-oriented benchmarks (HowMany-QA [trott2018interpretable] and TallyQA [acharya2019tallyqa]), up to 4 in absolute accuracy. More importantly, as a module, we show MoVie can be easily plugged into general Visual Question Answering (VQA) models [anderson2018bottom, jiang2018pythia, yu2019deep] and improve the ‘number’ category (e.g. 2% over MCAN [yu2019deep]) on the VQA 2.0 dataset. Furthermore, by just swapping the question with pre-defined class as input, we find MoVie also does well on counting common objects [chattopadhyay2017counting], significantly outperforming all previous approaches on challenging datasets like COCO [lin2014microsoft]. To study and better understand this implicit model, we present detailed ablative analysis of various factors on the counting performance, and provide attention visualizations.

Finally, to validate that MoVie can generalize to visual reasoning tasks beyond counting (Figure 1, bottom-right), we revisit CLEVR [johnson2017clevr], which contains not only counting, but other types of reasoning questions (e.g. comparison). We find that despite the design difference between our module and previous work [perez2018film] that modulates convolutions, it can reach similar, near-perfect accuracies. We additionally initiate an exploration on natural image reasoning by testing MoVie on GQA [hudson2019gqa], which again leads to competitive performance. These results suggests that modulated convolution can potentially serve as a simple but effective mechanism for general visual reasoning tasks.

Code will be made available to facilitate future research in this direction.

2 Related Work

Specialized counting.

Counting for specialized objects has a number of practical applications [lempitsky2010learning, marsden2018people], including but are not limited to cell counting [briggs2009quality, xie2018microscopy], crowd counting [idrees2013multi, zhang2015cross, zhang2016single, sindagi2017survey], vehicle counting [onoro2016towards], wild-life counting [arteta2016counting], etc

. While less general, they are important computer vision applications to respective domains,

e.g. medical and surveillance. Normal convolution filters are used extensively in state-of-the-art models [liu2019point, cheng2019learning] to produce density maps that approximate the target count number in a local neighborhood. However, such models are designed to deal exclusively with a single category of interest, and usually require point-level supervision [sindagi2017survey] in addition to the ground-truth overall count number for training.

Another line of work on specialized counting is psychology-inspired [kaufman1949discrimination, mandler1982subitizing, gross2009number, cutini2012subitizing], which focuses on the phenomenon coined ‘subitizing’ [kaufman1949discrimination], that humans and animals can immediately tell the number of salient objects in a scene using holistic cues [zhang2015salient]. It is specialized because the number of objects is usually limited to be small (e.g. up to 4 [zhang2015salient]).

General visual counting.

Lifting the restrictions of counting one category or a few items at a time, more general task settings for counting have been introduced. Generalizing to multiple semantic classes and more instances, common objects counting [chattopadhyay2017counting] as a task has been explored with a variety of strategies such as detection [ren2015faster], ensembling [galton1907one], or segmentation [laradji2018blobs]. The most recent development in this direction [cholakkal2019object] also adopts a density map based approach, achieving state-of-the-art with weak, image-level supervision alone. Even more general is the setting of open-ended counting, where the counting target is expressed in natural language questions [antol2015vqa, trott2018interpretable, zhang2018learning, acharya2019tallyqa]. This allows more advanced ‘reasoning’ task to be formulated involving objects, attributes, relationships, and more. Our module is designed for these general counting tasks, with the modulation coming from either a question or a class embedding.

Explicit counting/reasoning modules.

Several counting models have been proposed in the VQA literature so far [trott2018interpretable, zhang2018learning]. The model published with HowMany-QA [trott2018interpretable]

was among the first to treat counting differently from other types of questions, and cast the task as a sequential decision making problem optimized by reinforcement learning to mimic the human reasoning process. A similar argument for distinction was presented in Zhang

et al[zhang2018learning], which took a step further by showing their fully-differentiable method can be attached to the normal VQA models as a module to boost counting performance. However, the idea of modular design was not new – notably, several seminal works [andreas2016neural, hu2017learning] have described learnable procedures to construct networks for visual reasoning, with reusable ‘modules’ optimized for particular capabilities (e.g. count, find, compare). Our work differs from such works in philosophy, as they put more emphasis (and likely more bias) on explicit, interpretable designs whereas we seek for an implicit, data-driven approach in hope for better generalizations toward other reasoning tasks.

Implicit reasoning modules.

Like MoVie, this type of module does not have explicit, task-oriented designs for visual reasoning. Instead, it aims at general-purpose components. Besides modulated convolutions that we focus on, the other notable work is Relation Network [santoro2017simple], which learns to represent pair-wise relationships between features from different locations through simple MLPs, and showcases super-human performance on CLEVR [johnson2017clevr]. The counter from TallyQA [acharya2019tallyqa] followed this idea and built two such networks – one among foreground regions, and one between foreground and background. However, their counter is still based on regions, and neither generalization as a VQA module nor to other counting/reasoning tasks is demonstrated.

Because the coverage of VQA is broad and existing benchmarks also include questions that require counting ability [antol2015vqa, krishna2017visual], it is important to note that general VQA models [fukui2016multimodal, yu2019deep] without explicit counter also fall within the scope of ‘implicit’ ones when it comes to counting. However, a key distinction of MoVie is that: for most VQA models, multi-modal fusion takes effect globally after visual features [anderson2018bottom, jiang2020defense] have been computed; for MoVie, modulation fuses the query information locally in a sliding-window fashion to conditionally update the features. Such a local fusion scheme can be intuitively beneficial for counting (discussed in Section 3.1), which is also verified by our experiments.

3 Counting with Modulations

In this section we describe our approach MoVie. We begin by motivating and introducing our modulated convolutional network for visual counting.

3.1 Modulated ConvNet


Generally speaking, any convolutional network that accepts inputs from other modalities (e.g. text, audio etc.) in addition to pixels can be viewed as a ‘modulated’ ConvNet. But why choosing it for counting? Apart from empirical evidence on synthetic dataset [perez2018film], we believe the fundamental motivation lies in the convolution operation itself. Since ConvNets operate on feature maps with spatial dimensions (height and width), the extra modulation – in our case the query representation – is expected to be applied densely to all locations of the map in a fully-convolutional manner. This likely suits the task of visual counting well for at least two reasons. First, counting (like object detection [ren2015faster]) is a translation-equivariant problem: given a fixed-sized local window, the outcome changes as the input location changes. Therefore, a local fusion scheme like modulated convolutions is more preferred compared to existing fusion schemes [fukui2016multimodal, yu2017multi], which are typically applied after

visual features are pooled into a single global vector. Second, counting requires exhaustive search over all possible locations, which puts the dominating ‘bottom-up attention’ features 

[anderson2018bottom] that sparsely sample regions from the image at a disadvantage in recall compared to convolutional grid feature maps that output responses for each and every location. With these motivations, we present our simple pipeline for counting next.

Figure 2: Overview of MoVie. We use standard ConvNets (e.g. ResNet [he2016deep]) to produce input to our counting module, which consists of several modulated convolutional bottlenecks. Each bottleneck is a simple modified version of the residual bottleneck with an additional modulation ‘block’ before the first 11 Conv. The last feature map is average pooled and fed into a two-layer MLP to predict the final answer.

Pipeline and module.

In Figure 2 (a) we show the overall pipeline. We use the output of the last Conv layer from a standard ConvNet (e.g. Conv5 in ResNet [he2016deep]) as our visual input to the counting module.111The flexibility of our bottleneck-based design allows us to place the module earlier, or even split it across different ResNet stages [perez2018film, de2017modulating]. However, we didn’t find it help much in practice and can incur extra computation overhead to be used as a module for general VQA. Therefore we stick to placing it on the top. The module consists of 4 modulated convolutional bottlenecks (Figure 2

(b)). Each bottleneck receives the query as an extra input to modulate the feature map and produces another feature map as output. The final output after several stages of local fusion is average pooled and fed into a two-layer MLP classifier with ReLU non-linearity to predict the answer. Note that we do

not apply fusion between query and the global pooled vector in the end: all the interactions between query and image occur in modulated bottlenecks. Next, we delve into these bottlenecks.


As depicted in Figure 2 (c), we closely follow the bottleneck design of ResNet, which is both lightweight and effective for learning visual representations [he2016deep]. The original ResNet bottleneck contains three layers between the residual short-cut: two 11 Conv layers responsible for reducing and recovering the channel dimensions, and one 33 Conv layer responsible for aggregating spatial information. Note that a nice property of ResNet block is that the output channel size of the bottleneck can be specified (other than default 2048-dim), which can offer trade-off between speed and accuracy. Our modulation block is inserted before the first 11 Conv, which we describe next.


Let’s first briefly revisit the closely related work, Feature-wise Linear Modulation (FiLM) [perez2018film] and introduce notations for our modulated block. Since the modulation is applied by sweeping over each and every feature vector on the convolutional feature map , we simplify our discussion by just focusing on the local single vector

without loss of generality. FiLM follows the formulation of batch normalization 

[ioffe2015batch] and proposes to modulate

with linear transformations. Formally, the modulated output for FiLM



where is element-wise multiplication, is element-wise addition (same as normal addition but we use in consistency with Figure 2). Intuitively, scales the feature vector and does shifting. Together they perform affine or linear transformation to the input. Both and are conditioned on the query representation through fully connected layers.

One interesting detail in FiLM is to predict the difference of the scaling vector rather than itself [perez2018film], i.e. where is an all-one vector. This is presented to avoid zero-centered values that can zero-out in FiLM [perez2018film], but also essentially creates a connection to deep residual learning [he2016deep], as:

We can then view as a residual function for modulation, conditioned jointly on and and will be added back to . This perspective creates opportunities for us to explore other simple and effective forms of for counting, detailed below.

The modulation block for MoVie is shown in Figure 2 (d). The residual modulation function is defined as: where is a learnable weight matrix. Intuitively, instead of using the output of directly, learns to output inner products between and weighted individually by each column in . Such increased richness allows the model to potentially capture more intricate relationships. On the other hand, is removed as we find it yields diminishing gains. The final modulated output is:


Note that since we apply modulation in a convolutional fashion, becomes a 11 Conv layer (without bias) inside the network.

Fixed-sized input and multi-scale training.

One reasonable concern on using grid feature maps directly for counting is about its sensitiveness to input image scales. For example, if the image is resized to be twice as large, one would expect the same answer from the model. Region-based counting models [trott2018interpretable, zhang2018learning] are designed to be robust to scale changes, as the region features are computed on RoI-pooled [ren2015faster], fixed-shaped features regardless of the sizes of the bounding boxes in pixels. We find two implementation details helpful to remedy this issue. First is to keep the input size fixed. Similar to the practice object detection [ren2015faster], we maintain the original aspect-ratio for each image, while resizing it so the shorter side has a target number (e.g

. 800) of pixels and place it to the upper-left corner. But different from detection, we always pad the image to the global maximum size, rather than maximum size

within the batch. Second, we employ multi-scale training, by uniformly sampling the target size from a pre-defined set. Both of them are useful for MoVie’s robustness to scales per our analysis.

Figure 3: MoVie as a counting module for VQA. (a) A high-level overview of the current VQA systems, image and question are fused to predict the answer . (b) A naïve approach to include MoVie as a counting module: directly add our average pooled features after our module (with one FC to match dimensions) to . (c) We train with separate auxiliary losses on and , while during testing only using the joint branch . Shaded areas are used for both train and test; white areas are train-only.

3.2 Query Representations

There are two typical types of queries for counting: questions [trott2018interpretable, zhang2018learning] and classes [chattopadhyay2017counting]. We summarize how to convert them to a single vector representation below. For more details, please see our supplementary material.

For questions.

We use LSTM and self attention layers [vaswani2017attention] for question encoding [yu2019deep]. Specifically, each word in the question is first mapped to its GloVe embedding [pennington2014glove] (300-dim), and then fed into an one-directional LSTM followed by a stack of 4 layers of self attention. Both LSTM and self-attention have 512-dim outputs word-wise. An attention mechanism [nguyen2018improved] is used to aggregate information from all words, yielding a final, 512-dim vector as .

For classes.

Category embeddings are learned jointly with the counting model. All of the embedding vectors are randomly initialized and normalized by its -norm to unit vectors during training and inference. The dimension is set to 1024.

3.3 MoVie as Counting Module

An important property of a question-based counting model is to see whether it can be used as an integral module for general VQA [zhang2018learning]. To this end, we present how to incorporate MoVie as a counting module in this subsection. Before moving forward, we first take an overview at existing VQA models.

At a high-level, state-of-the-art models [anderson2018bottom, jiang2018pythia, yu2019deep] follow a common paradigm for the VQA problem: An image representation, , is extracted from pixels; a question representation, , is obtained from words. And then a fusion scheme (e.g. with bi-linear pooling [fukui2016multimodal]) is used to produce the answer on which a loss against ground truth is computed (Figure 3 (a)).

Now, the simplest idea to include MoVie, is to wire features from the modulated convolutions to enrich the image representation . We pick the average pooled features after our module (denoted as ) and directly add it (with one FC to potentially match dimensions) to for this purpose. This design is illustrated in Figure 3 (b). However, we find it gives performance at-par with the original VQA model, likely because the specific fusion scheme is not fit for handling such a mixed representation (), and results in diminished benefit.

We apply a simple solution to fix this issue [wang2019makes, qi2020imvotenet], shown in Figure 3 (c). Instead of training with a single fusion scheme and a single objective function, we additionally train two auxiliary branches. The first one trains a normal VQA model just with , and the second one trains a normal MoVie just with and MLP. This setup ‘forces’ the network to learn powerful representations within both and , as they have to separately minimize the VQA loss. When combined together as , the joint features can make the best use of both. During testing, we only use the joint branch, leading to significant improvements especially on ‘number’-related questions for VQA without sacrificing inference speed.

4 Experiments

We conducted a series of experiments to validate the effectiveness of MoVie, both on and beyond counting. Unless otherwise stated, we train our network using Adam [kingma2015adam], with base learning rate ; momentum 0.9 and 0.98. We start training by linearly warming up learning rate from  [yu2019deep]

for 3 epochs. The rate is decayed by a factor of 0.1 after 10 epochs and we finish training after 13 epochs. Batch size is set to 128 and all batch normalization layers are frozen. By default, we use ResNet-50 

[he2016deep] as our ConvNet base for MoVie.

4.1 Visual Counting

For visual counting, we focus on two tasks: open-ended counting with question queries [trott2018interpretable, zhang2018learning], and counting common objects [chattopadhyay2017counting]. We begin with the former.

Question-based counting setup.

Two datasets are used to evaluate question-based counting. HowMany-QA [trott2018interpretable] contains 83,642 train, 17,714 val, and 5,000 test questions. The train questions are extracted from 47,542 counting questions of VQA 2.0 [goyal2017making] train set and 36,100 counting questions of Visual Genome (VG) [krishna2017visual]. The val and test splits are taken from VQA 2.0 val set. Each ground-truth answer is a number between 0 to 20. Extending HowMany-QA, the TallyQA [acharya2019tallyqa] dataset augments the train set to 249,318 questions by adding synthetic counting questions automatically generated from COCO [lin2014microsoft] images using semantic annotations. They also split the test set into two parts: 22,991 test-simple questions and 15,598 test-complex questions, based on whether it requires advanced reasoning capability [acharya2019tallyqa]. The answers range between 0 and 15.

On both datasets, MoVie is built on top of grid features [jiang2020defense] pre-trained on VG object and attribute detection [anderson2018bottom], which is shown beneficial for vision and language tasks [jiang2020defense]. We only fine-tune the parameters in the counting module and keep the bottom grid features fixed. Accuracy (ACC, higher-better) and standard RMSE (lower-better) [acharya2019tallyqa] are metrics used for comparison.

setup ACC RMSE
57.1 2.67
58.4 2.60
variants 58.2 2.63
57.9 2.64
58.5 2.63
(a) Modulation design starting from FiLM [perez2018film].

stands for residual connection.

1 57.2 2.65
2 58.0 2.63
3 58.2 2.62
4 58.4 2.60
5 58.5 2.63
(b) Number of modulated bottlenecks.
fixed- size test size ACC RMSE
400 22.1 4.70
600 36.0 3.16
800 56.2 2.68
400 53.9 2.92
600 57.3 2.69
800 58.4 2.60
(c) Fixed-size vs. batch-varying size for image input (800 pixels for training).
multi- scale test size ACC RMSE
400 53.9 2.92
600 57.3 2.69
800 58.4 2.60
400 56.5 2.78
600 58.8 2.66
800 58.8 2.59
(d) Multi-scale ({400,600,800} pixels) vs. single-scale (800) input for training.
Table 1: Analysis on HowMany-QA [trott2018interpretable] val set. We ablate modulation design and show how fixed-size input and multi-scale training improves scale robustness.

We first conduct ablative analysis on HowMany-QA for important design choices in MoVie. Here, all the models are trained on HowMany-QA train and evaluated on val. The results are summarized in Table 1.

Modulation design.

In Table (a)a, we compare our modulation block with the FiLM block [perez2018film] by replacing with in our module (see definitions in Equation 1 and 2). We also experimented other variants of by switching on/off and the residual connection (). The results indicate that: 1) The added linear mapping () is helpful for counting – all MoVie variants outperform the original FiLM block; 2) the residual connection also plays a role and gives improvements on accuracy without introducing more parameters; 3) With and , is less essential for counting performance.

Number of bottlenecks.

We then varied the number of stacked bottlenecks in the module, with results shown in Table (b)b. We find the performance saturates around 4 blocks, but stacking multiple bottlenecks is useful compared to using a single one (1% less in accuracy).

Scale robustness.

The last two tables (c)c and (d)d ablate our strategy to deal with input scale changes. To study the effects, in Table (c)c we set a single scale (800 pixels) for training, and vary the scales from 400 to 800 during testing using the same trained model. If we only pad images to the largest size within the batch, testing scale changes drastically degenerate the performance (top 3 rows). By padding the input image to the maximum possible and fixed size, we find it can substantially improve the robustness to scales (bottom 3 rows).

Table (d)d further add multi-scale training into our setting (bottom 3 rows), where each image is randomly scaled to {400,600,800} shorter side instead of a fixed 800. As expected, it boosts the scale robustness even more: e.g. a test size of 600 can already perform on-par in accuracy to the 800 one. Therefore, we use both fixed-sized input and multi-scale training for the rest of the experiments.

Method HowMany-QA TallyQA-Simple TallyQA-Complex
MUTAN [ben2017mutan] 45.5 2.93 56.5 1.51 49.1 1.59
Counting module [zhang2018learning] 54.7 2.59 70.5 1.15 50.9 1.58
IRLC [trott2018interpretable] 56.1 2.45 - - - -
TallyQA [acharya2019tallyqa] 60.3 2.35 71.8 1.13 56.2 1.43
MoVie 61.2 2.36 70.8 1.09 54.1 1.52
MoVie (ResNeXt [xie2017aggregated]) 64.0 2.30 74.9 1.00 56.8 1.43
Table 2: Results on TallyQA and Howmany-QA test set. MoVie outperforms prior works and can achieve strong performance even with ResNet-50 (second last row).
Figure 4: We visualize the activation maps produced by the last modulated bottleneck and attention on question words for several complex questions in TallyQA [acharya2019tallyqa]. First two rows show successful examples, and last two shows four failure cases. Best viewed in color on a computer screen. Please see text for detailed explanations.

Question-based counting results.

We report test set on counting questions for both HowMany-QA and TallyQA in Table 2. Even with a baseline ResNet-50 model (second last row), MoVie can already achieve strong results, e.g. outperforming previous work on HowMany-QA in accuracy. When trained with the more powerful ResNeXt-101 backbone [xie2017aggregated, jiang2020defense], we can surpass all the previous models by a large margin e.g. 4% in absolute accuracy on HowMany-QA. The same trend holds true on TallyQA, where we achieve better performance for both simple and complex questions, with bigger gaps on test-simple.


To better understand the behavior of MoVie, we visualize the activation maps produced by the last modulated bottleneck for several complex questions in TallyQA (Figure 4) using our ResNeXt model. Specifically, we compute the -norm map per-location on the feature map [malinowski2018learning], which is then normalized to values within [0,1]. The attention on the question is also visualized by using attention weights [nguyen2018improved] from the question encoder. First two rows give successful examples. On the left side, our network is able to focus on the relevant portions of the image to produce the correct count number, and can extract key words from the question in order to count in the image. We further modified the questions to be more general on the right side, and despite the changed context and larger number to count, MoVie can give the right answers.

Four failure cases are shown in the last two rows of Figure 4. In the third row, the model attends to correct locations but fails to answer correctly. In the fourth row, the network produces wrong attention map, which is likely due to the novel categories (e.g. ‘seal’) not recognized by the underlying ConvNet.

Method RMSE RMSE-nz rel-RMSE rel-RMSE-nz
Aso-sub-ft-3x3* [chattopadhyay2017counting] 0.38 2.08 0.24 0.87
Seq-sub-ft-3x3* [chattopadhyay2017counting] 0.35 1.96 0.18 0.82
ens* [chattopadhyay2017counting] 0.36 1.98 0.18 0.81
LC-ResFCN* [laradji2018blobs] 0.38 2.20 0.19 0.99
glance-noft-2L [chattopadhyay2017counting] 0.42 2.25 0.23 0.91
CountSeg [cholakkal2019object] 0.34 1.89 0.18 0.84
Fast R-CNN* [chattopadhyay2017counting] 0.49 2.78 0.20 1.13
Faster R-CNN* [ren2015faster] 0.35 1.88 0.18 0.80
MoVie 0.30 1.49 0.19 0.67
Table 3: Results on COCO test set [lin2014microsoft] with various RMSE metrics proposed by [chattopadhyay2017counting]. (*: uses instance-level supervision)

Common objects counting.

Second, we examine the performance of MoVie on common object counting [chattopadhyay2017counting], where the query is an object category. For this task, popular datasets for object detection (e.g. COCO [lin2014microsoft]) were converted to counting benchmarks [chattopadhyay2017counting]

. We follow the standard splits to train and evaluate our model on both sets. The ConvNet base is set as ResNet-50 pre-trained on ImageNet 

[deng2009imagenet] (rather than VG, for fair comparison). Following the standard practice, we also fine-tune the backbone below our counting module to close the domain gap in image statistics [lin2014microsoft]

. Due to the relatively skewed distribution between zero and non-zero answers – most images only contain a few object categories, we perform balanced sampling during training. In addition to RMSE, three variants are listed 

[chattopadhyay2017counting], focusing more on non-zero counts (RMSE-nz) to filter out easy examples; relative errors (rel-RMSE) to better suit human perception; and the combination of both (rel-RMSE-nz). We use all four metrics to report results, averaged over all categories and summarized in Table 3.

Besides state-of-the-art counting approaches like CountSeg [cholakkal2019object], we also use the latest improved version of Faster R-CNN [ren2015faster] to evaluate object detection based counting, which serves as a much stronger baseline (e.g. than Fast R-CNN reported in [chattopadhyay2017counting]) due to the improved detection performance. MoVie outperforms all the methods on three metrics and comes as a close second for the remaining one without using any instance-level supervision. We also include results on VOC [everingham2015pascal] for reference in the supplementary material.

4.2 Visual Question Answering

Method Yes/No Number Other Overall
MoVie 82.48 49.26 54.77 64.46
MCAN-Small [yu2019deep] 83.59 46.71 57.34 65.81
MCAN-Small + MoVie (naïve fusion) 83.25 49.36 57.18 65.95
MCAN-Small + MoVie (three-branch) 84.01 50.45 57.87 66.72
Table 4: We use MCAN-Small to study different strategies to incorporate MoVie as a counting module. Results reported on VQA 2.0 val set.
Method Yes/No Number Other Overall
MCAN-Large [yu2019deep] 88.46 55.68 62.85 72.59
MCAN-Large + MoVie 88.39 57.05 63.28 72.91
Pythia [jiang2018pythia] 84.13 45.98 58.76 67.76
Pythia + MoVie 85.15 53.25 59.31 69.26
Table 5: Accuracies of state-of-the-art models with MoVie on VQA 2.0 test-dev set.

In this subsection we test MoVie as a counting module for state-of-the-art VQA models. We choose MCAN [yu2019deep] and Pythia [jiang2018pythia]

as representative works, since they are VQA challenge winners in the past two years with open-sourced code. For fair comparisons, all the models are trained using the respective setups (

e.g. single-scale, learning schedule) from their code base. Since VQA is a vision and language task, we again use a ResNet-50 pre-trained on VG [jiang2020defense] to provide convolutional features.

Table 4 shows our analysis of different mechanisms incorporating MoVie into MCAN-Small [yu2019deep] described in Section 3.3. We train all models on VQA 2.0 train split and report the breakdown scores on val split. Trained individually, MoVie outperforms MCAN in ‘number’ questions by a decent margin, but lags behind in the other question types. Directly adding features from MoVie shows limited improvement, but our three-branch training scheme is much more effective: increasing accuracy on all types with a strong emphasis on ‘number’ questions.

In Table 5, we further verify the effectiveness of MoVie with three-branch training on the test-dev split, by using both MCAN-Large [yu2019deep] and Pythia [jiang2018pythia]. We use a stronger backbone (ResNeXt-101) and more data (VQA 2.0 train+val)222MCAN additionally includes VG [krishna2017visual] questions for training. for training. We find MoVie consistently strengthens the counting ability of the original VQA model, with a significantly higher accuracy in ‘number’ questions.

Method CNN+LSTM BottomUp [anderson2018bottom] MAC [hudson2018compositional] NSM* [hudson2019learning] MoVie Humans
Overall 46.6 49.7 54.1 63.2 57.1 89.3
Binary 63.3 66.6 71.2 78.9 73.5 91.2
Open 31.8 34.8 38.9 49.3 42.7 87.4
Table 6: Results on GQA [hudson2019gqa] test set to show the generalization of MoVie beyond counting. (*: uses scene-graph supervision [krishna2017visual])

4.3 Beyond Counting

Finally, to explore the capability of our model beyond counting to general reasoning tasks, we evaluate MoVie on the diagnostic CLEVR dataset [johnson2017clevr]. We train our model with a ResNet-101 backbone (pre-trained on ImageNet) for 45 epochs on CLEVR, and report a test set accuracy of 97.42% – similar to the observation made in FiLM [perez2018film]. This result implies that: it is the idea of modulated convolutions that helped achieving near-perfect performance on CLEVR, rather than specific forms presented in the FiLM paper [perez2018film].

Since CLEVR is synthetic, we also initiate an exploration of MoVie on the recent natural-image reasoning dataset, GQA [hudson2019gqa]. We use ResNeXt-101 from VG and report competitive results in Table. 6 without using extra supervisions like scene-graph [krishna2017visual]. Despite simpler architecture compared to models like MAC [hudson2018compositional], we demonstrate better overall accuracy with a larger gain on open questions.

5 Conclusion

In this paper, we propose a simple and effective model named MoVie for visual counting by revisiting modulated convolutions that fuse queries locally. Different from previous works that perform explicit, symbolic reasoning [trott2018interpretable, zhang2018learning], counting is done implicitly and holistically in MoVie and only needs a single forward-pass during inference. We significantly outperform state-of-the-arts on three major benchmarks in visual counting, namely HowMany-QA, Tally-QA and COCO; and show that MoVie can be easily incorporated as a module for general VQA models like MCAN and Pythia to improve accuracy on number-related questions on VQA 2.0. More importantly, we show MoVie can be directly extended to perform well on datasets like CLEVR and GQA, suggesting modulated convolutions as a general mechanism can be useful for other reasoning tasks beyond counting.

Appendix 0.A Implementation Details

In our experiments, query representations have two types: questions and categories. We provide more details on how we compute for questions below.

Question representation.

We use LSTM and Self Attention (SA) layers [vaswani2017attention] for question encoding [yu2019deep]

. It was shown in Natural Language Processing (NLP) research that adding SA layers helps to produce informative and discriminative language representations 

[devlin2019bert]; and we also empirically observe better results ( improvement in accuracy and reduction in RMSE according to our anlysis on HowMany-QA [trott2018interpretable] val set). Specifically, a question (or sentence in NLP [devlin2019bert]) consisting of words is first converted into a sequence of 300-dim GloVe word embeddings [pennington2014glove], which are then fed into a one-directional LSTM followed by a stacked of 4 layers of self attention:


where , is the dimensional embedding for each word in the question after the -th SA layer. We cap all the questions to the same length as in common practice and pad all-zero vectors to shorter questions [devlin2019bert].

Our design for the self-attention layer closely follows [vaswani2017attention] and uses multi-head attentions ( heads) with each head having dimensions and attend with a separate set of keys, queries, and values. Layer normalization and feed-forward network are included without position embeddings.

Given the final , to get the conditional vector , we resort to a summary attention mechanism [nguyen2018improved]. A two-layer 512-dim MLP with ReLU non-linearity is applied to compute an attention score for each word representation . We normalize all scores by soft-max to derive attention weights and then compute an aggregated representation via a weighted summation over .

Object detection based counting.

We train a Faster R-CNN [ren2015faster] with feature pyramid networks [lin2017feature] using the latest implementation on Detectron2333 For fair comparison, we also use a ResNet-50 backbone [he2016deep] pre-trained on ImageNet, the same for our counting module. The detector is trained on the train2014 split of COCO images, which is referred as the train set for common objects counting [chattopadhyay2017counting]. We train the network for 90K iterations, reducing learning rate by 0.1 at 60K and 80K iterations – starting from a base learning rate of 0.02. The batch size is set to 16. Both left-right flipping and scale augmentation (randomly sampling shorter-side from {640, 672, 704, 736, 768, 800}) are used. The reference AP [ren2015faster] on COCO val2017 split is 37.1. We directly convert the testing output of the detector to the per-category counting numbers.

Appendix 0.B Definition of RMSE Variants

Besides accuracy and RMSE444, object counting [chattopadhyay2017counting] additionally proposed several variants of RMSE to evaluate a system’s counting ability. For convenience, we also include them here. The standard RMSE is defined as:


where is ground-truth, is prediction, and is the number of examples. Focusing more on non-zero counts, RMSE-nz tries to evaluate a model’s counting ability on harder examples where the answer is at least one:


where is the number of examples where ground-truth is non-zero. To penalize the mistakes when the count number is small (as making a mistake of 1 when the ground-truth is 2 is more serious than when the ground-truth is 100), rel-RMSE is proposed as:


And finally, rel-RMSE-nz is used to calculate the relative RMSE for non-zero examples – both challenging and aligned with human perception.

Appendix 0.C Common Objects Counting on VOC

As mentioned in Sec.4.1 of the main paper, we here present the performance of MoVie on test split of Pascal VOC counting dataset in Table 7. Different from COCO where there are 82,783 train images, 40,504 val images, and 80 categories, the VOC dataset is much smaller: containing 2,501 train images, 2,510 val images, and 4,952 test images with 20 object categories. It is observed that MoVie achieves comparable results to the state-of-the-art method, CountSeg [cholakkal2019object] in two relative metrics (rel-RMSE and rel-RMSE-nz) and falls behind in RMSE and RMSE-nz. In contrast, as shown in the main paper, MoVie outperforms CountSeg on COCO with a significant margin on three RMSE metrics. The performance difference in two datasets suggests that MoVie scales better than CountSeg in terms of dataset size and number of categories. Moreover, the maintained advantage on relative metrics indicates the output of MoVie is better aligned with human perception [chattopadhyay2017counting].

Method RMSE RMSE-nz rel-RMSE rel-RMSE-nz
Aso-sub-ft-3x3* [chattopadhyay2017counting] 0.43 1.65 0.22 0.68
Seq-sub-ft-3x3* [chattopadhyay2017counting] 0.42 1.65 0.21 0.68
ens* [chattopadhyay2017counting] 0.42 1.68 0.20 0.65
LC-ResFCN* [laradji2018blobs] 0.31 1.20 0.17 0.61
LC-PSPNet* [laradji2018blobs] 0.35 1.32 0.20 0.70
glance-noft-2L [chattopadhyay2017counting] 0.50 1.83 0.27 0.73
CountSeg [cholakkal2019object] 0.29 1.14 0.17 0.61
MoVie 0.36 1.37 0.18 0.56
Table 7: Results on VOC test set [lin2014microsoft] with various RMSE metrics proposed by [chattopadhyay2017counting]. (*: uses instance-level supervision).

Appendix 0.D Visualization of Where MoVie Helps

Figure 5: Visualization of where MoVie helps MCAN-Small [yu2019deep] for different question types on VQA 2.0 val set. We compute the probability by assigning each question to MoVie based on similarity scores (see Section 0.D for detailed explanations). The top contributed question types are counting related, confirming that state-of-the-art VQA models that perform global fusion between vision and language do not work well for counting, and the value of MoVie with local fusion. Best viewed on a computer screen.
Figure 6: Similar to Figure 5 but with Pythia [jiang2018pythia]. Best viewed on a computer screen.

When MoVie is used as a counting module for general VQA tasks, we fuse features pooled from MoVie and the features used by state-of-the-art VQA models (e.g. MCAN [yu2019deep]) to jointly predict the answer. Then a natural question arise: where does MoVie help? To answer this question, we want visualize how important MoVie and the original VQA features contribute to the final answer produced by the joint model. We conduct this study for each of the 55 question types listed in the VQA 2.0 dataset [antol2015vqa] for better insights.

Specifically, suppose a fused representation in the joint branch is denoted as , where is from the VQA model, is from MoVie, and is the function consisting of layers applied after the features are summed up. We can compute two variants of this representation : one without : , and one without : . The similarity score is then computed between two pairs via dot-product: and . Given one question, we assign a score of 1 to MoVie if , and otherwise 0. The scores within each question type are then averaged, and produces the probability of how MoVie is chosen over the base VQA model for that particular question type.

We take two models as examples. One is MCAN-Small [yu2019deep] + MoVie (three-branch) (corresponding to last row in Table 4 in the main paper), and the other one replaces MCAN-Small with Pythia [jiang2018pythia]. The visualizations are shown in Figure 5 and 6, respectively. We sort the question types based on how much MoVie has contributed, i.e. the ‘probability’. Some observations:

  • MoVie shows significant contribution in the counting questions for both MCAN and Pythia: the top three question types are consistently ‘how many people are in’, ‘how many people are’, and ‘how many’, this evidence strongly suggests that existing models that fuse features globally between vision and language are not well suited for counting questions, and confirms the value of incorporating MoVie (that performs fusion locally) as a counting module for general VQA models;

  • The ‘Yes/No’ questions are likely benefited from MoVie as well, since the contribution of MoVie spreads in several question types belong to that category (e.g. ‘are’, ‘are they’, ‘do’, etc.) – this maybe because counting also includes ‘verification-of-existing’ questions such as ‘are there people wearing hats in the image’;

  • For Pythia, we also find it likely helps ‘color’ related questions (e.g. ‘what color are the’, ‘what color’, etc.).