Revisiting Modulated Convolutions for Visual Counting and Beyond

04/24/2020
by   Duy-Kien Nguyen, et al.
1

This paper targets at visual counting, where the setup is to estimate the total number of occurrences in a natural image given an input query (e.g. a question or a category). Most existing work for counting focuses on explicit, symbolic models that iteratively examine relevant regions to arrive at the final number, mimicking the intuitive process specifically for counting. However, such models can be computationally expensive, and more importantly place a limit on their generalization to other reasoning tasks. In this paper, we propose a simple and effective alternative for visual counting by revisiting modulated convolutions that fuse query and image locally. The modulation is performed on a per-bottleneck basis by fusing query representations with input convolutional feature maps of that residual bottleneck. Therefore, we call our method MoVie, short for Modulated conVolutional bottleneck. Notably, MoVie reasons implicitly and holistically for counting and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong empirical performance. First, it significantly advances the state-of-the-art accuracies on multiple counting-specific Visual Question Answering (VQA) datasets (i.e., HowMany-QA and TallyQA). Moreover, it also works on common object counting, outperforming prior-art on difficult benchmarks like COCO. Third, integrated as a module, MoVie can be used to improve number-related questions for generic VQA models. Finally, we find MoVie achieves similar, near-perfect results on CLEVR and competitive ones on GQA, suggesting modulated convolutions as a mechanism can be useful for more general reasoning tasks beyond counting.

READ FULL TEXT
research
02/15/2018

Learning to Count Objects in Natural Images for Visual Question Answering

Visual Question Answering (VQA) models have struggled with counting obje...
research
10/29/2018

TallyQA: Answering Complex Counting Questions

Most counting questions in visual question answering (VQA) datasets are ...
research
12/23/2017

Interpretable Counting for Visual Question Answering

Questions that require counting a variety of objects in images remain a ...
research
04/12/2016

Counting Everyday Objects in Everyday Scenes

We are interested in counting the number of instances of object classes ...
research
07/23/2019

Graph Reasoning Networks for Visual Question Answering

The interaction between language and visual information has been emphasi...
research
04/06/2018

Question Type Guided Attention in Visual Question Answering

Visual Question Answering (VQA) requires integration of feature maps wit...

Please sign up or login with your details

Forgot password? Click here to reset