Visual Reasoning with Multi-hop Feature Modulation

08/03/2018
by   Florian Strub, et al.
0

Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition image-based convolutional network computation on language via Feature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and shifting. We propose to generate the parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once, as in prior work. By alternating between attending to the language input and generating FiLM layer parameters, this approach is better able to scale to settings with longer input sequences such as dialogue. We demonstrate that multi-hop FiLM generation achieves state-of-the-art for the short input sequence task ReferIt --- on-par with single-hop FiLM generation --- while also significantly outperforming prior state-of-the-art and single-hop FiLM generation on the GuessWhat?! visual dialogue task.

READ FULL TEXT
research
07/13/2021

Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering

Visual Question Answering (VQA) is concerned with answering free-form qu...
research
12/16/2022

Enhancing Multi-modal and Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation

Multi-modal and multi-hop question answering aims to answer a question b...
research
10/19/2020

Multi-hop Question Generation with Graph Convolutional Network

Multi-hop Question Generation (QG) aims to generate answer-related quest...
research
11/12/2018

Holistic Multi-modal Memory Network for Movie Question Answering

Answering questions according to multi-modal context is a challenging pr...
research
04/07/2020

Hierarchical Opacity Propagation for Image Matting

Natural image matting is a fundamental problem in computational photogra...
research
08/14/2019

Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling

Visual question answering and visual dialogue tasks have been increasing...
research
09/22/2017

FiLM: Visual Reasoning with a General Conditioning Layer

We introduce a general-purpose conditioning method for neural networks c...

Please sign up or login with your details

Forgot password? Click here to reset