Simple Baseline for Visual Question Answering

12/07/2015
by   Bolei Zhou, et al.
0

We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .

READ FULL TEXT

page 5

page 6

research
10/09/2016

Open-Ended Visual Question-Answering

This thesis report studies methods to solve Visual Question-Answering (V...
research
03/23/2017

Recurrent and Contextual Models for Visual Question Answering

We propose a series of recurrent and contextual neural network models fo...
research
06/10/2022

Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model

Current architectures for multi-modality tasks such as visual question a...
research
04/08/2019

Revisiting EmbodiedQA: A Simple Baseline and Beyond

In Embodied Question Answering (EmbodiedQA), an agent interacts with an ...
research
06/19/2021

VQA-Aid: Visual Question Answering for Post-Disaster Damage Assessment and Analysis

Visual Question Answering system integrated with Unmanned Aerial Vehicle...
research
06/19/2018

Learning Conditioned Graph Structures for Interpretable Visual Question Answering

Visual Question answering is a challenging problem requiring a combinati...
research
06/16/2016

No Need to Pay Attention: Simple Recurrent Neural Networks Work! (for Answering "Simple" Questions)

First-order factoid question answering assumes that the question can be ...

Please sign up or login with your details

Forgot password? Click here to reset