Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

04/16/2016
by   Arun Mallya, et al.
0

This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task.

READ FULL TEXT

page 2

page 9

page 12

page 13

research
08/27/2019

Visual Question Answering using Deep Learning: A Survey and Performance Analysis

The Visual Question Answering (VQA) task combines challenges for process...
research
11/01/2016

Solving Visual Madlibs with Multiple Cues

This paper presents an approach for answering fill-in-the-blank multiple...
research
09/19/2016

Graph-Structured Representations for Visual Question Answering

This paper proposes to improve visual question answering (VQA) with stru...
research
01/24/2018

Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering

Visual question answering (VQA) is of significant interest due to its po...
research
04/06/2023

Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions

Deep neural networks have been critical in the task of Visual Question A...
research
10/08/2019

Modulated Self-attention Convolutional Network for VQA

As new data-sets for real-world visual reasoning and compositional quest...
research
04/12/2017

What's in a Question: Using Visual Questions as a Form of Supervision

Collecting fully annotated image datasets is challenging and expensive. ...

Please sign up or login with your details

Forgot password? Click here to reset