Fusion of Detected Objects in Text for Visual Question Answering

08/14/2019
by   Chris Alberti, et al.
6

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The "Bounding Boxes in Text Transformer" (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark (visualcommonsense.com), achieving a new state-of-the-art with a 25 relative reduction in error rate compared to published baselines and obtaining the best performance to date on the public leaderboard (as of May 13, 2019). A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture.

READ FULL TEXT

page 1

page 3

page 4

page 7

research
06/02/2022

VL-BEiT: Generative Vision-Language Pretraining

We introduce a vision-language foundation model called VL-BEiT, which is...
research
12/30/2017

A Compare-Propagate Architecture with Alignment Factorization for Natural Language Inference

This paper presents a new deep learning architecture for Natural Languag...
research
01/25/2018

Finding ReMO (Related Memory Object): A Simple Neural Architecture for Text based Reasoning

To solve the text-based question and answering task that requires relati...
research
09/01/2021

WebQA: Multihop and Multimodal QA

Web search is fundamentally multimodal and multihop. Often, even before ...
research
07/06/2023

UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering

In recent years, artificial intelligence has played an important role in...
research
05/17/2023

Probing the Role of Positional Information in Vision-Language Models

In most Vision-Language models (VL), the understanding of the image stru...
research
06/11/2021

NAAQA: A Neural Architecture for Acoustic Question Answering

The goal of the Acoustic Question Answering (AQA) task is to answer a fr...

Please sign up or login with your details

Forgot password? Click here to reset