AttnGrounder: Talking to Cars with Attention

09/11/2020
by   Vivek Mittal, et al.
0

We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26

READ FULL TEXT

page 7

page 8

research
04/04/2019

VQD: Visual Query Detection in Natural Scenes

We propose Visual Query Detection (VQD), a new visual grounding task. In...
research
08/18/2019

A Fast and Accurate One-Stage Approach to Visual Grounding

We propose a simple, fast, and accurate one-stage approach to visual gro...
research
11/15/2022

YORO – Lightweight End to End Visual Grounding

We present YORO - a multi-modal transformer encoder-only architecture fo...
research
03/19/2020

Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding

We propose a new spatial memory module and a spatial reasoner for the Vi...
research
06/27/2021

Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding

In this paper, we address the text-to-audio grounding issue, namely, gro...
research
03/14/2022

Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention

Grounding a command to the visual environment is an essential ingredient...
research
06/30/2022

Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

We propose a margin-based loss for vision-language model pretraining tha...

Please sign up or login with your details

Forgot password? Click here to reset