Dynamic Attention Networks for Task Oriented Grounding

10/14/2019
by   Soumik Dasgupta, et al.
18

In order to successfully perform tasks specified by natural language instructions, an artificial agent operating in a visual world needs to map words, concepts, and actions from the instruction to visual elements in its environment. This association is termed as Task-Oriented Grounding. In this work, we propose a novel Dynamic Attention Network architecture for the efficient multi-modal fusion of text and visual representations which can generate a robust definition of state for the policy learner. Our model assumes no prior knowledge from visual and textual domains and is an end to end trainable. For a 3D visual world where the observation changes continuously, the attention on the visual elements tends to be highly co-related from a one-time step to the next. We term this as "Dynamic Attention". In this work, we show that Dynamic Attention helps in achieving grounding and also aids in the policy learning objective. Since most practical robotic applications take place in the real world where the observation space is continuous, our framework can be used as a generalized multi-modal fusion unit for robotic control through natural language. We show the effectiveness of using 1D convolution over Gated Attention Hadamard product on the rate of convergence of the network. We demonstrate that the cell-state of a Long Short Term Memory (LSTM) is a natural choice for modeling Dynamic Attention and shows through visualization that the generated attention is very close to how humans tend to focus on the environment.

READ FULL TEXT

page 1

page 4

page 7

research
06/22/2017

Gated-Attention Architectures for Task-Oriented Language Grounding

To perform tasks specified by natural language instructions, autonomous ...
research
04/23/2018

Attention Based Natural Language Grounding by Navigating Virtual Environment

In this work, we focus on the problem of grounding language by training ...
research
10/13/2019

Granular Multimodal Attention Networks for Visual Dialog

Vision and language tasks have benefited from attention. There have been...
research
04/17/2021

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framewo...
research
01/19/2021

Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning

In this paper, we consider the problem of leveraging textual description...
research
05/02/2020

Robust and Interpretable Grounding of Spatial References with Relation Networks

Handling spatial references in natural language is a key challenge in ta...
research
02/22/2020

Emergent Communication with World Models

We introduce Language World Models, a class of language-conditional gene...

Please sign up or login with your details

Forgot password? Click here to reset