Embodied Multimodal Multitask Learning

02/04/2019
by   Devendra Singh Chaplot, et al.
36

Recent efforts on training visual navigation agents conditioned on language using deep reinforcement learning have been successful in learning policies for different multimodal tasks, such as semantic goal navigation and embodied question answering. In this paper, we propose a multitask model capable of jointly learning these multimodal tasks, and transferring knowledge of words and their grounding in visual objects across the tasks. The proposed model uses a novel Dual-Attention unit to disentangle the knowledge of words in the textual representations and visual concepts in the visual representations, and align them with each other. This disentangled task-invariant alignment of representations facilitates grounding and knowledge transfer across both tasks. We show that the proposed model outperforms a range of baselines on both tasks in simulated 3D environments. We also show that this disentanglement of representations makes our model modular, interpretable, and allows for transfer to instructions containing new words by leveraging object detectors.

READ FULL TEXT

page 1

page 5

page 9

page 10

page 16

research
06/06/2016

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Modeling textual or visual information with vector representations train...
research
08/12/2019

Multimodal Unified Attention Networks for Vision-and-Language Interactions

Learning an effective attention mechanism for multimodal data is importa...
research
08/24/2018

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

We introduce a novel multimodal machine translation model that utilizes ...
research
09/21/2021

Does Vision-and-Language Pretraining Improve Lexical Grounding?

Linguistic representations derived from text alone have been criticized ...
research
03/04/2016

Learning deep representation of multityped objects and tasks

We introduce a deep multitask architecture to integrate multityped repre...
research
03/12/2020

Analyzing Visual Representations in Embodied Navigation Tasks

Recent advances in deep reinforcement learning require a large amount of...
research
05/22/2022

Learnable Visual Words for Interpretable Image Recognition

To interpret deep models' predictions, attention-based visual cues are w...

Please sign up or login with your details

Forgot password? Click here to reset