Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

07/16/2023
by   Ruipu Luo, et al.
0

Vision language decision making (VLDM) is a challenging multimodal task. The agent have to understand complex human instructions and complete compositional tasks involving environment navigation and object manipulation. However, the long action sequences involved in VLDM make the task difficult to learn. From an environment perspective, we find that task episodes can be divided into fine-grained units, each containing a navigation phase and an interaction phase. Since the environment within a unit stays unchanged, we propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias. Such framework leverages the unit-grained configurations and is model-agnostic. Specifically, we design a Unit-Transformer (UT) with an intrinsic recurrent state that maintains a unit-scale cross-modal memory. Through extensive experiments on the TEACH benchmark, we demonstrate that our proposed framework outperforms existing state-of-the-art methods in terms of all evaluation metrics. Overall, our work introduces a novel approach to tackling the VLDM task by breaking it down into smaller, manageable units and utilizing a hybrid-training framework. By doing so, we provide a more flexible and effective solution for multimodal decision making.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 8

research
04/22/2021

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation

Deep Learning has revolutionized our ability to solve complex problems s...
research
02/23/2022

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Following language instructions to navigate in unseen environments is a ...
research
11/26/2020

A Recurrent Vision-and-Language BERT for Navigation

Accuracy of many visiolinguistic tasks has benefited significantly from ...
research
10/25/2021

History Aware Multimodal Transformer for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to build autonomous visual age...
research
03/05/2021

Structured Scene Memory for Vision-Language Navigation

Recently, numerous algorithms have been developed to tackle the problem ...
research
01/19/2021

A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment

In this paper we propose a new framework - MoViLan (Modular Vision and L...

Please sign up or login with your details

Forgot password? Click here to reset