Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following

12/29/2022
by   Kei Akuzawa, et al.
0

Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the performance, this paper shows how to incorporate the generative model-based approach with an existing semi-supervised method called a speaker-follower model, and proposes a regularization term that improves inference using unpaired trajectories. Experiments on BabyAI and Room-to-Room (R2R) environments show that the proposed method improves the performance of instruction following by leveraging unpaired data, and improves the performance of the speaker-follower model by 2% to 4% in R2R.

READ FULL TEXT
research
05/01/2017

Towards well-specified semi-supervised model-based classifiers via structural adaptation

Semi-supervised learning plays an important role in large-scale machine ...
research
05/16/2020

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain suc...
research
03/30/2022

Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

Since the rise of vision-language navigation (VLN), great progress has b...
research
06/18/2022

Semi-supervised Time Domain Target Speaker Extraction with Attention

In this work, we propose Exformer, a time-domain architecture for target...
research
11/14/2017

Unified Pragmatic Models for Generating and Following Instructions

We extend models for both following and generating natural language inst...
research
07/23/2019

Pre-Learning Environment Representations for Data-Efficient Neural Instruction Following

We consider the problem of learning to map from natural language instruc...
research
06/01/2023

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

Constructing AI models that respond to text instructions is challenging,...

Please sign up or login with your details

Forgot password? Click here to reset