EC^2: Emergent Communication for Embodied Control

04/19/2023
by   Yao Mu, et al.
0

Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments, where video demonstrations contain visual and motion details needed for low-level perception and control, and language instructions support generalization with abstract, symbolic structures. While recent approaches apply contrastive learning to force alignment between the two modalities, we hypothesize better modeling their complementary differences can lead to more holistic representations for downstream adaption. To this end, we propose Emergent Communication for Embodied Control (EC^2), a novel scheme to pre-train video-language representations for few-shot embodied control. The key idea is to learn an unsupervised "language" of videos via emergent communication, which bridges the semantics of video details and structures of natural language. We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control. Through extensive experiments in Metaworld and Franka Kitchen embodied benchmarks, EC^2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs. Further ablations confirm the importance of the emergent language, which is beneficial for both video and language learning, and significantly superior to using pre-trained video captions. We also present a quantitative and qualitative analysis of the emergent language and discuss future directions toward better understanding and leveraging emergent communication in embodied tasks.

READ FULL TEXT

page 3

page 6

page 8

research
09/01/2023

Towards Contrastive Learning in Music Video Domain

Contrastive learning is a powerful way of learning multimodal representa...
research
04/26/2022

Contrastive Language-Action Pre-training for Temporal Localization

Long-form video understanding requires designing approaches that are abl...
research
10/13/2020

CAPT: Contrastive Pre-Training for LearningDenoised Sequence Representations

Pre-trained self-supervised models such as BERT have achieved striking s...
research
02/24/2023

Language-Driven Representation Learning for Robotics

Recent work in visual representation learning for robotics demonstrates ...
research
10/26/2022

IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors from Egocentric Videos and Text

We present IMU2CLIP, a novel pre-training approach to align Inertial Mea...
research
09/15/2022

Exploring Visual Interpretability for Contrastive Language-Image Pre-training

Contrastive Language-Image pre-training (CLIP) learns rich representatio...
research
10/11/2022

Learning to Locate Visual Answer in Video Corpus Using Question

We introduce a new task, named video corpus visual answer localization (...

Please sign up or login with your details

Forgot password? Click here to reset