Learning Articulated Motion Models from Visual and Lingual Signals

by   Zhengyang Wu, et al.

In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Previous work learns kinematic models that prescribe this manipulation from visual demonstrations. Lingual signals, such as natural language descriptions and instructions, offer a complementary means of conveying knowledge of such manipulation models and are suitable to a wide range of interactions (e.g., remote manipulation). In this paper, we present a multimodal learning framework that incorporates both visual and lingual information to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the lingual signal. We present a probabilistic language model that uses word embeddings to associate lingual verbs with their corresponding kinematic structures. By exploiting the complementary nature of the visual and lingual input, our method infers correct kinematic structures for various multiple-part objects on which the previous state-of-the-art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a 36 the vision-only baseline.


Learning Extended Body Schemas from Visual Keypoints for Object Manipulation

Humans have impressive generalization capabilities when it comes to mani...

Robobarista: Learning to Manipulate Novel Objects via Deep Multimodal Embedding

There is a large variety of objects and appliances in human environments...

Learning Kinematic Descriptions using SPARE: Simulated and Physical ARticulated Extendable dataset

Next generation robots will need to understand intricate and articulated...

VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement Learning

We present VisualHints, a novel environment for multimodal reinforcement...

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

Bilingual lexicon induction, translating words from the source language ...

Learning Articulated Motions From Visual Demonstration

Many functional elements of human homes and workplaces consist of rigid ...

"What's This?" – Learning to Segment Unknown Objects from Manipulation Sequences

We present a novel framework for self-supervised grasped object segmenta...

Please sign up or login with your details

Forgot password? Click here to reset