KITE: Keypoint-Conditioned Policies for Semantic Manipulation

06/29/2023
by   Priya Sundaresan, et al.
0

While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75 70 outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations. Supplementary material, datasets, code, and videos can be found on our website: http://tinyurl.com/kite-site.

READ FULL TEXT

page 12

page 13

research
06/30/2023

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

Our goal is for robots to follow natural language instructions like "put...
research
03/02/2023

Open-World Object Manipulation using Pre-trained Vision-Language Models

For robots to follow instructions from people, they must be able to conn...
research
11/12/2022

Learning Neuro-symbolic Programs for Language Guided Robot Manipulation

Given a natural language instruction, and an input and an output scene, ...
research
11/15/2021

Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement

Object rearrangement has recently emerged as a key competency in robot m...
research
09/11/2022

Instruction-driven history-aware policies for robotic manipulations

In human environments, robots are expected to accomplish a variety of ma...
research
03/24/2021

Scene-Intuitive Agent for Remote Embodied Visual Grounding

Humans learn from life events to form intuitions towards the understandi...
research
06/01/2017

Grounding Symbols in Multi-Modal Instructions

As robots begin to cohabit with humans in semi-structured environments, ...

Please sign up or login with your details

Forgot password? Click here to reset