A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter

02/24/2023
by   Kechun Xu, et al.
0

We focus on the task of language-conditioned grasping in clutter, in which a robot is supposed to grasp the target object based on a language instruction. Previous works separately conduct visual grounding to localize the target object, and generate a grasp for that object. However, these works require object labels or visual attributes for grounding, which calls for handcrafted rules in planner and restricts the range of language instructions. In this paper, we propose to jointly model vision, language and action with object-centric representation. Our method is applicable under more flexible language instructions, and not limited by visual grounding error. Besides, by utilizing the powerful priors from the pre-trained multi-modal model and grasp model, sample efficiency is effectively improved and the sim2real problem is relived without additional data for transfer. A series of experiments carried out in simulation and real world indicate that our method can achieve better task success rate by less times of motion under more flexible language instructions. Moreover, our method is capable of generalizing better to scenarios with unseen objects and language instructions.

READ FULL TEXT
research
02/28/2023

Task-Oriented Grasp Prediction with Visual-Language Inputs

To perform household tasks, assistive robots receive commands in the for...
research
05/09/2022

Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions

This paper studies the task of any objects grasping from the known categ...
research
04/06/2023

Object-centric Inference for Language Conditioned Placement: A Foundation Model based Approach

We focus on the task of language-conditioned object placement, in which ...
research
03/24/2021

Scene-Intuitive Agent for Remote Embodied Visual Grounding

Humans learn from life events to form intuitions towards the understandi...
research
11/23/2020

Action Concept Grounding Network for Semantically-Consistent Video Generation

Recent works in self-supervised video prediction have mainly focused on ...
research
08/30/2023

WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model

Enabling robots to understand language instructions and react accordingl...
research
08/10/2021

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Language-guided robots performing home and office tasks must navigate in...

Please sign up or login with your details

Forgot password? Click here to reset