Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

by   Ryosuke Korekata, et al.

This paper describes a domestic service robot (DSR) that fetches everyday objects and carries them to specified destinations according to free-form natural language instructions. Given an instruction such as "Move the bottle on the left side of the plate to the empty chair," the DSR is expected to identify the bottle and the chair from multiple candidates in the environment and carry the target object to the destination. Most of the existing multimodal language understanding methods are impractical in terms of computational complexity because they require inferences for all combinations of target object candidates and destination candidates. We propose Switching Head-Tail Funnel UNITER, which solves the task by predicting the target object and the destination individually using a single model. Our method is validated on a newly-built dataset consisting of object manipulation instructions and semi photo-realistic images captured in a standard Embodied AI simulator. The results show that our method outperforms the baseline method in terms of language comprehension accuracy. Furthermore, we conduct physical experiments in which a DSR delivers standardized everyday objects in a standardized domestic environment as requested by instructions with referring expressions. The experimental results show that the object grasping and placing actions are achieved with success rates of more than 90


page 1

page 3

page 5

page 6

page 7


Naming Objects for Vision-and-Language Manipulation

Robot manipulation tasks by natural language instructions need common un...

Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Currently, domestic service robots have an insufficient ability to inter...

A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions

This paper focuses on a multimodal language understanding method for car...

RM-PRT: Realistic Robotic Manipulation Simulator and Benchmark with Progressive Reasoning Tasks

Recently, the advent of pre-trained large-scale language models (LLMs) l...

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

For embodied agents, navigation is an important ability but not an isola...

Prototypical Contrastive Transfer Learning for Multimodal Language Understanding

Although domestic service robots are expected to assist individuals who ...

Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

In this study, we aim to develop a model that comprehends a natural lang...

Please sign up or login with your details

Forgot password? Click here to reset