Audio-Visual Grounding Referring Expression for Robotic Manipulation

09/22/2021
by   Yefei Wang, et al.
0

Referring expressions are commonly used when referring to a specific target in people's daily dialogue. In this paper, we develop a novel task of audio-visual grounding referring expression for robotic manipulation. The robot leverages both the audio and visual information to understand the referring expression in the given manipulation instruction and the corresponding manipulations are implemented. To solve the proposed task, an audio-visual framework is proposed for visual localization and sound recognition. We have also established a dataset which contains visual data, auditory data and manipulation instructions for evaluation. Finally, extensive experiments are conducted both offline and online to verify the effectiveness of the proposed audio-visual framework. And it is demonstrated that the robot performs better with the audio-visual data than with only the visual data.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

research
07/12/2023

GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

Language-Guided Robotic Manipulation (LGRM) is a challenging task as it ...
research
04/05/2021

Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation

There are rich synchronized audio and visual events in our daily life. I...
research
03/10/2020

MQA: Answering the Question via Robotic Manipulation

In this paper,we propose a novel task of Manipulation Question Answering...
research
12/16/2020

Visually Grounding Instruction for History-Dependent Manipulation

This paper emphasizes the importance of robot's ability to refer its tas...
research
03/02/2019

Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

In this paper, we focus on the challenging perception problem in robotic...
research
02/28/2023

Task-Oriented Grasp Prediction with Visual-Language Inputs

To perform household tasks, assistive robots receive commands in the for...
research
11/30/2020

Detecting expressions with multimodal transformers

Developing machine learning algorithms to understand person-to-person en...

Please sign up or login with your details

Forgot password? Click here to reset