Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

06/27/2023
by   Chiori Hori, et al.
0

To realize human-robot collaboration, robots need to execute actions for new tasks according to human instructions given finite prior knowledge. Human experts can share their knowledge of how to perform a task with a robot through multi-modal instructions in their demonstrations, showing a sequence of short-horizon steps to achieve a long-horizon goal. This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data. We built a system that accomplishes various cooking actions, where an arm robot executes a DMP sequence acquired from a cooking video using the audio-visual Transformer. Experiments with Epic-Kitchen-100, YouCookII, QuerYD, and in-house instruction video datasets show that the proposed method improves the quality of DMP sequences by 2.3 times the METEOR score obtained with a baseline video-to-action Transformer. The model achieved 32 object.

READ FULL TEXT
research
05/16/2019

Understanding of Object Manipulation Actions Using Human Multi-Modal Sensory Data

Object manipulation actions represent an important share of the Activiti...
research
04/02/2021

Language-based Video Editing via Multi-Modal Multi-Level Transformer

Video editing tools are widely used nowadays for digital design. Althoug...
research
10/21/2020

TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog

Audio Visual Scene-aware Dialog (AVSD) is a task to generate responses w...
research
03/01/2021

CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

Navigation guided by natural language instructions is particularly suita...
research
09/08/2021

Video2Skill: Adapting Events in Demonstration Videos to Skills in an Environment using Cyclic MDP Homomorphisms

Humans excel at learning long-horizon tasks from demonstrations augmente...
research
05/25/2022

Learning Action Conditions from Instructional Manuals for Instruction Understanding

The ability to infer pre- and postconditions of an action is vital for c...
research
12/14/2017

Online Motion Generation with Sensory Information and Instructions by Hierarchical RNN

This paper proposes an approach for robots to perform co-working task al...

Please sign up or login with your details

Forgot password? Click here to reset