Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

03/24/2023
by   Jiahao Zhang, et al.
0

Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives.

READ FULL TEXT
research
07/09/2023

HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding

Understanding comprehensive assembly knowledge from videos is critical f...
research
06/30/2015

Unsupervised Learning from Narrated Instruction Videos

We address the problem of automatically learning the main steps to compl...
research
04/26/2023

StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

Instructional videos are an important resource to learn procedural tasks...
research
06/06/2023

Learning to Ground Instructional Articles in Videos through Narrations

In this paper we present an approach for localizing steps of procedural ...
research
06/26/2023

A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference

Affordance-centric Question-driven Task Completion (AQTC) for Egocentric...
research
09/01/2023

Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

In this work, we present an approach to identify sub-tasks within a demo...
research
11/17/2021

Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

Schemata are structured representations of complex tasks that can aid ar...

Please sign up or login with your details

Forgot password? Click here to reset