StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

04/26/2023
by   Nikita Dvornik, et al.
0

Instructional videos are an important resource to learn procedural tasks from human demonstrations. However, the instruction steps in such videos are typically short and sparse, with most of the video being irrelevant to the procedure. This motivates the need to temporally localize the instruction steps in such videos, i.e. the task called key-step localization. Traditional methods for key-step localization require video-level human annotations and thus do not scale to large datasets. In this work, we tackle the problem with no human supervision and introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video. StepFormer is a transformer decoder that attends to the video with learnable queries, and produces a sequence of slots capturing the key-steps in the video. We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision. In particular, we supervise our system with a sequence of text narrations using an order-aware loss function that filters out irrelevant phrases. We show that our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization by a large margin on three challenging benchmarks. Moreover, our model demonstrates an emergent property to solve zero-shot multi-step localization and outperforms all relevant baselines at this task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/02/2023

STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

We address the problem of extracting key steps from unlabeled procedural...
research
03/23/2023

Learning and Verification of Task Structure in Instructional Videos

Given the enormous number of instructional videos available online, lear...
research
06/30/2015

Unsupervised Learning from Narrated Instruction Videos

We address the problem of automatically learning the main steps to compl...
research
05/27/2023

Non-Sequential Graph Script Induction via Multimedia Grounding

Online resources such as WikiHow compile a wide range of scripts for per...
research
06/06/2023

Learning to Ground Instructional Articles in Videos through Narrations

In this paper we present an approach for localizing steps of procedural ...
research
05/04/2022

P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision

In this paper, we study the problem of procedure planning in instruction...
research
03/24/2023

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

Multimodal alignment facilitates the retrieval of instances from one mod...

Please sign up or login with your details

Forgot password? Click here to reset