Learning and Verification of Task Structure in Instructional Videos

03/23/2023
by   Medhini Narasimhan, et al.
0

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks – procedural activity recognition, step classification, and step forecasting – and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.

READ FULL TEXT

page 8

page 12

page 13

research
04/26/2023

StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

Instructional videos are an important resource to learn procedural tasks...
research
01/04/2020

Painting Many Pasts: Synthesizing Time Lapse Videos of Paintings

We introduce a new video synthesis task: synthesizing time lapse videos ...
research
03/31/2023

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

The abundance of instructional videos and their narrations over the Inte...
research
03/31/2023

Procedure-Aware Pretraining for Instructional Video Understanding

Our goal is to learn a video representation that is useful for downstrea...
research
05/27/2023

Non-Sequential Graph Script Induction via Multimedia Grounding

Online resources such as WikiHow compile a wide range of scripts for per...
research
03/20/2017

Dance Dance Convolution

Dance Dance Revolution (DDR) is a popular rhythm-based video game. Playe...
research
06/06/2023

Learning to Ground Instructional Articles in Videos through Narrations

In this paper we present an approach for localizing steps of procedural ...

Please sign up or login with your details

Forgot password? Click here to reset