Procedure-Aware Pretraining for Instructional Video Understanding

03/31/2023
by   Honglu Zhou, et al.
0

Our goal is to learn a video representation that is useful for downstream procedure understanding tasks in instructional videos. Due to the small amount of available annotations, a key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge such as the identity of the task (e.g., 'make latte'), its steps (e.g., 'pour milk'), or the potential next steps given partial progress in its execution. Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph (PKG), where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks. We build a PKG by combining information from a text-based procedural knowledge database and an unlabeled instructional video corpus and then use it to generate training pseudo labels with four novel pre-training objectives. We call this PKG-based pre-training procedure and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23 gains in accuracy in 12 evaluation settings. Implementation is available at https://github.com/salesforce/paprika.

READ FULL TEXT

page 3

page 6

page 7

page 8

page 14

page 15

page 16

page 17

research
03/31/2023

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

The abundance of instructional videos and their narrations over the Inte...
research
03/25/2022

Unsupervised Pre-training for Temporal Action Localization Tasks

Unsupervised video representation learning has made remarkable achieveme...
research
12/30/2022

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Video-language pre-training has advanced the performance of various down...
research
01/14/2022

Boundary-aware Self-supervised Learning for Video Scene Segmentation

Self-supervised learning has drawn attention through its effectiveness i...
research
03/23/2023

Learning and Verification of Task Structure in Instructional Videos

Given the enormous number of instructional videos available online, lear...
research
05/04/2021

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

The goal of this paper is to self-train a 3D convolutional neural networ...
research
05/27/2023

Non-Sequential Graph Script Induction via Multimedia Grounding

Online resources such as WikiHow compile a wide range of scripts for per...

Please sign up or login with your details

Forgot password? Click here to reset