STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

01/02/2023
by   Anshul Shah, et al.
0

We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We employ self-supervised representation learning via a training strategy that adapts off-the-shelf video features using a temporal module. Training implements self-supervised learning losses involving multiple cues such as appearance, motion and pose trajectories extracted from videos to learn generalizable representations. Our method extracts key steps via a tunable algorithm that clusters the representations extracted from procedural videos. We quantitatively evaluate our approach with key step localization and also demonstrate the effectiveness of the extracted representations on related downstream tasks like phase classification. Qualitative results demonstrate that the extracted key steps are meaningful to succinctly represent the procedural tasks.

READ FULL TEXT

page 1

page 5

research
09/16/2022

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence

Learning temporal correspondence from unlabeled videos is of vital impor...
research
04/26/2023

StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

Instructional videos are an important resource to learn procedural tasks...
research
12/07/2021

STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Contrastive representation learning of videos highly relies on the avail...
research
12/11/2021

Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity

Recent self-supervised video representation learning methods have found ...
research
10/04/2021

How You Move Your Head Tells What You Do: Self-supervised Video Representation Learning with Egocentric Cameras and IMU Sensors

Understanding users' activities from head-mounted cameras is a fundament...
research
10/07/2022

Scalable Self-Supervised Representation Learning from Spatiotemporal Motion Trajectories for Multimodal Computer Vision

Self-supervised representation learning techniques utilize large dataset...
research
10/24/2021

Reachability Embeddings: Scalable Self-Supervised Representation Learning from Markovian Trajectories for Geospatial Computer Vision

Self-supervised representation learning techniques utilize large dataset...

Please sign up or login with your details

Forgot password? Click here to reset