P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision

05/04/2022
by   He Zhao, et al.
0

In this paper, we study the problem of procedure planning in instructional videos. Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state. When learning procedure planning from instructional videos, most recent work leverages intermediate visual observations as supervision, which requires expensive annotation efforts to localize precisely all the instructional steps in training videos. In contrast, we remove the need for expensive temporal video annotations and propose a weakly supervised approach by learning from natural language instructions. Our model is based on a transformer equipped with a memory module, which maps the start and goal observations to a sequence of plausible actions. Furthermore, we augment our model with a probabilistic generative module to capture the uncertainty inherent to procedure planning, an aspect largely overlooked by previous work. We evaluate our model on three datasets and show our weaklysupervised approach outperforms previous fully supervised state-of-the-art models on multiple metrics.

READ FULL TEXT

page 16

page 17

research
03/26/2023

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

In this paper, we study the problem of procedure planning in instruction...
research
08/17/2023

Event-Guided Procedure Planning from Instructional Videos with Text Supervision

In this work, we focus on the task of procedure planning from instructio...
research
04/26/2023

StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos

Instructional videos are an important resource to learn procedural tasks...
research
05/07/2020

Learning to Segment Actions from Observation and Narration

We apply a generative segmental model of task structure, guided by narra...
research
11/29/2019

Merging Weak and Active Supervision for Semantic Parsing

A semantic parser maps natural language commands (NLs) from the users to...
research
10/05/2021

Procedure Planning in Instructional Videosvia Contextual Modeling and Model-based Policy Learning

Learning new skills by observing humans' behaviors is an essential capab...
research
09/10/2021

PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks

In this work, we study the problem of how to leverage instructional vide...

Please sign up or login with your details

Forgot password? Click here to reset