Event-Guided Procedure Planning from Instructional Videos with Text Supervision

08/17/2023
by   An-Lan Wang, et al.
0

In this work, we focus on the task of procedure planning from instructional videos with text supervision, where a model aims to predict an action sequence to transform the initial visual state into the goal visual state. A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions, which is ignored by previous works. Specifically, this semantic gap refers to that the contents in the observed visual states are semantically different from the elements of some action text labels in a procedure. To bridge this semantic gap, we propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events. Our inspiration comes from that planning a procedure from an instructional video is to complete a specific event and a specific event usually involves specific actions. Based on the proposed paradigm, we contribute an Event-guided Prompting-based Procedure Planning (E3P) model, which encodes event information into the sequential modeling process to support procedure planning. To further consider the strong action associations within each event, our E3P adopts a mask-and-predict approach for relation mining, incorporating a probabilistic masking scheme for regularization. Extensive experiments on three datasets demonstrate the effectiveness of our proposed model.

READ FULL TEXT
research
07/02/2019

Procedure Planning in Instructional Videos

We propose a new challenging task: procedure planning in instructional v...
research
03/26/2023

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

In this paper, we study the problem of procedure planning in instruction...
research
05/04/2022

P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision

In this paper, we study the problem of procedure planning in instruction...
research
07/31/2023

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Can we better anticipate an actor's future actions (e.g. mix eggs) by kn...
research
09/14/2023

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

A key challenge with procedure planning in instructional videos lies in ...
research
09/06/2021

SENSATION: An Authoring Tool to Support Event-State Paradigm in End-User Development

In this paper, we present the design and the evaluation of an authoring ...
research
10/13/2020

"What Are You Trying to Do?" Semantic Typing of Event Processes

This paper studies a new cognitively motivated semantic typing task, mul...

Please sign up or login with your details

Forgot password? Click here to reset