Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions

06/06/2021
by   Fadime Sener, et al.
0

Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text-corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the Tasty Videos Dataset V2, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.

READ FULL TEXT

page 2

page 12

page 13

page 16

research
12/06/2018

Zero-Shot Anticipation for Instructional Activities

How can we teach a robot to predict what will happen next for an activit...
research
08/03/2020

RareAct: A video dataset of unusual interactions

This paper introduces a manually annotated video dataset of unusual acti...
research
05/16/2023

A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot

Multimedia content, such as advertisements and story videos, exhibit a r...
research
11/17/2021

Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

Schemata are structured representations of complex tasks that can aid ar...
research
11/07/2018

Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning

Although promising results have been achieved in video captioning, exist...
research
05/26/2023

Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning

Learning to infer labels in an open world, i.e., in an environment where...
research
01/02/2023

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Searching long egocentric videos with natural language queries (NLQ) has...

Please sign up or login with your details

Forgot password? Click here to reset