Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

11/17/2021
by   Yue Yang, et al.
0

Schemata are structured representations of complex tasks that can aid artificial intelligence by allowing models to break down complex tasks into intermediate steps. We propose a novel system that induces schemata from web videos and generalizes them to capture unseen tasks with the goal of improving video retrieval performance. Our system proceeds in three major phases: (1) Given a task with related videos, we construct an initial schema for a task using a joint video-text model to match video segments with text representing steps from wikiHow; (2) We generalize schemata to unseen tasks by leveraging language models to edit the text within existing schemata. Through generalization, we can allow our schemata to cover a more extensive range of tasks with a small amount of learning data; (3) We conduct zero-shot instructional video retrieval with the unseen task names as the queries. Our schema-guided approach outperforms existing methods for video retrieval, and we demonstrate that the schemata induced by our system are better than those generated by other models.

READ FULL TEXT
research
09/16/2023

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

Large-scale noisy web image-text datasets have been proven to be efficie...
research
10/13/2021

SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

Zero/few-shot transfer to unseen services is a critical challenge in tas...
research
06/06/2021

Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions

Can we teach a robot to recognize and make predictions for activities th...
research
01/31/2023

Learning Universal Policies via Text-Guided Video Generation

A goal of artificial intelligence is to construct an agent that can solv...
research
08/23/2022

VILT: Video Instructions Linking for Complex Tasks

This work addresses challenges in developing conversational assistants t...
research
04/10/2018

Imagine This! Scripts to Compositions to Videos

Imagining a scene described in natural language with realistic layout an...
research
03/24/2023

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

Multimodal alignment facilitates the retrieval of instances from one mod...

Please sign up or login with your details

Forgot password? Click here to reset