Log In Sign Up

Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

Schemata are structured representations of complex tasks that can aid artificial intelligence by allowing models to break down complex tasks into intermediate steps. We propose a novel system that induces schemata from web videos and generalizes them to capture unseen tasks with the goal of improving video retrieval performance. Our system proceeds in three major phases: (1) Given a task with related videos, we construct an initial schema for a task using a joint video-text model to match video segments with text representing steps from wikiHow; (2) We generalize schemata to unseen tasks by leveraging language models to edit the text within existing schemata. Through generalization, we can allow our schemata to cover a more extensive range of tasks with a small amount of learning data; (3) We conduct zero-shot instructional video retrieval with the unseen task names as the queries. Our schema-guided approach outperforms existing methods for video retrieval, and we demonstrate that the schemata induced by our system are better than those generated by other models.


page 1

page 3

page 8


SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

Zero/few-shot transfer to unseen services is a critical challenge in tas...

Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions

Can we teach a robot to recognize and make predictions for activities th...

VILT: Video Instructions Linking for Complex Tasks

This work addresses challenges in developing conversational assistants t...

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Large foundation models can exhibit unique capabilities depending on the...

Leveraging Schema Labels to Enhance Dataset Search

A search engine's ability to retrieve desirable datasets is important fo...

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that ca...

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

The rapid growth of video on the internet has made searching for video c...

1 Introduction

When encountering unfamiliar processes, people leverage knowledge from previous experience and generalize it to new situations. Cognitively, the information people use can be thought of as a schema: a sequence of steps and a set of rules that a person uses to perform everyday tasks [widmayer2004schema]. A schema can form a scaffold for adapting to unfamiliar contexts. For example, a person may know the steps for baking a cake, and when confronted with a new task of baking a cupcake, she may try to modify a familiar cake process. In this work, we study how a vision system can adopt such a reasoning approach and improve video retrieval.

Figure 1: An example from our IER system, which first induces a schema for Bake Chicken using a set of videos. Then it edits the steps in the schema to adapt to the unseen task Bake Fish (the tokens that have been edited are highlighted). Finally, IER relies on the edited schema to help retrieve videos for Bake Fish.

We propose a novel schema induction and generalization approach that we apply to video retrieval called Induce, Edit, Retrieve (IER). Our schemata are represented as sets of natural language sentences describing steps associated with a task. Unlike pre-training approaches that construct implicit representations of procedural knowledge [zellers2021merlot], our knowledge is explicit, interpretable, and easily adapted. Furthermore, while others have tried to derive such knowledge directly from text [regneri2010learning, ostermann2020script, sakaguchi-etal-2021-proscript-partially], we induce it from video. Once induced, our natural language schemata can be adapted to new unseen situations via explicit edit operations driven by BERT-based language models [devlin2018bert]. For example, IER is able to adapt an induced schema about Baking chicken to a novel task of Baking fish, as seen in Figure 1. Edited schemata can then be used to recognize novel situations and improve video retrieval systems.

We induce schemata by finding textual descriptions of videos that are reliably associated with a single task. Our system captions instructional YouTube videos from the Howto100M dataset [miech19howto100m] using candidate sentences from wikiHow [lyu-zhang-wikihow:2020] with pretrained video-text matching models [miech20endtoend]. Sentences with the highest average matching score over all videos available for a task are retained for the schema. The approach is simple and effective, leading to high-quality schemata with only 50 videos per task. In total, we induce 22,000 schemata from 1 million videos with this approach.

While large, our initial set of induced schemata is incomplete. When faced with unseen tasks, we propose to adapt existing schemata using edit operations. Our edits are directly applied to a schema’s textual representation and are primarily guided by language models. Given a novel unseen target task, we pair it with a previously induced source task based on visual and textual similarity. Then, we modify the steps in the source task’s schema using three editing routines, as shown in Figure 1. Broadly, our edit operations first make object replacements to the schema using alignments between task names. For example, in Figure ,1 we change all instances of “chicken” to “fish”. Then we use a BERT [devlin2018bert]

based model to both remove and modify the text in the source schema. Sentences that are poorly associated with the target task name according to the model are removed. Then we find low probability tokens and allow a BERT model to replace the ones with the lowest score with higher probability tokens 

[ghazvininejad2019mask]. While this editing approach relies on finding sufficiently similar tasks in our induction set, our experiments show that our initial set of induced schemata can generalize to unseen tasks found in datasets such as COIN [tang2019coin] or Youcook2 [ZhXuCoAAAI18].

The generated schemata can be used to retrieve multi-minute videos with extremely short queries 222On average, our queries are 4.4 tokens long. in the form of task names. Given a query, we retrieve videos using a schema from our initial induction set to produce a new schema through editing. The new schema is used to expand a short query into a larger set of sentences that can be matched to short clips throughout a long video. We evaluate the utility of our edited schemata for retrieval on Howto100M, COIN, and Youcook2 videos. Results demonstrate that our IER approach is significantly better at retrieving videos than approaches that do not expand task names with schemata, improving nearly 10% on top-1 retrieval precision. Furthermore, our edited schemata significantly outperform those generated from large language models such as GPT-3 [brown2020language] in retrieval. Finally, our extensive analysis shows that using schemata for retrieving instructional videos helps more as the length of the video increases.

2 Related Work

Previous work on schema induction has focused solely on textual resources through statistical methods [chambers2009unsupervised, frermann2014hierarchical, belyy2020script, chambers2013event, pichotta2016statistical] and neural approaches [rudinger2015script, zhang-etal-2020-analogous, weber2018hierarchical, sakaguchi-etal-2021-proscript-partially, belyy2020script, Lyu-et-al:2021, li2020connecting, li2021future]. While [zellers2021merlot] employ multimodal resources to extract procedural knowledge, the output is an implicitvector representation, unlike our work’s explicit and interpretable schema. Another line of research [xu2020benchmark] extracts verb-arguments from video clips without aggregating information from multiple clips on the same topic. [sener2019zero] aligns instructional text to videos in order to predict next steps but without schema generating method. To the best of our knowledge, this is the first attempt to extract explicit, human-readable schemata from videos and text.

Prior work has followed the paradigm of template extraction and slot filling [kulkarni2013babytalk, lu2018neural, farhadi2010every, demirel2021detection, hou2019joint] for image/video captioning to generalize to unseen situations and objects. While we draw inspiration from this literature, we instead retrieve human written sentences from wikiHow for captioning and employ language models to automatically modify them for unseen tasks.

Graphical knowledge extraction is not exclusive to script induction. A line of research that extracts graphical representations from visual input is scene graph extraction [zellers2018neural, yang2018graph, wang2019exploring, gu2019scene, gu2019scene, ji2020action], i.e., the detection of objects and their relations from an image. Scene graphs have been applied to captioning [chen2020say, 8630068, gu2019unpaired, yang2021reformer] and visual question answering [hudson2018compositional, hudson2019learning]. While those methods rely on the same principle of extracting a graphical structure from visual input, the representations require explicit specification of label space for objects, attributes, and relations. In our work, instead, we let sentences stand in for structure. This allows us to leverage commonsense in language models to adapt our schemata.

For the text-video retrieval task, earlier work has leveraged multimodal representations[mithun2018learning, dong2019dual] to more effectively rank videos. Clip-based [gabeur2020multi, dzabraev2021mdmmt, wang2021t2vlad] and key-frame-based [dong2019dual, peng2004clip] ranking methods have been shown effective in improving retrieval performance. However, they rely on implicit multimodal representations rather than explicit, interpretable, and malleable representations as proposed in this work. Moreover, earlier methods generally focus on retrieving short video clips that are only several seconds long [ging2020coot, gabeur2020mmt]. While the videos in our retrieval task (see Table 2) can be multiple minutes long.

Figure 2: A detailed example of the IER system. The left panel demonstrates the induction phase, which takes in a set of videos describing the same task and outputs a schema in the format of a bag of sentences. The middle panel shows the schema editing system, which modifies the existing schema for unseen tasks, e.g., editing the schema of Bake Chicken for the unseen task Bake Fish. Finally, in the right panel, we use the edited schema of the unseen task to retrieve its associated videos by matching video segments with sentences in the edited schema.

3 Building a Schemata Library

We create our schemata in two steps, shown in the first two panels of Figure 2: (1) Schema induction, where schemata are generated for a set of tasks based on their associated videos, and (2) Schema editing, where schemata from the first phase are modified to address unseen tasks with no video data available.

3.1 Formal Overview

We assume a set of tasks partitioned into known tasks and unknown tasks . Every task in the known set, , is associated with a set of videos . We also assume a background textual corpus of candidate steps, , made up of sentences describing tasks, not necessarily in .

Our goal is to construct a schema, , for every task . We proceed in two steps. First, we use videos associated with tasks in to align sentences from using a matching function that scores pairs of short clips and sentences. The highest scoring alignments form the set of sentences in . Second, given an unknown task, , we find a similar source task and modify its schema to create .


Object Replacement Step Deletion Token Replacement
Cook Ham Cook Lamb Transplant a Young Tree Remove a Tree Prepare Fish Prepare Crabs
Put the ham in the oven. Fill your pot with a balanced fertilizer. Cut the fins from the fish using kitchen shears.
Put the lamb in the oven. Fill your pot with a balanced fertilizer. Cut the shells from the crabs using steel scissors.
Clean a Guitar Build a Violin Fix a Toilet Remove a Toilet Make Healthy Donuts Bake Healthy Cookies
Use a polish for particularly dirty guitars. Test out the new flapper. Slice your donuts into disks.
Use a polish for particularly dirty violins. Test out the new flapper. Slice your cookies into squares.
Trap a Rat Trap a Rabbit Brush a Cat Brush a Long Haired Dog Wash Your Bike Wash a Motorcycle
Bait and set snap rat traps. Comb and groom your pet. Clean the bike chain with a degreaser.
Bait and set snap rabbit traps. Comb and groom your pet. Clean the motorcycle thoroughly with a towel.


Table 1: Examples of the operations performed by the three editing modules. Source task Target task represents the generalization from a source task to a target task with task similarity score , Source step Target Step denotes the editing of a source step to a target step. The yellow words are replaced during the Object Replacement operation and blue tokens are replaced by masked language model.

3.2 Schema Induction

Given a known task , and its associated videos, , we induce by retrieving sentences from that reliably describe steps performed in . Each video can be partitioned into short segments or clips. For each segment , our goal is to find textual descriptions of the step being performed.

We use a pre-trained matching function between video and text to compute the matching score between a segment and a step description . In practice, we use MIL-NCE [miech20endtoend], a model trained on HowTo100M videos, to create video and textual embeddings with high similarity on co-occurring frames and transcripts. For each clip , we retain the 30 highest scoring step descriptions from . Afterwards, for each step in the union of step descriptions retained for a task , we average the matching score over all videos associated with the task:


We select the top-100 step descriptions for each task based on the score above. Finally, we reduce redundancy by clustering similar descriptions. 333Paraphrases are very common in wikiHow, e.g., “Remove the chicken from the oven” and “Remove your chicken from the oven” both exist in corpus . We use AgglomerativeClustering API from sklearn for clustering. We select the step with the highest matching score from each cluster to construct the schema 444On average, the number of sentences in each schema is 25.1.

3.3 Schema Editing

To produce the schema for an unseen target task , we edit the schema of a similar source task in known set . To achieve this objective, we develop a schema editing pipeline composed of three modules to manipulate the steps of the source schema (See Table 1 for examples). Overall, our editing approach has three steps, performed in sequence, starting from deterministic replacements and ending with token-level edits performed by a language model. (1) Object Replacement: we replace aligned objects from task names. (2) Step deletion: we remove irrelevant steps using a BERT-based question-answering system. (3) Token Replacement: we adjust steps at the token level by allowing a language model to replace tokens that have low probability.

Object Replacement   Each task name has a main object, e.g., “chicken” in Bake Chicken, found using a part-of-speech tagger. For each task name, we retrain the first tagged noun as the main object. We replace all occurrences of the main object in the source schema with the main object of the target task. For example, in the first column of Table 1, we replace “Ham” with “Lamb”.

Step Deletion Some steps are irrelevant for the new target task. For example, the task Bake Chicken has a step “Insert a roasting thermometer into the thigh” which is inappropriate for the target task Bake Fish. Ideally, steps such as “Preheat the oven”, which apply to both Bake Chicken and Bake Fish, will be preserved.

To identify which step to delete, we utilize a sentence BERT model [reimers-2019-sentence-bert] fine-tuned on question-answer pairs.555 multi-qa-mpnet-base-cos-v1. The model, , computes a compatibility score between a question and an answer. It is trained to embed a question and an answer separately and then use the embedding similarity as the score. We use the model to score pairs of task names and steps and include a step in when scores it as less compatible with than by a significant margin:


where is a hyper-parameter determined on validation data. Examples of step deletions performed by our system can be found in the second column of Table 1.

Token Replacement Finally, we adapt elements of the source task’s schema at the token level, allowing a masked language model 666We choose distilroberta-base. to replace words in a step with more appropriate alternatives. We build on existing generation work using BERT-based models [ghazvininejad2019mask]. We prompt the language model with a task name and a step, i.e., “How to [TASK]? [STEP]” and then greedily allow it to replace the least likely noun in the step with a higher scoring noun. We repeat this iteratively on modified steps, a fixed number of times 777Determined by the number of nouns in a step.. For example, as in the third column of Table 1, we replace the word “fins” from a fish-based source task with “shells” in a crab-based target task.

4 Schema Guided Video Retrieval

To test the effectiveness of our schema induction and editing approaches, we formulate a novel video retrieval framework. Given queries in the form of task names, we must retrieve long multi-minute videos corresponding to people instructing others on how to execute these tasks. We use induced and edited schemata to retrieve such long videos. We formulate a novel matching function that combines global information from the task name and steps information from the schema to retrieve such videos. When using edited schemata, we average over multiple possible source tasks, allowing the model to combine information from multiple related tasks.

4.1 Matching Function

Global Matching Previous work on video retrieval largely focuses on short videos [ging2020coot, Luo2020UniVL, Luo2021CLIP4Clip]. They work predominately by matching a single feature vector, representing the entire video, to a query. However, in our retrieval scenario where videos are several minutes long, such an approach is impractical. Instead, given a query task, , and a video, with associated segments , we can average over a local matching score

, to estimate the overall compatibility between the task and the video:


This global averaging approach serves as the starting point for our schema-based retrieval function.

Step Aggregation Model Following [yang2021visual] who use sets of steps from wikiHow to match images, we define a video analog. The core idea is to score the compatibility between a schema, , and a video, , by finding an alignment between video segments and each sentence of the schema,

. The alignment is done greedily, selecting the best video segment for each step in the schema. The average quality of these alignments can then be interpolated with the global score above, to form our final scoring function:



is a hyperparameter tuned on the development data. Our scoring function smoothly interpolates between matching video directly with the task name and aligning video segments with the steps in the schema.

4.2 Task Similarity

Our final retrieval system integrates over uncertainty in the schema. Since there are many possible source tasks for an unseen task, each of which can be used to predict a different schema, we average over possibilities. Each possibility is weighted by textual and visual similarity between the source and target task. This allows us to avoid using a schema from a task such as

Bake Cake to retrieve Bake Fish videos.

We score the similarity of tasks and using textual () and visual (), similarity between the two tasks:


Textual Similarity is the sentence-level similarity computed by sentence-BERT [reimers-2019-sentence-bert].888We use all-mpnet-base-v2 as the text encoder.

We compute the cosine similarity between the embeddings of

and extracted by sentence-BERT as .

Visual Similarity is computed from the image representations of the tasks. For each task, we retrieve images from Google image search. 999We use simple_image_download package to get the urls of the Google images. Then we apply an image encoder101010We use clip-ViT-B-32 as our image encodee. to each image and average the resultant representations. This aggregate vector is used to represent the visual embedding of the task. We compute the cosine similarity between the features of source and target task as .

4.3 Video Retrieval on Unseen Tasks

In order to apply the step aggregation model in retrieving videos of an unseen task , we must find source task schemata to edit. Given a target task , we first retrieve a set of most similar tasks, , using . For each retrieved source task , we construct an edited schema, , using the routines defined in Section 3.3. Edited schemata are integrated into retrieval based on task similarity, :


5 Experiments

This section will introduce the evaluation datasets and the baselines used for comparison, and the implementation details of our IER model.

5.1 Datasets

Howto100M We use the Howto100M [miech19howto100m] dataset for schema induction, as described in Section 3.2. Howto100M is collected from YouTube using 1.22M instructional videos of 23k different visual tasks. These visual tasks are selected from wikiHow articles, and each task is described by a set of step-by-step instructions in the article. The number of videos for each Howto100M task varies significantly. We keep the tasks that have at least 20 videos, which results in 21,299 tasks. 111111The videos of Howto100M are retrieved from Youtube, and each video is associated with a rank. We delete the videos with Youtube search rank worse than 150 and assume these videos are not closely related to the task. The task names are annotated by parts of speech (POS) in order to identify the main object for the Object Replacement operation during editing. 121212We use the flair POS tagger

Howto-GEN To evaluate the schema editing modules, we split Howto100M tasks into two sets of known and unknown tasks. We select the tasks from Howto100M with exactly one noun, resulting in 3,365 tasks with 2,184 unique main objects. Then we randomly select 500 tasks for training and 500 tasks for validation and retrain 2,365 tasks for testing. Based on this split, there are 1,088 unseen main objects in the test set. We choose 5 videos for each test task for retrieval and pair them with a fixed set of 2,495 randomly sampled distractors videos to constitute a retrieval pool of 2,500 videos. 131313We select the top-5 videos of each task for testing based on the Youtube search rank .

COIN [tang2019coin] is a large-scale instruction video dataset with 11,827 videos for 180 tasks. The COIN tasks contain concepts unseen in Howto100M, such as “Blow Sugar”, “Play Curling”, “Make Youtiao”, etc. We treat COIN as a zero-shot test set; we randomly pick five videos for every task 141414No task names are shared between COIN and HowTo100M. We finally construct a retrieval pool of 900 videos for 180 tasks.

Youcook2 [ZhXuCoAAAI18] contains 2,000 long videos for 89 cooking recipes. We treat recipe names as tasks and use the same split as [miech19howto100m] to guarantee that there is no overlap between the videos in Youcook2 and Howto100M. We finally form a retrieval pool of 436 videos.


Dataset # of tasks # of videos Avg. video length (s)
Howto-GEN 2,365 11,825 392.9
COIN 180 900 143.2
Youcook2 89 436 310.9


Table 2: Statistics of the evaluation datasets (test set).

5.2 Preprocessing

The video segments boundaries in Howto100M are generated from an Automatic Speech Recognition system and are noisy and redundant. To reduce the number of segments per video, we apply

k-means to the S3D features [xie2018rethinking] of the clips, iteratively range k from 5 to 10 and select the best k with the highest silhouette score [ROUSSEEUW198753]. Then we pick the segment nearest to the center of each cluster to form the sequence of clips for each video. For COIN and Youcook2, we use the human-annotated video segments provided in their dataset.

5.3 Baselines

Global Matching We leverage the MIL-NCE model with the global averaging method described in Section 4 to retrieve procedural videos.

Step Aggregation Model As proposed in Section 4, we use edited schemata to improve video retrieval performance. For comparison, we use alternative methods to expand task names into schemata:

  • [leftmargin=*]

  • T5 [Lyu-et-al:2021] We propose a generation-based schema induction approach and fine-tune a multilingual T5 model [xue-etal-2021-mt5] using the wikiHow scripts. The model can generate a list of steps given a task as the prompt.

  • GPT-2 [radford2019language] Following the same experimental setup as T5, we fine-tune a GPT-2-large model to generate the schemata.

  • GPT-3 [brown2020language] We use the OpenAI GPT-3 (davinci) model to conduct zero-shot schema generation using the prompt - “How to Task Name? Give me several steps.”.

  • GOSC Goal-Oriented Script Construction (GOSC) [Lyu-et-al:2021] is a retrieval-based approach to construct a schema. GOSC utilizes a Step Inference model to gather the set of desired steps from wikiHow given the input task name. We use the off-the-shelf model, so some of the Howto-GEN test tasks have been seen during the training process of GOSC.

  • wikiHow We treat wikiHow as a schema library. For each unseen test task, we find the most similar task in wikiHow based on the similarity score and apply the schema editing modules to obtain the edited schema.

  • Oracle Our oracle schemata are written by humans for all datasets. For Howto-GEN, the oracle schemata are the steps in the exact, corresponding wikiHow articles. COIN provides human-annotated step labels for each task which we consider as the oracle schemata. For Youcook2, we treat the text annotations of the video segments as the oracle schemata.


Method Howto-GEN COIN Youcook2
P@1 R@5 R@10 Med r MRR P@1 R@5 R@10 Med r MRR P@1 R@5 R@10 Med r MRR
MIL-NCE [miech20endtoend] 45.2 31.0 43.1 15.0 .198 48.3 37.1 52.8 9.5 .227 27.0 18.2 26.5 32.0 .126

Step Aggregation

T5 [Lyu-et-al:2021] 44.0 29.9 41.0 19.0 .190 46.1 35.3 50.7 10.0 .219 21.3 16.0 24.7 61.5 .108
GPT-2 [radford2019language] 46.0 31.5 43.3 16.0 .200 48.9 39.2 53.4 8.0 .233 31.5 19.0 27.3 44.5 .130
GPT-3 [brown2020language] 49.3 33.3 45.7 13.0 .211 53.3 42.1 59.0 8.0 .252 37.1 22.4 34.6 27.0 .160
GOSC [Lyu-et-al:2021] 54.7 37.0 49.8 11.0 .231 53.9 41.6 55.1 8.0 .248 30.3 20.7 34.8 28.0 .146
wikiHow 51.9 35.4 47.8 11.0 .222 53.9 40.8 56.1 7.0 .246 31.5 21.0 34.2 24.5 .149
IER (Ours) 54.4 37.3 50.1 10.0 .231 57.2 42.2 57.8 7.0 .256 41.6 25.8 38.8 20.0 .175
IER (Ours) 55.0 37.4 50.6 10.0 .234 56.1 42.3 59.1 8.0 .258 40.4 25.1 38.8 20.0 .172
Oracle 56.5 38.0 50.8 10.0 .237 60.0 43.4 59.3 7.0 .262 52.8 33.5 47.1 14.0 .215


Table 3: Retrieval performance on Howto-GEN, COIN and Youcook2. Baselines include retrievals based on global matching, aggregation of steps generated from state-of-the-art language models, goal-oriented script construction (GOSC), and wikiHow. The Oracle upper bound contains human-written step labels for each task. Observe that our IER systems outperform the baselines across all metrics.

5.4 Implementation Details

Hyperparameters We fine-tune the hyperparameters on the validation set of Howto-GEN. We set in equation 2 as the threshold to determine which step to remove. We select to adapt the weight of the step score in equation 4. The two hyperparameters are fixed for all tests.

IER When evaluated on the Howto-Gen test set, the IER model can only have access to the schemata of 500 training tasks. Meanwhile, for COIN and Youcook2, IER can use all 21,299 schemata learned from Howto100M. As described in equation 6, we can select multiple schemata to assist retrieval. We report the performance of IER with the top-1 schema and the top-3 schemata (IER) in the results.

5.5 Evaluation Metrics

We use the standard metrics to evaluate retrieval performance: Precision@1 (P@1), Recall@K (R@K), Mean rank (Mean r), Median rank (Med r), and Mean Reciprocal Rank (MRR). We use

or to indicate whether a higher or lower score is better in all tables and figures.

Figure 3: Retrieval performance by video length (in the number of clips). We group the test videos of Youcook2 by the number of clips per video and compute the mean rank for each group.

6 Results

6.1 Main Results

As shown in Table 3, almost all step aggregation models assisted with schemata outperform the MIL-NCE model except for T5. These results suggest that the use of schemata is a promising way to enhance the retrieval of procedural videos. Furthermore, our IER model outperforms the other purely textual schema induction baselines and is close to the performance of the oracle.

We analyze the retrieval performance by video length in Figure 3. The performance of the model without schemata declines rapidly as video length increases. However, when using schemata induced and edited by IER, the performance declines substantially less on long videos.

Figure 4: Retrieval performance by task similarity. We sort the test tasks of Howto-GEN based on their task similarity () and compute their mean rank for every batch of 400 tasks.
(a) Stain Cabinets
(b) Make Tea
Figure 5: Qualitative examples of retrieval results. We demonstrate our video retrieval process for two unseen tasks. On the top, we display the induced schemata of existing tasks. Below, we display schemata for the unseen tasks obtained from our editing module. Finally, we display the top-5 videos retrieved using our edited schemata. We show at most 4 segments for each video and each segment is associated with the top-1 matched step in the schema. While many of our videos are correctly retrieved, some videos are not due to two main factors: 1) our joint video-text model propagates some errors, and 2) some of the videos are labeled under different tasks while containing very similar steps, such as video 2 which belongs to the task Apply a Two Tone Finish to Furniture that is very close to the query Stain Cabinets.


Method P@1 R@5 R@10 Med r MRR


full 54.4 37.3 50.1 10.0 .231
mask 53.7 36.3 49.3 11.0 .229
deletion 53.6 36.9 49.8 11.0 .230
replacement 51.5 34.9 47.3 12.0 .220
all 45.5 31.0 43.1 15.0 .199


full 57.2 42.2 57.8 7.0 .256
mask 53.9 42.3 58.3 7.0 .257
deletion 58.3 42.0 58.0 7.0 .258
replacement 53.8 41.0 59.2 7.5 .251
all 54.4 39.6 53.7 8.0 .246


full 41.6 25.8 38.8 20.0 .175
mask 40.4 25.4 39.3 20.0 .173
deletion 41.6 26.0 39.1 21.0 .175
replacement 40.4 25.8 38.5 20.0 .173
all 40.4 26.0 39.9 21.0 .174


Table 4: Ablation study on editing modules. “full” represents using all three modules and “ all” denotes removing all three modules. “ mask”, “ deletion” and “ replacement” are short for removing “Token Replacement”, “Step Deletion” and “Object Replacement” respectively. The numbers with underline are the ones lower than “full”. The highest number of each metric is bold.

6.2 Editing Module Ablations

To validate whether each editing module benefits the retrieval, we conduct an ablation study where we disable these modules one by one. As shown in Table 4, the editing modules never hurt and often improve retrieval performance for Howto-GEN and COIN. However, the editing modules are not necessary for Youcook2 because the tasks of Youcook2 are very close to the ones in Howto100M, and we can always find schemata of similar tasks. As shown in Figure 4, editing is more useful when task similarity is low. 151515We compute the average task similarity for each dataset, Howto-GEN is 0.88, COIN is 0.92, and Youcook2 is 0.97, which explains why editing modules are not helpful for Youcook2.

6.3 Schemata Transfer

Our schemata can improve video retrieval even when used with representations they were not induced on. For example, we experiment with CLIP[radford2021learning]. Following [portillo2021straightforward, Luo2021CLIP4Clip], which leverage CLIP for video via average-pooling, we convert video clips into sequences sampled at 10 FPS. Then we use clip-ViT-B-32 to encode each frame and average over the frame-level features for video representations. This allows us to use CLIP as the matching function .


Model P@1 R@5 R@10 Med r MRR
MIL-NCE 48.3 37.1 52.8 9.5 .227
schema 57.2 42.2 57.8 7.0 .256
CLIP[radford2021learning] 58.9 44.9 58.8 6.0 .264
schema 65.0 47.4 60.8 5.5 .282


Table 5: Retrieval performance on COIN using MIL-NCE and CLIP as the matching functions. schema represents using schema induced by IER (MIL-NCE as matching function) for retrieval.

We compute the retrieval performance of CLIP on COIN using the global matching method and the step aggregation method with the same schemata as MIL-NCE. As shown in Table 5, MIL-NCE has a lower performance than CLIP, but with the help of our schemata, it achieves comparable performance to CLIP. In addition, the performance of CLIP also increases significantly by using our schemata. This indicates that our schemata are transferable across different video-text models to improve the video retrieval performance.

7 Conclusion

We propose a schema induction and generalization system that improves instructional video retrieval performance. We demonstrate that the induced schemata benefit video retrieval on unseen tasks, and our IER system outperforms other methods. In the future, we plan to investigate the structure of our schemata, such as the temporal order, and discover other applications of schemata.