When encountering unfamiliar processes, people leverage knowledge from previous experience and generalize it to new situations. Cognitively, the information people use can be thought of as a schema: a sequence of steps and a set of rules that a person uses to perform everyday tasks [widmayer2004schema]. A schema can form a scaffold for adapting to unfamiliar contexts. For example, a person may know the steps for baking a cake, and when confronted with a new task of baking a cupcake, she may try to modify a familiar cake process. In this work, we study how a vision system can adopt such a reasoning approach and improve video retrieval.
We propose a novel schema induction and generalization approach that we apply to video retrieval called Induce, Edit, Retrieve (IER). Our schemata are represented as sets of natural language sentences describing steps associated with a task. Unlike pre-training approaches that construct implicit representations of procedural knowledge [zellers2021merlot], our knowledge is explicit, interpretable, and easily adapted. Furthermore, while others have tried to derive such knowledge directly from text [regneri2010learning, ostermann2020script, sakaguchi-etal-2021-proscript-partially], we induce it from video. Once induced, our natural language schemata can be adapted to new unseen situations via explicit edit operations driven by BERT-based language models [devlin2018bert]. For example, IER is able to adapt an induced schema about Baking chicken to a novel task of Baking fish, as seen in Figure 1. Edited schemata can then be used to recognize novel situations and improve video retrieval systems.
We induce schemata by finding textual descriptions of videos that are reliably associated with a single task. Our system captions instructional YouTube videos from the Howto100M dataset [miech19howto100m] using candidate sentences from wikiHow 111www.wikihow.com [lyu-zhang-wikihow:2020] with pretrained video-text matching models [miech20endtoend]. Sentences with the highest average matching score over all videos available for a task are retained for the schema. The approach is simple and effective, leading to high-quality schemata with only 50 videos per task. In total, we induce 22,000 schemata from 1 million videos with this approach.
While large, our initial set of induced schemata is incomplete. When faced with unseen tasks, we propose to adapt existing schemata using edit operations. Our edits are directly applied to a schema’s textual representation and are primarily guided by language models. Given a novel unseen target task, we pair it with a previously induced source task based on visual and textual similarity. Then, we modify the steps in the source task’s schema using three editing routines, as shown in Figure 1. Broadly, our edit operations first make object replacements to the schema using alignments between task names. For example, in Figure ,1 we change all instances of “chicken” to “fish”. Then we use a BERT [devlin2018bert]
based model to both remove and modify the text in the source schema. Sentences that are poorly associated with the target task name according to the model are removed. Then we find low probability tokens and allow a BERT model to replace the ones with the lowest score with higher probability tokens[ghazvininejad2019mask]. While this editing approach relies on finding sufficiently similar tasks in our induction set, our experiments show that our initial set of induced schemata can generalize to unseen tasks found in datasets such as COIN [tang2019coin] or Youcook2 [ZhXuCoAAAI18].
The generated schemata can be used to retrieve multi-minute videos with extremely short queries 222On average, our queries are 4.4 tokens long. in the form of task names. Given a query, we retrieve videos using a schema from our initial induction set to produce a new schema through editing. The new schema is used to expand a short query into a larger set of sentences that can be matched to short clips throughout a long video. We evaluate the utility of our edited schemata for retrieval on Howto100M, COIN, and Youcook2 videos. Results demonstrate that our IER approach is significantly better at retrieving videos than approaches that do not expand task names with schemata, improving nearly 10% on top-1 retrieval precision. Furthermore, our edited schemata significantly outperform those generated from large language models such as GPT-3 [brown2020language] in retrieval. Finally, our extensive analysis shows that using schemata for retrieving instructional videos helps more as the length of the video increases.
2 Related Work
Previous work on schema induction has focused solely on textual resources through statistical methods [chambers2009unsupervised, frermann2014hierarchical, belyy2020script, chambers2013event, pichotta2016statistical] and neural approaches [rudinger2015script, zhang-etal-2020-analogous, weber2018hierarchical, sakaguchi-etal-2021-proscript-partially, belyy2020script, Lyu-et-al:2021, li2020connecting, li2021future]. While [zellers2021merlot] employ multimodal resources to extract procedural knowledge, the output is an implicitvector representation, unlike our work’s explicit and interpretable schema. Another line of research [xu2020benchmark] extracts verb-arguments from video clips without aggregating information from multiple clips on the same topic. [sener2019zero] aligns instructional text to videos in order to predict next steps but without schema generating method. To the best of our knowledge, this is the first attempt to extract explicit, human-readable schemata from videos and text.
Prior work has followed the paradigm of template extraction and slot filling [kulkarni2013babytalk, lu2018neural, farhadi2010every, demirel2021detection, hou2019joint] for image/video captioning to generalize to unseen situations and objects. While we draw inspiration from this literature, we instead retrieve human written sentences from wikiHow for captioning and employ language models to automatically modify them for unseen tasks.
Graphical knowledge extraction is not exclusive to script induction. A line of research that extracts graphical representations from visual input is scene graph extraction [zellers2018neural, yang2018graph, wang2019exploring, gu2019scene, gu2019scene, ji2020action], i.e., the detection of objects and their relations from an image. Scene graphs have been applied to captioning [chen2020say, 8630068, gu2019unpaired, yang2021reformer] and visual question answering [hudson2018compositional, hudson2019learning]. While those methods rely on the same principle of extracting a graphical structure from visual input, the representations require explicit specification of label space for objects, attributes, and relations. In our work, instead, we let sentences stand in for structure. This allows us to leverage commonsense in language models to adapt our schemata.
For the text-video retrieval task, earlier work has leveraged multimodal representations[mithun2018learning, dong2019dual] to more effectively rank videos. Clip-based [gabeur2020multi, dzabraev2021mdmmt, wang2021t2vlad] and key-frame-based [dong2019dual, peng2004clip] ranking methods have been shown effective in improving retrieval performance. However, they rely on implicit multimodal representations rather than explicit, interpretable, and malleable representations as proposed in this work. Moreover, earlier methods generally focus on retrieving short video clips that are only several seconds long [ging2020coot, gabeur2020mmt]. While the videos in our retrieval task (see Table 2) can be multiple minutes long.
3 Building a Schemata Library
We create our schemata in two steps, shown in the first two panels of Figure 2: (1) Schema induction, where schemata are generated for a set of tasks based on their associated videos, and (2) Schema editing, where schemata from the first phase are modified to address unseen tasks with no video data available.
3.1 Formal Overview
We assume a set of tasks partitioned into known tasks and unknown tasks . Every task in the known set, , is associated with a set of videos . We also assume a background textual corpus of candidate steps, , made up of sentences describing tasks, not necessarily in .
Our goal is to construct a schema, , for every task . We proceed in two steps. First, we use videos associated with tasks in to align sentences from using a matching function that scores pairs of short clips and sentences. The highest scoring alignments form the set of sentences in . Second, given an unknown task, , we find a similar source task and modify its schema to create .
|Object Replacement||Step Deletion||Token Replacement|
|Cook Ham Cook Lamb||Transplant a Young Tree Remove a Tree||Prepare Fish Prepare Crabs|
|Put the ham in the oven.||Fill your pot with a balanced fertilizer.||Cut the fins from the fish using kitchen shears.|
|Put the lamb in the oven.||Fill your pot with a balanced fertilizer.||Cut the shells from the crabs using steel scissors.|
|Clean a Guitar Build a Violin||Fix a Toilet Remove a Toilet||Make Healthy Donuts Bake Healthy Cookies|
|Use a polish for particularly dirty guitars.||Test out the new flapper.||Slice your donuts into disks.|
|Use a polish for particularly dirty violins.||Test out the new flapper.||Slice your cookies into squares.|
|Trap a Rat Trap a Rabbit||Brush a Cat Brush a Long Haired Dog||Wash Your Bike Wash a Motorcycle|
|Bait and set snap rat traps.||Comb and groom your pet.||Clean the bike chain with a degreaser.|
|Bait and set snap rabbit traps.||Comb and groom your pet.||Clean the motorcycle thoroughly with a towel.|
3.2 Schema Induction
Given a known task , and its associated videos, , we induce by retrieving sentences from that reliably describe steps performed in . Each video can be partitioned into short segments or clips. For each segment , our goal is to find textual descriptions of the step being performed.
We use a pre-trained matching function between video and text to compute the matching score between a segment and a step description . In practice, we use MIL-NCE [miech20endtoend], a model trained on HowTo100M videos, to create video and textual embeddings with high similarity on co-occurring frames and transcripts. For each clip , we retain the 30 highest scoring step descriptions from . Afterwards, for each step in the union of step descriptions retained for a task , we average the matching score over all videos associated with the task:
We select the top-100 step descriptions for each task based on the score above. Finally, we reduce redundancy by clustering similar descriptions. 333Paraphrases are very common in wikiHow, e.g., “Remove the chicken from the oven” and “Remove your chicken from the oven” both exist in corpus . We use AgglomerativeClustering API from sklearn for clustering. We select the step with the highest matching score from each cluster to construct the schema . 444On average, the number of sentences in each schema is 25.1.
3.3 Schema Editing
To produce the schema for an unseen target task , we edit the schema of a similar source task in known set . To achieve this objective, we develop a schema editing pipeline composed of three modules to manipulate the steps of the source schema (See Table 1 for examples). Overall, our editing approach has three steps, performed in sequence, starting from deterministic replacements and ending with token-level edits performed by a language model. (1) Object Replacement: we replace aligned objects from task names. (2) Step deletion: we remove irrelevant steps using a BERT-based question-answering system. (3) Token Replacement: we adjust steps at the token level by allowing a language model to replace tokens that have low probability.
Object Replacement Each task name has a main object, e.g., “chicken” in Bake Chicken, found using a part-of-speech tagger. For each task name, we retrain the first tagged noun as the main object. We replace all occurrences of the main object in the source schema with the main object of the target task. For example, in the first column of Table 1, we replace “Ham” with “Lamb”.
Step Deletion Some steps are irrelevant for the new target task. For example, the task Bake Chicken has a step “Insert a roasting thermometer into the thigh” which is inappropriate for the target task Bake Fish. Ideally, steps such as “Preheat the oven”, which apply to both Bake Chicken and Bake Fish, will be preserved.
To identify which step to delete, we utilize a sentence BERT model [reimers-2019-sentence-bert] fine-tuned on question-answer pairs.555 multi-qa-mpnet-base-cos-v1. The model, , computes a compatibility score between a question and an answer. It is trained to embed a question and an answer separately and then use the embedding similarity as the score. We use the model to score pairs of task names and steps and include a step in when scores it as less compatible with than by a significant margin:
where is a hyper-parameter determined on validation data. Examples of step deletions performed by our system can be found in the second column of Table 1.
Token Replacement Finally, we adapt elements of the source task’s schema at the token level, allowing a masked language model 666We choose distilroberta-base. to replace words in a step with more appropriate alternatives. We build on existing generation work using BERT-based models [ghazvininejad2019mask]. We prompt the language model with a task name and a step, i.e., “How to [TASK]? [STEP]” and then greedily allow it to replace the least likely noun in the step with a higher scoring noun. We repeat this iteratively on modified steps, a fixed number of times 777Determined by the number of nouns in a step.. For example, as in the third column of Table 1, we replace the word “fins” from a fish-based source task with “shells” in a crab-based target task.
4 Schema Guided Video Retrieval
To test the effectiveness of our schema induction and editing approaches, we formulate a novel video retrieval framework. Given queries in the form of task names, we must retrieve long multi-minute videos corresponding to people instructing others on how to execute these tasks. We use induced and edited schemata to retrieve such long videos. We formulate a novel matching function that combines global information from the task name and steps information from the schema to retrieve such videos. When using edited schemata, we average over multiple possible source tasks, allowing the model to combine information from multiple related tasks.
4.1 Matching Function
Global Matching Previous work on video retrieval largely focuses on short videos [ging2020coot, Luo2020UniVL, Luo2021CLIP4Clip]. They work predominately by matching a single feature vector, representing the entire video, to a query. However, in our retrieval scenario where videos are several minutes long, such an approach is impractical. Instead, given a query task, , and a video, with associated segments , we can average over a local matching score
, to estimate the overall compatibility between the task and the video:
This global averaging approach serves as the starting point for our schema-based retrieval function.
Step Aggregation Model Following [yang2021visual] who use sets of steps from wikiHow to match images, we define a video analog. The core idea is to score the compatibility between a schema, , and a video, , by finding an alignment between video segments and each sentence of the schema,
. The alignment is done greedily, selecting the best video segment for each step in the schema. The average quality of these alignments can then be interpolated with the global score above, to form our final scoring function:
is a hyperparameter tuned on the development data. Our scoring function smoothly interpolates between matching video directly with the task name and aligning video segments with the steps in the schema.
4.2 Task Similarity
Our final retrieval system integrates over uncertainty in the schema. Since there are many possible source tasks for an unseen task, each of which can be used to predict a different schema, we average over possibilities. Each possibility is weighted by textual and visual similarity between the source and target task. This allows us to avoid using a schema from a task such asBake Cake to retrieve Bake Fish videos.
We score the similarity of tasks and using textual () and visual (), similarity between the two tasks:
Textual Similarity is the sentence-level similarity computed by sentence-BERT [reimers-2019-sentence-bert].888We use all-mpnet-base-v2 as the text encoder.
We compute the cosine similarity between the embeddings ofand extracted by sentence-BERT as .
Visual Similarity is computed from the image representations of the tasks. For each task, we retrieve images from Google image search. 999We use simple_image_download package to get the urls of the Google images. Then we apply an image encoder101010We use clip-ViT-B-32 as our image encodee. to each image and average the resultant representations. This aggregate vector is used to represent the visual embedding of the task. We compute the cosine similarity between the features of source and target task as .
4.3 Video Retrieval on Unseen Tasks
In order to apply the step aggregation model in retrieving videos of an unseen task , we must find source task schemata to edit. Given a target task , we first retrieve a set of most similar tasks, , using . For each retrieved source task , we construct an edited schema, , using the routines defined in Section 3.3. Edited schemata are integrated into retrieval based on task similarity, :
This section will introduce the evaluation datasets and the baselines used for comparison, and the implementation details of our IER model.
Howto100M We use the Howto100M [miech19howto100m] dataset for schema induction, as described in Section 3.2. Howto100M is collected from YouTube using 1.22M instructional videos of 23k different visual tasks. These visual tasks are selected from wikiHow articles, and each task is described by a set of step-by-step instructions in the article. The number of videos for each Howto100M task varies significantly. We keep the tasks that have at least 20 videos, which results in 21,299 tasks. 111111The videos of Howto100M are retrieved from Youtube, and each video is associated with a rank. We delete the videos with Youtube search rank worse than 150 and assume these videos are not closely related to the task. The task names are annotated by parts of speech (POS) in order to identify the main object for the Object Replacement operation during editing. 121212We use the flair POS tagger https://huggingface.co/flair/pos-english.
Howto-GEN To evaluate the schema editing modules, we split Howto100M tasks into two sets of known and unknown tasks. We select the tasks from Howto100M with exactly one noun, resulting in 3,365 tasks with 2,184 unique main objects. Then we randomly select 500 tasks for training and 500 tasks for validation and retrain 2,365 tasks for testing. Based on this split, there are 1,088 unseen main objects in the test set. We choose 5 videos for each test task for retrieval and pair them with a fixed set of 2,495 randomly sampled distractors videos to constitute a retrieval pool of 2,500 videos. 131313We select the top-5 videos of each task for testing based on the Youtube search rank .
COIN [tang2019coin] is a large-scale instruction video dataset with 11,827 videos for 180 tasks. The COIN tasks contain concepts unseen in Howto100M, such as “Blow Sugar”, “Play Curling”, “Make Youtiao”, etc. We treat COIN as a zero-shot test set; we randomly pick five videos for every task 141414No task names are shared between COIN and HowTo100M. We finally construct a retrieval pool of 900 videos for 180 tasks.
Youcook2 [ZhXuCoAAAI18] contains 2,000 long videos for 89 cooking recipes. We treat recipe names as tasks and use the same split as [miech19howto100m] to guarantee that there is no overlap between the videos in Youcook2 and Howto100M. We finally form a retrieval pool of 436 videos.
|Dataset||# of tasks||# of videos||Avg. video length (s)|
The video segments boundaries in Howto100M are generated from an Automatic Speech Recognition system and are noisy and redundant. To reduce the number of segments per video, we applyk-means to the S3D features [xie2018rethinking] of the clips, iteratively range k from 5 to 10 and select the best k with the highest silhouette score [ROUSSEEUW198753]. Then we pick the segment nearest to the center of each cluster to form the sequence of clips for each video. For COIN and Youcook2, we use the human-annotated video segments provided in their dataset.
Global Matching We leverage the MIL-NCE model with the global averaging method described in Section 4 to retrieve procedural videos.
Step Aggregation Model As proposed in Section 4, we use edited schemata to improve video retrieval performance. For comparison, we use alternative methods to expand task names into schemata:
T5 [Lyu-et-al:2021] We propose a generation-based schema induction approach and fine-tune a multilingual T5 model [xue-etal-2021-mt5] using the wikiHow scripts. The model can generate a list of steps given a task as the prompt.
GPT-2 [radford2019language] Following the same experimental setup as T5, we fine-tune a GPT-2-large model to generate the schemata.
GPT-3 [brown2020language] We use the OpenAI GPT-3 (davinci) model to conduct zero-shot schema generation using the prompt - “How to Task Name? Give me several steps.”.
GOSC Goal-Oriented Script Construction (GOSC) [Lyu-et-al:2021] is a retrieval-based approach to construct a schema. GOSC utilizes a Step Inference model to gather the set of desired steps from wikiHow given the input task name. We use the off-the-shelf model, so some of the Howto-GEN test tasks have been seen during the training process of GOSC.
wikiHow We treat wikiHow as a schema library. For each unseen test task, we find the most similar task in wikiHow based on the similarity score and apply the schema editing modules to obtain the edited schema.
Oracle Our oracle schemata are written by humans for all datasets. For Howto-GEN, the oracle schemata are the steps in the exact, corresponding wikiHow articles. COIN provides human-annotated step labels for each task which we consider as the oracle schemata. For Youcook2, we treat the text annotations of the video segments as the oracle schemata.
|P@1||R@5||R@10||Med r||MRR||P@1||R@5||R@10||Med r||MRR||P@1||R@5||R@10||Med r||MRR|
5.4 Implementation Details
Hyperparameters We fine-tune the hyperparameters on the validation set of Howto-GEN. We set in equation 2 as the threshold to determine which step to remove. We select to adapt the weight of the step score in equation 4. The two hyperparameters are fixed for all tests.
IER When evaluated on the Howto-Gen test set, the IER model can only have access to the schemata of 500 training tasks. Meanwhile, for COIN and Youcook2, IER can use all 21,299 schemata learned from Howto100M. As described in equation 6, we can select multiple schemata to assist retrieval. We report the performance of IER with the top-1 schema and the top-3 schemata (IER) in the results.
5.5 Evaluation Metrics
We use the standard metrics to evaluate retrieval performance: Precision@1 (P@1), Recall@K (R@K), Mean rank (Mean r), Median rank (Med r), and Mean Reciprocal Rank (MRR). We useor to indicate whether a higher or lower score is better in all tables and figures.
6.1 Main Results
As shown in Table 3, almost all step aggregation models assisted with schemata outperform the MIL-NCE model except for T5. These results suggest that the use of schemata is a promising way to enhance the retrieval of procedural videos. Furthermore, our IER model outperforms the other purely textual schema induction baselines and is close to the performance of the oracle.
We analyze the retrieval performance by video length in Figure 3. The performance of the model without schemata declines rapidly as video length increases. However, when using schemata induced and edited by IER, the performance declines substantially less on long videos.
6.2 Editing Module Ablations
To validate whether each editing module benefits the retrieval, we conduct an ablation study where we disable these modules one by one. As shown in Table 4, the editing modules never hurt and often improve retrieval performance for Howto-GEN and COIN. However, the editing modules are not necessary for Youcook2 because the tasks of Youcook2 are very close to the ones in Howto100M, and we can always find schemata of similar tasks. As shown in Figure 4, editing is more useful when task similarity is low. 151515We compute the average task similarity for each dataset, Howto-GEN is 0.88, COIN is 0.92, and Youcook2 is 0.97, which explains why editing modules are not helpful for Youcook2.
6.3 Schemata Transfer
Our schemata can improve video retrieval even when used with representations they were not induced on. For example, we experiment with CLIP[radford2021learning]. Following [portillo2021straightforward, Luo2021CLIP4Clip], which leverage CLIP for video via average-pooling, we convert video clips into sequences sampled at 10 FPS. Then we use clip-ViT-B-32 to encode each frame and average over the frame-level features for video representations. This allows us to use CLIP as the matching function .
We compute the retrieval performance of CLIP on COIN using the global matching method and the step aggregation method with the same schemata as MIL-NCE. As shown in Table 5, MIL-NCE has a lower performance than CLIP, but with the help of our schemata, it achieves comparable performance to CLIP. In addition, the performance of CLIP also increases significantly by using our schemata. This indicates that our schemata are transferable across different video-text models to improve the video retrieval performance.
We propose a schema induction and generalization system that improves instructional video retrieval performance. We demonstrate that the induced schemata benefit video retrieval on unseen tasks, and our IER system outperforms other methods. In the future, we plan to investigate the structure of our schemata, such as the temporal order, and discover other applications of schemata.