Imagine being in your kitchen, engaged in the preparation of a sophisticated dish that involves a sequence of complex steps. Fortunately, your J.A.R.V.I.S.111A fictional AI assistant in the Marvel Cinematic Universe. comes to your rescue. It actively recognizes the task that you are trying to accomplish and guides you step-by-step in the successful execution of the recipe. The dramatic progress witnessed in activity recognition [TSN, Kinetics, tran2018closer, Damen2018EPICKITCHENS] over the last few years has certainly made these fictional scenarios a bit closer to reality. Yet, it is clear that in order to attain these goals we must extend existing systems beyond atomic-action classification in trimmed clips to tackle the more challenging problem of understanding procedural activities in long videos spanning several minutes. Furthermore, in order to classify the procedural activity, the system must not only recognize the individual semantic steps in the long video but also model their temporal relations, since many complex activities share several steps but may differ in the order in which these steps appear or are interleaved. For example, “beating eggs” is a common step in many recipes which, however, are likely to differ in the preceding and subsequent steps.
In recent years, the research community has engaged in the creation of several manually-annotated video datasets for the recognition of procedural, multi-step activities. However, in order to make detailed manual annotations possible at the level of both segments (step labels) and videos (task labels), these datasets have been constrained to have a narrow scope or a relatively small scale. Examples include video benchmarks that focus on specific domains, such as recipe preparation or kitchen activities [breakfast, youcook2, Damen2018EPICKITCHENS], as well as collections of instructional videos manually-labeled for step and task recognition [zhukov2019cross, tang2020comprehensive]. Due to the large cost of manually annotating temporal boundaries, these datasets have been limited to a small size both in terms of number of tasks (about a few hundreds activities at most) as well as amount of video examples (about 10K samples, for roughly 400 hours of video). While these benchmarks have driven early progress in this field, their limited size and narrow scope prevent the training of modern large-capacity video models for recognition of general procedural activities.
On the other end of the scale/scope spectrum, the HowTo100M dataset [miech2019howto100m] stands out as an exceptional resource. It is over 3 orders of magnitude bigger than prior benchmarks in this area along several dimensions: it includes over 100M clips showing humans performing and narrating more than 23,000 complex tasks for a total duration of 134K hours of video. The downside of this massive amount of data is that its scale effectively prevents manual annotation. In fact, all videos in HowTo100M are unverified by human annotators. While this benchmark clearly fulfills the size and scope requirements needed to train large-capacity video models, its lack of segment annotations and the unvalidated nature of the videos impedes the training of accurate step or task classifiers.
In this paper we present a novel approach for training models to recognize procedural steps in instructional video without any form of manual annotation, thus enabling optimization on large-scale unlabeled datasets, such as HowTo100M. We propose a distant supervision
framework that leverages a textual knowledge base as a guidance to automatically identify segments corresponding to different procedural steps in video. Distant supervision has been used in Natural Language Processing[snow2005, mintz2009distant, riedel2010modeling] to mine relational examples from noisy text corpora using a knowledge base. In our setting, we are also aiming at relation extraction, albeit in the specific setting of identifying video segments relating to semantic steps. The knowledge base that we use is wikiHow [wikihow]
—a crowdsourced multimedia repository containing over 230,000 “how-to” articles describing and illustrating steps, tips, warnings and requirements to accomplish a wide variety of tasks. Our system uses language models to compare segments of narration automatically transcribed from the videos to the textual descriptions of steps in wikiHow. The matched step descriptions serve as distant supervision to train a video understanding model to learn step-level representations. Thus, our system uses the knowledge base to mine step examples from the noisy, large-scale unlabeled video dataset. To the best of our knowledge, this is the first attempt at learning a step video representation with distant supervision.
We demonstrate that video models trained to recognize these pseudo-labeled steps in a massive corpus of instructional videos, provide a general video representation transferring effectively to four different downstream tasks on new datasets. Specifically, we show that we can apply our model to represent a long video as a sequence of step embeddings extracted from the individual segments. Then, a shallow sequence model (a single Transformer layer [vaswani2017attention]) is trained on top of this sequence of embeddings to perform temporal reasoning over the step embeddings. Our experiments show that such an approach yields state-of-the-art results for classification of procedural tasks on the labeled COIN dataset, outperforming the best reported numbers in the literature by more than . Furthermore, we use this testbed to make additional insightful observations:
Step labels assigned with our distant supervision framework yield better downstream results than those obtained by using the unverified task labels of HowTo100M.
Our distantly-supervised video representation outperforms fully-supervised
video features trained with action labels on the large-scale Kinetics-400 dataset[Kinetics].
Our step assignment procedure produces better downstream results than a representation learned by directly matching video to the ASR narration [miech2020end], thus showing the value of the distant supervision framework.
We also evaluate the performance of our system for classification of procedural activities on the Breakfast dataset [breakfast]
. Furthermore, we present transfer learning results on three additional downstream tasks on datasets different from that used to learn our representation (HowTo100M): step classification and step forecasting on COIN, as well as categorization of egocentric videos on EPIC-KITCHENS-100[damen2020epic]. On all of these tasks, our distantly-supervised representation achieves higher accuracy than previous works, as well as additional baselines that we implement based on training with full supervision. These results provide further evidence of the generality and effectiveness of our unsupervised representation for understanding complex procedural activities in videos. We will release the code and the automatic annotations provided by our distant supervision upon publication.
2 Related Work
During the past decade, we have witnessed dramatic progress in action recognition. However, the benchmarks in this field consist of brief videos (usually, a few seconds long) trimmed to contain the individual atomic action to recognize [kuehne2011hmdb, UCF101, kay2017kinetics, goyal2017something]. In this work, we consider the more realistic setting where videos are untrimmed, last several minutes, and contain sequences of steps defining the complex procedural activities to recognize (e.g., a specific recipe, or a particular home improvement task).
Understanding Procedural Videos. Procedural knowledge is an important part of human knowledge [anderson1982acquisition, rasmussen1983skills, tan2021comprehensive] essentially answering “how-to” questions.
Such knowledge is displayed in long procedural videos [rohrbach2012database, breakfast, youcook2, Damen2018EPICKITCHENS, zhukov2019cross, tang2020comprehensive, miech2019howto100m], which have attracted active research in recognition of multi-step activities [hussein2019timeception, hussein2020timegate, zhou2021graph]. Early benchmarks in this field contained manual annotations of steps within the videos [youcook2, zhukov2019cross, tang2020comprehensive] but were relatively small in scope and size. The HowTo100M dataset [miech2019howto100m], on the other hand, does not contain any manual annotations but it is several orders of magnitude bigger and the scope of its “how-to” videos is very broad. An instructional or how-to video contains a human subject demonstrating and narrating how to accomplish a certain task. Early works on HowTo100M have focused on leveraging this large collection for learning models that can be transferred to other tasks, such as action recognition [miech2019howto100m, miech2020end, alayrac2020self], video captioning [youcook2, luo2020univl, huang2020multimodal], or text-video retrieval [xu2021vlm, miech2020end, bain2021frozen]. The problem of recognizing the task performed in the instructional video has been considered by Bertasius et al. [bertasius2021space]. However, their proposed approach does not model the procedural nature of instructional videos.
Learning Video Representations with Limited Supervision. Learning semantic video representations [srivastava2015unsupervised, qiu2017learning, VideoBERT, li2020hero, xu2015discriminative] is a fundamental problem in video understanding research. The representations pretrained from labeled datasets are limited by the pretraining domain and the predefined ontology. Therefore, many attempts have been made to obtain video representations with less human supervision. In the unsupervised setting, supervision signal is usually constructed by augmenting videos [srivastava2015unsupervised, wei2018learning, feichtenhofer2021large]. For example, Wei et al. [wei2018learning] proposed to predict the order of videos as the supervision to learn order-aware video representations. In the weakly supervised setting, the supervision signals are usually obtained from hashtags [ghadiyaram2019large], ASR transcriptions [miech2019howto100m], or meta-information extracted from the Web [gan2016webly]. Miech et al. [miech2019howto100m] show that ASR sentences extracted from audio can serve as a valuable information source to learn video representations. Previous works [elhamifar2019unsupervised, elhamifar2020self] have also studied learning to localize keyframes using task labels as supervision. This is different from the focus of this paper, which addresses the problem of learning step-level representations from unlabeled instructional videos.
Distant Supervision. Distant supervision [mintz2009distant, zeng2015distant] has been studied in natural language processing and generally refers to a training scheme where supervision is obtained by automatically mining examples from a large noisy corpus utilizing a clean and informative knowledge base. It has been shown to be very successful on the problem of relation extraction. For example, Mintz et al. [mintz2009distant] leverage knowledge from Freebase [bollacker2008freebase] to obtain supervision for relation extraction. However, the concept of distant supervision has not been exploited in video understanding. Huang et al. [huang2020multimodal] have proposed to use wikiHow as a textual dataset to pretrain a video captioning model but the knowledge base is not used to supervise video understanding models.
3 Technical Approach
Our goal is to learn a segment-level representation to express a long procedural video as a sequence of step embeddings. The application of a sequence model, such as a Transformer, on this video representation can then be used to perform temporal reasoning over the individual steps. Most importantly, we want to learn the step-level representation without manual annotations, so as to enable training on large-scale unlabeled data. The key insight leveraged by our framework is that knowledge bases, such as wikiHow, provide detailed textual descriptions of the steps for a wide range of tasks. In this section, we will first describe how to obtain distant supervision from wikiHow, then discuss how the distant supervision can be used for step-level representation learning, and finally, we will introduce how our step-level representation is leveraged to solve several downstream problems.
3.1 Extracting Distant Supervision from wikiHow
The wikiHow repository contains high-quality articles describing the sequence of individual steps needed for the completion of a wide variety of practical tasks. Formally, we refer to wikiHow as a knowledge base containing textual step descriptions for tasks: , where represents the language-based description of step for task , and is the number of steps involved for the execution of task . We view an instructional video as a sequence of segments , with each segment consisting of RGB frames having spatial resolution , i.e., . Each video is accompanied by a paired sequence of text sentences obtained by applying ASR to the audio narration. We note that the narration can be quite noisy due to ASR errors. Furthermore, it may describe the step being executed only implicitly, e.g., by referring to secondary aspects. An example is given in Fig. 1, where the ASR in the second segment describes the type of screws rather than the action of tightening the screws, while the last segment refers to the the tone confirmation of the air conditioner being activated rather than the plugging of the cord into the outlet. The idea of our approach is to leverage the knowledge base to de-noise the narration and to convert it into a supervisory signal that is more directly related to the steps represented in segments of the video. We achieve this goal through the framework of distant supervision, which we apply to approximate the unknown conditional distribution over the steps executed in the video, without any form of manual labeling. To approximate this distribution we employ a textual similarity measure between and :
The textual similarity is computed as a dot product between language embeddings
where and is the dimension of the language embedding space. The underlying intuition of our approach is that, compared to the noisy and unstructured narration , the distribution provides a more salient supervisory signal for training models to recognize individual steps of procedural activities in video. The last row of Fig. 1
shows the steps in the knowledge base having highest conditional probability given the ASR text. We can see that, compared to the ASR narrations, the step sentences provide a more fitting description of the step executed in each segment. Our key insight is that we can leverage modern language models to reassign noisy and imprecise speech transcriptions into the clean and informative step descriptions of our knowledge base. Beyond this qualitative illustration (plus additional ones available in the supplementary material), our experiments provide quantitative evidence of the benefits of training video models by usingas supervision as opposed to the raw narration.
3.2 Learning Step Embeddings from Unlabeled Video
We use the approximated distribution as the supervision to learn a video representation . We consider three different training objectives for learning the video repesentation : (1) step classification, (2) distribution matching, and (3) step regression.
Step Classification. Under this learning objective, we first train a step classification model to classify each video segment into one of the possible steps in the knowledge base , where . Specifically, let be the indices of the step in that best describes segment according to our target distribution, i.e.,
Then, we use the standard cross-entropy loss to train to classify video segment into class :
denotes the learning parameters of the video model. The model uses a softmax activation function in the last layer to define a proper distribution over the steps, such that. Although here we show the loss for one segment only, in practice we optimize the objective by averaging over a mini-batch of video segments sampled from the entire collection in each iteration. After learning, we use as a feature extractor to capture step-level information from new video segments. Specifically, we use the second-to-last layer of (before the softmax function) as the step embedding representation for classification of procedural activities in long videos.
Distribution Matching. Under the objective of Distribution Matching, we train the step classification model to minimize the KL-Divergence between the predicted distribution and the target distribution :
Due to the large step space (), in order to effectively optimize this objective we empirically found it beneficial to use only the top- steps in , with the probabilities of the other steps set to zero.
Step Regression. Under Step Regression, we train the video model to predict the language embedding associated to the pseudo ground-truth step . Thus, in this case the model is a regression function to the language embedding space, i.e., . We follow [miech2020end] and use the NCE loss as the objective:
Because is trained to predict the language representation of the step, we can directly use its output as step embedding representation for new video segments, i.e., .
3.3 Classification of Procedural Activities
In this subsection we discuss how we can leverage our learned step representation to recognize fine-grained procedural activities in long videos spanning up to several minutes. Let be a new input video consisting of a sequence of segments for . The intuition is that we can leverage our pretrained step representation to describe the video as a sequence of step embeddings. Because our step embeddings are trained to reveal semantic information about the individual steps executed in the segments, we use a transformer [vaswani2017attention] to model dependencies over the steps and to classify the procedural activity: . Since our objective is to demonstrate the effectiveness of our step representation , we choose to include a single transformer layer, which is sufficient to model sequential dependencies among the steps and avoids making the classification model overly complex. We refer to this model as the “Basic Transformer.”
We also demonstrate that our step embeddings enable further beneficial information transfer from the knowledge base to improve the classification of procedural activities during inference. The idea is to adopt a retrieval approach to find for each segment the step that best explains the segment according to the pretrained video model . For the case of Step Classification and Distribution Matching, where we learn a classification model , we simply select the step class yielding the maximum classification score:
In the case of Step Regression, since generates an output in the language space, we can choose the step that has maximum language embedding similarity:
Let denote the step description assigned through this procedure, i.e., .
Then, we can incorporate the knowledge retrieved from for each segment in the input provided to the transformer, together with the step embeddings extracted from the video:
This formulation effectively trains the transformer to fuse a representation consisting of video features and step embeddings from the knowledge base to predict the class of the procedural activity. We refer to this variant as “Transformer w/ KB Transfer”.
3.4 Step Forecasting
We note that we can easily modify our proposed classification model to address forecasting tasks that require long-term analysis over a sequence of steps to predict future activity. One such problem is the task of “next-step anticipation” which we consider in our experiments. Given as input a video spanning segments, , the objective is to predict the step executed in the unobserved -th segment. To address this task we train the transformer on the sequence step embeddings extracted from the observed segments. In the case of Transformer w/ KB Transfer, for each input segment , we include but also , i.e., the embedding of the step immediately after the step matched in the knowledge base. This effectively provides the transformer with information about the likely future steps according to the knowledge base.
3.5 Implementation Details
Our implementation uses the wikiHow articles collected and processed by Koupaee and Wang [koupaee2018wikihow], where each article has been parsed into a title and a list of step descriptions. We use MPNet [mpnet] as the language model to extract 768-dimensional language embeddings for both the ASR sentences and the step descriptions. MPNet is currently ranked first by Sentence Transformers [sbert], based on performance across 14 language retrieval tasks [reimers-2019-sentence-bert]
. The similarity between two embedding vectors is chosen to be the dot product between the two vectors. We use a total ofsteps collected from the tasks used in the evaluation of Bertasius et al. [bertasius2021space]. This represents the subset of wikiHow tasks that have at least 100 video samples in the HowTo100M dataset. We note that the HowTo100M videos were collected from YouTube [youtube] by using the wikiHow titles as keywords for the searches. Thus, each task of HowTo100M is represented in the knowledge base of wikiHow, except for tasks deleted or revised.
We implement our video model using the code base of TimeSformer [bertasius2021space]
and we follow its training configuration for HowTo100M, unless otherwise specified. All methods and baselines based on TimeSformer start from a configuration of ViT initialized with ImageNet-21K ViT pretraining[dosovitskiy2020image]. Each segment consists of 8 frames uniformly sampled from a time-span of 8 seconds. For pretraining, we sample segments according to the ASR temporal boundaries available in HowTo100M. If the time-span exceeds 8 seconds, we sample a segment randomly within it, otherwise we take the 8-second segment centered at the middle point. For step classification, if the segment exceeds 8 seconds we sample the middle clip of 8 seconds, otherwise we use the given segment and sample 8 frames from it uniformly. For classification of procedural activities and step forecasting, we sample 8 uniformly-spaced segments from the input video. For egocentric video classification, we follow [arnab2021vivit]. Although we use TimeSformer as the backbone for our approach, our proposed framework is general and can be applied to other video segment models.
The evaluations in our experiments are carried out by learning the step representation on HowTo100M (without manual labels) and by assessing the performance of our embeddings on smaller-scale downstream datasets where task and/or step manual annotations are available. To perform classification of multi-step activities on these downstream datasets we use a single transformer layer [vaswani2017attention] trained on top of our fixed embeddings. We use this shallow long-term model without finetuning in order to directly measure the value of the representation learned via distant supervision from the unlabeled instructional videos.
4.1 Datasets and Evaluation Metrics
Pretraining. HowTo100M (HT100M) [miech2019howto100m] includes over 1M long instructional videos split into about 120M video clips in total. We use the complete HowTo100M dataset only in the final comparison with the state-of-the-art (sec. 4.3). In the ablations, in order to reduce the computational cost, we use a smaller subset corresponding to the collection of 80K long videos defined by Bertasius et al. [bertasius2021space].
Classification of Procedural Activities. Performance on this task is evaluated using two labeled datasets: COIN [tang2019coin, tang2020comprehensive] and Breakfast [breakfast]. COIN contains about 11K instructional videos representing 180 tasks (i.e., classes of procedural activities). Breakfast [breakfast] contains 1,712 videos for 10 complex cooking tasks. In both datasets, each video is manually annotated with a label denoting the task class. We use the standard splits [tang2020comprehensive, hussein2019timeception] for these two datasets and measure performance in terms of task classification accuracy.
Step Classification. This problem requires classifying the step observed in a single video segment (without history), which is a good testbed to evaluate the effectiveness of our step embeddings. To evaluate methods on this problem, we use the step annotations available in COIN, corresponding to a total of 778 step classes representing parts of tasks. The steps are manually annotated within each video with temporal boundaries and step class labels. Classification accuracy [tang2020comprehensive] is used as the metric.
We also use step annotations available in COIN. The objective is to predict the class of the step in the next segment given as input the sequence of observed video segments up to that step (excluded). Note that there is a substantial temporal gap (21 seconds on average) between the end of the last observed segment and the start of the step to be predicted. This makes the problem quite challenging and representative of real-world conditions. We set the history to contain at least one step. We use classification accuracy of the predicted step as the evaluation metric.
Egocentric Activity Recognition. EPIC-KITCHENS-100 [damen2020epic] is a large-scale egocentric video dataset. It consists of 100 hours of first-person videos, showing humans performing a wide range of procedural activities in the kitchen. The dataset includes manual annotations of 97 verbs and 300 nouns in manually-labeled video segments. We follow the standard protocol [damen2020epic] to train and evaluate our models.
|Segment Model||Pretraining Supervision||Pretaining Dataset||Linear Acc (%)|
|TSN (RGB+Flow) [tang2020comprehensive]||Supervised: action labels||Kinetics||36.5*|
|S3D [miech2020end]||Unsupervised: MIL-NCE on ASR||HT100M||37.5|
|SlowFast [feichtenhofer2019slowfast]||Supervised: action labels||Kinetics||32.9|
|TimeSformer [bertasius2021space]||Supervised: action labels||Kinetics||48.3|
|TimeSformer [bertasius2021space]||Unsupervised: -means on ASR||HT100M||46.5|
|TimeSformer||Unsupervised: distant supervision (ours)||HT100M||54.1|
|Long-term Model||Segment Model||Pretraining Supervision||Pretaining Dataset||Acc (%)|
|TSN (RGB+Flow) [tang2020comprehensive]||Inception [szegedy2016rethinking]||Supervised: action labels||Kinetics||73.4*|
|Basic Transformer||S3D [miech2020end]||Unsupervised: MIL-NCE on ASR||HT100M||70.2|
|Basic Transformer||SlowFast [feichtenhofer2019slowfast]||Supervised: action labels||Kinetics||71.6|
|Basic Transformer||TimeSformer [bertasius2021space]||Supervised: action labels||Kinetics||83.5|
|Basic Transformer||TimeSformer [bertasius2021space]||Unsupervised: -means on ASR||HT100M||85.3|
|Basic Transformer||TimeSformer||Unsupervised: distant supervision (ours)||HT100M||88.9|
|Transformer w/ KB Transfer||TimeSformer||Unsupervised: distant supervision (ours)||HT100M||90.0|
4.2 Ablation Studies
We begin by studying how different design choices in our framework affect the accuracy of task classification on COIN using the basic Transformer as our long-term model.
4.2.1 Different Training Objectives
Fig. 2 shows the accuracy of COIN task classification using the three distant supervision objectives presented in Sec. 3.2. Distribution Matching and Step Classification achieve similar performance, while Embedding Regression produces substantially lower accuracy. Based on these results we choose Distribution Matching (Top-3) as our learning objective for all subsequent experiments.
4.2.2 Comparing Different Forms of Supervision
In Fig. 3, we compare the results of different pretrained video representations for the problem of classifying procedural activities on the COIN dataset. We include as baselines several representations learned on the same subset of HowTo100M as our step embeddings, using the same TimeSformer as video model. MIL-NCE [miech2020end] performs contrastive learning between the video and the narration obtained from ASR. The baseline (HT100M, Task Classification) is a representation learned by training TimeSformer as a classifier using as classes the task ids available in HowTo100M. The task ids are automatically obtained from the keywords used to find the video on YouTube. The baseline (HT100M, Task Labels + Distant Superv.) uses the task ids to narrow down the potential steps considered by distant supervision (only wikiHow steps corresponding to the task id of the video are considered). We also include a representation obtained by training TimeSformer on the fully-supervised Kinetics-400 dataset [Kinetics]. Finally, to show the benefits of distant supervision, we run -means clustering on the language embeddings of ASR sentences using the same number of clusters as the steps in wikiHow (i.e., ), and then train the video model using the cluster ids as supervision.
We observe several important results in Fig. 3. First, our distant supervision achieves an accuracy gain of over MIL-NCE with ASR. This suggests that our distant supervision framework provides more explicit supervision to learn step-level representations compared to using directly the ASR text. This is further confirmed by the performance of ASR Clustering, which is lower than that obtained by leveraging the wikiHow knowledge base.
Moreover, our step-level representation outperforms by the weakly-supervised task embeddings (Task Classification) and does even better (by ) than the video representation learned with full supervision from the large-scale Kinetics dataset. This is due to the fact that steps typically involve multiple atomic actions. For example, about of the steps consist of at least two verbs. Thus, our step embeddings capture a higher-level representation than those based on traditional atomic action labels.
Finally, using the task ids to restrict the space of step labels considered by distant supervision produces the worst results. This indicates that the task ids are quite noisy and that our approach leveraging relevant steps from other tasks can provide more informative supervision. These results further confirm the superior performance of distantly supervised step annotations over existing task or action labels to train representations for classifying procedural activities.
|Long-term Model||Segment Model||Pretraining Supervision||Pretaining Dataset||Acc (%)|
|Timeception [hussein2019timeception]||3D-ResNet [wang2017non]||Supervised: action labels||Kinetics||71.3|
|VideoGraph [hussein2019videograph]||I3D [Kinetics]||Supervised: action labels||Kinetics||69.5|
|GHRM [zhou2021graph]||I3D [Kinetics]||Supervised: action labels||Kinetics||75.5|
|Basic Transformer||S3D [miech2020end]||Unsupervised: MIL-NCE on ASR||HT100M||74.4|
|Basic Transformer||SlowFast [feichtenhofer2019slowfast]||Supervised: action labels||Kinetics||76.1|
|Basic Transformer||TimeSformer [bertasius2021space]||Supervised: action labels||Kinetics||81.1|
|Basic Transformer||TimeSformer [bertasius2021space]||Unsupervised: -means on ASR||HT100M||81.4|
|Basic Transformer||TimeSformer||Unsupervised: distant supervision (ours)||HT100M||88.7|
|Transformer w/ KB Transfer||TimeSformer||Unsupervised: distant supervision (ours)||HT100M||89.9|
|Long-term Model||Segment Model||Pretraining Supervision||Pretaining Dataset||Acc (%)|
|Basic Transformer||S3D [miech2020end]||Unsupervised: MIL-NCE on ASR||HT100M||28.1|
|Basic Transformer||SlowFast [feichtenhofer2019slowfast]||Supervised: action labels||Kinetics||25.6|
|Basic Transformer||TimeSformer [bertasius2021space]||Supervised: action labels||Kinetics||34.7|
|Basic Transformer||TimeSformer [bertasius2021space]||Unsupervised: -means on ASR||HT100M||34.0|
|Basic Transformer||TimeSformer||Unsupervised: distant supervision (ours)||HT100M||38.2|
|Transformer w/ KB Transfer||TimeSformer||Unsupervised: distant supervision (ours)||HT100M||39.4|
|Segment Model||Pretraining Supervision||Pretaining Dataset||Action (%)||Verb (%)||Noun (%)|
|TSM [lin2018temporal]||Supervised: action labels||Kinetics||38.3||67.9||49.0|
|SlowFast [feichtenhofer2019slowfast]||Supervised: action labels||Kinetics||38.5||65.6||50.0|
|ViViT-L [arnab2021vivit]||Supervised: action labels||Kinetics||44.0||66.4||56.8|
|TimeSformer [bertasius2021space]||Supervised: action labels||Kinetics||42.3||66.6||54.4|
|TimeSformer||Unsupervised: distant supervision (ours)||HT100M||44.4||67.1||58.1|
4.3 Comparisons to the State-of-the-Art
4.3.1 Step Classification
We study the problem of step classification as it directly measures whether the proposed distant supervision framework provides a useful training signal for recognizing steps in video. For this purpose, we use our distantly supervised model as a frozen feature extractor to extract step-level embeddings for each video segment and then train a linear classifier to recognize the step class in the input segment.
Table 1 shows that our distantly supervised representation achieves the best performance and yields a large gain over several strong baselines. Even on this task, our distant supervision produces better results compared to a video representation trained with fully-supervised action labels on Kinetics. The significant gain () over ASR clustering again demonstrates the importance of using wikiHow knowledge. Finally, our model achieves strong gains over previously reported results on this benchmark based on different backbones, including results obtained by finetuning and using optical flow as an additional modality [tang2020comprehensive].
4.3.2 Classification of Procedural Activities
Table 2 and Table 3 show accuracy of classifying procedural activities in long videos on the COIN and Breakfast dataset, respectively. Our model outperforms all previous works on these two benchmarks. For this problem, the accuracy gain on COIN over the representations learned with Kinetics action labels has become even larger () compared to the improvement achieved for step classification (). This indicates that the distantly supervised representation is indeed highly suitable for recognizing long procedural activities. We also observe a substantial gain () over the Kinetics baseline for the problem of recognizing complex cooking activities in the Breakfast dataset. As GHRM provided also the result obtained by finetuning the feature extractor on the Breakfast benchmark (89.0%), we measured the accuracy achieved by finetuning our model and observed a large gain: 91.6%. We also tried replacing the basic transformer with Timeception as the long-term model. Timeception trained on features learned with action labels from Kinetics gives an accuracy of . This same model trained on our step embeddings achieves an accuracy of . The large gain confirms the superiority of our representation for this task and suggests that our features can be effectively plugged in different long-term models.
4.3.3 Step Forecasting
Table 4 shows that our learned representation and a shallow transformer can be used to forecast the next step very effectively. Our representation outperforms the features learned with Kinetics action labels by . When the step order knowledge is leveraged by stacking the embeddings of the possible next steps, the gain is further improved to . This shows once more the benefits of incorporating information from the wikiHow knowledge base.
4.3.4 Egocentric Video Understanding
Recognition of activities in EPIC-KITCHENS-100 [damen2020epic] is a relevant testbed for our model since first-person videos in this dataset capture diverse procedural activities from daily human life. To demonstrate the generality of our distantly supervised approach, we finetune our pretrained model for the task of noun, verb, and action recognition in egocentric videos. For comparison purposes, we also include the results of finetuning the same model pretrained on Kinetics-400 with manually annotated action labels. Table 5 shows that the best results are obtained by finetuning our distantly supervised model. This provides further evidence about the transferability of our models to other tasks and datasets.
In this paper, we introduce a distant supervision framework that leverages a textual knowledge base (wikiHow) to effectively learn step-level video representations from instructional videos. We demonstrate the value of the representation on step classification, long procedural video classification, and step forecasting. We further show that our distantly supervised model generalizes well to egocentric video understanding.
Thanks to Karl Ridgeway, Michael Iuzzolino, Jue Wang, Noureldien Hussein, and Effrosyni Mavroudi for valuable discussions.
Appendix A Further Implementation Details
For our pretraining of TimeSformer on the whole set of HowTo100M videos, we use a configuration slightly different from that adopted in [bertasius2021space]
. We use a batch size of 256 segments, distributed over 128 GPUs to accelerate the training process. The models are first trained with the same optimization hyper-parameter settings for 15 epochs as[bertasius2021space].Then the models are trained with AdamW [loshchilov2018decoupled] for another 15 epochs, with an initial learning rate of 0.00005.
The basic transformer consists of a single transformer layer with 768 embedding dimensions and 12 heads. The step embeddings extracted with TimeSformer are augmented with learnable positional embeddings before being fed to the transformer layer.
For the downstream tasks of procedural activity recognition, step classification, and step anticipation, we train the transformer layer on top of the frozen step embedding representation for 75K iterations, starting with a learning rate of 0.005. The learning rate is scaled by 0.1 after 55K and 70K iterations, respectively. The optimizer is SGD. We ensemble predictions from 4, 3, and 4 temporal clips sampled from the input video for the three tasks, respectively.
For egocentric video classification, we adopt the training configuration from [arnab2021vivit], except that we sample 32 frames as input with a frame rate of 2 fps to cover a longer temporal span of 16 seconds.
Appendix B Classification Results with Different Number of Transformer Layers
In the main paper, we presented results for recognition of procedural activities using as classification model a single-layer Transformer trained on top of the video representation learned with our distant supervision framework. In Table 6 we study the potential benefits of additional Transformer layers. We can see that additional Transformer layers in the classifier do not yield significant gains in accuracy. This suggests that our representation enables accurate classification of complex activities with a simple model and does not require additional nonlinear layers to achieve strong recognition performance. We also show the results without any transformer layers, by training a linear classifier on the average pooled or concatenated features from the pretrained TimeSformer. It has a substantially low results compared to using transformer layers for temporal modeling, which indicates that our step-level representation enables effective powerful temporal reasoning even with a simple model.
Appendix C Representation Learning with Different Video Backbones
Although the experiments in our paper were presented for the case of TimeSformer as the video backbone, our distant supervision framework is general and can be applied to any video architecture. To demonstrate the generality of our framework, in this supplementary material we report results obtained with another recently proposed video model, ST-SWIN [wang2021long]
, using ImageNet-1K pretraining as initialization. We first train the model on HowTo100M using our distant supervision strategy and then evaluate the learned (frozen) representation on the tasks of step classification and procedural activity classification in the COIN dataset. Table7 and Table 8 show the results for these two tasks. We also include results achieved with a video representation trained with full supervision on Kinetics as well as with video embeddings learned by -means on ASR text. As we have already shown for the case of TimeSformer in the main paper, even for the case of the ST-SWIN video backbone, our distant supervision provides the best accuracy on both benchmarks, outperforming the Kinetics and the -means baseline by substantial margins. This confirms that our distant supervision framework can work effectively with different video architectures.
|# Transformer Layers||Acc (%) of Basic Transformer||Acc (%) of Transformer w/ KB Transfer|
|0 (Avg Pool)||81.0||n/a|
|Segment Model||Pretraining Supervision||Pretaining Dataset||Acc (%)|
|ST-SWIN||Supervised: action labels||Kinetics||44.0|
|ST-SWIN||Unsupervised: -means on ASR||HT100M||44.8|
|ST-SWIN||Unsupervised: distant supervision (ours)||HT100M||50.3|
|Long-term Model||Segment Model||Pretraining Supervision||Pretaining Dataset||Acc (%)|
|Basic Transformer||ST-SWIN||Supervised: action labels||Kinetics||79.6|
|Basic Transformer||ST-SWIN||Unsupervised: -means on ASR||HT100M||82.4|
|Basic Transformer||ST-SWIN||Unsupervised: distant supervision (ours)||HT100M||88.3|
Appendix D Action Segmentation Results on COIN
In the main paper, we use step classification on COIN as one of the downstream tasks to directly measure the quality of the learned step-level representations. We note that some prior works [ActBERT, miech2020end] used the step annotations in COIN to evaluate pretrained models for action segmentation. This task entails densely predicting action labels at each frame. Frame-level accuracy is used as the evaluation metric. We argue that step classification is a more relevant task for our purpose since we are interested in understanding the representational power of our features as step descriptors. Nevertheless, in order to compare to prior works, here we present results of using our step embeddings for action segmentation on COIN. Following previous work [ActBERT, miech2020end], we sample adjacent non-overlapping 1-second segments from the long video as input to our model. We use our model pretrained on HowTo100M as a fixed feature extractor to obtain a representation for each of these segments. Then a linear classifier is trained to classify each segment into one of the 779 classes (778 steps plus the background class). Our method achieves a frame accuracy of . The representation learned with full-supervision using action labels in Kinetics gives a substantially lower accuracy: with the same classification model as our method. The methods in [ActBERT, miech2020end] achieve an accuracy of and , respectively.
Appendix E More Qualitative Results and Discussion
Visualization of Distant Supervision. In Figure 4 we provide visualizations of steps assigned by our distant supervision method for three video examples. We can observe that the matched step descriptions capture high-level semantics about actions and objects, which are conversely often missed by the narrations. An example is given in Figure 3(a) where the narration in the last segment (“really really hot water you can use”) does not correspond to an object or an action directly recognizable in the segment. The language model assigns this narration to a more expressive step description (Open the hot water faucet in your sink or tub). Figure 3(b) shows that the assigned steps capture higher level information compared to traditional atomic actions. For example, a video segment of pouring oil into a heated pan is matched to Prepare to fry tortillas.
Visualization of Step Classes. In order to better understand the variety of video segments that are grouped under a given step, in Figure 5 we show three video clips assigned to three given steps. We can observe that our method can successfully group together video segments that are coherent in terms of the demonstrated step. Note that, at the same time, the segments assigned to a given step exhibit large variations in terms of appearance (e.g., color, viewpoint, object instances). Because our model assigns segments to step descriptions purely based on language information, it is insensitive to these large appearance variations. This invariance is then transferred to the video model: by using these distantly supervised step classes as supervision for the video model, our method trains the video representation to be invariant to these appearance variations and to capture the higher-level semantics represented in each step class.
Limitations and Failure Cases. Our approach may fail in assigning the correct step to a segment due to errors caused by the language model and due to excessive noise or ambiguity in the ASR sentence. Furthermore, the step description may refer to actions or objects not represented in the segment. For example, in Figure 4(c) the assigned step Put in the oven and bake until the cheese is melted provides an accurate semantic description for the segments, but it refers to an object (“oven”) that is not shown in the video frames. On one hand, this visual misalignment may render the training difficult; on the other hand, it may still be beneficial, since it forces the model to use contextual information (e.g., visible objects that tend to co-occur with “oven”, such as the bakeware objects appearing in the frames) to recognize the high-level semantics of the steps. Another potential limitation is the temporal misalignment between speech and visual content. However, this problem can be reduced by expanding the temporal span of the ASR text to increase the probability of including the relevant text for the given video segment, or by adopting a multiple-instance learning scheme [miech2020end] to find the correct temporal alignment between ASR text sentences and video segments.
Complexity of Steps. In our experiments we demonstrated that, on the downstream problems of step and task classification, our distantly-supervised video representation outperforms video descriptors trained with full supervision on traditional action classes. We hypothesize that this is due to the fact that each step typically consists of multiple actions performed in sequence, unlike traditional action classes which typically encode a single atomic action (e.g., “drinking”, “jumping”, “punching”). To assess this hypothesis we analyzed the number of verbs returned by the POS tagger [nltk] for each wikiHow step description as a measure of the complexity of the step. Figure 6 shows the distribution of the number of verbs. The average and the median number of verbs in step descriptions are 10.1 and 8.0, respectively. Furthermore, more than of the steps contain at least 2 verbs. This indeed suggests that steps tend to have a higher-level of complexity compared to traditional atomic actions.
Appendix F Further Details about Step Forecasting
We follow the training/validation split in the COIN dataset to train and evaluate our models for step forecasting. By constraining the observed history to contain at least one step, we construct a training set of 22037 samples and a validation set of 6721 samples. Fig. 7 shows the distribution of the gaps (in seconds) separating the history from the step to predict. The average and the median of the gap are 21 seconds and 14 seconds, respectively. Thus, the forecasting gaps in this benchmarks are substantially longer than those used in other action anticipation tasks [ryoo2011human, hoai2014max, Gammulle_2019_ICCV]. This makes this benchmark particularly challenging as the model is asked to predict the step of segments far away in the future compared to the observed history.