Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions

Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text-corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the Tasty Videos Dataset V2, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.



There are no comments yet.


page 2

page 12

page 13

page 16


Zero-Shot Anticipation for Instructional Activities

How can we teach a robot to predict what will happen next for an activit...

RareAct: A video dataset of unusual interactions

This paper introduces a manually annotated video dataset of unusual acti...

Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

Schemata are structured representations of complex tasks that can aid ar...

Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning

Although promising results have been achieved in video captioning, exist...

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Generating videos from text is a challenging task due to its high comput...

Cross-Modal and Hierarchical Modeling of Video and Text

Visual data and text data are composed of information at multiple granul...

Zero-shot Recognition of Complex Action Sequences

Zero-shot video classification for fine-grained activity recognition has...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imagine a not-so-distant future where robot chefs service our kitchens. How can we learn and embody cooking as a general skill? Perhaps by reading all the recipes on the web? Or by watching all the cooking videos on YouTube? Learning and generalizing from a set of instructions, be it in text, image, or video form, is a highly challenging and open problem faced by those working in computer vision, natural language understanding and robotics.

In this work, we limit our scope of training the next ‘robochef’ to predict subsequent steps as it watches a human cook a never-before-seen dish. We frame the problem as action recognition and anticipation in a zero- or few-shot learning scenario. In addition to recognition, it will be important for the robot to anticipate what happens in the future to ensure a safe and smooth collaborative experience with the human [koppula2015anticipating, wu2016watch]. We also place importance on zero-shot or few-shot learning as it likely reflects how service robots will be introduced to the home [Chelsea04905, sunderhauf2018limits]. Models (and robots) can be pre-trained extensively, but they will likely deploy in never-before-encountered scenarios. To successfully anticipate the next steps of a never-before-seen dish would require leveraging and generalizing from previously learned procedural knowledge.

Instructional data, especially cooking recipes, can be found readily on the web [tang2019coin, Wikihow, zhukov2019cross]. The richest forms are multimodal e.g., images plus text or videos with narrations. Such data could be used to build automated systems to learn visual models from videos, enabling the advancement of virtual assistants or service robots learning new skills. However, learning complex multi-step procedures requires significant amounts of data. Despite the abundance of instructional data online, it is still difficult to find sufficient examples in multimodal form. Furthermore, learning the steps’ visual appearance would require temporally aligned data, which is less common and or expensive to obtain.

Fig. 1: An overview of our model. We first learn procedural knowledge from large text corpora and transfer it to the visual domain to anticipate the future. Our system comprises four RNNs: a sentence encoder, a sentence decoder, a video encoder, and a recipe network.

Our strategy is to separate the procedural learning from the visual perception problem. We learn procedural knowledge from text corpora; they are readily available and large scale, on the scale of millions [salvador2017learning, malmaud2014cooking, miech2019howto100m]. Knowledge from text is then transferred to video so that visual perception is simplified to a grounding task done via aligned video and text (Fig. 1

). More specifically, we encode text and or video in a multimodal embedding space. The context vectors, derived either from video or text, are fed to a recipe network that models the recipe’s sequential structure and makes following step predictions. Figure 

1 shows an overview. The use of text and language information to help train video models is not new. Several works employ accompanying narrations [zhukov2019cross], recipes [malmaud2014cooking] or film scripts [zhu2015aligning] for using text as weak boundaries and several learn joint text-video models [miech2019howto100m, sun2019videobert]. Our work is similar in spirit in that we also want to leverage these auxiliary sources of information to reduce the labelling effort. However, the previous works mainly focus on using text for minimizing the labeling effort for large-scale video datasets. In contrast, our work learns entire models out of text and then transfers said model across the natural language domain to the visual domain. To the best of our knowledge, we are the first to learn and transfer a cross-domain model.

Our work breaks new ground in procedural activity understanding in two ways. First and foremost, we anticipate upcoming actions under a zero-shot setting, as we target to make predictions for never-before-seen dishes. We achieve this by generalizing cooking knowledge from large-scale text corpora and then transferring the knowledge to the visual domain. This approach relieves us of the burden and impracticality of providing annotations for a virtually unlimited number of categories (dishes) and sub-categories (instructional steps). Our work is the first to tackle the problem of procedural activity understanding in this form; prior works in recognition are severely limited in the number of categories and steps [alayrac2016unsupervised, kuehne2014language, rohrbach2012database], while works in anticipation rely on strong supervision [abu2018will, lan2014hierarchical, zhou2015temporal].

Our work’s second novelty is that we do not work with a closed set of labels derived from word tags. Instead, we train with and also predict full sentences, e.g., ‘Cook the chicken wing until both sides are golden brown.’ vs. ‘cook chicken’. This design choice makes the problem more challenging but also brings several advantages. First, it adds qualifiers and richness to the instruction since natural language conveys much more information than simple text labels [lin2015generating, zhou2018towards]. Secondly, it allows for anticipation of not only actions but also objects and attributes. Finally, as a byproduct, it facilitates data collection, as the number of class-based annotations grows exponentially with the number of actions, objects, and attributes, leading to very long-tailed distributions [Damen2018EPICKITCHENS].

When transferring procedural knowledge from text recipes to videos, we need to ground the text domain with the video and vice versa. This requires video with temporally aligned captions; to the best of our knowledge, YouCookII [zhou2018towards] is the only dataset with such labels. However, YouCookII lacks diversity in the number of dishes (89 dishes) and therefore the number of possible recipe steps. As such, we collect and present our new Tasty Videos dataset V2, a diverse set of 4022 different cooking recipes111 Collected from the website https://tasty.co/ accompanied by a video, ingredient list, and temporally aligned recipe steps. Video footage is taken from a fixed birds-eye view and focuses almost exclusively on the cooking instructions, making it well-suited for procedural understanding.

Our main contributions are summarized as follows:

  • We are the first to explore action anticipation in a zero-shot setting by generalizing knowledge from text-corpora and transferring it to the visual domain.

  • We propose a modular hierarchical model for learning multi-step procedures with text and visual context. Our model generalizes cooking knowledge and predicts coherent and plausible instructions for multiple steps into the future. The rich natural language predictions score higher in NLP metrics than state-of-the-art video captioning methods applied directly to the (future) video.

  • We present a new and highly diverse dataset of cooking recipes. The dataset is publicly available222Tasty Video Dataset https://cvml.comp.nus.edu.sg/tasty and will be of interest for those working in procedural video understanding, action recognition and anticipation, as well as other multimodal research in video and text.

A preliminary version of this paper was published in [sener2019zero]. The current paper extends the previous work’s model by integrating a temporal segment proposal method to the video encoder and additional losses at the recipe encoder to improve convergence. We add experiments comparing against recipe generation networks [salvador2019inverse] and verify that our hierarchical architecture better generalizes to previously unseen dishes or recipes. Finally, we extend the dataset by from 2511 to 4022 videos.

2 Related Works

2.1 Procedural Activity Modelling

Understanding procedural activities and their sub-activities have been addressed typically as a supervised temporal video segmentation and recognition problem [rohrbach2012database, kuehne2014language, richard2016temporal]. A variety of models have been used, including conditional random fields [hoai2011joint]

, hidden Markov models 

[lea2016segmental], RNNs [singh2016multi] and recently temporal convolutional networks [lea2017temporal, farha2019ms] have been used. Data and label-wise, these methods require video sequences in which every frame is labelled exhaustively, making it difficult to work at a large scale.

Alternative lines of work are either weakly-supervised, using cues from accompanying narrations [alayrac2016unsupervised, malmaud2015s, sener2015unsupervised] or sub-activity orderings [huang2016connectionist, richard2017weakly, chang2019d3tw], or are fully unsupervised [sener2018unsupervised, kukleva2019unsupervised]. Our work is similar to those using text cues; however, we do not rely on aligned video and text modalities for learning the activity models [alayrac2016unsupervised, sener2015unsupervised]. Assuming videos accompanied by temporally aligned narrations is not always the case for instructional videos as it is far more natural for people to talk about an action before/after performing it or talk about an alternative action/object. Instead, we use a large corpus of unlabeled data in the text domain and use only a very small set of aligned data for grounding the visual evidence.

2.2 Action Anticipation

Action anticipation is the forecast of not-yet-observed actions into the future. Let be the ‘anticipation time’, i.e., how many seconds in advance to anticipate the next action. Then the task of action anticipation is to predict upcoming action, seconds before it starts. In many recent works, is considered to be 1 second [miech2019leveraging, furnari2019rulstm] but could vary between several zero [lan2014hierarchical] and several seconds [koppula2015anticipating].

Early works in forecasting activities have been limited to simple movement primitives such as reaching and placing [koppula2015anticipating] or personal interactions like hand-shaking, hugging etc. missing [lan2014hierarchical, vondrick2016anticipating]. More recent works anticipate up to thousands of action classes [Damen2018EPICKITCHENS] by defining actions as the composition of a verb and a noun. However, this does not scale as the number of actions grows, leading to a long-tail distribution. For example, in the new Epic-100 Dataset [Damen2020RESCALING] 92% of the actions are tail classes. Methods for anticipation include RNNs for encoding the observations [mahmud2017joint, furnari2019rulstm], predicting future features and performing future action classification on these [gao2017red, zolfaghari2019learning], employing knowledge distillation [tran2019back], transferring knowledge from other sources such as word embeddings [camporese2020knowledge]

or visual attribute classifiers 

[miech2019leveraging]. Moreover, recent interest in egocentric vision has started a new line of approaches dedicated to such recordings using gaze [shen2018egocentric] or hand-object interaction regions [liu2019forecasting, dessalene2020egocentric].

Dense anticipation extends action anticipation to forecast multiple actions into the future. Examples include [abu2018will, Ke_2019_CVPR], who propose two-stage methods that first perform action segmentation of the observed sequence before using the frame-wise labels as input for anticipation. The work of [sener2020temporal] bypasses the initial segmentation and performs dense anticipation in a single stage directly using a temporal aggregation framework. Our proposed approach can also predict actions multiple steps into the future, but unlike these methods, we do not work in a fully supervised framework. Furthermore, we do not require repetitions of activity sequences for training. Moreover, we are the first to predict the future in the form of sentences instead of category-based predictions as in  [abu2018will, Ke_2019_CVPR, sener2020temporal]. Similar to us, the recent work [mahmud2020captioning] also predicts sentences for subsequent actions by extending their anticipation framework [mahmud2017joint].

2.3 Zero- and Few-Shot Learning in Video

Zero- and few-shot learning is more popular in the image domain and we refer the interested readers to two recent surveys [wang2020generalizing, wang2019survey]. Extensions to the video domain has been less explored. Early works on zero-shot learning on videos rely on attribute-oriented feature representations which are then used to categorize the unseen videos [gan2016learning]. More recent works train temporal models that map video features to a semantic embedding space of categorical labels [zhu2018towards, hahn2019action2vec, brattoli2020rethinking] or sentence representations [zhang2018cross]. In a similar spirit to the embedding-based approaches, we ground the video to text by mapping video representations to the semantic space of step-wise instructions. Our approach extends the research on zero-shot recognition of simple actions to the complex multi-step activities of procedural cooking videos. ‘Zero-shot’ in our case refers to making predictions of previously unseen recipes.

2.4 Modelling Instructional Text in NLP

Cooking is a popular domain in NLP research since recipes are rich in natural language yet reasonably limited in scope. Cooking recipes are employed in tasks such as food recognition [Herranzfood], recommender systems [Mindel2017] and indexing and retrieval [carvalho2018cross, salvador2017learning]. Modelling the procedural aspects of the text and generating coherent recipes date back several decades  [hammond1986chef, DaleRobert88]. Early works focus on parsing the recipes to extract verbs and ingredients [tenorth2010understanding, malmaud2014cooking, kiddon2015mise, jermsurawong2015predicting]. For example, [tenorth2010understanding] generate plans from textual instructions, [kiddon2015mise] map recipes to action graphs and  [beetz2011robotic] use parsed instructions to make robots cook pancakes.

More recently, neural network-based solutions are popular, and they target especially improving the generated recipes’ coherence. For example, 

[kiddon2016globally] train an encoder-decoder [sutskever2014sequence] with a checklist mechanism to keep track of the ingredients given as input.  [bosselut2018discourse]

propose a reinforcement learning-based solution with discourse-aware rewards to encourage generating instructions in correct orders.  

[h2020recipegpt] produce personalized recipes by fusing users’ previously consumed recipes with an attention mechanism.

2.5 Cooking Domain in Vision

In vision, cooking has been explored for procedural and fine-grained activity recognition [Damen2018EPICKITCHENS, kuehne2014language, rohrbach2012database, zhou2018towards], temporal segmentation [kuehne2014language, zhou2018towards], video-text alignment [malmaud2015s, lin2020recipe] and captioning [rohrbach2013translating, regneri2013grounding, zhou2018end]. There are several cooking and kitchen datasets [Damen2018EPICKITCHENS, malmaud2015s, sener2015unsupervised, kuehne2014language, zhou2018towards]; What’s Cooking [malmaud2015s] and YouCookII [zhou2018towards] are the most similar to us, featuring videos and accompanying recipe texts. YouCookII, however, has limited diversity with only 89 dishes; What’s Cooking is large scale (180K recipes) but lacks temporal alignments between the video and recipe texts.

Some recent methods investigate learning image-text embeddings for image-based recipe retrieval. For example, [salvador2017learning] learn a joint embedding space of the recipes encoded with skip-thought vectors [sThought15] and associated food images using a pairwise ranking loss. This baseline is extended by learning with a triplet loss [carvalho2018cross] and hard sample mining [wang2019learning]. Instead of retrieval, [zhu2020cookgan] recently propose an image generation method from recipe text using an instruction encoder.

An alternative application is to generate recipes from images. [salvador2019inverse] predicts ingredients of food images and use the ingredients as input to a transformer-based decoder. However, [salvador2019inverse] generates an entire recipe as one continuous text block, so recipes can only be as long as the allowed maximum length of the decoder (150 words). [wang2020decomposed] splits recipes into several chunks and predict the instructions for each chunk guided by position encoders.

Fig. 2: Our system is composed of four RNNs: a sentence encoder and a decoder, a video encoder, and a recipe RNN. Given the ingredients as initial input and context in either text or visual form, the recipe RNN recurrently predicts future steps. The sentence decoder converts predicted future steps back into natural language. We continue predicting future steps by repeatedly feeding the next steps encoded by the sentence or video encoder.

3 Modelling Sequential Instructions

Sequence-to-sequence learning [sutskever2014sequence] has made it possible to successfully generate continuous text and build dialogue systems [cho2014learning, vinyals2015neural]

. Recurrent neural networks are used to learn rich representations of sentences 

[hill2016learning, Ba2016LayerN, sThought15] in an unsupervised manner, using the extensive amount of text that exists in book and web corpora. Examples include skip-thoughts vectors [sThought15] and FastSent [hill2016learning], both of which are effective for generation tasks. However, for instructional text, such as cooking recipes, such representations do not fully capture the underlying sequential nature of the instruction set, and generations are not always coherent from one step to the next. As such, we propose a hierarchical model and dedicate two RNNs to represent the sentences and the steps of the recipe individually: the sentence encoder and the recipe RNN respectively. A third RNN decodes predicted recipe steps back into sentence form for human-interpretable results (sentence decoder). These three RNNs are learned jointly as an auto-encoder in an initial training step. A fourth RNN encoding visual evidence (video encoder) is then learned in a subsequent step to replace the sentence encoder to enable interpretation and future prediction from video data. An overview is shown in Figure 2, while details of the RNNs are given in Sections 3.1 to 3.3.

3.1 Sentence Encoder and Decoder

The sentence encoder produces a fixed-length vector representation of each textual recipe step. We use a bi-directional LSTM, but rather than representing a sentence by the last step’s hidden vector, we apply a (temporal) max-pooling over each dimension of the hidden units. This type of architecture and, in particular, the temporal pooling is shown to be successful in sentence encoding 

[conneauEMNLP2017]. More formally, let sentence from step of a recipe (we assume each step is one sentence) be represented by words, i.e., and be the word embedding of word . For each sentence , at each (word) step , the bi-directional LSTM based sentence encoder, SE, outputs :


which is a concatenation of the hidden states from the forward and backward pass of the LSTM. The overall sentence representation is determined by a dimension-independent max-pooling over the time steps, i.e.,


where , , indicates the -th element of the -dimensional bi-directional LSTM outputs .

The sentence decoder, SD, is an LSTM-based neural language model that converts the fixed-length representation of the steps back into human-interpretable sentences. More specifically, given the vector prediction from the recipe RNN of step , it decodes the sentence


3.2 Recipe RNN

We model the sequential ordering of recipe steps with a recipe encoder (RE) which is an LSTM, which takes as input , i.e., fixed-length representations of the steps of a recipe with steps, where indicates the step index. At each recipe step, the hidden state of the RE can be considered a fixed-length representation of all recipe steps seen up to step ; we directly use this hidden state vector as a prediction of the sentence representation for step , i.e.,


The hidden state of the last step can be considered as a representation of the entire recipe. Due to the standard recursion of the hidden states in LSTM, each hidden state vector and, therefore, each future step prediction is conditioned on the previous steps. This allows predicting recipe steps that are plausible and coherent with respect to previous steps.

Recipes usually include an ingredient list, a rich source of information that can also serve as a strong modelling cue [kiddon2016globally, salvador2017learning, salvador2019inverse]. To incorporate the ingredients, we form an ingredient vector

for each recipe in the form of a one-hot encoding over a vocabulary of ingredients.

is then transformed with a separate fully connected layer in the recipe RNN to serve as the initial input, i.e.,

3.3 Video Encoder

For inference, we would like the recipe RNN to interpret sentences from text and visual inputs. The modular nature of our model allows us to conveniently replace the sentence encoder with an analogous video encoder, VE. Suppose the video segment is composed of frames, i.e.,  333We overload the word index from Eqs. 1 and 2 to also denote the frame index as the two are directly analogous in our encoders.. Each frame is represented as a high-level CNN feature vector – we use the last fully connected layer output of ResNet-50 [he2016deep]

before the softmax layer. Similar to the sentence encoding,

, in Eqs. 1 and 2, we determine the video encoding vector, , by applying a temporal max pooling over each dimension of the video segment representations, :


The video encoder, VE, is trained such that can directly replace . The inputs to our video encoder are frames from the video segments which correspond to individual recipe steps. We train our method on videos using ground truth segments for the video encoder. For testing, we use temporal segments either based on fixed temporal windows or from predicted segment proposals [zhou2018towards].

3.4 Model Learning

Our full model is learned in two stages. First, the sentence encoder (SE), recipe RNN (RE) and sentence decoder (SD) are jointly trained end-to-end. Given a recipe of steps, we define our decoder loss,

, as the negative log probability of each reconstructed word

for each step :



is parameterised by a softmax function at the output layer of the sentence decoder to estimate the distribution over the words,

, in our vocabulary . The overall objective is then summed over all recipes in the corpus. The loss is computed only when the LSTM is learning to decode a sentence. This first training stage is unsupervised, as the sentence encoder and decoder and the recipe RNN require only text inputs that can easily be scraped from the web without human annotations.

In a second step, we train the video encoder (VE) while keeping the recipe RNN (RE) and sentence decoder (SD

) fixed. We simply replace the sentence encoder with the video encoder while applying the same loss function as defined in Eq. 

9. This step is supervised, as it requires video segments of each step temporally aligned with the corresponding sentences.

In addition to the word-based reconstruction loss of Eq. 9, we propose an L2 regularizer that encourages the predicted next step representation of of Eq. 1 to be faithful to the observed representation , i.e. recipe loss :


We use the decoder loss, , and recipe loss

together with weighting hyperparameter



where is the predicted output for step j and is the input from the sentence encoder for step j.

3.5 Inference

During inference, we provide the ingredient vector as an initial input to the recipe RNN, which then outputs the predicted vector for the first step of the video (see Figure 2). We use the sentence decoder and generate the corresponding first sentence, . Then, we sample a sequence of frames from the video and apply the video encoder to generate , which we again provide as an input to the recipe RNN. The output prediction of the recipe RNN, , is for the second step of the video. We again use the sentence decoder and generate the corresponding sentence .

Our model is not limited to one step ahead predictions: for further predictions, we can simply apply the predicted output as contextual input . During training, instead of always feeding in the ground truth

, we sometimes (with 0.5 probability after the 5th epoch) use our predictions,

, as the input for the next step predictions that helps us with being robust to feeding in bad predictions [bengio2015scheduled].

3.6 Implementation and Training Details

We use a vocabulary of 30171 words provided by the Recipe1M dataset [salvador2017learning]; words are represented by a 256 dimensional embedding shared by the sentence encoder and decoder. We use the ingredient vocabulary from the training set of Recipe1M; the one-hot ingredient encodings are mapped into a 1024 dimensional vector

. The RNNs are all single-layer LSTMs implemented in PyTorch;

SE, VE, SD have 512 hidden units while RE has 1024. We train our model using the Adam optimizer [kingma2014adam] with a batch size of 50 recipes and a learning rate of 0.001. We train our text-based model (SE, RE, SD) for 50 epochs and the visual model (VE, RE, SD) for 25 epochs. We use for . The text-based model trained with converges faster so we train this variant for only 10 epochs.

Tasty #videos
#images /
avg. #steps /
avg. #segments
#steps /
avg. segment
avg. video
V1 2511 4.1M 1228 9 21243 5s 54s
V2 4022 8.6M 1542 10 37530 6s 71s
Recipe1M[salvador2017learning] - 887K 3769 9 - - -
YouCookII [zhou2018towards] 2000 15.8M 828 8 15400 19.6s 303s
Epic-Kitchens[Damen2018EPICKITCHENS] 432 11.5M - - 39596 1.9s 426s
TABLE I: Comparisons for our Tasty V1 and V2 datasets to relevant datasets. Recipe1M includes textual recipes and recipe images, while YouCookII and Epic-Kitchens are video-based datasets.

4 Tasty Videos Dataset V1 and V2

In our original publication [sener2019zero], we released the Tasty Videos Dataset with 2511 unique recipes which we will refer to as . We have since extended the dataset to 4022 unique recipes – Tasty Videos Dataset V2. All text recipes and videos are collected from Buzzfeed’s Tasty website 444https://tasty.co. We make publicly available links to each recipe page, computed features, temporal annotations 555https://cvml.comp.nus.edu.sg/tasty for both versions of the dataset.

In our dataset, each recipe has an ingredient list, step-wise instructions, and a video demonstrating the preparation of the dish. The videos in this dataset are captured with a fixed overhead camera and focus entirely on preparing the dish (see Figure 2). This viewpoint removes the added challenge of distractors and irrelevant actions. This simplification is not reflective of in-the-wild environments but does allow us to focus our scope on modelling the sequential nature of instructional videos, which is already a highly challenging and open research topic. The videos are designed to be sufficiently informative visually without the need for any narrations.

Other datasets feature crowd-sourced text [zhou2018towards, Damen2018EPICKITCHENS]; the recipes in our dataset are written by experts. This ensures specificity and richness in the instruction. For each recipe step, which corresponds to a single sentence, we annotate the temporal boundaries of the step in the video. We omit annotating steps without visual correspondences, such as alternative recommendations, non-visualized instructions like ‘Preheat oven.’  and stylistic statements such as ‘Enjoy!’. For both Tasty V1 and V2, we define a split ratio of 8:1:1 for training, validation, and testing sets.

Fig. 3: (top) Distribution of the number of steps/segments (left) & distribution of the number of ingredients (right). (bottom) Distribution of step / segment durations (left) & distribution of video durations (right).

We present the statistics of our datasets in Table I and in Figure 3. We compare our dataset to relevant datasets of Recipe1M and YouCookII as well a large-scale activity dataset Epic-Kitchens in Table I. Recipe1M is a large-scale dataset with, as the name suggests, approximately one million recipes with a recipe name, list of ingredients, a sequence of instructions, and images of the final dish for each recipe. YoucookII is a collection of cooking videos from YouTube with around 2000 videos of 89 dishes. Videos are captured from a third-person viewpoint. Each dish has an average of 22 videos, each with an average of 8 steps. The videos are annotated with the temporal boundaries of each step and their corresponding descriptions. Epic-Kitchens is a large-scale egocentric activity dataset with 39K action segments with category-based labels and is frequently used for action recognition and anticipation.

Tasty V2 extends the number of visual segments by 76%. It includes around 37K segments which correspond to single sentence recipe steps. This number is comparable to the large-scale Epic-Kitchens dataset, which contains 39K segments, and is higher than the number of segments in YoucookII with its 15K segments. Compared to the 1M recipes of Recipe1M, our dataset with 4022 recipes covers 40% of the 3769 ingredients in Recipe1M. Compared to YouCookII, our dataset includes a diverse list of dishes (4022 vs. 22) and ingredients (1542 vs. 828). One notable difference between these datasets is the segment/step granularity which is on the order of a few seconds in our dataset and Epic, but is coarser for YoucookII (19.6 seconds on average). The videos in our dataset are short (on average 54/71 seconds for V1/V2) yet contain a challenging number of steps (on average 9/10 for V1/V2). Recipes in YoucookII and Recipe1M have a similar number of steps (9/8 respectively).

We note that the Tasty dataset also has the potential for tasks beyond anticipation, such as temporal action segmentation, dense video captioning, and object state recognition. In this paper, we compute several video captioning baselines, and we encourage the community to further develop models to tackle our challenging zero-shot dataset.

0.80 Ground Truth (GT) Prediction BLEU1 BLEU4 MET. HUM1 HUM2 ING bacon, brown sugar, cooking spray, breadsticks step1 Preheat oven to 325 degrees F ( 165 degrees C ). Preheat oven to 400 degrees F. 36.0 0.0 26.0 1.5 1.5 step2 Line 2 baking sheets with aluminum foil or parchment paper and spray with cooking spray. Line a baking sheet with aluminum foil. 23.0 0.0 23.0 1.0 1.0 step3 Wrap 1 bacon strip around each breadstick, leaving about 1 inch uncovered on each end. Place bacon strips in a single layer on the prepared baking sheet. 13.0 0.0 9.0 0.5 1.5 step4 Place wrapped breadsticks on the prepared baking sheet. Place rolls on a baking sheet. 48.0 0.0 30.0 1.5 1.5 step5 Sprinkle brown sugar evenly over breadsticks. Bake in the preheated oven until breadsticks are golden brown, about 15 minutes. 15.0 0.0 13.0 0.0 1.5 step6 Bake in the preheated oven until bacon is crisp and browned, 50 to 60 minutes. Bake in preheated oven until bacon is crisp and breadsticks are golden brown, about 15 minutes. 63.0 43.0 36.0 1.0 1.0 step7 Cool breadsticks on a piece of parchment paper or waxed paper sprayed with cooking spray. Remove from oven and let cool for 5 minutes. 6.0 0.0 4.0 0.5 1.5

Fig. 4: Predictions of our text-based method for ‘Candied Bacon Sticks’ along with the automated scores and human ratings. For ‘HUMAN1 (HUM1)’ we ask the raters to directly assess how well the predicted steps match the corresponding Ground Truth (GT) sentences, for ‘HUMAN2 (HUM2)’ we ask to judge if the predicted step is still a plausible future prediction (see Sec. 6.6). Our prediction for step6 matches the GT well, while step5 does not. However, according to ‘HUMAN2 (HUM2)’ score, our step5 prediction is still a plausible future action.

5 Experiments: Text

5.1 Datasets and Evaluation Measures

We experiment with Recipe1M [salvador2017learning], YoucookII [zhou2018towards], and Tasty Videos V1 and V2.

We use the ingredients and instructions from the training split of the Recipe1M dataset to learn our sentence encoder (Eq. 1), sentence decoder (Eq. 3), and recipe RNN (Eq. 4). To learn the video encoder (Eq. 6), we use the aligned instructions and video data from the training split of either YouCookII or Tasty datasets. We evaluate our model’s prediction capabilities with text inputs from Recipe1M and video and text inputs from YoucookII and Tasty Videos.

Our predictions are in sentence form; evaluating the quality of generated sentences is known to be difficult in captioning and natural language generation  

[vedantam2015cider, lopez2008statistical]. We apply a variety of measures to offer a broad assessment. First, we target the matching of ingredients and verb keywords since they indicate the next active objects and actions and are analogous to the assessments of action anticipation  [Damen2018EPICKITCHENS]. Secondly, we evaluate with sentence-matching scores BLEU (BiLingual Evaluation Understudy)  [papineni2002bleu] and METEOR (Metric for Evaluation of Translation with Explicit ORdering)  [banerjee2005meteor] which are also used for video captioning methods  [regneri2013grounding, rohrbach2013translating, zhou2018end]

. BLEU computes an n-gram based precision for predicted sentences


the ground truth sentences. METEOR creates an alignment between the ground truth and predicted sentence using the exact word matches, stems, synonyms, and paraphrases; then, it computes a weighted F-score with an alignment fragmentation penalty.

For the uninformed reader, sentence scores like BLEU and METEOR are best at indicating sentences with precise word matches to the ground truth (GT). There are variations between sentences conveying the same idea in natural language, so automated scores may fail to match sentences a human would consider equivalent. This is true even for text with very specific language, such as cooking recipes. For example, for the ground truth sentence ‘Garnish with the remaining wasabi and sliced green onions.’, our method predicts ‘Transfer to a serving bowl and garnish with reserved scallions.’. For a human reader, this is half correct, since ‘scallions’ and ‘green onions’ are synonyms, yet this example would have only a BLEU1 score of , BLEU4 of and METEOR of . For another, for the ground truth sentence ‘Place patties on the grill, and cook for 5 minutes per side.’ versus a prediction by our model ‘Place on the grill, and cook for about 10 minutes, turning once.’, we would have a BLEU1 score of 65.0, BLEU4 of 44.0, and METEOR of 29.0. In this regard, we note that BLEU and METEOR scores offer only a limited ability to evaluate the predicted sentences.

The gold standard to evaluate dialogue generation [LiuLSNCP16] and captioning [li2018jointly] is human subject ratings. Therefore, we conduct a user study and ask people to assess how well the predicted step matches the ground truth in meaning; if it does not match, we ask if the prediction would be plausible for future steps. This gives flexibility in case predictions do not follow the exact aligned order of the ground truth, e.g., due to missing steps not predicted, or steps that are slightly out of order (see Figures  4,  12 and  11).

(a) Ingredient recall.
(b) Verb recall.
(c) Sentence scores.
Fig. 5: Comparisons on Recipe1M’s test set. (a) recall of ingredients predicted by our model (‘Ours’), skip-thought vectors (‘ST’), and our model trained without ingredients (‘Ours noING’). (b) verb recall of our model (‘Ours’) vs ‘ST’. (c) BLEU1, BLEU4, METEOR scores for our model (‘Ours’) vs. ‘ST’. The x-axes in the plots indicate the step number being predicted in the recipe; each curve begins on the first prediction, i.e., the step after having received steps to as input.

5.2 Learning of Procedural Knowledge

We first verify the learning of procedural knowledge with a text-only model, i.e., the sentence encoder, sentence decoder, and the recipe RNN by evaluating on Recipe1M’s test set of 51K recipes. For a recipe of steps, we evaluate our model’s ability to predict steps to , conditioning on steps 1 to as input context. For comparison, we look at the generations from the commonly used sequence-to-sequence model skip-thought (ST) vectors [sThought15]. Skip-thought vectors are trained to decode temporally adjacent sentences from a current encoding, i.e., given step to the encoder, the decoder predicts steps and , and have been shown to be successful in generating continuous text [cho2014learning, vinyals2015neural, kiddon2016globally].

We train the skip-thought vectors on the training set of the Recipe1M dataset; because the model is not designed to accept an ingredient list as a or initialization step, we make skip-thought predictions only from the second step onwards. We report our results over the recipes in the entire test set of the Recipe1M dataset in Figure 5 (a), (b) and (c). We report scores of the predicted steps averaged over multiple recipes. Only those recipes which have at least steps contribute to the average for step .

5.2.1 Key Ingredients

We first look at our model’s ability to predict ingredients and verbs on the Recipe1M dataset. For ingredients, we also compare with a variant of our model without any ingredients (‘Ours noING’) where we train our network without ingredient inputs. For evaluating recall, we do not directly cross-referencing the ingredient list but instead limit the evaluation to ingredients mentioned explicitly in the recipe steps. This is necessary to avoid ambiguities that may arise from specific instructions such as ‘add chicken, onion, and bell pepper’ versus the more vague ‘add remaining ingredients’. Furthermore, the ingredient lists in Recipe1M are often automatically generated and may be incomplete.

In Figure 5(a), we compare the recall of the ingredients detected in our predicted steps versus steps generated by skip-thoughts vectors and our model trained without ingredient inputs. We can see that our model’s predictions successfully incorporate relevant ingredients with recall rates as high as 39.6% with the predicted next step, 31.0% with second, 24.8% with third and 20.2% with the predicted fourth step. The overall recall decreases with the latter steps. This likely due to increased difficulty once the overall number of ingredient occurrences decreases, which tends to happen in later steps. Based on the ground truth, we observe that the majority of the ingredients occurs in the early and middle steps and decreases in the last steps. The last steps are usually related to the already completed dish and do not explicitly mention as many ingredients as the earlier steps.

Compared to skip-thought vectors, our predictions’ ingredient recall is higher regardless of whether or not ingredients are provided as an initial input. Without ingredient input, the overall recall is lower, but after the initial step, our model’s recall increases sharply, i.e., once it receives some context. Our model without the ingredient input still performs better than the skip-thoughts predictions. We attribute this to the strength of our model to generalize across related recipes so that it is able to predict relevant co-occurring ingredients. Our predictions include common ingredients such as salt, butter, eggs and water and also recipe-specific ones such as couscous, zucchini, or chocolate chips. While skip-thought vectors predicts some common ingredients, it fails to predict recipe-specific ingredients.

5.2.2 Key Verbs

Key verbs indicate the main action for a step and are also cues for future steps both immediate (e.g., ‘mix’ after ‘adding’ ingredients into a bowl) and long-term (e.g., ‘bake’ after ‘preheating’ the oven). We tag the verbs in the training recipes with a natural language toolkit [nltk] and select the 250 most frequent verbs for evaluation. Similar to ingredients, we check for recall only of the verbs appearing in the ground truth steps. In the ground truth steps, there are between 1.55 and 1.85 verbs per step, i.e., steps often include multiple verbs such as ‘add and mix’.

Figure 5 shows that our model recalls up to 30.6% of the verbs with the predicted next step(b). Our model’s performance is poor in the first steps, due to ambiguities when given only the ingredients without any further knowledge of the recipe. After the first steps, our model’s performance quickly increases and stays consistent across the remaining steps. In comparison, the ST model’s best recall is only 20.1% for the next step prediction.

Fig. 6: Ablations on the interchangeability of the sentence encoder and influence of ingredient inputs, evaluated on Recipe1M’s test set. We compare the sentence scores of our joint model (‘Ours’), our joint model without ingredient inputs (‘Ours noING’), and our model where sentence encoder is replaced with pre-trained skip-thoughts vectors (‘ST vectors’). seen’ refers to the number of steps the model receives as input, while predicting the remaining .

5.2.3 Sentences

Key ingredients and verbs alone do not capture the rich instructional nature of recipe steps; compare e.g., ‘whisk’ and ‘egg’ to ‘Whisk the eggs till light and fluffy’. As such, we also evaluate the quality of the entire predicted sentences based on the BLEU1, BLEU4, and METEOR scores. We compare with skip-thought vectors in Figure 5 (c). Our results for the BLEU1 scores are consistently high, at around 22.7 for the next step predictions, with a slight decrease towards the end of the recipes. Predictions further than the next step have lower scores, though they stay above 12.0. The BLEU4 scores are highest in the very first step, around 5.8, and range between 0.7 and 4.3 over the remaining steps. The high performance, in the beginning, is because many of the recipes start with common instructions (e.g., ‘Preheat oven to X degrees’, ‘In a large skillet, heat the oil’). For similar reasons, we also do well towards the end of recipes, where instructions for serving and garnishing are common (e.g., ‘Season with salt and pepper’). Trends for the METEOR score are similar to our BLEU1 scores. METEOR scores are above 13.0 for the next step predictions and do not go lower than 6.5 for the further step predictions.

Our proposed method outperforms skip-thoughts predictions across the board. In fact, predictions up to four steps into the future surpass the predictions made by skip-thoughts predictions only one step ahead. This can be attributed to the dedicated long-term modelling of the recipe RNN; as such, we are able to incorporate the context from all sentence inputs up to the present. In contrast, skip-thoughts are Markovian in nature and can only take the current step into account.

Our model predicts coherent and plausible instructional sentences, as shown in Figure 4. One interesting and unexpected outcome of our model is that it also makes recommendations. In cooking recipes, one does not only find strict recipe steps but also suggestions based on the writer’s experience (e.g., ‘If using wooden skewers, make sure to soak in water.’). Our learned model also generates such suggestions. For example, for the ground truth ‘If it’s too loose at this point, place it in the freezer for a little while to let freeze.’, our model predicts ‘If you freeze it, it will be easier to eat’.

5.3 Encoder Modularity

Since our network is modular, we check the interchangeability of the sentence encoder by replacing it with skip-thoughts vectors trained on the Recipe1M dataset, as provided by [salvador2017learning]. For this experiment, we train the recipe RNN and sentence decoder jointly using the pre-trained skip-thoughts vectors as sentence representations. The recipe RNN and sentence decoder have been trained with the same parameter settings as our full model in all the ablation studies.

Figure 6 compares sentence scores of our joint model (‘Ours’), our joint model trained without ingredient inputs (‘Ours noING’) as well as our model using the pre-trained skip-thought vectors (‘ST vectors’) when where , of a recipe is observed. Our sentence encoder performs on par with skip-thought vectors as an encoding. An advantage of our model though, is that our encoder and decoder can be trained jointly and does not require a separate pre-training of a sentence auto-encoder which is required when using skip-thoughts vectors as input. Similar to our observations for ingredient recall, we see that ingredient information is very important for predicting sentences, especially for the initial steps. In subsequent steps, when 25%, 50% of the recipe steps (enough context) are observed, the model’s performance starts to improve.

5.4 Amount of Training Data

ours text (100%) 26.09 27.19 26.78 3.30 17.97
ours text (50%) 23.01 24.90 25.05 2.42 16.98
ours text (25%) 19.43 23.83 23.54 2.03 16.05
ours text (0%) 5.80 9.42 10.58 0.24 6.80
TABLE II: Evaluations on Tasty V1’s test set for the textual model when the number of training recipes varies. Performance drops when the amount of pre-training decreases.

At the core of our method is the transfer of knowledge from text resources to solve a challenging visual problem. We evaluate the effectiveness of the knowledge transfer by varying the amount of training data from Recipe1M to be used for pre-training. We train our method with 100%, 50%, and 25% of the Recipe1M training set. We present our results in Table II. Looking at the averaged scores over all the predicted steps on the Tasty Videos dataset, we observe a decrease in all evaluation measures as we limit the amount of data from Recipe1M (see ‘ours text’ 100%, 50%, 25%, and 0%), with the most significant decrease occurring for the BLEU4 score. When using less text data, our method’s performance decreases from 3.30 of BLEU4 score to 2.42 when half of the Recipe1M is used and to 2.03 when a quarter of the dataset is used. While we observe a similar decrease in the ingredient detection scores, the decrease in BLEU1, METEOR, and verb scores remains less significant. If there is no pre-training, i.e., when the model is learned only on text from Tasty Videos V1 (‘ours text (0%)’), the decrease in scores is noticeable for all evaluation criteria. These results verify that pre-training has a significant effect on our method’s performance.

step: Ld +Lr Ld +Lr Ld +Lr Ld +Lr Ld +Lr
next 32.3 33.5 26.1 26.7 22.1 22.8 4.3 4.4 13.4 13.7
next+1 24.6 28.4 19.1 16.0 16.1 14.0 2.2 1.0 9.2 7.6
next+2 19.6 24.5 16.6 12.5 13.9 11.8 1.5 0.6 7.8 6.1
next+3 16.5 20.9 15.2 10.3 12.9 10.5 1.3 0.4 7.2 5.2
TABLE III: Comparisons for the decoder, and decoder + recipe loss, , on the Recipe1M’s test set.
step: gr b gr b gr b gr b gr b
next 32.3 33.1 26.1 26.9 22.1 22.8 4.3 4.6 13.4 13.9
next+1 24.6 26.2 19.1 20.2 16.1 17.2 2.2 2.4 9.2 9.9
next+2 19.6 21.3 16.6 17.7 13.9 15.2 1.5 1.8 7.8 8.6
next+3 16.5 18.2 15.2 16.3 12.9 14.1 1.3 1.5 7.2 7.9
TABLE IV: We compare greedy (gr) and beam (b) search when decoding sentences on the Recipe1M’s test set.

5.5 Loss Variants

We train our recipe network using a decoder loss, (Eq. 7), and additionally propose a recipe loss, (Eq. 8), in Section 3. Table III presents our experiments analyzing the influence of the recipe loss evaluated on the test set of Recipe1M. Overall the recipe loss , as expected, encourages and increases the next step prediction performance. The increase is more than 0.5% for verb and ingredient recall and by 0.7, 0.1, and 0.3 for BLEU1, BLEU4, and METEOR scores. However, using the recipe loss significantly decreases the scores in the further stages, particularly for sentence scores. The most significant decrease is observed for BLEU4 scores, which decreases to 0.4 for four steps into the future, ‘next+3’. Only the ingredient recall does not significantly decrease and is at least 4% higher than our model trained with the decoder loss alone. However, upon closer inspection of individual predictions, we see that the recipe loss has a tendency to promote repetitive outputs. Repetitions are a common but undesirable outcome for natural language generation [holtzman2019curious]. In this case, as consecutive steps are more likely to describe the handling of common items or ingredients, the ingredient recall is not as directly affected.

Beam search [1977cmurep] is known to improve the performance of text-based generation algorithms [freitag2017beam]. In Table IV, we evaluate our model, trained with decoder loss, , using greedy and beam search. Greedy decoding selects the words with the highest probability at every decoding step. Beam search keeps track of a beam of possible generations, updates them at every step of decoding by ranking them according to the model likelihood. Although it is times more expensive than the greedy search, it improves the performance. In our experiments we use . It improves ingredient and verb scores both by 0.8%, and our sentence scores by 0.7, 0.3, and 0.5 for BLEU1, BLEU4, and METEOR, respectively. We observe similar improvements over the later steps as well. We report beam search encoding results only in Table IV, while in our other results tables, out of fairness, we use greedy search.

5.6 State-of-the-Art Comparisons

Fig. 7: We evaluate our model for recipe generation given ingredients and compare our performance to a recipe generation network, ‘Inverse’ [salvador2019inverse], on a subset of Recipe1M.
Fig. 8: We employ the recipe generation network, ‘Inverse’ [salvador2019inverse], for next step prediction and compare our performance. Both methods are tested on a subset of Recipe1M.

There are several works [kiddon2016globally, salvador2019inverse] that generate instructional text for the cooking domain. Similar to our works, such approaches also target learning procedural knowledge in text. Inverse cooking [salvador2019inverse] generates recipes for food images. They use a two-stage transformer approach: the first transformer predicts ingredients from a recipe picture while the second transformer generates the recipe from these ingredients. Inverse cooking uses a subset of Recipe1M’s dataset by removing the recipes that contain no images and those with fewer than two ingredients or steps. For a fair comparison, we train and test our model using this subset, which features 252.5k and 54.5k recipes for training and testing, respectively. Their network is limited to generating a maximum of 150 words per recipe. Since our model does not have limitations regarding the generations, for fair comparisons, we also truncate our model’s generations after 150 words. We compare our model to the publicly released Inverse cooking model by directly providing the ground truth ingredients into their recipe generation transformer.

We first evaluate our model’s capabilities in recipe generation and compare our generations to Inverse cooking in Figure 7 for individual steps. Both methods are evaluated using GT ingredients. Overall, our model outperforms ‘Inverse’ for recipe generation for individual steps. Our method’s performance is significantly higher for ingredient recall by at least 10% for all steps. Similarly, for BLEU4 score, our model’s first step prediction is 4.8, while Inverse has only 1. However, for the later steps, we perform comparably in this metric. We observe similar gaps between the first step predictions in BLEU1, METEOR, and verb scores.

Next, we evaluate ‘Inverse’ for next step prediction. For each step, we feed the ground truth sentences in the previous steps to Inverse’s transformer-based sentence decoder, which generates a sentence for the next step. Results in Figure 8 show that although the transformer-decoder has access to the complete history of the GT recipe until the next step, our method significantly outperforms this recipe generation network in all scores. For ingredients and verbs, the difference is more than 15% for all steps. Inverse’s performance degrades towards the end of the recipe, while ours stays consistently high for all metrics. The most significant decrease for Inverse is in BLEU4, which decreases to almost zero after the 5th step, whereas ours is invariably above 1.5.

6 Experiments: Video

6.1 Video Predictions on Tasty V1

Fig. 9: We compare our visual model when tested with GT segments, ‘Visual GT’, and our textual model, ‘Text’, on the Tasty Videos V1’s test set for next step predictions using the recall of predicted ingredients and verbs, and sentence scores. Compared to our text-based model, our visual model has lower performance, but follow similar trends.

We first evaluate our model for making predictions on video inputs on Tasty Videos V1’s test set. To explore the importance of video partitioning, we consider the following settings for inference, one according to ground truth segments, ‘ours visual (GT)’, and one based on fixed temporal windows, ‘ours visual (window)’. For temporal window experiments, the videos are partitioned into chunks of fixed-sized windows until the latest observation. We sequentially feed the representations of these chunks obtained from the video encoder into our recipe RNN. Overall, our method is relatively robust to window size as shown in Table VI for BLEU4 scores. We empirically select a window of 170 for Tasty Videos. In both settings, every fifth frame in GT or window-based segments is sampled, and their visual features are fed into the video encoder. The representations from the video encoder are then fed to the recipe RNN as context vectors. Through the video encoder, our model can interpret visual evidence, and make plausible predictions of the next steps that can be seen in examples in Figures 11 and 12 where it can be seen that our visual model corrects itself after observing new evidence.

S2VT [venugopalan2015sequence] (GT) 7.59 19.18 18.03 1.10 9.12
S2VT [venugopalan2015sequence], next (GT) 1.54 10.66 9.14 0.26 5.59
End-to-end [zhou2018end] - - - 0.54 5.48
ours visual (GT) 20.40 19.18 19.05 1.48 11.78
ours visual (window) 16.66 17.08 17.59 1.23 11.00
ours text 26.09 27.19 26.78 3.30 17.97
ours text noING 9.04 22.00 20.11 0.92 13.07
ours video-text 22.27 23.35 21.75 2.33 14.09
TABLE V: Evaluations on Tasty Videos V1 for our visual and text-based model along with comparisons against video captioning methods [venugopalan2015sequence, zhou2018end]. Performance drops when fixed-sized windows based segments, ‘ours visual (window)’ are used compared to using GT segments, ‘ours visual (GT)’. Ingredient inputs are important for our model’s success. Our method performs better than video captioning.
window s. 30 50 70 90 110 130 150 170 190 210 230
Tasty V1 0.75 0.90 0.93 1.06 1.18 1.09 1.07 1.23 1.09 1.19 1.06
YouCookII 0.60 1.10 1.38 1.32 1.32 1.28 1.30 1.20 1.17 1.20 1.22
TABLE VI: Window size selection on the Tasty V1 and YouCookII dataset. Reported are BLEU4 scores.

The results are shown in Table V. Compared to using ground truth segments, ‘ours visual (GT)’, using fixed window segments, ‘ours visual (window)’, results in a decrease in performance, with the most extreme drop on the most challenging sentence score, BLEU4 (around 17%), and ingredient scores (around 18%). For the verb, BLEU1, and METEOR scores, the decrease is not as big (lower than 10%).

In Table V, our text-based results are presented as upper-bound, ‘ours text’. Given that our model is first trained on text and then transferred to video, the drop in performance from text to video is as expected. Video results, however, follow similar trends as the text; see for example Figure 9, where we provide step-wise comparisons of our textual and visual models (GT). We further investigate the influence of the ingredients on the performance of our method. When ingredients are not provided, ‘ours text noING’, our method fails to make plausible predictions. The performance decrease is mainly noticeable in the ingredient scores and the BLEU4 scores.

In some instructional scenarios, there may be semi-aligned text that accompanies the video, e.g., narrations. We test such a setting by training the sentence and video encoder, as well as sentence decoder and recipe RNN jointly, for making future step predictions. For this, the sentence and video context vectors are first concatenated and then passed through a linear layer before feeding them as input to the Recipe RNN. Overall, the results are better than our video alone results but not better than our text alone results (see ‘ours video-text’ in Table V). Even with joint training, it is still challenging to make improvements, which we attribute to the diversity in our videos and variations in the text descriptions for similar visual inputs. On the other hand, when there is accompanying text, our model can be adapted easily and improve prediction performance.

6.2 Video Predictions on YouCookII

End-to-end (GT) [zhou2018end] - - - 0.87 8.15
TempoAttn (GT) [yao2015describing] - - - 1.42 11.20
ours visual (GT) 21.36 27.55 23.71 1.66 11.54
End-to-end [zhou2018end] - - - 0.08 4.62
TempoAttn [yao2015describing] - - - 0.30 6.58
ours visual (window) 17.64 25.11 22.55 1.38 10.71
ours text 24.60 29.39 26.49 2.66 13.31
TABLE VII: Evaluation for our visual and text-based models and comparison against two video captioning methods [yao2015describing, zhou2018end] on YouCookII’s validation set. Note that in this comparison, we are anticipating the NLP description of the next step, while the captioning methods are applied directly to the future (unseen to us) video footage. Even though both methods, including [zhou2018end] which is state-of-the-art for dense captioning, have access to the video and we do not, we still perform better in both BLEU4 and METEOR scores.

To further validate the effectiveness of our model on publicly available datasets with sentence-level annotations, we also evaluate it on the YouCookII dataset. Table VII compares our visual model evaluated with ground truth segments, ‘ours visual (GT)’, and temporal windows, ‘ours visual (window)’ on YouCookII’s validation set. For ‘ours visual (window)’ experiments, a window of 70 frames is empirically selected, see Table VI. YouCookII includes a large number of background frames that are irrelevant to the recipe steps. We suspect using large windows misses important cues for the YouCookII steps.

Similar to our observations on Tasty V1, using window segments, ‘ours visual (window)’ instead of ground truth, ‘ours visual (GT)’, results in decreased accuracy. The largest decrease is observed for the BLEU4 and ingredient scores by 16% and 17%, respectively, highlighting the importance of achieving high scores for these metrics. The decrease for BLEU1 is the smallest by 4%. Comparing the performance of our visual with the textual model, ‘ours text’, the textual results are better overall on the Tasty than YouCookII.

6.3 Video Predictions on Tasty V2

video (GT) 19.33 16.86 19.34 1.71 12.64
video (window) 18.60 16.47 18.60 1.42 11.75
video (proposal) 19.12 16.37 18.60 1.63 11.84
text 28.93 27.37 26.97 4.61 18.26
TABLE VIII: Evaluation for visual and text-based models and comparisons using different segments on Tasty V2.
Fig. 10: Example proposals from a proposal decoder [zhou2018end] trained on Tasty V2.

In addition to using fixed window-based segments, we train the transformer-based proposal decoder from Zhou et al. [zhou2018towards] on the Tasty V2 dataset to generate segment proposals. First, using non-maximum suppression, at each iteration, the longest proposal is selected, and those with an IoU with the longest proposal that is greater than a threshold of 0.2 are discarded. Then, to obtain non-overlapping proposals, the overlapping regions among the overlapping proposals are divided w.r.t. their lengths; an example can be seen in Figure 10. We evaluate the quality of proposals using mean intersection over union (IoU), which is 39.59%.

The results are shown in Table VIII. To validate our method’s performance on Tasty V2, we use ground truth segments, ‘video (GT)’, fixed temporal windows, ‘video (window)’, and segment proposals, ‘video (proposal)’, as input to our video encoder during inference. Compared to using temporal windows, proposals improve ingredient scores by 2% and BLEU4 by 12%. The other scores show slight differences. This small performance gap between the window and proposal-based segments indicates the difficulty of partitioning our videos and motivates exploring more robust ways of generating segments/proposals. The text-based results, ‘ours text’, maintains the highest scores in all metrics, encouraging us to further develop visual models to tackle anticipation on our challenging zero-shot dataset.

Fig. 11: Next step predictions from our visual model for ‘Salted Caramel Hot Chocolate’ in blue. Note that our model predicts the next steps without having observed the corresponding video segment.
Fig. 12: Next step predictions for ‘Garlic Knots’ shown in blue. After baking in step 7, our visual model predicts that the dish should be served. Yet when presented with visual evidence of the garlic butter in step 8, it correctly predicts that the knots should be brushed with the mixture in step 9.

6.4 Supervised vs. Zero-Shot Learning

sup. visual (GT) 20.93 24.76 22.11 1.21 10.66
sup. visual (window) 18.90 23.15 21.09 1.03 10.22
sup. visual, w/o pre-train 2.69 19.43 15.05 0.15 5.89
sup. text 24.56 27.24 24.94 1.99 12.50
zero visual (GT) 17.77 23.11 20.61 0.84 9.51
zero visual (window) 6.04 23.19 20.30 0.76 9.27
zero visual, w/o pre-train 1.58 17.83 14.54 0.01 5.03
zero text 19.90 24.86 23.06 1.47 10.98
TABLE IX: Comparison of our zero-shot and supervised setting on YouCookII, computed using 4-fold cross-validation. Supervised results are better overall. Without pre-training on Recipe1M, the performance drop is significant.

YouCookII is a suitable dataset to compare the differences between supervised and zero-shot learning. As the provided splits for this dataset are not zero-shot and overlap in the dishes for training and test, we create our own splits based on distinct dishes for this ablation study. We divide the dataset into four splits based on the 89 dishes, 22 dishes per split, and use three splits for training and half of the videos in the fourth split for testing. In the zero-shot setting, the videos from the other half of the fourth split are unused, while in the supervised setting, they are included as part of the training.

We report our results as averages over the four cross-folds in Table IX. As expected, the predictions are better when the model is trained under a supervised setting than a zero-shot setting. This is true for all inputs, with the same drop as observed previously when moving from text to video inputs and when moving from ground truth video segments to fixed window segments. However, the difference between the supervised versus zero-shot setting (‘sup. visual’ vs. ‘zero visual’) is surprisingly much smaller than the difference between a supervised setting with and without pre-training on Recipe1M (‘sup. visual’ vs. ‘sup. visual w/o pre-train’). This suggests that having a large corpus for pre-training is more useful than repeated observations for a specific dish.

Figure 13 shows a detailed transition from the zero-shot scenario (no videos about the evaluated dish in the training set) to a one-shot one-shot setting (only one similar video) and incremental addition of training videos until the fully supervised case (average of 11 videos from the same dish). One can see that performance increases as more videos are added, indicating that the model is learning and that more than 11 videos (current supervised setting) will further improve the supervised performance.

We show that knowledge transfer considerably improves our method’s predictions, see Sec. 5.4. To further validate our claims, we compare our method against different video captioning methods in Tables V, Table X and  VII for the Tasty Videos V1, V2 and YouCookII datasets respectively. Unlike predicting future steps, captioning methods generate sentences after observing their visual data. In principle, this should be an easier task than predicting the future.

We compare our model on the validation set of YouCookII against two captioning methods [yao2015describing, zhou2018end] in Table VII. End-to-end masked transformer [zhou2018end] performs dense video captioning by both localizing steps and generating descriptions for these steps. Instead of separating the captioning problem into the two stages of proposal generation and captioning, [zhou2018end] produce proposals and descriptions simultaneously. Their work is composed of a transformer-based [vaswani2017attention] video encoder for context-aware features, a proposal decoder similar to [zhou2018towards] that localizes action proposal candidates, and finally, a transformer-based decoder that generates captions. TempoAttn [yao2015describing] is an RNN-based encoder-decoder with attention. A variant of TempoAttn [yao2015describing] is trained on YouCookII by Zhou et al. [zhou2018end] after several changes were made to the model for a fair comparison, including adding a Bi-LSTM context encoder and adding temporal attention.

End-to-end (GT) [zhou2018end] 18.33 1.14 6.73
ours visual (GT) 19.34 1.71 12.64
TABLE X: Captioning results from End-to-end [zhou2018end] vs. our method evaluated for next step prediction on the Tasty V2 dataset. Both methods are evaluated using GT segments.
Fig. 13: Zero-shot (‘Zero’) vs. supervised (‘Sup.’) comparison on YouCookII when the number of training videos from the same dish is increased. When more videos from the same dish are added into the training set, BLEU4 score increases.

6.5 Comparisons to Video Captioning

In Table VII, we see that even though the anticipation task is more difficult than captioning, our method outperforms both of the captioning methods in BLEU4 and METEOR scores. Compared to the state-of-the-art video captioning method, [zhou2018end], our visual model achieves a METEOR score twice as high and a BLEU4 score four times higher. We attribute the better performance of our method compared to the captioning methods to the pre-training on the Recipe1M dataset, which allows our model to generalize. Note that for YouCookII, as we use all the videos in the training set, our training is no longer a zero-shot but a supervised scenario.

Table V compares our model against different captioning methods on the Tasty V1 dataset. We also test S2VT [venugopalan2015sequence], an RNN-based encoder-decoder on the ground truth segments of Tasty V1 for captioning. Our visual model outperforms this baseline, especially for ingredient recall, by 13%, and with an improvement of 0.3 in BLEU4 score. To highlight the difficulty of predicting future steps compared to captioning, we train S2VT for predicting the next step from the observation of the current step, ‘S2VT [venugopalan2015sequence] next (GT)’. Our visual model outperforms this variation with a significant margin for all scores. We also test a state-of-the-art video captioning method, ‘End-to-end  [zhou2018end], on Tasty V1 and get a BLEU4 and METEOR score of 0.54 and 5.48, respectively (vs. our future prediction scores 1.23 / 11.00). The poor performance is likely due to the increased dish diversity and difficulty of our dataset compared to YouCookII.

Finally, we train the End-to-end masked transformer [zhou2018end], on our Tasty V2 for captioning. Table X compares our model to this work using ground truth segments. Similar to our observations on other datasets, although ‘End-to-end [zhou2018end]’, incorporates context into predictions and predicts the current observation, our method, which predicts the next steps, outperforms it for all sentence scores.

6.6 Human Ratings

As automated scores such as BLEU and METEOR are not fully representative of the correctness of the predicted steps, we also ask humans to evaluate our model’s predictions. We invite three volunteers to assess how well the anticipated steps match the ground truth with scores 0 (‘not at all’), 1 (‘somewhat’), or 2 (‘very well’). If the prediction receives a score of 0, we additionally ask the participant to judge if the predicted step is still a plausible future prediction, again with the same scores of 0 (‘not at all’), 1 (‘somewhat’), or 2 (‘very likely’). The study is done on a subset of 30 recipes from Recipe1M’s test set, each with seven steps. Ratings are compared to automated sentence scores in Figure 14.

In Figure 14, the upper graph (a) shows the results of the human raters. In this plot, ‘exact match’ corresponds to humans assessing if the predicted steps match the ground truth. Raters report a score close to 1 for the initial step predictions indicating that our method, even by only seeing the ingredients, can start predicting plausible steps. Scores increase towards the end of the recipe and are lowest at step 3. ‘future match’ corresponds to humans assessing if that step a plausible future prediction. The average score of the predicted steps being a possible future prediction is consistently high across all steps. Even if the predicted step does not exactly match the ground truth, human raters still consider it possible for the future, including the previously low rating for step 3. Overall, the ratings indicate that the predicted steps are plausible.

The lower graph (b) in Figure 14 shows automated scores for the same set of recipes used in our user study. The left plot shows the standard scores for the predicted sentences matching the ground truth. Overall, trends are very similar to the user study, including the low-scoring step 3. To match the second setting of the user study, we compute the sentence scores between the predicted sentence , and the next four future ground truth steps and select the step with the maximum score as our future match. These scores are plotted in the lower right plot of Figure 14. Similar to the second setting in the human study, the sentence scores increase overall.

(a) Human ratings.
(b) Sentence scores.
Fig. 14: We conduct a user study and ask human raters to asses how well the predicted sentences match the ground truth sentences. We present the comparison of human ratings (a) versus automated sentence scores (b).

7 Conclusion

In this paper, we posed a new problem setting of zero-shot action anticipation. We presented a model that can generalize instructional knowledge from the text domain and be applied to videos. Using this model we tackled the challenging task of predicting the steps of complex tasks from visual data. Our model produces coherent and plausible future steps from both text and video inputs. Such a task has been to date otherwise not possible because of the scarcity in annotating training data. Our evaluation shows that our anticipation method is more competitive than all other baselines, even when compared against video captioning methods that have access to the visual data. While we score well for keyword recall like ingredients and verbs, our sentence scores like the challenging BLEU4 is still poor. We believe this highlights the difficulty of our task and aim for improvements in our future work.

To complement our new task and model, we presented a diverse dataset of 4022 cooking videos and recipes. All the videos are annotated with the temporal boundaries of the textual recipe steps. Our dataset includes cooking videos with various dish categories, cookware, ingredients and provides researchers a rich database to study the challenging zero-shot anticipation problem. We also hope that its diversity will motivate researchers to study tasks beyond anticipation, such as dense video captioning, temporal segmentation, visual grounding, or retrieval.

Currently, our method only employs textual recipes and ingredient keys. It can further be improved using additional cues such as the amount of ingredients, which are crucial for real-life instructions. However, this will likely require a dedicated architecture for handling such data to keep track of ingredient amounts. Further improvements could be achieved by aggregating information from multiple similar recipes, or from user feedback and comments.