Video Representations of Goals Emerge from Watching Failure

by   Dave Epstein, et al.

We introduce a video representation learning framework that models the latent goals behind observable human action. Motivated by how children learn to reason about goals and intentions by experiencing failure, we leverage unconstrained video of unintentional action to learn without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features. Experiments and visualizations show the model is able to predict underlying goals, detect when action switches from intentional to unintentional, and automatically correct unintentional action. Although the model is trained with minimal supervision, it is competitive with highly-supervised baselines, underscoring the role of failure examples for learning goal-oriented video representations. The project website is available at


page 1

page 3

page 6

page 7

page 8


Oops! Predicting Unintentional Action in Video

From just a short glance at a video, we can often tell whether a person'...

Learning the Predictability of the Future

We introduce a framework for learning from unlabeled video what is predi...

Featureless: Bypassing feature extraction in action categorization

This method introduces an efficient manner of learning action categories...

Hierarchical Contrastive Motion Learning for Video Action Recognition

One central question for video action recognition is how to model motion...

Learning Multi-level Features For Sensor-based Human Action Recognition

This paper proposes a multi-level feature learning framework for human a...

Learning Representations of Endoscopic Videos to Detect Tool Presence Without Supervision

In this work, we explore whether it is possible to learn representations...

Revealing Occlusions with 4D Neural Fields

For computer vision systems to operate in dynamic situations, they need ...

1 Introduction

Consider the person in Figure 1, which shows a man heating a wine bottle with a blowtorch. Even if this action is unconventional, we still perceive the action as rational in the context of the goal (to open the bottle). Evidence suggests that this ability to reason about goals is learned before our second birthday tomasello2009usage, woodward2009infants, and it plays a key role in children’s rapid development of communicative skills tomasello2005understanding and mental representations of the world barresi1996intentional. However, despite the importance of this problem, learning visual representations of human goals has remained challenging.

Visual action recognition has largely focused on learning to recognize action categories carreira2017quo, ji2019action, which indicate how a person acted, and not why they acted. While this has spurred tremendous progress in video analysis, the resulting video representations do not discriminate the underlying goals of action. We hypothesize that a key missing piece is the lack of examples demonstrating the failure to achieve goals. Similar to how a child learns about goals by experiencing failure, we leverage a large dataset containing both intentional and unintentional real-world action oops to learn goal-oriented representations of video.

Figure 1: What is this person’s goal? Although only the action is observable, we are still able to predict the goal behind the action (to open the bottle). In this paper, we introduce a model to learn video representations that encode goals as latent action trajectories.

We present a video model that learns a trajectory representation of action, and encodes goals as the path of the trajectory. We input entire videos to our model by first dividing them into short clips, which are run through a 3D CNN to learn low-level motion features. We then pass the motion features into a Transformer model, which models relations between different periods in its input, and thus represents the entire action as context-aware latent trajectories. The whole model is trained from scratch in an end-to-end manner.

Our experiments show that observing failure is vital for learning representations of goals. We evaluate our model on three visual tasks for goal prediction. First, we experiment on detecting unintentional action in video, and we demonstrate state-of-the-art performance on this task. Second, we evaluate the representation at predicting goals with minimal supervision, which we characterize as structured categories consisting of subject, action, and object triplets. Lastly, we use our representation to automatically “correct” unintentional action and decode these corrections by retrieving from other videos or generating categorical descriptions.

Our main contribution is an approach that (1) models long video sequences as latent-space trajectories with indirect supervision and, in doing so, (2) learns a goal-directed representation of videos. Since the goals are encoded in the path of the trajectory, we also show how to find minimal adjustments to the path to automatically correct unintentional action in video. The remainder of this paper will describe this approach in detail. Code, data, and models are available online.

Figure 2: Learning goal-oriented video representations: We show an overall view of our approach. First, we embed short clips using a 3D CNN to represent short-term motion features. Then, we run the sequence of CNN embeddings through a stack of Transformers, where they interact with each other to finally form a context-adjusted latent action trajectory. The model is trained end-to-end from scratch, with intentionality and temporal coherence losses (depicted top-left). Points along the resultant trajectory are decoded with linear projections into various spaces (top-middle).

2 Related Work

Recognizing action in video: Previous work explores many different approaches to recognizing action in video. Earlier directions develop hand-designed features to process spatio-temporal information for action recognition laptev2005space, klaser2008spatio, wang2011action, pirsiavash2014parsing

. Popular deep learning architectures for images were extended to operate directly on video by modeling time as a third dimension

hara2018can, carreira2017quo, simonyan2014two, luvizon20182d, ji2019action

. To deal with variable-length or long video input, previous work frequently takes one of two approaches: pooling or recurrent networks. However, pooling loses spatial and/or temporal connections between different moments of video. Since recurrent networks are sequential, they require selecting important video features ahead of time, without viewing full context. RNNs are also known to struggle to connect between far-apart inputs, which creates significant challenges in modeling long-term video.

sun2019contrastive is most similar to our approach, since they also run clips through 3D CNNs and Transformers, but they freeze 3D CNNs and train on a “masked video modeling" task, ultimately discarding contextually learned temporal dynamics across videos since their goal is to learn information useful for an effective cross-modal representation. To address these drawbacks, we propose a 3D-CNN-Transformer model which allows for short-term, granular motion detection combined with a long-term action representation, trained end-to-end from scratch.

Learning about intention: Evidence in developmental psychology quantifies why humans perceive intention barresi1996intentional, how we perceive it woodward2001infants, woodward2009emergence, woodward2009infants, when we begin to do so meltzoff1995understanding, meltzoff1999toddlers, and what allows us to infer the goals behind others’ behavior shultz1980development. While these questions have been studied in early stages of child development, the same abilities have remained a challenge for machines in unconstrained situations. One possible reason for this is a lack of realistic data. We take advantage of incidental signals in unconstrained videos oops to learn video representations.

Leveraging adversarial attacks: We use adversarial gradients goodfellow2014explaining, kurakin2016adversarial to find corrections to the trajectory. Previous work studied adversarial attacks in steganography hayes2017generating, zhu2018hidden, software bug-finding she2019neuzz, generating CAPTCHAs von2003captcha to fool modern deep nets osadchy2017no, generating interesting images simonyan2013deep

, creating real-world 3D objects that trick neural networks

zhou2018invisible, athalye2017synthesizing, and in training models more robust to test-time adversarial attacks miyato2015distributional, goodfellow2014explaining, miyato2016adversarial. jahanian2019steerability extend this concept to generative models, setting a new image output as a target label and perturbing latent space. In video, jiang2019black, wei2018transferable introduce various methods to fool action recognition networks, often on a 3D CNN backbone. We instead utilize adversarial attacks to manipulate and correct unintentional action.

Figure 3: Labeling goals and failures in video: To evaluate our representation, we annotate the Oops! dataset with short sentences describing the goals and failures. We extract subject-verb-object triples and train a decoder on learned representations. The intentional and unintentional action in the dataset span a diverse range of categories.

3 Method

In this section, we introduce our framework to learn video representations as trajectories, formulate learning objectives, and use the learned representations to predict goals in video.

3.1 Visual Dynamics as Trajectories

The conventional approach to representing video data is to run each clip through a convolutional network and combine clip representations by pooling to run models on entire sequences feichtenhofer2019slowfast, han2019video, gao2019listen, xu2017r. However, these methods do not allow for connections between different moments in video and cannot richly capture temporal relationships, which give rise to goal-directed action. While recurrent networks hochreiter1997long

are more expressive, they require compressing history into a fixed-length vector, which forces models to select relevant visual features without viewing full context and makes reasoning about connections between different moments difficult, especially when they are far apart.

Temporal streams of visual input are highly contextual with both short- and long-term dependencies. We will represent video as a contextually-adjusted trajectory of latent representations in a learned space. Figure 2 illustrates this architecture, which has both a motion and action level:

Motion Level: First, we separate video into short clips (or tokens) in order to make initial motion-level observations. Let be a video, and be a video clip centered at time

. We estimate the motion-level features

where is a 3D CNN 3dcnn.

Action Level: Second, we model relations between to construct a contextual trajectory where is the Transformer transformer. The Transformer architecture is able to capture relations in its input by performing self-attention among tokens in its input sequence, and outputs a contextual representation across the video. Since the Transformer architecture can incorporate contributions from both nearby and far away moments in its representations for each clip, it is well-suited to modeling higher-level connections between the atomic actions recognized at the motion level. The resulting sequence of embeddings induces a trajectory in the form of a sequence of hidden vectors , which we can use for different downstream tasks.

3.2 Learning with Indirect Supervision

We train the representation with indirect supervision that is accessible at large scales. We use the following two objectives for learning:

Action Intentionality: We train the model to temporally localize when action is unintentional. Let be the video frame where the action shifts from intentional to unintentional (which we assume is labeled oops). For each video clip , we set the target according to whether the labeled happens before, during, or after the clip . The model estimates with a linear projection where is a jointly learned projection matrix to . We train with a cross-entropy loss between and where the class weight is set to the inverse frequency of the class label to balance training. We label this loss .

Temporal Consistency: We also train the model to learn temporal dynamics with a self-supervised consistency loss han2019video, misra2016shuffle, fernando2017self, wei2018learning, jayaraman2016slow, BERT. Let indicate that the sequence is consistent. We predict whether the input sequence is temporally consistent with where is a jointly learned projection to . We train with the binary cross-entropy loss between and . We label this loss (next sequence prediction).

Figure 4: Automatically correcting unintentional action: Starting from an initial trajectory, we use model gradients as a signal to correct the course of points representing unintentional action. This corrected trajectory is evaluated by decoding into various feature spaces.

We create inconsistent sequences as follows: For each video sequence in the batch, we bisect the sequence into two parts at a random index with probability

. For these sequences, we perturb one or both of the video segments with probability . When perturbing, we swap the order of the two sequences with probability , otherwise we pick a randomly sized subsequence from another video sequence in the batch to replace one of the two segments.

Training: To train our model, we set the overall loss as , where

is a hyperparameter controlling the importance of the coherence loss. We set

to balance the magnitudes of the losses. We sample sequences of one-second long clips, run each clip

through the motion-level 3D CNNs then pass all outputs through the Transformer stack, and calculate the gradients. We optimize the loss with stochastic gradient descent. At inference time, we run entire continuously-sampled videos through our model.

3.3 Completing Goals by Auto-Correcting Trajectories

We use this learned representation in order to complete the goals of people in the scene meltzoff1995understanding, skulmowski2015investigating. However, since the model is trained with indirect supervision, the trajectories are not supervised with goal states. We propose to formulate goal completion as a latent trajectory prediction problem. Given an observed trajectory of unintentional action , we seek to find a new, minimally modified trajectory

that is classified as intentional. By analogy to how word processors auto-correct a sentence, we call this process

action auto-correct. We illustrate this process in Figure 4.

We find this correction in feature space, not pixel space, to yield interpretable results. We find a gradient to the features that switches the prediction to be the “intentional” category for all clips . We formulate an optimization problem with two soft constraints. Firstly, we want to increase the classification score of intentional action . Secondly, we want the resulting trajectory to be temporally consistent . Without this term, the corrected trajectory is not required to be coherent with the initial part of the original trajectory. We minimize:



s are the original loss functions but with target labels

overridden to be the intentional class, and is a scalar to balance the two terms. We only modify on the clips which the model classifies as unintentional in the first place, which we denote . The coherence loss is also truncated by its original value, causing the optimization to favor a trajectory that is no less temporally coherent than the original one.

To solve this optimization problem, we use the iterative target class method kurakin2016adversarial, which repeatedly runs the input through the model and modifies it in the direction of the desired loss. For every corresponding to a clip where action is unintentional, we repeat a gradient attack step towards the target . The complete update is:111We found to be reasonable values.


where . We repeat this process until the network is “fooled” into classifying the input as intentional action, for at most iterations or until . Once the halting condition is satisfied, we run the modified vectors through the model, yielding a trajectory of corrected action that encodes successful completion of the goal. This trajectory can be read out into various spaces (Section 5.1).

In other words, goals are the adversarial examples goodfellow2014explaining of failed action – instead of viewing adversarial examples as a bug, we view them as a feature NIPS2019_8307.

Localization Classification
Method 0.25 sec 1 sec Accuracy
Kinetics supervision carreira2017quo 69.2 37.8 53.6
Kinetics supervision carreira2017quo + finetune 75.9 46.7 64.0
3D CNN only oops 68.7 39.8 59.4

Our model

Classification only 64.9 33.6 73.0
+ Pseudo-GT 72.4 39.9 77.7
+ Coherence loss 63.2 32.4 72.1
 + Pseudo-GT 71.8 39.6 77.8
Chance 25.9 6.8 33.3
Table 1: Detecting unintentional action: We evaluate models on classifying and localizing unintentional action. Our model is competitive with Kinetics pretraining despite training from scratch, and outperforms it on classification.

4 Unintentional Action and Goals Dataset

Figure 5: Decoding the Trajectories: After estimating the decoder, we read out triplets from different parts of videos. The first row shows intentional action, and the decoder predicts the goal. The second row shows unintentional action, and the decoder now predicts the failure instead. The final row shows unintentional videos that have been auto-corrected, and the decoder returns to predicting goals, suggesting the auto-correct procedure shifts the failed trajectories towards successful ones.

Similar to how children learn about goals by perceiving failed attempts at executing them meltzoff1999toddlers, we hypothesize that examples of failure are crucial for learning to discriminate between action and goal. We use the recently released Oops! dataset oops, which is a large collection of videos containing intentional and unintentional action, to train and evaluate our models. Videos in this dataset are annotated with the moment at which action becomes unintentional.222In addition to the ground truth annotations provided by oops, we run their pretrained model on the unlabeled portion of the training set and collect pseudo-ground-truth, which we found improves performance. Figure 3 shows some example frames. We also use the Kinetics dataset kinetics to evaluate models, since it contains a wide range of successful action.

Goal Annotation:

Established action datasets in computer vision

gu2018ava, li2020ava contain annotations about person and object relationships in scenes, but they do not directly annotate the goal, which we need for evaluation of goal prediction. We collect unconstrained natural language descriptions of a subset of videos in the Oops! dataset (4675 training videos and 3404 test videos), prompting Amazon Mechanical Turk workers333with approvals at a rate to answer “What was the goal in this video?” as well as “What went wrong?”. We then process these sentences444Using the natural language library to detect lemmatized subject-verb-object triples, manually correcting for common constructions such as “tries to X” (where the verb lemma is detected as “try”, but we would like “X”). The final vocabulary contains 3615 tokens. Figure 3 shows some example annotations. We use SVO triples to evaluate the video representations.

5 Experiments

We experiment with our model on two tasks: recognizing intentional action, and predicting goals. We train our method from scratch on a dataset of unintentional action oops.

5.1 Experimental Setup

Baselines: We evaluate the 3D CNN from oops which is trained from scratch on the action intentionality loss (Section 3.2). We also evaluate a 3D CNN pre-trained on Kinetics action recognition, which is frozen unless indicated otherwise. We compare goal prediction to a frozen, randomly initialized network (denoted “Scratch"). We also consider several ablations of our model. To evaluate representations, we freeze them and implement different decoders as described below.

Retrieval: This decoder does not require further training and performs nearest-neighbor retrieval among one-second long clips in the test sets for the Oops! and Kinetics datasets. While we do not learn a representation using Kinetics, we include it in retrieval to see if auto-corrected actions match with successfully executed goals in Kinetics rather than failed attempts (see Section 5.3). This decoder maintains a lookup table of all clip representations and computes the -nearest neighbors from different videos using cosine distance.

Categorization: We also implement a decoder using the textual labels we gathered on the videos. Here, the task is to describe the goals of the input video using the SVO triplets. We train a decoder to predict the main goal for clips with intentional action, and predict what went wrong for clips with unintentional action. The estimated decoder will describe the video with descriptions of the goal, for example “athlete wins game”, which is a goal, and not “woman throws ball", which is an action. We train a linear layer to output a vector for subject, verb, and object. As ground truth, we use BERT word embeddings BERT, calculating scores using dot product and running them through softmax and a cross-entropy loss.

5.2 Unintentional Action Detection

Figure 6: Retrievals from Auto-corrected Trajectories: We show the nearest neighbors from auto-corrected action trajectories, using our proposed method and a linearization baseline. The retrievals are computed across both the Oops! and Kinetics datasets. The corrected representations yield corrected trajectories that are often embedded close to the goal.
Subject Verb Object Average All three
Method Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5
Kinetics carreira2017quo 26.79 72.34 27.33 52.67 36.01 64.64 30.04 63.22 2.07 16.46
3D CNN oops 29.44 72.72 26.42 50.36 44.71 57.89 33.52 60.32 2.86 13.85
Scratch 23.67 55.73 22.74 45.44 44.82 52.67 30.41 51.28 1.42 8.72
Our Model 34.31 74.50 29.72 54.17 44.95 58.16 36.32 62.27 3.32 14.39
Chance 0.14 <0.01
Table 2: Comparison of Representations: To evaluate how well representations encode goals, we freeze them and estimate a linear projection to predict labelled subject-verb-object triples.

We first evaluate how well the model is able to detect and localize when action deviates from its goal. We use labels from the test set in oops as the ground truth. We process entire videos with our model, sampling continuous one-second clips as tokens, and take the predicted localization as the center of the clip with maximum probability of failure. We also classify each clip according to its label (intentional, transitional, or unintentional). We show results in Table 1. On the former task, our model is competitive with fine-tuning a fully-supervised Kinetics CNN, despite using less data and less supervision. On classification, our network outperforms the Kinetics network by 14%, showing that representing videos as contextual trajectories is effective.

5.3 Goal Prediction

We next evaluate the model at predicting goal descriptions. We train a decoder on the trajectory to read out subject, verb, object triplets. In training, if sentences have more than one extracted SVO, we randomly select one as ground truth. In testing, we average-pool predictions among all clips with intentional action and unintentional action separately and take the maximum over all sentence SVOs. Each video clip has two pooled predictions: one for intentional action, and one for unintentional action. Table 2 shows our model obtains better top-1 accuracy on all metrics than baselines, including the Kinetics-pretrained model, and is competitive on top-5 accuracy.


Top neuron-SVO correlations

(b) Trajectories in t-SNE
Figure 7: Analyzing the Representation: We probe the learned trajectories. (a) shows the neurons with highest correlation to the words in the SVO vocabulary, along with their top-5 retrieved clips. Neurons that detect intentions across a wide range of action and scene appear to emerge, despite only training with binary labels on the intentionality of action. (b) We show six randomly sampled video trajectories in t-SNE space, before and after auto-correct, superimposed over the embeddings for intentional and unintentional action. Visualizations suggest our approach tends to adjust unintentional action in the direction of successful, intentional action.

5.4 Analysis of Learned Representation

To evaluate how action and goals are embedded in the trajectory representation, we find the minimal “auto-correction” to the unintentional action sequences and probe them. As a comparison, we implement a simple baseline where we linearly extrapolate the trajectory of observed intentional action: if the unintentional action in a sequence of clips begins at clip , we extend the trajectory for a clip by setting .

Figure 6 shows examples of nearest neighbor retrievals of the corrected latent vectors, computing over the Oops! and Kinetics test sets. Despite not training on Kinetics (i.e. on videos with completed goals), our representation can adjust video trajectories such that their nearest neighbors are goals being successfully executed.

Intentional SVO Unintentional SVO
Method Acc. Rank Acc. Rank
Kinetics carreira2017quo +0.4 +0.3M -0.3 -1.2M
3D CNN oops +0.3 +0.1M -0.3 -0.6M
Ours (linearized) +0.6 +1.0M -0.5 -1.7M
Ours (adversarial) +1.6 +15.8M -3.3 -9.3M
Table 3: Evaluating Autocorrection: We show effect of auto-correct on SVO decoder predictions (top 5 accuracy and rank assigned to the correct triple). Our model shifts probability mass from unintentional to intentional SVOs.

We also examine the effects of auto-correction on the frozen SVO decoder. Table 3 shows these results. For decoders trained on all models, rankings of intentional action SVOs increase while those of unintentional SVOs decrease. However, the changes are greatest for our model. Figure 5 visualizes the output of a frozen SVO decoder on auto-corrected actions, demonstrating the auto-correct process’ ability to encode completed goals in its output trajectories.

We finally probe the model’s learned representation to analyze how trajectories are encoded. We measure Spearman’s rho correlation between the activation of neurons in the output vectors and words in the SVO vocabulary. Each video is an observation containing neuron activations and an indicator variable for whether each word is present in ground truth. Many neurons have significant correlation, and we show the top 3 in Figure 6(a), along with the 5 clips that activate them most. These neurons appear to discover common actions in the Oops! dataset, despite being trained without any action labels. We also visualize trajectories of some videos using t-SNE (Figure 6(b)), before and after autocorrect. Our model often adjusts trajectories from unintentional action to the region of embedding space with Kinetics videos, shown in the figure as “at goal" action.

6 Conclusion

We introduce an approach to represent videos as contextual trajectories in a learned latent space, leveraging the Transformer architecture. By encoding action as a trajectory, we are able to perform several different tasks, such as decoding to categorical descriptions or manipulating the trajectory. Our experiments show that learning from failure examples, not just successful action, is crucial for learning rich visual representations of goals.


We thank Dídac Surís, Mia Chiquier, Amogh Gupta, Ruoshi Liu, Ishaan Chandratreya, and Boyuan Chen for helpful comments. Funding was provided by DARPA MCS, NSF NRI 1925157, and an Amazon Research Gift. We thank NVidia for donating GPUs.

Broader Impact

Human action recognition is critical for situational awareness applications in robotics, healthcare, and security, which may potentially have a large practical impact on society. For example, predicting the goals of actions could enable machines to better assist and communicate with people. A key limitation in our experiments is that we leverage publicly available video data, which is likely biased to Western cultures. Consequently, the learned representation likely encodes a Western definition of success, which may not generalize to other demographic areas.