Predicting the Future: A Jointly Learnt Model for Action Anticipation

by   Harshala Gammulle, et al.

Inspired by human neurological structures for action anticipation, we present an action anticipation model that enables the prediction of plausible future actions by forecasting both the visual and temporal future. In contrast to current state-of-the-art methods which first learn a model to predict future video features and then perform action anticipation using these features, the proposed framework jointly learns to perform the two tasks, future visual and temporal representation synthesis, and early action anticipation. The joint learning framework ensures that the predicted future embeddings are informative to the action anticipation task. Furthermore, through extensive experimental evaluations we demonstrate the utility of using both visual and temporal semantics of the scene, and illustrate how this representation synthesis could be achieved through a recurrent Generative Adversarial Network (GAN) framework. Our model outperforms the current state-of-the-art methods on multiple datasets: UCF101, UCF101-24, UT-Interaction and TV Human Interaction.


page 1

page 3

page 7


TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation

Video action anticipation aims to predict future action categories from ...

Fine-grained Action Segmentation using the Semi-Supervised Action GAN

In this paper we address the problem of continuous fine-grained action s...

A-ACT: Action Anticipation through Cycle Transformations

While action anticipation has garnered a lot of research interest recent...

PLAN-B: Predicting Likely Alternative Next Best Sequences for Action Prediction

Action prediction focuses on anticipating actions before they happen. Re...

Relational Action Forecasting

This paper focuses on multi-person action forecasting in videos. More pr...

Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity

We address the challenging task of anticipating human-object interaction...

Generative Adversarial Network for Future Hand Segmentation from Egocentric Video

We introduce the novel problem of anticipating a time series of future h...

1 Introduction

(a) Action Recognition
(b) Typical Action Anticipation
(c) Proposed Action Anticipation Method
Figure 1: Action anticipation through future embedding prediction. Action recognition approaches (a) carry out the recognition task via fully observed video sequences while the typical action anticipation methods (b) are based on predicting the action from a small portion of the frames. In our proposed model (c) we jointly learn the future frame embeddings to support the anticipation task.

We propose an action anticipation model that uses visual and temporal data to predict future behaviour, while also predicting a frame-wise future representation to support the learning. Unlike action recognition where the recognition is carried out after the event, by observing the full video sequence (Fig. 1(a)), the aim of action anticipation (Fig. 1(b)) is to predict the future action as early as possible by observing only a portion of the action [3]. Therefore, for the prediction we only have partial information in the form of a small number of frames, so the available information is scarce. Fig. 1(c) shows the intuition behind our proposed model. The action anticipation task is accomplished by jointly learning to predict the future embeddings (both visual and temporal) along with the action anticipation task, where the anticipation task provides cues to help compensate for the missing information from the unobserved frame features. We demonstrate that joint learning of the two tasks complements each other.

This approach is inspired by recent theories of how humans achieve the action predictive ability. Recent psychology literature has shown that humans build a mental image of the future, including future actions and interactions (such as interactions between objects) before initiating muscle movements or motor controls [17, 10, 31]. These representations capture both the visual and temporal information of the expected future. Mimicking this biological process, our action anticipation method jointly learns to anticipate future scene representations while predicting the future action, and outperforms current state-of-the-art methods.

Figure 2: Action Anticipation GAN (AA-GAN): The model receives RGB and optical flow streams as the visual and temporal representations of the given scene. Rather than utilising the raw streams we extract the semantic representation of the individual streams by passing them through a pre-trained feature extractor. These streams are merged via an attention mechanism which embeds these low-level feature representations in a high-level context descriptor. This context representation is utilised by two GANs: one for future visual representation synthesis and one for future temporal representation synthesis; and the anticipated future action is obtained by utilising the context descriptor. Hence context descriptor learning is influenced by both the future representation prediction, and the action anticipation task.

In contrast to recent works [50, 45, 3] which rely solely on visual inputs, and inspired by [17, 10, 31], we propose a joint learning process which attends to salient components of both visual and temporal streams, and builds a highly informative context descriptor for future action anticipation. In [50]

the authors demonstrate that the context semantics, which capture high level action related concepts including environmental details, objects, and historical actions and interactions, are more important when anticipating actions than the actual pixel values of future frames. Furthermore, the semantics captured through pre-trained deep learning models show robustness to background and illumination changes as they tend to capture the overall meaning of the input frame rather than simply using pixel values

[8, 26]. Hence in the proposed architecture we extract deep visual and temporal representations from the inputs streams and predict the future representations of those streams.

Motivated by recent advances in Generative Adversarial Networks (GAN) [16, 33, 1]

and their ability to automatically learn a task specific loss function, we employ a GAN learning framework in our approach as it provides the capability to predict a plausible future action sequence.

Although there exist individual GAN models for anticipation [56, 32], we take a step further in this work. The main contribution is the joint learning of a context descriptor for two tasks, action anticipation and representation prediction, through the joint training of two GANs.

Fig. 2 shows the architecture of our proposed Action Anticipation GAN (AA-GAN) model. The model receives the video frames and optical flow streams as the visual and temporal representations of the scene. We extract a semantic representation of the individual streams by passing them through a pre-trained feature extractor, and fuse them through an attention mechanism. This allows us to provide a varying level of focus to each stream and effectively embed the vital components for different action categories. Through this process low level feature representations are mapped to a high-level context descriptor which is then used by both the future representation synthesis and classification procedures. By coupling the GANs (visual and temporal synthesisers) through a common context descriptor, we optimally utilise all available information and learn a descriptor which better describes the given scene.

Our main contributions are as follow:

  • We propose a joint learning framework for early action anticipation and synthesis of the future representations.

  • We demonstrate how attention can efficiently determine the salient components from the multi-modal information, and generate a single context descriptor which is informative for both tasks.

  • We introduce a novel regularisation method based on the exponential cosine distance, which effectively guides the generator networks in the prediction task.

  • We perform evaluations on several challenging datasets, and through a thorough ablation study, demonstrate the relative importance of each component of the proposed model.

2 Previous Work

Human action recognition is an active research area that has great importance in multiple domains [7, 9, 23]. Since the inception of the field researchers have focused on improving the applicability of methods to tally with real world scenarios. The aim of early works was to develop discrete action recognition methods using image [6, 25] or video inputs [15, 46, 22], and these have been extended to detect actions in fine-grained videos [29, 37]. Although these methods have shown impressive performance, they are still limited for real-world applications as they rely on fully completed action sequences. This motivates the development of action anticipation methods, which can accurately predict future actions utilising a limited number of early frames, and thereby providing the ability to predict actions that are in progress.

In [50]

, a deep network is proposed to predict a representation of the future. The predicted representation is used to classify future actions. However

[50] requires the progress level of the ongoing action to be provided during testing, limiting applicability [19]. Hu et al. [19] introduced a soft regression framework to predict ongoing actions. This method [19] learns soft labels for regression on the subsequences containing partial action executions. Lee et al. [30] proposed a human activity representation method, termed sub-volume co-occurrence matrix, and developed a method to predict partially observed actions with the aid of a pre-trained CNN. The deep network approach of Aliakbarian et al. [3] used a multi-stage LSTM architecture that incorporates context-aware and action-aware features to predict classes as early as possible. The CNN based action anticipation model of [40] predicts the most plausible future motion, and was improved via an effective loss function based on dynamic and classification losses. The dynamic loss is obtained through a dynamic image generator trained to generate class specific dynamic images. However, performance is limited due to the hand-crafted loss function. A GAN based model can overcome this limitation as it can automatically learn a loss function and has shown promising performance in recent research [35, 54, 38].

In our work we utilise a conditional GAN [36, 12, 13] for deep future representation generation. A limited number of GAN approaches can be found for human action recognition [33, 1]. In [33], a GAN is used to generate masks to detect the actors in input frame and action classification is done via a CNN. This method is prone to difficulties with the loss function as noted previously. Considering other GAN methods, [32, 56] require human skeletal data which is not readily available; [56] only synthesises the future skeletal representation; and [55] considers the task of synthesising future gaze points using a single generator and discriminator pair and directly extracting spatio-temporal features from a 3D CNN. In contrast to these, we analyse two modes and utilise an attention mechanism to embed the salient components of each mode into a context descriptor which can be used for multiple tasks; and we learn this descriptor through joint training of two GANs and a classifier.

The authors of [45] have adapted the model of [50] to a GAN setting; using GANs to predict the future visual feature representation. Upon training this representation, they train a classifier on the predicted features to anticipate the future action class. We argue that the approach of [45] is suboptimal, as there is no guarantee that the future action representation is well suited to predicting the action due to the two stage learning. Our approach, which learns the tasks jointly, ensures that a rich multi-modal embedding is learnt that captures the salient information needed for both tasks. Furthermore, by extending this to a multi-modal setting, we demonstrate the importance of attending to both visual and temporal features for the action anticipation task.

3 Action Anticipation Model

Our action anticipation model is designed to predict the future while classifying future actions. The model aims to generate embeddings for future frames, to obtain a complete notion of the ongoing action and to understand how best to classify the action. In Sec. 3.1, we discuss how the context descriptor is generated using the visual and temporal input streams while Sec. 3.2 describes the use of the GAN in the descriptor generation process. The future action classification procedure is described in Sec. 3.3 and we further improve this process with the addition of the cosine distance based regularisation method presented in Sec. 3.4.

3.1 Context Descriptor Formulation

Inputs to our model are two fold: visual and temporal. The visual inputs are the RGB frames and the temporal inputs are the corresponding optical flow images (computed using [4]). If the number of input video frames is T, then both the visual input () and the temporal input () can be represented as follows,


These inputs are passed through a pre-trained feature extractor which extracts features and frame wise,


Then and are sent through separate LSTM networks to capture the temporal structure of the input features. The LSTM outputs are defined as,


Attention values are generated for each frame such that,


where and

are multilayer perceptrons trained together with the rest of the network, and are passed through a sigmoid function to get the score values,


Then, an attention weighted output vector is generated,


Finally these output vectors are concatenated (denoted by ) to generate the context descriptor (),


encodes the recent history of both inputs, and thus is used to predict future behaviour.

3.2 Visual and Temporal GANs

GAN based models are capable of learning an output that is difficult to discriminate from real examples. They learn a mapping from the input to this realistic output while learning a loss function to train the mapping. The context descriptor, , is the input for both GANs (visual and temporal synthesisers, see Fig. 2). The ground truth future visual and temporal frames are denoted and , and are given by,


We extract features for and similar to Eq. 2,


These features, and , are utilised during GAN training. The aim of the generator ( or

) of each GAN is to synthesise the future deep feature sequence that is sufficiently realistic to fool the discriminator (

or ). It should be noted that the GAN models do not learn to predict the future frames, but the deep features of the frames (visual or temporal). As observed in [50] this allows the model to recognise higher-level concepts in the present and anticipate their relationships with future actions. This is learnt through the following loss functions,


3.3 Classification

The deep future sequences are learnt through the two GAN models as described in Sec. 3.2. A naive way to perform the future action classification is using the trained future feature predictor and passing the synthesised future features to the classifier. However, this is sub-optimal as and have no knowledge of this task, and thus features are likely sub-optimal for it. As such, in this work we investigate joint learning of the embedding prediction and future action anticipation, allowing the model to learn the salient features that are required for action anticipation. Hence, the GANs are able to support learning salient features for both processes. We perform future action classification for the action anticipation task through a classifier, the input for which is . Then the classification loss can be defined as,


It is important to note that the context descriptor is influenced by both the classification loss, , and the GAN losses, and , as and utilise the context descriptor to synthesise the future representations.

3.4 Regularisation

To stabilise GAN learning a regularisation method such as the loss is often used [20]. However the cosine distance has been shown to be more effective when comparing deep embeddings [51, 52]. Furthermore when generating future sequence forecasts it is more challenging to forecast representations in the distant future than the near future. However, the semantics from the distant future are more informative for the action class anticipation problem, as they carry more information about what the agents are likely to do. Hence we propose a temporal regularisation mechanism which compares the predicted embeddings with the ground truth future embeddings using the cosine distance, and encourages the model to focus more on generating accurate embeddings for the distant future,


where represents the cosine distance function. Motivated by [3] we introduce the exponential term, , encouraging more accurate prediction of distant future embeddings.

Then, the loss for the final model that learns the context descriptor and is reinforced by both deep future sequence synthesisers (GAN models), and the future action classification can be written as,


where and are hyper-parameters which control the contribution of the respective losses.

4 Evaluations

4.1 Datasets

Related works on action anticipation or early action prediction typically use discrete action datasets. The four datasets we use to evaluate our work are outlined below.

UCF101 [49] has been widely used for discrete action recognition and recent works for action anticipation due to its size and variety. The dataset includes 101 action classes from 13,320 videos with an average length of 7.2 seconds. In order to perform comparison to the state-of-the-art methods, we utilise the provided three training/testing splits and report the average accuracy over three splits.

UCF101-24 [47] is a subset of the UCF101 dataset. It is composed of 24 action classes in 3207 videos. In order to compare action anticipation results to the state-of-the-art we utilise only the data provided in set1.

UT-Interaction (UTI) [42] is a human interaction dataset, which contains videos of two or more people performing interactions such as handshake, punch etc. in a sequential and/or concurrent manner. The dataset has total of 120 videos. For the state-of-the-art comparison we utilise a 10-fold leave-one-out cross validation on each set and the mean performance over all sets is obtained, as per [3].

TV Human Interaction (TV-HI) [39] dataset is a collection of 300 video clips collected from 20 different TV shows. It is composed of four action classes of people performing interactions such as handshake, highfive, hug and kiss, and a fifth action class called ‘none’ which does not contain any of the four actions. The provided train/ test splits are utilised with a 25-fold cross validation, as per [50].

4.2 Network Architecture and Training

Considering related literature for different datasets, different numbers of observed frames [3, 45] are used. Let be the number of observed frames, then we extract frames to as future frames, where is the number of future frames for embedding prediction. As the temporal input, similar to [46] we use dense optical flow displacements computed using [4]. In addition to horizontal and vertical components we also use the mean displacement of the horizontal and vertical flow. Both visual and temporal inputs are individually passed through a pre-trained ResNet50 [18]

trained on ImageNet

[41], and activations from the ‘activation_23’ layer are used as the input feature representation.

The network of the generator is composed of two LSTM layers followed by a fully connected layer. The generator is fed only with the context input while the discriminator is fed with both the context and the real/fake feature representations. The two inputs of the discriminator are passed through separate LSTM layers and then the merged output is passed through two fully connected layers. The classifier is composed of a single LSTM layer followed by a single fully connected layer. For clarity, we provide model diagrams in the supplementary materials. For all LSTMs, 300 hidden units are used. For the model training procedure we follow the approach of [20], alternating between one gradient decent pass for the discriminators, and the generators and the classifier using 32 samples per mini batch. The Adam optimiser [24] is used with a learning rate of 0.0002 and a decay of

, and is trained for 40 epochs. Hyper-parameters

are evaluated experimentally and set to 25, 20, 43 and 15, respectively. Please refer to supplementary material for these evaluations. When training the proposed model for the UTI and TV-HI datasets, due to the limited availability of training examples we first train the model on UCF101 training data and fine-tuned it on the training data from the specific datasets.

For the implementation of our proposed method we utilised Keras


with Theano

[2] as the backend.

4.3 Performance Evaluation

4.3.1 Evaluation Protocol

To evaluate our model on each dataset, where possible we consider two settings for the number of input frames, namely the ‘Earliest’ and ‘Latest’ settings. For UCF101 and UTI, similar to [3] we consider 20% and 50% of the frames for the ‘Earliest’ and ‘Latest’ settings, respectively; following [3] we do not use more than 50 frames for the ‘Latest’ setting. For each dataset and setting, we resample the input videos such that all sequences have a constant number of frames. Due to unavailability of baseline results and following [45], for UCF101-24 we report evaluate using 50% of the frames from each video and for the TV-HI dataset, as in [27, 14], we consider only 1 seconds worth frames.

4.3.2 Comparison to the state-of-the-art methods

Evaluations for UCF101, UCF101-24, UTI and TV-HI datasets are presented in Tables 1, to 4 respectively. Considering the results, the authors of Multi_stage LSTM [3] and RED [14] have introduced a new hand engineered loss that encourages the early prediction of the action class. The authors of RBF-RNN [45] use a GAN learning process where the loss function is also automatically learnt. Similar to the proposed architecture, the RBF-RNN [45] model also utilises the spatial representation of the scene through a Deep CNN model and tries to predict the future scene representations. However in contrast to the proposed architecture this method does not utilise temporal features, or joint learning. We learn a context descriptor which effectively combines both spatial and temporal representations which not only aids the action classification but also anticipates the future representations more accurately. This led us to obtain superior results. In Tab. 2, the results for UCF101-24 shows that our model is able to outperform RBF-RNN [45] by 0.9% while in Tab. 3 we outperform [45] on the UTI dataset by 1.3% at the earliest setting.

When comparing the performance gap between the earliest and latest settings, our model has a smaller performance drop compared to the baseline models. The gap for UCF101 on our model is 1.4% while the gap for the Multi_stage LSTM model [3] is 2.9%. and synthesise the future representation of both visual and temporal streams while considering the current context. As such, the proposed model is able to better anticipate future actions, even with fewer frames. Our evaluations on multiple benchmarks further illustrate the generalisability of the proposed architecture, with varying video lengths and dataset sizes.

Method Earliest Latest
Context-aware loss in [21] 30.6 71.1
Context-aware loss in[34] 22.6 73.1
Multi_stage LSTM [3] 80.5 83.4
Proposed 84.2 85.6

Table 1: Action anticipation results for UCF101 considering the ‘Earliest’ 20% of frames and ‘Latest’ 50% of frames.
Method Accuracy
Temporal Fusion [11] 86.0
ROAD [47] 92.0
ROAD + BroxFlow [47] 90.0
RBF-RNN[45] 98.0
Proposed 98.9

Table 2: Action anticipation results for UCF101-24 considering 50% of frames from each video.
Method Earliest Latest
S-SVM [48] 11.0 13.4
DP-SVM [48] 13.0 14.6
CuboidBayes [44] 25.0 71.7
CuboidSVM [43] 31.7 85.0
Context-aware loss in [21] 45.0 65.0
Context-aware loss in[34] 48.0 60.0
I-BoW [44] 65.0 81.7
BP-SVM [28] 65.0 83.3
D-BoW [44] 70.0 85.0
multi-stageLSTM [3] 84.0 90.0
Future-dynamic [40] 89.2 91.9
RBF-RNN [45] 97.0 NA
Proposed 98.3 99.2

Table 3: Action anticipation results for UTI ‘Earliest’ 20% of frames and ‘Latest’ 50% of frames.
Method Accuracy
Vondrick et. al [50] 43.6
RED [14] 50.2
Proposed 55.7

Table 4: Action anticipation results for TV Human Interaction dataset considering 1 second worth of frames from each video.

4.4 Ablation Experiments

To further demonstrate the proposed AA-GAN method, we conducted an ablation study by strategically removing components of the proposed system. We evaluated seven non-GAN based model variants and ten GAN-based variants of the proposed AA-GAN model. Non-GAN based models are further broken into two categories: models with and without future representation generators. Similarly, the GAN based models fall into two categories: those that do and do not learn tasks jointly Diagrams of these ablation models are available in the supplementary materials.

Non-GAN based models: These models do not utilise any future representation generators, and are only trained through classification loss.

  1. [label=()]

  2. : A model trained to classify using the context feature extracted only from the visual input stream (V).

  3. : As per model (a), but using the temporal input stream (TP).

  4. : As per (a), but using both data streams to create the context embedding.

Non-GAN based models with future representation generators: Here, we add future embedding generators to the previous set of models. The generators are trained through mean squared error (i.e. no discriminator and no adversarial loss) while the classification is learnt through categorical cross entropy loss. The purpose of these models is to show how the joint learning can improve performance, and how a common embedding can serve both tasks.

  1. [label=()]

  2. : Model with the future visual representation generator () and fed only with the visual input stream to train the classifier

  3. : As per (d), but receiving and predicting the temporal input stream.

  4. : The model is composed of both generators, and , and fed with both visual and temporal input streams.

  5. : As per (f) but with the use of attention to combine the streams.

GAN based models without joint training: These methods are based on the GAN framework that generates future representations and a classifier that anticipates the action where these two tasks are learnt separately. We first train the GAN model using the adversarial loss and once this model is trained, using the generated future embeddings the classifier anticipates the action.

  1. [label=()]

  2. : Use the GAN learning framework with only the visual input stream and cosine distance based regularisation is used.

  3. : As per (h), but with the temporal input stream

  4. AA-GAN Use the GAN learning framework with both the visual and temporal input streams.

GAN based models with joint training: These models train the deep future representation generators adversarially. The stated model variants are introduced by removing the different components from the proposed model.

  1. [label=()]

  2. : The proposed approach with only the visual input stream and without cosine distance based regularisation.

  3. : The proposed approach with only the temporal input stream and without cosine distance based regularisation.

  4. : The proposed approach with only the visual input stream. Cosine distance based regularisation is used.

  5. : The proposed approach with only the temporal input stream. Cosine distance based regularisation is used.

  6. AA-GAN : Proposed model without cosine distance based regularisation.

  7. AA-GAN : Similar to the proposed model, however and predict pixel values for future visual and temporal frames instead of representations extracted from the pre-trained feature extractor.

Method Accuracy
(a)   45.1
(b)   39.8
(c)   52.0
(d)   54.7
(e)   52.4
(f)   68.1
(g)   68.8
(h)   98.1
(i)   97.9
(j)   AA-GAN 98.3

(l)  95.4
(m)   98.4
(n)   98.1
(o)   AA-GAN 98.7

(p)   AA-GAN
AA-GAN (proposed) 98.9

Table 5: Ablation results for UCF101-24 dataset for the ‘Latest’ setting, which uses 50% of the frames from each video.
(a) AA-GAN
(b) Ablation model (g)(see Section 4.4)
Figure 3: Projections of the discriminator hidden states for the for the AA-GAN (a) and ablation model (g) in (b) before (in blue) and after (in red) training. Ground truth action classes are in brackets. Insert indicates sample frames from the respective videos.

The evaluation results of the ablation models on the UCF101-24 test set are presented in Tab. 5.

Non-GAN based models (a to g): Model performance clearly improves when using both data streams together over either one individually (see (c) vs (a) and (b); and (f) vs (d) and (e)). Hence, it is clear that both streams provide different information cues to facilitate the prediction. Comparing the results of models that do not utilise the future representation generators to (d), we see that overseeing future representation does improve the results.

GAN based models without joint training (h to j): Comparing the non-GAN based methods with ablation model (h), we see that a major performance boost is achieved through the GAN learning process, denoting the importance of the automated loss function learning. Comparing the performance of visual and temporal streams, we observe that the visual stream is dominant, however combining both streams through the proposed attention mechanism captures complimentary information.

GAN based models with joint training (k to p): Comparing models (h) and (i), which are single modal models that do not use joint training, with models (m) and (n) which do, we can see the clear benefit offered by learning the two complementary tasks together. This contradicts the observation reported in [45], who use a classifier which was connected to the predicted future embeddings. We speculate that by learning a compressed context representation for both tasks we effectively propagate the effect of the action anticipation error through the encoding mechanisms, allowing this representation to be informative for both tasks. Finally, by coupling the GAN loss together with , where the cosine distance based regularisation is combined with the exponential loss to encourage accurate long-term predictions, we achieve state-of-the-art results. Furthermore we compare the proposed AA-GAN model, where and synthesise future visual and temporal representations, against ablation model (p) where and synthesise pixel values for future frames. It is evident that the latter model fails to capture the semantic relationships between the low-level pixel features and the action class, leading to the derived context descriptor being less informative for action classification, reducing performance.

To demonstrate the discriminative nature of the learnt context embeddings, Fig. 3 (a) visualises the embedding space before (in blue) and after (in red) training of the proposed context descriptor for 30 randomly selected examples of the TV-HI test set. We extracted the learned context descriptor, , and applied PCA [53] to generate 2D vectors. Ground truth action classes are indicated in brackets.

This clearly shows that the proposed context descriptor learns embeddings which are informative for both future representation generation and the segregation of action classes. From the inserts which show sample frames from the videos, visual similarities exist between the classes, hence the overlap in the embedding space before training. However after learning, the context descriptor has been able to maximise the interclass distance while minimising the distance within the class. Fig. 3 (b) shows an equivalent plot for the ablation model (g). Given the cluttered nature of the embeddings before and after learning, it is clear that the proposed GAN learning process makes a significant contribution to learning discriminative embeddings 222Additional qualitative evaluations showing generated future visual and temporal representations are in the supplementary material.

4.5 Time Complexity

We evaluate the computational demands of the proposed AA-GAN model for the UTI dataset’s ‘Earliest’ setting. The model contains 43M trainable parameters, and generates 500 predictions (including future visual and temporal predictions and the action anticipation) in 1.64 seconds using a single core of an Intel E5-2680 2.50 GHz CPU.

5 Conclusion

In this paper we propose a framework which jointly learns to anticipate an action while also synthesising future scene embeddings. We learn a context descriptor which facilitates both of these tasks by systematically attending to individual input streams and effectively extracts the salient features. This method exhibits traits analogous to human neurological behaviour in synthesising the future, and renders an end to end learning platform. Additionally, we introduced a cosine distance based regularisation method to guide the generators in the synthesis task. Our evaluations demonstrate the superior performance of the proposed method on multiple public benchmarks.


  • [1] U. Ahsan, C. Sun, and I. Essa (2018) DiscrimNet: semi-supervised action recognition from videos using generative adversarial networks. arXiv preprint arXiv:1801.07230. Cited by: §1, §2.
  • [2] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, et al. (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 472, pp. 473. Cited by: §4.2.
  • [3] M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson (2017) Encouraging lstms to anticipate actions very early. In

    IEEE International Conference on Computer Vision (ICCV)

    Vol. 1. Cited by: §1, §1, §2, §3.4, §4.1, §4.2, §4.3.1, §4.3.2, §4.3.2, Table 1, Table 3.
  • [4] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert (2004)

    High accuracy optical flow estimation based on a theory for warping

    In European conference on computer vision, pp. 25–36. Cited by: §3.1, §4.2.
  • [5] F. Chollet et al. (2015) Keras. Note: Cited by: §4.2.
  • [6] V. Delaitre, I. Laptev, and J. Sivic (2010) Recognizing human actions in still images: a study of bag-of-features and part-based representations. In British Machine Vision (BMVC) Conference, Note: updated version, available at Cited by: §2.
  • [7] A. Dix (2009) Human-computer interaction. In Encyclopedia of database systems, pp. 1327–1331. Cited by: §2.
  • [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2013) A deep convolutional activation feature for generic visual recognition. arxiv preprint. arXiv preprint arXiv:1310.1531. Cited by: §1.
  • [9] A. Ekin, A. M. Tekalp, and R. Mehrotra (2003) Automatic soccer video analysis and summarization. IEEE Transactions on Image processing 12 (7), pp. 796–807. Cited by: §2.
  • [10] B. Elsner and B. Hommel (2001) Effect anticipation and action control.. Journal of experimental psychology: human perception and performance 27 (1), pp. 229. Cited by: §1, §1.
  • [11] Z. Fan, T. Lin, X. Zhao, W. Jiang, T. Xu, and M. Yang (2017) An online approach for gesture recognition toward real-world applications. In International Conference on Image and Graphics, pp. 262–272. Cited by: Table 2.
  • [12] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes (2018) Multi-level sequence gan for group activity recognition. In Asian Conference on Computer Vision, pp. 331–346. Cited by: §2.
  • [13] H. Gammulle, T. Fernando, S. Denman, S. Sridharan, and C. Fookes (2019) Coupled generative adversarial network for continuous fine-grained action segmentation. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 200–209. Cited by: §2.
  • [14] J. Gao, Z. Yang, and R. Nevatia (2017) RED: reinforced encoder-decoder networks for action anticipation. arXiv preprint arXiv:1707.04818. Cited by: §4.3.1, §4.3.2, Table 4.
  • [15] G. Gkioxari and J. Malik (2015) Finding action tubes. In

    The IEEE Conference on Computer Vision and Pattern recognition (CVPR)

    Cited by: §2.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [17] P. F. Greve (2015) The role of prediction in mental processing: a process approach. New Ideas in Psychology 39, pp. 45–52. Cited by: §1, §1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §4.2.
  • [19] J. Hu, W. Zheng, L. Ma, G. Wang, J. Lai, and J. Zhang (2018) Early action prediction by soft regression. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • [20] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017-07) Image-to-image translation with conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.4, §4.2.
  • [21] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena (2016) Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 3118–3125. Cited by: Table 1, Table 3.
  • [22] S. Ji, W. Xu, M. Yang, and K. Yu (2013)

    3D convolutional neural networks for human action recognition

    IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §2.
  • [23] C. G. Keller and D. M. Gavrila (2014) Will the pedestrian cross? a study on pedestrian path prediction. IEEE Transactions on Intelligent Transportation Systems 15 (2), pp. 494–506. Cited by: §2.
  • [24] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations (ICLR). Cited by: §4.2.
  • [25] B. Ko, J. Hong, and J. Nam (2015-06) Human action recognition in still images using action poselets and a two-layer classification model. J. Vis. Lang. Comput. 28 (C), pp. 163–175. External Links: ISSN 1045-926X Cited by: §2.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [27] T. Lan, T. Chen, and S. Savarese (2014) A hierarchical representation for future action prediction. In European Conference on Computer Vision, pp. 689–704. Cited by: §4.3.1.
  • [28] K. Laviers, G. Sukthankar, D. W. Aha, M. Molineaux, C. Darken, et al. (2009) Improving offensive performance through opponent modeling.. In AIIDE, Cited by: Table 3.
  • [29] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager (2017) Temporal convolutional networks for action segmentation and detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [30] D. Lee and S. Lee (2019) Prediction of partially observed human activity based on pre-trained deep representation. Pattern Recognition 85, pp. 198–206. Cited by: §2.
  • [31] M. Lehne and S. Koelsch (2015) Toward a general psychological model of tension and suspense. Frontiers in Psychology 6, pp. 79. Cited by: §1, §1.
  • [32] C. Li, Z. Zhang, W. Sun Lee, and G. Hee Lee (2018) Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5226–5234. Cited by: §1, §2.
  • [33] X. Li, Y. Zhang, J. Zhang, Y. Chen, H. Li, I. Marsic, and R. S. Burd (2017) Region-based activity recognition using conditional gan. In Proceedings of the 2017 ACM on Multimedia Conference, pp. 1059–1067. Cited by: §1, §2.
  • [34] S. Ma, L. Sigal, and S. Sclaroff (2016) Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950. Cited by: Table 1, Table 3.
  • [35] M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (ICLR). Cited by: §2.
  • [36] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.
  • [37] B. Ni, X. Yang, and S. Gao (2016) Progressively parsing interactional objects for fine grained action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1020–1028. Cited by: §2.
  • [38] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §2.
  • [39] A. Patron-Perez, M. Marszalek, I. Reid, and A. Zisserman (2012) Structured learning of human interactions in tv shows. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (12), pp. 2441–2453. Cited by: §4.1.
  • [40] C. Rodriguez, B. Fernando, and H. Li (2018) Action anticipation by predicting future dynamic images. In ECCV’18 workshop on Anticipating Human Behavior, Cited by: §2, Table 3.
  • [41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §4.2.
  • [42] M. S. Ryoo and J. K. Aggarwal (2010) UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA). Cited by: §4.1.
  • [43] M. S. Ryoo, C. Chen, J. Aggarwal, and A. Roy-Chowdhury (2010) An overview of contest on semantic description of human activities (sdha) 2010. In Recognizing Patterns in Signals, Speech, Images and Videos, pp. 270–285. Cited by: Table 3.
  • [44] M. S. Ryoo (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1036–1043. Cited by: Table 3.
  • [45] Y. Shi, B. Fernando, and R. Hartley (2018) Action anticipation with rbf kernelized feature mapping rnn. In European Conference on Computer Vision, pp. 305–322. Cited by: §1, §2, §4.2, §4.3.1, §4.3.2, §4.4, Table 2, Table 3.
  • [46] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199. Cited by: §2, §4.2.
  • [47] G. Singh, S. Saha, M. Sapienza, P. H. Torr, and F. Cuzzolin (2017) Online real-time multiple spatiotemporal action localisation and prediction.. In ICCV, pp. 3657–3666. Cited by: §4.1, Table 2.
  • [48] K. Soomro, H. Idrees, and M. Shah (2018) Online localization and prediction of actions and interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 3.
  • [49] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.
  • [50] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106. Cited by: §1, §2, §2, §3.2, §4.1, Table 4.
  • [51] J. Wang, Y. Li, Z. Miao, Y. Xu, and G. Tao (2017) Learning deep discriminative features based on cosine loss function. Electronics Letters 53 (14), pp. 918–920. Cited by: §3.4.
  • [52] N. Wojke and A. Bewley (2018) Deep cosine metric learning for person re-identification. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 748–756. Cited by: §3.4.
  • [53] S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: §4.4.
  • [54] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon (2016) Pixel-level domain transfer. In European Conference on Computer Vision, pp. 517–532. Cited by: §2.
  • [55] K. Zeng, W. B. Shen, D. Huang, M. Sun, and J. Carlos Niebles (2017) Visual forecasting by imitating dynamics in natural sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2999–3008. Cited by: §2.
  • [56] M. Zhang, K. Teck Ma, J. Hwee Lim, Q. Zhao, and J. Feng (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4372–4381. Cited by: §1, §2.