1 Introduction
Anticipatory reasoning to model the evolution of action sequences over time is a fundamental challenge in human activity understanding. The crux of the problem in making predictions about the future is the fact that for interesting domains, the future is uncertain – given a history of actions such as those depicted in Fig. 1, the distribution over future actions has substantial entropy.
In this work, we propose a powerful generative approach that can effectively model the categorical and temporal variability comprising action sequences. Much of the work in this domain has focused on taking frame level data of video as input in order to predict the actions or activities that may occur in the immediate future. There has also been recent interest on the task of predicting the sequence of actions that occur farther into the future [6, 32, 1].
Time series data often involves regularly spaced data points with interesting events occurring sparsely across time. This is true in case of videos where we have a regular frame rate but events of interest are present only in some frames that are infrequent. We hypothesize that in order to model future events in such a scenario, it is beneficial to consider the history of sparse events (action categories and their temporal occurrence in the above example) alone, instead of regularly spaced frame data. While the history of frames contains rich information over and above the sparse event history, we can possibly create a model for future events occurring farther into the future by choosing to only model the sparse sequence of events. This approach also allows us to model highlevel semantic meaning in the time series data that can be difficult to discern from lowlevel data points that are regular across time.
Our model is formulated in the variational autoencoder (VAE) [15]
paradigm, a powerful class of probabilistic models that facilitate generation and the ability to model complex distributions. We present a novel form of VAE for action sequences under a point process approach. This approach has a number of advantages, including a probabilistic treatment of action sequences to allow for likelihood evaluation, generation, and anomaly detection.
Contribution. The contributions of this work center around the APPVAE (Action Point Process VAE), a novel generative model for asynchronous time action sequences. The contributions of this paper include:

A novel formulation for modeling point process data within the variational autoencoder paradigm.

Conditional prior models for encoding asynchronous time data.

A probabilistic model for jointly capturing uncertainty in which actions will occur and when they will happen.
2 Related Work
Activity Prediction. Most activity prediction tasks are framebased, i.e. the input to the model is a sequence of frames before the action starts and the task is predict what will happen next. Lan et al. [18]
predict future actions from hierarchical representations of short clips by having different classifiers at each level in a maxmargin framework. Mahmud
et al. [20] jointly predicts future activity as well as its starting time by a multistreams framework. Each streams tries to catch different features for having a richer feature representation for future prediction: One stream for visual information, one for previous activities and the last one focusing on the last activity.Farha et al. [1] proposed a framework for predicting the action categories of a sequence of future activities as well as their starting and ending time. They proposed two deterministic models, one using a combination of RNN and HMM and the other one is a CNN predicting a matrix which future actions are encoded in it.
Asynchronous Action Prediction. We focus on the task of predicting future action given a sequence of previous actions that are asynchronous in time. Du et al. [6] proposed a recurrent temporal model for learning the next activity timing and category given the history of previous actions. Their recurrent model learns a nonlinear map of history to the intensity function of a temporal point process framework. Zhong et al. [32] also introduced a hierarchical recurrent network model for future action prediction for modeling future action timing and category. Their model takes framelevel information as well as sparse highlevel events information in the history to learn the intensity function of a temporal point process. Xiao et al. [28] introduced an intensityfree generative method for temporal point process. The generative part of their model is an extension of Wasserstein GAN in the context of temporal point process for learning to generate sequences of action.
Early Stage Action Prediction. Our work is related to early stage action prediction. This task refers to predicting the action given the initial frames of the activity [19, 10, 25]. Our task is different from early action prediction, because the model doesn’t have any information about the action while predicting it. Recently Yu et al. [31] used variational autoencoder to learn from the frames in the history and transfer them into the future. Sadegh Aliakbarian et al. [24]
combine context and action information using a multistage LSTM model to predict future action. The model is trained with a loss function which encourages the model to predict action with few observations. Gao
et al. [7] proposed to use a Reinforced EncoderDecoder network for future activity prediction. Damen et al. [3]proposed a semisupervised variational recurrent neural network to model human activity including classification, prediction, detection and anticipation of human activities.
Video Prediction. Video prediction has recently been studied in several works. Denton and Fergus [5] use a variational autoencoder framework with a learned prior to generate future video frames. He et al. [9] also proposed a generative model for future prediction. They structure the latent space by adding control features which makes the model able to control generation. Vondrick et al. [27] uses adversarial learning for generating videos of future with transforming the past pixels. Patraucean et al. [23] describe a spatiotemporal autoencoder that predicts optical flow as a dense map, using reconstruction in its learning criterion. Villegas et al. [26] propose a hierarchical approach to pixellevel video generation, reasoning over body pose before rendering into a predicted future frame.
3 Asynchronous Action Sequence Modeling
Training 


Generation 
Our proposed recurrent VAE model for asynchronous action sequence modeling. At each time step, the model uses the history of actions and interarrival times to generate a distribution over latent codes, a sample of which is then decoded into two probability distributions for the next action: one over possible action labels and one over the inter arrival time.
We first introduce some notations and the problem definition. Then we review the VAE model and temporal point process that are used in our model. Subsequently, we present our model in detail and how it is trained.
Problem definition.
The input is a sequence of actions where is the th action. The action is represented by the action category ( discrete action classes) and the interarrival time . The interarrival time is the difference between the starting time of action and . We formulate the asynchronous action distribution modeling task as follows: given a sequence of actions , the goal is to produce a distribution over what action will happen next, and the inter arrival time . We aim to develop probabilistic models to capture the uncertainty over these what and when questions of action sequence modeling.
3.1 Background: Base Models
Variational AutoEncoders (VAEs).
A VAE [15] describes a generative process with simple prior (usually chosen to be a multivariate Gaussian) and complex likelihood (the parameters of which are produced by neural networks). and are observed and latent variables, respectively. Approximating the intractable posterior with a recognition neural network , the parameters of the generative model as well as the recognition model can be jointly optimized by maximizing the evidence lower bound on the marginal likelihood :
(1) 
Recent works expand VAEs to timeseries data including video [2, 5, 9], text [4, 12], or audio [30]. A popular design choice of such models is the integration of a per timestep VAE with RNN/LSTM temporal modelling. The ELBO thus becomes a summation of timestepwise variational lower bound^{1}^{1}1Note that variants exist, depending on the exact form of the recurrent structure and its VAE instantiation.:
(2)  
with a “prior” that evolves over the time steps used.
Temporal point process.
A temporal point process is a stochastic model used to capture the interarrival times of a series of events. A temporal point process is characterized by the conditional intensity function , which is conditioned on the past events (e.g. action in this work). The conditional intensity encodes instantaneous probabilities at time . Given the history of
past actions, the probability density function for the time of the next action is:
(3) 
The Poisson process [16] is a popular temporal point process, which assumes that events occur independent of one another. The conditional intensity is where is a positive constant. More complex conditional intensities have been proposed like Hawkes Process [8] and SelfCorrecting Process [13]. All these conditional intensity function seek to capture some forms of dependency on the past action. However, in practice the true model of the dependencies is never known [21]
and the performance depend on the design of the conditional intensity. In this work, we learn a recurrent model that estimates the conditional intensity based on the history of actions.
3.2 Proposed Approach
We propose a generative model for asynchronous action sequence modeling using the VAE framework. Figure 3 shows the architecture of our model. Overall, the input sequence of actions and inter arrival times are encoded using a recurrent VAE model. At each step, the model uses the history of actions to produce a distribution over latent codes , a sample of which is then decoded into two probability distributions: one over the possible action categories and another over the interarrival time for the next action. We now detail our model.
Model.
At time step during training, the model takes as input the action , which is the target of the prediction model, and the history of past actions . These inputs are used to compute a conditional distribution from which a latent code is sampled. Since the true distribution over latent variables is intractable we rely on a timedependent inference network
that approximates it with a conditional Gaussian distribution
. To prevent from just copying , we force to be close to the prior distribution using a KLdivergence term. Usually in VAE models, is a fixed Gaussian . But a drawback of using a fixed prior is that samples at each time step are drawn randomly, and thus ignore temporal dependencies present between actions. To overcome this problem, a solution is to learn a prior that varies across time, being a function of all past actions except the current action . Both prior and approximate posterior are modelled as multivariate Gaussian distributions with diagonal covariance with parameters as shown below:(4)  
(5) 
At step , both posterior and prior networks observe actions but the posterior network outputs the parameters of a conditional Gaussian distribution for the current action whereas the prior network outputs the parameters of a conditional Gaussian distribution for the next action .
At each timestep during training, a latent variable is drawn from the posterior distribution . The output action is then sampled from the distribution of our conditional generative model which is parameterized by . For mathematical convenience, we assume the action category and interarrival time are conditionally independent given the latent code :
(6) 
where (resp. ) is the conditional generative model for action category (resp. interarrival time). This is a standard assumption in event prediction [6, 32]. The sequence model generates two probability distributions: (i) a categorical distribution over the action categories and (ii) a temporal point process distribution over the interarrival times for the next action.
The distribution over action categories is modeled with a multinomial distribution when can only take a finite number of values:
(7) 
where is the probability of occurrence of action , and is the total number of action categories.
The interarrival time is assumed to follow an exponential distribution parameterized by
, similar to a standard temporal point process model:(8) 
where
is a probability density function over random variable
and is the intensity of the process, which depends on the latent variable sample .Learning.
We train the model by optimizing the variational lower bound over the entire sequence comprised of steps:
(9)  
Because the action category and interarrival time are conditionally independent given the latent code , the loglikelihood term can be written as follows:
(10)  
Given the form of the loglikelihood term reduces to a cross entropy between the predicted action category distribution and the ground truth label . Given the ground truth interarrival time , we compute its loglikelihood over a small time interval under the predicted distribution.
(11)  
We use the reparameterization trick [15] to sample from the encoder network .
Generation.
The goal is to generate the next action given a sequence of past actions . The generation process is shown on the bottom of Figure 3. At test time, an action at step is generated by first sampling from the prior. The parameters of the prior distribution are computed based on the past actions . Then, an action category and interarrival time are generated as follows:
(12) 
Architecture.
We now describe the architecture of our model in detail. At step , the current action
is embedded into a vector representation
with a twostep embedding strategy. First, we compute a representation for the action category () and the interarrival time () separately. Then, we concatenate these two representations and compute a new representation of the action.(13)  
(14) 
We use a 1hot encoding to represent the action category label
. Then, we have two branches: one to estimate the parameters of the posterior distribution and another to estimate the parameters of the prior distribution. The network architecture of these two branches is similar but we use separate networks because the prior and the posterior distribution capture different information. Each branch has a Long Short Term Memory (LSTM)
[11] to encode the current action and the past actions into a vector representation:(15)  
(16) 
Recurrent networks turn variable length sequences into meaningful, fixedsized representations. The output of the posterior LSTM (resp. prior LSTM ) is passed into a posterior (also called inference) network (resp. prior network ) that outputs the parameters of the Gaussian distribution:
(17)  
(18) 
Then, a latent variable is sampled from the posterior (or prior during testing) distribution and is fed to the decoder networks for generating distributions over the action category and interarrival time .
The decoder network for action category
is a multilayer perceptron with a softmax output to generate the probability distribution in Eq.
7:(19) 
The decoder network for interarrival time is another multilayer perceptron, producing the parameter for the point process model for temporal distribution in Eq. 8:
(20) 
During training, the parameters of all the networks are jointly learned in an endtoend fashion.
Dataset  Model  Stoch. Var.  LL 

Breakfast  APPLSTM    6.668 
APPVAE w/o Learned Prior  ✓  9.427  
APPVAE  ✓  5.944  
MultiTUHMOS  APPLSTM    4.190 
APPVAE w/o Learned Prior  ✓  5.344  
APPVAE  ✓  3.838 
4 Experiments
Datasets.
We performed experiments using APPVAE on two action recognition datasets. We use the standard training and testing sets for each.
MultiTHUMOS Dataset [29] is a challenging dataset for action recognition, containing 400 videos of 65 different actions. On average, there are 10.5 action class labels per video and 1.5 actions per frame.
Breakfast Dataset [17] contains 1712 videos of breakfast preparation for 48 action classes. The actions are performed by 52 people in 18 different kitchens.
Architecture details.
The APPVAE model architecture is shown in Fig. 3
. Action category and interarrival time inputs are each passed through 2 layer MLPs with ReLU activation. They are then concatenated and followed with a linear layer. Hidden state of prior and posterior LSTMs is 128. Both prior and posterior networks are 2 layer MLPs, with ReLU activation after the first layer. Dimension of the latent code is 256. Action decoder is a 3 layer MLP with ReLU at the first two layers and softmax for the last one. The time decoder is also a 3 layer MLP with ReLU at the first two layers, with an exponential nonlinearity applied to the output to ensure the parameter of the point process is positive.
Implementation details.
The models are implemented with PyTorch
[22] and are trained using the Adam [14]optimizer for 1,500 epochs with batch size 32 and learning rate 0.001. We split the standard training set of both datasets into training and validation sets containing 70% and 30% of samples respectively. We select the best model during training based on the model loss (Eq.
3.2) on the validation set.Baselines.
We compare APPVAE with the following models for action prediction tasks.

Time Deterministic LSTM (TDLSTM). This is a vanilla LSTM model that is trained to predict the next action category and the interarrival time, comparable with the model proposed by Farha et al. [1]. This model directly predicts the interarrival time and not the distribution over it. TDLSTM uses the same encoder network as APPVAE. We use crossentropy loss for action category output and perform regression over interarrival time using mean squared error (MSE) loss similar to [1].

Action Point Process LSTM (APPLSTM). This baseline predicts the interarrival time distribution similar to APPVAE. The model uses the same reconstruction loss function as in the VAE model – cross entropy loss for action category and negative loglikelihood (NLL) loss for interarrival time. APPLSTM does not have the stochastic latent code that allows APPVAE to model diverse distributions over action category and interarrival time. Our APPLSTM baseline encompasses Du et al. [6]’s work. The only difference is the way we model the intensity function (IF). Du et al. [6] defines IS explicitly as a function of time. This design choice has been investigated in Zhong et al. [32]; an implicit intensity function is shown to be superior and thus adapted in our APPLSTM baseline.
Metrics.
We use loglikelihood (LL) to compare our model with the APPLSTM. We also report accuracy of action category prediction and mean absolute error (MAE) of interarrival time prediction. We calculate accuracy by comparing the most probable action category from the model output with the ground truth category. To calculate MAE, we use the expected interarrival time under the predicted distribution :
(21) 
The expected value and the ground truth interarrival time are used to compute MAE.
Dataset  Model  Time Loss  stoch. var.  accuracy  MAE 

Breakfast  TDLSTM  MSE    53.64  173.76 
APPLSTM  NLL    61.39  152.17  
APPVAE w/o Learned Prior  NLL  ✓  27.09  270.75  
APPVAE  NLL  ✓  62.20  142.65  
MultiTUHMOS  TDLSTM  MSE    29.74  2.33 
APPLSTM  NLL    36.31  1.99  
APPVAE w/o Learned Prior  NLL  ✓  8.79  2.02  
APPVAE  NLL  ✓  39.30  1.89 
Test sequences with high likelihood  

1  NoHuman, CliffDiving, Diving, Jump, BodyRoll, CliffDiving, Diving, Jump, BodyRoll, CliffDiving, Diving, Jump, BodyRoll, BodyContract, Run, CliffDiving, Diving, Jump, …, BodyRoll, CliffDiving, Diving, BodyContract, CliffDiving, Diving, CliffDiving, Diving, CliffDiving, Diving, Jump, CliffDiving, Diving, Walk, Run, Jump, Jump, Run, Jump 
2  CleanAndJerk, PickUp, BodyContract, Squat, StandUp, BodyContract, Squat, CleanAndJerk, PickUp, StandUp, BodyContract, Squat, CleanAndJerk, PickUp, StandUp, Drop, BodyContract, Squat, PickUp, …, Squat, StandUp, Drop, BodyContract, Squat, BodyContract, Squat, BodyContract, Squat, BodyContract, Squat, BodyContract, Squat, NoHuman 
Test sequences with low likelihood  
1  NoHuman, TalkToCamera, GolfSwing, GolfSwing, GolfSwing, GolfSwing, NoHuman 
2  NoHuman, HammerThrow, TalkToCamera, CloseUpTalkToCamera, HammerThrow, HammerThrow, HammerThrow, TalkToCamera, …, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow, HammerThrow 
Dataset  Model  Acc  MAE 

Breakfast  APPVAE  avg  59.02  145.95 
APPVAE  mode  62.20  142.65  
MultiTUHMOS  APPVAE  avg  35.23  1.96 
APPVAE  mode  39.30  1.89 
4.1 Experiment Results
We discuss quantitative and qualitative results from our experiments. All quantitative experiments are performed by teacher forcing methodology i.e. for each step in the sequence of actions, the models are fed the ground truth history of actions, and likelihood and/or other metrics for the next action are measured.
Quantitative results.
Table 1 shows experimental results that compare APPVAE with the APPLSTM. To estimate the loglikelihood (LL) of our model, we draw 1500 samples from the approximate posterior distribution, following the standard approach of importance sampling. APPVAE outperforms the APPLSTM on both MultiTHUMOS and Breakfast datasets. We believe that this is because the APPVAE model is better in modeling the complex distribution over future actions.
Table 2 shows accuracy and MAE in predicting the future action given the history of previous actions. APPVAE outperforms TDLSTM and APPLSTM under both the metrics. For each step in the sequence we draw 1500 samples from the prior distribution that models the next step action. Given the output distributions, we select the action category with the maximum probability as the predicted action, and the expected value of interarrival time as the predicted interarrival time. Out of 1500 predictions, we select the most frequent action as the model prediction for that time step, and compute interarrival time by averaging over the corresponding time values.
also show the comparison of our model with the case where the prior is fixed in all of the timesteps. In this experiment, we fixed the prior to the standard normal distribution
. We can see that the learned prior variant outperforms the fixed prior variant consistently across all datasets. The model with the fixed prior does not perform well because it learns to predict the majority action class and average interarrival time of the training set, ignoring the history of any input test sequence.In addition to the above strategy of selecting the mode action at each step, we also report action category accuracy and MAE obtained by averaging over predictions of all 1500 samples. We summarize these results in Table 4.
We next explore the architecture of our model by varying the sizes of the latent variable. Table 5 shows the loglikelihood of our model for different sizes of the latent variable. We see that as we increase the size of the latent variable, we can model a more complex latent distribution which results in better performance.
Qualitative Results.
Fig. 4 shows examples of diverse future action sequences that are generated by APPVAE given the history. For different provided histories, sampled sequences of actions are shown. We note that the overall duration and sequence of actions on the Breakfast Dataset are reasonable. Variations, e.g. taking the juice squeezer before using it, adding salt and pepper before cooking eggs, are plausible alternatives generated by our model.
Fig. 5 visualizes a traversal on one of the latent codes for three different sequences by uniformly sampling one dimension over while fixing others to their sampled values. As shown, this dimension correlates closely with the action add_saltnpepper, strifry_egg and fry_egg.
We further qualitatively examine the ability of the model to score the likelihood of individual test samples. We sort the test action sequences according to the average per timestep likelihood estimated by drawing 1500 samples from the approximate posterior distribution following the importance sampling approach. High scoring sequences should be those that our model deems as “normal” while low scoring sequences those that are unusual. Tab. 3 shows some example of sequences with low and high likelihood on the MultiTHUMOS dataset. We note that a regular, structured sequence of actions such as jump, body roll, cliff diving for a diving action or body contract, squat, clean and jerk for a weightlifting action receives high likelihood. However, repeated hammer throws or golf swings with no set up actions receives a low likelihood.
Latent size  32  64  128  256  512 

LL ()  4.486  3.947  3.940  3.838  4.098 
Finally we compare asynchronous APPLSTM with a synchronous variant (with constant frame rate) on Breakfast dataset. The synchronous model predicts actions one step at a time and the sequence is postprocessed to infer the duration of each action. The performance is significantly worse for both MAE time (152.17 vs 1459.99) and action prediction accuracy (61.39% vs 28.24%). A plausible explanation is that LSTMs cannot deal with very longterm dependencies.
5 Conclusion
We presented a novel probabilistic model for point process data – a variational autoencoder that captures uncertainty in action times and category labels. As a generative model, it can produce action sequences by sampling from a prior distribution, the parameters of which are updated based on neural networks that control the distributions over the next action type and its temporal occurrence. The model can also be used to analyze given input sequences of actions to determine the likelihood of observing particular sequences. We demonstrate empirically that the model is effective for capturing the uncertainty inherent in tasks such as action prediction and anomaly detection.
References

[1]
Yazan Abu Farha, Alexander Richard, and Juergen Gall.
When Will You Do What?  Anticipating Temporal Occurrences of
Activities.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  [2] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. Stochastic Variational Video Prediction. In International Conference on Learning Representations (ICLR), 2018.
 [3] Judith Bütepage, Hedvig Kjellström, and Danica Kragic. Classify, predict, detect, anticipate and synthesize: Hierarchical recurrent latent variable models for human activity modeling. arXiv preprint arXiv:1809.08875, 2018.
 [4] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.

[5]
Emily Denton and Rob Fergus.
Stochastic Video Generation with a Learned Prior.
In
International Conference on Machine Learning (ICML)
, 2018.  [6] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel GomezRodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
 [7] Jiyang Gao, Zhenheng Yang, and Ram Nevatia. RED: reinforced encoderdecoder networks for action anticipation. CoRR, abs/1707.04818, 2017.
 [8] Alan G Hawkes. Spectra of some selfexciting and mutually exciting point processes. Biometrika, 1971.
 [9] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holistic attribute control. In The European Conference on Computer Vision (ECCV), September 2018.
 [10] M. Hoai and F. De la Torre. Maxmargin early event detectors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 [11] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Comput., 1997.
 [12] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In International Conference on Machine Learning, pages 1587–1596, 2017.
 [13] Valerie Isham and Mark Westcott. A selfcorrecting point process. Stochastic Processes and their Applications, 1979.
 [14] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
 [15] Diederik P Kingma and Max Welling. AutoEncoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
 [16] J. F. C. Kingman. Poisson processes. 1993.
 [17] Hilde Kuehne, Ali Arslan, and Thomas Serre. The Language of Actions: Recovering the Syntax and Semantics of GoalDirected Human Activities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 [18] Tian Lan, TsungChuan Chen, and Silvio Savarese. A hierarchical representation for future action prediction. In European Conference on Computer Vision (ECCV), 2014.
 [19] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early detection. In cvpr, 2016.
 [20] Tahmida Mahmud, Mahmudul Hasan, and Amit K. RoyChowdhury. Joint prediction of activity labels and starting times in untrimmed videos. In IEEE International Conference on Computer Vision (ICCV), 2017.
 [21] Hongyuan Mei and Jason Eisner. The Neural Hawkes Process: A Neurally SelfModulating Multivariate Point Process. In Advances in Neural Information Processing Systems (NIPS), 2017.
 [22] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems (NIPS), 2017.

[23]
Viorica Pătrăucean, Ankur Handa, and Roberto Cipolla.
Spatiotemporal video autoencoder with differentiable memory.
In International Conference on Learning Representations (ICLR) Workshop, 2016.  [24] Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Basura Fernando, Lars Petersson, and Lars Andersson. Encouraging LSTMs to Anticipate Actions Very Early. In IEEE International Conference on Computer Vision (ICCV), 2017.
 [25] Yuge Shi, Basura Fernando, and Richard Hartley. Action anticipation with rbf kernelized feature mapping rnn. In The European Conference on Computer Vision (ECCV), September 2018.
 [26] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to Generate Longterm Future via Hierarchical Prediction. In International Conference on Machine Learning (ICML), 2017.
 [27] Carl Vondrick and Antonio Torralba. Generating the future with adversarial transformers. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [28] Shuai Xiao, Mehrdad Farajtabar, Xiaojing Ye, Junchi Yan, Le Song, and Hongyuan Zha. Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems (NIPS), 2017.

[29]
Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li
FeiFei.
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos.
International Journal of Computer Vision (IJCV), 2017.  [30] Li Yingzhen and Stephan Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, 2018.
 [31] Runsheng Yu, Zhenyu Shi, and Laiyun Qing. Unsupervised learning aids prediction: Using future representation learning variantial autoencoder for human action prediction. CoRR, abs/1711.09265, 2017.
 [32] Y. Zhong, B. Xu, G.T. Zhou, L. Bornn, and G. Mori. Time Perception Machine: Temporal Point Processes for the When, Where and What of Activity Prediction. ArXiv eprints, Aug. 2018.