During the past decades, researchers have made substantial progress in computer vision algorithms that can automatically detect[1, 2, 3] and recognize [4, 5, 6, 7] actions in video sequences. However, the ability to go beyond this and estimate how past actions will affect future activities opens exciting possibilities. A good estimation of future behaviour is an essential sensory component for an automated system to fully comprehend the real world. In this paper, we tackle the problem of estimating the prospective occurrence of future activity. Our goal is to predict the timing, spatial location, and category of the next activity given past information. We aim to answer the when, where, and what questions of activity prediction.
Consider the sports video example shown in Fig. 1
. In our work, we directly model the occurrence of discrete activity events that occur in a data stream. Within a sports context, these activities could include key moments in a game, such as passes, shots, or goals. More generally, they could correspond to important human actions along a sequence: such as a person leaving a building, stopping to engage in conversation with a friend, or sitting down on a park bench. Predicting where and when these semantically meaningful events occur would enable many applications within robotics, autonomous vehicles, security and surveillance, and other video processing domains.
Let the input be a sequence of frames. Among these, () frames are each marked by an activity, whose timestamps are denoted as . Our goal is to estimate when and where the next activity () will happen and what type of activity it will be given the past sequence of activities and frames up to .
Importantly, we are interested in predictions regarding the semantically meaningful, sparsely occurring events within a sequence. This discrete time moment representation for actions is commonplace in numerous applications: e.g., where and when will the next shot take place in this hockey game, where do we need to be to intercept it; from where and when will the next person hail a rideshare, where should we drive to pick him/her up; when is the next nursing home patient going to request assistance, what will he/she request and where will that request be made? Generalizations of this paradigm are possible, where we consider multiple people, such as players in a sports game. We elaborate on this idea and demonstrate that we can model events corresponding to important, actionable inferences.
Following the standard terminology , we use the term arrival pattern to refer to the temporal distribution of activities throughout the paper. We wish to model this distribution and infer when and where the next activity will take place. However, in vision tasks the raw input has frames, whereas we are interested in the moments sparsely distributed in the sequence that are the points at which activities commence. Therefore, we need a mechanism to build features from the
frames while also preserving information of other regular frames. To address this problem, we utilize a hierarchical recurrent neural network with skip connections for multi-resolution temporal data processing.
Similar to variational autoencoders[9, 10]
, which model the distribution of latent variables with deep learning, our model leverages the same advantage of neural networks to fit the arrival pattern (temporal distribution of activities) in the data. A network is used to learn the conditional intensity of a temporal point process and the likelihood is maximized during training. In contrast to traditional statistical approaches that demand expert domain knowledge, our model does not require a hand-crafted conditional intensity. Instead, it is automatically learned on top of raw data. We name our model the Time Perception Machine (TPM).
Our work has three main contributions:
Proposing a new task – predicting the occurrence of future activity – for human action analysis, which has not been explored before on streaming data such as videos and person trajectories;
Developing a novel hierarchical RNN with skip connections for feature extraction at finer resolution (frames of interest) while preserving information at coarser resolution;
Formulating a generic conditional intensity and extending the model to a joint prediction framework for the when, where and what of activity forecasting.
2 Related Work
2.1 Activity Forecasting
Seminal work on activity forecasting was done by Kitani et al. 
, who modeled the effect of physical surroundings using semantic scene labeling and inverse reinforcement learning to predict plausible future paths and destinations of pedestrians.
Subsequent work  reasons about the long-term behaviors and goals of an individual given his first-person visual observations. Similarly, Xie et al.  attempted to infer human intents by leveraging the agent-based Lagaragian mechanics to model the latent needs that drive people toward functional objects. Park et al.  proposed an EgoRetinal map for motion planning from egocentric stereo videos. Vondrick et al.  presented a framework for predicting the visual representations of future frames, which is employed to anticipate actions and objects in the future. Unlike the previous work on activity forecasting, which focuses on planning paths and predicting intent, our work addresses a different problem in that we aim to predict the discrete attributes (the when, where, and what) of future activities.
Recent temporal activity detection / prediction methods build on recurrent neural network architectures. These include connectionist temporal classification (CTC) architectures [16, 17]. CTC models conduct classification by generalizing away from actual time stamps, while prediction methods regress actual temporal values. A variety of temporal neural network structures exist (convolutional , GRU, LSTM, Phased LSTM ), many of which have been applied to activity recognition. Our contribution is complementary in that it focuses on a novel point process model for distributions of discrete events for activity prediction.
2.2 Temporal Point Processes
A temporal point process is a stochastic model used to capture the arrival pattern of a series of events in time. Temporal point processes are studied in various areas including health-care analysis , electronic commerce , modeling earthquakes and aftershocks , etc.
A temporal point process model can be fully characterized by the “conditional intensity” quantity, denoted by , which is conditioned on the past information . The conditional intensity encodes the expected rate of arrivals within an infinitesimal neighborhood at time . Once we determine the intensity, we determine a temporal point process. Mathematically, given the history up to the event and the conditional intensity
, we can formulate the probability density function
and the cumulative distribution functionfor the time of the next event , shown in Eq. 1 and Eq. 2. We defer the full derivation of both formulas to Appendix A.
For notational convenience, we use “” to indicate that a quantity is conditioned on the past throughout this paper. For example, , and . Below we show the conditional intensities of several temporal point process models.
Poisson Process . , where is a positive constant.
Hawkes Process . , where , and are positive constants. This process is an “aggregated” process, where one event is likely to trigger a series of other events in a short period of time, but the likelihood drops exponentially with regard to time.
Self-Correcting Process . , where and are positive constants. This process is more “averaged” in time. A previous event is likely to inhibit the occurrence of the next one (by decreasing the intensity). Then the intensity will increase again until the next event happens.
Furthermore, a recent work by Du et al.  explored temporal process models using neural networks, but only experimented with sparse timestamp data. We extend their approach to dense streaming data with the proposed hierarchical RNN to extract features at frames of interest. Additionally, we demonstrate the effectiveness of a more generic intensity function in modeling the arrival pattern. We also show how a more powerful joint estimation framework can be formulated for simultaneous prediction of the timing, spatial location and category of the next activity event.
We will first introduce the hierarchical RNN structure upon which our model is built. Then we will present in detail the formulation and derivation of the proposed model for predicting the timing of future activities. Finally we show how our model can be extended to a joint estimation framework for the simultaneous prediction of the time, location, and category of the next activity.
3.1 Hierarchical RNN
The input to our model is an entire sequence of frames. In our experiments, these include visual data in the form of bounding boxes cropped around people in video sequences and/or representations of human motion trajectories as 2D coordinates of person location over time.
A typical temporal point process model only takes as input the frames annotated with activities. These are very sparse compared to the entire dense sequence of frames (). We expect these significant frames will contain important features. However, we do not want to lose any information inherent in the remaining (
) frames. To this end, we need a hierarchical RNN capable of feature extraction at different time resolutions. This is similar in vein to tasks from the natural language processing domain, such as recent work[27, 28, 29] in language modeling, with character-to-word and word-to-phase networks for feature extraction at multiple scales. More generally, this is an instance of the classic multiple-time scales problem in recurrent neural networks .
In our case, we use a hierarchical RNN model composed of two stacked RNNs. The lower-level RNN looks into the details by covering every frame in the input sequence. The higher level RNN fixes its attention only on frames of activities so as to capture the temporal dynamics among these significant times. We implement the RNN with LSTM cells. Fig. 2 shows the model structure.
3.2 Conditional Intensity Function
Instead of hand-crafting the conditional intensity , we view it as the output of the hierarchical RNN and learn the conditional intensity directly from raw data. However, an arbitrary choice of the conditional intensity
could be potentially problematic, because it needs to characterize a probability distribution. Thus, we need to validate the resultant probability density function in Eq.1 and the cumulative distribution function in Eq. 2.
Necessity (). Given and Eq. 2, we have , from which it follows that . Since is positive, under this condition it defines a valid probability distribution, hence a well established temporal point process.
Sufficiency (). First, must be positive for it to define a valid probability density by Eq. 1. If , which means the integral is a positive less than , then it is easy to notice that . This would be an invalid cumulative distribution function since . ∎
We formally define two forms of conditional intensity as follows.
Explicit time dependence : The first form is inspired by , which models the conditional intensity based on the hidden states and the time .
Note that we make an important correction to . The conditional intensity without the positive constraint in Eq. 3 does not conform to the necessary condition above. By imposing a constraint , we can prove that the revised intensity in Eq. 3 satisfies the condition in the above proposition.
Implicit time dependence : Note that the design of , to some extent, assumes how it is a function of time . As is part of the input, we believe it is possible to acquire the time information from the hidden states without any specification about . We use an exponential activation to ensure the positivity of the resultant conditional intensity. Formally, we have:
3.3 Joint Likelihood
Now we show our model can be readily plugged into a joint estimation framework by formulating a joint likelihood for the timing, spatial location and category of activities. However, instead of directly modeling the next activity location, we use an incremental approach that models the space shift from the current position. Let be the joint likelihood for a sequence of activities; , and denote the timestamp, action category, and space shift respectively. To derive the joint likelihood, we make the following assumption.
For mathematical convenience, we assume the timing, action category, space shift of event are conditionally independent given the history up to event (). That is, , or if we use the “*” notation. Therefore, we have the joint likelihood parameterized by :
Estimating the Action Category: The action category likelihood represents the distribution over the type of action. Since the history is encoded by the RNN hidden states , we have . Given the hidden states , our model outputs a discrete distribution over action classes:
We then model this likelihood with a Gibbs distribution:
where the energy function
is the Kullback-Leibler divergence between the predicted distributionand the ground-truth distribution
(encoded as a one-hot vector).
Estimating the Space Shift: The space shift likelihood gives the spatial distribution of the next move. Similar to , we have
. We model the likelihood using a bivariate Gaussian distribution:
where is the mean and is a 2x2 covariance matrix. We find that learning all the parameters in is unstable, so we assume the shifts along the and directions are independent, hence . We set to be constant and given the hidden states ; we use
to parameterize Eq. 10, where and are learnable parameters.
The model parameters can be learned in a supervised learning framework, by maximizing the likelihood of event sequences. In order to formulate the data (log-)likelihood, we substitute5, 6, 9 and 10 into Eq. 7. Converting this to log-likelihood yields Eq. 12 and Eq. 13 for the intensities and in Eq. 3 and Eq. 4, respectively.
Here absorbs all constants in the derivation above and can be dropped during optimization. The joint likelihood for all sample sequences is obtained by summing the log-likelihood for each sequence. Because the log-likelihood is fully differentiable, we can apply back-propagation algorithms for maximization.
To infer the timing of the next activity, we follow the same inference procedure as in the standard point process literature: given all ground-truth history up to activity , we predict when the next activity will happen. Then we proceed to predict the timing of activity given all ground-truth history up to activity . Therefore, the errors will not accumulate exponentially. This is a reasonable approach in many practical scenarios (knowing what has happened up to now, predict the next event). While we have a full model of the distribution, to obtain a point estimate, we take the expected time as our prediction. Eq. 14 is the result obtained using the conditional intensity in Eq. 3, where is an incomplete gamma function whose value can be evaluated using numerical integration algorithms. Eq. 15 is acquired using the conditional intensity in Eq. 4. The derivation makes use of Eq. 1, and we include the full details in the supplementary material.
To predict the category of the next activity, we take the most confident class in the output distribution as the prediction:
To estimate the spatial location of the next activity, we take the expected space shift added to the current position as the result:
We evaluate the model on two challenging datasets collected from real world sports games. These datasets include activities in basketball and ice hockey with extremely fast movement.
All of our baselines consist of two components: a Markov chain and a conventional point process. The Markov chain models action category and space shift distribution; the point process models action timestamps. In our experiments, we compare TPM’s performance in time estimation with three other typical temporal point processes: Poisson process, Hawkes process and self-correcting process (Sec.2). We compare TPM’s performance in space and category prediction with -order Markov chains (). Also note that TPM has two variants, TPM and TPM, using the two conditional intensity functions and in Eq. 3 and Eq. 4, respectively.
STATS SportVU NBA dataset. This dataset contains the trajectories of 10 players and the ball in court coordinates. During each basketball game possession, there are annotations about when and where a pre-defined activity is performed, such as pass, rebound, shot, etc.
The frame data are obtained by concatenating the
court coordinates of the offensive players, defensive players and the ball. The order of concatenation within each team is determined by how far a player is away from the ball. The closest is the first entry while the farthest is appended as the last. The frame data are fed into the hierarchical RNN with a single-layer perceptron as the feature extractor of each frame. The maximum number of frames is 150 for each sequence. A basketball possession is at most 24 seconds, so this results in an effective frame rate of 6.2fps. During training, we set bothand to 2ft.
SPORTLOGiQ NHL dataset. This dataset includes the raw broadcast videos, player bounding boxes and trajectories with similar annotations to the NBA dataset. However, unlike the NBA dataset, the number of players in each frame may change due to the nature of broadcast videos. To solve this problem, we set a fixed number of players to use. If there are fewer than players, we zero out the extra entries. If there are more than players, we select the players that are most clustered. We essentially assume the players cluster around where the actions are. We use closeness centrality to implement this intuition. We build a complete graph over the players in a frame, each player being a node in the graph. Then we compute the closeness centrality for each node using Euclidean distance and choose the top highest closeness scores.
Given the pixels inside the bounding box and the coordinates of a single player, we feed them into a VGG-16 network  and a single-layer perceptron respectively. The outputs are then summed. This is repeated
times (i.e. for every selected player), and finally we do element-wise max-pooling over thefeature vectors to obtain a holistic feature representation for the players. Fig. 3 outlines this workflow.
In the experiments, we use . For each sequence, we use at most 80 frames for training and 200 frames for evaluation. After down-sampling the videos, the frame rate is 7.5fps. Thus the longest sequence allowed is approximately 10.7s for training and 26.7s for evaluation. We again use ft.
4.2 Performance Measures
We use mean absolute error (mAE) to evaluate the estimation of time and space, and mean average precision (mAP) to measure the performance of action category prediction. However, given the nature of sports games, there are significant variations among the time intervals between neighboring activities (intervals range from milliseconds to seconds). Reporting mAE alone ignores these variations. For example, an error of 100ms is considered less significant if the ground-truth time interval is 1s as opposed to merely 100ms. Therefore we advocate mean deviation rate (mDR) as a better measure. Deviation rate (DR) is calculated as below; mDR is DR averaged over all time steps.
The baseline models predict the time of the next activity with conventional temporal point process, such as Poisson process, Hawkes process and self-correcting process. In order to predict the category and location of next activity, we utilize -order Markov chains, where . We do not use higher orders since most sample possessions do not have sequence length larger than 10.
The inference stage of a -order Markov chain works as follows. Given the most recent activities, we find the next activity with the highest transition probability. If the number of historical activities at current time step is less than or we are unable to find the exact historical activities in the transition matrix, we relax the dependency requirement by using the most recent activities. This is repeated until we find a valid transition to the next activity. The worst case is a degenerate Markov chain of 0-order, which is basically doing majority voting. Given the selected transition to next activity, we compute the mean space shift of all such transitions collected during training, which will be added to the current location, eventually making the prediction of the next activity location.
The results in Tab. I show that the proposed TPMs outperform traditional statistical approaches. On the other hand, by comparing the two TPM variants, we find that TPM performs better than TPM. Thus, the proposed conditional intensity can be more generic and effective than .
|mAE (ms)||mDR (%)||mAE (ms)||mDR (%)|
To see what the model has learned, we visualize the TPM model predictions versus ground-truth annotations in Fig. 4. We find that our model generally is able to approximate and keep track of the true arrival pattern in the input sequence (e.g., the upper row in each of the four subfigures in Fig. 4). There are some large gaps between prediction and ground-truth when there comes a sudden high spike in the ground-truth. We believe this is because of the inherent randomness in sports games. In addition to the past series of activities, the action to be performed depends on many other factors such as tactics, which have not been explicitly observed and annotated during training and are challenging for the model to learn.
The lower row of each of the four subfigures in Fig. 4 visualizes how the predicted time distribution changes as a basketball possession proceeds. The ability to capture the temporal distribution is a key advantage of the TPM.
|NBA||space mAE (ft)||3.43||3.28||6.91||6.86||6.73||6.69||6.69|
category AP (%)
|NHL||space mAE (ft)||56.95||57.01||65.96||66.60||66.85||66.88||67.24|
category AP (%)
In terms of space prediction, Tab. II shows quantitative results. We see that TPMs have consistently better performance than Markov chains on both datasets. A sample qualitative result is presented in Fig. 5. Note that the court in NBA games is 94ft by 50ft and the rink in NHL games is 200ft by 85ft.
The space mAE (in Euclidean distance) on the NHL dataset is significantly greater than that on the NBA dataset. We believe this is because, in ice hockey games, players and the puck exhibit extremely quick motions. For example, the puck can be moved from one end of the rink to the other in less than a second, after which a puck reception could happen immediately, making the spatial location hard to predict. In contrast to hockey, our models are more accurate for basketball, where the relatively slower motions make space prediction more precise. Space prediction relies heavily on the speed of motion, but category prediction is not subject to such a constraint, so our models exhibit reasonable performance on inferring the type of the next activity.
An interesting finding is that a -order Markov chain has surprisingly good mAP on the NHL dataset when compared to Markov chains of other orders. After we look into the precision of each category (provided in the supplementary material), we find that it performs exceptionally better on activities such as carry, dumpout and dumpin, which are very rare in the training data as opposed to other types of activities. We did not observe similar behaviour on the NBA dataset, so we believe this results from the highly unbalanced ground-truth annotations in the NHL dataset.
Regression v.s. distribution. An intuitive way to predict the next activity time is training a regression neural network with mean squared error loss. However, we believe that learning a distribution captures more than regressing a scalar does. We validate this by doing a simple experiment. We train TPM solely for time prediction. Everything else equal, we train a vanilla regression neural network to predict the time interval between current activity and next activity, which is then added to current timestamp to obtain the predicted time of next activity. Results are presented in Tab. III. We see clearly how TPM does a better job in predicting the next activity occurrence. Additionally, since TPM is trained explicitly by maximizing the raw likelihood function, it readily enables us to inspect the temporal distribution of predictions as in Fig. 4, whereas this feature is not available for a regression model.
Framework and generality. The proposed TPM is a general framework for prediction and modeling the arrival pattern of an activity sequence. It does not rely on a specific neural network structure. For example, in our experiment, we use a simple VGG-16 as the backbone network, but one can use other more advanced networks such as [32, 33, 34]. Networks [35, 36, 37, 7] exclusively designed for action recognition can be used as well.
Applicable scenarios. TPM is a powerful model of the arrival pattern of sparsely distributed activities and can forecast the exact next activity time of occurrence. Here “sparsely distributed” does not imply any concepts regarding weak supervision/annotation. TPM conforms to a fully supervised learning paradigm. Existing work such as  uses sparsely annotated data as well, but it addresses a totally different task than TPM. Furthermore, TPM specializes in dealing with sequences where activity events can be approximated as mass points in time. Activities with long temporal span do not fit into the TPM framework. Therefore, TPM is positioned in contrast to existing benchmarks such as Breakfast  and MPII-Cooking , but useful for the sports analytics, surveillance, and autonomous vehicle scenarios outlined above.
We have presented a novel take on the problem of activity forecasting. Predicting when and where discrete, important activity events will occur is the task we explore. In contrast with previous activity forecasting methods, this emphasizes semantically meaningful action categories and is explicit about when and where they will next take place. We construct a novel hierarchical RNN based temporal point process model for this task. Empirical results on challenging sports action datasets demonstrate the efficacy of the proposed methods.
Appendix A Probability density and cumulative distribution of temporal point processes
The cumulative distribution is defined as the probability that there is (at least) an event to happen at time since the last event time . The “*” is a reminder that a quantity depends on the past. Let denote the probability density function and the number of events till time . Then we have
This is equivalent to
Because the temporal point process models we are dealing with belong to the general class of non-homogeneous Poisson processes whose conditional intensity is a function of time , by definition the number of events in
conforms to Poisson distribution parameterized by:
where is expected number of events per interval.
Appendix B The validity of conditional intensities
When , the quantity is monotonically increasing in terms of . As approaches infinity, approaches infinity as well. Substituting into Eq. 2, we have , so is a valid conditional intensity when .
Appendix C Inference of time
c.1 When takes the form in Eq. 3
|(Obtained by letting )|
|( is , so equal to 1)|
|(Obtained by letting )|
|(Integrate by parts)|
|(where is an incomplete|
c.2 When takes the form in Eq. 4
|(Obtained by letting since does not|
|actually rely on )|
|(Obtained by letting )|
|( is , so equal to 1)|
|(Integrate by parts)|
-  Z. Shou, D. Wang, and S. Chang, “Action temporal localization in untrimmed videos via multi-stage CNNs,” in CVPR, 2016.
-  Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos,” in CVPR, 2017.
-  Z. Yuan, J. C. Stroud, T. Lu, and J. Deng, “Temporal action localization by structured maximal sums,” in CVPR, 2017.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” inCVPR, 2014.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in CVPR, 2016.
-  Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in CVPR, 2017.
-  W. Lian, R. Henao, V. Rao, J. Lucas, and L. Carin, “A multitask point process predictive model,” in ICML, 2015.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
-  K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, “DRAW: A recurrent neural network for image generation,” in ICML, 2015.
-  K. Kitani, B. Ziebart, J. Bagnell, and M. Hebert, “Activity forecasting,” in ECCV, 2012.
-  N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online inverse reinforcement learning,” in ICCV, 2017.
-  D. Xie, T. Shu, S. Todorovic, and S.-C. Zhu, “Modeling and inferring human intents and latent functional objects for trajectory prediction,” arXiv preprint arXiv: 1606.07827, 2016.
-  H. Soo Park, J.-J. Hwang, Y. Niu, and J. Shi, “Egocentric future localization,” in CVPR, 2016.
-  C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in CVPR, 2016.
-  D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Connectionist temporal modeling for weakly supervised action labeling,” in ECCV, 2016.
-  A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
-  S. Bai, J. Zico Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv: 1803.01271, 2018.
-  D. Neil, M. Pfeiffer, and S.-C. Liu, “Phased LSTM: Accelerating recurrent network training for long or event-based sequences,” in NIPS, 2016.
-  T. A. Lasko, “Efficient inference of gaussian-process-modulated renewal processes with application to medical event data,” in UAI, 2014.
-  L. Xu, J. A. Duan, and A. Whinston, “Path to purchase: A mutually exciting point process model for online advertising and conversion,” Management Science, vol. 60, no. 6, pp. 1392–1412, 2014.
-  Y. Ogata, “Space-time point-process models for earthquake occurrences,” Annals of the Institute of Statistical Mathematics, vol. 50, no. 2, pp. 379–402, 1998.
-  J. F. C. Kingman, Poisson processes. Wiley Online Library, 1993.
-  A. G. Hawkes, “Spectra of some self-exciting and mutually exciting point processes,” Biometrika, vol. 58, no. 1, pp. 83–90, 1971.
-  V. Isham and M. Westcott, “A self-correcting point process,” Stochastic Processes and Their Applications, vol. 8, no. 3, pp. 335–347, 1979.
-  N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song, “Recurrent marked temporal point processes: Embedding event history to vector,” in SIGKDD, 2016.
-  J. Chung, S. Ahn, and Y. Bengio, “Hierarchical multiscale recurrent neural networks,” in ICLR, 2017.
-  A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, and J.-Y. Nie, “A hierarchical recurrent encoder-decoder for generative context-aware query suggestion,” in CIKM, 2015.
-  L. Kong, C. Dyer, and N. A. Smith, “Segmental recurrent neural networks,” in ICLR, 2016.
-  S. El Hihi and Y. Bengio, “Hierarchical recurrent neural networks for long-term dependencies,” in NIPS, 1996.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions,” in CVPR, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in CVPR, 2017.
-  M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep temporal model for group activity recognition.” in CVPR, 2016.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014.
-  H. Kuehne, A. B. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in CVPR, 2014.
-  M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele, “Recognizing fine-grained and composite activities using hand-centric features and script data,” International Journal of Computer Vision, vol. 119, no. 3, pp. 346–373, 2016.