Time Perception Machine: Temporal Point Processes for the When, Where and What of Activity Prediction

08/13/2018 ∙ by Yatao Zhong, et al. ∙ Simon Fraser University 0

Numerous powerful point process models have been developed to understand temporal patterns in sequential data from fields such as health-care, electronic commerce, social networks, and natural disaster forecasting. In this paper, we develop novel models for learning the temporal distribution of human activities in streaming data (e.g., videos and person trajectories). We propose an integrated framework of neural networks and temporal point processes for predicting when the next activity will happen. Because point processes are limited to taking event frames as input, we propose a simple yet effective mechanism to extract features at frames of interest while also preserving the rich information in the remaining frames. We evaluate our model on two challenging datasets. The results show that our model outperforms traditional statistical point process approaches significantly, demonstrating its effectiveness in capturing the underlying temporal dynamics as well as the correlation within sequential activities. Furthermore, we also extend our model to a joint estimation framework for predicting the timing, spatial location, and category of the activity simultaneously, to answer the when, where, and what of activity prediction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

During the past decades, researchers have made substantial progress in computer vision algorithms that can automatically detect 

[1, 2, 3] and recognize  [4, 5, 6, 7] actions in video sequences. However, the ability to go beyond this and estimate how past actions will affect future activities opens exciting possibilities. A good estimation of future behaviour is an essential sensory component for an automated system to fully comprehend the real world. In this paper, we tackle the problem of estimating the prospective occurrence of future activity. Our goal is to predict the timing, spatial location, and category of the next activity given past information. We aim to answer the when, where, and what questions of activity prediction.

Consider the sports video example shown in Fig. 1

. In our work, we directly model the occurrence of discrete activity events that occur in a data stream. Within a sports context, these activities could include key moments in a game, such as passes, shots, or goals. More generally, they could correspond to important human actions along a sequence: such as a person leaving a building, stopping to engage in conversation with a friend, or sitting down on a park bench. Predicting where and when these semantically meaningful events occur would enable many applications within robotics, autonomous vehicles, security and surveillance, and other video processing domains.

Fig. 1: An ice hockey example: 1) the puck is passed to the player in the red box; 2) the player in the red box receives the puck; 3) the player in the red box carries the puck across the centre line; 4) the player in the red box dumps the puck into the offensive zone. Given the sequence of activities above, we aim to predict what the next activity will be, where it will take place, and when it will occur.
Problem Definition.

Let the input be a sequence of frames. Among these, () frames are each marked by an activity, whose timestamps are denoted as . Our goal is to estimate when and where the next activity () will happen and what type of activity it will be given the past sequence of activities and frames up to .

Importantly, we are interested in predictions regarding the semantically meaningful, sparsely occurring events within a sequence. This discrete time moment representation for actions is commonplace in numerous applications: e.g., where and when will the next shot take place in this hockey game, where do we need to be to intercept it; from where and when will the next person hail a rideshare, where should we drive to pick him/her up; when is the next nursing home patient going to request assistance, what will he/she request and where will that request be made? Generalizations of this paradigm are possible, where we consider multiple people, such as players in a sports game. We elaborate on this idea and demonstrate that we can model events corresponding to important, actionable inferences.

Following the standard terminology [8], we use the term arrival pattern to refer to the temporal distribution of activities throughout the paper. We wish to model this distribution and infer when and where the next activity will take place. However, in vision tasks the raw input has frames, whereas we are interested in the moments sparsely distributed in the sequence that are the points at which activities commence. Therefore, we need a mechanism to build features from the

frames while also preserving information of other regular frames. To address this problem, we utilize a hierarchical recurrent neural network with skip connections for multi-resolution temporal data processing.

Similar to variational autoencoders

[9, 10]

, which model the distribution of latent variables with deep learning, our model leverages the same advantage of neural networks to fit the arrival pattern (temporal distribution of activities) in the data. A network is used to learn the conditional intensity of a temporal point process and the likelihood is maximized during training. In contrast to traditional statistical approaches that demand expert domain knowledge, our model does not require a hand-crafted conditional intensity. Instead, it is automatically learned on top of raw data. We name our model the Time Perception Machine (TPM).

Our work has three main contributions:

  1. Proposing a new task – predicting the occurrence of future activity – for human action analysis, which has not been explored before on streaming data such as videos and person trajectories;

  2. Developing a novel hierarchical RNN with skip connections for feature extraction at finer resolution (frames of interest) while preserving information at coarser resolution;

  3. Formulating a generic conditional intensity and extending the model to a joint prediction framework for the when, where and what of activity forecasting.

2 Related Work

2.1 Activity Forecasting

Seminal work on activity forecasting was done by Kitani et al. [11]

, who modeled the effect of physical surroundings using semantic scene labeling and inverse reinforcement learning to predict plausible future paths and destinations of pedestrians.

Subsequent work [12] reasons about the long-term behaviors and goals of an individual given his first-person visual observations. Similarly, Xie et al. [13] attempted to infer human intents by leveraging the agent-based Lagaragian mechanics to model the latent needs that drive people toward functional objects. Park et al. [14] proposed an EgoRetinal map for motion planning from egocentric stereo videos. Vondrick et al. [15] presented a framework for predicting the visual representations of future frames, which is employed to anticipate actions and objects in the future. Unlike the previous work on activity forecasting, which focuses on planning paths and predicting intent, our work addresses a different problem in that we aim to predict the discrete attributes (the when, where, and what) of future activities.

Recent temporal activity detection / prediction methods build on recurrent neural network architectures. These include connectionist temporal classification (CTC) architectures [16, 17]. CTC models conduct classification by generalizing away from actual time stamps, while prediction methods regress actual temporal values. A variety of temporal neural network structures exist (convolutional [18], GRU, LSTM, Phased LSTM [19]), many of which have been applied to activity recognition. Our contribution is complementary in that it focuses on a novel point process model for distributions of discrete events for activity prediction.

2.2 Temporal Point Processes

A temporal point process is a stochastic model used to capture the arrival pattern of a series of events in time. Temporal point processes are studied in various areas including health-care analysis [20], electronic commerce [21], modeling earthquakes and aftershocks [22], etc.

A temporal point process model can be fully characterized by the “conditional intensity” quantity, denoted by , which is conditioned on the past information . The conditional intensity encodes the expected rate of arrivals within an infinitesimal neighborhood at time . Once we determine the intensity, we determine a temporal point process. Mathematically, given the history up to the event and the conditional intensity

, we can formulate the probability density function

and the cumulative distribution function

for the time of the next event , shown in Eq. 1 and Eq. 2. We defer the full derivation of both formulas to Appendix A.


For notational convenience, we use “” to indicate that a quantity is conditioned on the past throughout this paper. For example, , and . Below we show the conditional intensities of several temporal point process models.

Poisson Process [23]. , where is a positive constant.

Hawkes Process [24]. , where , and are positive constants. This process is an “aggregated” process, where one event is likely to trigger a series of other events in a short period of time, but the likelihood drops exponentially with regard to time.

Self-Correcting Process [25]. , where and are positive constants. This process is more “averaged” in time. A previous event is likely to inhibit the occurrence of the next one (by decreasing the intensity). Then the intensity will increase again until the next event happens.

Furthermore, a recent work by Du et al. [26] explored temporal process models using neural networks, but only experimented with sparse timestamp data. We extend their approach to dense streaming data with the proposed hierarchical RNN to extract features at frames of interest. Additionally, we demonstrate the effectiveness of a more generic intensity function in modeling the arrival pattern. We also show how a more powerful joint estimation framework can be formulated for simultaneous prediction of the timing, spatial location and category of the next activity event.

3 Model

We will first introduce the hierarchical RNN structure upon which our model is built. Then we will present in detail the formulation and derivation of the proposed model for predicting the timing of future activities. Finally we show how our model can be extended to a joint estimation framework for the simultaneous prediction of the time, location, and category of the next activity.

3.1 Hierarchical RNN

The input to our model is an entire sequence of frames. In our experiments, these include visual data in the form of bounding boxes cropped around people in video sequences and/or representations of human motion trajectories as 2D coordinates of person location over time.

A typical temporal point process model only takes as input the frames annotated with activities. These are very sparse compared to the entire dense sequence of frames (). We expect these significant frames will contain important features. However, we do not want to lose any information inherent in the remaining (

) frames. To this end, we need a hierarchical RNN capable of feature extraction at different time resolutions. This is similar in vein to tasks from the natural language processing domain, such as recent work

[27, 28, 29] in language modeling, with character-to-word and word-to-phase networks for feature extraction at multiple scales. More generally, this is an instance of the classic multiple-time scales problem in recurrent neural networks [30].

In our case, we use a hierarchical RNN model composed of two stacked RNNs. The lower-level RNN looks into the details by covering every frame in the input sequence. The higher level RNN fixes its attention only on frames of activities so as to capture the temporal dynamics among these significant times. We implement the RNN with LSTM cells. Fig. 2 shows the model structure.

Fig. 2: The hierarchical RNN structure. The frame level feature extractor can be any network applied to frames (e.g., VGG-16 net [31]). The dense sequence of frames is fed into the lower level LSTM while only the significant frames pass their features to the higher level LSTM for further processing.

3.2 Conditional Intensity Function

Instead of hand-crafting the conditional intensity , we view it as the output of the hierarchical RNN and learn the conditional intensity directly from raw data. However, an arbitrary choice of the conditional intensity

could be potentially problematic, because it needs to characterize a probability distribution. Thus, we need to validate the resultant probability density function in Eq. 

1 and the cumulative distribution function in Eq. 2.

is a valid conditional intensity that defines a temporal point process if and only if it satisfies and .

Necessity (). Given and Eq. 2, we have , from which it follows that . Since is positive, under this condition it defines a valid probability distribution, hence a well established temporal point process.

Sufficiency (). First, must be positive for it to define a valid probability density by Eq. 1. If , which means the integral is a positive less than , then it is easy to notice that . This would be an invalid cumulative distribution function since . ∎

We formally define two forms of conditional intensity as follows.

Explicit time dependence : The first form is inspired by [26], which models the conditional intensity based on the hidden states and the time .


Note that we make an important correction to [26]. The conditional intensity without the positive constraint in Eq. 3 does not conform to the necessary condition above. By imposing a constraint , we can prove that the revised intensity in Eq. 3 satisfies the condition in the above proposition.

Implicit time dependence : Note that the design of , to some extent, assumes how it is a function of time . As is part of the input, we believe it is possible to acquire the time information from the hidden states without any specification about . We use an exponential activation to ensure the positivity of the resultant conditional intensity. Formally, we have:


The proof for the validity of and is provided in Appendix B. The analytic form for the likelihood is obtained by substituting Eq. 3 or Eq. 4 into Eq. 1:


3.3 Joint Likelihood

Now we show our model can be readily plugged into a joint estimation framework by formulating a joint likelihood for the timing, spatial location and category of activities. However, instead of directly modeling the next activity location, we use an incremental approach that models the space shift from the current position. Let be the joint likelihood for a sequence of activities; , and denote the timestamp, action category, and space shift respectively. To derive the joint likelihood, we make the following assumption.

For mathematical convenience, we assume the timing, action category, space shift of event are conditionally independent given the history up to event (). That is, , or if we use the “*” notation. Therefore, we have the joint likelihood parameterized by :


We drop the subscript “” whenever possible for clean notations. Since we have already obtained the form of in Eq. 5 and Eq. 6, in the next section we derive the form of and .

Estimating the Action Category: The action category likelihood represents the distribution over the type of action. Since the history is encoded by the RNN hidden states , we have . Given the hidden states , our model outputs a discrete distribution over action classes:


We then model this likelihood with a Gibbs distribution:


where the energy function

is the Kullback-Leibler divergence between the predicted distribution

and the ground-truth distribution

(encoded as a one-hot vector).

Estimating the Space Shift: The space shift likelihood gives the spatial distribution of the next move. Similar to , we have

. We model the likelihood using a bivariate Gaussian distribution:


where is the mean and is a 2x2 covariance matrix. We find that learning all the parameters in is unstable, so we assume the shifts along the and directions are independent, hence . We set to be constant and given the hidden states ; we use


to parameterize Eq. 10, where and are learnable parameters.

3.4 Training

The model parameters can be learned in a supervised learning framework, by maximizing the likelihood of event sequences. In order to formulate the data (log-)likelihood, we substitute

5, 6, 9 and 10 into Eq. 7. Converting this to log-likelihood yields Eq. 12 and Eq. 13 for the intensities and in Eq. 3 and Eq. 4, respectively.


Here absorbs all constants in the derivation above and can be dropped during optimization. The joint likelihood for all sample sequences is obtained by summing the log-likelihood for each sequence. Because the log-likelihood is fully differentiable, we can apply back-propagation algorithms for maximization.

3.5 Inference

To infer the timing of the next activity, we follow the same inference procedure as in the standard point process literature: given all ground-truth history up to activity , we predict when the next activity will happen. Then we proceed to predict the timing of activity given all ground-truth history up to activity . Therefore, the errors will not accumulate exponentially. This is a reasonable approach in many practical scenarios (knowing what has happened up to now, predict the next event). While we have a full model of the distribution, to obtain a point estimate, we take the expected time as our prediction. Eq. 14 is the result obtained using the conditional intensity in Eq. 3, where is an incomplete gamma function whose value can be evaluated using numerical integration algorithms. Eq. 15 is acquired using the conditional intensity in Eq. 4. The derivation makes use of Eq. 1, and we include the full details in the supplementary material.


To predict the category of the next activity, we take the most confident class in the output distribution as the prediction:


To estimate the spatial location of the next activity, we take the expected space shift added to the current position as the result:


4 Experiments

We evaluate the model on two challenging datasets collected from real world sports games. These datasets include activities in basketball and ice hockey with extremely fast movement.

All of our baselines consist of two components: a Markov chain and a conventional point process. The Markov chain models action category and space shift distribution; the point process models action timestamps. In our experiments, we compare TPM’s performance in time estimation with three other typical temporal point processes: Poisson process, Hawkes process and self-correcting process (Sec. 

2). We compare TPM’s performance in space and category prediction with -order Markov chains (). Also note that TPM has two variants, TPM and TPM, using the two conditional intensity functions and in Eq. 3 and Eq. 4, respectively.

4.1 Datasets

STATS SportVU NBA dataset. This dataset contains the trajectories of 10 players and the ball in court coordinates. During each basketball game possession, there are annotations about when and where a pre-defined activity is performed, such as pass, rebound, shot, etc.

The frame data are obtained by concatenating the

court coordinates of the offensive players, defensive players and the ball. The order of concatenation within each team is determined by how far a player is away from the ball. The closest is the first entry while the farthest is appended as the last. The frame data are fed into the hierarchical RNN with a single-layer perceptron as the feature extractor of each frame. The maximum number of frames is 150 for each sequence. A basketball possession is at most 24 seconds, so this results in an effective frame rate of 6.2fps. During training, we set both

and to 2ft.

SPORTLOGiQ NHL dataset. This dataset includes the raw broadcast videos, player bounding boxes and trajectories with similar annotations to the NBA dataset. However, unlike the NBA dataset, the number of players in each frame may change due to the nature of broadcast videos. To solve this problem, we set a fixed number of players to use. If there are fewer than players, we zero out the extra entries. If there are more than players, we select the players that are most clustered. We essentially assume the players cluster around where the actions are. We use closeness centrality to implement this intuition. We build a complete graph over the players in a frame, each player being a node in the graph. Then we compute the closeness centrality for each node using Euclidean distance and choose the top highest closeness scores.

Given the pixels inside the bounding box and the coordinates of a single player, we feed them into a VGG-16 network [31] and a single-layer perceptron respectively. The outputs are then summed. This is repeated

times (i.e. for every selected player), and finally we do element-wise max-pooling over the

feature vectors to obtain a holistic feature representation for the players. Fig. 3 outlines this workflow.

In the experiments, we use . For each sequence, we use at most 80 frames for training and 200 frames for evaluation. After down-sampling the videos, the frame rate is 7.5fps. Thus the longest sequence allowed is approximately 10.7s for training and 26.7s for evaluation. We again use ft.

Fig. 3: Frame-level feature extractor for the SPORTLOGiQ dataset.

4.2 Performance Measures

We use mean absolute error (mAE) to evaluate the estimation of time and space, and mean average precision (mAP) to measure the performance of action category prediction. However, given the nature of sports games, there are significant variations among the time intervals between neighboring activities (intervals range from milliseconds to seconds). Reporting mAE alone ignores these variations. For example, an error of 100ms is considered less significant if the ground-truth time interval is 1s as opposed to merely 100ms. Therefore we advocate mean deviation rate (mDR) as a better measure. Deviation rate (DR) is calculated as below; mDR is DR averaged over all time steps.


4.3 Baselines

The baseline models predict the time of the next activity with conventional temporal point process, such as Poisson process, Hawkes process and self-correcting process. In order to predict the category and location of next activity, we utilize -order Markov chains, where . We do not use higher orders since most sample possessions do not have sequence length larger than 10.

The inference stage of a -order Markov chain works as follows. Given the most recent activities, we find the next activity with the highest transition probability. If the number of historical activities at current time step is less than or we are unable to find the exact historical activities in the transition matrix, we relax the dependency requirement by using the most recent activities. This is repeated until we find a valid transition to the next activity. The worst case is a degenerate Markov chain of 0-order, which is basically doing majority voting. Given the selected transition to next activity, we compute the mean space shift of all such transitions collected during training, which will be added to the current location, eventually making the prediction of the next activity location.

4.4 Results

The results in Tab. I show that the proposed TPMs outperform traditional statistical approaches. On the other hand, by comparing the two TPM variants, we find that TPM performs better than TPM. Thus, the proposed conditional intensity can be more generic and effective than .

mAE (ms) mDR (%) mAE (ms) mDR (%)
TPM 288.1 54.6 527.9 174.5
TPM 282.1 52.0 530.7 172.0
Poisson 365.5 547.0 645.4 297.6
Hawkes 363.8 541.2 643.7 296.5
Self-Correcting 382.4 522.4 643.0 291.5
TABLE I: Results of time prediction as part of joint estimation.

To see what the model has learned, we visualize the TPM model predictions versus ground-truth annotations in Fig. 4. We find that our model generally is able to approximate and keep track of the true arrival pattern in the input sequence (e.g., the upper row in each of the four subfigures in Fig. 4). There are some large gaps between prediction and ground-truth when there comes a sudden high spike in the ground-truth. We believe this is because of the inherent randomness in sports games. In addition to the past series of activities, the action to be performed depends on many other factors such as tactics, which have not been explicitly observed and annotated during training and are challenging for the model to learn.

The lower row of each of the four subfigures in Fig. 4 visualizes how the predicted time distribution changes as a basketball possession proceeds. The ability to capture the temporal distribution is a key advantage of the TPM.

Fig. 4: Visualization of sample arrival patterns and predicted time distributions on the NBA dataset (TPM). The horizontal axis is the time line for a sequence of activity events within a basketball possession. The upper part of each subfigure plots the predicted and ground-truth time intervals between the current activity and the next activity. The lower part of each subfigure shows the predicted time distribution at each activity event (i.e. red or blue area). There is also a gray bar indicating the error between the predicted time and the ground-truth time on the next activity. The wider the gray bar, the more error and blueish the corresponding distribution; the thinner the gray bar, the less error and reddish the corresponding distribution. The near-vertical spiky distribution at the end of each subfigure shows how well TPM is predicting the sequence end.
Fig. 5: Qualitative results of space prediction on the NBA dataset. Multiple example possessions are shown, each in a different color. Ground-truth locations of the activity sequences are connected with dashed lines. Each arrow points from the ground-truth location of an activity to its location predicted by our model.
TPM TPM MC-1 MC-3 MC-5 MC-7 MC-9
NBA space mAE (ft) 3.43 3.28 6.91 6.86 6.73 6.69 6.69

category AP (%)

shoot 57.9 58.0 10.1 32.9 35.7 37.0 37.4
dribble 92.4 92.7 86.2 76.2 80.6 82.1 82.6
pass 44.5 45.9 34.3 21.4 22.6 24.5 24.7
reception 98.4 98.4 96.2 95.3 95.3 95.2 95.1
assist 8.7 8.6 2.1 2.5 3.3 3.7 3.7
end 99.9 99.9 99.9 99.9 99.9 99.9 99.9
mAP 67.0 67.2 54.8 54.7 56.2 57.0 57.3
NHL space mAE (ft) 56.95 57.01 65.96 66.60 66.85 66.88 67.24

category AP (%)

pass 61.2 61.8 66.9 51.8 52.4 53.1 52.9
reception 64.4 64.3 78.8 50.8 51.8 52.3 52.1
carry 21.3 21.2 30.8 20.0 18.7 19.2 18.8
shoot 11.1 9.6 11.4 10.9 9.9 10.4 10.3
dumpin 11.3 12.2 30.0 8.6 9.5 9.2 9.3
protection 32.8 32.8 28.3 24.4 23.6 24.8 24.4
dumpout 4.7 5.5 22.8 4.6 4.6 4.6 4.6
check 11.0 11.8 19.5 7.2 7.3 8.0 8.7
block 25.9 23.0 21.7 15.5 16.2 15.8 15.8
end 80.6 79.5 47.0 32.9 26.6 25.0 25.0
mAP 32.4 32.2 35.7 22.7 22.1 22.2 22.2
TABLE II: Results of category and space prediction as part of joint estimation. MC- refers to -order Markov chain.

In terms of space prediction, Tab. II shows quantitative results. We see that TPMs have consistently better performance than Markov chains on both datasets. A sample qualitative result is presented in Fig. 5. Note that the court in NBA games is 94ft by 50ft and the rink in NHL games is 200ft by 85ft.

The space mAE (in Euclidean distance) on the NHL dataset is significantly greater than that on the NBA dataset. We believe this is because, in ice hockey games, players and the puck exhibit extremely quick motions. For example, the puck can be moved from one end of the rink to the other in less than a second, after which a puck reception could happen immediately, making the spatial location hard to predict. In contrast to hockey, our models are more accurate for basketball, where the relatively slower motions make space prediction more precise. Space prediction relies heavily on the speed of motion, but category prediction is not subject to such a constraint, so our models exhibit reasonable performance on inferring the type of the next activity.

An interesting finding is that a -order Markov chain has surprisingly good mAP on the NHL dataset when compared to Markov chains of other orders. After we look into the precision of each category (provided in the supplementary material), we find that it performs exceptionally better on activities such as carry, dumpout and dumpin, which are very rare in the training data as opposed to other types of activities. We did not observe similar behaviour on the NBA dataset, so we believe this results from the highly unbalanced ground-truth annotations in the NHL dataset.

TPM Regression NN
NBA 51.6% 56.9%
NHL 138.0% 188.2%
TABLE III: Comparison between TPM and a vanilla regression neural network on the task of predicting the time of next activity. Errors are measured in mDR.

5 Discussion

Regression v.s. distribution. An intuitive way to predict the next activity time is training a regression neural network with mean squared error loss. However, we believe that learning a distribution captures more than regressing a scalar does. We validate this by doing a simple experiment. We train TPM solely for time prediction. Everything else equal, we train a vanilla regression neural network to predict the time interval between current activity and next activity, which is then added to current timestamp to obtain the predicted time of next activity. Results are presented in Tab. III. We see clearly how TPM does a better job in predicting the next activity occurrence. Additionally, since TPM is trained explicitly by maximizing the raw likelihood function, it readily enables us to inspect the temporal distribution of predictions as in Fig. 4, whereas this feature is not available for a regression model.

Framework and generality. The proposed TPM is a general framework for prediction and modeling the arrival pattern of an activity sequence. It does not rely on a specific neural network structure. For example, in our experiment, we use a simple VGG-16 as the backbone network, but one can use other more advanced networks such as [32, 33, 34]. Networks [35, 36, 37, 7] exclusively designed for action recognition can be used as well.

Applicable scenarios. TPM is a powerful model of the arrival pattern of sparsely distributed activities and can forecast the exact next activity time of occurrence. Here “sparsely distributed” does not imply any concepts regarding weak supervision/annotation. TPM conforms to a fully supervised learning paradigm. Existing work such as [16] uses sparsely annotated data as well, but it addresses a totally different task than TPM. Furthermore, TPM specializes in dealing with sequences where activity events can be approximated as mass points in time. Activities with long temporal span do not fit into the TPM framework. Therefore, TPM is positioned in contrast to existing benchmarks such as Breakfast [38] and MPII-Cooking [39], but useful for the sports analytics, surveillance, and autonomous vehicle scenarios outlined above.

6 Conclusion

We have presented a novel take on the problem of activity forecasting. Predicting when and where discrete, important activity events will occur is the task we explore. In contrast with previous activity forecasting methods, this emphasizes semantically meaningful action categories and is explicit about when and where they will next take place. We construct a novel hierarchical RNN based temporal point process model for this task. Empirical results on challenging sports action datasets demonstrate the efficacy of the proposed methods.

Appendix A Probability density and cumulative distribution of temporal point processes

This seciton presents an intuitive derivation of Eq.1 and Eq.2.

The cumulative distribution is defined as the probability that there is (at least) an event to happen at time since the last event time . The “*” is a reminder that a quantity depends on the past. Let denote the probability density function and the number of events till time . Then we have


This is equivalent to


Because the temporal point process models we are dealing with belong to the general class of non-homogeneous Poisson processes whose conditional intensity is a function of time , by definition the number of events in

conforms to Poisson distribution parameterized by



where is expected number of events per interval.

Because the conditional intensity is the expected rate of event arrivals, we have . Let in Eq. 21 be zero, then Eq. 21 is equal to Eq. 20. This yields


and that


Appendix B The validity of conditional intensities

This section provides the proof that the two conditional intensities (Eq.3 and Eq.4) used in our experiments characterize valid temporal point processes.


takes the form of Eq.3 if while it takes the form of Eq.4 if . Let us denote


When , the quantity is monotonically increasing in terms of . As approaches infinity, approaches infinity as well. Substituting into Eq. 2, we have , so is a valid conditional intensity when .

However, when , we have , hence . This definitely results in an invalid probability distribution. Therefore, , or equivalently Eq.3 and Eq.4, is valid if . ∎

Appendix C Inference of time

In this section, we derive the predicted time for the two conditional intensities (Eq.3 and Eq.4) we used.

c.1 When takes the form in Eq. 3

(Obtained by letting )
( is , so equal to 1)
(Obtained by letting )
(where )
(Integrate by parts)
(where is an incomplete
gamma function)

c.2 When takes the form in Eq. 4

(Obtained by letting since does not
actually rely on )
(Obtained by letting )
( is , so equal to 1)
(Integrate by parts)


  • [1] Z. Shou, D. Wang, and S. Chang, “Action temporal localization in untrimmed videos via multi-stage CNNs,” in CVPR, 2016.
  • [2] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, “CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos,” in CVPR, 2017.
  • [3] Z. Yuan, J. C. Stroud, T. Lu, and J. Deng, “Temporal action localization by structured maximal sums,” in CVPR, 2017.
  • [4]

    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in

    CVPR, 2014.
  • [5] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015.
  • [6] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in CVPR, 2016.
  • [7] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in CVPR, 2017.
  • [8] W. Lian, R. Henao, V. Rao, J. Lucas, and L. Carin, “A multitask point process predictive model,” in ICML, 2015.
  • [9] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • [10] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, “DRAW: A recurrent neural network for image generation,” in ICML, 2015.
  • [11] K. Kitani, B. Ziebart, J. Bagnell, and M. Hebert, “Activity forecasting,” in ECCV, 2012.
  • [12] N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online inverse reinforcement learning,” in ICCV, 2017.
  • [13] D. Xie, T. Shu, S. Todorovic, and S.-C. Zhu, “Modeling and inferring human intents and latent functional objects for trajectory prediction,” arXiv preprint arXiv: 1606.07827, 2016.
  • [14] H. Soo Park, J.-J. Hwang, Y. Niu, and J. Shi, “Egocentric future localization,” in CVPR, 2016.
  • [15] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in CVPR, 2016.
  • [16] D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Connectionist temporal modeling for weakly supervised action labeling,” in ECCV, 2016.
  • [17] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
  • [18] S. Bai, J. Zico Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv: 1803.01271, 2018.
  • [19] D. Neil, M. Pfeiffer, and S.-C. Liu, “Phased LSTM: Accelerating recurrent network training for long or event-based sequences,” in NIPS, 2016.
  • [20] T. A. Lasko, “Efficient inference of gaussian-process-modulated renewal processes with application to medical event data,” in UAI, 2014.
  • [21] L. Xu, J. A. Duan, and A. Whinston, “Path to purchase: A mutually exciting point process model for online advertising and conversion,” Management Science, vol. 60, no. 6, pp. 1392–1412, 2014.
  • [22] Y. Ogata, “Space-time point-process models for earthquake occurrences,” Annals of the Institute of Statistical Mathematics, vol. 50, no. 2, pp. 379–402, 1998.
  • [23] J. F. C. Kingman, Poisson processes.   Wiley Online Library, 1993.
  • [24] A. G. Hawkes, “Spectra of some self-exciting and mutually exciting point processes,” Biometrika, vol. 58, no. 1, pp. 83–90, 1971.
  • [25] V. Isham and M. Westcott, “A self-correcting point process,” Stochastic Processes and Their Applications, vol. 8, no. 3, pp. 335–347, 1979.
  • [26] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song, “Recurrent marked temporal point processes: Embedding event history to vector,” in SIGKDD, 2016.
  • [27] J. Chung, S. Ahn, and Y. Bengio, “Hierarchical multiscale recurrent neural networks,” in ICLR, 2017.
  • [28] A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, and J.-Y. Nie, “A hierarchical recurrent encoder-decoder for generative context-aware query suggestion,” in CIKM, 2015.
  • [29] L. Kong, C. Dyer, and N. A. Smith, “Segmental recurrent neural networks,” in ICLR, 2016.
  • [30] S. El Hihi and Y. Bengio, “Hierarchical recurrent neural networks for long-term dependencies,” in NIPS, 1996.
  • [31] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions,” in CVPR, 2015.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [34] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in CVPR, 2017.
  • [35] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep temporal model for group activity recognition.” in CVPR, 2016.
  • [36] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015.
  • [37] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014.
  • [38] H. Kuehne, A. B. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in CVPR, 2014.
  • [39] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele, “Recognizing fine-grained and composite activities using hand-centric features and script data,” International Journal of Computer Vision, vol. 119, no. 3, pp. 346–373, 2016.