I Introduction
Robots are envisioned to coexist with humans in unscripted environments and accomplish a diverse set of objectives. Towards this goal, navigation is an essential task for the autonomous mobile robot. This requires the mobile robot to navigate human crowds in not just a safe and efficient manner, but also in a socially compliant way, i.e., the robot has to collaboratively avoid collisions with surrounding humans and alter its path in a humanpredictable manner. To achieve this, the robot needs to accurately predict the future trajectories of humans within the crowd and accordingly plan its own path.
Early works in the domain of social robot navigation have modeled individual human motion patterns in crowds to predict future trajectories as in [1, 2, 3]. However, as shown in [4], such independent modeling doesn’t capture the complex and subtle interactions between humans in the crowd and the resulting path for the robot is highly suboptimal. For the robot to navigate in a socially compliant way, it is key to capture humanhuman interactions observed in a crowd.
More recent approaches such as [4, 5, 6]
model the joint distribution of future trajectories of all interacting agents through a spatially local interaction model. Such a joint distribution model is capable of capturing the dependencies between trajectories of interacting humans, and results in socially compliant predictions. However, these approaches assume that only humans in a local neighborhood affect each others motion, which is not necessarily true in real crowd scenarios. For example, consider a long hallway with two humans moving at both ends towards each other. If both of them were walking, such an assumption holds as they don’t influence each other over such long distance. However, if one of them starts running, the other person adapts his own motion to avoid collision before the runner enters his local neighborhood. This observation leads us to the insight that humanhuman interactions in crowd are not just dependent on relative distance, but also on other features such as velocity, timetocollision
[7], acceleration and heading.In this work, we propose an approach that addresses this observation through a novel datadriven architecture for predicting future trajectories of humans in crowds. As a foremost step towards achieving socially acceptable robot navigation, we focus on the problem of human trajectory prediction in a crowd. We use a feedforward, fully differentiable, and jointly trained recurrent neural network (RNN) mixture to model trajectories of all humans in the crowd, addressing both spatial and temporal aspects of the problem. The humanhuman interactions are modeled using a soft attention model over
all humans in the crowd, thereby not restricting the approach with the local neighborhood assumption. The resulting model captures the influence of each person on the other, the nature of their interaction and predicts their future trajectories. Finally, we demonstrate that our model, Social Attention, is capable of predicting human trajectories more accurately than the stateoftheart approach on two publicly available real world crowd datasets. We also analyze the trained attention model to understand the nature of humanhuman interactions learned from the crowd datasets.Ii Problem Definition
In this paper, we deal with the problem of human trajectory prediction in crowded spaces. We assume that each scene is preprocessed to track pedestrians in the crowd and obtain their spatial coordinates at successive timesteps. Note that, across timesteps pedestrians enter and leave the scene, with varying length trajectories. Let represent the spatial location of agent at timestep .
Following a similar notation as [6], our problem can be formulated as: Given spatial locations for agents from timesteps , predict their future locations from .
Iii Related Work
Our work is relevant to past literature in the domain of modeling human interactions for navigation, human trajectory prediction and spatiotemporal models.
Iiia Modeling Human Interactions for Navigation
To predict future behavior of pedestrians in crowds, we need to model interactions between pedestrians accurately. An early work by [8] proposed Social Force, which models motion of pedestrians using attractive forces that guide them towards the destination, and repulsive forces that ensure collisionavoidance. Subsequently, several approaches [9, 10] have extended the social forces model by fitting the parameters of the force functions to observed crowd behavior. Using attractive and repulsive forces based on relative distances, the social forces model can capture simple interactions but can’t model complex crowd behavior such as cooperation, as shown in [6].
A pioneering work by [11] introduced a theory on human proximity relationships which has been used in potential field based methods such as [12, 13] to model humanhuman interactions in crowds for robot navigation. The proximitybased model effectively captures reactive collisionavoidance but does not model humanhuman and humanrobot cooperation. However, models of cooperation are essential for safe and efficient robot navigation in dense crowds. As shown by [4], lack of cooperation leads to the freezing robot problem where the robot believes there is no feasible path in the environment, despite the existence of several feasible paths.
More recently, the use of Interacting Gaussian Processes (IGP) was proposed by [4] to model the joint distribution of trajectories of all interacting agents in the crowd using Gaussian Processes with a handcrafted interaction potential term. The potential term captures interactions based on the relative distances of humans in the crowd and results in a probabilistic model that has been shown to capture joint collision avoidance behavior. This has been extended in [5] by replacing the handcrafted potential term with a locally trained interaction model based on occupancy grids. However, these approaches model interactions based on relative distances and orientations, ignoring other features such as velocity and acceleration.
Finally, the works of [14, 15] explicitly model humanhuman and humanrobot interactions and jointly predict the trajectories of all agents, using featurebased representations. They use
maximum entropy inverse reinforcement learning
(IRL) to learn a distribution of trajectories that results in crowdlike behavior. Features used such as clearance, velocity, and group membership are carefully designed. However, their approach has only been tested in scripted environments with no more than four humans and due to the featurebased joint modeling, it scales poorly with the number of agents considered. Very recently, [16] extended this approach to unseen and unstructured environments using a receding horizon motion planning approach.IiiB Human Trajectory Prediction
In the domain of video surveillance, human trajectory prediction is a significant challenge. The approaches by [17, 18] learn motion patterns of pedestrians in videos using Gaussian Processes and cluster observed trajectories into patterns. These motion patterns capture navigation behavior such as static obstacle avoidance, but they ignore humanhuman interactions. IRL has also been used for activity forecasting in [19] to predict future trajectories of pedestrians by inferring traversable regions in a scene by modeling humanspace interactions using semantic scene information. However, interactions between humans are not modeled. More recently, [6]
used Long ShortTerm Memory networks (LSTM) to model the joint distribution of future trajectories of interacting agents. This work has been extended in
[20, 21] to include static obstacles in the model in addition to dynamic agents. However, these approaches assume that only the dynamic agents in a local discretized neighborhood of a pedestrian affect the pedestrian’s motion. As shown in Section I, this is not necessarily true and in our work, we do not make such an assumption. The authors would also like to point out a very recent work [22] who also consider all agents in the environment, rather than just the local neighborhood, using attention. However, the attention used is hardwired based on proximity rather than being learned from data.IiiC SpatioTemporal Models
In this paper, we formulate the task of human trajectory prediction using spatiotemporal graphs. Spatiotemporal graphs have nodes that represent the problem components and edges that capture spatiotemporal interactions between the nodes. This spatiotemporal formulation finds applications in robotics and computer vision,
[23, 24, 25]. Traditionally, graphical models such as CRF are used to model such problems, [26, 27, 28]. Recently, [29] introduced Structural RNN(SRNN), a rich RNN mixture that can be jointly trained to model dynamics in spatiotemporal tasks. This has been successfully applied to diverse tasks such as modeling human motion and driver maneuver anticipation. In this paper, we will use a variant of SRNN.Iv Approach
Humans navigate crowds by adapting their own trajectories based on the motion of others around them. It is assumed in [6, 5, 20, 21] that this influence is spatially local, i.e., only spatial neighbors influence the motion of a human in the crowd. But as shown in Section I, this is not necessarily true and other features such as velocity, acceleration and heading play an important role, enabling agents who are not spatially local to influence a pedestrian’s motion. In this work, we aim to model the influence of all agents in the crowd by learning an attention model over the agents. In other words, we seek to answer the question: Which surrounding agents do humans attend to, while navigating a crowd? Our hypothesis is that the representation of trajectories learned by our model enables us to effectively reason about the importance of surrounding agents better than only considering spatially local agents.
As argued in Section I, to model interactions among humans, we cannot predict future locations of each human independently. Instead, we need to jointly reason across multiple people and couple their predictions so that interactions among them are captured. Towards this goal, we use a feedforward, fully differentiable, and jointly trained RNN mixture that predicts both their future locations and captures humanhuman interactions. Our approach builds on the architecture proposed in [29] for this purpose.
Iva SpatioTemporal Graph Representation
We use a similar spatiotemporal graph (stgraph) representation as [29] with , where is the stgraph, is the set of nodes, is the set of spatial edges and is the set of temporal edges. Note that the graph is unrolled using to form . Hence, in the unrolled stgraph, different nodes at the same timestep are connected using edges whereas same nodes at adjacent timesteps are connected using edges . For more details on general stgraph representation, we refer the reader to [29].
In this work, we formulate the problem of human trajectory prediction as a spatiotemporal graph. The nodes of the stgraph represent the humans in the crowd, the spatial edges connect two different humans at the same timestep, and temporal edges connect the same human at adjacent timesteps. The spatial edges aim to capture the dynamics of relative orientation and distance between two humans, and temporal edges capture the dynamics of the human’s own trajectory. The feature vector associated with node
at timestep is , the spatial location of the corresponding human. The feature vector associated with a spatial edge at timestep is , the vector from location of at time to location of at (encoding the relative orientation and distance). Similarly, the feature vector associated with a temporal edge at timestep is , the vector from location of node at to its location at . The corresponding stgraph representation (with the unrolled stgraph) is shown in Figure 2.The factor graph representation of the stgraph associates a factor function for each node and a pairwise factor function for each edge in the graph, as shown in Figure 2. At each timestep, the factors in the stgraph observe node/edge features and perform some computation on those features. Each of these factors have parameters that need to be learned. In our formulation, all the nodes share the same factor, giving the model scalability to handle more nodes (in dense crowds) without increasing the number of parameters. For similar reasons, all spatial edges share a common factor and all temporal edges share the same factor function. Note that the factor for spatial edges and temporal edges are different, as they capture different aspects of the trajectories. This kind of parameter sharing is necessary to generalize across scenes with varying number of humans, and keeps the parameterization compact.
IvB Model Architecture
The factor graph representation lends itself naturally to the SRNN architecture [29]. We represent each factor with an RNN. Hence, for each of the node factors we have nodeRNNs and for each of the edge factors we have edgeRNNs . Note that all the nodeRNNs, spatial edgeRNNs and temporal edgeRNNs share parameters among themselves. The spatial edgeRNNs model the dynamics of humanhuman interactions in the crowd and the temporal edgeRNNs model the dynamics of individual motion of each human in the crowd. The nodeRNNs use the node features and hidden states from the neighboring edgeRNNs to predict the future location of the node at the next timestep. We would like to emphasize that since we share the model parameters across all nodes and edges, the number of parameters is independent of the number of pedestrians at any given time.
Our architecture differs from the SRNN architecture, by introducing an attention module to compute a soft attention over hidden states of neighboring spatial edgeRNNs for each node as summarized in Figure 3. We will describe each of these components in the following subsections.
IvB1 EdgeRNN
Each spatial edgeRNN , at every timestep , takes the corresponding edge’s features , embeds it into a fixedlength vector and is used as an input to the RNN cell as follows:
(1)  
(2) 
where is an embedding function, is the embedding weights, is the hidden state of the RNN at time and are the weights of the spatial edgeRNN cell.
The temporal edgeRNN is defined in a similar way with its own set of weights and for the embedding and edgeRNN, respectively. Hence, the trainable parameters for edgeRNNs are and .
IvB2 Attention Module
For each node , the attention module computes a soft attention over the hidden states of the edgeRNNs of the spatial edges that the node belongs to. Observe that this differs from the SRNN architecture from [29], where the edge features of these spatial edges are added and sent to the edgeRNN to compute a single hidden state, which is used as an input to the nodeRNN.
At each timestep for each node , we compute a score between the hidden state of its corresponding temporal edgeRNN and all the hidden states of the neighboring spatial edgeRNNs . The score function used is scaled dot product attention [30], given by:
(3) 
where is the number of spatial edges the node is associated with, are weights to linearly scale and project the hidden states into dimensional vectors. Scaling the dot product using is necessary because dot product attention performs poorly for large values of as found in [30], and the number of spatial edges change from frame to frame, depending on the number of agents.
The output vector is computed as a weighted sum of with the weights as softmax of computed scores,
(4) 
Hence, the trainable parameters in the attention module are the weights and .
IvB3 NodeRNN
Finally, the nodeRNN at every timestep , takes the corresponding node’s features , embeds it into a fixedlength vector . It also takes the hidden state of corresponding temporal edgeRNN , concatenates it with the computed attention output and embeds it into a fixedlength vector . These embeddings are concatenated and sent as an input to the RNN cell as follows:
(5)  
(6)  
(7) 
The hidden state of the RNN cell at timestep is passed through a linear layer to get a 5D vector
corresponding to predicted mean, standard deviation and correlation of a bivariate Gaussian distribution, similar to
[31].(8) 
Thus, the trainable parameters for a nodeRNN are .
IvC Training the model
We jointly train the entire model by minimizing the negative loglikelihood loss of the node’s true position at all predicted timesteps under the predicted bivariate Gaussian distribution as follows:
The loss is computed over trajectories of all nodes in the training dataset and backpropagated. Note that, we jointly backpropagate through the nodeRNN, spatial edgeRNN and temporal edgeRNN, thereby updating all their parameters to minimize the loss.
IvD Inference for path prediction
At test time, we fit the trained model to observed trajectory at timesteps and sample from the predicted bivariate Gaussian distribution to get forecasted locations for all the pedestrians, for timesteps . Formally,
(9) 
For timesteps , we use the predicted location at the previous timestep inplace of the true coordinates as node features , similar to [6]. The predicted locations are also used to compute the edge features for these timesteps.
V Evaluation
Va Datasets and Metrics
We evaluate our model, which we call Social Attention, on two publicly available datasets: ETH [32], and UCY [33]. These two datasets contain crowd sets with a total of pedestrians exhibiting complex interactions such as walking together, groups crossing each other, joint collision avoidance and nonlinear trajectories, as shown in [32]. These datasets are recorded at frames per second, annotated every seconds and contain different scenes. As shown in [6], Social LSTM performs better than other traditional methods such as linear model, the Social forces model [8] and Interacting Gaussian Processes [4]. Hence, we chose Social LSTM as the baseline to compare the performance of our method.
To compute the prediction error, we consider the following two metrics:

Final Displacement Error: Introduced in [6], this metric computes the mean euclidean distance between the final predicted location and the final true location after timesteps.
Similar to [6], we use a leaveoneout approach where we train and validate our approach on sets, and test on the remaining set. We repeat this for all the sets. For validation, within each set we divide the set of trajectories in a split for training and validation data. Our baseline, SocialLSTM, [6], has also been trained in the same fashion. We observe the trajectory for timesteps (corresponding to seconds) and predict the trajectory for the next timesteps (corresponding to seconds). We also conduct the same experiments for an independent LSTM approach that models each trajectory independently.
VB Implementation Details
We use LSTM as the RNN in our Social Attention model. The dimension of hidden state of nodeRNN is set to and that of edgeRNN to . All the embedding layers in the network embed the input into a
dimensional vector with ReLU nonlinearity. The attention dimension, i.e.,
in Equation 3, is set to . A batch size of is used and the network is trained for epochs using Adam with an initial learning rate of . The global norm of gradients are clipped at a value of to ensure stable training. The model was trained on a single TitanX GPU. The full implementation of our approach is available at https://github.com/vvanirudh/srnnpytorch.For the Social LSTM implementation, we made our best attempt to follow the implementation details specified in [6] and the code can be accessed at https://github.com/vvanirudh/sociallstmpytorch.
VC Quantitative Results
Metric  Crowd Sets  LSTM  Social LSTM  Social Attention 
Average Displacement Error  ETH  Univ  0.59  0.46  0.39 
ETH  Hotel  0.35  0.42  0.29  
UCY  Zara 1  0.25  0.21  0.20  
UCY  Zara 2  0.38  0.41  0.30  
UCY  Univ  0.40  0.36  0.33  
Average  0.39  0.37  0.30  
Final Displacement Error  ETH  Univ  5.28  4.55  3.74 
ETH  Hotel  4.42  3.57  2.64  
UCY  Zara 1  1.55  0.65  0.52  
UCY  Zara 2  3.57  3.39  2.13  
UCY  Univ  6.39  4.45  3.92  
Average  3.84  3.32  2.59 
The prediction errors for all the methods on the 5 crowd sets is presented in Table I. The naive independent LSTM approach results in high prediction errors, as it cannot capture humanhuman interactions unlike Social LSTM and Social Attention. However, in some cases, the independent LSTM approach performs slightly better than others, especially in sparse crowd settings where there are scarcely any interactions. Our model, Social Attention, performs better than Social LSTM consistently across all the crowd sets in both the metrics. In particular, in the ETHHotel crowd set, our approach significantly outperforms others by a large margin, supporting our hypothesis on nonlocal interactions as follows. This crowd set contains a lot of pedestrians who are stationary or go towards each other with varied velocities and heading. For stationary pedestrians, Social LSTM considers them important if they are within the local neighborhood, whereas Social Attention doesn’t assign importance to these agents as they don’t affect others motion in a significant way. In the case of pedestrians going headlong towards each other, Social LSTM doesn’t consider them until they enter each others local neighborhood, whereas Social Attention captures the interactions between them from a far distance based on their velocities and heading. By learning relative importance of each pedestrian in the crowd from data, Social Attention results in more accurate predictions.
In our evaluation, we also included the prediction errors of pedestrians for whom we observed fewer than timesteps as they entered the crowd at a later time. Generally, when we have a fewer number of observations, the model’s accuracy naturally degrades for their predictions. This is one of the primary reasons for the difference in our results of Social LSTM compared to that from the original paper [6], as they disregarded such scenarios. On the other hand, we consider them to be important since they happen often in real robot navigation.
Accounting for all agents in the crowd increases the computational complexity of our approach, but inference in the model is parallelized on GPU to ensure realtime performance (10Hz).
VD Qualitative Results
The qualitative results for Social Attention is shown in Figure 4. To analyze the learned attention model, we considered several crowd scenarios among the datasets and extracted the predicted attention weights (softmax of scores in equation 4). This lets us observe the relative importance of each pedestrian on the motion of a specific pedestrian, as predicted by Social Attention. Figure 4 (a)(c) show scenarios where the model successfully identifies important pedestrians and Figure 4 (d)(f) highlight the scenarios where the model fails. In (a), the model attends with a higher weight to the dynamic pedestrian in close proximity compared to others far away. (b) shows a scenario where the model predicts that stationary pedestrians in the local neighborhood are relatively less important than a dynamic pedestrian who is farther away. In (c), the model assigns equal relative importance to each of the dynamic pedestrians as they are all too far away to exert any influence.
There are several cases where our model incorrectly predicts the relative importance. Figure 4 (d) and (f) show those scenarios where the model assigns a high attention weight to pedestrians who are far and moving in such a way (or stationary as in (f)) that they can’t exert any influence, completely ignoring nearby pedestrians who are more important. Finally in (e), the model predicts equal attention weights for all the three dynamic pedestrians, while one of them is clearly more important than others. Investigating the reason for such prediction failures of our model is left to future work.
Vi Conclusion
In this work, we have presented an attentionbased trajectory prediction model, Social Attention, that learns the relative influence of each pedestrian in the crowd on the planning behavior of the other, and accurately predicts their future trajectories. We use an RNN mixture to model both the temporal and spatial dynamics of trajectories in human crowds. The resulting model is feedforward, fullydifferentiable, and is jointly trained to capture humanhuman interactions between pedestrians. We show that our proposed method outperforms the stateoftheart approach in prediction errors, on two publicly available datasets. We also analyze the learned attention model to understand which surrounding agents humans attend to, when navigating a crowd, and present qualitative results. Future work can extend the model to include static obstacles in the environment. The SRNN architecture employed in this work can be naturally extended to model different semantic entities, as shown in [29]. We also plan to verify and validate our model on a real robot placed in a human crowd, predicting future trajectories of surrounding humans and planning its own path to reach its destination. In addition to these, it would be useful to compare performance of our model with IRLbased approaches such as [14, 15, 16], which currently don’t scale well to large crowds.
References
 [1] S. Thompson, T. Horiuchi, and S. Kagami, “A probabilistic model of human motion and navigation intent for mobile robot path planning,” in 4th International Conference on Autonomous Robots and Agents, 2009. IEEE, 2009, pp. 663–668.
 [2] M. Bennewitz, W. Burgard, G. Cielniak, and S. Thrun, “Learning motion patterns of people for compliant robot motion,” The International Journal of Robotics Research, vol. 24, no. 1, pp. 31–48, 2005.
 [3] F. Large, D. Vasquez, T. Fraichard, and C. Laugier, “Avoiding cars and pedestrians using velocity obstacles and motion prediction,” IEEE Intelligent Vehicles Symposium, 2004, pp. 375–379, 2004.
 [4] P. Trautman and A. Krause, “Unfreezing the robot: Navigation in dense, interacting crowds,” 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 797–803, 2010.
 [5] A. Vemula, K. Muelling, and J. Oh, “Modeling cooperative navigation in dense human crowds,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 1685–1692.

[6]
A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. FeiFei, and S. Savarese,
“Social LSTM: Human trajectory prediction in crowded spaces,”
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 961–971, 2016.  [7] I. Karamouzas, B. Skinner, and S. J. Guy, “Universal power law governing pedestrian interactions,” Physical review letters, vol. 113, no. 23, p. 238701, 2014.
 [8] D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,” Physical review E, vol. 51, no. 5, p. 4282, 1995.
 [9] A. Johansson, D. Helbing, and P. K. Shukla, “Specification of the social force pedestrian model by evolutionary adjustment to video tracking data,” Advances in complex systems, vol. 10, no. supp02, pp. 271–288, 2007.
 [10] R. Mehran, A. Oyama, and M. Shah, “Abnormal crowd behavior detection using social force model,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 935–942, 2009.
 [11] E. T. Hall, “A system for the notation of proxemic behavior,” American anthropologist, vol. 65, no. 5, pp. 1003–1026, 1963.
 [12] M. Svenstrup, T. Bak, and H. J. Andersen, “Trajectory planning for robots in dynamic human environments,” 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4293–4298, 2010.
 [13] N. Pradhan, T. Burg, and S. Birchfield, “Robot crowd navigation using predictive position fields in the potential function framework,” in American Control Conference (ACC), 2011. IEEE, 2011, pp. 4628–4633.
 [14] M. Kuderer, H. Kretzschmar, C. Sprunk, and W. Burgard, “Featurebased prediction of trajectories for socially compliant navigation.” in Robotics: science and systems, 2012.
 [15] H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard, “Socially compliant mobile robot navigation via inverse reinforcement learning,” The International Journal of Robotics Research, vol. 35, no. 11, pp. 1289–1307, 2016.
 [16] M. Pfeiffer, U. Schwesinger, H. Sommer, E. Galceran, and R. Siegwart, “Predicting actions to act predictably: Cooperative partial motion planning with maximum entropy models,” 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2096–2101, 2016.
 [17] K. Kim, D. Lee, and I. A. Essa, “Gaussian process regression flow for analysis of motion trajectories,” 2011 International Conference on Computer Vision, pp. 1164–1171, 2011.
 [18] J. Joseph, F. DoshiVelez, A. S. Huang, and N. Roy, “A bayesian nonparametric approach to modeling motion patterns,” Autonomous Robots, vol. 31, no. 4, pp. 383–400, 2011.
 [19] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity forecasting,” in European Conference on Computer Vision. Springer, 2012, pp. 201–214.
 [20] D. Varshneya and G. Srinivasaraghavan, “Human trajectory prediction using spatially aware deep attention models,” arXiv preprint arXiv:1705.09436, 2017.
 [21] F. Bartoli, G. Lisanti, L. Ballan, and A. Del Bimbo, “Contextaware trajectory prediction,” arXiv preprint arXiv:1705.02503, 2017.
 [22] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Soft+ hardwired attention: An lstm framework for human trajectory prediction and abnormal event detection,” arXiv preprint arXiv:1702.05552, 2017.
 [23] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4346–4354.
 [24] C. Sun and R. Nevatia, “Active: Activity concept transitions in video event classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 913–920.
 [25] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that knows before you do: Anticipating maneuvers via learning temporal driving models,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3182–3190.
 [26] Y. Li and R. Nevatia, “Key object driven multicategory object recognition, localization and tracking using spatiotemporal context.” in ECCV (4), 2008, pp. 409–422.
 [27] H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 14–29, 2016.
 [28] X. Zhang, P. Jiang, and F. Wang, “Overtaking vehicle detection using a spatiotemporal crf,” in Intelligent Vehicles Symposium Proceedings, 2014 IEEE. IEEE, 2014, pp. 338–343.

[29]
A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structuralrnn: Deep learning on spatiotemporal graphs,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5308–5317.  [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” ArXiv eprints, June 2017.
 [31] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
 [32] S. Pellegrini, A. Ess, K. Schindler, and L. van Gool, “You’ll never walk alone: Modeling social behavior for multitarget tracking,” in 2009 IEEE 12th International Conference on Computer Vision, Sept 2009, pp. 261–268.
 [33] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,” Computer Graphics Forum, vol. 26, no. 3, pp. 655–664, 2007. [Online]. Available: http://dx.doi.org/10.1111/j.14678659.2007.01089.x