1 Introduction
Given an observation of history trajectory, learning to predict future pedestrian locations is an essential task in many applications, e.g. autonomous driving, robot navigation, etc. Although tremendous research have been investigated [1, 3, 11], it is still challenging to capture the complex social interactions in crowded scenarios. For example, a person can walk alone or with others in a consistent group. A group might be changed when a person joins or leaves. As a result, the individual trajectory is usually influenced by others in order to avoid collisions while following reasonable social norms. Furthermore, the future routine is always ambiguous, meaning that more than one path are reasonable to reach the same destination.
Motivated by these, we highlight two factors that are crucial in the trajectory prediction:
1). Social interaction between human is nonsymmetric.
Compared with previous works [1, 9, 25, 29], we propose that the pairwise social interaction should be nonsymmetric among pedestrians. For example, a person always pays attention to pedestrians ahead of him, and has little aware of pedestrians behind him. To simulate this, we model the social topology as a directed graph. Then we propose a graph network which accumulates social interactions to the intrinsic destination in order to capture destinationoriented feature enriched with social interaction patterns.
2). The next step is uncertain, dependent on both intrinsic destination and social selection at each time.
During walking, a person might adopt various flexible decisions in avoid collision. To model this uncertainty, previous work introduced one single random variable sampled from either the prior distribution conditioned on the observations
[11]or a fixed multivariate normal distribution
[9, 20]. The single random variable will be used to generate all future steps. However, in the realworld, the selection on the next step might be changed during walking. For example, a person might try to surpass the person ahead of him at first, but he might give up and continue to follow. To model the temporal stochastic and generate diverse routines, at each time step, one latent variable is sampled from a learned distribution, which models all possible selections on the next step till the moment.
In summary, in this paper, we propose two contributions to generate all destinationoriented and socialplausible future trajectories. A new social graph network can effectively extract nonsymmetric pairwise relationships and social interactions. A stochastic method can predict diverse socialplausible selections for the next step. The final stochastic predictions are generated by progressively integrating both social and individual destination information with a hierarchical LSTM.
2 Related Work
Predicting the future is always a challenging problem in computer vision. It has been widely investigated in many fields, such as video frame prediction (
[2, 5]), motion flow prediction [26], traffic forecasting[13], car trajectory prediction[19], etc. For predicting pedestrian trajectories, a lot of approaches have been proposed to make predictions for the first person view [15, 28], collaboration with nonhomogeneous traffic agents [14], team sports[6], etc. In this paper, we focus on prediction under the fixed camera view given only world coordinates inputs.Trajectory Prediction:
The earlier works used heuristic features to model humanhuman interactions. For example, in social force model
[10], each trajectory is generated by applying both attractive forces towards an intended destination and repulsive forces to avoid collisions. But the social forces only consider individual history, Linear Trajectory Avoidance (LTA [16]) predicts the future trajectory by jointly anticipating the movement of other pedestrians and obstacles in the scene.Over the past several years, datadriven methods based on RNNs have showed powerful ability in modeling sequential data. Based on RNN, Social LSTM [1]
firstly proposed social pooling, which updates the hidden state of the each pedestrian by summing up the states of neighborhood pedestrians with predefined regular grids. In order to remove the limitation of neighborhood grids, the social pooling is extended to a multilayer perceptron (MLP) network in
[9]. SRLSTM [29] iteratively refines the cell and hidden states in LSTM at each timestep by learning attention on other pedestrians. Similar attention mechanism can also be found in [25]. CIDNN [27] used spatial affinity to replaced attention weights, which are the inner product on the embedding representations of the current locations. To improve training efficiency, adversarial training is also introduced in [9, 30].Stochastic Prediction: In trajectory prediction, most of the previous works use a single stochastic variable to model possible diversity. Based on conditional autoencoder (CVAE, [21]) framework, Lee et al.samples latent variables conditioning on the summary of the observed trajectory given by RNN, then decodes into a sequence. However, they did not consider pedestrian interactions during generation. SoPhie [20]
modified decoder LSTM by sampling a white noise and concatenated with the scene information extracted by CNN as inputs. Su
et al. [22] adds Gaussian processes on LSTM hidden states, in order to generate probabilistic predictions. Similarly, a stochastic extension of LTA [17]obtains a set of possible future states by extending the original energy into a probabilistic form, then estimate the Gibbs potential by fitting a Gaussian mixture model.
VRNN [4] firstly introduces temporal stochastic latent variable, which is dependent on both the current LSTM hidden state and the input at each time step. Based on VRNN, [7, 8]
extend the stochastic latent variables to be timedependent and adds a auxiliary backward LSTM for training. The key difference of these models are the choice of prior, approximated posterior model and loss function. A recent work in
[23] is similar to our work, which associates hidden states in VRNN with a fully connected graph interaction network. Different from us, they introduce context image as additional input, and generate predictions as a weighted combination of visual decoder and VRNN decoder. Furthermore, they studied team sports with highly collaborative agents, whereas the social interactions in our work is more flexible, with both collaborative and standalone agents. Our work is also inspired by [5] on learning dynamic prior model across time. The difference is that we use the output of social graph for modeling uncertainty and propose a hierarchical LSTMs as decoder to progressively integrate different information. Although the stochastic framework is similar, [5] aims to model frame uncertainty in video prediction, whereas our goal is to capture social interaction uncertainty in trajectory prediction problem.3 Method
3.1 Problem Formulation
Assume there are pedestrians in a scene, the spatial location of the th pedestrian at time can be denoted as . The problem is that given the observed frames as , we need to predict the trajectories of all pedestrians in the next few frames .
The architecture of our model is depicted in Fig. 1. It consists of three modules including: 1) Encoder: a social graph network for learning both social interactions and individual representation (see Sec. 3.2); 2) Stochastic: a temporal stochastic model for generating latent variable conditioned on encoder outputs (see Sec. 3.3); 3) Decoder: a decoder model to predict the speed of each agent. Given the predicted speed and current location, the next location can be found with a simple addition.
3.2 Social Graph Network
At each time , a directed graph can be constructed, called social graph. In this graph, each node indicates a pedestrian in the scene, thus the is not changed throughout the sequence. represents the set of directed edges determined by adjacency matrix . An edge from node to node exists when the element in adjacency matrix () equals 1.
As shown in Fig. 2, we derive a view area for each pedestrian, and construct the timely social graph by inserting edges from all persons inside the view area. For example, two person marked with orange and blue are in the visible view of the person marked with purple. It means the future path of purple person might be affected by these two persons. Thus two edges are added from orange and blue node to the purple node. However, the green person is out of the scope of the purple one, thus no edge exists from purple to green. Specifically, to build the view area, we use the speed direction as the fixation direction of eyes and expand the arc area to a predefined view angle. In this paper, the view angle is set to degree, which is larger than maximum human eye angle because of possible eye or head movements. If a person is standing still (as the orange node in Fig. 2), input edges are inserted from all other nodes. The reason is that the still person can move in any direction and need pay attention to all persons in the scene. Furthermore, a precise view area can be found by estimating head pose if the context image is given. Because the relative positions of pedestrians might be changed during walking, the topology of social graph is not consistent during the whole sequence. At each time, we will update the social graph given the current pedestrians’ layout and speed.
Then at time , the embedding representation for each node() and edge() can be derived as:
(1)  
(2) 
where and
are neural networks (In this paper, we use onelayer MLP for all
), which encode nodes, edges and pairwise relationships, individually. The input of indicates pairwise relationship, which could be measured in two different coordinate systems:
Cartesian: . The input is the pairwise position displacement.

Polar: . The input is the coordinate of in a local polar coordinate system whose reference point is .
Experimentally, we found the polar representation performs slightly better than Cartesian representation. The benefits might come from the disentanglement of distance and direction factor.
To obtain social interaction features, a social block is designed to update node representation by accumulating neighborhood information. Formally, at time , the update equation of th social block for th node is:
(3) 
where denotes the message passing from node to , and denotes a neural network. Initially, and . denotes the element in the adjacency matrix of the social graph at time . In Eqn. (3), the feature of each node will be updated by aggregating information from its neighborhood nodes.
The message at time is calculated as:
(4) 
Here we omit subscript for brevity. In equation (4), is a scalar for edge , is the social gate for elementwise selection,
is the elementwise product operator. Intuitively, the attention value measures the importance of each edge, whereas the social gate acts as elementwise feature selection, similar to the motion gate in
[29].We adopt a similar attention calculation as [24]:
(5) 
The social gate is calculated as
(6) 
where is a neural network.
We can sequentially stack multiple (=K ) social blocks. The final output feature of th pedestrian at time is the output of last social block , which encodes both intrinsic destination and social interactions.
3.3 Stochastic Trajectory Prediction
In order to generate stochastic predictions, our temporal model samples one each latent variable at each time step. Inspired by [5], we define the following update equations:
(7)  
(8)  
(9) 
where denotes the output node features from social graph network module at time , denotes the sampled stochastic latent variable, denotes the node embedding in Eqn. ( 1), and denotes the output speed prediction. In all three equations, models are used to encode past histories.
The prior is learned on its past trajectories with recursive hidden states. The posterior model for inference encodes scenes on the current time step. The prior model is learned to approximate posterior model, in order to capture uncertain social interactions. The detailed descriptions can be found in [5].
In generation step, a hierarchical LSTM is used to gradually decode pedestrian features. The first LSTM taking socialencoded features as inputs aim to generate sociallyplausible prediction, whereas the second LSTM taking individual embeddings as inputs aim to adjust the predicted path towards individual destination.
Finally, the network is trained endtoend by maximizing the variational lower bound
(10)  
(11) 
where the first likelihood term can be reduced to reconstruction loss between the predicted results and groundtruth. The hyperparameter is chosen as the balance between reconstruction error and sample diversity. We use Gaussian distributions for both the prior and posterior models, and apply reparametrization trick for training with SGD.
Method  Performance (ADE/FDE)  

ETH  Hotel  Zara01  Zara02  Univ  AVG  
Linear  1.33/2.94  0.39/0.72  0.62/1.21  0.77/1.48  0.82/1.59  0.79/1.59 
LSTM  1.14/2.39  0.69/1.47  0.64/1.43  0.54/1.21  0.73/1.60  0.75/1.62 
SLSTM[1]  0.77/1.60  0.38/0.80  0.51/1.19  0.39/0.89  0.58/1.28  0.53/1.15 
SRLSTM[29]*  0.63/1.25  0.37/0.74  0.42/0.90  0.32/0.70  0.51/1.10  0.45/0.94 
CVAE[11]†  0.93/1.94  0.52/1.03  0.41/0.86  0.33/0.72  0.59/1.27  0.53/1.11 
SoPhie[20]*†  0.90/1.60  0.87/1.82  0.38/0.73  0.38/0.79  0.49/1.19  0.61/1.22 
SGAN[9]*†  1.19/1.62  1.02/1.37  0.43/0.68  0.58/0.84  0.84/1.52  0.81/1.21 
Ours†  0.75/1.63  0.63/1.01  0.30/0.65  0.26/0.57  0.48/1.08  0.48/0.99 
Components  Performance(ADE/FDE)  

DG  SG  Polar  K  ETH  Hotel  Zara01  Zara02  Univ  AVG 
1  0.98/2.01  0.72/1.45  0.51/1.13  0.32/0.71  0.65/1.35  0.64/1.33  
1  0.84/1.70  0.66/1.23  0.48/1.13  0.31/0.70  0.60/1.34  0.58/1.22  
2  0.84/1.61  0.61/1.11  0.39/0.88  0.34/0.75  0.67/1.48  0.57/1.27  
2  0.81/1.64  0.63/1.01  0.34/0.76  0.26/0.58  0.52/1.17  0.51/1.03  
2  0.75/1.63  0.64/1.11  0.30/0.65  0.26/0.57  0.48/1.08  0.49/1.01 
4 Experiments
4.1 Datasets and Metrics
Datasets: We evaluated our method on two public datasets: ETH [18] and UCY [12]
, which consist of rich realworld humanhuman interactions. These two datasets contain 5 scenes, including 2 scenes (university and hotel) from ETH and 2 scenes (zara and university) from UCY. The average pedestrian number of a scene is 18.0 for UCY and 5.9 for ETH. All the trajectory coordinates are converted to world coordinates and interpolated to sample the coordinate at every 0.4 seconds. In total, there are 1536 pedestrians covering complex social interactions. Following prior work
[1, 29], we use the leaveoneout strategy for evaluation. Also we take 8 frames (=3.2 seconds) as observation, and predict the next 12 time steps(=4.8 seconds).Metrics: Following [1, 29], we evaluate with two error metrics in meters.

Average Displacement Error (ADE): Averaged Euclidean distance between groundtruth and predicted coordinates over all predicted time steps.

Final Displacement Error (FDE): Euclidean distance between groundtruth and predicted coordinates at the last frame.
4.2 Implementation Details
The dimension of both embedding () and social () features is set to . The dimension of hidden states of is for both prior and posterior LSTMs, and for decoder LSTMs. Table 3 details network configuration with one social block. The batch size is scenes with variable pedestrian number. The Adam optimizer is adopted with an initial learning rate of
, the epoch number is 300, and
for all experiments.During prediction, we use the onestep mode, indicating that we iteratively use the previous prediction results as the inputs of the step. In contrast, during training, the inputs are always the groundtruth of the last frame.
4.3 Comparison with Existing Methods
Baselines: A few baselines are used for comparison, including both deterministic and stochastic methods. For deterministic methods, we choose Linear (a linear regressor trained by minimizing least square error) and three LSTMbased methods, including vanilla LSTM (denoted as LSTM), social LSTM (denoted as SLSTM) and SRLSTM. For stochastic methods, we choose CVAE (we use the same network settings as [11]) and two methods which introduces stochastic with Gaussian white noise (denoted as SoPhie [20] and SGAN [9]). For fair comparison, we consider the model with only trajectory inputs, without the scene images in SoPhie. For stochastic methods, 20 samples are generated for evaluation, whereas the deterministic methods only produce one best prediction.
As shown in Table. 2, our method can achieve comparable results with the current stateoftheart methods. In particular, the error reduction is signification in UCY as compared with ETH datasets. Because UCY contains more crowded scenes with complex social interactions, our method demonstrates its superiority when dealing with complicated nonlinear trajectories in crowded scenarios, benefited from our social graph network. For ETH, our results are slightly worse than SRLSTM, but still better than other stochastic methods. Because of simple interactions and few ambiguous paths in ETH, deterministic methods have advantages by optimizing reconstruction loss only.
4.4 Ablation Study
4.4.1 Component Analysis
Table 2 gives results on several configurations of our method varied in whether to use our directed graph, social gate, polar coordinates and different number of blocks in social graph network. When the directed graph is disabled, an undirected fullyconnected graph topology is used, in which all elements in the adjacency matrix are 1. From the table, it is worth noting that the directed social graph can significantly reduce the error from to , which indicates that the selection of noticeable pedestrians is important for boosting performance. Another salient error reduction comes from the introduction of social gate, which indicates elementwise social feature selection helps to filter information during message passing. In general, more refinements () perform better than one step social calculation.
4.4.2 Qualitative Analysis
Social aware prediction. In Fig. 5, we illustrate six crowd scenarios where the target person has to adjust his path towards the destination. As shown, our method can learn social norms and have the ability to adjust the path towards destination. For example, when meeting with a group, as shown in Fig. 5 (b) and (f), our prediction make a detour in order to avoid stepping into the group. In Fig. 5 (d), our results give reasonable routines to walk through the crowds without collision. Our method can also capture potential social intention, such as generating a new group in Fig. 5 (e). More results can be found in the demo videos.
Stochastic movement. Generally, the prediction is uncertain, especially when walking at a low speed or near the road crossing. Fig. 5 shows our prediction results in these cases. As shown in Fig. 5, in the road corner, our method gives two options: go straight or turn right. It is worth noting that the generated stochastic predictions still do not break the consistency group walking.
Social attention. Fig. 5 illustrates the attention value of some example pedestrians in the same crowded scene. It shows that our directed graph can help to filter irrelevant pedestrians (marked in gray circles). The dominant attention is paid to the neighboring person, while he still notices other pedestrians which might affect his routine.
5 Conclusions
In this paper, we propose a temporal stochastic model with social graph network to address the problem of predicting all social plausible trajectories in the crowds. We propose a directed social graph and a network to encode both individual and social features. In addition, we utilize a temporal stochastic model which sequentially learns dynamic prior model at each time step. The final onestep prediction is generated by sampling from the prior model and progressively decoded with a hierarchical LSTMs. Our empirical evaluations on real datasets demonstrate our improvement over the current stateoftheart methods in crowded scenes. In the future, we plan to introduce context images to refine our social graph construction and add scene semantics, such as obstacles and road path, derived from context images.
Layer  Input, (Dimensions)  Output, (Dimensions)  Parameters 
Encoder  
Fullyconnected  , (4)  , (32)  act:=ReLU 
Fullyconnected  , (4)  , (32)  act:=ReLU 
Fullyconnected  , (96)  , (32)  act:=ReLU 
Fullyconnected  , (96)  , (32)  act:=LeakyReLU 
Softmax  , (32)  , (1)  
Fullyconnected  , (96)  , (32)  act:=ReLU 
Sigmoid  , (32)  , (32)  
Identity  , (32)  , (32)  
Identity  , (32)  , (32)  
Multiplication  , (32),(1),(32)  ,(32)  
Aggregation  , (32),(32)  , (32)  
Fullyconnected  , (32)  ,(32)  
Addition  , , (32),(32)  , (32)  
Prior  
Fullyconnected  , (32)  , (32)  
LSTM  , (32)  , (32)  
Fullyconnected  , (32)  , (32)  
Fullyconnected  , (32)  , (32)  act:= 
Reparam. trick  , (32),(32)  , (32)  
Inference  
Fullyconnected  , (32)  , (32)  
LSTM  , (32)  , (32)  
Fullyconnected  , (32)  , (32)  
Fullyconnected  , (32)  , (32)  act:= 
Decoder  
LSTM  , (64)  , (32)  
LSTM  , (64)  , (64)  
Fullyconnected  , (64)  , (2) 
References
 [1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li FeiFei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In CVPR, June 2016.
 [2] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. Stochastic variational video prediction. In ICLR, 2018.
 [3] Stefan Becker, Ronny Hug, Wolfgang Hübner, and Michael Arens. An evaluation of trajectory prediction approaches and notes on the trajnet benchmark. arXiv:1805.07663, 2018.
 [4] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NIPS, pages 2980–2988, 2015.
 [5] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In ICML, pages 174–1183, 2018.

[6]
Panna Felsen, Patrick Lucey, and Sujoy Ganguly.
Where will they go? predicting finegrained adversarial multiagent motion using conditional variational autoencoders.
In ECCV, pages 761–776, 2018.  [7] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In NIPS, pages 2207–2215, 2016.
 [8] Anirudh Goyal, Alessandro Sordoni, MarcAlexandre Côté, Nan Rosemary Ke, and Yoshua Bengio. Zforcing: Training stochastic recurrent networks. In NIPS, pages 6716–6726, 2017.
 [9] Agrim Gupta, Justin Johnson, Li FeiFei, Silvio Savarese, and Alexandre Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In CVPR, June 2018.
 [10] Dirk Helbing and Peter Molnar. Social force model for pedestrian dynamics. In Physical Review E, volume 51, pages 4282–4286, 1995.
 [11] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B. Choy, Philip H. S. Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In CVPR, July 2017.
 [12] Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. Crowds by example. Computer Graphics Forum, 26(3):655–664, 2007.

[13]
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu.
Diffusion convolutional recurrent neural network: Datadriven traffic forecasting.
In ICLR, 2018.  [14] Yuexin Ma, Xinge Zhu, Sibo Zhang, Ruigang Yang, Wenping Wang, and Dinesh Manocha. Trafficpredict: Trajectory prediction for heterogeneous trafficagents. In AAAI, 2019.
 [15] Hyun Soo Park, JyhJing Hwang, Yedong Niu, and Jianbo Shi. Egocentric future localization. In CVPR, 2016.
 [16] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc J. Van Gool. You’ll never walk alone: Modeling social behavior for multitarget tracking. In ICCV, pages 261–268, 2009.
 [17] Stefano Pellegrini, Andreas Ess, Marko Tanaskovic, and Luc Van Gool. Wrong turn  no dead end: A stochastic pedestrian motion model. In International Workshop on Socially Intelligent Surveillance and Monitoring, pages 15–22, 2010.
 [18] Stefano Pellegrini, Andreas Ess, and Luc Van Gool. Improving data association by joint modeling of pedestrian trajectories and groupings. In ECCV, pages 452–465, 2010.
 [19] Nick Rhinehart, Paul Vernaza, and Kris Kitani. R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting. In ECCV, pages 794 – 811, 2018.
 [20] Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Noriaki Hirose, and Silvio Savarese. Sophie: An attentive GAN for predicting paths compliant to social and physical constraints. arXiv:1806.01482, 2018.
 [21] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In NIPS, 2015.
 [22] Hang Su, Jun Zhu, Yinpeng Dong, and Bo Zhang. Forecast the plausible paths in crowd scenes. In IJCAI, pages 2772–2778, 2017.
 [23] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and Kevin Murphy. Stochastic prediction of multiagent interactions from partial observations. In ICLR, 2019.
 [24] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. ICLR, 2018.

[25]
Anirudh Vemula, Katharina Muelling, and Jean Oh.
Social attention: Modeling attention in human crowds.
In ICRA, May 2018.  [26] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In ECCV, 2016.
 [27] Yanyu Xu, Zhixin Piao, and Shenghua Gao. Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In CVPR, June 2018.
 [28] Takuma Yagi, Karttikeya Mangalam, Ryo Yonetani, and Yoichi Sato. Future person localization in firstperson videos. In CVPR, 2018.
 [29] Pu Zhang, Wanli Ouyang, Pengfei Zhang, Jianru Xue, and Nanning Zheng. Srlstm state refinement for pedestrian trajectory prediction. In CVPR, 2019.
 [30] Haosheng Zou, Hang Su, Shihong Song, and Jun Zhu. Understanding human behaviors in crowds by imitating the decisionmaking process. In AAAI, pages 7648–7656, 2018.
Comments
There are no comments yet.