Potential Field: Interpretable and Unified Representation for Trajectory Prediction

by   Shan Su, et al.

Predicting an agent's future trajectory is a challenging task given the complicated stimuli (environmental/inertial/social) of motion. Prior works learn individual stimulus from different modules and fuse the representations in an end-to-end manner, which makes it hard to understand what are actually captured and how they are fused. In this work, we borrow the notion of potential field from physics as an interpretable and unified representation to model all stimuli. This allows us to not only supervise the intermediate learning process, but also have a coherent method to fuse the information of different sources. From the generated potential fields, we further estimate future motion direction and speed, which are modeled as Gaussian distributions to account for the multi-modal nature of the problem. The final prediction results are generated by recurrently moving past location based on the estimated motion direction and speed. We show state-of-the-art results on the ETH, UCY, and Stanford Drone datasets.



There are no comments yet.


page 1

page 4

page 8


Stochastic trajectory prediction with social graph network

Pedestrian trajectory prediction is a challenging task because of the co...

TNT: Target-driveN Trajectory Prediction

Predicting the future behavior of moving agents is essential for real wo...

AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory Prediction

Anticipating human motion in crowded scenarios is essential for developi...

SDMTL: Semi-Decoupled Multi-grained Trajectory Learning for 3D human motion prediction

Predicting future human motion is critical for intelligent robots to int...

Personalized Trajectory Prediction via Distribution Discrimination

Trajectory prediction is confronted with the dilemma to capture the mult...

You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction

Predicting the future trajectory of a moving agent can be easy when the ...

Discovering Symmetry Invariants and Conserved Quantities by Interpreting Siamese Neural Networks

In this paper, we introduce interpretable Siamese Neural Networks (SNN) ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Trajectory prediction is essential for the safe operation of vehicles and robots designed to cooperate with humans. Although intensive research has been conducted, accurate prediction of road agents’ future motion is still a challenging problem given the high complexity of stimuli [25]. To properly model the behavior of humans, three types of stimuli should be considered: (i) Environmental (external) stimulus: humans obey the physical constraints of the environment as they move on the walkable area and avoid collision with stationary obstacles; (ii) Inertial (internal) stimulus: humans’ future motion is driven by their own intention inferred from the past motion (i.e., direction and speed); and (iii) Social stimulus: humans share the physical environment with others, and they interactively negotiate for possession of the road. Meanwhile, prediction of human behavior is inherently multi-modal in nature. Given the past motion, there exist multiple plausible future trajectories due to the unpredictability of the future.

Figure 1: We address trajectory prediction problem from bird-eye view by introducing potential field. The proposed potential field serves as an interpretable and unified representation which incorporates environmental, inertial and social forces. Yellow and blue represent high and low potential values respectively and arrows indicate the motion direction. The target agent is marked in red while the neighbor agents are marked in orange.
Figure 2: Overview of the proposed pipeline. Two different potential fields are generated from the road image and the past information, respectively. The corresponding motion fields are then estimated from the potential fields to encode the motion direction. Since the motion fields provide only motion directions, the target agent’s speed are predicted from the past trajectory and merged to generate the displacement field. The final predictions are generated by recurrently applying the displacement field on the target agent’s locations. The “Merge Layer” module is described in details in Figure 6. denotes concatenation operation and donates multiplication.

There have been research efforts [19, 5, 27, 35]

to model environmental, inertial, and social stimuli for trajectory prediction. They extract features of individual stimulus from different modules and fuse them in a feature space. Although such methods could be convenient to train the network in an end-to-end manner, it is apparent that end-to-end learning is not optimal to benefit from the modular design of neural networks 

[9]. In this view, the current models can not ensured whether desired stimuli are actually captured (i.e., lack of interpretability) or whether the captured features are appropriate to fuse (i.e., lack of unified representation). Following the Vygotsky’s zone of proximal development theory [4] that claims the necessity of incremental supervision for learning tasks, we propose to supervise and evaluate the intermediate learning progress of neural networks using interpretable representations that can be easily unified over different modules.

In this work, we present a novel way to model environmental, inertial, and social stimuli using invisible forces. This is inspired by Newton’s Law of Motion, where force is the governing factor of all interactions and motions. More specifically, we model the stimuli by potential fields induced by forces as shown in Figure 1. The environmental force is determined by the structural layout of the road. It naturally pushes humans to move along the road avoiding collision with stationary obstacles such as buildings, walls, vegetation, etc. The inertial force guides humans to move towards the intentional destination. It is caused by the past motion of humans, and infers his/her possible motion in future. The social force comes from the motion of other agents. It enables the predicted future motion of the target agent to be compatible with other static/dynamic agents.

Although the combined force itself represents the effects of all stimuli on the target agent, force inherently has a deterministic direction at each location. We thus borrow the notion of potential field to deal with the unpredictability of the future. In physics, an electric charge moves towards any locations with a lower potential energy in a potential field. In analogy to this, we define a potential field in the physical world, where the potential values can be inversely estimated by integrating forces along agents’ trajectories. Therefore, the target agent can travel to anywhere with lower potential values. We further tackle the multi-modality of the problem by estimating the distribution of both motion direction and speed. An overview of the proposed framework is shown in Figure 2.

Using potential field as an interpretable representation, our method is able to seamlessly fuse three types of stimuli, and predict future trajectory in a classic manner (potential fieldforcemotion). Our conjecture is that such framework enables the network to comprehensively develop the intellectual capabilities [4]. More importantly, this provides a way to link the learning process with domain knowledge support, which is a crucial step to build algorithms that helps to model the human-level understanding [18].

The main contributions of this work are as follows: (i) To the best of our knowledge, our method is the first to present potential field as an interpretable and unified representation to model multiple stimuli for trajectory prediction. (ii) We develop a novel method to formulate potential field from the trajectory and provide its mathematical derivation with physical concepts. (iii) We generalize our potential field knowledge to unknown scenes using a neural network. (iv) We achieve state-of-the-art performances on three widely used public datasets.

2 Related Work

Classic Models Classic models for future trajectory can be traced back to Newtonian mechanics, where future trajectory is calculated by force [10]

. Such physics based models are further developed by introducing statistical models, such as Kalman filters 

[17] and Gaussian process [31, 30]. Some approaches such as [13, 21, 34] adopt a conceptual meaning of force to model attractive and repulsive behavior of humans. Classic models are revisited recently by [28], where constant velocity model is used and achieves state-of-the-art performance. Such method has high interpretability by hand-crafting the motion feature and dynamics. However, it is not guaranteed that the hand-crafted dynamics rule actually models human motion in real life. Also, it is hard to incorporate contextual information, and thus limits the ability of extending the methods to complicated environments. In contrast, we target the problems which have complex context (e.g., road structure and interaction with other agents) and develop our method by combining the classic concepts with data-driven approach.

Deep Learning Methods

Focusing on social interactions between humans, deep learning-based approaches for trajectory prediction have been studied to train the network to generate interaction-encoded features. A social pooling module is proposed in

[1, 11] where the embeddings of the agents’ motion are aggregated in a feature space. They motivated the following works to capture internal stimuli using the additional input modality such as head pose [12]

or body pose 

[33] of humans. These approaches concatenate the feature representations of a pose proxy with motion embeddings, seeking to find a correlation between gaze direction and the person’s destination. Besides these efforts, the structural layout of the road is commonly used to model external stimuli in [19, 32, 35, 7]. They first extract image features from RGB input and simply fuse with different types of feature embeddings. Although such models are convenient to use auxiliary information, they may not be optimal to capture desired stimuli and to fuse different feature embeddings [28]. Therefore, we introduce interpretable representations for intermediate supervision and their unified format to share domain knowledge.

Potential Field in Robotics Potential field methods are previously used for planning in Robotics with an emphasis on obstacle avoidance in both static and dynamic environments [2, 23, 8]. They use the potential field to obtain a global representation of the space [15]

. However, potential values are heuristically assigned for both robots and obstacles to generate a smooth and viable path. Choosing the hyper-parameters are tedious, and more importantly, the resulting trajectory may be sub-optimal 

[8]. On the contrary, our potential field is learned from human trajectories in a data-driven manner by linking their behaviors with either the scene context or past positions. As a result, we do not use any heuristic values nor hyper-parameters to generate the potential field.

3 Potential Field

We introduce a potential field that is induced by invisible forces and influences pedestrian trajectories. We assign each location a scalar value that represents the potential energy. A pedestrian’s motion is thus generated towards lower potential energy location. Due to the fact that human motions do not have large acceleration or deceleration in everyday environments111It is also observed from the public datasets such as [22, 20]., we assume that our invisible forces are proportional to velocities of humans.

We define the input road structure image captured from a bird-eye view as , where and are the width and height of the image and is the number of channels. A sequence of pedestrian’s trajectory on the road structure is defined as a set of distinct points from time to with a constant sampling time interval of 222In this paper, we set seconds.. Then, our objective is to predict the future sequence based on one’s past trajectory , where is the number of observed frames. Therefore,

To model the behavior of humans influenced by both the environment and the past trajectory , we generate two separate potentials fields and , respectively. In Section 3.1, we first derive equations to compute potential values from adjacent waypoints of a single trajectory. Then, we expand the Eq. 3 to any three points on the trajectory. Given those equations, we build a dense potential field for a road scene in Section 3.2. Lastly, we present how to generate a potential field from the past motion information in Section 3.3, which is used to model inertial and social stimuli.

Figure 3: Illustration for potential field generated inversely from trajectory data. Left: Four adjacent points are selected from the trajectory with unknown potential values. Middle: We define the start and end point of the trajectory with potential value and as reference points to derive the potential values in Eq. 5. Right: We define the width of the trajectory to be , and a dense potential field can be estimated using Eq. 6.

3.1 Derivation of Potential Field Ground Truth

We aim to estimate a potential field which is compatible with the observed trajectory. With an analogy to electric charges’ motion in electric field, the trajectory is modeled as movements of an agent towards lower potential value locations. It means that the potential values along a single trajectory should monotonically decrease. However, such decreasing property is not sufficient to generate a unique and stable field333The detailed comparison is shown in the supplementary material.. Therefore, we explicitly compute the potential values for each point on the trajectory and infer a dense field from those sparse values. Our key observation is that the potential difference is linearly proportional to agents’ velocity, which can be extracted from distance among points on the trajectory. It allows us to draw a direct relationship between distance and potential values.

Given three adjacent points , , and on a trajectory . Their corresponding potential values are denoted as , and . We assume that the velocity within a pair of adjacent sampled positions is constant. Therefore, the velocity within two points is given as follows:


where is the sampling interval and is the distance between and . Note that the velocity can be different for other segment regions in the same trajectory.

We denote the potential difference between two points as . Similar to the constant velocity assumption, we assume the derivative of the potential energy is constant from to . The field strength is then denoted as .

In order to derive the relationship between the velocity and the potential difference , we borrow the potential field analogy from physics [10]. In theory of electricity, is usually referred to as voltage and is referred to as electric field strength. The corresponding electric force is then proportional to the electric field strength following , where is the electric charge of an object. Similarly, we define our potential energy difference to be directly proportional to velocity . Then, the velocity can be formulated as follows:


where is a constant scaling factor that depends on the types and intrinsic properties of agents, which is similar to mass or electric charge of the objects. Note that its value does not change throughout the same trajectory. By combining Eq. 1 and Eq. 2, the relationship among potential values , and is derived as follows:


The constant velocity and uniform field strength assumptions require the three points to be adjacently sampled. We further generalize 444The detailed derivations are provided in the supplementary material. Eq. 3 to the potential values among any triplets on the same trajectory as follows:


where . If we further constrain that and on this trajectory, for points can be explicitly calculated as:


Given a single trajectory and the corresponding potential values for points , we generate a dense potential field as shown in Figure 3. In the rest of this paper, we use the notation as the generated ground truth potential field, and as the estimated potential field.


where , with being dot product, and is the distance from pixel to a segment as shown in Figure 3.

3.2 Potential Field from Road Structure

Figure 4: Illustration for potential field from road structure. Left: All the trajectories that traverse the image patch. Color denotes time sequence (from yellow to blue). Right: Potential field for the image patch. Color denotes potential energy value. Best viewed in color.

To estimate a ground truth potential field for a certain road structure , we collect all trajectories from agents who traverse the scene as shown in Figure 4. We denote as a set of trajectories from different agents who travels the given road scene . For each trajectory , we compute the ground truth dense potential field using Eq. 6.

Since the actual ground truth potential field is not available, we approximate with using . We train a network to estimate from a bird-eye view image as follows:


where and is a set of trainable parameters. In this paper, we use the encoder and decoder structure [16] to model function

. The loss function to estimate

from a set of trajectories is given as:


where is a pixel-wise mask that acts as a soft indicator function for trajectory given by:


where is a weight parameter. The loss function enforces to be consistent with different ground truth trajectories in the scene, while having a regularization part for the region where no one travels. In practice, we choose . Note that testing trajectories are not used during estimating in training phase.

The road potential field models the influence of the environment on the agent’s motion behavior, which corresponds to the environmental stimulus. It constrains the future motion of agents to be on the movable/drivable area such as roads or sidewalks in the scene. Note that our potential field is naturally shaped by the agents’ trajectories, which is more representative to human behaviors than road structure segmentation.

In fact, the agents can transverse the region of interest from left to right, or from right to left. However, the potential fields for these two trajectories cancel out each other. To ensure the consistency of the potential field, we pre-process the data by rotating the trajectories and the corresponding image patches to make sure that the motions start from one end to the other end (e.g., left to right) in the image space.

3.3 Potential Field from Past Information

Given past motion information, we generate the potential field to encode inertial effect and social effect. For an agent ’s past trajectory , we define a set of neighbor’s past trajectories as , where is a pre-defined radius. The potential field for the target agent is generated from its past trajectory and trajectories of nearby agents using


where takes the input from both the ’s past trajectory and its neighbors’ past trajectories , is the sum of ground truth dense potential fields (generated from Eq. 6) in the set , and is a set of trainable parameters.

The resulting potential field from past information is not only compatible with the past behavior of the target agent, but also considers the social interactions among nearby participants. In Figure 5, we illustrate the inertial effect on the left. The generated potential field shows a distribution of possible motions given the past trajectory. It corresponds to the multi-modal nature of future prediction. On the right, we illustrate the social effect. It is shown that the possibility of moving upward is eliminated by the interaction with a nearby agent of similar velocity.

In practice, we use neural network to model the function . The loss function for training the network is given by

Figure 5: Illustration for potential field from past information. Left: Potential field generated without social information. Right: Potential field generated with social information. The possibility of moving upward is eliminated to avoid collision with a nearby agent of similar velocity. Best viewed in color.

4 Trajectory Prediction

With the generated potential field, the future trajectory can be calculated by ordinary differential equation. However, this step converts the potential field back to force, which overlooks the multi-modal nature of trajectory prediction. Due to the unpredictability of the future, road agents may have different trajectories even when their past motions are the same. To represent such unpredictability, we use two separate Gaussian distributions to model the target agent’s motion direction and speed.

We separate the pedestrian’s velocity into motion direction and speed , where is the number of prediction frames. We model the distributions of motion direction and speed as Gaussians denoted by and respectively. For each future location at time in the environment, we obtain the mean motion direction and speed as and for . Note that the number of channels is two for

since it is a vector field over space


4.1 Motion Field and Speed Prediction

Our road potential field is a universal representation of the road structures. Therefore, it cannot be used to provide agent-specific prediction due to speed differences. We thus disentangle the prediction problem by estimating the future motion direction and speed independently.

The expected value of direction for the future motion is derived from the potential field as


where with and being the learnable parameters. calculate the expected values.

We further merge the motion fields from the environment and past information into a single field as shown in Figure 6. Note that and are the resulted motion directions of three independent stimuli on the target agent. Following the addictive property of force, we can thus fuse the two with a weighted sum by


where is a pixel-wise weighting mask determined by the two motion fields. We drop and for in later sections.

The expected value of speed for the future motion is derived from the past trajectory by


where , is the length of the whole trajectory and is the length of the past trajectory.

In practice, we model the functions and

using neural networks. In addition to mean values of the distributions, the networks also output the variance of the distributions. The ground truth of the motion direction and speed are calculated from the trajectory data. The loss function then enforces the network to generate distributions of

,, . More specifically, we estimate the maximum likelihood of the ground truth samples given the generated mean and sigma values, and the loss is given by


where (in Eq. 2) is the velocity of an agent at location at time .

Figure 6: We merge the two motion fields into a single motion field using this module. The pixel-wise mask is generated by convolutional layers, which is used to fuse the motion field from road structure and past trajectory information. denotes concatenation operation, denotes multiplication and denotes addition.

4.2 Single and Multiple Future Prediction

For single future prediction, mean values for motion direction and for speed are used to generate displacement field as follows:


where is a vector field with scale and is a of desired prediction time. The displacement field set provides the complete motion of each pixel at every time step. Then, the trajectory prediction is given by recurrently moving and updating the previous location by


where .

For multi-modal future prediction, we sample instances for motion direction and for speed from the distribution and , respectively, and generate displacement field


where and is the prediction index with being the number of generated predictions. The predicted trajectory is then generated by recurrently applying from the previous location by


Note that the predicted trajectories for both single- and multi-modal prediction are generated from previously learned motion field and speed prediction with no extra parameters.

In practice, we observe that recurrently moving a single point (predicted location) over time using the displacement field is not stable enough, especially when applied to unknown scenes. Inspired by the methods such as [3, 7] that generate a heatmap for the future location instead of regressing trajectory coordinates, we generate a pre-defined Gaussian distribution around the observation location, and move the whole distribution together.

is pre-defined hyperparameters. Recurrent updates of the heatmap allows us to consider uncertainty on each predicted point of the trajectory 


Category Method 1.0 sec 2.0 sec 3.0 sec 4.0 sec
Linear    -   /2.58    -   /5.37    -   /8.74    -   /12.54
Single Prediction S-LSTM [1] 1.93 / 3.38 3.24 / 5.33 4.89 / 9.58 6.97 / 14.57
(State-of-art) DESIRE [19]    -   / 2.00    -   / 4.41    -   / 7.18    -   / 10.23
Gated-RN [7] 1.71 / 2.23 2.57 / 3.95 3.52 / 6.13 4.60 / 8.79
Single Prediction Past Inf. 1.17 / 1.54 1.97 / 3.32 2.94 / 5.73 4.13 / 8.43
(Ours) Past Inf. + Road Struc. 1.14 / 1.48 1.91 / 3.23 2.87 / 5.58 4.00 / 8.10
Multi-Modality CVAE [29] 1.84 3.93 6.47 9.65
(State-of-art) DESIRE [19] 1.29 2.35 3.47 5.33
Multi-Modality (Ours) Past Inf. + Road Struc. 1.10 2.33 3.62 4.92
Table 1: Quantitative results on SDD dataset. Both ADE and FDE are reported in pixel coordinates at 1/5 resolution as proposed in [19]. Our method consistently outperforms the previous works.
ETH-eth ETH-hotel UCY-univ UCY-zara01 UCY-zara02 Average
Linear 0.143 / 0.298 0.137 / 0.261 0.099/0.197 0.141 / 0.264 0.144 / 0.268 0.133 / 0.257
S-LSTM [1] 0.195 / 0.366 0.076 / 0.125 0.196 / 0.235 0.079 / 0.109 0.072 / 0.120 0.124 / 0.169
SS-LSTM [32] 0.095 / 0.235 0.070 / 0.123 0.081 / 0.131 0.050 / 0.084 0.054 / 0.091 0.070 / 0.133
Gated-RN [7] 0.052 / 0.100 0.018 / 0.033 0.064 / 0.127 0.044 / 0.086 0.030 / 0.059 0.044 / 0.086
Ours 0.047 / 0.082 0.027 / 0.051 0.051 / 0.104 0.036 / 0.078 0.023 / 0.047 0.037 / 0.072
Table 2: Quantitative results on ETH/UCY dataset. Both ADE and FDE are reported in normalized pixel coordinates as proposed in [32]. Our method consistently outperforms the previous works.

5 Experiments

We evaluate our algorithm on three widely used benchmark datasets ETH [22] / UCY [20] and Stanford Drone Dataset (SDD) [24]. All datasets contain annotated trajectories of real world humans. The ETH / UCY dataset has five scenes: ETH-University, ETH-hotel, UCY-Zara1, UCY-Zara2 and UCY-University. The SDD dataset has eight scenes of 60 videos with more complex road structures. For evaluation, we report the average displacement error (ADE) and the final displacement error (FDE).

For the datasets, we augment the dataset by flipping the trajectory data and the corresponding images with respect to -axis, -axis and line , respectively. This gives seven times data for training.

5.1 Implementation Details

Potential Field Generation

We choose Pix2Pix 

[16] architecture to model the relationship from image to potential field , and from past information to potential field . We resize the image to 1/4 of the original size, and crop the image with patch size of centered at

for the target agent. To generate a potential field from the image, we find that the VGG-19 weight parameters pre-trained on ImageNet 


to be beneficial for image-level feature extraction and network convergence. To generate a potential field from past information, both the encoder and decoder are trained from scratch.

Motion Field Generation The relationship between potential field and motion field is trained using four convolutional layers. This is a relatively shallow module, since the functionality assembles the task of taking derivatives. The convolutional layers provide larger receptive fields to tolerate the noises in the generated potential fields.

Future Speed Generation

We employ a multi-layer perceptron for future speed prediction due to its structure simplicity. In theory, it can also be implemented by recurrent modules such as GRU 

[6], LSTM [14] or any other neural network architectures that can extract temporal information.

The detailed network structure is explained in supplemental material. All networks are optimized with Adam optimizer. The sub-modules are trained separately using intermediate losses, and then the whole pipeline is fine-tuned together with respect to the loss between predicted trajectories and ground truth trajectories.

Figure 7:

Qualitative results. Here we show some instances of the intermediate and final results. For each instance, we show the following three figures: (1) Potential field and motion field from environmental stimulus. (2) Potential field and motion field from inertial and social stimulus. (3) Final prediction results. Predicted future trajectory is drawn in blue and ground truth future trajectory is drawn in red with past trajectory drawn in black. We also show the future distribution heatmap of the target agent, with red denoting high probability and blue denoting low probability. Left: Our potential field of the road map is able to recognize complicated road structure (roundabout/crossing) and generate reasonable motion field. Right: Our model is able to predict turning according to the context. Best viewed in color.

5.2 Quantitative Results

We quantitatively evaluate our method on both ETH/UCY and SDD dataset, and compare our model with the state-of-the-art methods. For SDD, we randomly divide 60 videos into training and testing set with a ratio of 4:1, which is similar to [7, 19]. We split the trajectories into 7.2 seconds segments, and use 3.2 seconds for observation and 4 seconds for prediction and evaluation. Raw RGB road images are used with no additional annotation. Average displacement error and final displacement error are reported for 1s, 2s, 3s and 4s in future. The errors are reported in pixel coordinates in 1/5 resolution.

Table 1 shows the quantitative comparison for both single-modal prediction and multi-modal prediction. For single-modal prediction, we compare our model with (1) S-LSTM [1] that models human-human interaction with social pooling in the feature space, (2) DESIRE [19] that incorporates both static scene and dynamic agents for future trajectory prediction, (3) Gated-RN [7] that extracts relations of the target agent with other road users and environment. We test our model with past information (Past Inf.) that only uses inertial and social stimuli, and with Past Inf. + Road Struc. that also uses environmental stimulus together. As shown in Table 1, our method already outperforms previous methods with Past Inf. It demonstrates that intermediate supervision using interpretable representations is beneficial for information extraction. Adding road structure into the pipeline makes the predicted trajectory to be compatible with the environmental context, and thus further improve the performance. It further validates our use of unified representations to merge different domain knowledge. We additionally compare our multi-modal prediction with CVAE [29] and DESIRE [19], and report FDE as shown in Table 1. We predict possible trajectories for evaluation. The improvements between single-modal prediction and multi-modal prediction show that the generated distributions capture the unpredictability of the future.

For the ETH/UCY dataset, we adopt the same experimental setting of [32, 7], where we split the trajectories into 8 seconds segments, and use 3.2 second for observations and 4.8 seconds for prediction and evaluation. We use four scenes for training and the remaining scene for testing in a leave-one-out cross-validation fashion. ADE and FDE are reported for 4.8s in the future. Since we work in the pixel coordinates, both error metrics are reported in normalized pixel error. The ground truth labels are generated by converting world coordinates to pixel coordinates using a homography matrix. Table 2 shows the quantitative comparison with S-LSTM [1], SS-LSTM [32], and Gated-RN [7]. Our method outperforms previous methods in most scenes.

5.3 Qualitative Results

We qualitatively evaluate our method in Figure 7. It shows that our model can deal with different challenging road structures (open area / straight road / crossing / roundabout) and diverse motions (standing still / going straight / taking turn). As shown in the top right case, our potential field not only gives walkable area, but also learns walking habit of humans (walking on the right side of the road) automatically in a data-driven manner. Such information cannot be obtained from other labels such as road segmentation. The information from environmental and past information can be merged reasonably and compensate each other to generate plausible future trajectories. We also demonstrate that our method can deal with interaction intensive scenarios such as in Figure 8. The target agent move and turn in a crowded group without colliding into others.

Figure 8: Social behavior. Left: Walking with a group of pedestrians. Right: Turning with a group of pedestrians. For the target agent, we use the same color legend as Figure 7.

6 Conclusion

Predicting future motion of road agents is a crucial and challenging task. We propose to use potential field as an interpretable and unified representation for human trajectory prediction. This enables us to not only fuse the information of different stimuli more reasonably, but also allows to supervise and evaluate the intermediate learning progress of neural networks. Potential fields are generated to represent the effect of the environmental force, inertial force, and social force on the target agent. We further estimate future velocity direction and magnitude from potential fields, which are modeled as Gaussian distributions to account for the unpredictability of the future. The predicted future trajectory is generated by recurrently moving past location on the displacement field, which is calculated from motion direction and speed. We test our model on three challenging benchmark datasets. The results show that our method can deal with complicated context while achieving state-of-the-art performances.


  • [1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 961–971, 2016.
  • [2] Jerome Barraquand, Bruno Langlois, and J-C Latombe. Numerical potential field techniques for robot path planning. IEEE transactions on systems, man, and cybernetics, 22(2):224–241, 1992.
  • [3] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction, 2019.
  • [4] Seth Chaiklin. The zone of proximal development in vygotsky’s analysis of learning and instruction. Vygotsky’s educational theory in cultural context, 1:39–64, 2003.
  • [5] Rohan Chandra, Uttaran Bhattacharya, Aniket Bera, and Dinesh Manocha. Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8483–8492, 2019.
  • [6] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [7] Chiho Choi and Behzad Dariush. Looking to relations for future trajectory forecast. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [8] Shuzhi Sam Ge and Yun J Cui. Dynamic motion planning for mobile robots using potential field method. Autonomous robots, 13(3):207–222, 2002.
  • [9] Tobias Glasmachers. Limits of end-to-end learning. In

    Proceedings of the Ninth Asian Conference on Machine Learning

    , volume 77 of Proceedings of Machine Learning Research, pages 17–32. PMLR, 15–17 Nov 2017.
  • [10] David J Griffiths. Introduction to electrodynamics, 2005.
  • [11] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2255–2264, 2018.
  • [12] Irtiza Hasan, Francesco Setti, Theodore Tsesmelis, Alessio Del Bue, Fabio Galasso, and Marco Cristani. Mx-lstm: mixing tracklets and vislets to jointly forecast trajectories and head poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6067–6076, 2018.
  • [13] Dirk Helbing and Peter Molnar. Social force model for pedestrian dynamics. Physical review E, 51(5):4282, 1995.
  • [14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [15] Yong Koo Hwang and Narendra Ahuja. A potential field approach to path planning. IEEE Transactions on Robotics and Automation, 8(1):23–32, 1992.
  • [16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

    Image-to-image translation with conditional adversarial networks.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [17] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960.
  • [18] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017.
  • [19] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 336–345, 2017.
  • [20] Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. Crowds by example. In Computer graphics forum, volume 26, pages 655–664. Wiley Online Library, 2007.
  • [21] Matthias Luber, Johannes A Stork, Gian Diego Tipaldi, and Kai O Arras. People tracking with human motion predictions from social forces. In 2010 IEEE International Conference on Robotics and Automation, pages 464–469. IEEE, 2010.
  • [22] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You’ll never walk alone: Modeling social behavior for multi-target tracking. In 2009 IEEE 12th International Conference on Computer Vision, pages 261–268. IEEE, 2009.
  • [23] John H Reif and Hongyan Wang. Social potential fields: A distributed behavioral control for autonomous robots. Robotics and Autonomous Systems, 27(3):171–194, 1999.
  • [24] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In European conference on computer vision, pages 549–565. Springer, 2016.
  • [25] Andrey Rudenko, Luigi Palmieri, Michael Herman, Kris M. Kitani, Dariu M. Gavrila, and Kai O. Arras. Human motion trajectory prediction: A survey, 2019.
  • [26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [27] Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Noriaki Hirose, Hamid Rezatofighi, and Silvio Savarese. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [28] Christoph Schöller, Vincent Aravantinos, Florian Lay, and Alois Knoll. What the constant velocity model can teach us about pedestrian motion prediction, 2019.
  • [29] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pages 3483–3491, 2015.
  • [30] Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283–298, 2007.
  • [31] Christopher KI Williams.

    Prediction with gaussian processes: From linear regression to linear prediction and beyond.

    In Learning in graphical models, pages 599–621. Springer, 1998.
  • [32] Hao Xue, Du Q Huynh, and Mark Reynolds. Ss-lstm: A hierarchical lstm model for pedestrian trajectory prediction. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1186–1194. IEEE, 2018.
  • [33] Takuma Yagi, Karttikeya Mangalam, Ryo Yonetani, and Yoichi Sato. Future person localization in first-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7593–7602, 2018.
  • [34] Francesco Zanlungo, Tetsushi Ikeda, and Takayuki Kanda. Social force model with explicit collision prediction. EPL (Europhysics Letters), 93(6):68005, 2011.
  • [35] Tianyang Zhao, Yifei Xu, Mathew Monfort, Wongun Choi, Chris Baker, Yibiao Zhao, Yizhou Wang, and Ying Nian Wu.

    Multi-agent tensor fusion for contextual trajectory prediction.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.