Socially Aware Kalman Neural Networks for Trajectory Prediction

09/14/2018 ∙ by Ce Ju, et al. ∙ Nanyang Technological University Baidu, Inc. 0

Trajectory prediction is a critical technique in the navigation of robots and autonomous vehicles. However, the complex traffic and dynamic uncertainties yield challenges in the effectiveness and robustness in modeling. We purpose a data-driven approach socially aware Kalman neural networks (SAKNN) where the interaction layer and the Kalman layer are embedded in the architecture, resulting in a class of architectures with huge potential to directly learn from high variance sensor input and robustly generate low variance outcomes. The evaluation of our approach on NGSIM dataset demonstrates that SAKNN performs state-of-the-art on prediction effectiveness in a relatively long-term horizon and significantly improves the signal-to-noise ratio of the predicted signal.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Forecasting the motion of surrounding vehicles is very critical for the navigation of robots and autonomous vehicles. Imagine that if a surrounding vehicle has an intention to cut in your lane, it is necessary to estimate its trajectory in the following few seconds as a priori for planning a non-intersection path as precisely as possible. In order to improve the safety and efficiency of the autonomous vehicles, a well-designed autonomous driving system should have the ability to predict the future traffic scenes over a relatively long-term horizon.

Generally, forecasting the future trajectory of a vehicle is very challenging. Because it is typically an issue of the complex system which is affected by the variable weather, the complex traffic scenes, the different driving styles of human drivers and even the massive interaction between neighbor vehicles. Recall the previous example that supposes your feedback strategy is to speed up your vehicle in order to prevent your neighbor to cut in your lane, it will dramatically change his original intention in a sudden, and he will then recast his driving path soon. This is a simple traffic interaction scene but mathematically modeling of it is not an easy task.

Figure 1: Illustration of SAKNN Model: The graph is composed of data extraction part, which we extract four categories of data in history from NGSIM data set, and the SAKNN model part in which we feed data to the interaction layer to get predicted acceleration and then recursively filtering and smoothing raw NGSIM data and predicted acceleration to estimate underlying states in training. In the trajectory prediction part, the predicted acceleration generated by the interaction layer is integrated to predicted trajectories of vehicles with respect to motion formula. See section Model for details.

In spite of the challenges above, we exploit the following two considerations in our model:

  • Socially Aware Interactive Effects: The socially aware interactive effect is cased by the social force which is a measure of the internal motivation of an individual. It is purposed by D.Helbing et al. [Helbing and Molnar1995] to study the motion trajectories of a group of pedestrians. In our model, we assume the dynamic motion of vehicles in complex traffic is affected by the socially aware interaction between neighbor vehicles.

  • Dynamic Uncertainties: Forecasting trajectories of robots or autonomous vehicles is intensely entangled with linear or nonlinear dynamic models since of uncertainties. In our model, the neural network is formulated in a dynamic model-based approach that a filter layer is embedded into.

Inspired by the recent literature of artificial intelligence with the application in the complex system and nonlinear dynamic model

[Langford, Salakhutdinov, and Zhang2009, Krishnan, Shalit, and Sontag2015], we propose a neural network architecture called Socially Aware Kalman Neural Networks (SAKNN). SAKNN is a deep neural network with an embedding filter layer which has the following three advantages:

  • Robustness: The robustness is attributed to the powerful ability of extracting complex abstractions of deep neural network [Deng, Yu, and others2014], especially the multiple convolution layers in front of SAKNN, to resolve a hierarchy of concepts of massive surrounding sensor data and extract unique features from the data of traffic and vehicle dynamics, while human experts almost impossibly can.

  • Effectiveness: After a relatively long-term horizon, SAKNN overwhelms all the baselines in the mean square error (MSE) sense which is attributed to an embedding filter structure. Actually, we embed a filter layer into SAKNN as an approximation structure of the Kalman filter (KF). In training, the filter layer recursively bends the weights of the neural network to reach the optimal estimation of the underlying states in the MSE sense.

  • Smoothness: In the light of the embedding filter structure of SAKNN, we set a smoothing regularization to bend the weights of the neural network in which we directly execute the Kalman smoother backwardly on the predicted trajectory for smoothing.

In the experimental evaluation, we apply SAKNN to the Next Generation Simulation (NGSIM) data set [Colyar and Halkias2007] and the empirical results demonstrate that SAKNN predicts more precisely in a relatively long-term horizon and significantly improves the signal-to-noise ratio of the predicted signal.

Related Work

We adopt the categories of methods of the trajectory prediction purposed by S.Lefevre et al. [Lefèvre, Vasquez, and Laugier2014] in which the abstraction of their classifications is gradually increasing, i.e. physics-based motion model, maneuver-based motion model and interaction-aware motion model. Actually, SAKNN is a combination of physics-based motion model and interaction-aware motion model.

Dynamic Uncertainties

The uncertainties need to be taken into account for modeling of the trajectory prediction more precisely. Typically, the KF [Kalman1960] is applied to model this uncertainty and recursively estimate the states of robots or autonomous vehicles from noisy sensor measurements. For a more complete review of state estimation, we refer the reader to standard references [Bishop, Welch, and others2001, Thrun, Burgard, and Fox2005, Simon2006]. In recent literature, the neural network approach for state estimation has been explored largely [Bobrowski et al.2008, Wilson and Finkel2009]. H.Coskun et al.[Coskun et al.2017]

train triple long short-term memory (LSTM)

[Hochreiter and Schmidhuber1997] to learn the motion model, the process noise, and the measurement noise respectively in order to estimate the human pose in a camera image. T.Haarnoja et al. [Haarnoja et al.2016] purpose a discriminative approach to state estimation in which they train a neural network to learn features from highly complex observations and then filter with the learned features to output underlying states.

Socially Aware Model

Interaction layer takes into account the influences of neighboring pedestrians or vehicles. Comparing with the traditional dynamic model-based approach of modeling these dynamic features, we mainly refer to the recent work in the neural network approach. A.Alahi [Alahi et al.2016]

purposes a deep neural network model to predict the motion dynamics of pedestrians in crowded scenes, in which they build a fully connected layer called social pooling to learn the social tensor and the interaction between each pedestrian. However, this approach is designed for the motion of the pedestrians and not effective in complex traffic configurations. Another work of motion prediction is from

[Deo and Trivedi2018]

. In their work, they extract social tensor by a convolutional social pooling layer and then feed the social tensor to a maneuver based model of trajectory prediction. A special version of the prediction model is applying imitation learning approach to learn human driver behavior in the multi-agent setting

[Bhattacharyya et al.2018, Kuefler et al.2017]. The learned policies are able to generate the future driving trajectories which are better to match those of human drivers and can also interact with neighbors in a more stable manner over long horizons in the multi-agent setting.



In the following sections, we use alphabets , , T to represent the Kalman filter starting time, the Kalman filter terminal time and the Kalman prediction terminal time respectively. And, the subscript s:t represents a history of data or states from Time s to Time t. We also use to represent the weights of the neural networks Interaction, LSTM and LSTM in our architecture.

Formulation of Our Problem

We consider the basic kinematic motion formula. Let be the position at Time t in the trajectory of the vehicles. The motion formula is as follows,


where we use big notation to simply represent the higher order term of Taylor’s expansion of position. Our approach aims to predict trajectories of vehicles in two sequential steps. The first step is to predict acceleration generated by our neural network SAKNN whose input is the sensor information from perception space which mainly includes a history of positions, velocities, accelerations and social data; In the second step, we integral the predicted acceleration with respect to kinematic motion formula (1).

In the next section, we create a new model for tracking this setting. And, in the rest of this section, we provide a small overview of the Kalman filter and smoother which play an important role in our model. The KF is an optimal state estimator in the MSE sense with a linear dynamic model and Gaussian noise assumptions. Suppose we denote the state as , the control as and observation as , the KF is written as the process model and the measurement model

where the matrices , and are constant and process noise and measurement noise

are covariance matrices of zero mean Gaussian white noise with respect to time.

The KF adopts a recursive way to estimate underlying states of the above equation in two steps. Typically, the prediction and the update phases alternate until the observations are unavailable. In the prediction step, the current states and the error covariance matrix is estimated by the process model which is independent of current measurement

In the update step, once received the current observations, Kalman gain , the prior estimation and the error covariance matrix is calculated by the measurement model,

After the KF, we backwardly carry out the recursive steps of Kalman smoother as follows

where the new symbol and are the smoothed states and the smoothed error covariance matrix.


In this section, we present our model of Socially Aware Kalman Neural Network (SAKNN). See Figure 1 for the graph of the detailed architecture.

Architecture of Interaction Layer

SAKNN takes into account a slightly influence of social force between vehicles and simulate such dynamical features. In contrast to the Helbing-Molnar-Farkas-Vicsek social force model [Helbing, Farkas, and Vicsek2000], social force between vehicles is the intention of drivers not to collide with the others or static obstacles.

According to Newton’s 2nd Law in physics, acceleration is directly proportional to force and thus we decide to pick acceleration to be the output of SAKNN. Thus, the interaction operator is from the perception space to the control space , i.e.

Figure 2: Illustration of the Interaction Layer

: Our main architecture of interaction layer is composed of the encoder part and the decoder part. In the encoder part, we sequentially build CNN layers which can be regarded as social tensor extractor and FCN layers which can be regarded a mixer of social tensor, and then we merge the deep features into the encoder LSTM and the decoder LSTM.

The architecture of the interaction layer refers to Figure 2. Specifically, the input of interaction layer is a history of the recorded velocities , positions , accelerations , widths and lengths of vehicles, and distances between two vehicles and the interaction layer is written as

Inspired by the work of Helbing et al. [Helbing and Molnar1995], we also feed the hand-crafted repulsive interaction force

to our neural network, where the subscript and represent two adjacent vehicle and vehicle .

Multi-Agent Kalman filter and smoother

We denote interaction layer as Interaction, process noise neural network as LSTM and measurement noise neural network as LSTM.

The diagram of multi-agent KF of N-vehicle motion refers to Figure 3 and we write it in a recursive way as follows,


represents surrounding data; The state vector

and measurement vector consist of positions and velocities respectively written as

The motion matrix and control matrix are determined by motion formula and written as

And, process noise matrix obeys Gaussian and measurement noise matrix also obeys Gaussian .

The prediction step in this N-agent Kalman filter is then written as

And, the update step of Kalman filter is written as

where matrix is the output of Interaction, and are the diagonal matrix of the output of LSTM and LSTM module respectively.

Finally, the recursive step of Kalman smoother is written as

where is smoothing matrix and and are the smoothed state and the smoothed error covariance matrix.

Figure 3: Illustration of the Kalman Layer: We execute Kalman filter and smoother in the Kalman layer in which we predict priori and update posteriori forwardly, and smooth them backwardly. Note that the acceleration from prior to posterior in this figure is the predicted acceleration from the interaction layer. In the forward process, which is represented by the black arrow, the current underlying is updated by current observation and previous priori . In the backward process, which is represented by the red arrow, the smoothed state is obtained by the current state , the next priori and the smoothed state .

Architecture for Noise Model

To model the sequential data, LSTM [Hochreiter and Schmidhuber1997] and GRU [Cho et al.2014] are the most popular deep neural network architectures till now. These two gated-structure models overwhelm all models on the metric of recognition accuracy in quite a few fields e.g. natural language text compression, unsegmented connected handwriting recognition, and speech recognition [Graves et al.2009, Graves, Mohamed, and Hinton2013].

In SAKNN, we train two gated-structure neural networks to learn the time-varying sensor noise in filter layer, i.e.

Process model noise covariance: LSTM;

Measurement model noise covariance: LSTM.

Loss Function

The loss in SAKNN is set for three goals, to smooth the recorded predicted states in the KF, to fit the predicted acceleration of network and constrain of future predicted states in the KF.

  • Smoothing of predicted states: At each time step of the KF, the filter layer outputs a predicted state . We collect the predicted states and smooth them by the Kalman smoothing backward step. We enlarge the smoothed trajectory and raw sensor data in order to bend the weights of the neural network. The smoothing loss is defined as,

    where is the Kalman smoothing operator.

  • Fitting of accelerations: At each time step of the KF, the interaction layer will produce current and future acceleration. We fit the sign of predicted acceleration and its corresponding sign of the ground truth in L2 norm. In practice, the ground truth of acceleration is directly from raw sensor data .

    where we take an expectation of the fitting loss to normalize the effects of a big number of Kalman time-step results.

  • The Penalty of prediction: After an integration of acceleration with respect to the kinematic motion formula (1), we give a penalty in L2 norm to the predicted trajectory in order to constrain the final displacement. representing the kinematic motion formula starts from observation state and iterates with the predicted acceleration . In practice, the ground truth of positions is directly from raw sensor data .

    where we take an expectation of the penalty loss to normalize the effects of a big number of Kalman time-step results.

The total loss is composed of the above three losses,

where ,,

are hyperparameters.



We evaluate our approach on public datasets US Highway 101 (US-101) [Colyar and Halkias2007] and Interstate 80 (I-80) [Lu and Skabardonis2007] of NGSIM program. NGSIM program collects detailed vehicle trajectory data on southbound US101 and eastbound I-80 with a software application called NG-VIDEO to transcribe the vehicle trajectory data from the video. The statistics of raw data of US-101 and I-80 are shown in Table 1.

Dataset Study area lanes Time span Sampling rate
US-101 2,100 feet 6 315 min 10 Hz
I-80 1,640 feet 6 315 min 10 Hz
Table 1: Dataset Statistics

The training set and testing set are composed of 100,000 frames of raw data extracted from the NGSIM data set. 70% of them are for training and the rest for testing. In particular, we use the raw data for training but filter it for the evaluation. We extract raw data from NGSIM in the following way. We align raw data by the timestamp and group vehicles by one host and five surrounding vehicles on the adjacent traffic lanes. We then set 12 seconds as a window size of experiments in which the first 5 second’s trajectory is for a track history and the rest is for the prediction horizon.

Figure 4: Illustration of the Performance of SAKNN: The data sets labeled by General include all the traffic scenes. And, the data sets labeled by LaneChanging only include the changing lane traffic scenes.

Evaluation Metrics

In this subsection, we purpose multiple metrics to verify the effectiveness, the robustness and the smoothness of SAKNN. In particular, we filter the raw data from NGSIM to be the ground truth.

  • Effectiveness: The two regular metrics in trajectory prediction are average displacement error (ADE) and final displacement error (FDE). ADE is the root mean squared prediction error accumulated by the displacement error between predicted position and the ground truth in each time step. And, FDE takes into account the final displacement error. A small ADE or FDE means that the model predicts well.

  • Robustness: We test SAKNN on the general data set and the lane changing data set. In each data set, we compute ADE and FDE to evaluate the robustness. A robust model is corresponding to a small number in both ADE and FDE.

  • Smoothness: We use signal-to-noise ratio (SNR) as a metric to quantitatively compute how the smoothness of our predicted signal is. SNR is a measure to compare the level of the underlying signal and the level of background noise.


We compare the following baselines with SAKNN.

  • Constant Velocity (CV): The basic kinematic motion formula with a constant velocity.

  • Vanilla-LSTM (V-LSTM): Referring to [Park et al.2018], V-LSTM feeds in track history and generates the future trajectory sequence.

  • Acceleration-LSTM (ACC-LSTM): We purpose it to compare with SAKNN in which LSTM takes the track history and predicts the future acceleration. The predicted trajectory is accumulated by the predicted acceleration according to the basic kinematic motion formula.

  • Convolutional Social Pooling-LSTM (CS-LSTM): CS-LSTM is a maneuver based model generating a multi-modal predictive distribution [Deo and Trivedi2018].


Figure 4 shows the effectiveness and robustness of SAKNN in both general and lane changing data set. We train SAKNN in almost 500 independent trails and over 300 training episodes in each case until it converges. And, we take an average of ADEs and FDEs of all vehicles to compare with the results of multiple baselines, since SAKNN predicts the future trajectories of a group of six vehicles together. The average of the performance of six vehicles in SAKNN is called Average SAKNN.

Hyperparameters of baselines are fine-tuned to obtain a good score on the validation sets. Another two state-of-the-art algorithms of trajectory prediction Social Attention LSTM [Vemula, Muelling, and Oh2017] and Social LSTM [Alahi et al.2016] is taken into account, but they are designed for the motion of the pedestrians and thus not effective in the complex traffic configurations. Actually, our results differ one order of magnitude, and we decide not to put their results with SAKNN in our figure.

In Figure 4, the Average SAKNN is a little better or slightly worse than the ACC-LSTM and CS-LSTM algorithms in a short-term prediction (1 to 4 seconds). However, the Average SAKNN is superior to all others after 4 seconds and the advantages increase heavily as time propagates. We also observe that the Average SAKNN performs well both in general and lane changing data set. The observations verify that the Average SAKNN indeed has merits in the effectiveness and the robustness, better than baselines. In particular, the Average SAKNN reduces 30% to 60% displacement errors of the constant velocity model in any situations.

In particular, we pick the best performance of six vehicles in SAKNN to be the Best SAKNN. The vehicle which is called Best SAKNN varies in different data sets but overwhelms the baselines substantially.

Qualitative Analysis and Discussion

In this section, we will qualitatively analyze the deep relationship between performance and structure of SAKNN.

  • Sensitivity of Hyperparameters in Architecture

    : SAKNN has tens of hyperparameters from the interaction layer to the Kalman layer. The experimental results show that the coefficients of the three loss terms in the loss function will influence our results. In practice, we take into account the orders of magnitude of three loss terms and prefer to give fitting term importance.

  • Sensitivity of Hyperparameters in Filtering: In the evaluation step, we smooth the trajectories from NGSIM by the Savitzky-Golay filter with some window size as a hyperparameter. In fact, ADE and FDE are truly sensitive to this hyper-parameter in the experiments. However, suppose we fix any window size for the filtering of the trajectories, SAKNN is always superior to other baselines in performance.

  • Effects of the Interaction Layer and the Kalman Layer in SAKNN: The interaction Layer provides a relatively underlying acceleration with low variance Gaussian white noise. And, the Kalman layer is indeed effective for smoothing the predicted acceleration in a slight degree and further improve the signal-to-noise ratio in experiments. See Table 2. Actually, the effect of the Kalman layer in SAKNN is not inferior to it of the Kalman filter and smoother in robotics.

    Weights Acceleration x (dB) Acceleration y (dB)
    Ground Truth -1.0753 -1.4390
    0.0/0.9/0.1 5.9033 5.4834
    0.1/0.8/0.1 2.8446 7.4998
    0.2/0.7/0.1 11.0210 15.8976
    0.3/0.6/0.1 7.9355 16.7044
    0.4/0.5/0.1 12.9703 10.0643
    Table 2: Signal-to-Noise Ratio of SAKNN: The weight column includes the smoothing/fitting/penalty weights. We randomly pick experiment samples from US-101 data set and compute the average SNR of the samples for the future 20 time steps. We raise the smoothing weight in each experiment and find that the average SNR is increasing quickly but irregularly. The results in the table demonstrate the Kalman layer and smoothing term has effects in our model.
  • Predictive Ability of Regression Model

    : A big doubt in our approach is if it is possible to learn acceleration with any architecture of the neural network. The experimental results demonstrate a lack of patterns of the predicted acceleration after enough long time for training. We then create a heuristic method to learn the sign of acceleration. The new method turns the regression problem to be a classification problem and much tractable. Furthermore, we divide the range of acceleration and put them into range boxes which cover the whole range of acceleration of all vehicles. Then, we set

    to be the loss of predicted range box number with the box number of the ground truth.

Conclusions and Future Work

In this work, we purpose a new model SAKNN of trajectory prediction in which we take into account the two intractable challenges socially aware interactive effects and dynamic uncertainties in robotics and autonomous driving. We embed the interaction layer and the Kalman layer into the architecture of SAKNN to exploit the two challenges respectively. In an extensive set of experiments, SAKNN outperforms in the effectiveness, the robustness and the smoothness than baseline models and achieve state-of-the-art performance on US-101 and I-80 data sets. Further work will extend SAKNN to a probabilistic formulation and combine SAKNN with a maneuver-based model in which road topology and more of the traffic information will be taken into account as a priori.


We would like to thank Baidu Apollo community. Especially, we thank the L3 prediction lead Yizhi Xu for his support and some helpful suggestions and discussions in the techniques of the motion prediction.


Supplementary Materials