Predicting future motion of agents from video inputs in indoor and outdoor environments is a fundamental building block in various applications such as autonomous navigation in robotics, autonomous driving and driving assistance technologies, and video surveillance. Although substantial progress has been made in trajectory forecast, existing approaches have not fully addressed important challenges as follows: (i) interactions between agents are modeled without proper consideration of how the agents influence one another in both spatial and temporal domains, (ii) surrounding environmental constraints are to a large extent ignored, and (iii) perhaps most importantly, the computational complexity of existing methods linearly increases with respect to the number of agents, which places strict limits on their practical use in real-time and safety critical applications such as autonomous navigation for robotics or driving scenarios.
In this work, we provide a robust and computationally efficient solution to address the aforementioned challenges. Our method primarily aims for single-shot prediction of the trajectory of all agents in the scene. This is undoubtedly important for applications involving intelligent mobility (e.g., automated guide/service robots, self-driving vehicles, and driving assistance technologies) where the trajectory prediction is safety critical and has real time requirements using minimal compute resources. To achieve single shot prediction, the proposed method makes use of two types of composite fields: (i) a localization field that predicts the future locations of all agents at each time step and (ii) an association field to associate the predicted locations between successive frames. By associating composite fields from the last observation, identification of each agent directly propagates through future time steps. To the best of our knowledge, this work is the first to predict all agents’ future trajectories in a single-shot manner, running in constant time complexity.
Our framework also models interactions with the surrounding environment, including interactions with other agents in the scene as well as physical constraints such as obstacles or structures. It should be noted that the convolutional and recurrent operations in many existing approaches capture local interactions, either in space [fukushima1982neocognitron, lecun1989backpropagation] or in time [rumelhart1986learning, hochreiter1997long], but not both. Thus, the interaction modules of the existing trajectory prediction approaches [alahi2016social, gupta2018social, cui2019multimodal] have inherent limitations in capturing both spatial and temporal relationships. However, recent work on a non-local block [wang2018non] have shown promising results in modeling such spatio-temporal relationships (i.e., in space-time). The non-local block computes the features at a location as a linear weighted combination of features at all locations in space and in time. Inspired by their non-local operator functions, we model social interactions of entities in space-time domains and further extend the function of non-locality to capture the environmental interaction by constraining the operation using the semantic context of the scene. In this way, our new proposed block is able to capture both the inter-agent and environmental interactions.
The main contributions of this work are as follows:
The use of composite fields enables us to predict the future trajectory of all road agents with one-shot forward pass computation. Thus, the network run time is constant, O(1), with respect to any number of agents in the scene.
The interactive behavior of entities is jointly captured both spatially and temporally, which models the social interaction between agents in the scene.
The proposed non-local module integrates visual semantics of the physical environment to take into account spatio-temporal interactions with the physical environment.
Ii Related Work
In this section, we review deep learning-based literature on trajectory prediction, non-local interaction, and convolution LSTMs.
Trajectory Prediction. Early methods focused on modeling social interactions between humans. The social pooling layer is introduced in [alahi2016social] and extended in [gupta2018social]. Both methods use the recurrent operation for individual pedestrians, which aims to capture their temporal relationships. Although they motivated subsequent research such as [lee2017desire, sadeghian2019sophie, sadeghian2018car, yao2019egocentric, malla2019nemo, malla2020titan] to process neighborhoods agents, spatial relationships are not explicitly modeled in space. Another research thread [yagi2018future, nikhil2018convolutional, su2019potential] uses the convolution operation to encode past motion of pedestrians as well as their interactions. However, the fact that these methods do not consider temporal changes makes them susceptible to errors in capturing long-range interactions. The recent work [choi2019looking, choi2020shared] considers both spatial and temporal interactions of agents. However, their successive 2D-3D convolutional operations are computationally inefficient.
Convolutional LSTM. Traditional LSTM has been extended to Convolutional LSTM (Conv-LSTM) for the direct use of images as input for precipitation nowcasting [xingjian2015convolutional]. Subsequently, [finn2016unsupervised] proposed an architecture consisting of several stacks of Conv-LSTM for learning physical interaction through video prediction. The success of such works inspired us to encode past motion of all road agents using images through Conv-LSTM layers. Unlike existing methods that encode a single motion history, we encode the features using a set of binary maps. Each binary map indicates the locations of all road agents at a certain time step.
Single-shot computation.Estimation of multiple locations in a single-shot manner has been recently studied. In pose estimation, the joint locations of human body and their connections were estimated using a single-shot framework [papandreou2018personlab, kreiss2019pifpaf]. In parallel, an encoder-decoder architecture was used to synthesize future motion sequences producing a set of motion flow fields in [ji2018dynamic]. Motivated by these works, we propose a framework to predict future trajectories of all road agents in single-shot. The use of composite fields (i.e., localization and association field) enables us to locate the future positions of all agents at each time step and to find their associations between successive frames.
Non-local interaction. In contrast to existing methods in trajectory prediction, we model interactions between agents as a non-local response in space-time domains. Non-local algorithms were introduced in [buades2005non] for image de-noising. This idea was recently adapted to video action recognition in [wang2018non] for neural networks. [hussein2019timeception] subsequently used multi-scale temporal convolutions for long-range temporal modeling in videos. For future trajectory prediction applications, we adapt the idea in [wang2018non] that was originally proposed for action recognition. Our framework captures not only the spatio-temporal interactions between agents, but also the interaction with the environment using a novel interaction module, exhibiting improved performance in our experiments.
We observe the trajectories of all agents for time steps. The position of pedestrian at time is denoted by . We predict the trajectories of all agents for time steps in the form of two composite fields and , representing localization and association fields, respectively.
Iii-a Network Architecture
Our trajectory prediction network consists of three sub-modules: past motion encoder, interaction module, and future motion decoder as shown in Fig 2
. The past motion encoder and future motion decoder consist of a set of convolutional and Conv-LSTM layers. We generate an image-like tensor using the positions of all pedestrians at each past time step, which is used as input to the past motion encoder. The output encodingof the past motion encoder and semantic segmentation features are provided as input to the interaction module, where the interaction between the agents (social interaction) and with the environment (environmental interaction) are discovered. The encoded interaction features are concatenated with the localization field (, as shown in Figure 2) that is computed at the previous time step through the future motion decoder. Note that at the first future time step , we use the image-like tensor as , which is processed from the last observation. We send their concatenation to the future motion decoder and produce composite fields that are used to decode the locations of agents at all future time steps.
Iii-B Composite Fields
The output of the network consists of two types of fields: localization and association fields. The localization field is used to find the locations of all agents at each future time step. In order to identify each agent’s trajectory, the locations at different time steps should be associated across time. We use association field to associate the agent’s location at time with its past location at time . All fields generated by the network have a dimension of .
Iii-B1 Localization field
At each spatial location in the field, the network predicts 3 parameters representing directional offset and its confidence . If is within a certain threshold manhattan distance away from an agent location, each triplet would represent a prediction of the position of this agent. If a point on the field is not in the vicinity of any agent, then . Thus, each agent’s location is predicted by multiple points in its vicinity defined by . Fig 4 shows an example of localization field generated by our network. The final location is an ensemble of the predictions. To predict these locations, we create map, H which accumulates all the predictions. For each spatial location in the field, we add a Gaussian contribution in H, such that,
where the mean is shifted using the directional offset predicted by the localization field as , and represents a constant covariance matrix.
Although the future motion decoder outputs localization fields of dimensions for computational efficiency, it is up-scaled to while finding locations of agents in Eqn. 1. The peaks detected on H are the predicted future locations for all agents. We use thresholding followed by 2D non-maximum suppression to find peaks in H.
Iii-B2 Association fields
At each spatial location of association field, the network predicts 5 parameters . If
is within a certain threshold manhattan distance away from an agent’s location, then the directional vectororiginating at would point to the location of the agent at time . Similarly the directional vector originating at would point to the location of the agent at time . Similar to the localization fields, if the point on the field is not in the vicinity of any agent, then all the parameters will be equal to . During testing, say, we need to associate the location of a pedestrian W at time (from a set of candidate locations which were already found using the localization maps), given we already know W’s location at time as . We threshold the association fields with a threshold on . Then we find the directional vector which gives the best prediction of W’s position at , . The corresponding would point towards W’s estimated position at . From the list of candidate points, we find the point nearest to this estimate.
Iii-C Semantic Context
Future trajectory of road agents depends on their interactions with the environment. For example, agents will change their trajectory to avoid collisions not just with other agents but also obstacles in the scene. Some types of agents may also be more likely to traverse in certain areas of the scene than others. Pedestrians are more likely to travel on the side-walk as opposed to an area occupied by grass. Such contextual information should be considered for more accurate and natural motion prediction. Therefore, we model environmental interactions of road agents using semantic features. To extract such features, we annotate a segmentation map for each scene in the ETH, UCY, and SDD datasets. We define five classes for annotation; walkable area, area covered by grass / bushes / trees, drivable areas, non-drivable areas, and sidewalks. We use a small network with 4 convolutional layers to extract contextual features from this annotated map during training. These extracted features are given as input to the interaction module to encode the agents’ interaction with the environment.
Iii-D Interaction Module
We note the local nature of the convolutional and recurrent operations. Therefore they are able to only capture interactions occurring in the local neighborhood either in space or in time.
The recently introduced non-local block [wang2018non] overcomes such locality issues. Since the output of the non-local block at a certain location is the linear weighted combination of features at all other locations, they are well suited to capture interactions in space-time domains. We redesign their non-local block by taking its original advantage of interaction modeling capability (social interaction between agents) while additionally capturing environmental interactions (interaction between agents and environment) across different spatial and temporal locations.
Fig. 3 shows the proposed non-local block. The semantic context of the scene guides the interaction module to consider the environmental constraints towards the agents’ potential motion. Section V shows the effectiveness of our modification compared to the vanilla non-local block. The new modified block is able to capture both the inter-agent interactions and environmental interactions. The input to the non-local block is and . is obtained from the encoder and
is the semantic features extracted from segmentation maps. We take all the hidden states of the last Conv-LSTM of past trajectory encoder at each time step and concatenate them to getof size , where is the size of the feature at each time step, is the number of channels for the feature, and is the batch size. The interaction module has 3 branches,
The first 2 terms, the functions and are respectively social interaction and environmental interaction, which corresponds to the left and right branch in Fig. 3. The last term represents the residual connection. models non-local module as self-attention following [wang2018non]. For an input feature this is defined as,
|State of the art|
|Linear||0.143 / 0.298||0.137 / 0.261||0.099 / 0.197||0.141 / 0.264||0.144 / 0.268||0.133 / 0.257|
|S-LSTM [alahi2016social]||0.195 / 0.366||0.076 / 0.125||0.196 / 0.235||0.079 / 0.109||0.072 / 0.120||0.124 / 0.169|
|SS-LSTM [xue2018ss]||0.095 / 0.235||0.070 / 0.123||0.081 / 0.131||0.050 / 0.084||0.054 / 0.091||0.070 / 0.133|
|SGAN-P [gupta2018social]||0.091 / 0.178||0.052 / 0.094||0.112 / 0.215||0.064 / 0.130||0.059 / 0.115||0.075 / 0.146|
|Gated-RN [choi2019looking]||0.052 / 0.100||0.018 / 0.033||0.064 / 0.127||0.044 / 0.086||0.030 / 0.059||0.044 / 0.086|
|ED||0.051 / 0.085||0.024 / 0.039||0.079 / 0.156||0.058 / 0.121||0.056 / 0.113||0.054 / 0.103|
|ED+F||0.043 / 0.077||0.018 / 0.029||0.069 / 0.141||0.043 / 0.090||0.049 / 0.100||0.045 / 0.088|
|ED+F+||0.042 / 0.072||0.019 / 0.034||0.064 / 0.128||0.040 / 0.081||0.050 / 0.102||0.043 / 0.083|
|ED+F+||0.040 / 0.075||0.020 / 0.038||0.063 / 0.131||0.040 / 0.084||0.045 / 0.095||0.042 / 0.084|
|ED+F+||0.038 / 0.067||0.016 / 0.028||0.060 / 0.122||0.040 / 0.080||0.048 / 0.097||0.040 / 0.078|
|ED+F+||0.036 / 0.064||0.018 / 0.031||0.059 / 0.120||0.038 / 0.078||0.046 / 0.094||0.039 / 0.077|
where is the index of an output position (in space-time) and is the index that enumerates all possible locations. is a pair-wise function that computes the relationship between and . calculates the weight corresponding to weighting of the self-attention and is the normalizing factor. For our case, we use a concatenation version of f:
where are convolutional layers as shown in Fig. 3 and represents the concatenation operation. The matrix multiplication in Fig. 3 is responsible for the multiplication of f,g and the successive summation over in Eqn. 3.
is obtained by element-wise multiplication of shaped with a shaped attention heat-map, produced using the segmentation features.
The element-wise multiplication allows the network to suppress a certain region of the input feature maps, which may not be useful for trajectory prediction (e.g., non-walkable areas).
|State of the art||Ours|
|1.0 sec||- / 2.58||1.93 / 3.38||- / 2.00||1.71 / 2.23||1.57 / 2.06|
|2.0 sec||- / 5.37||3.24 / 5.33||- / 4.41||2.57 / 3.95||2.48 / 3.91|
|3.0 sec||- / 8.74||4.89 / 9.58||- / 7.18||3.52 / 6.13||3.47 / 6.08|
|4.0 sec||- / 12.54||6.97 / 14.57||- / 10.23||4.60 / 8.79||4.53 / 8.43|
We use three publicly available benchmark datasets (ETH, UCY, and SDD) for our experiments. The ETH [pellegrini2009you] and UCY [lerner2007crowds] datasets contain top view videos of pedestrians walking at public locations. Combined, these datasets contain 5 scenes eth, hotel, ucy, zara1, zara2 that are captured using a stationary camera. The SDD dataset [robicquet2016learning] contains 60 unique top view videos taken using a drone at a university campus. The dataset consists of pedestrians, cyclists, and cars.
These datasets contain naturalistic interactions between road agents and between agents and the environment. Both datasets have annoated labels of the agents’ locations in world coordinates. In our experiments, we convert all agents’ locations to pixel locations in a sized image space. For ETH / UCY datasets, we use a leave-one-out cross validation policy, where we train on splits and test on the remaining split. For SDD, we follow the same settings used by [lee2017desire].
Iv-B Implementation details
We used Nvidia V100 and P100 GPUs for all experiments. Our network is implemented in PyTorch and trained from scratch using the He Normal[he2015delving] initialization. For optimization, we use Adam [kingma2014adam] with a mini-batch size of and initial learning rate of
with step decay. We train our models for 100 epochs. Each Conv-LSTM layer is followed by a normalization layer in order to facilitate convergence of the network (these layers are not shown in2). We employ data augmentation to reduce over fitting, whereby training videos are randomly flipped horizontally / vertically and rotated in multiples of 90°about the center. Our network uses a Mean square error (MSE) to learn composite fields.
For ETH / UCY datasets, our method observes the trajectory of agents for 8 time-steps (, corresponding to 3.2 sec) and predicts the trajectory for the next 12 time-steps (, corresponding to 4.8 sec). For the SDD dataset, we use (corresponding to 3.2 sec and 4.0 sec, respectively) to keep experiments comparable with [lee2017desire]. During training, we prepared ground truth localization and association fields which are learned by our network by regression. The confidence parameter of both the association and localization field, for all ground truth fields. We use the MSE loss for regression.
The input to the past motion encoder at time is a binary map of size . If is the location of an agent at time-step , then all points in within a Manhattan distance from would have value . All points in which do not satisfy this condition will have value .
Iv-C Baselines and Evaluation
We compare our work with the state-of-the-art models in the literature. We use Linear, Social-LSTM [alahi2016social], SS-LSTM [xue2018ss], and Gated-RN [choi2019looking] for comparision on ETH / UCY dataset. For the SDD dataset, we use Linear, Social-LSTM, Gated-RN, and DESIRE [lee2017desire]. Linear is a linear regressor that estimates future trajectory by minimizing the mean square error.
We also show ablative studies of our model to emphasize the importance of each module proposed in our framework. The baseline ED is the vanilla encoder-decoder version without any interaction module. This baseline also does not weight the multiple predictions in the localization fields to generate a single prediction – we find the prediction with the highest confidence without using Eqn. 1.In the baseline ED + F, we use the aggregation of the multiple predictions on the localization fields using Eqn. 1. Its variations ED+F+, ED+F+, ED+F+ use the hidden states of the last Conv-LSTM layer of the past motion encoder in different ways. The suffix concatenates all the hidden-states over the past time steps , is a small 3D convolutional network to process these temporal features, and is the original non-local interaction module introduced in [wang2018non]. Finally, our best model with the proposed interaction module ED+F+ advances the non-local module to factor in environmental interactions, as shown in Fig. 3.
Following the standard evaluation metrics, experiments are evaluated using average displacement error (ADE) and final displacement error (FDE). ADE is defined as the mean of L2 distances between the prediction and the ground truth for all time-steps. FDE is defined as the L2 distance between the predicted location at the last time step and the corresponding ground truth.
V-a Quantitative results
Table I shows a comparison of our approach against state-of-the-art methods as well as our ablation experiments for the ETH and UCY datasets. As expected, Linear which is a linear regressor, has the worst performance. S-LSTM shows improved performance due to its use of social pooling. By adding high-level image features, SS-LSTM reaches lower error rates than S-LSTM. Gated-RN shows further improvement due to their pair-wise interaction encoding.
Our first baseline model, ED performs better than SS-LSTM. The next model, ED+F which incorporates a weighted aggregation of localization fields, generates a significant performance boost across all splits, demonstrating the efficacy of our ensemble model. Note that this baseline model is already comparable to all state-of-the-art methods. Through the interaction modeling ablations, we notice that shows improvement over the baseline ED+F, suggesting the use of additional temporal features for the decoder. Further improvement of suggests that 3D convolutions are more suitable in modeling interactions as compared to naive concatenation of . ’s usage of the non-local block allows the system to capture spatio-temporal features, thus showing improvement over . Using the semantic context extracted from the segmentation features makes the network being aware of physical boundaries and constraints, which results in further improvement of for all splits except for hotel. For hotel, we observe that most of the area in the video frames is walkable, meaning that the segmentation features may not provide any additional context. Given the results in Table I, we show that our best model ED+F+ generally outperforms the state-of-the-art methods on most of the splits including the total ADE/FDE. We also compare our method ED+F+ with the existing works using the SDD dataset. In Table II, our approach achieves performance higher than these state-of-the-art methods over all time steps. The evaluation on ETH, UCY, and SDD datasets demonstrates that the efficacy of our single-shot framework as well as the proposed interaction module.
|average run-time (in sec) / speed-up|
|1(min)||2.61 / 1x||0.15 / 17x||0.12 / 21x||0.32 / 8x|
|4.5(avg)||11.69 / 1x||0.67 / 17x||0.54 / 21x||0.32 / 36x|
|21(max)||54.80 / 1x||3.15 / 17x||2.52 / 21x||0.32 / 171x|
V-B Speed and Run time
The computational time of such systems is one of the critical aspects for their practical use in real world applications such as autonomous navigation for robotics and driving scenarios. Table III shows the comparison of our method against S-LSTM [alahi2016social], S-GAN-P [gupta2018social], and Gated-RN [choi2019looking].
Although the time complexity of existing methods linearly increases as O(n) where n is the number of agents, the proposed approach always runs in constant time complexity O(1) due to its structure, specifically designed for single-shot prediction. In practice, our model is able to run 171x faster than S-LSTM and 7x faster than Gated-RN with 21 observed agents, while achieving higher performance as reported in Table I and II.
V-C Qualitative results
Fig. 6 illustrates the impact of the proposed interaction module against the non-local block . The examples are collected from ETH and SDD datasets, highlighting qualitative improvement due to the usage of semantic context for environmental interactions. Subsequently, we show the efficacy of the proposed interaction module for inter-agent interactions. Fig. 7 compares the results of future trajectory prediction with and without the interaction module. The examples are drawn from the UCY dataset, and they clearly show that the future motions are being more natural with the interaction module considering potential collisions between agents.
In this paper, we considered the problem of future trajectory forecast using composite fields for single-shot prediction of all agents’ future trajectories. We demonstrated the efficacy of the proposed models with experimental evaluations on ETH, UCY, and SDD datasets. The results showed that our novel interaction module improves performance by capturing inter-agent and environmental interactions in both spatial and temporal domains. Importantly, unlike previous methods in trajectory forecast, the proposed network is highly efficient in its computation and runs in constant time with respect to any number of agents in the scene. In the future, we plan to extend this work to egocentric video inputs obtained from a moving platform and incorporate additional scene contexts to capture participant interactions with the environment in various autonomous mobility applications.