Repo for the paper 'Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments'. WACV 2020
This paper introduces the problem of multiple object forecasting (MOF), in which the goal is to predict future bounding boxes of tracked objects. In contrast to existing works on object trajectory forecasting which primarily consider the problem from a birds-eye perspective, we formulate the problem from an object-level perspective and call for the prediction of full object bounding boxes, rather than trajectories alone. Towards solving this task, we introduce the Citywalks dataset, which consists of over 200k high-resolution video frames. Citywalks comprises of footage recorded in 21 cities from 10 European countries in a variety of weather conditions and over 3.5k unique pedestrian trajectories. For evaluation, we adapt existing trajectory forecasting methods for MOF and confirm cross-dataset generalizability on the MOT-17 dataset without fine-tuning. Finally, we present STED, a novel encoder-decoder architecture for MOF. STED combines visual and temporal features to model both object-motion and ego-motion, and outperforms existing approaches for MOF. Code & dataset link: https://github.com/olly-styles/Multiple-Object-ForecastingREAD FULL TEXT VIEW PDF
Repo for the paper 'Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments'. WACV 2020
Motion Forecasting is one of the main important aspects in the autonomous vehicles predict motion attributes of pedestrians and vehicles in the next 5 seconds to
Predicting future events in video is a core problem in computer vision that has been studied in several contexts such as human action prediction, semantic forecasting , and road agent trajectory forecasting . In this work, we focus on the task of pedestrian trajectory forecasting from video data, which has seen considerable research attention over recent years [19, 32, 1, 12, 44, 42]. Humans are a particularly challenging class of objects to predict, as they exhibit highly dynamic motion and may change speed or direction rapidly. However, precise human trajectory prediction is critical in numerous application domains including tracking , robotic navigation , and autonomous driving .
Much of the existing work on pedestrian trajectory forecasting considers the problem from a birds-eye view using footage from a fixed overhead camera, often considering each pedestrian as a single point in space [1, 12, 44]. This setting is effective for modeling crowd motion patterns and interactions with the environment. However, by simplifying each pedestrian as a point in space, salient visual features such as person appearance, body language, and individual characteristics are not considered. Prior research has shown that features are of importance for trajectory prediction in settings such as anticipating if a pedestrian will cross the road [38, 30]. Furthermore, overhead perspectives are often not available in practical applications. As a result, trajectory forecasting from an object-level perspective has been studied in recent years , which better facilitates the modeling of visual features. However, trajectory forecasting in this setting suffers from a lack of large, high-quality datasets and standardized evaluation protocols.
Motivated by the above observations, we introduce a new formalization of the trajectory forecasting task: multiple object forecasting (MOF) (Fig.1). MOF follows the same formulation as the popular multiple object tracking (MOT) task, but rather is concerned with predicting future object bounding boxes and tracks in upcoming video frames, rather than the bounding boxes and tracks in the current frame. Future bounding box prediction has previously been studied in constrained settings such as on-board a moving vehicle with odometry information [3, 43]. In contrast, MOF follows the unconstrained MOT setting, which utilizes only image information where data from other sensors is not available. This setup poses several challenges, such as variations in object scale, non-linear motions, and ego-motion. Similar to MOT, we focus on the pedestrian object class, although the problem formulation is generalizable to other object classes.
To facilitate research on the MOF problem, we construct the Citywalks dataset. Citywalks is a large and diverse dataset collected from a first-person perspective in 21 European cities with considerable variability in many facets such as weather, object appearance, illumination, object scale, and pedestrian density. Citywalks is annotated using automated methods for detection and tracking and is considerably more diverse than existing datasets [29, 35, 28]
for trajectory forecasting. We evaluate existing models adapted for MOF on Citywalks and propose a novel encoder-decoder model. Our model, STED, combines visual features extracted from optical flow with temporal features and outperforms existing models on the MOF task.
To summarize, the contributions of this work are as follows:
Our work is primarily driven by a need for better object forecasting models in unconstrained settings and suitable protocols for evaluating such models under varied conditions. In this section, we summarize the main contributions in the fields of pedestrian trajectory forecasting and MOT. We also provide an overview of existing datasets for both tasks and their limitations.
Methods for MOT typically follow a tracking-by-detection paradigm that relies heavily on the accuracy of single-frame detections and models to associate detections across time. Given high-quality detections, reasonable MOT performance can be obtained with simple constant velocity motion assumptions , and better still when combined with a visual appearance association metric . Constructing more sophisticated methods capable of modeling non-linear motion can improve tracking performance, particularly in scenarios with occlusion . However, trajectory forecasting for improved tracking is challenging due to small datasets, which results in overfitting. One approach proposed to overcome this issue is to consider the future trajectory as a binary classification problem  or using explicit external memory to avoid memorization . We adopt a more straightforward approach to address overfitting: building a larger dataset.
. Methods typically focus on interactions between pedestrians and social conventions such as the pioneering Social Long-Short-Term-Memory (Social-LSTM) model, in addition to scene semantics. These methods do not typically consider visual cues, and many simplify each pedestrian to a point in space. Recently, Liang et al.  proposed one of the first approaches for trajectory forecasting using visual features. Their method encodes appearance using a person keypoint detector and joint modeling of future pedestrian trajectory and activity.
Most related to our paper, a small number of works consider trajectory forecasting from an object-level perspective. Predicting object trajectories from on-board moving vehicles, in particular, has been studied extensively [17, 3, 40]. Methods typically use additional information sources specific to a vehicle setting, such as odometry information. In an inspiring work outside of the vehicle domain, Yagi et al. 
propose a model which uses past locations, ego-motion, and pedestrian keypoints to estimate future trajectory in first-person videos. Their model outperforms existing state-of-the-art approaches; however, accurate pedestrian keypoint estimation is not always practical, especially in low-resolution or low-lighting scenarios. In contrast, our approach does not rely on pedestrian keypoint estimation.
Many large datasets with annotated pedestrian bounding boxes have been released such as Citypersons , BDD-100K  and EuroCity Persons . However, these datasets do not contain object tacking annotations. Older datasets such are KITTI  and Caltech-USA  provide full object tracks, although these datasets are considerably smaller with more limited geographical variety (1 country only) than the proposed dataset (10 countries).
Several datasets have been created explicitly tailored to pedestrian trajectory forecasting, such as UCY , ETH , and Stanford Drone . These datasets are recorded from a birds-eye view, making them suitable for modeling social and environmental factors. However, such datasets are not well suited to the MOF task due to being captured at a perspective from which extracting visual features is challenging.
Few public datasets exist for object-level view trajectory forecasting. Most similar to ours, the MOT-17 dataset  contains annotated pedestrian bounding boxes from both first-person and overhead cameras. However, MOT-17 contains only 14 video sequences. Our dataset, Citywalks, contains 358 video sequences.
Consider a sequence of video frames, . Given the frame , the task of object detection is to associate each identifiable object in the frame with a set of coordinates which represent the centroid , width, and height of the object bounding box, and is the set of all identifiable objects. Given all the framewise detections for all , the task of MOT is to associate each detection with a unique object identifier , where is the total number of unique objects across all frames, such that each object is tracked across the set of frames.
We extend the MOT task to MOF, shown in Fig.1. Given with associated object detections and tracks, we define MOF as the joint problem of predicting the future bounding boxes and associated object tracks of the upcoming video frames for each object present in frame , where is the number of future frames to be predicted. In this work, we use , corresponding to 2 seconds into the future at 30Hz.
We adopt the average displacement error (ADE) and final displacement error (FDE) metrics from the trajectory forecasting literature . ADE is defined as the mean Euclidean distance between predicted and ground-truth bounding box centroids for all predicted bounding boxes, and FDE is defined similarly for the bounding box at the final timestep only. We also use the average and final intersection-over-union (AIOU and FIOU) metrics. AIOU is defined as the mean IOU of the predicted and ground truth bounding boxes for all predicted boxes, and FIOU is the IOU for the box at the final timestep only.
Our newly-constructed Citywalks dataset comprises of 358 video sequences containing footage from 21 different cities in 10 European countries.
We extract footage from the online video-sharing site YouTube111Videos are obtained from https://www.youtube.com/c/poptravelorg. Each original video consists of first-person footage recorded using an Osmo Pocket camera with gimbal stabilizer held by a pedestrian walking in one of the many environments for between 50 and 100 minutes. Videos are recorded in a variety of weather conditions, as well as both indoor and outdoor scenes. Example frames showcasing the variety of the dataset are shown in Fig.2.
|Clip length||20 seconds|
|Time of day||Day/Night|
|Labelled objects per frame||0 - 17|
|Unique tracks (YOLOv3)||2201|
|Unique tracks (Mask-RCNN)||3623|
One of the fundamental challenges of MOF is the bounding box motion caused by both ego-motion and object motion. Large displacements resulting from significant ego-motion pose a problem and may overwhelm the training process. To mitigate the impact of large ego-motions, we filter the dataset by removing high motion segments. Global motion is estimated by extracting dense optical flow and selecting short video clips from windows with a mean optical flow magnitude below a threshold. Specifically, we downsample video frames to pixels for faster computation and extract dense optical flow using FlowNet2-S . We then select 20-second clips from longer videos using segments containing frames that do not exceed a mean optical flow magnitude threshold of 1.5.
Once clips are selected, pedestrians are detected using an object detection algorithm and tracked using Deepsort . We provide annotations for two object detectors: YOLOv3  and Mask-RCNN . Both detectors are trained using the MS-COCO  dataset and generalize well to Citywalks. For the YOLOv3 annotations, images are downsampled to pixels before detection, to simulate detection quality under low processing time requirements. We use a resolution of for detection using Mask-RCNN to obtain the best detection performance. Note that we leave any attempts to combine the two annotation sets (such as in ) for future work. Following the detection and tracking phase, we discard tracks shorter than 3 seconds as the previous one second of bounding box data is used to predict the next 2 seconds. Dropping short tracks reduces the number of false positives in the annotation set, as we observe that erroneous detections typically do not last longer than 3 seconds. Each video clip is also manually annotated with the city of recording, time of day, and weather condition. Annotation statistics are shown in Fig.3, and metadata are shown in Table 1.
In this section, we present STED, an encoder-decoder architecture for MOF that combines visual and temporal features. The proposed architecture has three components: (i) A bounding box feature encoder based on a Gated Recurrent Unit (GRU)  that extracts temporal features from past object bounding boxes (ii) A CNN-based encoder that extracts motion features directly from optical flow, and (iii) a decoder implemented in terms of another GRU for generating future bounding box predictions given the learned features. An overview of our model is shown in Fig.4.
Our bounding box encoder extracts features from past bounding box coordinates of each object represented in terms of its centroid, width and height . In addition, we compute the velocity in the and directions, , change in width, , and change in height, . This results in an -dimensional vector associated with each object bounding box .
For each observed timestep, a GRU (GRU-1 in Fig.4) takes the vector as input and outputs an updated hidden state vector . This update is repeated for all timesteps, resulting in a single hidden state vector at the final timestep which summarizes the entire sequence of bounding boxes. The 256-dimensional feature vector from a fully connected layer (FC-1 in Fig.4) is used as a compact representation of the history of bounding boxes.
We adapt Dynamic Trajectory Predictor (DTP)  to learn features directly from optical flow. Flow frames, , are extracted from within object bounding boxes obtained using YOLOv3 or Mask-RCNN at each timestep. A stack of frames are sampled uniformly from timesteps to inclusively, representing one second of motion history. The stack of 10 horizontal and 10 vertical frames are used as input to a CNN which takes the stack of frames as input and is trained to predict future object bounding boxes. The -dimensional feature vector from the final fully connected layer (FC-2 in Fig.4) is used as a compact representation of optical flow features. As optical flow captures both object motion and ego-motion, the vector encodes information from these two motion sources. Using optical flow as the input of our encoder rather than features from a person keypoint estimation model  avoids the challenges relating to inaccurate keypoint estimations.
Following the feature encoding stage, we use another GRU to generate the estimated sequence of future bounding boxes, enabling the model to generate predictions for an arbitrary number of timesteps into the future. The two feature vectors, and , are concatenated resulting in a single feature vector representing both optical flow and bounding box history. For each future timestep to be predicted, the decoder GRU (GRU-2 in Fig.4) receives two inputs: The context vector , and the internal hidden state . The GRU outputs a new value for at each timestep. Given each generated hidden state, a final fully connected layer generates the predicted bounding box for each timestep. Rather than representing object bounding boxes by their absolute location  or relative displacement from the previous bounding box , we adopt the formulation of  and represent the bounding box centroid as the relative change in velocity. The decoder generates a vector , representing the change in velocity along the and -axes, and the change in bounding box width and height. The untrained model is initialized to the case where (constant velocity) and (constant scale). This formulation results in a better initialization than absolute or relative locations.
We adapt the following models for MOF, which are originally developed for trajectory forecasting. Each model is modified for full bounding box prediction assuming object scale is constant, or by adding additional output channels representing bounding box height and width for the learning-based approaches.
Constant Velocity & Constant Scale (CV-CS): We adopt the simple constant velocity model, which is used widely as a baseline for trajectory forecasting models [1, 42, 12] and as a motion model for MOT [48, 39, 33]. We find that using a constant scale performs better than linearly extrapolating a change in width and height.
Linear Kalman Filter (LKF)
Linear Kalman Filter (LKF): The LKF is a widely-used method for tracking objects and predicting trajectories under noisy conditions. We use an LKF with initial parameters chosen using cross-validation. The LKF is one of the most popular motion models for MOT [41, 26, 18].
Dynamic Trajectory Predictor (DTP) : We adapt DTP which uses a CNN with past optical flow frames as input to predict future bounding boxes.
Clips from Citywalks are split into 3 folds, and the test set is further divided 50% for validation and 50% for testing for each fold. We use inter-city cross-validation, i.e., footage from cities in the validation/testing sets do not appear in the training set. This challenging evaluation setup ensures that pedestrian identities from the training set do not appear at test time, and prevents models from overfitting to a particular environment.
Bounding box feature encoder. Bounding box vectors (defined in Section 5.1) are computed by taking the velocity of the object over the previous 5 timesteps, i.e., and . Our feature encoder consists of a GRU with 512 hidden units which uses and the previous hidden state vector as input and outputs an updated hidden state vector .
Optical flow feature encoder. We compute optical flow for each video frame using FlowNet2 . The flow from within each pedestrian bounding box is then cropped, clipped to a range of to , scaled to a fixed size of , and normalized to a range of to . We perform standard data augmentation, taking a random crop of size
and randomly horizontally flipping frames with probability 0.5 during training. We train the optical flow feature encoder using ResNet50 as the backbone CNN architecture for 10k iterations with a batch size of 64 and learning rate of to predict future object locations as described in  and then freeze the weights to use our flow encoder as a fixed feature extractor.
Decoder. As described in Section 5.3, our decoder takes the concatenated feature vector as input. The decoder consists of another GRU with 512 hidden units. For each of the 60 timesteps to be predicted, the decoder takes and previous hidden state and outputs a new hidden state . A linear layer takes the hidden state and generates a predicted bounding box for the respective timestep. The optical flow feature encoder is used as a fixed feature extractor, while the bounding box encoder and decoder are trained jointly end-to-end using an initial learning rate of
, which is halved every 5 epochs. We use a batch size of 1024 and train the model for 20 epochs. The model is optimized using the smooth
loss, which we find to be more robust to outliers in the training data than theloss.
|Model||ADE / FDE||AIOU / FIOU|
|BB-encoder||29.6 / 53.2||51.5 / 27.9|
|OF-encoder||27.5 / 50.0||53.2 / 28.8|
|Both encoders||26.7 / 48.4||54.3 / 30.2|
We evaluate each model on the Citywalks dataset using both annotation sets and evaluate each component of STED separately. Finally, we evaluate the cross-dataset generalizability of each model on the MOT-17 dataset .
Results on Citywalks. Table 2 shows the ADE / FDE222A displacement of 50 pixels corresponds to 2.5% of the total frame size at a resolution of . and AIOU / FIOU of all methods on Citywalks with both annotation sets. We evaluate the original DTP and FPL models for trajectory forecasting, as well as the versions modified for MOF. STED consistently performs better than existing approaches across all metrics, resulting in more precise bounding box forecasts. Fig.6 shows example bounding box predictions. STED implicitly anticipates both object and ego-motion in a diverse range of environments and situations. Fig.7 shows failure cases. The model performs poorly in challenging conditions such as large ego-motions and when the pedestrian scale is small.
We further break down performance on Citywalks in Fig.5. We find that most models perform better for sequences recorded in cities with clear weather conditions (e.g., Barcelona, Prague) than, in particular, snow (e.g., Tallinn, Helsinki). To confirm this intuition, we further plot the performance in different weather conditions and at different times of the day. Finally, we plot the mean IOU at all predicted timesteps 1 to 60. The IOU of the predicted and ground-truth bounding boxes predictably declines quickly, particularly for earlier timesteps. STED maintains the best IOU throughout the full prediction horizon.
Ablation study. We evaluate the benefits of each component of our proposed model by evaluating them separately. Specifically, we use the bounding box encoder feature vector as input to the decoder, rather than the concatenated feature vector . We repeat this for the optical flow encoder feature vector . Table 3 show the results of our ablation study on Citywalks. Both the bounding box and optical flow encoders contribute to the overall performance.
|Model||ADE / FDE||AIOU / FIOU|
|CV-CS||58.9 / 104.7||43.8 / 21.5|
|LKF ||62.0 / 110.2||41.6 / 20.1|
|FPL ||56.9 / 96.3||-|
|DTP ||55.2 / 99.0||-|
|FPL-MOF||58.0 / 98.4||41.4 / 20.4|
|DTP-MOF||52.2 / 92.4||47.7 / 26.1|
|STED||51.8 / 91.6||46.7 / 24.4|
Cross-dataset evaluation. In order to evaluate the generalizability of models trained on Citywalks, we use the popular MOT-17 dataset . We use sequences 2, 9, 10 and 11 from the MOT-17 train set and discard sequences 4 and 13 as these sequences are filmed from an overhead perspective. We also discard sequence 5 due to the low image resolution and frame rate. We follow a similar pre-processing setup to Citywalks, discarding tracks shorter than 3 seconds. We also ensure pedestrians are occluded no more than 50% of their total bounding box size using the annotations provided, resulting in 83 unique pedestrian tracks. We take each model trained on Citywalks and evaluate using each of the four sequences. Note that we do not modify the models and crucially we do not fine-tune on MOT-17. Table 4 shows encouraging results suggesting that models trained on Citywalks generalize cross-dataset and to human-annotated bounding boxes. However, due to the small size of the MOT-17 dataset, these results should be treated with caution.
We have introduced the task of multiple object forecasting and created the Citywalks dataset to facilitate future research. Crucially, we have shown that models trained on the Citywalks dataset can predict future object bounding boxes on the MOT-17 tracking benchmark more precisely than existing methods used by multiple object tracking. Our encoder-decoder model, STED, forecasts object bounding boxes up to two seconds in the future and anticipates non-linear motions. This development shows promise for building more sophisticated object forecasting models to aid object tracking in order to address common problems such as occlusions and missed detections.
We would like to thank Shanaka Perera and Shuyang Sun for their insightful comments. This work is funded by the UK EPSRC (grant no. EP/L016400/1). Our thanks to NVIDIA corp. for supporting this research with their generous hardware dontation. We would also like to thank Daniel Sczepansky for collecting the videos used in this research and sharing them under the CC-BY license.
Computer Vision and Pattern Recognition, Cited by: §1, §1, §2.2, §3.2, §6.1.
Realtime multi-person 2d pose estimation using part affinity fields. In Computer Vision and Pattern Recognition, Cited by: §6.1.