Top-view Trajectories: A Pedestrian Dataset of Vehicle-Crowd Interaction from Controlled Experiments and Crowded Campus

02/01/2019 ∙ by Dongfang Yang, et al. ∙ Dalian University of Technology The Ohio State University 2

Predicting the collective motion of a group of pedestrians (a crowd) under the vehicle influence is essential for the development of autonomous vehicles to deal with mixed urban scenarios where interpersonal interaction and vehicle-crowd interaction (VCI) are significant. This usually requires a model that can describe individual pedestrian motion under the influence of nearby pedestrians and the vehicle. This study proposed two pedestrian trajectory dataset, CITR dataset and DUT dataset, so that the pedestrian motion models can be further calibrated and verified, especially when vehicle influence on pedestrians plays an important role. CITR dataset consists of experimentally designed fundamental VCI scenarios (front, back, and lateral VCIs) and provides unique ID for each pedestrian, which is suitable for exploring a specific aspect of VCI. DUT dataset gives two ordinary and natural VCI scenarios in crowded university campus, which can be used for more general purpose VCI exploration. The trajectories of pedestrians, as well as vehicles, were extracted by processing video frames that come from a down-facing camera mounted on a hovering drone as the recording equipment. The final trajectories were refined by a Kalman Filter, in which the pedestrian velocity was also estimated. The statistics of the velocity magnitude distribution demonstrated the validity of the proposed dataset. In total, there are approximate 340 pedestrian trajectories in CITR dataset and 1793 pedestrian trajectories in DUT dataset. The dataset is available at GitHub.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Intelligent vehicles have to cope with mixed urban scenarios where a certain number of pedestrians walk around a moving vehicle. In such scenarios, it is necessary to understand how vehicles and pedestrians interact with each other. This interaction has been studied for some time, but in most cases, the number of pedestrians is small so that the interpersonal interaction is usually ignored. However, in the real world, vehicles may face a large number (crowd) of pedestrians. In this situation, the interpersonal interaction is indispensable. For example, under the same vehicle influence, a group of large number of pedestrians may behave differently than a group of small number of pedestrians, because a larger group, i.e., a crowd, plays a more dominant role in the vehicle-pedestrian interaction. This vehicle-crowd interaction (VCI) scenario has been drawing attention in recent years. Specific models [1][2][3] have been designed to describe the motion of individuals of a crowd for some specific situations, in which both interpersonal and vehicle-pedestrian interaction was considered. To accurately calibrate and evaluate such models, the availability of ground truth trajectories of VCI is becoming increasingly important. However, to the best of authors’ knowledge, there is no dataset that covers VCI, especially in scenarios where the interpersonal interaction is not negligible. To fill this gap, we built two new VCI datasets. One focuses fundamental VCI scenarios in controlled experiments (CITR dataset) and the other consists of ordinary VCIs in crowded university campus (DUT dataset).

In general, there are two approaches to model pedestrian motion in crowd. One is the traditional approach, i.e., to design a rule-based model followed by calibrating the model parameters by using ground truth trajectories [4][1][5]

. The other approach applies neural networks, especially long-short term memory (LSTM) networks, in which the model parameters are trained by also using the ground truth trajectories

[6][7]. The utilization of ground truth pedestrian trajectories in both approaches proves the importance of pedestrian/crowd trajectory dataset. It is obvious that the more available dataset [8][9][10] we have, the better pedestrian models we can obtain by using the data. However, with the increasing interest in intelligent vehicles, a pedestrian dataset that not only includes interpersonal interaction, but also covers VCI is required for vehicles driving in high pedestrian density areas. Our dataset, as an initial exploration, meets this requirement and will benefit the development of intelligent vehicles, or even intelligent transportation systems.

Unlike pure interpersonal interaction, VCI introduces more complexity. To examine such complexity, it is necessary to identify separate vehicle influence on pedestrians. Therefore, in this study, controlled experiments (CITR dataset) were designed and conducted so that by comparing pedestrian-only scenarios (pure interpersonal interaction) and VCI scenarios, the vehicle influence can be separated and analyzed. This is because, except the difference of the existence of a vehicle, other factors such as pedestrians’ intention (starting point and destination), pedestrians’ identity (who are these pedestrians), and environment layout (location, time period, weather, etc.) remain the same all the time in the controlled experiments.

To obtain models that perfectly describe pedestrian motion, it also helps to consider personal characteristics, i.e., each pedestrian applies a model with a unique parameter set. One advantage of the CITR dataset is that, the same pedestrian always has the same ID, which provides additional information for researchers who want to study the effect of personal characteristics on interpersonal or vehicle-pedestrian interaction.

In addition to controlled experiments, a dataset of ordinary VCI scenarios (DUT dataset) was constructed from a series of recordings of crowded university campus. By using a down-facing camera attached to a drone hovering above and far away from the ground as the recording equipment, both the crowd and the vehicle are unaware of being observed, hence their behavior are natural. The DUT dataset can be used for final verification of VCI models or some end-to-end VCI modeling design. The application of a hovering drone also ensured the accuracy of the extracted trajectories and avoided the issue of occlusion, which is a major deficiency in some existing pedestrian dataset.

The trajectories of individual pedestrians and vehicles were extracted by image processing techniques. Due to the unavoidable instability of the camera attached to a hovering drone (even with a gimbal system), the recorded videos have to be stabilized before further processing. A robust tracking algorithm (CSRT[11]) was then applied to automatically track pedestrians and vehicles, although the initial positions have to be selected manually. This management avoids tedious manual annotation as done in the ETH and UCY dataset [9][8], or possible imprecision of the tracking as done in the Stanford dataset [10].

In general, the contribution of the study can be summarized as follows:

  • We built a new pedestrian trajectory dataset that covers both interpersonal interaction and vehicle-crowd interaction.

  • The dataset includes two portions. One comes from controlled experiments, in which fundamental VCIs are covered and each person has a unique ID. The other comes from crowded university campus scenarios where the pedestrian reaction to a vehicle is completely natural.

  • The application of a drone camera for video recording eliminated possible occlusions among pedestrians so that the trajectories were extracted as accurately as possible.

In the rest of the paper, section 2 reviews related dataset regarding pedestrian motion and vehicle-pedestrian interaction. Section 3 describes detailed configuration of both CITR and DUT dataset. Section 4 provides an overview of the image processing techniques for trajectory extraction. Section 5 shows some statistics of our dataset. Section 6 concludes the study and discusses possible improvement.

Ii Related Works

Name Location # of trajs Other participants Pedestrian density Video resolution Pedestrian observation Cam. depression angle Camera view
ETH campus, urban street 650 no medium 720 x 576 2.5 fps 7̃5 degrees bird’s eye view at building
UCY campus, park, urban street 909 no high, low 720 x 576 interpolated by control points from a building bird’s eye view at building
Stanford campus 3297 cyclist, bus, cart, car medium, low 595 x 326 29.97 fps 90 degrees top view on a drone
CITR specifically designed 340 cart medium 4K 29.97 fps 90 degrees top view on a drone
DUT campus 1793 car high, medium, low 4K 23.98 fps 90 degrees top view on a drone
TABLE I: Comparison with existing world coordinate based pedestrian trajectory dataset

Pedestrian dataset can be in general divided into two categories: world coordinate (WC) based dataset and vehicle coordinate (VC) based dataset. WC based dataset is usually applied to studies that need to consider interpersonal interaction, because the collective motion of pedestrians is available and easily accessible, while VC based dataset doesn’t contain enough instances of interpersonal interaction. Popular WC based datasets includes UCY Crowds-by-Example dataset [8], ETH BIWI Walking Pedestrians dataset [9], and Stanford Drone dataset [10]. They have been widely used for the calibration/training of various rule-based and neural network based pedestrian models [12]. The proposed dataset in this study aims to enrich the WC based dataset by incorporating the vehicle-crowd interaction. A comparison of the proposed and existing WC based datasets are shown in table I. VC based dataset is usually used for single/multiple, but not too many, pedestrian detection from a mono camera mounted in front of the vehicle. A couple of datasets such as Daimler Pedestrian Path Prediction dataset [13] and KITTI dataset [14] provide vehicle motion information, hence the trajectories of both the vehicle and pedestrians in world coordinate can be estimated by combining vehicle motion and video frames. The estimated trajectories can serve as ground truth data for vehicle-pedestrian interaction but without interpersonal interaction due to the limited number of pedestrians.

Some existing datasets apply a down-facing camera attached to a hovering drone as the recording equipment. For example, in Stanford Drone dataset [10], the utilization of drone eliminated occlusion so that all participants (pedestrians, cyclists, cars, carts, buses) were individually and clearly tracked. Another dataset HighD [15], which focuses on vehicle-vehicle interaction on highway driving, also successfully demonstrated the benefit of using the hovering drone to remove occlusion.

Iii Dataset

Iii-a CITR Dataset

Fig. 1: Layout of the controlled experiment area (a parking lot near CITR Lab at OSU). The vehicle (a golf cart) moves back and forth between two blue areas. The interaction happens in the orange area, which is also the central area of the video.
Fig. 2: EZ-GO Golf cart employed in the experiments (left) and makers on top of the vehicle (right). In the vehicle tracking process, 3 markers were continuously tracked and the middle point between the front point and center point were used as the vehicle position, as illustrated in the figure.

The controlled experiments were conducted in a parking lot near the facility of Control and Intelligent Transportation Research (CITR) Lab at The Ohio State University (OSU). Figure 1 shows the layout of the experiment area. A DJI Phamton 3 SE Drone with a down-facing camera on a gimbal system was used as the recording equipment. The video resolution is 4K with an fps of 29.97. Participants are members of CITR Lab at OSU. During the experiments, they were instructed only to walk from one small area (starting points) to another small area (destinations). The employed vehicle was an EZ-GO Golf Cart, as shown in figure 2. 3 markers were put on top of the vehicle to help vehicle motion tracking, of which the vehicle position is calculated by geometry. The reason of using 3 markers is to reduce the tracking noise as much as possible.

Fig. 3: Designed scenarios of controlled experiments. Red arrows indicate the motion of pedestrians/crowd, while blue arrows indicate vehicle motion.

The designed fundamental scenarios were generally divided into 6 groups, as shown in figure 3. They were designed such that from interpersonal interaction scenarios to VCI scenarios, they can be pairwisely compared so that separate effect, for example, the existence (or not) of a vehicle or the walking direction of the crowd, can be identified and analyzed.

After processing, there are 38 video clips in total, which include approximate 340 pedestrian trajectories. The detailed information is presented in table II.

Scenarios Num. of clips
Pedestrian only (unidirectional) 4
Pedestrian only (bidirectional) 8
Lateral interaction (unidirectional) 8
Lateral interaction (Bidirectional) 10
Front interaction 4
Back interaction 4
TABLE II: Number of clips in each scenario of CITR dataset

Iii-B DUT Dataset

Fig. 4: Locations in DUT dataset. Upper: an area of pedestrian crosswalk at an intersection without traffic signals. Lower: a relatively large shared space.

The DUT dataset were collected at two crowded locations in the campus of Dalian University of Technology (DUT) in China, as shown in figure 4. One location includes an area of pedestrian crosswalk at an intersection without traffic signals. When VCI happens, in general there is no priority for either pedestrians or vehicles. The other location is a relatively large shared space, in which pedestrians and vehicles can freely move. Similar to CITR dataset, a DJI Mavic Pro Drone with a down-facing camera was hovering above the interested area as the recording equipment, high enough to be unnoticed by pedestrians and vehicles. The video resolution is 4K with an fps of 23.98. Pedestrians are primarily made up of college students who just finished classes and on their way out of classrooms. Vehicles are regular cars that go through the campus.

With this configuration, scenarios of DUT dataset consists of ordinary VCIs, in which the number of pedestrians varies hence introducing some variety of the VCI.

After processing, there are 17 clips of crosswalk scenario and 11 clips of shared space scenario, including 1793 trajectories. Some of the clips contains multiple VCIs, i.e., more than 2 vehicles interacting with pedestrians simultaneously, as in the lower picture in figure 4.

Figure 5 and 6 demonstrate the processed example trajectories of the DUT dataset.

Fig. 5: Trajectories of vehicles (red dashed line) and pedestrians (colorful solid lines) in a clip of the intersection scenario.
Fig. 6: Trajectories of vehicles (red dashed line) and pedestrians (colorful solid lines) in a clip of the shared space scenario.

Iv Trajectory Extraction

Four procedures were done to extract the trajectories of both pedestrians and vehicles from the recorded top-view video.

Iv-a Video Stabilization

First, the raw video was stabilized to remove the noise caused by unstable drone motion. This procedure applies several image processing techniques, which include scale-invariant feature transform (SIFT) algorithm for finding key-points, k-nearest neighbors (k-NN) for obtaining matches, and random sample consensus (RANSAC) for calculating perspective transformation between each video frame and the first video frame (reference frame). The detailed procedure is illustrated in algorithm 1.

Result: calibrated frames
set 1st frame as reference ;
for each new frame  do
       apply SIFT to find key-points in and , separately;

    apply KNN to find matches;

       obtain good matches by removing matches that have long distance of pixel positions in and ;
       apply RANSAC for the good matches to calculate the transformation matrix from to ;
       obtain by applying transformation to ;
end for
Algorithm 1 Video Stabilization

Iv-B Vehicle and Pedestrian Tracking

Once the video was stabilized, pedestrians and vehicles were automatically tracked by using Discriminative Correlation Filter with Channel and Spatial Reliability (CSR-DCF) [11]. In the tracking process, raw videos are partitioned into small clips, which contain separate and complete VCIs. For pedestrians, once they appear in the region of interest (ROI), the initial positions were manually given, hence initializing the trackers. When they exited the ROI, the trackers stopped. Due to the vehicle size, vehicle tracking was done by individually tracking either the 3 markers on top of the vehicle (CITR dataset) or four corners of vehicle (DUT dataset). Then, the vehicle position was calculated based on geometric relationship of these tracked points.

Iv-C Coordinate Transformation

Pedestrian trajectories obtained in the previous step are in the coordinates of image pixels. A coordinate transformation operation is necessary to convert the trajectories from image pixels into actual metrics (in meters).

This can be done by either measuring the actual length of a relatively long reference line in the scene or measuring the distance between markers on top of the vehicle (if applicable). The assumption here is that, compared with the altitude of the hovering drone, the distance between the ground plane and the tracking plane (the plane of a pedestrian’s head or the vehicle’s top) is very small so that both planes can be treated as the same plane.

Iv-D Trajectory Filtering

In the last step, a Kalman filter [16] was applied to remove the noise and refine the trajectories. Trajectories obtained by tracking algorithm only provide positions. In the Kalman Filter, the state of both the positions and velocities are expected to be estimated, which is denoted as . Therefore, this particular Kalman Filter has the following state transition equation and measurement equation :




, and . This assumes a linear Gaussian system.

Then, by iteratively applying the prediction step,


and the update step,


the final trajectories were generated and are ready to use.

V Statistics

To give a more detailed description of the above dataset, the magnitude of pedestrian velocities (estimated by Kalman Filter) in all video clips were analyzed. The reason of analyzing velocity magnitude is that, pedestrian velocity is the most intuitive way of describing pedestrian motion, and, as argued in [12], if pedestrian trajectories are used to train neural network based pedestrian model, using pedestrian velocity (offset in motion at the next time step) is better than using absolute position, because different reference systems (how the global coordinates are defined) in different dataset usually cause incompleteness of training data.

Figure 7 and 8 show the distribution of the velocity magnitude for CITR dataset and DUT dataset, respectively. Table III presents the mean velocity magnitude and mean walking velocity magnitude. The walking velocity excluded the velocity magnitude that is less than , at which the pedestrian is considered as either standing or yielding to the vehicle instead of walking. It is obvious that, from the velocity distribution and the mean velocity results, the pedestrians in DUT dataset walk faster than the pedestrians in CITR dataset. The reason could be that, when conducting controlled experiments, as in the CITR dataset, pedestrians were more relaxed, while in the DUT dataset, pedestrians were in a little bit hurry because they just came out of classes. However, in general, the distribution and the average of velocity magnitude are in accordance with the preferred walking velocity of pedestrians in various situations [17].

Fig. 7: Distribution of velocity magnitude in CITR dataset
Fig. 8: Distribution of velocity magnitude in DUT dataset
Dataset Mean velocity Mean walking velocity
CITR 1.2248 1.2379
DUT 1.3656 1.3815
TABLE III: Mean velocity magnitude

Vi Conclusion

Two dataset, experimentally designed CITR dataset and natural DUT dataset, were built in this study for the calibration and verification of pedestrian motion models that consider both interpersonal and vehicle-crowd interaction. The trajectories of pedestrians and vehicles were extracted by image processing techniques and further refined by a Kalman Filter. The statistics of the velocity magnitude validated the proposed dataset.

This study can be regarded as an initial attempt to incorporate VCI into pedestrian trajectory dataset. The amount of the trajectories and the variety of VCI scenarios are somehow limited, therefore, it is expected to build more dataset of various scenarios. Another possible improvement could be automatically detecting/selecting initial positions of pedestrians when they entered the ROI, hence totally removing manual operation and expediting the data processing. From the aspect of personal characteristics, it would help if the pedestrians in the dataset could be identified according to their age, gender, gaze direction, and other features, although manual annotation of these features seems to be the only option at current stage.


The authors would like to thank Xinran Wang for collecting and processing DUT dataset, and Ekim Yurtsever and John Maroli for conducting controlled experiments. Also many thanks to the members who have supported the dataset building at Control and Intelligent Transportation Research (CITR) Lab at The Ohio State University (OSU) and at Dalian University of Technology (DUT).


  • [1] W. Zeng, P. Chen, G. Yu, and Y. Wang, “Specification and calibration of a microscopic model for pedestrian dynamic simulation at signalized intersections: A hybrid approach,” Transportation Research Part C: Emerging Technologies, vol. 80, pp. 37–70, 2017.
  • [2] B. Anvari, M. G. Bell, A. Sivakumar, and W. Y. Ochieng, “Modelling shared space users via rule-based social force model,” Transportation Research Part C: Emerging Technologies, vol. 51, pp. 83–103, 2015.
  • [3] D. Yang, A. Kurt, K. Redmill, and Ü. Özgüner, “Agent-based microscopic pedestrian interaction with intelligent vehicles in shared space,” in Proceedings of the 2nd International Workshop on Science of Smart City Operations and Platforms Engineering, pp. 69–74, ACM, 2017.
  • [4] F. Zanlungo, T. Ikeda, and T. Kanda, “Social force model with explicit collision prediction,” EPL (Europhysics Letters), vol. 93, no. 6, p. 68005, 2011.
  • [5] W. Daamen and S. Hoogendoorn, “Calibration of pedestrian simulation model for emergency doors by pedestrian type,” Transportation Research Record, vol. 2316, no. 1, pp. 69–75, 2012.
  • [6] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 961–971, 2016.
  • [7] M. Pfeiffer, G. Paolo, H. Sommer, J. Nieto, R. Siegwart, and C. Cadena, “A data-driven model for interaction-aware pedestrian motion prediction in object cluttered environments,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8, IEEE, 2018.
  • [8] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,” in Computer Graphics Forum, vol. 26, pp. 655–664, Wiley Online Library, 2007.
  • [9] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in Computer Vision, 2009 IEEE 12th International Conference on, pp. 261–268, IEEE, 2009.
  • [10] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in European conference on computer vision, pp. 549–565, Springer, 2016.
  • [11] A. Lukežič, T. Vojíř, L. Č. Zajc, J. Matas, and M. Kristan, “Discriminative correlation filter tracker with channel and spatial reliability,” International Journal of Computer Vision, vol. 126, no. 7, pp. 671–688, 2018.
  • [12] S. Becker, R. Hug, W. Hübner, and M. Arens, “An evaluation of trajectory prediction approaches and notes on the trajnet benchmark,” arXiv preprint arXiv:1805.07663, 2018.
  • [13] N. Schneider and D. M. Gavrila, “Pedestrian path prediction with recursive bayesian filters: A comparative study,” german conference on pattern recognition, vol. 8142, pp. 174–183, 2013.
  • [14] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [15] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein, “The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems,” in 2018 IEEE 21st International Conference on Intelligent Transportation Systems (ITSC), 2018.
  • [16] R. Faragher et al., “Understanding the basis of the kalman filter via a simple and intuitive derivation,” IEEE Signal processing magazine, vol. 29, no. 5, pp. 128–132, 2012.
  • [17] B. J. Mohler, W. B. Thompson, S. H. Creem-Regehr, H. L. Pick, and W. H. Warren, “Visual flow influences gait transition speed and preferred walking speed,” Experimental brain research, vol. 181, no. 2, pp. 221–228, 2007.