Predicting the collective motion of a group of pedestrians (a crowd) under the vehicle influence is essential for the development of autonomous vehicles to deal with mixed urban scenarios where interpersonal interaction and vehicle-crowd interaction (VCI) are significant. This usually requires a model that can describe individual pedestrian motion under the influence of nearby pedestrians and the vehicle. This study proposed two pedestrian trajectory dataset, CITR dataset and DUT dataset, so that the pedestrian motion models can be further calibrated and verified, especially when vehicle influence on pedestrians plays an important role. CITR dataset consists of experimentally designed fundamental VCI scenarios (front, back, and lateral VCIs) and provides unique ID for each pedestrian, which is suitable for exploring a specific aspect of VCI. DUT dataset gives two ordinary and natural VCI scenarios in crowded university campus, which can be used for more general purpose VCI exploration. The trajectories of pedestrians, as well as vehicles, were extracted by processing video frames that come from a down-facing camera mounted on a hovering drone as the recording equipment. The final trajectories were refined by a Kalman Filter, in which the pedestrian velocity was also estimated. The statistics of the velocity magnitude distribution demonstrated the validity of the proposed dataset. In total, there are approximate 340 pedestrian trajectories in CITR dataset and 1793 pedestrian trajectories in DUT dataset. The dataset is available at GitHub.READ FULL TEXT VIEW PDF
Driving among a dense crowd of pedestrians is a major challenge for
Pedestrians and vehicles often share the road in complex inner city traf...
The study of human-robot interaction (HRI) has received increasing resea...
Understanding and predicting the intention of pedestrians is essential t...
Autonomous vehicle navigation in shared pedestrian environments requires...
Pedestrians and drivers interact closely in a wide range of environments...
This contribution provides a microscopic experimental study of pedestria...
Intelligent vehicles have to cope with mixed urban scenarios where a certain number of pedestrians walk around a moving vehicle. In such scenarios, it is necessary to understand how vehicles and pedestrians interact with each other. This interaction has been studied for some time, but in most cases, the number of pedestrians is small so that the interpersonal interaction is usually ignored. However, in the real world, vehicles may face a large number (crowd) of pedestrians. In this situation, the interpersonal interaction is indispensable. For example, under the same vehicle influence, a group of large number of pedestrians may behave differently than a group of small number of pedestrians, because a larger group, i.e., a crowd, plays a more dominant role in the vehicle-pedestrian interaction. This vehicle-crowd interaction (VCI) scenario has been drawing attention in recent years. Specific models  have been designed to describe the motion of individuals of a crowd for some specific situations, in which both interpersonal and vehicle-pedestrian interaction was considered. To accurately calibrate and evaluate such models, the availability of ground truth trajectories of VCI is becoming increasingly important. However, to the best of authors’ knowledge, there is no dataset that covers VCI, especially in scenarios where the interpersonal interaction is not negligible. To fill this gap, we built two new VCI datasets. One focuses fundamental VCI scenarios in controlled experiments (CITR dataset) and the other consists of ordinary VCIs in crowded university campus (DUT dataset).
In general, there are two approaches to model pedestrian motion in crowd. One is the traditional approach, i.e., to design a rule-based model followed by calibrating the model parameters by using ground truth trajectories 6]. The utilization of ground truth pedestrian trajectories in both approaches proves the importance of pedestrian/crowd trajectory dataset. It is obvious that the more available dataset  we have, the better pedestrian models we can obtain by using the data. However, with the increasing interest in intelligent vehicles, a pedestrian dataset that not only includes interpersonal interaction, but also covers VCI is required for vehicles driving in high pedestrian density areas. Our dataset, as an initial exploration, meets this requirement and will benefit the development of intelligent vehicles, or even intelligent transportation systems.
Unlike pure interpersonal interaction, VCI introduces more complexity. To examine such complexity, it is necessary to identify separate vehicle influence on pedestrians. Therefore, in this study, controlled experiments (CITR dataset) were designed and conducted so that by comparing pedestrian-only scenarios (pure interpersonal interaction) and VCI scenarios, the vehicle influence can be separated and analyzed. This is because, except the difference of the existence of a vehicle, other factors such as pedestrians’ intention (starting point and destination), pedestrians’ identity (who are these pedestrians), and environment layout (location, time period, weather, etc.) remain the same all the time in the controlled experiments.
To obtain models that perfectly describe pedestrian motion, it also helps to consider personal characteristics, i.e., each pedestrian applies a model with a unique parameter set. One advantage of the CITR dataset is that, the same pedestrian always has the same ID, which provides additional information for researchers who want to study the effect of personal characteristics on interpersonal or vehicle-pedestrian interaction.
In addition to controlled experiments, a dataset of ordinary VCI scenarios (DUT dataset) was constructed from a series of recordings of crowded university campus. By using a down-facing camera attached to a drone hovering above and far away from the ground as the recording equipment, both the crowd and the vehicle are unaware of being observed, hence their behavior are natural. The DUT dataset can be used for final verification of VCI models or some end-to-end VCI modeling design. The application of a hovering drone also ensured the accuracy of the extracted trajectories and avoided the issue of occlusion, which is a major deficiency in some existing pedestrian dataset.
The trajectories of individual pedestrians and vehicles were extracted by image processing techniques. Due to the unavoidable instability of the camera attached to a hovering drone (even with a gimbal system), the recorded videos have to be stabilized before further processing. A robust tracking algorithm (CSRT) was then applied to automatically track pedestrians and vehicles, although the initial positions have to be selected manually. This management avoids tedious manual annotation as done in the ETH and UCY dataset , or possible imprecision of the tracking as done in the Stanford dataset .
In general, the contribution of the study can be summarized as follows:
We built a new pedestrian trajectory dataset that covers both interpersonal interaction and vehicle-crowd interaction.
The dataset includes two portions. One comes from controlled experiments, in which fundamental VCIs are covered and each person has a unique ID. The other comes from crowded university campus scenarios where the pedestrian reaction to a vehicle is completely natural.
The application of a drone camera for video recording eliminated possible occlusions among pedestrians so that the trajectories were extracted as accurately as possible.
In the rest of the paper, section 2 reviews related dataset regarding pedestrian motion and vehicle-pedestrian interaction. Section 3 describes detailed configuration of both CITR and DUT dataset. Section 4 provides an overview of the image processing techniques for trajectory extraction. Section 5 shows some statistics of our dataset. Section 6 concludes the study and discusses possible improvement.
|Name||Location||# of trajs||Other participants||Pedestrian density||Video resolution||Pedestrian observation||Cam. depression angle||Camera view|
|ETH||campus, urban street||650||no||medium||720 x 576||2.5 fps||7̃5 degrees||bird’s eye view at building|
|UCY||campus, park, urban street||909||no||high, low||720 x 576||interpolated by control points||from a building||bird’s eye view at building|
|Stanford||campus||3297||cyclist, bus, cart, car||medium, low||595 x 326||29.97 fps||90 degrees||top view on a drone|
|CITR||specifically designed||340||cart||medium||4K||29.97 fps||90 degrees||top view on a drone|
|DUT||campus||1793||car||high, medium, low||4K||23.98 fps||90 degrees||top view on a drone|
Pedestrian dataset can be in general divided into two categories: world coordinate (WC) based dataset and vehicle coordinate (VC) based dataset. WC based dataset is usually applied to studies that need to consider interpersonal interaction, because the collective motion of pedestrians is available and easily accessible, while VC based dataset doesn’t contain enough instances of interpersonal interaction. Popular WC based datasets includes UCY Crowds-by-Example dataset , ETH BIWI Walking Pedestrians dataset , and Stanford Drone dataset . They have been widely used for the calibration/training of various rule-based and neural network based pedestrian models . The proposed dataset in this study aims to enrich the WC based dataset by incorporating the vehicle-crowd interaction. A comparison of the proposed and existing WC based datasets are shown in table I. VC based dataset is usually used for single/multiple, but not too many, pedestrian detection from a mono camera mounted in front of the vehicle. A couple of datasets such as Daimler Pedestrian Path Prediction dataset  and KITTI dataset  provide vehicle motion information, hence the trajectories of both the vehicle and pedestrians in world coordinate can be estimated by combining vehicle motion and video frames. The estimated trajectories can serve as ground truth data for vehicle-pedestrian interaction but without interpersonal interaction due to the limited number of pedestrians.
Some existing datasets apply a down-facing camera attached to a hovering drone as the recording equipment. For example, in Stanford Drone dataset , the utilization of drone eliminated occlusion so that all participants (pedestrians, cyclists, cars, carts, buses) were individually and clearly tracked. Another dataset HighD , which focuses on vehicle-vehicle interaction on highway driving, also successfully demonstrated the benefit of using the hovering drone to remove occlusion.
The controlled experiments were conducted in a parking lot near the facility of Control and Intelligent Transportation Research (CITR) Lab at The Ohio State University (OSU). Figure 1 shows the layout of the experiment area. A DJI Phamton 3 SE Drone with a down-facing camera on a gimbal system was used as the recording equipment. The video resolution is 4K with an fps of 29.97. Participants are members of CITR Lab at OSU. During the experiments, they were instructed only to walk from one small area (starting points) to another small area (destinations). The employed vehicle was an EZ-GO Golf Cart, as shown in figure 2. 3 markers were put on top of the vehicle to help vehicle motion tracking, of which the vehicle position is calculated by geometry. The reason of using 3 markers is to reduce the tracking noise as much as possible.
The designed fundamental scenarios were generally divided into 6 groups, as shown in figure 3. They were designed such that from interpersonal interaction scenarios to VCI scenarios, they can be pairwisely compared so that separate effect, for example, the existence (or not) of a vehicle or the walking direction of the crowd, can be identified and analyzed.
After processing, there are 38 video clips in total, which include approximate 340 pedestrian trajectories. The detailed information is presented in table II.
|Scenarios||Num. of clips|
|Pedestrian only (unidirectional)||4|
|Pedestrian only (bidirectional)||8|
|Lateral interaction (unidirectional)||8|
|Lateral interaction (Bidirectional)||10|
The DUT dataset were collected at two crowded locations in the campus of Dalian University of Technology (DUT) in China, as shown in figure 4. One location includes an area of pedestrian crosswalk at an intersection without traffic signals. When VCI happens, in general there is no priority for either pedestrians or vehicles. The other location is a relatively large shared space, in which pedestrians and vehicles can freely move. Similar to CITR dataset, a DJI Mavic Pro Drone with a down-facing camera was hovering above the interested area as the recording equipment, high enough to be unnoticed by pedestrians and vehicles. The video resolution is 4K with an fps of 23.98. Pedestrians are primarily made up of college students who just finished classes and on their way out of classrooms. Vehicles are regular cars that go through the campus.
With this configuration, scenarios of DUT dataset consists of ordinary VCIs, in which the number of pedestrians varies hence introducing some variety of the VCI.
After processing, there are 17 clips of crosswalk scenario and 11 clips of shared space scenario, including 1793 trajectories. Some of the clips contains multiple VCIs, i.e., more than 2 vehicles interacting with pedestrians simultaneously, as in the lower picture in figure 4.
Four procedures were done to extract the trajectories of both pedestrians and vehicles from the recorded top-view video.
First, the raw video was stabilized to remove the noise caused by unstable drone motion. This procedure applies several image processing techniques, which include scale-invariant feature transform (SIFT) algorithm for finding key-points, k-nearest neighbors (k-NN) for obtaining matches, and random sample consensus (RANSAC) for calculating perspective transformation between each video frame and the first video frame (reference frame). The detailed procedure is illustrated in algorithm 1.
Once the video was stabilized, pedestrians and vehicles were automatically tracked by using Discriminative Correlation Filter with Channel and Spatial Reliability (CSR-DCF) . In the tracking process, raw videos are partitioned into small clips, which contain separate and complete VCIs. For pedestrians, once they appear in the region of interest (ROI), the initial positions were manually given, hence initializing the trackers. When they exited the ROI, the trackers stopped. Due to the vehicle size, vehicle tracking was done by individually tracking either the 3 markers on top of the vehicle (CITR dataset) or four corners of vehicle (DUT dataset). Then, the vehicle position was calculated based on geometric relationship of these tracked points.
Pedestrian trajectories obtained in the previous step are in the coordinates of image pixels. A coordinate transformation operation is necessary to convert the trajectories from image pixels into actual metrics (in meters).
This can be done by either measuring the actual length of a relatively long reference line in the scene or measuring the distance between markers on top of the vehicle (if applicable). The assumption here is that, compared with the altitude of the hovering drone, the distance between the ground plane and the tracking plane (the plane of a pedestrian’s head or the vehicle’s top) is very small so that both planes can be treated as the same plane.
In the last step, a Kalman filter  was applied to remove the noise and refine the trajectories. Trajectories obtained by tracking algorithm only provide positions. In the Kalman Filter, the state of both the positions and velocities are expected to be estimated, which is denoted as . Therefore, this particular Kalman Filter has the following state transition equation and measurement equation :
, and . This assumes a linear Gaussian system.
Then, by iteratively applying the prediction step,
and the update step,
the final trajectories were generated and are ready to use.
To give a more detailed description of the above dataset, the magnitude of pedestrian velocities (estimated by Kalman Filter) in all video clips were analyzed. The reason of analyzing velocity magnitude is that, pedestrian velocity is the most intuitive way of describing pedestrian motion, and, as argued in , if pedestrian trajectories are used to train neural network based pedestrian model, using pedestrian velocity (offset in motion at the next time step) is better than using absolute position, because different reference systems (how the global coordinates are defined) in different dataset usually cause incompleteness of training data.
Figure 7 and 8 show the distribution of the velocity magnitude for CITR dataset and DUT dataset, respectively. Table III presents the mean velocity magnitude and mean walking velocity magnitude. The walking velocity excluded the velocity magnitude that is less than , at which the pedestrian is considered as either standing or yielding to the vehicle instead of walking. It is obvious that, from the velocity distribution and the mean velocity results, the pedestrians in DUT dataset walk faster than the pedestrians in CITR dataset. The reason could be that, when conducting controlled experiments, as in the CITR dataset, pedestrians were more relaxed, while in the DUT dataset, pedestrians were in a little bit hurry because they just came out of classes. However, in general, the distribution and the average of velocity magnitude are in accordance with the preferred walking velocity of pedestrians in various situations .
|Dataset||Mean velocity||Mean walking velocity|
Two dataset, experimentally designed CITR dataset and natural DUT dataset, were built in this study for the calibration and verification of pedestrian motion models that consider both interpersonal and vehicle-crowd interaction. The trajectories of pedestrians and vehicles were extracted by image processing techniques and further refined by a Kalman Filter. The statistics of the velocity magnitude validated the proposed dataset.
This study can be regarded as an initial attempt to incorporate VCI into pedestrian trajectory dataset. The amount of the trajectories and the variety of VCI scenarios are somehow limited, therefore, it is expected to build more dataset of various scenarios. Another possible improvement could be automatically detecting/selecting initial positions of pedestrians when they entered the ROI, hence totally removing manual operation and expediting the data processing. From the aspect of personal characteristics, it would help if the pedestrians in the dataset could be identified according to their age, gender, gaze direction, and other features, although manual annotation of these features seems to be the only option at current stage.
The authors would like to thank Xinran Wang for collecting and processing DUT dataset, and Ekim Yurtsever and John Maroli for conducting controlled experiments. Also many thanks to the members who have supported the dataset building at Control and Intelligent Transportation Research (CITR) Lab at The Ohio State University (OSU) and at Dalian University of Technology (DUT).