A major challenge in recent work on autonomous vehicles is making proper decisions about how to deal with interactions with human-driven vehicles. However, interactions among human drivers are hard to model via equations directly. To address this problem, learning-based methods for characterizing human-driver behavior become good choices and make it easier to simulate a human driver’s behavior in simulations such as CARLA , VTD , etc. However, such methods require a large amount of driving data in order to learn human drivers’ diverse behavior. For a long time, NGSIM  was the only public trajectory-based dataset from which human driver behavior could be extracted. In 2018, highD 
became available, but it only includes highway scenarios. Moreover, how to extract and classify the human driver behavior without manually labeling a large amount of data for ground-truth is another time-consuming challenge when dealing with raw human driver data.
The current state of the art in acquiring and using such data faces several problems. First, some published work relies on privately collected datasets, the inaccessibility of which makes them impossible to use as benchmarks for comparisons between various algorithms. Second, some datasets are collected by autonomous vehicles from the perspective of the ego vehicle. Although this perspective is ultimately the one available to an autonomous vehicle, it is difficult for it to provide full sequences showing the social behavior of surrounding vehicles. To derive models for such behavior, bird’s-eye view datasets are useful. In response to these problems, this paper constructs a method for benchmarking human driver behavior based on a bird’s-eye-view data collection system via drone. Figure 1 shows the pipeline of the data processing procedure.
On the other side, based on the dataset, predicting the other vehicles’ intentions or trajectories is an essential procedure for behavior planning of autonomous vehicles during the decision making or trajectory planning procedures. During the application of motion planning practice, accurately predicting human drivers’ behavior can help the ego car to have better decision making. In Pittsburgh, most traffic lights control Going Straight (GS), Turning Left (TL) and Turning Right (TR) with one light with the result that at the urban intersection, many interactions occur between vehicles approaching from opposite directions with intention pair of GS and TL or TL and TR. Under these situations, ’who will go first’ between the two interacting vehicles is a key problem even for human drivers.
The main contributions of this work are:
A drone-based data collection and processing system to analyze bird’s-eye view trajectory data of human drivers.
An algorithm which can predict the interacting human drivers’ intentions as well as trajectories based on the historical trajectories occupying a given period of time when approaching an an urban intersection.
Ii Related Work
This section introduces previous work related to this paper, which can be categorized as follows: 1) papers that address algorithm which is part of the traffic data collection procedures; 2) papers that propose intention and trajectories predictions of human drivers.
Ii-a Data Extraction
With the current popularity of autonomous driving, various datasets are available for researchers to develop and test their algorithms. These datasets can be categorized into two classes. The first one is traffic-flow-based datasets, which focus on a particular scene and simultaneously capture all the vehicles within it. This type of dataset uses a bird’s-eye view to observe vehicle trajectories within the scene. The NGSIM dataset  is the best-known such dataset and includes highway and urban scenarios. Last year, RWTH Aachen University released the highD dataset 
, which used advanced computer vision technology to improve the data collection mechanism based on the NGSIM dataset. Another kind of dataset is based on the sensors mounted on the ego-vehicle and data collected while driving the ego car over a given route. Most such datasets create various vision-based benchmarks for further study. The KITTI dataset offers a vision benchmark for different autonomous vehicle-related tasks. The Oxford RobotCar dataset  collected 20 million images from 6 cameras mounted on the vehicle, along with LIDAR, GPS, and INS ground truth. Recently, UC Berkeley released the BDD100k  which includes diverse driving videos collected from a camera mounted on the vehicle with scalable annotation tooling.
In the current work, in order to gain a comprehensive view of the traffic situation, we use a bird’s-eye-view method to collect traffic-flow-based datasets via drone. The portable end-to-end system allows researchers to collect their own data from any site of interest, unlike the NGSIM system, which depended on the installation of a fixed camera. While our data collection method is similar to that used for the highD dataset, our method focuses primarily on urban intersections, which are more challenging than the highway scenarios that the highD dataset focuses on. Compared with highD, this work extends to urban scenario.
Liebner  proposed an explicit model to extract characteristic desired velocity profiles from real-world data that allow the Intelligent Drive Model (IDM) to account for turn-related deceleration to represent both car-following and turning behavior. Derek et al. 
used LSTMs to classify vehicle maneuvers at intersections. They predicted whether a driver will turn left, turn right, or continue straight up to 150m with consistent accuracy before reaching the intersection using LSTM, with the mean cross-validated prediction accuracy averaging over 85% for both three- and four-way intersections. There are other works on predicting complete trajectories using Hidden Markov Models, Gaussian Processes, Dynamic Bayesian Networks, Support Vector Machines, and inverse reinforcement learning.
Compared with , besides velocity profile, multiple factors are added in our models, such as yaw variation, target motion features, etc., which contain information on environmental changes for the ego vehicle. The work concentrates on the interaction of ego and target car pairs by studying the related interaction with each other. Meanwhile, we introduce the idea of direction intention prediction and use the result to determine a more detailed trajectory prediction. The main difficulties we tackled in the work is that the human driver’s intentions and trajectories of the vehicle are much more variable when approaching an urban intersection with heavy traffic flow than in highway situations.
In this section, the preliminary background of the problem is described. The fundamental algorithms which are used for video stabilization, object detection and tracking are included here.
Iii-a Enhanced Correlation Coefficient
The two main challenges for video stabilization are the robustness and the speed of the alignment. Feature-based alignment is fast and is able to align images with large displacement. However, its robustness is susceptible to the quality and distribution of the feature points detected. On the other hand, the image alignment algorithm like Enhanced Correlation Coefficient (ECC)  can be used to process every pair of two consecutive frames in the video. But each alignment iteration needs all the pixels to be searched, which is computationally expensive. Moreover, ECC also fails to align frames without a good initial guess at the homography matrix when the two frames have a low similarity. As a result, a combination of feature-matching-based and homography-based methods is used to reap the advantages of both.
Feature-based alignment involves detecting key-points, finding key-point correspondences, and computing image transformation using the Random sample consensus (RANSAC)  algorithm. The standard ECC alignment uses normalized intensity with zero mean so that the similarity measurement is invariant to contrast and brightness change . Each frame was warped first using the homography calculated from the feature-based method for a rough alignment, then warped with homography calculated from ECC alignment.
In the proposed pipeline, RetinaNet  is used for detecting vehicles in the images. RetinaNet has a backbone network which is responsible for computing the convolution feature map over an entire input image. A class sub-net is responsible for predicting the class of object. In our case, there is only one class, which is ’car’. A box regression sub-net predicts the location and size of each vehicle. ResNet-50 was used as the backbone for the forward pass of the FPN architecture since the residual learning framework promotes easier convergence.
Iii-C Kalman Filter
is used for tracking and trajectory smoothing. Based on the car’s dynamic model, characteristics of the system noise and measurement noise, the measurement variables are used as the input signal, and the estimation variables that we need to know are the output of the filter. The whole filtering process is composed of a prediction equation and an update equation as follows:
where and are the estimated state variable and measurement variable at frame , respectively.
In this section we propose UrbanFlow as a procedure to deal with the collected bird’s-eye view data. Then, based on UrbanFlow, we propose a method for predicting the human driver’s intention as well as trajectories.
Iv-A1 Video Stabilization
In this paper, we propose several steps for the video stabilization in order to deal with the displacement of the drone during the data collection process. Figure 2 visualizes the flow of the stabilization method. For each frame at time step , the algorithm chooses a reference frame according to the alignment evaluation score gotten from the result of the last time step and corresponding homograph matrix in order to get the stabilized frame. Firstly, a re-alignment is performed when the result of the ECC alignment score is lower than a threshold. ECC takes a lot of time to converge and is not adaptive to align the current frame with a reference frame when their similarity is lower than a threshold. Secondly, since the alignment is time-consuming, it is only performed when a reference frame needs to be re-chosen. The homography matrix is re-used for the following frames until a new reference frame is chosen when the evaluation score drops to the threshold. Then, the homography matrix calculated from ECC alignment during the previous step is used for initializing the guess for ECC in the next step to speed up the convergence. Lastly, images are down-sampled  so that ECC uses fewer pixels during the calculation.
Iv-A2 Object Detection
The training dataset contains all the bounding boxes and their corresponding labels for each image. The input images are re-sized to ensure that the size of detection objects is greater than 32-by-32 pixels as well as not too large to fit for the GPU computational capability. Images are masked to crop out the roads in order to make detection easier. RetinaNet was fine-tuned using pre-trained weights from the COCO dataset .
Iv-A3 Map Construction and Coordinate Transition
The first step in the creation of the map is to crop the area of interest, which in this case is the roads. To attack this problem, we took advantage of the image segmentation network ”U-net”, described in Ronneberger et al. , with just a few adjustments based on the work of Iglovikov et al. 
. We preserve the decoder section of the network because by adding a large number of feature channels, it allows the network to propagate context information to the higher-resolution layers. The important change was in the encoder section, where it was replaced by the down-sampling elements of the VGG16 architecture in order to take advantage of the pre-trained weights in ImageNet, due to the limitation of the quantity of the collected data.
After detecting the road and applying a color filter to detect the lane markings on the road, the work transforms all the detected vehicle positions from the original image-based coordinates to the road-based coordinates. Figure 3 shows the method to generate the road-based coordinate based on a random road geometry which may occur in the real world. The method firstly chooses an origin and then proceeds to obtain the axis and axis along the lane markings which separate the opposite directions of moving vehicles. For the given vehicles and , the figure lists two examples of how to extract the road-based positions. Finally, it is able to represent the vehicles’ information, which contains the following items:
Local and based on the road-based coordinates
Vehicle length and width
Iv-A4 Vehicle Tracking and Trajectory Smoothing
After the positions of vehicles have been transformed into the local (road) coordinates, we apply the tracking algorithm to track each car. Meanwhile, we smooth each vehicle’s trajectory.
In the system, we use vehicle position as the state variable. is the state transition matrix and is the measurement matrix. and represent the system noise and measurement noise, respectively.
Iv-B Driving Behavior Prediction
Iv-B1 Intention Prediction
For the driving behavior task, we construct the network with an LSTM layer which gets the Direction Intention and Yield Intention (see Figure 4). The direction intentions include Going Straight (GS), Turing Left (TL) and Turning Right (TR). The yield intention indicates the prediction of which car will go through the potential crash point first. For the interacting driver pairs with intentions of GS and TL or TL and TR, the input states include the positions, velocities, heading angles and relative distance to the intersection center of both cars of each pair. During the interaction procedure, the yield motion also changes based on the counterpart’s behavior. This will contribute as a key factor to the next-step motion planning module and help to generate a safer and feasible trajectory.
Iv-B2 Trajectories Prediction
Based on the results of direction and yield predictions, a more detailed trajectory prediction procedure includes more information on the future trajectories. In Figure 5, includes information on velocities and positions of the target car. A reference trajectory is first selected according to the intention prediction results. According to the reference trajectories (see Figure 6) with intersection geometry information, the velocities, heading angles and relative distance to the intersection center of both cars, the network can predict the future trajectories.
In this section, we show the results for methods corresponding to different data processing procedures.
|Method||Processing Time ()||SSIM|
|ORB ECC w/o DS||1.7609||0.8032|
|ORB ECC, DS||0.6724||0.7759|
|ORB ECC, DS||0.4779||0.7324|
|ORB ECC, DS||0.4071||0.7160|
|SURF ECC w/o DS||0.6599||0.81896|
|SURF ECC, DS||0.6627||0.7758|
|SURF ECC, DS||0.4960||0.7336|
|SURF ECC, DS||0.3750||0.7166|
V-A1 Video Stabilization
In the previous section, we have introduced the combination of the feature-matching-based and homography-based alignment methods. The result compares different combinations of feature-matching-based and homography-based video stabilization algorithms with various down-sampling ratios. Table I shows the results of different choices of algorithms and the corresponding structural similarity (SSIM) score which is used to calculate the similarity between any two images. The higher SSIM score indicates a better stabilization result.
We finally chose the Speeded Up Robust Features (SURF) detector combined with ECC and down-sampling to get a relatively good tradeoff between stabilization and computational efficiency. We visualize images with and without stabilization in Figure 7 with four sub-figures. The Reference Frame shows the anchor frame for the stabilization. Ideally, the roads can be perfectly aligned in the Reference Frame and Target Frame. Before stabilization, the Reference Frame and the Target Frame are blended, which is shown as the Target Blended Frame. It is obvious that the two frames have a big misalignment. After the stabilization of the frame, the result is shown as the Stabilized Frame and then the new blended result is shown as the Stabilized Blended Frame.
V-A2 Vehicle Detection
|P / T||Car (Train / Test)||No Car (Train / Test)|
|Car||387 / 2369||1 / 0|
|No Car||3 / 72||0 / 0|
|0.95 / 0.994|
|0.97 / 0.995|
|0.97 / 0.996|
By using Retinanet to do vehicle detection, we trained a good model to detect vehicles from a bird’s-eye view. For testing, only 97 out of 2322 vehicles are not detected, giving an accuracy of , and the average intersection over union is . This accuracy is high since the vehicles in the test cases are similar to the ones during training. False positives were removed using non-maximum suppression and thresholding the confidence score for a prediction. If vehicles such as a bus appear in testing but had never appeared in training, these vehicles will not be detected, since they are too different from what the network has learned both in size and color. The images with incorrect detection were relabeled to fine-tune the network.
For each frame, RetinaNet is applied to detect vehicles. Figure 8 visualizes one of the testing images after applying the vehicle detection method. The original image-based positions of all the red bounding boxes detected as vehicles are saved. Table II shows the quantitative results of training and testing. is the abbreviation for the area of Ground Truth and is the abbreviation for the area of Predicted Results.
V-A3 Trajectory Smoothing
Most existing vehicle trajectory datasets, such as NGSIM , only provide raw trajectories, which are noisy and therefore hard to use directly due to the jerky trajectories. Figure 9 visualizes the results of the trajectories for one of the vehicles with and without smoothing. The Figure 9(a) shows the result with equal scaling of the and axes. It is hard to find the difference between the trajectories with (RED) and without (GREEN) smoothing. However, when the axis is enlarged in figure 9(b), the trajectory without smoothing (GREEN) is much jerkier than the one with smoothing (RED).
Finally, the video111https://youtu.be/oTPgLUdN_cU includes all the dynamic results proposed in the pipeline.
We tested the algorithm based on the UrbanFlow dataset. We selected pairs of interacting vehicles with the driving directions of GS and TL or TL and TR from the dataset. Figure 10 shows a pair of two interacting vehicles. The blue rectangle with is the ego car and the green rectangle with is the target vehicle. The input state includes velocities, heading angles and relative distances to the intersection center of both ego and target cars.
V-B2 Intention Prediction
According to the described state, the intention network predicts the direction intention as well as the yield intention. Figure 11 visualizes the results of direction and yield intentions. According to the coordinate transitions, all the ego cars approach the intersection (intersection center is coordinate ) from the bottom. Different colors with marker show the results of direction predictions when the target vehicle reaching that position and the other colors with marker X present the yield intention prediction results. Figure 12 shows the prediction accuracy with respect to the distance to the start of intersection for the target vehicle.
V-B3 Trajectory Prediction
We compared the mean squared error (MSE) results between the trajectory prediction with and without intention results and reference trajectories. Table III presents the average MSE for different methods and Figure 12 shows the MSE with respect to the distance to the start of intersection for the target vehicle. When pass the start position of the intersection, the trajectories become diverse due to various direction intentions, as a result, the MSE of trajectories prediction increase.
|Method||Average MSE (m)|
|LSTM w/ intention||0.89|
|LSTM w/ intention and reference trajectory||0.18|
In this paper, we propose a pipeline called UrbanFlow which is used to deal with traffic data collected by drones in urban environments. The raw data are processed through video stabilization, vehicle detection, map construction and coordinate transformation, vehicle tracking, and trajectory smoothing. Moreover, the paper proposes a method for driving behavior clustering and tests it on the UrbanFlow dataset. The following work for improving the dataset will focus on increasing the quantity of the dataset. More types of urban scenarios like T-intersections, stop-sign intersections and yield intersections will be included in the dataset.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16.
-  “VTD homepage.” 2019. [Online]. Available: https://vires.com/vtd-vires-virtual-test-drive
-  “NGSIM homepage. FHWA.” 2005-2006. [Online]. Available: http://ngsim.fhwa.dot.gov.
-  R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein, “The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems,” in 2018 IEEE 21st International Conference on Intelligent Transportation Systems (ITSC), 2018.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
-  W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,” The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017. [Online]. Available: http://dx.doi.org/10.1177/0278364916679498
-  F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving video database with scalable annotation tooling,” arXiv preprint arXiv:1805.04687, 2018.
-  M. Liebner, M. Baumann, F. Klanner, and C. Stiller, “Driver intent inference at urban intersections using the intelligent driver model,” in 2012 IEEE Intelligent Vehicles Symposium. IEEE, 2012, pp. 1162–1167.
-  D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in 2017 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2017, pp. 1665–1670.
-  G. D. Evangelidis and E. Z. Psarakis, “Parametric image alignment using enhanced correlation coefficient maximization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 10, pp. 1858–1865, 2008.
-  M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” in Readings in computer vision. Elsevier, 1987, pp. 726–740.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE transactions on pattern analysis and machine intelligence, 2018.
-  Y. Chan, A. Hu, and J. Plant, “A kalman filter based tracking scheme with input estimation,” IEEE transactions on Aerospace and Electronic Systems, no. 2, pp. 237–244, 1979.
-  S. Overflow, “cv2 motion euclidean for the warpmode in ecc image alignment method.”
H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes
in context,” in
Computer vision and pattern recognition (CVPR), 2018 IEEE conference on. IEEE, 2018.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  V. Iglovikov and A. Shvets, “Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation,” arXiv preprint arXiv:1801.05746, 2018.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.