I Introduction
Semantic understanding of the dynamic interactions with their surroundings allows autonomous vehicles to make an accurate prediction in realtime, thus making proper decisions. It is essential but challenging to achieve such tasks because realworld driving scenarios are complex. Approaches with a specific simplified scenario or maneuver, such as lanekeeping, lanechanging, and merging on the highway [1, 2, 3, 4]
, have been proposed to study the interactions among multiple traffic participants. However, realworld traffic scenarios, such as intersections, are far more complicated. Here we propose a general framework to investigate how experienced human drivers interact with surrounding vehicles. This framework is capable of modeling interaction and learning interactive patterns from cluttered traffic scenarios. We anticipate this framework to be a starting point for analyzing of more sophisticated realworld traffic scenarios, including reliable environment recognition, efficient scene understanding, and tractable safety evaluation.
The recorded traffic videos in a specific regionofinterest (ROI) contain informative multivehicle interaction patterns. The NextGeneration Simulation (NGSIM) [5] is one of the most widely used datasets, which is recorded on the highway with explicit traffic rules, e.g., clear lane markers and wellcontrolled traffic lights. Some public datasets are also provided in [6, 7, 8]. In these datasets, the traffic signals may provide simple traffic patterns. However, they may not be extended into more cluttered traffic scenarios without traffic lights and lane markers. In order to study the interaction patterns in the cluttered urban traffic scenarios, we analyzed videos recorded over an unsignalized junction. In this paper, we provide a general framework capable of analyzing more complicated interaction patterns in the Meskel Square Dataset, without any traffic lights and clear lane markers. This dataset comprehensively reveals how human drivers interact with and react to each other in cluttered intersection scenarios without any external control, such as traffic lights.
Many approaches have been used to model interactions within multiple traffic participants. The modelbased methods provide a straightforward understanding but maybe only suitable for limited scenarios [9]
. The dynamic Bayesian networks
[2]and deep neural networks
[4]are powerful in inferring hidden states of traffic behavior, but the states should be in a fixed space. The inverse reinforcement learning (IRL) can model the decisionmaking process in multivehicle interaction behaviors
[1], but it may require prior knowledge to design appropriate reward functions. Besides, the IRL is a deterministic approach and could only pursue optimal motions. Therefore, these approaches cannot address situations with a timevarying number of interacting vehicles or complicated scenarios without explicit traffic rules.Towards this end, our proposed general framework (Fig.1) is capable of analyzing multivehicle interaction behaviors from complicated traffic videos without any traffic controls, e.g., traffic lights and lane markers, and we validate this framework on the Meskel Square Dataset. In this framework, we first extract the position and velocity of vehicles from videos by implementing a detectionbased tracking algorithm that integrates You Only Look Once (YOLOv3) [10] with optical flow. However, the number of vehicles in each frame of the videos is changing, which lays a top challenge to define the interaction patterns. Then, to overcome this issue, we utilize a Gaussian velocity field to represent the spatial
interaction patterns in each frame and apply a deep autoencoder to encode the highdimensional representatives into a lowdimensional vectored latent space. Finally, we use Bayesian nonparametric learning to cluster the
temporal interaction patterns with the encoded latent features. The main contributions are as follows:
Providing a general framework to extract interaction patterns from traffic videos, which can be applied to any multiagent interactions in videos.

Employing Bayesian nonparametrics to learn interaction patterns in the temporal space automatically.
In addition, this developed framework can also provide a potential way to get highquality data sources from traffic videos for multiagent interaction analysis.
Ii MultiVehicle Interaction Modeling
In this section, we will introduce the Gaussian velocity field for describing the multivehicle interactions in each video frame, the autoencoder for learning representatives of the Gaussian velocity field, and Bayesian nonparametrics for clustering interaction patterns.
Iia Gaussian Velocity Field
In most ROI of traffic, e.g., intersections, the number of vehicles randomly vary over time, which yields a timevarying dimension problem. There are two ways to deal with this problem: 1) Fixing the dimension of the extracted features that describe multivehicle interactions, or 2) Fixing the number of surrounding vehicles as a constant. Here we learn the fixeddimension features with an assumption that the ego vehicle can perceive its nearby vehicles in terms of the position and relative velocity. Modeling motion patterns as velocity fields rather than trajectories is a robust method to group vehicles that share similar motion characteristics. Therefore, we employed a velocity field[13, 14] to model the interaction patterns of surrounding vehicles with respect to the ego vehicle.
We model velocity field based on Gaussian process according to Joseph et al.[13]. Given the observed locations of vehicles, , their trajectory derivatives
are jointly distributed according to a Gaussian distribution with mean velocity
and covariance , where and denote the covariance of the and direction. In the direction, the components of can be expressed as , where is the standard squared exponential covariance functions(1) 
and is the amplification factor, and are the lengthscale factors governing the impact on each vehicle with different ranges.
Given a new position , the predictive distribution over the trajectory derivative via a Gaussian process can be computed by
(2)  
(3) 
where and denote the positions of all surrounding vehicles at time . The calculation of is similar to the procedure above by using . Figure 2 illustrates the Gaussian velocity field schematics with the relative velocities and .
IiB Autoencoders
An autoencoder is composed of an encoder and a decoder . It reconstructs its inputs and sets the target values of output to be equal to , where and
are the activation functions. Then a deep autoencoder is formed through hierarchically stacking multiple autoencoders by treating the latent features
in the middle hidden layer of autoencoder as the input of autoencoder.IiC Traffic Primitives
IiC1 Definition
We define traffic primitives as the fundamental building blocks of multivehicle interactions that shape complicated scenarios. More specifically, traffic primitives are regarded as the nonoverlapped segments of the timeseries traffic observations with respect to interaction patterns. In this way, each traffic primitive represents an essential interaction behavior, and various complex scenarios are composed of several primitives.
IiC2 Traffic Primitive Extraction
The problem of multivehicle interaction is complicated with few studies have been done on this topic, thus setting a reasonable number of patterns is a tricky problem. We utilize Bayesian nonparametrics to learn the traffic primitives automatically, which does not require setting the number of traffic primitives initially. The Bayesian nonparametric model is a Bayesian model on an infinitedimensional space which assumes the number of mixture model components increases with getting more observations[15], in contrast to the approaches such as means in which the cluster number is predefined. In our research, the number of interaction patterns is formulated by a hierarchical Dirichlet process (HDP) which can optimize the number of interaction patterns. The relationship of sequential frames in the recorded video is formulated by a hidden semiMarkov model (HSMM). In turn, the combination of HDP and HSMM, called HDPHSMM, can automatically segment the traffic time series into segments, called traffic primitives. In what follows, we will introduce the basic theoretical concepts of HDPHSMM.
The states transition process of interaction among multiple vehicles can be modeled as a probabilistic inferential process. For each state, a random state duration time is drawn from some statespecific distribution which is set as Poisson prior. The hidden state of HSMM represents the traffic primitive and is the observed data at time [16]. The duration of the primitive
is a random variable, which is entered at time
, andis the probability mass function. The HSMM can be interpreted as a Markov chain (without selftransitions) on a finite primitive set
. The transition probability from primitive to is defined as , where is the size of primitives set . The possibility of observations given current primitive and the emission parameter of mode is defined as . Thus, we can describe HSMM as(4a)  
(4b)  
(4c) 
where is the emission function. The HDP can be formed by stickbreaking construction as
(5a)  
(5b)  
(5c)  
(5d)  
(5e) 
where denotes the latent variables. By placing an HDP prior over the infinite transition matrices of HSMM, a robust architecture HDPHSMM[17] (Fig.3) can be obtained
(6a)  
(6b)  
(6c)  
(6d)  
(6e)  
(6f)  
(6g) 
where , , and is added to eliminate selftransitions in the sequence .
IiC3 Learning Procedure
We adopt a weaklimit Gibbs sampling algorithm for HDPHSMM[16]
; the duration variables are drawn from Poisson distribution. The observations are generated from a Gaussian model
, we take according to [18]. The hyperparameters
and are drawn from a gamma prior, and the hyperparameters for are determined by an InverseWishart (IW) prior[19].Iii Experimental Setup for Data Collection
In this section, to test the potential of the method capability, we will introduce a cluttered scenario, the Meskel Square Dataset (see https://youtu.be/UEIn8GJIg0E), which may never happen in the regular traffic. We will present a detectionbased method to track dense moving objects in this lowresolution video. Then we will transform the tracked bounding boxes into the corresponding bird’s eye vision.
Iiia Object Tracking
In order to extract the position and velocity of vehicles, two key procedures are necessary: 1) detection – recognizing objects and tagging them with bounding boxes in each frame, and 2) tracking – matching the same objects in successive frames. Many approaches are established on the optical flow as it can capture motions and segment objects [20]. However, the capacity of optical flow to detect static objects is limited. A powerful toolbox – YOLOv3 [10]
– has been developed based on convolutional neural networks, which we use in our research.
Given a set of bounding boxes without ID, a straightforward tracking way is to match the same bounding boxes between adjacent frames and using similarity measurements, e.g., Euclidean distance and overlapping ratio. The displacement of the same object between consecutive frames should be very small if the speed of objects is sufficiently low, implying that the point in the frame has a high possibility to match with the closest point in the frame . Here we use the Euclidean Distance (ED) to gauge the similarity.
Directly minimizing the ED may lead to incorrect results in some situations, for example, when the points in the frame are close to each other, e.g., is close to in Fig.4LABEL:sub@fig2l, it would match with incorrectly. To overcome this limitation, we have developed a matching algorithm based on movement prediction. The optical flow method which calculated the velocities of moving brightness patterns provides the trends of objects that can be used to predict the approximated upcoming position for each point. Therefore, the tracking method in Fig.4LABEL:sub@fig2m with the prediction of optical flow can match (the predicted position of the object in the frame ) with correctly. In addition, the states and duration of each vehicle ID are also considered. An object is noted as inactive if no bounding box in the next frame was matched, and we do not update this ID once the duration of the inactive state is longer than a predefined threshold.
IiiB Bird’sEye View
As introduced above, we set a fixed ED threshold to determine whether a new ID will be assigned to a bounding box. However, the position and orientation of the camera would skew the ground slightly, and the size and velocity of objects would cause biases over different regions. Hence, we apply a perspective transformation to obtain a bird’seye view video to deal with such biases by manually picking a rectangle
as the reference points to perform a perspective transformation procedure, as shown in Fig.5. The most frequent interaction behaviors occur in the red shaded area . Thus, we set the red area as the ROI and set the detection threshold of YOLOv3 as (Fig. 6). The big bounding boxes covering more than a quarter of the screen are removed. Furthermore, we set a filter on the overlapping ratio between the bounding boxes within one frame to avoid redundant detection from a low threshold. Figure 6LABEL:sub@fig10d shows the filtered results, which are recorded in the form of sets of time series that contain the coordinates of all vehicles with the ID of each frame.Iv Results and Analysis
In this section, we will present and analyze the experimental results tested on the Meskel Square Dataset and the NGSIM US Highway Dataset[5].
Iva Experimental Results
IvA1 Gaussian Velocity Field
Our focus is mainly on the vehicles that survive longer than frames with at least one surrounding vehicle, which allows us to find challenging and informative interaction patterns. The objects within the radius of m for each ego vehicle are considered and then the surrounding environment of each vehicle by extracting relative velocity is modeled using the coordinate transformation matrix. Human drivers’ decisionmaking is more sensitive to ahead vehicles than vehicles on both sides, and the lateral speed is much lower than longitudinal speed for most scenarios. Therefore, the parameters of the Gaussian process are set as , , and . The Gaussian velocity field is expressed as a set of matrices in the size of , where represents the dimensions of and within the range of m, and represents the velocity field in the and direction.
IvA2 Representation Learning
We build a deep autoencoder with a symmetric structure using fully connected layers. An
layer deep autoencoder is trained for mapping Gaussian velocity field to hidden representations (i.e., the
hidden layer). More specifically, the number of hidden units in each encoder layer are set as , , , , , and , where and represent the input layer and the hidden representations, respectively.IvA3 Traffic Primitives
We use the latent features to represent relative velocity field for the surrounding environment and egovehicle states, and use the velocity and acceleration to depict the decisions of the ego vehicle. We focus on the first typical vehicles and analyze the vehicles that are encountering complex traffic and struggling for a long duration. In addition, we neglect the vehicles that survive shorter than frames and without enough surrounding vehicles. With a total duration of frames (which represents the summation of durations of all ego vehicles and is different from the video clip frames), the input dimension would be , where elements are composed of hidden representations, , and . We train the HDPHSMM for epochs using pyhsmm[17]. Figure 7 shows nine primitive interaction patterns extracted via our proposed framework. The learned traffic primitives convey scenarios of passing through among the opposite traffic flows, encountering with traffic flows from other directions, being stuck in the vertical traffic flows, and being cut off by other vehicles from the traffic flow that is being followed.
IvB Experiment on NGSIM Datasets
Considering the differences between scenarios at intersections and on highways, we set the parameters of the Gaussian process as , , and . In this experiment, we skip the first vehicles since the traffic is sparse at the very beginning of this dataset, and show the results of ID from with frames. Eight representative interaction patterns on the highway are finally learned, as shown in Fig.8. These interaction patterns include the fundamental highway scenarios, such as overtaking, being overtaken by the surrounding vehicles, and lane changing.
V Conclusion
In this paper, we demonstrated a flexible way to analyze complicated multivehicle interaction patterns based on traffic primitives from recorded traffic videos. A generic framework was developed, composing of three main modules: object tracking, representation learning, and interaction pattern clustering. This proposed framework can be extended to other scenarios with multiagent involved. In the tracking module, we synthesized bounding boxes identification and optical flow representation to improve the position tracking and velocity estimation performance. In the representation learning module, we introduced the Gaussian velocity fields to model the cluttered scene where the number of vehicles is changing over time. Then, we learn the lowdimensional latent features of the velocity field by a deep autoencoder. Finally, in the interaction pattern clustering module, we input the combination of the velocity fields and the decisions of the ego vehicle into the Bayesian nonparametrics to extract and cluster the interaction patterns in the temporal space. Our experimental results on two different datasets show an appealing performance of the proposed framework for extracting semantically meaningful multivehicle interaction patterns. Such performance is desirable in the context of naturalistic driving analysis, especially in highly dynamic scenes. Furthermore, the resulting richinformation representation allows indepth investigations for various autonomous driving applications, including reliable environment recognition, efficient scene understanding, and tractable safety evaluation.
Acknowledgment
Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
References
 [1] D. S. González, V. RomeroCano, J. S. Dibangoye, and C. Laugier, “Interactionaware driver maneuver inference in highways using realistic driver models,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2017, pp. 1–8.
 [2] T. Gindele, S. Brechtel, and R. Dillmann, “A probabilistic model for estimating driver behaviors and vehicle trajectories in traffic environments,” in 2010 IEEE 13th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2010, pp. 1625–1631.
 [3] J. Li, H. Ma, W. Zhan, and M. Tomizuka, “Generic probabilistic interactive situation recognition and prediction: From virtual to real,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 3218–3224.
 [4] A. Sarkar, K. Czarnecki, M. Angus, C. Li, and S. Waslander, “Trajectory prediction of traffic agents at urban intersections through learned interactions,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2017, pp. 1–8.
 [5] “https://www.fhwa.dot.gov/publications/research/operations/07030/index.cfm.”
 [6] “http://personal.ie.cuhk.edu.hk/~ccloy/downloads˙qmul˙junction.html.”

[7]
X. Wang, X. Ma, and E. Grimson, “Unsupervised activity perception by
hierarchical bayesian models,” in
2007 IEEE Conference on Computer Vision and Pattern Recognition
. IEEE, 2007, pp. 1–8.  [8] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank, “A system for learning statistical motion patterns,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 9, pp. 1450–1464, 2006.
 [9] S. Klingelschmitt, F. Damerow, V. Willert, and J. Eggert, “Probabilistic situation assessment framework for multiple, interacting traffic participants in generic traffic scenes,” in 2016 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2016, pp. 1141–1148.
 [10] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
 [11] W. Wang, A. Ramesh, and D. Zhao, “Clustering of driving scenarios using connected vehicle trajectories,” arXiv preprint arXiv:1807.08415, 2018.
 [12] W. Wang, W. Zhang, and D. Zhao, “Understanding v2v driving scenarios through traffic primitives,” arXiv preprint arXiv:1807.10422, 2018.
 [13] J. Joseph, F. DoshiVelez, A. S. Huang, and N. Roy, “A bayesian nonparametric approach to modeling motion patterns,” Auto. Rob., vol. 31, no. 4, p. 383, 2011.
 [14] G. S. Aoude, B. D. Luders, J. M. Joseph, N. Roy, and J. P. How, “Probabilistically safe motion planning to avoid dynamic obstacles with uncertain motion patterns,” Auto. Rob., vol. 35, no. 1, pp. 51–76, 2013.

[15]
P. Orbanz and Y. W. Teh, “Bayesian nonparametric models,” in
Encyclopedia of Machine Learning
. Springer, 2011, pp. 81–89.  [16] M. J. Johnson and A. S. Willsky, “Bayesian nonparametric hidden semimarkov models,” J. of Mach. Learn. Res., vol. 14, no. Feb, pp. 673–701, 2013.
 [17] M. J. Johnson, “Bayesian time series models and scalable inference,” Ph.D. dissertation, Massachusetts Institute of Technology, 2014.
 [18] R. Hamada, T. Kubo, K. Ikeda, Z. Zhang, T. Shibata, T. Bando, K. Hitomi, and M. Egawa, “Modeling and prediction of driving behaviors using a nonparametric bayesian method with ar models,” IEEE Trans. Intell. Veh., vol. 1, no. 2, pp. 131–138, 2016.
 [19] W. Wang, J. Xi, and D. Zhao, “Driving style analysis using primitive driving patterns with bayesian nonparametric approaches,” IEEE Transactions on Intelligent Transportation Systems, DOI:10.1109/TITS.2018.2870525.
 [20] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in European conference on computer vision. Springer, 2006, pp. 428–441.
Comments
There are no comments yet.