A General Framework of Learning Multi-Vehicle Interaction Patterns from Videos

07/17/2019 ∙ by Chengyuan Zhang, et al. ∙ Carnegie Mellon University 9

Semantic learning and understanding of multi-vehicle interaction patterns in a cluttered driving environment are essential but challenging for autonomous vehicles to make proper decisions. This paper presents a general framework to gain insights into intricate multi-vehicle interaction patterns from bird's-eye view traffic videos. We adopt a Gaussian velocity field to describe the time-varying multi-vehicle interaction behaviors and then use deep autoencoders to learn associated latent representations for each temporal frame. Then, we utilize a hidden semi-Markov model with a hierarchical Dirichlet process as a prior to segment these sequential representations into granular components, also called traffic primitives, corresponding to interaction patterns. Experimental results demonstrate that our proposed framework can extract traffic primitives from videos, thus providing a semantic way to analyze multi-vehicle interaction patterns, even for cluttered driving scenarios that are far messier than human beings can cope with.



There are no comments yet.


page 2

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Semantic understanding of the dynamic interactions with their surroundings allows autonomous vehicles to make an accurate prediction in real-time, thus making proper decisions. It is essential but challenging to achieve such tasks because real-world driving scenarios are complex. Approaches with a specific simplified scenario or maneuver, such as lane-keeping, lane-changing, and merging on the highway [1, 2, 3, 4]

, have been proposed to study the interactions among multiple traffic participants. However, real-world traffic scenarios, such as intersections, are far more complicated. Here we propose a general framework to investigate how experienced human drivers interact with surrounding vehicles. This framework is capable of modeling interaction and learning interactive patterns from cluttered traffic scenarios. We anticipate this framework to be a starting point for analyzing of more sophisticated real-world traffic scenarios, including reliable environment recognition, efficient scene understanding, and tractable safety evaluation.

The recorded traffic videos in a specific region-of-interest (ROI) contain informative multi-vehicle interaction patterns. The Next-Generation Simulation (NGSIM) [5] is one of the most widely used datasets, which is recorded on the highway with explicit traffic rules, e.g., clear lane markers and well-controlled traffic lights. Some public datasets are also provided in [6, 7, 8]. In these datasets, the traffic signals may provide simple traffic patterns. However, they may not be extended into more cluttered traffic scenarios without traffic lights and lane markers. In order to study the interaction patterns in the cluttered urban traffic scenarios, we analyzed videos recorded over an unsignalized junction. In this paper, we provide a general framework capable of analyzing more complicated interaction patterns in the Meskel Square Dataset, without any traffic lights and clear lane markers. This dataset comprehensively reveals how human drivers interact with and react to each other in cluttered intersection scenarios without any external control, such as traffic lights.

Many approaches have been used to model interactions within multiple traffic participants. The model-based methods provide a straightforward understanding but maybe only suitable for limited scenarios [9]

. The dynamic Bayesian networks


and deep neural networks


are powerful in inferring hidden states of traffic behavior, but the states should be in a fixed space. The inverse reinforcement learning (IRL) can model the decision-making process in multi-vehicle interaction behaviors

[1], but it may require prior knowledge to design appropriate reward functions. Besides, the IRL is a deterministic approach and could only pursue optimal motions. Therefore, these approaches cannot address situations with a time-varying number of interacting vehicles or complicated scenarios without explicit traffic rules.

Towards this end, our proposed general framework (Fig.1) is capable of analyzing multi-vehicle interaction behaviors from complicated traffic videos without any traffic controls, e.g., traffic lights and lane markers, and we validate this framework on the Meskel Square Dataset. In this framework, we first extract the position and velocity of vehicles from videos by implementing a detection-based tracking algorithm that integrates You Only Look Once (YOLOv3) [10] with optical flow. However, the number of vehicles in each frame of the videos is changing, which lays a top challenge to define the interaction patterns. Then, to overcome this issue, we utilize a Gaussian velocity field to represent the spatial

interaction patterns in each frame and apply a deep autoencoder to encode the high-dimensional representatives into a low-dimensional vectored latent space. Finally, we use Bayesian nonparametric learning to cluster the

temporal interaction patterns with the encoded latent features. The main contributions are as follows:

  1. Providing a general framework to extract interaction patterns from traffic videos, which can be applied to any multi-agent interactions in videos.

  2. Introducing a Gaussian velocity field to describe interaction patterns in the spatial space, whose dimensions are invariant to the number of vehicles in the ROI, and thus can adapt to various scenarios from two vehicles [11, 12] to multiple vehicles.

  3. Employing Bayesian nonparametrics to learn interaction patterns in the temporal space automatically.

In addition, this developed framework can also provide a potential way to get high-quality data sources from traffic videos for multi-agent interaction analysis.

The remainder of this paper is structured as follows. Section II introduces the essentials of our proposed framework. Section III presents the details of motion extraction from videos. Section IV analyzes experimental results. Section V makes further conclusion and discussion.

Ii Multi-Vehicle Interaction Modeling

In this section, we will introduce the Gaussian velocity field for describing the multi-vehicle interactions in each video frame, the autoencoder for learning representatives of the Gaussian velocity field, and Bayesian nonparametrics for clustering interaction patterns.

Ii-a Gaussian Velocity Field

In most ROI of traffic, e.g., intersections, the number of vehicles randomly vary over time, which yields a time-varying dimension problem. There are two ways to deal with this problem: 1) Fixing the dimension of the extracted features that describe multi-vehicle interactions, or 2) Fixing the number of surrounding vehicles as a constant. Here we learn the fixed-dimension features with an assumption that the ego vehicle can perceive its nearby vehicles in terms of the position and relative velocity. Modeling motion patterns as velocity fields rather than trajectories is a robust method to group vehicles that share similar motion characteristics. Therefore, we employed a velocity field[13, 14] to model the interaction patterns of surrounding vehicles with respect to the ego vehicle.

We model velocity field based on Gaussian process according to Joseph et al.[13]. Given the observed locations of vehicles, , their trajectory derivatives

are jointly distributed according to a Gaussian distribution with mean velocity

and covariance , where and denote the covariance of the and direction. In the direction, the components of can be expressed as , where is the standard squared exponential covariance functions


and is the amplification factor, and are the length-scale factors governing the impact on each vehicle with different ranges.

Fig. 2: The Gaussian velocity field of multi-vehicle scenarios.

Given a new position , the predictive distribution over the trajectory derivative via a Gaussian process can be computed by


where and denote the positions of all surrounding vehicles at time . The calculation of is similar to the procedure above by using . Figure 2 illustrates the Gaussian velocity field schematics with the relative velocities and .

Ii-B Autoencoders

An autoencoder is composed of an encoder and a decoder . It reconstructs its inputs and sets the target values of output to be equal to , where and

are the activation functions. Then a deep autoencoder is formed through hierarchically stacking multiple autoencoders by treating the latent features

in the middle hidden layer of -autoencoder as the input of -autoencoder.

Ii-C Traffic Primitives

Ii-C1 Definition

We define traffic primitives as the fundamental building blocks of multi-vehicle interactions that shape complicated scenarios. More specifically, traffic primitives are regarded as the non-overlapped segments of the time-series traffic observations with respect to interaction patterns. In this way, each traffic primitive represents an essential interaction behavior, and various complex scenarios are composed of several primitives.

Ii-C2 Traffic Primitive Extraction

The problem of multi-vehicle interaction is complicated with few studies have been done on this topic, thus setting a reasonable number of patterns is a tricky problem. We utilize Bayesian nonparametrics to learn the traffic primitives automatically, which does not require setting the number of traffic primitives initially. The Bayesian nonparametric model is a Bayesian model on an infinite-dimensional space which assumes the number of mixture model components increases with getting more observations[15], in contrast to the approaches such as -means in which the cluster number is predefined. In our research, the number of interaction patterns is formulated by a hierarchical Dirichlet process (HDP) which can optimize the number of interaction patterns. The relationship of sequential frames in the recorded video is formulated by a hidden semi-Markov model (HSMM). In turn, the combination of HDP and HSMM, called HDP-HSMM, can automatically segment the traffic time series into segments, called traffic primitives. In what follows, we will introduce the basic theoretical concepts of HDP-HSMM.

Fig. 3: Graphical model of HDP-HSMM.

The states transition process of interaction among multiple vehicles can be modeled as a probabilistic inferential process. For each state, a random state duration time is drawn from some state-specific distribution which is set as Poisson prior. The hidden state of HSMM represents the traffic primitive and is the observed data at time [16]. The duration of the primitive

is a random variable, which is entered at time

, and

is the probability mass function. The HSMM can be interpreted as a Markov chain (without self-transitions) on a finite primitive set

. The transition probability from primitive to is defined as , where is the size of primitives set . The possibility of observations given current primitive and the emission parameter of mode is defined as . Thus, we can describe HSMM as


where is the emission function. The HDP can be formed by stick-breaking construction as


where denotes the latent variables. By placing an HDP prior over the infinite transition matrices of HSMM, a robust architecture HDP-HSMM[17] (Fig.3) can be obtained


where , , and is added to eliminate self-transitions in the sequence .

Ii-C3 Learning Procedure

We adopt a weak-limit Gibbs sampling algorithm for HDP-HSMM[16]

; the duration variables are drawn from Poisson distribution. The observations are generated from a Gaussian model

, we take according to [18]

. The hyperparameters

and are drawn from a gamma prior, and the hyperparameters for are determined by an Inverse-Wishart (IW) prior[19].

Iii Experimental Setup for Data Collection

In this section, to test the potential of the method capability, we will introduce a cluttered scenario, the Meskel Square Dataset (see https://youtu.be/UEIn8GJIg0E), which may never happen in the regular traffic. We will present a detection-based method to track dense moving objects in this low-resolution video. Then we will transform the tracked bounding boxes into the corresponding bird’s eye vision.

Iii-a Object Tracking

In order to extract the position and velocity of vehicles, two key procedures are necessary: 1) detection – recognizing objects and tagging them with bounding boxes in each frame, and 2) tracking – matching the same objects in successive frames. Many approaches are established on the optical flow as it can capture motions and segment objects [20]. However, the capacity of optical flow to detect static objects is limited. A powerful toolbox – YOLOv3 [10]

– has been developed based on convolutional neural networks, which we use in our research.

Given a set of bounding boxes without ID, a straightforward tracking way is to match the same bounding boxes between adjacent frames and using similarity measurements, e.g., Euclidean distance and overlapping ratio. The displacement of the same object between consecutive frames should be very small if the speed of objects is sufficiently low, implying that the point in the frame has a high possibility to match with the closest point in the frame . Here we use the Euclidean Distance (ED) to gauge the similarity.

Fig. 4: The objects matching method of (a) directly minimizing ED (b) with movement prediction.

Directly minimizing the ED may lead to incorrect results in some situations, for example, when the points in the frame are close to each other, e.g., is close to in Fig.4LABEL:sub@fig2l, it would match with incorrectly. To overcome this limitation, we have developed a matching algorithm based on movement prediction. The optical flow method which calculated the velocities of moving brightness patterns provides the trends of objects that can be used to predict the approximated upcoming position for each point. Therefore, the tracking method in Fig.4LABEL:sub@fig2m with the prediction of optical flow can match (the predicted position of the object in the frame ) with correctly. In addition, the states and duration of each vehicle ID are also considered. An object is noted as inactive if no bounding box in the next frame was matched, and we do not update this ID once the duration of the inactive state is longer than a predefined threshold.

Iii-B Bird’s-Eye View

Fig. 5: A bird’s-eye view perspective transformation with the line of the curb and crossing lines: (a) original frame, (b) top-down view.
Fig. 6: YOLOv3 original detection results with threshold (a) , (b) , (c) , and (d) the filtered result.

As introduced above, we set a fixed ED threshold to determine whether a new ID will be assigned to a bounding box. However, the position and orientation of the camera would skew the ground slightly, and the size and velocity of objects would cause biases over different regions. Hence, we apply a perspective transformation to obtain a bird’s-eye view video to deal with such biases by manually picking a rectangle

as the reference points to perform a perspective transformation procedure, as shown in Fig.5. The most frequent interaction behaviors occur in the red shaded area . Thus, we set the red area as the ROI and set the detection threshold of YOLOv3 as (Fig. 6). The big bounding boxes covering more than a quarter of the screen are removed. Furthermore, we set a filter on the overlapping ratio between the bounding boxes within one frame to avoid redundant detection from a low threshold. Figure 6LABEL:sub@fig10d shows the filtered results, which are recorded in the form of sets of time series that contain the coordinates of all vehicles with the ID of each frame.

Iv Results and Analysis

In this section, we will present and analyze the experimental results tested on the Meskel Square Dataset and the NGSIM US Highway Dataset[5].

Iv-a Experimental Results

Iv-A1 Gaussian Velocity Field

Our focus is mainly on the vehicles that survive longer than frames with at least one surrounding vehicle, which allows us to find challenging and informative interaction patterns. The objects within the radius of m for each ego vehicle are considered and then the surrounding environment of each vehicle by extracting relative velocity is modeled using the coordinate transformation matrix. Human drivers’ decision-making is more sensitive to ahead vehicles than vehicles on both sides, and the lateral speed is much lower than longitudinal speed for most scenarios. Therefore, the parameters of the Gaussian process are set as , , and . The Gaussian velocity field is expressed as a set of matrices in the size of , where represents the dimensions of and within the range of m, and represents the velocity field in the and direction.

Iv-A2 Representation Learning

We build a deep autoencoder with a symmetric structure using fully connected layers. An

-layer deep autoencoder is trained for mapping Gaussian velocity field to hidden representations (i.e., the

hidden layer). More specifically, the number of hidden units in each encoder layer are set as , , , , , and , where and represent the input layer and the hidden representations, respectively.

Iv-A3 Traffic Primitives

We use the latent features to represent relative velocity field for the surrounding environment and ego-vehicle states, and use the velocity and acceleration to depict the decisions of the ego vehicle. We focus on the first typical vehicles and analyze the vehicles that are encountering complex traffic and struggling for a long duration. In addition, we neglect the vehicles that survive shorter than frames and without enough surrounding vehicles. With a total duration of frames (which represents the summation of durations of all ego vehicles and is different from the video clip frames), the input dimension would be , where elements are composed of hidden representations, , and . We train the HDP-HSMM for epochs using pyhsmm[17]. Figure 7 shows nine primitive interaction patterns extracted via our proposed framework. The learned traffic primitives convey scenarios of passing through among the opposite traffic flows, encountering with traffic flows from other directions, being stuck in the vertical traffic flows, and being cut off by other vehicles from the traffic flow that is being followed.

Fig. 7: Nine traffic primitives in experimental results. The traffic primitives bounded in the same colored boxes are categorized into the same feature label, the left parts within each box are the representative velocity fields, and the right ones are their corresponding scenarios.
Fig. 8: The extracted eight traffic primitives from NGSIM datasets.

Iv-B Experiment on NGSIM Datasets

Considering the differences between scenarios at intersections and on highways, we set the parameters of the Gaussian process as , , and . In this experiment, we skip the first vehicles since the traffic is sparse at the very beginning of this dataset, and show the results of ID from with frames. Eight representative interaction patterns on the highway are finally learned, as shown in Fig.8. These interaction patterns include the fundamental highway scenarios, such as overtaking, being overtaken by the surrounding vehicles, and lane changing.

V Conclusion

In this paper, we demonstrated a flexible way to analyze complicated multi-vehicle interaction patterns based on traffic primitives from recorded traffic videos. A generic framework was developed, composing of three main modules: object tracking, representation learning, and interaction pattern clustering. This proposed framework can be extended to other scenarios with multi-agent involved. In the tracking module, we synthesized bounding boxes identification and optical flow representation to improve the position tracking and velocity estimation performance. In the representation learning module, we introduced the Gaussian velocity fields to model the cluttered scene where the number of vehicles is changing over time. Then, we learn the low-dimensional latent features of the velocity field by a deep autoencoder. Finally, in the interaction pattern clustering module, we input the combination of the velocity fields and the decisions of the ego vehicle into the Bayesian nonparametrics to extract and cluster the interaction patterns in the temporal space. Our experimental results on two different datasets show an appealing performance of the proposed framework for extracting semantically meaningful multi-vehicle interaction patterns. Such performance is desirable in the context of naturalistic driving analysis, especially in highly dynamic scenes. Furthermore, the resulting rich-information representation allows in-depth investigations for various autonomous driving applications, including reliable environment recognition, efficient scene understanding, and tractable safety evaluation.


Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.