Interaction Detection Between Vehicles and Vulnerable Road Users: A Deep Generative Approach with Attention

by   Hao Cheng, et al.

Intersections where vehicles are permitted to turn and interact with vulnerable road users (VRUs) like pedestrians and cyclists are among some of the most challenging locations for automated and accurate recognition of road users' behavior. In this paper, we propose a deep conditional generative model for interaction detection at such locations. It aims to automatically analyze massive video data about the continuity of road users' behavior. This task is essential for many intelligent transportation systems such as traffic safety control and self-driving cars that depend on the understanding of road users' locomotion. A Conditional Variational Auto-Encoder based model with Gaussian latent variables is trained to encode road users' behavior and perform probabilistic and diverse predictions of interactions. The model takes as input the information of road users' type, position and motion automatically extracted by a deep learning object detector and optical flow from videos, and generates frame-wise probabilities that represent the dynamics of interactions between a turning vehicle and any VRUs involved. The model's efficacy was validated by testing on real–world datasets acquired from two different intersections. It achieved an F1-score above 0.96 at a right–turn intersection in Germany and 0.89 at a left–turn intersection in Japan, both with very busy traffic flows.


page 4

page 5

page 7

page 10

page 11

page 12

page 13

page 15


A Realistic Cyclist Model for SUMO Based on the SimRa Dataset

Increasing the modal share of bicycle traffic to reduce carbon emissions...

Goal-oriented Object Importance Estimation in On-road Driving Videos

We formulate a new problem as Object Importance Estimation (OIE) in on-r...

Modeling Interactions of Multimodal Road Users in Shared Spaces

In shared spaces, motorized and non-motorized road users share the same ...

AMENet: Attentive Maps Encoder Network for Trajectory Prediction

Trajectory prediction is a crucial task in different communities, such a...

Detection of E-scooter Riders in Naturalistic Scenes

E-scooters have become ubiquitous vehicles in major cities around the wo...

Intentions of Vulnerable Road Users - Detection and Forecasting by Means of Machine Learning

Avoiding collisions with vulnerable road users (VRUs) using sensor-based...

Predicting Motion of Vulnerable Road Users using High-Definition Maps and Efficient ConvNets

Following detection and tracking of traffic actors, prediction of their ...

I Introduction

In real–world traffic situations, it is not uncommon that heterogeneous road users like vehicles and vulnerable road users (VRUs, e. g., pedestrians and cyclists) have to directly interact with each other at some particular locations. Especially in city traffic, such type of locations include the turning areas of the so-called Turn-on-Red (TOR) intersections [23] or, more generally, intersections that allow vehicles to turn while other road users are crossing. During the time window of vehicles’ turning, their behavior is largely guided by social protocols, e. g., right-of-way or courtesy. For example, in Germany, as shown in Fig. 1, a turning vehicle at a permissive right–turn intersection often encounters cyclists that are passing by and pedestrians that are cross walking in the conflict areas. In Japan, a similar situation can be found at a permissive left–turn intersection [1] in left–hand traffic.

Fig. 1: A right-turn intersection in Germany. A dedicated lane for cyclists is typically parallel to the crossing zone.

Efficiently and accurately learning how vehicles and VRUs interact with each other at such intersections is important for many applications. As statistics show, accidents often occur at places where vehicles and VRUs confront each other and there have been reports that VRUs were seriously injured by the overlook of car drivers at turning intersections [4, 9, 35]. Thus, one important application is the analysis of interactions and critical situations. In addition, the foreseeable advent of autonomous driving in urban areas [7], particularly in such locations, requires accurate recognition of road users’ behavior. Another potential application would be an accident warning system for road users.

Nowadays, with the ubiquity of traffic data and the development of computer vision techniques, there is a high chance for automatically recognizing road users’ behavior from massive video data. Hence, in this paper we aim to investigate an efficient way that can automatically analyze whether the continuity of road users’ behavior is interrupted when vehicles and VRUs meet at busy intersections, which we formalize as interaction detection by automatically extracting the user type, location and motion information from videos.

The concept of interaction represents a changing level of reaction between road users. As defined by [32] an interaction is “a situation in which two or more road users are close enough in space and time and their distance is decreasing”. Similarly, [38] describes an interaction between road users as “a continuum of safety related events”. Moreover, [38, 34] relate interaction to conflict [26] that interaction can range from collision to negligible conflict risks. However, in everyday traffic, collisions or accidents fortunately only account for a very small amount—the tip of the pyramid of interactions [38], see Fig. 2

. More frequent events are conflicts with different degrees of severity (serious, slight and potential) and undisturbed passages. Therefore, in this paper, a high level of classification is adopted for classifying events into non-interactions (undisturbed passages) and interactions (all the other events).

The task of interaction detection is to differentiate interaction and non-interaction levels over the dynamics of a vehicle turning sequence. Interaction is needed if the turning vehicle drives into an intersection while any VRUs approaching or moving in the intersection space (see Fig. 1). In order to avoid any conflicts that might happen at any time during the vehicle’s turning, they adapt their movement, i. e., velocity and orientation, accordingly. Otherwise, no interaction is needed if the target vehicle drives in an undisturbed manner with VRUs in its neighborhood, if there are any.

Fig. 2: The pyramid of interactions. The figure is partially adapted from [38].

There are many challenges for automated interaction detection between vehicles and VRUs using video data. Road users’ behavior is dynamic and stochastic as they have to adjust their motion according to the reaction from each other. In addition, mixed types and varying numbers of roads users, as well as direct confrontations greatly complicate this task. The following open questions have to be addressed: (I) How to efficiently acquire, process, and label a large amount of video data for training a deep learning model for interaction detection considering all the relevant road users? (II) How can a system automatically detect the location and motion of the involved road users? (III) How to represent the dynamics of interactions in vehicle turning sequences of varying duration?

To tackle the above challenges, we propose a deep conditional generative model based on Conditional Variational Auto-Encoder (CVAE) [36] for automated interaction detection. The model is conditioned on the information extracted from video data and performs probabilistic inference for interaction prediction. As opposed to a disciminative model [3] that distinguishes interaction classes (interaction vs. non-interaction) from the observed information, a set of Gaussian latent variables are used to encode the dynamic and stochastic behavior patterns, which enables the generative model to perform diverse predictions at inference time. The contributions of this work are summarized as follows:

  • Various activities among all road user types were recorded using a camera at a right–turn intersection in Germany and a left–turn intersection in Japan for very busy traffic flows. They were processed for interaction detection in both right– and left–hand real–world traffic. In the future the data will be released for further research.

  • We combine a deep learning object detector to automatically detect all the relevant road users and optical flow to extract their motion. The combination captures the dynamics of all the road users and circumvents the tremendous work of manual tracking of trajectories.

  • Both sliding window and padding methods are explored to parse vehicle turning sequences of varying lengths.

  • We propose an end-to-end sequence-to-sequence conditional generative model with a self-attention mechanism [40] for interaction detection, which simultaneously takes both the object and motion information sequences and generates probabilities of interaction at each short interval (). The probabilities change accordingly when the intensity of interaction changes between a turning vehicle and the involved VRUs over time.

The remainder of the paper is organized as follows. Sect. II reviews the related studies on road user behavior at intersections. The proposed methodology is introduced in Section III

. The detailed information of the datasets and evaluation metrics is provided in Sec. 

V. The experimental results are presented and analyzed in Sec. VI and the limitations of the model are further discussed in Sec. VII. Finally, the conclusions are drawn in Sec. VIII with potential directions of future work.

Ii Related Work

Early studies on road user behavior at intersections are focused on collision and conflict analyses [2, 5, 18, 31]. For example, [2] manually observed and studied a total of 25 collision scenes taped by videos at an intersection over a period of one year. [5] conducted a study of the safety impact of permitting turns on red lights (TOR) based on data of crashes. Examining actual collision and conflict scenes has many limitations. First, the occurrence of crash and accident events are very rare in daily traffic and vary from case to case, which cannot represent the majority of road users’ behavior [33] as undisturbed passages are not included. Second, as pointed out by [15], collision–based safety analysis is a reactive approach, which requires a significant number of collisions to be collected before an action is warranted. Third, those data are likely to be incompletely documented or protected for legal and privacy reasons, which leads to the data acquisition being complicated or even not possible. Most importantly, the above drawbacks make it almost impossible to automatically analyze the behavior of road users.

The development of computer vision techniques allows for automated analysis of road users’ behavior at intersections. The work by Ismail et al., [15], one of the early studies using computer vision techniques, automatically analyzed pedestrian–vehicle conflicts at an intersection using the extracted trajecotries from video data. Later on, similarly, several works used trajectories extracted from videos to analyze before–and–after vehicle-pedestrian conflicts [16], in street designs with elements of shared space [17], in less organized traffic environments [39], and vehicle–bicycle conflicts at intersections [33]. The work carried out by Ni et al., [24] analyzed pedestrian–vehicle interaction patterns using indicators of Time-to-collision (TTC) [10] and Gap Time (GT) [15]. First, trajectories were extracted by the semi-automated image processing tool Traffic Analyzer [37], and then according to the TTC and GT values derived from the trajectory speed profiles interactions were classified into three classes: hard interaction, soft interaction and no interaction. On the one hand, their work is very close to the studies carried out in this paper. Both interactions and non-interactions are studied at permissive right–turn intersections. On the other hand, their work is not fully automated in terms of trajectory extraction. Acquiring reliable trajectory data is often costly and time–consuming. The quality of the data is difficult to guarantee. For example, tracking multiple objects from frame to frame is very challenging due to, e. g., abrupt object motion, change of appearance and occlusions [41]. Errors and failures in detection will propagate to the process of tracking, which later will directly lead to wrong conclusions in the analysis step [33]. Moreover, the above works only consider either vehicle-pedestrian or vehicle-cyclist interactions. In real–world traffic situations at big intersections other heterogeneous road users are often involved at the same time.

In very recent years, deep learning methods have been successfully applied to understand road users’ behavior at intersections using video data, whereas many of them [27, 8, 13, 28] are conducted from a perspective of a self-driving car for pedestrian intent detection. With respect to a third–person perspective, [3] trained an encoder-decoder model to automatically detect interactions using sequences of video frames from a static camera facing a very busy left–turn intersection in a Japanese city. However, the discriminative model is trained to optimize the reconstruction loss between the pairs of ground truth and prediction, which tends to learn the “average” behavior of road users. The dynamics and stochastic behavior patterns are not fully captured. Hence, in this paper, we propose a CVAE–based model with Gaussian latent variables to encode various behavior patterns and perform diverse predictions. We not only test our model in left–hand traffic but also right–hand traffic in different countries for interaction detection between vehicles and all the other VRUs.

Iii Methodology

This section explains the methodology of interaction detection in detail. Sec. III-A formulates the problem, Sec. III-B describes the extraction of the input features, Sec. III-C introduces the detection model, and Sec. III-D

provides the estimation of the model’s uncertainty.

Fig. 3: Sequence-to-sequence modeling using sliding window or padding method.
Fig. 4: The pipeline for interaction detection.

Iii-a Problem formulation

Interaction detection is formulated as a classification problem using the information extracted from videos. Given a set of observed vehicle turning sequences of video frames , the input of the -th sequence is characterized as , where is the frame at time step , and is the total number of observed frames for the sequence. , and denote the width, height and the number of channels of each frame. Instead of using raw images, object and optical–flow information (see Sec. III-B for more details) is extracted from the frames and is used as the sequence of the input. In this way, the personal information of road users,e. g.,face, gender, age, driving plates can be protected. is the corresponding ground truth interaction label and is the prediction.

Moreover, sequence-to-sequence modeling is applied to learn the frame–wise dynamics of interactions over a turning sequence. Similar to [8]

, the task defined above is a weakly supervised learning problem due to the labelled data structure; The interaction label is a dichotomous class that represents the interaction level of the whole sequence. It does not provide detailed information about how the interaction level changes with time. In fact, it is not feasible to manually label each frame due to the tremendous amount of work. Without knowing the exact fine–grained frame–wise interaction label, the sequence–wise label is duplicated at each frame. Hence, the form of the output is converted to obtain the time steps, denoted as

for the -th turning sequence. Thereafter, the input and output are aligned at each frame for a sequence-to-sequence modeling, denoted by Fig. 3 and Eq. (1).


where denotes the detection model and is a voting scheme that summaries the frame–wise predictions to the sequence–wise prediction. In this paper an average voting scheme that weighs the prediction at each frame equally is adopted and the sequence–wise prediction is the class label voted by the majority [3]. The above conversion is based on the following hypotheses: (1) Over a large dataset, sequence lengths vary from one to another, which provides rich interaction information of both long and short sequences. (2) The prediction error between and is still computed at the sequence level because each frame–wise prediction only partially contributes to the sequence–wise prediction using the voting scheme. This mechanism enables the model to automatically learn the frame–wise dynamics at each frame in training time.

In addition, sliding window and padding methods are proposed to deal with varying sequence lengths for training a recurrent neural network (RNN) based CVAE model, as shown in Fig. 


. This is because the most commonly used RNNs, e. g., Long Short-Term Memory (LSTM) 

[11], for sequence modeling often require a fixed sequence length. However, at an intersection some vehicles can quickly complete the turning if the space happens to be free, whereas some vehicles may have to wait for a long time to let VRUs cross first.

The sliding window method parses each sequence with a fixed window size . Eq. (2

) denotes the sliding window method with a stride being the same as

. The overlap between two consecutive windows is allowed when the stride is set to be smaller than the window size.


The padding method uses zero-paddings at the end to extend the sequences shorter than a predefined length . The value of can be adjusted to cover most of the sequences, e. g., , where denotes the length of an arbitrary vehicle turning sequence. Meanwhile, a padding mask is used to annotate the exact sequence length so that the padded zero values are treated differently to mitigate the negative impact on the learning process.


After the formulation of the above sequence-to-sequence problem, we now introduce the pipeline of the proposed method for interaction detection, denoted by Fig. 4. It consists of two components: feature extraction (Sec. III-B) and the sequence-to-sequence CVAE model (Sec. III-C). Each component is explained in detail in the following subsections.

Iii-B Feature extraction

Object information and optical–flow information extracted from video frames are used as input features for the interaction detection task, as shown in Table I and Fig. 5.

Feature C1 C2 C3 C4 value
Object pedestrians bikes/motors cars/trucks buses {0, 1}
Optical flow orientation 1 velocity  [0, 1 ]
The HSV (Hue, Saturation, Value) color representation is used to store the optical–flow information. The hue channel (C1) is used to store orientation, the saturation channel (C2) is set to its maximum, and the value channel (C3) is used to store velocity. Note that there are four channels in each object frame and only three channels in each optical–flow frame.
TABLE I: Object and optical flow information extracted by an object detector and the dense optical flow, respectively.
(a) Object detection
(b) Binary mask
(c) Object information
(d) Optical–flow information
Fig. 5: Input features for interaction detection. Note that (c) only exemplifies three channels with pedestrians denoted in blue, bicycle(s) in green and car(s) in red color. The overlaid bounding boxes in (d) only serve the purpose of showing the location of the objects, including the static ones. They are not integrated into the optical–flow information.

Object information contains road users’ type and location. The deep learning object detector, such as YOLOv3 [29] or M2Det [42], is leveraged for detecting all the relevant road users at each frame. Namely they are pedestrians, cyclists, motorbikes, cars, trucks and buses. Different channels are used to store the road–user position information and each channel is dedicated to one or two similar road user types; Based on the acquired data that only very few motorbikes were detected, they are stored in the same channel for bicycles. Cars/trucks are stored in one channel given their very similar trace of turning, see Table I. The location of the detected road users (Fig. (a)a) is mapped by the corresponding bounding boxes with values of one in each frame (Fig. (c)c). Areas with no detected objects are set with values of zero, as shown in black color.

Optical flow is used to capture the motion of road users. It describes the distribution of velocities of moving pixels’ brightness in two consecutive images [12]. Moving objects are captured by optical flow and static objects and background are ignored. The dense optical–flow algorithm [6] is applied to map the displacement of moving objects and remove the static background information, see Fig (d)d. Similarly, respective frame channels are dedicated to the information of orientation and velocity of the moving objects, see Table I.

The area of interest is the tuning space of the intersection and is marked by a binary mask (Fig. (b)b), the other areas are not considered. As shown in Fig. (a)a and (b)b, the mask of the area of interest slightly extends into the through lane next to the turning lane. Due to the oblique view of the camera, the upper body of the vehicles in the turning lane are partially projected into the through lane. The extended mask aims to include the upper body of the turning vehicles to be detected as well. The lower middle point of the bounding boxes of the detected vehicles is used to filter out the vehicles in the through lane. However, the extended mask introduces noise to the optical–flow information. For instance, as shown in Fig. (d)d, the motion of the vehicles in the through lane is also captured by the optical–flow, which cannot be easily filtered out given the irregular shapes and occlusion. But it later turns out that this noise is not problematic when both the object information and the optical–flow information are combined as the input for training the interaction detection model. Interactions between vehicles and VRUs only happen in the crossing zone.

Iii-C CVAE model for interaction detection

The model of predicting the probabilities of interaction between a turning vehicle and the other crossing road users is denoted as , where is a CVAE model that performs probabilistic prediction and are the Gaussian latent variables. The model encodes the information of interaction into a latent space and predicts the interaction label conditioned on the input and . The variational lower bound [36] of the model is given as follows:


The model jointly trains a recognition model (a. k. a. encoder), and a generative model (a. k. a. decoder). In the training phase, the model is optimized via stochastic backpropogation [30]. encodes the observed information and the ground truth label into the latent variables . In other words, the inserted label in training is combined with the condition to parameterize the Gaussian latent space, which later can be used for structured prediction to map the many possible outputs [36]. decodes the prediction of the interaction label conditioned on the input and the latent variables.

is the negative Kullback-Leibler divergence of the approximate posterior from the prior

and acts as a regularizer, which pushes the approximate posterior to the prior distribution . Note that in our model the prior is relaxed to make the latent variables statistically independent from the input variables so that  [20]. The prediction loss measures the distance between and . The binary cross–entropy loss is used as the prediction loss, as denoted by Eq. (5).


In the inference phase, the decoder predicts the interaction label conditioned on the input of the observed information concatenated with a latent variable directly sampled from the Gaussian prior . The sampling process is done multiple times to perform diverse predictions [36].

Convolutional Neural Networks (CNNs) and RNNs, as well as the self-attention mechanism [40] are employed in the CVAE model to learn the parameters of and . As shown in Fig. 4, the encoder has two branches: X-Encoder and Y-Encoder. They are dedicated to extracting low–level features from the condition (the object and optical–flow information) and the interaction label information, respectively. Each module, i. e., X-Encoder, Y-Encoder, Latent Space, and Decoder of the CVAE model, is explained in detail as follows.

The X-Encoder manipulates two CNNs for learning spatial features from the object frame sequence and the optical–flow frame sequence, respectively. Without loss of generality, the object frame sequence using the sliding window (e. g., ) method is taken as an example for explaining the learning process. First, each frame from the sliding window is passed to a CNN to learn spatial features. As shown in Fig. 6

, the CNN has three 2D convolutional (CONV) layers with each one followed by a Maximum Pooling (MP) layer and a Batch Normalization (BN) 


. It takes the frame that contains object information as input and outputs a flattened feature vector. This process is done frame by frame for all the frames in the sliding window. Then, the output feature vectors of all the frames are timely distributed as a sequence that maintains the same length as the window size, as shown in Fig. 

4 for the X-Encoder111This process works in the same way for the padding method with the predefined sequence length, instead of the sliding window size.. The optical–flow frame sequence is processed by another CNN in a similar way to get the sequence of optical–flow feature vectors. In the end, the object feature vectors and the optical–flow feature vectors are concatenated into a 2D feature vector as the final output of the X-Encoder. Note that the CNN for the optical–flow frame sequence has a similar structure, except for the input channel number. The CNN for the object frame sequence has four channels that are dedicated to different road user types, whereas the CNN for the optical–flow frame sequence has three channels that are dedicated to the motion information (see Table. I).

Fig. 6: The CNN used for learning spatial features from an object frame. CONV stands for 2D convolutional layer, MP for Maximum Pooling layer and BN for Batch Normalization.

The Y-Encoder embeds the interaction label for each sequence. First, the sequence–wise label is replicated to align with the sequence length. Then, a fully connected (FC) layer is used to embed the replicated labels into a label vector. The original dimension of the label is only two after the one-hot encoding for the non-interaction and interaction classes, which is much smaller than the combined feature vector. The embedding increases the balance of the sizes of the label vector and the combined feature vector. The specific dimensionalities are shown in Fig. 

4, which are hyper-parameters that can be changed in the experimental settings.

The prior Gaussian latent variables are modulated by the encoded feature vector and the label vector from the X-Encoder and Y-Encoder, respectively. First, the outputs of the X-Encoder and Y-Encoder are concatenated along the time axis. Then, the concatenated features are passed to an FC layer and a following self-attention layer [40]

. The self-attention layer takes all the features along the time axis at the same time and attentively learns the interconnections of the features globally. After that, an LSTM with two stacked hidden layers is used for learning the temporal features into a hidden state. In the end, the hidden state is fully connected by an FC layer and then split by two FC layers side by side. The two FC layers are trained to learn the mean and variance of the distribution of the latent variables

, respectively.

The Decoder is trained conditioned on the encoded feature vector from the X-Encoder and the latent variables. First, the encoded feature vector is concatenated with the latent variables and passed to an FC layer. Then, an LSTM with two stacked layers is used to learn the temporal dynamics. After that, two FC layers are used for fusion and dimension reduction. The Softmax activation function is added to the last FC layer for generating the probability of the interaction class at each frame. The output of the Decoder are the frame–wise predictions of the interaction class. In the end, the average voting scheme is used to summarize the frame–wise predictions to get the sequence–wise prediction for the interaction class.

In inference time, the interactions for unseen vehicle turning sequences are classified using the trained CVAE model. First, the object and optical–flow information are encoded by the X-Encoder. A latent variable is sampled from the Gaussian distribution. Then the Decoder generates the probabilities of the interaction class for each sequence conditioned on the output of the X-Encoder and the sampled latent variable. The sampling is repeated multiple times at each step so that the Decoder generates diverse probabilities of the interaction class.

Iii-D Estimation of uncertainty

Kernel density estimation (KDE) [25, 22] is used to measure the uncertainty of the diverse predictions generated by the above multi-sampling process. At each frame, the predictions are assumed to be i.i.d. samples drawn from an unknown density function . is the total number of predictions and , and is the total steps of the give sequence . The KDE is calculated as:


where is the Gaussian kernel function, is the smoothing parameter (also called bandwidth). The log-likelihood of the average prediction at step is determined by , where is the average prediction. The uncertainty is defined as the residual of the normalized log-likelihood averaged over all the steps for the given sequence , as denoted by Eq. (7):


where is the normalization parameter that scales the values to and stands for the degree of uncertainty for the prediction over the sequence .

Iv Data Acquisition and Pre-processing

(a) The KoW left–hand intersection in Germany
(b) The NGY right–hand intersection in Japan
Fig. 7: The screenshots of KoW and NGY intersections. Vehicle turning sequences are constrained in the yellow contours.

Real–world datasets were acquired to test the performance of the proposed model for interaction detection. Fig. 7 shows the screenshots of the two intersections where various traffic scenes were recorded. The KoW dataset was acquired by [21] from a very busy right–turn intersection in a German city. The videos recorded traffic conditions from 00:02 a. m. to 11:58 p. m. on November 8th and 9th, 2019 in Hannover. The videos were recorded in pixels at by a camera module (Raspberry Pi Camera Module v2) installed inside a building (ca. ground elevation) facing the intersection and stored in .h264 format. We use an approximately 14-hour sub-footage from two seven–hour segments (8 a. m. to 3 p. m. on both the 8th and 9th), when there was enough traffic and adequate ambient light to perform stable image processing for feature extraction. The NGY dataset was provided by Nagoya Toyopet Corporation. It was acquired from an extremely busy left–turn intersection in a Japanese city. In total, approximately 24 hours of traffic footage from an oblique view at one of the major intersections in Nagoya were recorded from 11 a. m. to 11 a. m. on April 23rd and 24th, 2019. The videos were recorded in pixels at using a camera (Panasonic WV-SF781L) installed inside a building (ca. ground elevation) adjacent to the intersection and stored in .mp4 format. Similarly, we use a twelve-hour sub-footage recorded from 11 a. m. to 6 p. m. on the 23th and from 6 a.m. to 11 a.m. on the 24th.

Both datasets were pre-processed for later usage. Due to the missing camera intrinsic and extrinsic parameters, no projection was done for extracting the trajectory data. The pre-process aimed to identify vehicle turning sequences and extract all the road users’ type, position and motion information. First, two annotators for each dataset manually detected the sequence scenes where a vehicle turned right at the KoW intersection or left at the NGY intersection, and extracted the time intervals of the vehicle staying in the yellow contours (see Fig. 7). The annotators independently determined whether or not interactions occurred in each scene, afterwards they revised and agreed with each other222Less than of the sequences were initially annotated differently. and labeled each scene as “non-interaction” or “interaction”. Then, YOLOv3 [29] and M2Det [42] were used to detect all the traffic related objects at the original frame rate of the KoW () and NGY () datasets, respectively. Note that these two sources of data were from different providers so that the camera settings and the object detection algorithms were not unified. Considering the change between two consecutive frames is small, a possibly failed detection in the current frame is supplemented by the detection in the previous or the next frame if either of them was available. Otherwise, the sequences with failed detection and no supplementation were discarded. In addition, the dense optical–flow algorithm [6] was used to extract the optical–flow information from the sequences. Different from the object detection, the frame rate was down–sampled to the half of the original frame rate, i. e.,  for KoW and for NGY. This aims to reduce the computational cost and increase the offset of moving pixels between two consecutive frames, so as to improve the extraction performance [6] of optical flow. In the end, both the object and the optical–flow sequences were aligned with the down–sampled frame rate in each dataset, which is used as the time step for interaction detection.

Fig. 8:

Sequence lengths in the KoW and NGY datasets. (a) Standard deviation is denoted by the red error bar. (b) Sequence length measured by the number of frames and

is the length threshold for the padding method.

The data processing yields over 2000 vehicle turning sequences with varying lengths, as denoted by Fig. 8. Within each dataset, sequence lengths measured in seconds are very different. The non-interaction sequences are significantly shorter than the interaction ones (U-test, , for KoW and for NGY), and the standard deviation in each class in both datasets deviates in a large range. This indicates that the duration of a sequence is not an accurate feature for the detection task; a short sequence duration does not necessarily imply no interaction. In addition, the sequence lengths of all the sequences over each dataset are very unevenly distributed, i. e., long–tail distributed, especially for the NGY dataset, see Fig. (b)b. Across the datasets, the sequences in the KoW and NGY datasets are different in terms of not only travel direction but also frame size and rate, as well as sequence length in general. Though non-interaction sequences from both datasets have similar sequence lengths, i. e., on average non-interaction sequences have a length of in KoW and in NGY, interaction sequences in NGY have a longer average sequence length () than the ones in KoW (). Due to the higher density of traffic at the NGY intersection compared to the KoW intersection, vehicles often had to wait for more pedestrians and cyclists crossing. The above differences make cross dataset validation very difficult (more details in Sec. VII-B).

Name Input form Max. length Training Validation Test Total
KoW sliding 500 360/360 90/90 192/192 642/642
KoW padding 100 352/352 88/88 188/188 628/628
NGY sliding 500 291/291 74/74 159/159 530/530
NGY padding 100 132/132 33/33 70/70 235/235
TABLE II: Video frame sequences used for interaction detection. Sequence length is measured by frame numbers and the sample sizes of the classes were balanced for each sets.

The acquired datasets were further prepared for training the detection models, which involves sample balancing, sequence padding and dataset partitioning. The number of samples in each class was balanced to perform unbiased training. For both datasets, the maximum number of sequences in each class was set to a value that is close to the capacity of the smaller class. Note that the small amount of very long sequences (i. e.,  frames, see Fig. (b)b) were not used for the experiments. All such sequences are from the class of interaction, i. e., vehicles had to wait for a long time to let other road users cross the intersection. The removal of these sequences balances sample size and length in both classes, in order to prevent a model being biased towards the interaction class. Sequences with a smaller number of frames than the threshold are padded with zeros for the padding method (see Sec. III-A). However, if is too large most of the sequences will be padded with zeros and this will lead to noisy samples; if is too small many long sequences will be excluded. To balance the trade-off, based on the sequence length distributions over the datasets (Fig. (b)b), was set to 100 frames for both datasets so that the majority of all the sequences were included. The sequences shorter than or equal to were preserved for the experiments of the models that use the padding method. Sequences longer than exceed the maximum length of the input size that the models can handle. Therefore, they were discarded. Under the balanced criteria above for each class, both the datasets were then randomly split into training and test sets by the ratio of . Additionally of the training data was separated as an independent validation set to monitor the process of training. Table II lists the statistics of the final data used for the experiments after the preparation steps.

V Experiments

V-a Baseline and ablative models

To evaluate the performance of the proposed CVAE model, it is compared with a baseline model. The baseline model is a sequence-to-sequence encoder–decoder model that uses the same input features from the object and motion information for interaction detection [3]. It has the same structure as the X-Encoder and the Decoder that are implemented in the CVAE model (Fig. 4). The difference between these two models is the sample generation process. The baseline model is a discriminative model and does not use the class label information and the conditional information for learning the latent variables that mimic the stochastic behavior in vehicle–VRU interactions. Without randomly sampling from the Gaussian latent variables, the output of the sequence-to-sequence encoder–decoder model is deterministic.

A series of ablative models are designed to analyze the contribution of the object information (ob), the optical–flow information (op), and the self-attention mechanism (att). The ablative models are trained by removing one of the aforementioned parts, as denoted in Table III.

Model name (ob) (op) (att) Sample generation
[S+ob+op+att] -
[C+op+att] -
[C+ob+att] -
[C+ob+op] -
the baseline model; the complete CVAE model
TABLE III: The models with different input structures.

V-B Evaluation metrics

Tested samples are categorized according to the comparison between their ground truth and the predicted labels, as listed in Table IV. Accuracy, Precision, Recall and F1-score are applied to measure the performance of interaction detection on the test data from both the KoW and NGY intersections.

Category name Ground truth Prediction
TP: true positive interaction interaction
TN: true negative non-interaction non-interaction
FP: false positive non-interaction interaction
FN: false negative interaction non-interaction
TABLE IV: Categories of tested samples

Accuracy is the fraction of the number of the correctly predicted samples over the total number of samples.

Precision is the fraction of the number of the TP samples over the number of predicted positive samples.

Recall is the fraction of the number of the TP samples over the number of actual positive samples in the whole dataset.

F1-score is used to provide a measurement of the overall performance of a model. It is defined as the so-called harmonic mean of precision and recall.

V-C Experimental Settings

The kernel size of the CNNs in each layer is set to 8, 4, and 2 respectively with a stride of 2 and the same padding for the borders. The size of the first hidden layer of the LSTM is set to 64 and the second hidden layer is 32. The size of the latent variables is set to 64. All the models are trained by a learning rate of with a zero decay using the Adam optimizer ( and [19]

. The batch size is set to 32, and all the models were trained for 50 epochs on an NVIDIA Quadro T2000 GPU. In inference time, the number of sampling

is set to 100 for all the CVAE–based models.

Vi Results

This section presents the quantitative and qualitative results for each intersection, as well as the discussion of the results.

Vi-a Quantitative results

The quantitative results are summarized in Table V and VI for the right–turn KoW intersection and the left–turn NGY intersection, respectively. Due to the multi–sampling process the results of the CVAE–based models are not deterministic, hence the corresponding standard deviations are provided.

Table V shows the results of the interaction detection at the right–turn intersection. (1) Both the sliding window and padding methods yield similar and very accurate results for interaction detection using the combined information from object detection and optical flow. The accuracy and F1-score are both above 0.95. (2) Compared to the baseline models [S2S+ob+op+att], the proposed models [CVAE+ob+op+att] have a slightly better performance using the sliding window method and comparable performance using the padding method. (3) Compared to the ablative models, the combined information improves the performance using the sliding window method. However, the improvement is not obvious using the padding method, especially compared to the ablative model that only uses the object information. On the other hand, regardless of the sliding window method or padding method, the ablative models that merely use optical–flow information only achieve an accuracy below 0.70. (4) The self-attention mechanism does not lead to an obviously better or worse performance using either the sliding window or padding method.

Model shape Accuracy Precision Recall F1-score
[S+ob+op+att] sli. 0.951 0.935 0.969 0.951
[C+op+att] sli. 0.692 0.717 0.635 0.673
[C+ob+att] sli. 0.952 0.934 0.973 0.953
[C+ob+op] sli. 0.965 0.976 0.953 0.964
[C+ob+op+att] sli. 0.961 0.969 0.953 0.961
[S+ob+op+att] pad. 0.963 0.944 0.984 0.964
[C+op+att] pad. 0.610 0.649 0.479 0.551
[C+ob+att] pad. 0.967 0.955 0.980 0.967
[C+ob+op] pad. 0.966 0.946 0.989 0.967
[C+ob+op+att] pad. 0.962 0.952 0.973 0.963
TABLE V: Detection results of the right–turn intersection on the KoW dataset. Best values are highlighted in boldface.

Table VI shows the results of the interaction detection at the left–turn intersection. (1) Both the sliding window and padding methods yield reasonable results for interaction detection using the combined information. However, the predictions of the sliding window method are more accurate than the padding method. (2) Compared to the baseline models, the CVAE models using the combined information achieve better performance, especially by using the sliding window method (e. g.,  about 0.05 increment in F1-score). (3) Compared to the ablative models, the improvement by using the combined information can be found in both the sliding window and padding methods. (4) The best performance, especially measured by recall (0.916) and F1-score (0.892) on the NGY dataset, is achieved by the proposed CVAE model using the sliding window method with the self-attention mechanism.

Model Shape Accuracy Precision Recall F1-score
[S+ob+op+att] sli. 0.849 0.878 0.811 0.843
[C+op+att] sli. 0.878 0.854 0.912 0.882
[C+ob+att] sli. 0.734 0.698 0.824 0.756
[C+ob+op] sli. 0.882 0.915 0.842 0.887
[C+ob+op+att] sli. 0.889 0.869 0.916 0.892
[S+ob+op+att] pad. 0.721 0.712 0.743 0.727
[C+op+att] pad. 0.764 0.756 0.808 0.781
[C+ob+att] pad. 0.683 0.661 0.751 0.703
[C+ob+op] pad. 0.782 0.763 0.819 0.790
[C+ob+op+att] pad. 0.742 0.750 0.728 0.739
TABLE VI: Detection results of the left–turn intersection on the NGY dataset. Best values are highlighted in boldface.

The Kernel Density Estimation (KDE) function (see Sec. III-D) is used to measure the uncertainty levels of the CVAE–based models with different input structures. The uncertainties of the CVAE–based models are plotted in Fig. 9 and compared by the Mann-Whitney U-test. Fig. (a)a and (b)b demonstrate that the CVAE–based models [CVAE+op+att] using only the optical–flow information generate significantly more uncertain predictions than the other models tested on the KoW dataset. This pattern is consistent with the prediction performance that they also yield less accurate predictions. A similar pattern can be observed from the CVAE–based models [CVAE+ob+att] using only the object information (Fig. (c)c and (d)d) tested on the NGY dataset. When the uncertainty level is high in the predictions, the accuracy level also drops.

(a) KoW sliding window
(b) KoW padding
(c) NGY sliding window
(d) NGY padding
Fig. 9: Uncertainty measure of the CVAE–based models tested on the KoW and NGY datasets. The mean value is denoted by the yellow square in each box-plot. The uncertainty levels across the models are compared using the Mann-Whitney U-test. p-values are annotated using * or ns (not significant), where ns: , *: , **: , ***: , and ****: .

The confusion matrices for the proposed CVAE model using both the object and the optical–flow information are presented in Fig. 10. It can be seen that the model using either the sliding window (true negative rate 0.970 and true positive rate 0.953) or the padding method (true negative rate 0.951 and true positive rate 0.973) achieves high performance for interaction detection tested on the KoW dataset. The proposed model achieves good performance using the sliding window method (true negative rate 0.861 and true positive rate 0.916) on the NGY dataset and maintains a relatively low false negative rate (0.084). Nevertheless, the performance of the proposed CVAE model tested on the NGY dataset is inferior to the one on the KoW dataset. In contrast, the padding method only achieves mediocre performance (true negative rate 0.757 and true positive rate 0.728) tested on the NGY dataset.

(a) KoW sliding window
(b) KoW padding
(c) NGY sliding window
(d) NGY padding
Fig. 10: Confusion matrices for the proposed CVAE model tested on the KoW/NGY dataset using the sliding window (a)/(c) and padding (b)/(d) methods. The confusion matrices are normalized so that they can be compared across sliding window and padding methods, as well as across the datasets.

Vi-B Qualitative results

The qualitative results intuitively showcase the process of interaction detection of the models. The fine–grained probability of the predicted interaction at each frame provides a clue of how the interaction intensity evolves over time.

Fig. 11 demonstrates a non-interaction scenario at the KoW intersection between the right–turning target vehicle (in the blue bounding box) and the standstill pedestrian (in the red bounding box). There was no explicit interaction between them as the continuity of their behavior was not affected when the gap between them closed up, so the sequence was annotated as no interaction required. The sequence-level prediction is the average vote of all the frame-level predictions. At the sequence level, all the models correctly predict this scenario as non-interaction using both the sliding window (Fig. (a)a) and padding (Fig. (b)b) methods, except the ablative model [CVAE+op+att] (in cyan) that only uses the optical–flow information. However, all the models predict a high probability of interaction when the vehicle approached the pedestrian. Also, the variance of the CVAE–based models in Fig.(b)b increases when the probabilities change from under 0.5 to a higher value of interaction. The baseline model [S2S+ob+op+att] (in black) generates a similar pattern in frame–wise predictions. But it is deterministic at each frame and does not have the mechanism to represent the uncertainty of the predictions.

(a) Sliding window method.
(b) Padding Method.
Fig. 11: Examples of interaction probability at the frame level using the sliding window (a) and padding (b) methods, tested on the KoW dataset. The variance of the probabilities is visualized by the marginal shadow for the CVAE–based models. The corresponding video screenshots are aligned from upper left to the lower right at the bottom with a time interval of eight frames. The target vehicle is highlighted by the blue bounding box and the standstill pedestrian involved in the turning sequence is highlighted by the red bounding box.

Fig. 12 demonstrates an interaction scenario at the NGY intersection between the left–turning target vehicle (in the blue bounding box) and the crossing cyclist (in the red bounding box). Interaction was required between them as the vehicle had to decelerate or even briefly stop, yielding the way to the cyclist. All the models correctly predict this sequence as interaction using both the sliding window (Fig. (a)a) and padding (Fig. (b)b) methods. Similar to the scenario above, the variance of the probabilities for the CVAE–based models using the sliding window method change with the modification of the distance between the target vehicle and the cyclist. As the distance between them decreases, the probability is higher and the variance is smaller for interaction, and vice versa. The ablative model based on the object information using the padding method has higher uncertainty levels than the other models of the frame–wise predictions.

(a) Sliding window method.
(b) Padding Method.
Fig. 12: Examples of interaction probability at the frame level using the sliding window (a) and padding (b) methods, tested on the NGY dataset. The variance of the probabilities is visualized by the marginal shadow for the CVAE–based models. The corresponding video screenshots are aligned from upper left to the lower middle at the bottom with a time interval of eight frames. The target vehicle is highlighted by the blue bounding box and the passing cyclist involved in the turning sequence is highlighted by the red bounding box.

Vi-C Analysis of the results

The results shown above are analyzed with respect to: (I) the pros and cons between the sliding window and padding methods; (II) the performance between the proposed CVAE model and the baseline model; (III) the contribution of the object information and the optical–flow information via the ablative models; (IV) the impact of the self-attention mechanism.

The performances of the sliding window and padding methods are not only influenced by the size of the training data, but also the zero–padded values. The sliding window method does not depend on the sequence length, which is more flexible in dealing with various sequence lengths. Hence, the number of training samples was not compromised for the experiments. On the other hand, the padding method requires a pre-defined fixed sequence length, which is unable to deal with longer sequences. Hence, the number of training samples was compromised by excluding longer sequences. The impact of the training data size has been shown by the performance difference across the KoW and NGY datasets. The numbers of the training samples of KoW for the sliding window and padding methods are similar (see Table II), and their performances for interaction detection are comparable to each other (see Table V). On the contrary, the number of training samples of NGY for the sliding window method is much larger than the one for the padding method (see Table II). The prediction by the sliding window method is more accurate than the padding method (see Table VI). In addition, the shorter sequences were padded with zeros. This is problematic for the information extracted by optical flow. Because the zero values in the optical–flow feature vector represent the background of the intersection or static road users. Even though a padding mask is incorporated into the sequence for indicating the actual sequence length, the negative impact cannot be fully remedied due to the complex learning process in training. The negative impact of padded zeros from the padding method has been uncovered by the impaired performance of the ablative model [CVAE+op+att] compared to the sliding window method.

In general, the proposed CVAE model [CVAE+ob+op+att] outperforms the baseline model [S2S+ob+op+att] quantitatively (see Table V and VI) and qualitatively (see Fig. 11 and 12). In the CVAE model, the latent variables are trained to capture the stochastic attributes of road users’ behavior in various traffic situations, which is optimized by the Kullback-Leibler divergence loss against a Gaussian prior. In addition to the Kullback-Leibler divergence loss, the reconstruction loss is trained by minimizing the cross-entropy loss between ground truth and prediction. Optimizing these two losses together enables the CVAE model to generate diverse predictions. With the multi–sampling process of the latent variables, the predicted probabilities of interaction at each frame vary, especially when the probabilities change over time, see Fig. 11 and 12; the variance of the probabilities indicates the uncertainty in the predictions. In contrast, the baseline model is trained only by optimizing the reconstruction loss. It tends to learn the “average” behavior of road users. Predictions by the baseline model are rather deterministic and there is no mechanism to interpret the uncertainty of the predictions.

The combined information of object detection and optical flow shows a stable performance for the interaction detection task. The performance of interaction detection highly depends on the quality of the input information extracted from videos, which is often impaired for many reasons. A single type of information may not be sufficient for this task. As indicated by the limited performance of the ablative models that only use optical–flow information on the KoW dataset, without the object information the noisy optical–flow information from the through lane or padded zeros may impact the detection performance. Similarly, the ablative models that only use object information on the NGY dataset also achieved limited performance. The distorted object information, especially for the road users close to the camera at the NGY dataset, could lead to wrong interaction detection. The combination of the extraction techniques increases the possibility to maintain a good quality of the input information, so as to achieve a stable performance of interaction detection.

The self-attention mechanism does not show a consistent benefit across the datasets. The CVAE models regardless of the self-attention mechanism yield very similar results for interaction detection using both the sliding window and padding methods on the KoW dataset, and using the padding method on the NGY dataset. The improvement with the self-attention mechanism can be found for the sliding window method on the NGY dataset. First, the self-attention layer is followed by an LSTM (Fig. 4), which may be redundant for learning the interconnections along the time axis. The self-attention layer is likely under-trained due to the small dataset size or redundant layers, whereas the LSTM is already sufficient for learning the temporal patterns of the sequence data from the KoW intersection. On the other hand, the sequence data from the NGY intersection is more complex, e. g., longer and more varying sequence lengths (see Fig. 8) and high traffic density. On top of the LSTM, the self-attention mechanism has turned out to be beneficial for further learning the temporal patterns.

In summary, the sliding window method is more flexible than the padding method in dealing with various sequence lengths. The CVAE models using the combined information of both object detection and optical flow achieve a more stable performance compared to using a single type of information. The multi–sampling process enables the CVAE–based models to mimic the uncertainty of road users’ behavior, and the self–attention mechanism is only beneficial for learning temporal patterns from complex data. Overall the proposed model [CVAE+ob+op+att] using the sliding window method achieves a more desirable performance across the datasets.

Vii Discussion

Here we discuss the failed detection by the proposed CVAE model using the sliding window method and the challenges to transfer the model from one intersection to the other smoothly.

Vii-a Failed detection

Various reasons can lead to a wrong interaction classification. Table VII categorizes the wrongly detected scenarios i. e., false negative (FN) and false positive (FP) tested on both the KoW and NGY datasets. The false negative examples are visualized in Fig. 13 for KoW and in Fig. 14 for NGY, and the false positive examples are visualized in Fig. 15 for KoW and in Fig. 16 for NGY.

Errors Scenario description Category KoW NGY
pedestrian entering the intersection (FN-I) 8 7
FN pedestrian leaving the intersection (FN-II) 1 4
cyclist entering the intersection (FN-III) - 2
car following (FP-I) 4 17
FP pedestrian standing near the intersection (FP-II) - 3
pedestrian approaching from the sidewalk (FP-III) - 1
pedestrian finishing crossing (FP-IV) 1 1
Total 14 35
The total number of the wrongly detected scenarios listed here is slightly different as shown in the above confusion matrices due to the multi–sampling of the CVAE model.
TABLE VII: Categories of the wrongly detected scenarios by [CVAE+ob+op+att] using the sliding window method.

The FN scenarios are associated with VRUs entering (FN-I and FN-III) or leaving (FN-II) the intersection space. Due to their relatively long distance to the target vehicle, but fast travel speed, they are erroneously classified as non-interaction.

(a) FN-I
(b) FN-II
Fig. 13: Examples of the false negative detection on the KoW dataset. The right–turning target vehicles are denoted by the blue bounding boxes and the involved VRUs are denoted by the red bounding boxes.
(a) FN-I
(b) FN-II
(c) FN-III
Fig. 14: Examples of the false negative detection on the NGY dataset. The left–turning target vehicles are denoted by the blue bounding boxes and the involved VRUs are denoted by the red bounding boxes.

Most of the FP scenarios are associated with the target vehicle following a leading vehicle. As exemplified by FP-I in Fig. 15 and 16, only the leading vehicle (in the yellow bounding box) required direct interactions with the involved VRUs. After the leading vehicle finished turning, the pedestrian (in the red bounding box) also completed crossing. Afterwards, there was no interaction required from the target vehicle (in the blue bounding box) with the VRUs. However, the CVAE model has limited performance in handling this type of situations. Because the current model does not have explicit information to differentiate the leading and target vehicles, and the model is not specifically trained for car following situations.

In addition, a short distance from the VRUs to the intersection, e. g., standing on the sidewalk close to the intersection (FP-II Fig. 16) or just finishing crossing (FP-IV, Fig. 15 and 16), can also lead to an FP case. The distortion of the distance may lead to an FP case as well. For example, in FP-III in Fig. 16, even though the pedestrian on the sidewalk was relatively far from the turning vehicle, it has been still classified as an interaction by the model due to the distorted distance at the NGY intersection. However, the camera at the KoW intersection was installed at a higher elevation than the camera at the NGY intersection. The distortion is thus less harmful for the horizontal distance. Among other reasons, this might have contributed to the better performance of the model tested on the KoW dataset than on the NGY dataset.

(a) FP-I
(b) FP-IV
Fig. 15: Examples of the false positive detection on the KoW dataset. The right–turning target vehicles are denoted by the blue bounding boxes, the leading, but not target vehicle, is denoted by the yellow bounding box, and the involved VRUs are denoted by the red bounding boxes.
(a) FP-I
(b) FP-II
(c) FP-III
(d) FP-IV
Fig. 16: Examples of the false positive on the NGY dataset. The left–turning target vehicles are denoted by the blue bounding boxes, the leading, but not target vehicle, is denoted by the yellow bounding box, and the involved VRUs are denoted by the red bounding boxes.

Based on the discussion of the failed detection scenarios, the limitations of the proposed model are summarized as follows: 1) The definition of interaction only considers the relationship between the target turning vehicle and the involved VRUs. The car–following relationship is not included. Without considering this relationship often leads to false positive detection between the following car and the VRUs. 2) The crossing directions of VRUs are not used as a factor to differentiate interaction types. For example, interactions between a turning vehicle and pedestrians or cyclists that are approaching the crossing area from near side and far side are labeled as the same interaction type. However, the discussion above indicates that the moving directions of VRUs are important for estimating the interactions between the turning vehicle and VRUs, especially when the VRUs are leaving the intersection. 3) The exact distances between a turning vehicle and the involved VRUs are not measured, thus the change of the distances between them cannot be correctly quantified. Without the measurement of distance, it is difficult for a model to distinguish the subtle difference between interaction and non-interaction. Especially, when the image distance is distorted by the camera’s perspective or when an occlusion happens, the model’s performance will be impaired.

Vii-B Challenges of cross dataset generalization

The models proposed in this chapter are adapted for interaction detection at different intersections, in order to analyze the generalizability of the above models. In the previous experiment setting, all the models were trained and tested using the dataset from the same intersections. In this section, the models trained using the KoW dataset were tested on the NGY dataset, and vice versa. Frames from the test set were resized into the same size and mirrored into the same direction as the training set, so that the trained models could be tested on both datasets without changing the setting of the input size.

Table VIII lists the results for the cross dataset validation. It can be seen that, both the CVAE–based and sequence-to-sequence encoder–decoder models do not achieve good performance either using the sliding window or padding method. This could be because these two datasets (Sec. IV) are very different in terms of, e. g., vehicle’s travel direction, camera’s perspective, frame size and rate, sequence length, intersection layout, traffic density, and cultural factors (Germany vs. Japan). Note that, because the camera parameters and reference coordinates were not available from the datasets, in this paper projection is not applied to transform the perspective to a bird’s-eye view. Under the cross dataset validation setting, the resized frames distort the motion and position of the dynamic objects and confused the models for predicting the interactions between vehicles and VRUs. However, this leads to the future research question—how to generalize the models for different intersections and traffic, and even different cultures?

Model Shape Accuracy Precision Recall F1-score
Trained on the NGY dataset and tested on the KoW dataset
[S+ob+op+att] sli. 0.490 0.430 0.490 0.350
[C+ob+op+att] sli. 0.490 0.495 0.934 0.647
[S+ob+op+att] pad. 0.473 0.450 0.470 0.400
[C+ob+op+att] pad. 0.485 0.491 0.804 0.609
Trained on the KoW dataset and tested on the NGY dataset
[S+ob+op+att] sli. 0.535 0.526 0.704 0.602
[C+ob+op+att] sli. 0.541 0.540 0.557 0.548
[S+ob+op+att] pad. 0.464 0.420 0.460 0.380
[C+ob+op+att] pad. 0.490 0.475 0.174 0.255
TABLE VIII: Performance of cross dataset validation for interaction detection on the KoW and NGY datasets.

Viii Conclusion

In this paper, an end-to-end sequence-to-sequence generative model based on CVAE has been proposed to automatically detect interactions between vehicles and VRUs at intersections using video data. All the road users that appear during a vehicle’s turning time are detected by a deep learning object detector, and their motion information is captured by optical flow, simultaneously. The sequences of object detection and optical–flow information together provide rich information for interaction detection. Both sliding window and padding methods are explored to learn dynamic patterns from turning sequences of varying lengths. The proposed model predicts fine–grained interaction class label at each frame of less than . which provides a clue of how the intensity of an interaction between a turning vehicle and VRUs evolves as time unfolds. The average voting scheme summarizes the frame–wise predictions so as to accurately get a class label for the overall sequence. Besides, the multi–sampling process generates diverse predictions and the Kernel Density Estimation function is used to measure the uncertainty level.

The efficacy of the model was validated at a right–turn intersection in Germany and a left–turn intersection in Japan. It achieved an F1-score above 0.96 at the right–turn intersection and 0.89 at the left–turn intersection, and outperformed a sequence-to-sequence encoder–decoder model quantitatively and qualitatively.

Furthermore, a series of ablation studies investigated the effectiveness of the combined information from object detection and optical flow, and the self-attention mechanism for learning temporal patterns from complex sequences. The comparison between the sliding window and padding methods showed that the former method is more flexible in coping with sequences of varying sequence length—the number of samples is not restricted to the maximum sequence length that a model can handle, which stands in contrast to the padding method. The self-attention mechanism has only shown a clear positive effect for interaction detection on the complex NGY dataset.

In future work, several improvements can be made to reduce the limitations of the detection model. First, the dichotomous classification of interaction should be extended to multi–class classification, e. g., taking the confrontation direction and car–following relationship into consideration. Second, the accuracy of feature extraction can be enhanced by using multiple cameras or even tracking. Third, projective transformation techniques or data recorded by drones with a bird’s-eye view can be explored to reduce the distortion caused by the camera’s perspective and filter the noisy optical–flow information captured from the through lane next to the turning lane. Last but not least, the generalizability of the model for interaction detection at different intersections needs to be further studied.


The project is funded by the German Research Foundation (DFG) through the Research Training Group SocialCars (227198829/GRK1931). This work is a collaboration with Murase Lab at Nagoya University and supported by Nagoya Toyopet Corporation with the Nagoya intersection dataset.


  • [1] W. K. Alhajyaseen, M. Asano, and H. Nakamura (2012) Estimation of left-turning vehicle maneuvers for the assessment of pedestrian safety at intersections. IATSS research 36 (1), pp. 66–74. Cited by: §I.
  • [2] B. L. Allen, B. T. Shin, and P. J. Cooper (1978) Analysis of traffic conflicts and collision. Transportation Research Record 677, pp. 67–74. Cited by: §II.
  • [3] H. Cheng, H. Liu, T. Hirayama, F. Shinmura, N. Akai, and H. Murase (2020) Automatic interaction detection between vehicles and vulnerable road users during turning at an intersection. In Proceedings of the 31st IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, Vol. 19. Cited by: §I, §II, §III-A, §V-A.
  • [4] E. Choi (2010) Crash factors in intersection-related crashes: an on-scene perspective. Technical report Cited by: §I.
  • [5] R. Compton and E. Milton (1994) Safety impact of permitting right-turn-on-red. National Highway Traffic Safety Administration, Report No. DOT-HS-808. Cited by: §II.
  • [6] G. Farnebäck (2003) Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, pp. 363–370. Cited by: §III-B, §IV.
  • [7] U. Franke, D. Gavrila, S. Görzig, F. Lindner, F. Paetzold, and C. Wöhler (1998) Autonomous driving goes downtown. Intelligent Systems (6), pp. 40–48. Cited by: §I.
  • [8] O. Ghori, R. Mackowiak, M. Bautista, N. Beuter, L. Drumond, F. Diego, and B. Ommer (2018) Learning to forecast pedestrian intention from pose dynamics. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1277–1284. Cited by: §II, §III-A.
  • [9] A. Habibovic and J. Davidsson (2011) Requirements of a system to reduce car-to-vulnerable road user crashes in urban intersections. Accident Analysis & Prevention 43 (4), pp. 1570–1580. Cited by: §I.
  • [10] J. C. Hayward (1972) Near-miss determination through use of a scale of danger. Highway Research Record (384), pp. 24–34. Cited by: §II.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-A.
  • [12] B. K. Horn and B. G. Schunck (1981) Determining optical flow. Artificial intelligence 17 (1-3), pp. 185–203. Cited by: §III-B.
  • [13] M. Hoy, Z. Tu, K. Dang, and J. Dauwels (2018) Learning to predict pedestrian intention via variational tracking networks. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3132–3137. Cited by: §II.
  • [14] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International conference on machine learning

    pp. 448–456. Cited by: §III-C.
  • [15] K. Ismail, T. Sayed, N. Saunier, and C. Lim (2009) Automated analysis of pedestrian–vehicle conflicts using video data. Transportation research record 2140 (1), pp. 44–54. Cited by: §II, §II.
  • [16] K. Ismail, T. Sayed, and N. Saunier (2010) Automated analysis of pedestrian–vehicle conflicts: context for before-and-after studies. Transportation research record 2198 (1), pp. 52–64. Cited by: §II.
  • [17] I. Kaparias, M. G. Bell, W. Dong, A. Sastrawinata, A. Singh, X. Wang, and B. Mount (2013) Analysis of pedestrian–vehicle traffic conflicts in street designs with elements of shared space. Transportation research record 2393 (1), pp. 21–30. Cited by: §II.
  • [18] I. Kaparias, M. G. Bell, J. Greensted, S. Cheng, A. Miri, C. Taylor, and B. Mount (2010) Development and implementation of a vehicle–pedestrian conflict analysis method: adaptation of a vehicle–vehicle technique. Transportation research record 2198 (1), pp. 75–82. Cited by: §II.
  • [19] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §V-C.
  • [20] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3581–3589. Cited by: §III-C.
  • [21] C. Koetsier, S. Busch, and M. Sester (2019) TRAJECTORY extraction for analysis of unsafe driving behaviour.. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences. Cited by: §IV.
  • [22] D. O. Loftsgaarden, C. P. Quesenberry, et al. (1965) A nonparametric estimate of a multivariate density function. The Annals of Mathematical Statistics 36 (3), pp. 1049–1051. Cited by: §III-D.
  • [23] H. McGee and D. Warren (1976) Right turn on red. Public Roads 40 (HS-019 049). Cited by: §I.
  • [24] Y. Ni, M. Wang, J. Sun, and K. Li (2016) Evaluation of pedestrian safety at intersections: a theoretical framework based on pedestrian-vehicle interaction patterns. Accident Analysis & Prevention 96, pp. 118–129. Cited by: §II.
  • [25] E. Parzen (1962)

    On estimation of a probability density function and mode

    The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §III-D.
  • [26] S. R. Perkins and J. L. Harris (1968) Traffic conflict characteristics-accident potential at intersections. Highway Research Record (225). Cited by: §I.
  • [27] A. Rasouli, I. Kotseruba, and J. K. Tsotsos (2017) Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior. In

    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    pp. 206–213. Cited by: §II.
  • [28] A. Rasouli and J. K. Tsotsos (2019) Autonomous vehicles that interact with pedestrians: a survey of theory and practice. IEEE Transactions on Intelligent Transportation Systems 21 (3), pp. 900–918. Cited by: §II.
  • [29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 779–788. Cited by: §III-B, §IV.
  • [30] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In International conference on machine learning, pp. 1278–1286. Cited by: §III-C.
  • [31] K. Salamati, B. Schroeder, N. M. Rouphail, C. Cunningham, R. Long, and J. Barlow (2011) Development and implementation of conflict-based assessment of pedestrian safety to evaluate accessibility of complex intersections. Transportation research record 2264 (1), pp. 148–155. Cited by: §II.
  • [32] N. Saunier and T. Sayed (2008) Probabilistic framework for automated analysis of exposure to road collisions. Transportation research record 2083 (1), pp. 96–104. Cited by: §I.
  • [33] T. Sayed, M. H. Zaki, and J. Autey (2013) Automated safety diagnosis of vehicle–bicycle interactions using computer vision analysis. Safety science 59, pp. 163–172. Cited by: §II, §II.
  • [34] T. Sayed and S. Zein (1999) Traffic conflict standards for intersections. Transportation Planning and Technology 22 (4), pp. 309–323. Cited by: §I.
  • [35] M. S. Shirazi and B. T. Morris (2016) Looking at intersections: a survey of intersection monitoring, behavior and safety analysis of recent studies. IEEE Transactions on Intelligent Transportation Systems 18 (1), pp. 4–24. Cited by: §I.
  • [36] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 3483–3491. Cited by: §I, §III-C, §III-C.
  • [37] K. Suzuki and H. Nakamura (2006) Trafficanalyzer-the integrated video image processing system for traffic flow analysis. In Proceedings of the 13th ITS World Congress, London, 8-12 October 2006, Cited by: §II.
  • [38] Å. Svensson and C. Hydén (2006) Estimating the severity of safety related behaviour. Accident Analysis & Prevention 38 (2), pp. 379–385. Cited by: Fig. 2, §I.
  • [39] A. Tageldin and T. Sayed (2016) Developing evasive action-based indicators for identifying pedestrian conflicts in less organized traffic environments. Journal of Advanced Transportation 50 (6), pp. 1193–1208. Cited by: §II.
  • [40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008. Cited by: item 4), §III-C, §III-C.
  • [41] A. Yilmaz, O. Javed, and M. Shah (2006) Object tracking: a survey. Acm computing surveys (CSUR) 38 (4), pp. 13. Cited by: §II.
  • [42] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9259–9266. Cited by: §III-B, §IV.