Driver Anomaly Detection: A Dataset and Contrastive Learning Approach

09/30/2020 ∙ by Okan Köpüklü, et al. ∙ 1

Distracted drivers are more likely to fail to anticipate hazards, which result in car accidents. Therefore, detecting anomalies in drivers' actions (i.e., any action deviating from normal driving) contains the utmost importance to reduce driver-related accidents. However, there are unbounded many anomalous actions that a driver can do while driving, which leads to an 'open set recognition' problem. Accordingly, instead of recognizing a set of anomalous actions that are commonly defined by previous dataset providers, in this work, we propose a contrastive learning approach to learn a metric to differentiate normal driving from anomalous driving. For this task, we introduce a new video-based benchmark, the Driver Anomaly Detection (DAD) dataset, which contains normal driving videos together with a set of anomalous actions in its training set. In the test set of the DAD dataset, there are unseen anomalous actions that still need to be winnowed out from normal driving. Our method reaches 0.9673 AUC on the test set, demonstrating the effectiveness of the contrastive learning approach on the anomaly detection task. Our dataset, codes and pre-trained models are publicly available.



There are no comments yet.


page 1

page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Using contrastive learning, normal driving template vector

is learnt during training. At test time, any clip whose embedding is deviating more than threshold from normal driving template is considered as anomalous driving. Examples are taken from new introduced Driver Anomaly Detection (DAD) dataset for front (left) and top (right) views on depth modality.

Driving has become an indispensable part of modern life providing a high level of convenient mobility. However, this strong dependency on driving also leads to an increased number of road accidents. According to the World Health Organization’s estimates, 1.25 million people die in road accidents per year, and up to 50 million people injure. Human factors are the main contributing cause in almost 90% of the road accidents having distraction as the main factor for around 68% of them

[7]. Accordingly, the development of a reliable Driver Monitoring System (DMS), which can supervise a driver’s performance, alertness, and driving intention, contains utmost importance to prevent human-related road accidents.

Due to the increased popularity of deep learning methods in computer vision applications, there has been several datasets to facilitate video based driver monitoring

[23, 26, 1]. However, all these datasets are partitioned into finite set of known classes, such as normal driving class and several distraction classes, with equivalent training and testing distribution. In other words, these datasets are designed for closed set recognition, where all samples in their test set belong to one of the K known classes that the networks are trained with. This arises a very important question: How would the system react if an unknown class is introduced to the network? This obscurity is a serious problem since there might be unbounded many distracting actions that a driver can do while driving.

Different from available datasets and majority research on DMS applications, we propose an open set recognition approach for video based driver monitoring. Since the main purpose of a DMS is to ensure that driver drives attentively and safely, which is referred as normal driving in this work, we propose a deep contrastive learning approach to learn a metric in order to distinguish normal driving from anomalous driving. Fig. 1 illustrates the proposed approach.

In order to to facilitate further research, we introduce a large scale, multi-view, multi-modal Driver Anomaly Detection (DAD) dataset. The DAD dataset contains normal driving class together with a set of anomalous driving actions in its training set. However, there are several unseen anomalous actions in the test set of DAD dataset that still need to be distinguished from normal driving. We believe that DAD dataset addresses to the true nature of driver monitoring.

Overall, the main contributions of this work can be summarized as:

  • We introduce DAD dataset, which is the first video based open set recognition dataset for vision based driver monitoring application. The DAD dataset is multi-view (front and top views), multi-modal (depth and infrared modalities) and large enough to train deep Convolutional Neural Network (CNN) architectures from scratch.

  • We propose a deep contrastive learning approach to distinguish normal driving from anomalous driving. Although contrastive learning has been popular for unsupervised metric learning recently, we prove its effectiveness by achieving 0.9673 AUC in the test set of DAD dataset.

  • We present a detailed ablation study on the DAD dataset and proposed contrastive learning approach in order give better insights about them.

2 Related Work

Vision Based Driver Monitoring Datasets. There are several hand-focused datasets such as CVRR-HANDS 3D [24], VIVA-Hands [5] and DriverMHG [19]. Although these datasets aim to facilitate research on hand gesture recognition for human machine interaction, they can be used to detect hand position [21], which is highly correlated to the drivers’ ability to drive. Ohn-bar et al. introduces additional two datasets [24, 25] in order to study hand activity and pose which can be used to identify driver’s state.

Drivers’ face and head information also provides very important cues to identify driver’s state such as head pose, gaze directions, fatigue and emotions. There are several datasets such as [2, 27, 8] that provide eye-tracking annotations. This information together with the interior design of the cabin help identifying where the driver is paying attention, as in DrivFace dataset [6]. In addition, datasets such as DriveAHead [32] and DD-Pose [29] provide head pose annotations of yaw, pitch and roll angles.

There are also datasets that focus on the body actions of the drivers. StateFarm [9] is the first image-based dataset for this purpose, which contains safe driving and 9 additional distracting classes. A similar image-based dataset AUC Distracted Driver (AUC DD) [1] is proposed using a side-view camera to capture drivers’ actions. However, these two datasets are image-base and lack important temporal information. A simple modification on AUC DD dataset to investigate importance of spatio-temporal information is presented in [20]. Recently, Drive&Act dataset is introduced in [23], which is recorded for 5 NIR cameras where subjects perform distraction-related actions for autonomous driving scenario.

None of the datasets mentioned above is designed for open set recognition scenarios [31], where unknown actions are performed at the test time. In this perspective, the introduced DAD dataset is the first available dataset designed for open-set-recognition.

Contrastive Learning Approaches. Since its initial proposition [11], these approaches learn representations by contrasting positive pairs against negative pairs. In [35]

, the full softmax distribution is approximated by the Noise Contrastive Estimation (NCE)

[10]; a memory bank and the Proximal Regularization [28] are used in order to stabilize learning process. Following works use similar approaches with several modifications. In [38], instances that are close to each other on the embedding space used as positive pairs in addition to the augmented version of the original images. In [12], a dynamic dictionary with a queue and a moving-average encoder are presented. Authors in [33] try to bring different views of the same scene together in embedding space, while pushing views of different scenes apart. A projection head is introduced in [4], which improves the quality of the learned representations. It has been proven that models with unsupervised pretraining achieves better than models with supervised pretraining in various tasks [4]. Moreover, performance of supervised contrastive learning is also validated in [16].

Lightweight CNN Architectures. Since DMS applications need to be deployed in car, it is critical to have a resource efficient architecture. In recent years, several lightweight CNN architectures are proposed. SqueezeNet [15] is the first and most well-know architecture, which consists of fire modules to achieve AlexNet-level accuracy with 50x fewer parameters. MobileNet [14] contains depthwise separable convolutions with a width multiplier parameter to achieve thinner or wider network. MobileNetV2 [30]

contains inverted residuals blocks and ReLU6 activation function. ShuffleNet

[37] proposes to use channel shuffle operation together with pointwise group convolution. ShuffleNetV2 [22] upgrades it with several principles, which are effective in designing lightweight architectures. Networks using Neural Architecture Search (NAS) [39], such as NASNet[40], FBNet[34], provide another direction for designing lightweight architectures. In this work, we have used 3D version of several resource efficient architectures, which are introduced in [18].

3 Driver Anomaly Detection (DAD) Dataset

(a) Camera placements in the simulator
(b) Camera
(c) Top depth image
(d) Top infrared image
(e) Front depth image
(f) Front infrared image
Figure 2: Environment for data collection. (a) Driving simulator with camera placements. (b) Infineon CamBoard pico flexx camera installed for front and top views. Examples of (c) top depth, (d) top infrared, (e) front depth and (f) front infrared recordings.

There are several vision-based driver monitoring datasets that are publicly available, but for the task of open set recognition such that normal driving should still be distinguished from unseen anomalous actions, there has been none. In order to fill this research gap, we have recorded the Driver Anomaly Detection (DAD) dataset, which contains the following properties:

  • The DAD dataset is large enough to train a Deep Neural Network architectures from scratch.

  • The DAD dataset is multi-modal containing depth and infrared modalities such that system is operable at different lightning conditions.

  • The DAD dataset is multi-view containing front and top views. These two views are recorded synchronously and complement each other.

  • The videos are recorded with 45 frame-per-second providing high temporal resolution.

We have recorded the DAD dataset using a driving simulator that is shown in Fig. 2. The driving simulator contains a real BMW car cockpit, and the subjects are instructed to drive in a computer game that is projected in front of the car. Two Infineon CamBoard pico flexx cameras are placed on top and in front of the driver. The front camera is installed to record the drivers’ head, body and visible part of the hands (left hand is mostly obscured by the driving wheel), while top camera is installed to focus on the drivers’ hand movements. The dataset is recorded in synchronized depth and infrared modalities with the resolution of 224 x 171 pixels and frame rate of 45 fps. Example recordings for the two views and two modalities are shown in Fig. 2.

Figure 3: The DAD dataset statistics.
Anomalous Actions in Training Set Anomalous Actions in Test Set
Talking on the phone-left Talking on the phone-left Adjusting side mirror Wearing glasses
Talking on the phone-right Talking on the phone-right Adjusting clothes Taking off glasses
Messaging left Messaging left Adjusting glasses Picking up something
Messaging right Messaging right Adjusting rear-view mirror Wiping sweat
Talking with passengers Talking with passengers Adjusting sunroof Touching face/hair
Reaching behind Reaching behind Wiping nose Sneezing
Adjusting radio Adjusting radio Head dropping (dozing off) Coughing
Drinking Drinking Eating Reading
Table 1: Anomalous actions in the training and test sets. 16 actions in the test set that are not available in the training set are highlighted in red color.

For the dataset recording, 31 subjects are asked to drive in a computer game performing either normal driving or anomalous driving. The training set contains recordings of 25 subjects and each subject has 6 normal driving and 8 anomalous driving video recordings. Each normal driving video lasts about 3.5 minutes and each anomalous driving video lasts about 30 seconds containing a different distracting action. The list of distracting actions recorded in the training set can be found in Table 1. In total, there are around 550 minutes recording for normal driving and 100 minutes recording of anomalous driving in the training set.

The test set contains 6 subjects and each subject has 6 video recordings lasting around 3.5 minutes. Anomalous actions occur randomly during the videos. Most importantly, there are 16 distracting actions in the test set that are not available in the training set, which can be found in Table 1. Because of these additional distracting actions, the networks need to be trained according to open set recognition task and distinguish normal driving no matter what the distracting action is. The complete test consists of 88 minutes recording for normal driving and 45 minutes recording of anomalous driving. The test set constitutes the 17% of the complete DAD dataset, which is around 95 GB. The dataset statistics can be found in Fig. 3.

4 Methodology

4.1 Contrastive Learning Framework

Our motivation is to learn a compact representation for normal driving such that any action deviating from normal driving beyond a threshold can be detected as anomalous action. Accordingly, Inspired by recent progress in contrastive learning algorithms, we try to maximize the similarity between normal driving samples and minimizing the similarity between normal driving and anomalous driving samples in the latent space using a contrastive loss. Fig. 4 illustrates the applied framework, which has three major components:

  • Base encoder (.) is used to extract vector representations of input clips. (.) refers to a 3D-CNN architecture with parameters . We performed experiments with ResNet-18 and various resource efficient 3D-CNNs to transform input into via .

  • Projection head (.) is used to map into another latent space . According to findings in [4], it beneficial to define the contrastive loss on rather than . (.)

    refers to MLP with one hidden layer with ReLU activation and has parameters

    to achieve transformation of , where . After MLP, 2 normalization is applied to the embedding .

  • Contrastive loss is used to impose that normalized embeddings from the normal driving class are closer together than embeddings from different anomalous action classes. For this reason, positive pairs in the contrastive loss are always selected from normal driving clips, whereas anomalous driving clips are used only as negative samples.

We divide our normal and anomalous videos into clips for the training. Within a mini-batch, we have normal driving clips and anomalous driving clips with index . Final embedding of the normal and anomalous driving clips are denoted as and , respectively. There are in total positive pairs and negative pairs in every mini-batch. For the supervised contrastive learning approach that we have applied for the task of driver anomaly detection task, the loss takes the following final form:

Figure 4: Contrastive learning framework for driver anomaly detection task. A pair of normal driving clips a number of anomaly driving clips (2 in this example) are fed to a base encoder (.) and projection head (.) to extract visual representations of and , respectively. Once training is completed, projection head is removed, and only the encoder (.) is used for test time recognition.

where {0, 1} is an indicator function that returns 1 if and 0 otherwise, and (0, ) is a scalar temperature parameter that can control the concentration level of the distribution [13]. Typically,

is chosen between 0 and 1 to amplify the similarity between samples, that is beneficial for training. The inner product of vectors measures the cosine similarity between encoded feature vectors because they are all

normalized. By optimizing Eq. (2), the encoder is updated to maximize the similarity between the normal driving feature vectors and while minimizing the similarity between the normal driving feature vector and all other anomalous driving feature vectors in the same mini-batch.

Noise Contrastive Estimation. The representation learnt by Eq. (2) can be improved by introducing many more anomaly driving clips (i.e. negative samples). In the extreme case, we can use the complete training samples of the anomalous driving. However, this is too expensive considering the limited memory of the used GPU. Noise Contrastive Estimation [10] can be used to approximate the full softmax distribution as in [10, 35]. In our implementation, we have used the negative samples in our mini-batch and applied (+1)-way softmax classification as also used [33, 12, 3]. Different from these works, we do not use a memory bank and optimize our framework using only the elements in the mini-batch.

4.2 Test Time Recognition

The common practice to evaluate learned representations is to train a linear classifier on top of the frozen base network

[33, 12, 3, 4]

. However, this final training is tricky since representations learned by unsupervised and supervised training can be quite different. For example, training of the final linear classification is performed with learning rate of 30, although unsupervised learning is performed with initial learning rate of 0.01. In addition, authors in

[35] apply

-nearest neighbours (kNN) classification for the final evaluation. However, kNN also requires distance calculation with all training clips for each test clip, which is computationally expensive.

For the test time recognition, we propose an evaluation protocol that does not require neither any further training nor complex computations. After the training phase, we throw away the projection head as in [4] and use the trained 3D-CNN model to encode every normal driving training clips , into a set of normalized 512-dimensional feature representations. Afterwards, normal driving template vector can be calculated with:


To classify a test video clip , we encode it again into a normalized 512-dimensional vector and compute the cosine similarity between the encoded clip and by:


Finally, any clip whose similarity score below a threshold, , is classified as anomalous driving. This way, only a simple vector multiplication is performed for test time evaluation. Moreover, similarity score of the test clip gives the severity of the anomalous behavior.

Fusion of Different Views and Modalities. The DAD dataset contains front and top views; and depth and infrared modalities. We have trained a separate model for each view and modality and fused them later with decision level fusion. As an example, the fused similarity score for top view depth and infrared modalities is calculated with:


It must be noted that each applied view and modality increases the required memory and inference time, which would be critical for autonomous driving applications.

Model Loss AUC
Top Front Top+Front
Depth IR D+IR Depth IR D+IR Depth IR D+IR
ResNet-18 CE Loss 0.7982 0.8183 0.8384 0.8416 0.8493 0.8816 0.8783 0.8967 0.9190
ResNet-18 Weighted CE Loss 0.8047 0.8169 0.8399 0.8921 0.8808 0.9044 0.9017 0.9070 0.9275
ResNet-18 Contrastive Loss 0.9128 0.8804 0.9166 0.8996 0.8695 0.9196 0.9609 0.9321 0.9655
Table 2: Performance Comparison of contrastive loss, CE loss and weighted CE loss for different views and modalities.

4.3 Training Details

We train our models from scratch for 250 epochs using Stochastic Gradient Descent (SGD) with momentum 0.9 and initial learning rate of 0.01. The learning rate is reduced with a factor of 0.1 every 100 epochs. The DAD dataset videos are divided into non-overlapping 32 frames clips. In every mini-batch, we have 10 normal driving clips and 150 anomalous driving clips. We have set the temperature

. Several data augmentation methods are applied: multi-scale random cropping, salt and pepper noise, random rotation, random horizontal flip (only for top view). We have used 16 frames input clips, which are downsampled from 32 frames and resized to resolution. At test time, the output score of a 16 frames clip is assigned to the middle frame of the clip (i.e.

frame). For the evaluation metric, we have mainly used area under the cure (AUC) of the ROC curve since it provides calibration-free measure of detection performance.

We have implemented our code in PyTorch, and all the experiments are done using a single Titan XP GPU. Our code and pretrained models are publicly available


5 Experiments

Baseline Results. We have used ResNet-18 as base encoder for the baseline results. All the models in the experiments are trained from scratch unless otherwise specified. For every view and modality, a separate model is trained and individual results as well as fusion results are reported in Table 3. The thresholds that are achieving highest classification accuracy are reported in Table 3. However, true positive rate and false positive rates change according to the applied threshold value. Therefore, we have also reported AUC of the ROC curve for baseline evaluation.

Fusion of different modalities as well as different views always achieves better performance compared to single modalities and views. This shows that different views/modalities in the dataset contains complementary information. Fusion of top/front views and depth/infrared modalities achieves the best performance with 0.9655 AUC. Using this fusion network, the visualization for a continuous video stream is illustrated in Fig. 5.

Contrastive Loss or Cross Entropy Loss? We have compared the performance of contrastive loss and cross entropy (CE) loss. We have trained a ResNet-18 with a final fc layer with CE loss to perform binary classification. However, since the data distribution for normal and anomalous driving is unbalanced in the training set of DAD dataset, we have also experimented with weighted CE loss, where weights are set by inverse class frequency. Comperative results are reported in Table 2. Our findings are in accordance with [16]. Except for front view infrared modality, contrastive loss always outperforms CE loss.

Metric Thresholds Acc. (%) AUC
Top(D) 0.89 89.13 0.9128
Top(IR) 0.65 83.63 0.8804
Top(DIR) 0.76 87.75 0.9166
Front(D) 0.75 87.21 0.8996
Front(IR) 0.82 83.68 0.8695
Front(DIR) 0.81 88.68 0.9196
Top+Front(D) 0.83 91.60 0.9609
Top+Front(IR) 0.80 87.06 0.9311
Top+Front(DIR) 0.81 92.34 0.9655
Table 3: Results obtained by using a ResNet-18 as base encoder. Thresholds that result in highest classification accuracy are reported.
Figure 5: Illustration of recognition for a continuous video stream using fusion of both views and modalities. Similarity score refers to cosine similarity between the normal driving template vector and base encoder embedding of input clip. The frames are classified as anomalous driving if the similarity score is blow the preset threshold.
Model Params MFLOPS AUC
Top Front Top+Front
Depth IR D+IR Depth IR D+IR Depth IR D+IR
MobileNetV1 2.0x 13.92M 499 0.9125 0.8381 0.9097 0.9018 0.8374 0.9057 0.9474 0.9059 0.9533
MobileNetV2 1.0x 3.01M 470 0.9124 0.8531 0.9146 0.8899 0.8355 0.8984 0.9641 0.9154 0.9608
ShuffleNetV1 2.0x 4.59M 413 0.8884 0.8567 0.8926 0.8869 0.8398 0.9000 0.9358 0.9023 0.9480
ShuffleNetV2 2.0x 6.46M 383 0.8959 0.8570 0.9066 0.9002 0.8371 0.9054 0.9490 0.9131 0.9531
ResNet-18 (from scratch) 32.99M 6104 0.9128 0.8804 0.9166 0.8996 0.8695 0.9196 0.9609 0.9311 0.9655
ResNet-18 (pre-trained) 32.99M 6104 0.9200 0.8857 0.9228 0.9020 0.8666 0.9128 0.9646 0.9227 0.9620
ResNet-18 (post-processed) 32.99M 6104 0.9143 0.8827 0.9182 0.9020 0.8737 0.9223 0.9628 0.9335 0.9673
Table 4: Comparison of different network architectures over classification accuracy, number of parameters and MFLOPS. All architectures takes 16 frames input with spatial resolution.

Resource Efficient Base Encoders. For autonomous applications, it is critical that the deployed systems should be designed considering resource efficiency. Therefore, we have experimented with different resource efficient 3D CNNs [18] as base encoder. Comperative results are reported in Table 4. Out of all resource efficient 3D CNNs, MobileNetV2 stands out with its performance achieving close to ResNet-18 architecture. More importantly, MobileNetV2 has around 11 times less parameters and requires 13 times less computation compared to ResNet-18. ROC curves for different base encoders are also depicted in Fig. 6, where ResNet-18 and MobileNetV2 again stands out in terms of performance compared to other networks.

With or Without Pre-training?Transfer learning is a common and effective strategy to improve generalization in small-scale datasets by pretraining network initially with a large-scale dataset [36]. Therefore, in order to investigate the effect of pretraining, we have pretrained our ResNet-18 base encoder on Kinetics-600 for 100 epochs with contrastive loss similar to our contrastive learning approach described in Section 4. We have not applied CE loss that is common for training classification tasks since feature representations learnt by CE loss and contrastive loss would be quite different, hence can hinder the transfer learning performance. Before fine-tuning, we have modified the initial convolution layer of the pretrained network to accommodate single channel input by averaging weights of 3 channels. Afterwards, we fine-tune the network using the DAD dataset. Comparative results are reported in Table 4 that pretrained base encoder does not show apparent advantages compared to base encoder trained from scratch. We infer that our DAD dataset is large enough and the networks that are trained from scratch can already learn all distinctive features without the need of transfer learning.

Post Processing. It is a common approach to apply post processing in order to prevent fluctuation of detected scores [17]. For instance, the misclassification between frames 6500 and 6750 in Fig. 5 can be prevented by such a post processing. Therefore, we have applied a simple low pass filtering (i.e. averaging) on the predicted scores. Instead of making score predictions considering only the current clip, we have applied a running averaging on the -previous scores. We have experimented with different values and best results are achieved when . Comparative results with and without post processing are reported in Table 4, where post processing slightly improves the performance.

How Training Data Affects the Performance? The quality and the amount of training data is one of the most important factors on the performance of deep learning applications. Therefore we have investigated the impact of different amounts of training data. First, we have created 5 equal folds each containing training data of 5 subjects. Then, keeping all the anomalous driving in the training set, we have gradually increased the used folds for normal driving data. We have applied the same procedure by switching the normal and anomalous driving subsets. The comparative results are reported in Table 5, where and refers to the proportion of the used training data for normal driving and anomalous driving subsets, respectively.

The direct interpretation of Table 5 is that as we increase the amount of normal and anomalous driving videos, achieved performance also increases accordingly. This is natural since we need more normal driving data in order to increase the generalization strength of the learned embeddings. On the other hand, we also need enough anomalous driving data in the training set to draw the boundary of the normal driving embedding and increase the compactness of the learned representation.

Figure 6: ROC curves using 5 different base encoders. The curves are drawn for the fusion of both views and modalities.
Ratio AUC
aaaTop aaFront Top+Front
20% 100% 0.7956 0.7639 0.8513
40% 100% 0.7795 0.8111 0.8561
60% 100% 0.8599 0.8166 0.8802
80% 100% 0.8998 0.8601 0.9382
100% 20% 0.8025 0.7873 0.8545
100% 40% 0.8103 0.8577 0.9070
100% 60% 0.8694 0.8911 0.9335
100% 80% 0.8854 0.8921 0.9484
100% 100% 0.9128 0.8996 0.9609
Table 5: Performance comparison using different amount of normal and anomalous driving data in the training. Results are reported for ResNet-18 base encoder on depth modality.

6 Conclusion

In this paper, we propose an open set recognition based approach for a driver monitoring application. For this objective, we create and share a video based benchmark dataset, Driver Anomaly Detection (DAD) dataset, which contains unseen anomalous action classes in its test set. Correspondingly, the main task in this dataset is to distinguish normal driving from anomalous driving even some of the anomalous actions have never been seen. We propose a contrastive learning approach in order to generalize the learned embedding of the normal driving video, which can later be used to detect anomalous actions in the test set.

In our experiments, we have validated that the proposed DAD dataset is large enough to train deep architectures from scratch and has different views and modalities that contain complementary information. Since autonomous applications are limited in terms of hardware, we have also experimented with resource efficient 3D CNN architectures. We specifically note that MobileNetV2 achieves close to ResNet-18 performance, but contains 11 times less parameters and requires 13 times less computations than ResNet-18.

We believe that this work will bring a new perspective to the research on driving monitoring systems. We strongly encourage research community to use open set recognition approaches for detecting drivers’ distraction.


We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU, and Infineon Technologies with the donation of Pico Flexx ToF cameras used for this research.