Drivers Drowsiness Detection using Condition-Adaptive Representation Learning Framework

10/22/2019 ∙ by Jongmin Yu, et al. ∙ 21

We propose a condition-adaptive representation learning framework for the driver drowsiness detection based on 3D-deep convolutional neural network. The proposed framework consists of four models: spatio-temporal representation learning, scene condition understanding, feature fusion, and drowsiness detection. The spatio-temporal representation learning extracts features that can describe motions and appearances in video simultaneously. The scene condition understanding classifies the scene conditions related to various conditions about the drivers and driving situations such as statuses of wearing glasses, illumination condition of driving, and motion of facial elements such as head, eye, and mouth. The feature fusion generates a condition-adaptive representation using two features extracted from above models. The detection model recognizes drivers drowsiness status using the condition-adaptive representation. The condition-adaptive representation learning framework can extract more discriminative features focusing on each scene condition than the general representation so that the drowsiness detection method can provide more accurate results for the various driving situations. The proposed framework is evaluated with the NTHU Drowsy Driver Detection video dataset. The experimental results show that our framework outperforms the existing drowsiness detection methods based on visual analysis.



There are no comments yet.


page 1

page 7

page 8

page 9

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Driver drowsiness detection is one of the essential functions in the advanced driver assistant systems (ADAS) for preventing fatal accidents from the people on a road. Many drivers and pedestrians are killed or significantly injured by drowsy driving. The report of the National Sleep Foundation’s Sleep in America poll presents 60% of Americans have an experience of drowsiness driving, and 37% have experienced falling asleep while driving in the recent one year. According to the report of the national highway traffic safety administration in the USA, the driver fatigue is closely related to the 100,000 of car crashes reported by polices. By this report, this car crashes made 1,550 deaths, 71,000 injuries, and 12.5 billion in monetary losses [schroeder2013national]. Also, the car crash by the driver drowsiness is not unique to drivers in the USA, drowsiness contributes to as many as 7% of crashes in the United Kingdom and 3.9% of crashes in Norway[maycock1996sleepiness, sagberg1999road]. The majority of drowsiness-related car accidents, approximately 80%, might be classified as individual vehicle run off road crashes, where a driver lost the controlling their vehicle and eventually departed their lane or smashed into the rear of the car ahead [pack1995characteristics]. These figures may be the tip of the iceberg because of not only it is hard to attribute the cause of crashes to drowsiness but also the criteria for recognizing drowsiness differ depending on the driver [schroeder2013national]. There is no Breathalyzer equivalent for drowsiness. Therefore, in order to prevent these losses of life and property, it is an important challenge to develop a driver drowsiness detection method.

The approaches for driver drowsiness detection could be classified based on their target domain to analysis. One approach is to directly analyze the driver’s behaviour to identify changes in driver behaviour. This approach analyzes facial elements such as eye and mouth using visual sensors [garcia2012vision, mbouna2013visual, wang2012method, minkov2012comparison, panning2011color, kurylyak2012detection, suzuki2006measurement], or detects particuar patterns in electrophysiological signals occurring when a driver is falling asleep [khushaba2011driver, patel2011applying, tran2010improving, papadelis2007monitoring]. Other approaches indirectly infer a driver’s state through analysis of signals extracted from the steering system [ersal2010model, yang2009detection, liu2009predicting, takei2005estimate, wakita2006driver].

The most commonly applied and theoretically rigorous approach involves the analysis of electrical bio-signals e.g., electroencephalogram (EEG) or facial elements such as eye based on percent eye-closure over a fixed time window (PERCLOS) [dinges1998perclos]. Dinges et al. had verified that the approach using PERCLOS had over than 90% accuracy in recognizing degraded performance during a vigilance task. This figure demonstrated that the PERCLOS was more reliable across drivers than EEG, blinks, and head position in the study [dinges1998perclos]. Khushaba et al. proposed the driver drowsiness detection method which employs fuzzy mutual-information-based wavelet packet transform model for extracting drowsiness-related information from a set of EEG, electrooculogram (EOG), and electrocardiogram (ECG) signals [khushaba2011driver]. Papadelis et al. developed drowsiness monitoring system using onboard electrophysiological recording systems [papadelis2007monitoring]. Aforementioned methods identify the change of patterns of signals such as brain activity or heartbeat to measure the strength of fatigue of drivers. These signals reflect brain electrical activity and can provide more discriminative information than other features in analyzing the driver’s conditions. For these reasons, the methods using biomedical signals captured from drivers had provided relatively higher accurate detection results than other methods based on visual analysis or measuring the steering signals. Nevertheless, the main disadvantage of these methods is that the sensing equipment for the physiological signals such as EEG, ECG, and EOG, must be attached to the driver’s body. The attachment of those sensors could cause inconvenience to drivers when they are driving. Additionally, the high price of sensors is one reason that they can not be used in a practical drowsiness detection system.

In addition to the methods of directly recognizing the drivers’ condition through the analysis of biomedical signals, the approaches based on visual analysis of facial elements generally employ computer vision techniques such as object detection and tracking to find the interesting objects such as eye or mouth, on the image containing the driver’s face

[garcia2012vision, mbouna2013visual, wang2012method, minkov2012comparison, panning2011color, kurylyak2012detection, suzuki2006measurement]. Garcia et al. proposed a system which consist of three steps [garcia2012vision]. Their system initially detects and tracks face and eye, and then to stabilize the performance of analyzing the status of the eye in various illumination conditions, the system conducts image filtering. This system evaluates the closure status of the eye using PERCLOS measurement. Mbouna et al. provided the analysis method for a visual feature to understand the closure state and head pose. The proposed method monitors a driver using a single camera without any source of light [mbouna2013visual]. Wang et al. presented a solution for the situation that driver is wearing glasses by combining two analysis methods for the status of eye and mouth [wang2012method]. The method proposed by Dwivedi et al. extracts features using a convolutional neural network and detects eye blinking, eye closure, and yawning [dwivedi2014drowsy]. Generally, these methods assume that facial expressions of extremely tired drivers, such as eye blinking, yawning, and eye and head moving, are different from facial expressions represented when drivers are not tired. These approaches classify the driver’s condition as whether he/she is asleep or not, using the hand-crafted features such as the histogram of gradient (HoG) [dalal2005histograms] and Haar-like features [lienhart2002extended]. To extract these facial feature information, visual sensors like an RGB camera or an active infrared sensor should be installed on the vehicle dashboard, sun visor, or overhead console for taking face images of drivers. However, despite the convenience of installation, the methods based on video analysis using visual sensors solely, provide unstable detect results in many situations. For example, general cameras cannot capture clear images at night without illumination system. The development of the drowsiness detection method using visual analysis, invariant to the light condition is still an open question.

The limitations of the above-mentioned approaches have led researchers to attend to the signals from a steering system such as the deflection of the top of the wheel from the zero point [mcdonald2018contextual]. These signals are similar to electrical bio-signals in that they require significant pre-processing and transformation before they become viable input measures [sayed2001unobtrusive]. Sayed and Eskandarian proposed a steering-wheel angle based method that filtered raw information for steering angle for the elimination of road curvature events, and then discretized into binary signals to represent steering patterns [sayed2001unobtrusive]. This method detected the drowsiness of drivers with nearly 90% accuracy. Similarly, Krajewski et al. presented an approach to process raw steering-wheel angle data into features represented by the signal in the time and frequency domains [krajewski2009detecting]. Ersal et al. presented an approach to recognition of driving behaviours [ersal2010model]

, which is based on support vector machines (SVM)

[hearst1998support]. This approach systemically assists determination of whether a driver is asleep or not by interpreting behaviours of drivers using the linear discriminative model. Takei et al. [takei2005estimate]estimated a driver’s fatigue by analyzing steering motions with the fast Fourier transform (FFT) and Chaos characteristics. These methods judge whether a driver is falling into a drowsy state by analyzing signals such as variation of velocity, acceleration, breaking, and gear change, that are recorded from the sensors embedded in steering systems. These methods are not focused on the detection of driver drowsiness directly. They try to recognize the unstable vehicle movements that are caused by various intrinsic and extrinsic reasons from analyzing steering signals. Consequently, it can provide a more flexible system to detect unstable movements than other systems which are only focused on the detection of driver drowsiness. However, many automobile manufacturers in the world embed a particular steering system in their vehicles. In addition, these signals cannot be a clear basis to distinguish whether a driver is sleepy or not since every driver has not only a different personality but also a different driving habit.

Fig. 1: Illustrations of the processes of general represenration learning and adaptive representation learning on a classification task

Recently, deep learning architectures have been successfully used to solve various computer vision problems, such as image recognition

[simonyan2014very, girshick2016region], object detection [erhan2014scalable, ren2015faster], gesture recognition [molchanov2016online], image segmentation [qi2016dynamic], and action recognition [simonyan2014two, du2015hierarchical]. In particular, the deep learning methods [simonyan2014two, du2015hierarchical] show good performance in analyzing video streams to recognize specific actions when compared with conventional methods based on hand-crafted features [wang2013action, jiang2015human]. Although various methods [wang2013action, jiang2015human, jain2013better] to extract superior hand-crafted features have been proposed, the key to these successes is a rich and discriminative representation extracted from multi-layer nonlinear systems in the deep learning approaches [xu2015learning]. We had adopted the convolutional neural network (CNN) and multi-layer fully connected neural network (a.k.a., deep neural network) to discover significant time-space features, and showed the possibility of the deep learning method for drowsiness detection in previous works [yu2016representation]. In our previous works, we had proposed the driver drowsiness detection method exploiting extra scene condition prediction to improve discriminative properties of learnt representation. However, despite outperforming in drowsiness detection, the previous method had a critical drawback in generating representations. The previous method had a possibility that the method generates extremely sparse representation which cannot contain sufficient information to detect drowsiness. This work is improved and extended from our earlier work [yu2016representation], and we propose an end-to-end learning framework for a novel representation called condition-adaptive representation for drowsiness detection.

The condition-adaptive representation learning is a representation learning process to take the feature focused on some particular condition using auxiliary information (a.k.a., meta information). When the training dataset can be classified to several conditions, whilst the normal representation learning perform to extract generalized features from overall training data the condition-adaptive representation learning can extract more specific representations reflecting given conditions. Figure 1 represents the comparison of processes about the normal representation learning and condition-adaptive representation learning. An auxiliary information has been used to improve the performance of the deep learning model in many computer vision studies [hong2016learning, zhang2016learning]. Hong et al. proposed deep learning system using transferrable knowledge to the scene segmentation in training phase [hong2016learning]. Zhang et al. proposed a face alignment method using the result of landmark detection as auxiliary information [zhang2016learning]. These methods tried to improve the performance of their solutions by learning the features biased to extra information that could help to explore useful features in their target domains. As with the methods described above, the concept of the condition-adaptive representation could be possibly interpreted as a representation biased to some conditions. However, in compared to the above methods which use extra information solely in training phase as prior knowledge, the proposed framework can generate the information which can help to improve the discrimination of the learnt representation during not only the training task but also testing task. By using this paradigm, the proposed framework can immediately generate the representation which adapts to the interpreted results.

The proposed framework is composed of four models consisting of representation learning, scene understanding, feature fusion, and drowsiness detection. The representation learning model discovers the rich and discriminative representation that can describe the motion and appearance of an object within the consecutive frames simultaneously. The scene understanding model identifies the various scene conditions that relate to driving conditions, e.g., illumination conditions and wearing glasses. The feature fusion model generates a condition-adaptive representation which is biased to a specific scene condition as opposed to the general spatio-temporal representation. The proposed framework detects drivers drowsiness in various situations accurately by using this condition-adaptive representation. The main contribution of this work is the representation learning framework that could be adapted to the particular scene conditions via understanding the scenes and generating the condition adaptive representation.

The rest of the paper is organized as follows. In Section ii@, we give an overview of the 2D and 3D CNNs. The architectural detail of the proposed framework is explained in Section iii@. We describe the training and inferencing procedure of the proposed framework in Section iv@, and represent the method of data argumentation in Section v@. In Section vi@, we show the experimental results and analysis those results. The conclusion and discussion are described in Section vii@.

Fig. 2: Illustrations of (a) 2D and (b) 3D convolution kernels. The connections sharing the same color denote a weight sharing in convolution layer. In 3D convolution (b), a temporal dimension is 3.

Ii 2D and 3D Convolutional neural networks

A convolutional neural network (CNN) (a.k.a., Deep convolutional neural network) is a multi-layer weighted filter model introduced by LeCun et al. [lecun1998gradient]. CNNs show outstanding performance in many computer vision studies such as image classification [krizhevsky2012imagenet], object detection, and recognition [ren2015faster]. The key architectural characteristics of CNNs are ensuring some degree of shift, scale, and distortion invariance: local receptive field, shared weight, and spatial or temporal sub-sampling [lecun1998gradient]. The function of a locally connected neural network in CNNs permits that CNNs can extract locally meaningful features, and by using the weight sharing, CNNs can be used as a elementary feature detector for one part of an image, across the set of entire images.

Fig. 3: Overall architecture of the proposed framework. The red boxes with bold line denote the models, and the black boxes drawn by dotted line define extracted features or outputs of each model.

In general CNNs, the convolution is performed at the convolution layers to discover features from spatial neighbourhoods on feature maps in each layer. Formally, the value of a unit at position in the -th feature map in the -th layer presented as is represented by



is the activation function such as hyperbolic tangent, sigmoid, and rectified linear functions, and

is the bias for the feature map, and is latent representation of the unit at position in the -th feature map in the -th layer. is the value of the kernel (Local receptive field) connected to the feature map, and and are the width and the height of the kernel respectively. In the sub-sampling layer, the dimensional scale of the feature map is reduced by pooling over the spatially adjacent neighbourhood on the feature maps in the previous layer. The learnt feature using 2 dimensional-CNN (2D-CNN) can not only discover the locally useful feature but also be helpful to understand an entire image.

However, although the spatial features extracted from the 2D-CNN is robust to various computer vision studies, this paradigm of 2D-CNNs plays the role of a hurdle in learning the temporal representations about the sequential data such as video. To discover the rich and informative information from the sequential data using CNNs, Ji et al. proposed the 3D convolution [ji20133d]. The 3D convolution is achieved by convolving a 3D feature map to the 3D volume formed by stacking multiple images together. By this principle, the feature maps in the convolution layers can capture temporal information that is contained in multiple contiguous frames. The value of a unit at position in the -th feature map in the -th layers which is denoted as can be formulated as


where is the activation function in 3D convolution, is latent representation of a unit at position in the -th feature map in the -th layer, is the bias for the feature map, is the value of the kernel (3D Local receptive field) connected to the feature map, and , and are the width, the height, and the depth of the kernel, respectively. Figure 2 shows the comparison of 2D and 3D convolutions. While the 2D convolution extracts spatial representation from given single image only, a 3D convolution can extract both spatial and temporal representation simultaneously in multiple consecutive images because the kernel of 3D convolution explore not only spatial axis but also temporal axis.

Iii Architecture

The proposed framework is based on four models for representation learning, the scene understanding, the feature fusion, and the drowsiness detection. The representation learning model based on 3D-DCNN is used to extract the spatio-temporal representation from an input data. The scene understanding model consists of four sub-models for interpreting the condition of glasses, illuminations, and movement of facial elements. The fusion model generates condition-adaptive representation which can acclimatize the scene conditions. The detection model

determines whether a driver is sleepy or not. Figure 3 shows an overall architecture of the proposed framework. The brief explanation for how to generate condition-adaptive representation and detect drowsiness of drivers, using the proposed framework is as follows. Initially, the representation learning based on the 3D-DCNN extracts a feature that can describe motion and appearance from a video clip simultaneously. Secondly, the scene understanding predicts five scene conditions that associated with wearing glasses, illumination conditions, and facial elements using the spatio-temporal feature extracted from the representation learning. The scene understanding results are represented by a vector that is defined by the one-hot encoding method. The one-hot encoding is one of the encoding approaches which indicates the state of a system using the binary values. The encoding result is represented by the group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0) bits. Then, feature fusion learns a condition-adaptive representation by agglomerating the spatio-temporal representation and the one-hot vectors. Finally, the detection model identifies a state of driver drowsiness by analyzing the condition-adaptive representation. In the following, we will describe the detail of information of each model and training scheme of the proposed framework.

Fig. 4: Illustration of the 3D-DCNN in representation learning module. The green box and red box denote an input data and extracted spatio-temporal representation respectively, and the blue boxes represent convolution layers and pooling layers. Numbers located in the upside of the boxes represent the depth of each layer, and numbers below the boxes illustrate the dimensionality and structural detail of the kernel in each convolutional layer.

Iii-a Spatio-temporal representation learning

In this section, we describe the representation learning model using 3D-DCNN for extracting the spatio-temporal representation from given mutlitple consecutive frames. The objective of the representation learning is discovering a rich and discriminative feature from inputted consecutive frames. Videos taken by the frontal facing camera in the display units of a vehicle can be variously modified depending on the various conditions of the vehicle interiors or exteriors, such as illumination conditions and an interior design of a vehicle. When drivers feel drowsiness, their facial elements make various changes, and these changes would be interpreted as either a shift in shape or change of motion. Therefore, to detect a drowsiness of drivers, we have to consider the representation which can describe spatial information (appearance) and temporal information (motion) simultaneously. It is impossible to estimate a temporal information using only a single frame since a single frame cannot contain a change according to a time sequence. When we consider these limitations observed when a input is a single frame, it is necessary to use multiple consecutive frames as an input to discover the spatial and temporal information simultaneously. In this work, we employed 3D-DCNN to discover various spatial and temporal change in given multiple consecutive frames.

Let denotes a training video clip where , , and are the width, height, and the temporal length respectively. For a given input video clip , the representation learning based on the 3D-DCNN extract a spatio-temporal representation as


where is the parameter vector of the representation learning, and is a learnt spatio-temporal representation. The spatio-temporal representation is defined as the activation values of the hidden units in the last convolutional layer of 3D-DCNN of the representation learning model. , , and denote the width, height, and depth of the spatio-temporal representation. The 3D-DCNN in the representation learning is composed of six convolutional layers and two pooling layers. Figure 4 shows the architectural detail of the 3D-DCNN in the representation learning. To discover a spatial and temporal feature simultaneously, we employed a 3D local receptive field suggested by Tran et al. [tran2015learning]. The convolutional operation based on 3D local receptive field can be defined as


where is an activation value of the hidden unit, and , , and are the input value, the weight, and bias respectively. , , and denote the width, the height, and the depth of 3D local receptive field, and

is an activation function for the convolution layer. We adopt the Rectified Linear Units (ReLUs)

[krizhevsky2012imagenet] for the proposed 3D-DCNN. While the ordinary 2D structure of the kernel (local receptive field) in 2D convolution layers can extract spatial information only, the 3D structure of the kernel in 3D convolution layer allows to us capturing the spatial and temporal features simultaneously. The extracted representations which contain spatial and temporal features convey to the scene understanding model and feature fusion model to identify the various scene conditions and generate the condition-adaptive representation.

Iii-B Scene understanding

The goal of the scene understanding is interpreting of the scenes with drivers, and understanding the various condition of drivers that can be categorized by the physiological and environmental conditions such as movement of facial elements, wearing glasses, and a difference between a day and night. These interpreted information help to train the framework for adapting the learnt representation to the various scene conditions. We hypothesize that each video clip is associated with the scene conditions and a driver drowsiness status. These are represented by either ground-truth (in training phase) or prediction results (in inferencing phase).

In this work, the scene condition contains the three categories of the facial elements and one category for the status of glasses and illumination: 1) conditions of glasses and illumination , 2) head , 3) mouth , and 4) eye . We define states of facial elements and the conditions for glasses wearing and illumination using a one-hot vector. The detailed explanation for the annotation of each scene condition is described in Table i@. We adopt a fully connected neural network since there is a possibility that given spatiotemporal representations have complex distributions which can not be modelled by a linear kernel. The predictions of conditions using the scene understanding model are written by


where are predicted scene conditions associated to input data , and are dimensions of each annotation for the condition containing glasses and illumination, head, mouth, and eye. are the parameters of the each model that defined by the fully connected network in the scene understanding model. Each model is composed of two hidden layers and a corresponding output layer. The aforementioned models are represented as


where , , and are activation functions of the first and second hidden layers and an output layer respectively. is reshaped a spatio-temporal representation which is extracted from the representation learning model based on 3D-DCNN. , , and are weight parameters of two hidden layers and the output layer. , , and are the bias parameters of each layer. The learning procedure of each sub-model in the scene understanding is similar to the back propagation algorithm [le1990handwritten]. Each sub-model estimates a condition that corresponding to the given spatio-temporal representations , then computes the difference between the predicted conditions and annotations to train the parameters of the network of the sub-model. The dimensionalities of the outputs for each scene understanding model correspond to their target domain to predict. For example, the dimensonality of the output of the scene understanding model for glasss and illumination conditions is five, because of the model is designed to identify the conditions defined as five classes. For a given spatio-temporal representation as input, the scene understanding model is trained to optimize the objective function defined as follows


where denote annotations of input data, and , and

denote loss functions defined by the softmax cross-entropy loss between the annotation and predicted results.

is a hyper-parameter for regularization of the summation of values of error functions. The details of training and inference tasks are given in Section iv@. The spatio-temporal representation and the outputs of the scene understanding model are then combined to produce the condition-adaptive representation explained in the following subsections.

Scene condition
Category One-hot vector Condition

Glasses and illumination conditions
1 10000 Day bare face

2 01000 Day glasses

3 00100 Night glasses

4 00010 Night bare face

5 00001 Day sunglasses

Head condition
1 100 Normal status

2 010 Looking at both sides

3 001 Nodding

Mouth condition
1 100 Normal status

2 010 Talking and laughing

3 001 Yawning

Eye condition
1 10 Sleepiness eye

2 01 Normal status

TABLE I: Annotations for the sub-models in the scene understanding and its status.

Fig. 5: Illustration of the deep spatio-temporal representation and condition-adaptive representation according to input data. (a) Input frames, (b) Deep spatio-temporal representation, and (c) denotes condition-adaptive representation obtained by the fusion model . Two images in (b) and (c) represents the visualization of activation results of hidden units in representation learning and feature fusion modules. The proposed condition-adaptive representation learning framework adaptively discover the conditional feature in an input volumes depending on the result of the scene understanding model.

Iii-C Feature fusion

The objective of the model for feature fusion is to learn a set of condition-adaptive representations from the given spatio-temporal representation and its associated scene condition annotations . Given the spatio-temporal representation extracted from 3D-DCNN and its associated and predicted scene conditions , the fusion model discovers a set of condition-adaptive representation . The condition-adaptive feature vector is generated by using the multiplicative interaction approach proposed by Memisevic et al., [memisevic2013learning]. Hong et al. observed that the high-order dependency between relevant features can be captured by using element-wise multiplication interaction between the feature maps [hong2015learning]. To train the proposed framework that generates the combined representation which needs joint learning between the multiple resources, we refer to the training procedure proposed by Hong et al., [hong2015learning]. The fusion model is defined as follows


where denotes the unnormalized condition-adaptive representation, is the bias of the fusion model, and denotes element-wise multiplication. The weights are given by , , and , , , and are defined as the specific sizes based on the dimensional scale of each associated annotation. The variables and

denote the number of hidden units in the fusion model. This 5-way tensor product can capture the correlation between the input domains containing the spatio-temporal representation and the scene conditions.

However, the element-wise multiplication with the spatio-temporal representation and the outputs of the scene understanding empirically computes values that are close to zero. These computed values can influence not only the result of the fusion model but also computational procedure when the multiplication results exceeded the range that can be represented by computation machine. We adopted a normalization scheme to prevent values close to zero for avoiding the computational errors and finding high-order dependency between the spatio-temporal representation and the identified scene conditions. To prevent computational error and to pay attention to only a scene condition, we normalize to using the softmax function in [qi2016dynamic, xu2015show]. The normalization is formulated as follows


where represents -th element of the unnormalized joint feature, and is -th element of the normalized fusion feature. Intuitively, represents a condition-adaptive representation defined over all spatio-temporal representations and the corresponding scene conditions. Figure 5 shows the input images, the spatio-temporal representations, and the condition-adaptive representations. The condition-adaptive representations are then used as an inputs to the detection model, which is explained in next section.

Iii-D Drowsiness detection

The fusion model described in the previous subsection generates a set of condition-adaptive representations , which provide scene adaptive features containing information of facial elements and illumination of drivers. The drowsiness detection of the proposed framework using the given condition-adaptive representation in Eq. (10) is carried out via additional neural networks. As same as the scene understanding model, we put an additional fully connected deep neural network on top of the fusion model as follow:


where denotes the output of the detection model, and is the model parameter. The output of the fully connected network is consists of two units: non-drowsiness unit and drowsiness unit, to classify the drowsiness of a driver. To compute the likelihood of the driver drowsiness, we apply the soft-max function which reflects the drowsiness and non-drowsiness degrees of input. Using the soft-max function, we can detect the driver drowsiness in each input. A high value of the non-drowsiness unit signifies that a driver in the input frames is likely to be awake, and a high value of the drowsiness unit signified that the driver is falling asleep. An optimization scheme for both and operates under the detection objective. Our detection model is trained to minimize the detection loss using detection annotation associated with fusion feature, and representation as follows:


where is a ground-truth value that corresponds to each input data , and denotes the objective function of the detection model. We used the softmax cross-entropy function as the objective function for . The objective function is worked to all models embedding into the proposed framework.

Iv Training and Inference

The training of the proposed framework has two objectives including the scene understanding objective in Eq. (7) and the drowsiness detection objective in Eq. (12), and the harmony of those two objectives is essential for achieving a superb locally optimized solution. Combining Eq. (7) and (12), the overall objective function is defined by


where is a parameter for balancing during training two modules for the scene understanding and drowsiness detection. The objective function can optimize the four modules of the proposed framework simultaneously. However, when we begin the training, we do not train the all models of the proposed framework simultaneously. The overall architecture (see Fig 2.) shows that the proposed framework is sharing the output of the representation learning model, and also denotes that the representation learning and scene understanding models can considerably influence to the other models (feature fusion and drowsiness detection). First, we train the representation learning and scene understanding models during steps. After that, we train all models containing the feature fusion and detection models.

To detect the drowsiness of drivers from input video clip, the proposed framework generates spatio-temporal representations using the representation learning, and then the spatio-temporal representation is used to understand scene conditions. these two pieces of information are combined to produce the condition-adaptive representation. Drowsiness is detected by using this condition-adaptive representation.

V Data augmentation

The most general approach to reduce overfitting on a given training dataset is artificially enlarging the dataset using label-preserving transformations [krizhevsky2012imagenet]. In this work, we apply the data augmentation based on horizontal transformation and image pyramid technique. This approach allows transformation of an image with very little computation so that we can make an additional dataset without huge computational load. We generate horizontally flipped images from the original images, and these original images and flipped images are transformed by using the image filtering methods based on the Gaussian filter. Figure 6 illustrates the procedure of the data augmentation. We conduct this by extracting training patches using various values of variations and training our proposed framework on this extended dataset. In our experiments, we used three different variations to generate additional training samples by using the image pyramid paradigm. These two types of data augmentation approaches can sufficiently increase the number of the training samples. Without this scheme, our proposed framework suffers from substantial overfitting, and it can converge to a poorly local optimized solution.

Vi Experiments

Vi-a Benchmark dataset

Previous studies [khushaba2011driver, takei2005estimate, wakita2006driver] on driver drowsiness detection attempted to recognize small cases in the private dataset which is constructed in their own experimental environment for driver drowsiness detection. Abtahi et al. provided a publicly-available dataset for yawning detection [abtahi2014yawdd]. However, it is still insufficient for a comprehensive drowsy driver study. We used the NTHU Drowsy Driver Dataset (NTHU-DDD Dataset) to demonstrate an efficiency of the proposed framework for the drivers drowsiness detection. It is too difficult and dangerous to construct a dataset for detecting of driver drowsiness detection in real situations. The NTHU-DDD dataset is composed of several videos containing a driver who was sitting on a car seat and playing a racing game with driving simulator wheel and pedals. The drivers in the dataset conducted various facial expressions during video recording. The total time of the entire dataset is about 9 and a half hours.

Fig. 6: Illustration for the procedure of the data augmentation. Original training sample and the rotated sample of it generates another training samples by using the image filtering such as Gaussian filter.

The NTHU-DDD dataset is composed of three subsets for training, evaluation, and test, which are composed of non-redundant video files. Each subset consists of the videos which contain diverse situations for the condition for drivers that is captured using visual sensors such as a camera and an active infrared (IR) sensor. The entire dataset including training and evaluation datasets contain 36 of drivers of different ethnicities recorded with and without glasses/sunglasses under a variety of driving scenarios. The driving scenarios include normal driving, yawning, slow blink rate, falling asleep, and burst out laughing, under day and night illumination conditions. All videos contain frame-level annotation for the drowsiness condition. The video resolution is 640 480 in AVI format. Figure 7 shows example snapshots of the NTHU-DDD dataset.

Fig. 7: The example snapshots of NTHU Drowsy Driver Detection Dataset (NTHU-DDD Dataset).
Fig. 8: The illustration for the concept of temporal IOU.

The training dataset is composed of subsets that are composed of 18 subject folders. Each subject folder contains videos recorded in various driving condition. Each subset is classified into four scenarios defined as the condition of the glasses and illumination conditions (i.e., glasses, bare face, sunglasses, night glasses, night bare face). Each scenario contains four videos with different situation and corresponding annotation files. The evaluation dataset provides four subject folders and each subject contains five videos with different scenarios and corresponding annotation files. The training dataset is composed of 360 videos (722,223 frames), and the evaluation dataset contains 20 videos (173,259 frames). In this work, we only used training and evaluation datasets because test dataset can not publicly accessible and the test dataset not contains annotation for performance evaluation. We used all given training data to train the proposed framework. We make a small video clip that consists of five consecutive frames, and assign an annotation about the scene conditions and drowsiness status.

Glasses and illumination Head Mouth Eye

Day bare face
0.99 0.99 0.98 0.89

Day glasses
0.97 0.93 0.95 0.81

Day sunglasses
0.98 0.97 0.78 0.78

Night bare face
0.99 0.95 0.97 0.82

Night glasses
0.97 0.96 0.88 0.92

0.98 0.96 0.912 0.844

Total average

TABLE II: Validation accuracies of the scene understanding model using the evaluation dataset in NTHU-DDD dataset.

Unfortunately, the given training data provides frame-level annotation, so that we employed a concept of the intersection over union (IOU) [farfade2015multi]

, in order to change the frame-level annotation to clip-level annotation. Figure 8 shows the concept of the temporal IOU used in our experiment. We assume that the annotation value of each clip is defined as a value occupying more than 50% among the frame-level annotations. Therefore, we defined the annotation value as the value which is observed more than three frames in each clip in our experiment. In addition, we downsample all frames using a bilinear interpolation method in Opencv library to the uniform size with width of 224 pixels and height of 224 pixels for improving an experimental and time efficiencies.

LeNet[lecun1998gradient] AlexNet[krizhevsky2012imagenet] VGG-FaceNet[parkhi2015deep] LRCN[donahue2015long] FlowImageNet[donahue2015long] DDD-FFA[parkdriver] DDD-IAA[parkdriver] Ours

Day bare face
0.531 0.704 0.638 0.687 0.563 0.782 0.698 0.796

Day glasses
0.592 0.616 0.705 0.617 0.616 0.741 0.759 0.781

Day sunglasses
0.682 0.702 0.570 0.714 0.675 0.618 0.698 0.738

Night bare face
0.602 0.646 0.737 0.573 0.668 0.702 0.749 0.765

Night glasses
0.599 0.627 0.741 0.556 0.551 0.683 0.747 0.734

0.601 0.659 0.678 0.629 0.615 0.708 0.730 0.762

TABLE III: Average accuracy comparison of the drowsiness detection approaches in different situations using the evaluation dataset in NTHU-DDD dataset. The bolded values represent the best accuracies in each scenario and the averages.

Drowsiness (F) Non-drowsiness (F) Accuracy

Day bare face
0.809 0.784 0.796

Day glasses
0.789 0.774 0.781

Day sunglasses
0.758 0.718 0.738

Night bare face
0.753 0.777 0.765

Night glasses
0.718 0.750 0.734

0.765 0.760 0.762

TABLE IV: F-measures and accuracies of the drowsiness detection using for the evaluation dataset in NTHU-DDD dataset. The listed values below the drowsiness and non-drowsiness attributes represent the results of F-measures.

Vi-B Experimental results

We demonstrate an efficiency of our framework using the evaluation set of the NTHU-DDD dataset. The evaluation dataset is composed of 5 scenarios, and each scenario contains five videos that captured various virtual driving situations. The videos in the evaluation dataset are not duplicated to the videos in the training dataset. The dataset also includes multiple annotations that are concerned with the scene conditions and drowsiness detection. We tested the performances of the scene understanding and drowsiness detection respectively.

The scene understanding module is evaluated by using validation accuracy, represented as where the numerator is the number of the correctly classified results of each sub-model in the scene understanding model, and the denominator denotes the total number of test samples. Table ii@ shows the validation accuracies of the scene understanding model that is composed of four sub-models: the glasses and illumination conditions , the head model , mouth model , and eye model . The averages are computed by the formulation of the arithmetic mean so that the weights according to the number of data that classified to the same categories in the table did not consider. This measurement has been applied equally to subsequent experiments. The average of validation accuracies across to all scene conditions for sub-models is 0.924. Experimental results in Table ii@ show that the scene understanding module in the proposed framework achieves good classification results in the classification problems of the glasses and illumination conditions and the status of a head. However, the classification result for the condition of mouth and eye is relatively lower than the other categories. The performance gaps between the sub-models in the scene understanding could be interpreted as a bias of representation learning. The understanding of the scene conditions based on our spatio-temporal representation could be influenced by the geometrical size and scale of a target object. Since the portion of each frame for an eye and mouth is relatively smaller than the portion of a frame for glasses, illumination, and head in the NTHU-DDD dataset, the learnt representation learning model would have been over-fitted to the conditions for glasses, illumination and head.

We evaluated the proposed framework quantitatively by using the F-measure. F-measure is harmonic mean of precision and detection rate, where precision and recall are defined as follows:


where (True positive) is the number of correctly detected as drowsiness state, and (False negative) is the number of incorrect detection results that classified to non-drowsiness condition. (False positive) is the number of non-drowsiness detection result incorrectly identified to the drowsiness state, and (True negative) is the number of correctly classified as non-drowsiness state. The quantitative evaluation denotes an average over all videos represented as same glass and illumination categories. Table iv@ shows the accuracy of the proposed framework for the drowsiness detection. The results show that our proposed framework achieves an average accuracy of 0.762.

Due to the lack of performance comparison using a publicly available dataset for drowsiness detection, we referred the previous method which was evaluated their performance using the NTHU-DDD dataset or implement a method based on the well-known multiclass classification algorithm for images. We compared our framework to several methods [parkhi2015deep, donahue2015long, parkdriver, krizhevsky2012imagenet, lecun1998gradient]

. Parkhi et al. proposed a face recognition method (VGG-FaceNet) using a deep neural network

[parkhi2015deep]. The VGG-FaceNet consists of 36 convolution layers, and this network is much deeper than the 3D-DCNN used in the proposed framework. Donahue et al. provide the method based on long-term recurrent convolutional networks (LRCN) for visual recognition and description for long-term time series data [donahue2015long]. We modified these methods to evaluate the performance of driver drowsiness detection. Park et al. proposed the deep drowsiness detection (DDD) network for drowsiness detection using feature-fused architecture [parkdriver]. Park et al. used two different fusion strategies to their network: independently-averaged architecture (IAA) and feature-fused architecture(FFA). They provide the experimental results using the NTHU-DDD dataset. These methods were trained and tested with the equal procedure of the proposed framework. Additionally, we compare the results using the NTHU-DDD dataset, which is listed in Part et al.[parkdriver].

Fig. 9: The ROCs for the driver drowsiness detection. Figures in parentheses indicate the area under curves (AUCs).

Fig. 10: The detection results using NTHU-DDD dataset. The images of the first row show the detection results for the driver drowsiness, and the images of the second row denote the detection results of a normal condition of drivers.

Table iii@ shows that the comparison results of driver drowsiness detection using NTHU-DDD dataset. The experimental results show that the proposed framework outperforms other methods in most of the scenarios. Only in the night glasses scenario did the proposed method achieve a performance lower than the DDD-IAA. Additionally, the experimental results illustrate that the proposed framework achieves higher and stable performance in various scene conditions than the listed methods, even though several methods used the deeper network structure. Figure 9 shows the receiver operating characteristic (ROC) curves and the area under curves (AUCs), generated by the evaluation dataset predictions. The results of the ROC plots in Fig. 9 present that the proposed method does not take a benefit in the lower regions of the curve, where the false positive rate (FPR) is less than 0.05 approximately, but provides a definite benefit for much of the rest of the curve, over the other methods [lecun1998gradient, krizhevsky2012imagenet, parkhi2015deep, donahue2015long].

The overall experimental results demonstrate that the proposed method can provide an accurate and effective method for the driver drowsiness detection than the other drowsiness detection method based on a visual analysis. Driver’s drowsiness in the real world could appear with various variations of facial elements in diverse illumination conditions. The feature fusion helps to discover the discriminative and rich condition-adaptive representation for detecting the drowsiness, and this function plays a significant role to provide high-quality drowsiness detection in various situations. Figure 10 shows the example snapshots of the correct detection results using NTHU-DDD dataset.

Vi-C Computational complexity

Although the computational cost of the framework depends on the size of input images and the structure details such as the number of layers and the size of kernels in a neural network, theoretically, the computational complexity of representation learning and feature fusion models based on 3D-CNN is ,

where and are the index of a convolutional layer and the number of convolutional layers of each model. , , and denote the width, height, and depth of input data in each convolutional layer. , , and denote the width, height, and depth of 3D-convolutional kernel in -th layer. The computational complexity of the scene understanding and drowsiness detection models using two-layers neural networks is , where and denote the dimensionalities of each hidden layer and target domain for objectives. We have estimated the computational complexity of the proposed framework based on the approaches of He et al., [he2015convolutional] and Notchenko et al., [notchenko2016sparse].

Note these computational complexities apply to both training and testing phases, however practical execution times in both phases are different since the proposed framework shows different work-flows in training and test phases. The training task consists of the three steps: 1) calculation of output, 2) computing an error, and 3) updating the parameters. Therefore, the execution time in the training task is relatively longer than the time in the testing task. Once the model training end, the execution time in testing phase is much faster because of the framework only needs to compute the output for drowsiness detection. The execution time in our experimental setting was 38.1 FPS (28.6 ) which is almost real-time, and was obtained. We calculated this value by averaging the execution time of the proposed framework for 300 seconds, except displaying an output on a screen.

The proposed framework is implemented with Google Tensorflow library. Although the training in the framework requires long times, after the model training is finished, the entire framework is able to perform in real-time with Python implementation using a Core i7, 3.4GHz PC with 16GB RAM and GTX TITAN GPU.

Vii Discussion and Conclusion

In this paper, we have proposed an condition-adaptive representation learning for efficient driver drowsiness detection method which is invariant to various driving conditions containing a driving time such as day and night and a driver’s appearance. To this end, we extracted the spatio-temporal representation and merged it with the vectors that represent the scene understanding results using the feature fusion method based on the tensor product approach. These problems are effectively modelled using 3D-DCNN and fully connected neural network based on recent advances in computer vision fields. The spatio-temporal representation and estimated scene conditions are merged to enhance the discriminative power for providing precise driver drowsiness detection in various driving conditions. With the feature fusion properly harnessed, the merged feature can provide more discrimination than the original spatio-temporal representation even though the original representation contains the motion and appearance information about the driving and drivers conditions. Experimental results show that the proposed framework outperforms other methods, including methods based on deep learning, in drowsiness detection accuracies.

The limitation of the proposed framework can be summarized as follows. First, although the proposed framework achieves good detection performance, it also needs a high-performance GPU computing unit that must be installed on a vehicle. It may cause high price of the vehicle and an increase in vehicle weight. Second, the proposed method needs many training samples that are labelled with the scene conditions and drowsiness state, for learning the representation that can cover various situations about drivers. Third, since the proposed framework is an off-line method, it can not guarantee to detect the drowsiness of drivers of entirely different types that are not included in training samples.

In future works, several suggestions should be taken into account. First, we will optimize the network structure in the proposed framework for use in an embedded board or microcomputing systems to reduce the financial cost and improve the computational efficiency without performance degradation. Second, we will develop an on-line updating method in order to improve the drowsiness detection reliability of the model through continuous updating. Third, we will study a data augmentation method based on generative models to improve the performance of drowsiness detection by enlarging the scale and variety of a given dataset.


This work was supported by the Institute for Information and Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. B0101-15-0525, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis), and the National Strategic Project-Fine particle of the National Research Foundation of Korea(NRF) funded by the Ministry of Science and ICT(MSIT), the Ministry of Environment(ME), and the Ministry of Health and Welfare(MOHW) (NRF-2017M3D8A1092022).