Long-term Multi-granularity Deep Framework for Driver Drowsiness Detection

01/08/2018 ∙ by Jie Lyu, et al. ∙ Xi'an Jiaotong University Microsoft 0

For real-world driver drowsiness detection from videos, the variation of head pose is so large that the existing methods on global face is not capable of extracting effective features, such as looking aside and lowering head. Temporal dependencies with variable length are also rarely considered by the previous approaches, e.g., yawning and speaking. In this paper, we propose a Long-term Multi-granularity Deep Framework to detect driver drowsiness in driving videos containing the frontal faces. The framework includes two key components: (1) Multi-granularity Convolutional Neural Network (MCNN), a novel network utilizes a group of parallel CNN extractors on well-aligned facial patches of different granularities, and extracts facial representations effectively for large variation of head pose, furthermore, it can flexibly fuse both detailed appearance clues of the main parts and local to global spatial constraints; (2) a deep Long Short Term Memory network is applied on facial representations to explore long-term relationships with variable length over sequential frames, which is capable to distinguish the states with temporal dependencies, such as blinking and closing eyes. Our approach achieves 90.05 accuracy and about 37 fps speed on the evaluation set of the public NTHU-DDD dataset, which is the state-of-the-art method on driver drowsiness detection. Moreover, we build a new dataset named FI-DDD, which is of higher precision of drowsy locations in temporal dimension.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

It is reported that about 1.24 million people die on roads every year, while driver drowsiness accounts for 6% [1] of them. Driver drowsiness indicates a driver is lack of sleep, which can be detected by the variation of physiological signal [2], vehicle trajectory [3, 4] and facial expressions [5]. However the first two methods are hard to satisfy the requirement of convenience and timeliness. Drowsiness can be reflected by facial expression, such as nodding, yawning and closing eyes. We therefore aim to develop a drowsiness detection method based on video. Video-based method is possible to give the warning prompts and receive the driver’s feedback in time, being of great value in practice.

Video-based drowsiness detection is still full of challenges, mainly stemming from the illumination condition change, head pose variation, and temporal dependencies. In particular, the large variation of head pose causes serious deformation of facial shape, which makes it difficult to extract effective spatial representations. Conventionally, approach based on aligned facial points [5] is a better way to represent drowsy features, however, ignoring temporal relationships means it cannot distinguish blinking and closing eyes. Spatio-temporal descriptor [6]

is proposed to collect spatial and temporal features but not good at distinguishing states with long-term dependencies, such as yawning and speaking. Besides, these handcrafted descriptors are not enough powerful to describe large variation of head pose and classify confusing states, e.g., looking aside and lowering head lead to large pose variation, while yawning and laughing are similar but belong to different states. Recently, deep learning methods are widely used to learn facial spatial representations automatically from global face

[7, 8, 9]. Nevertheless, the global face without well alignment is weak to provide effective representations for large pose variation. Moreover, it is not flexible to fuse the configurations of local regions and concentrate representations on the most important parts such as eyes, nose and mouth on which the majority of drowsy information focuses. It is another challenge to distinguish easy-to-confuse states, such as blinking and closing eyes. 3D-CNN with fixed time windows [7] tried to describe spatial and temporal features, but it does not have enough capability to model long-term relationships with variable time length.

We propose a Long-term Multi-granularity Deep Framework (LMDF) to detect driver drowsiness from well-aligned facial patches. Our method applies alignment technology to obtain the well-aligned facial patches over frames, and these patches mainly locate in the informative regions that supply critical drowsy information. A group of parallel convolutional paths are applied on the patches, and the outputs of these layers are fused by a fully connected layer to generate spatial representations, which is named as Multi-granularity Convolutional Neural Network (MCNN). MCNN is able to fuse appearance of those well-aligned patches and capture local to global constraints. To explore temporal dynamical characteristics, a deep Long Short Term Memory (LSTM) network is applied to the spatial representations over sequential frames, which can distinguish the states with temporal relationships, such as yawning and laughing, blinking and closing eyes. The proposed method can thus not only extract effective facial representations on single-frame images, but also mine temporal clues on videos.

The contributions of our approach are mainly in three aspects: (1) We propose a Long-term Multi-granularity Deep Framework to learn facial spatial features and their long-term temporal dependencies. (2) We propose MCNN to learn the facial representations from the most important parts, which makes the detector robust to large pose variation. (3) We build a Forward Instant Driver Drowsiness Detection (FI-DDD) dataset with higher precision of drowsy locations in temporal dimension, which is a good test bed for evaluating practical systems that are required to detect drowsiness in time.

Ii Related Work

Driver drowsiness detection is becoming a hot topic of Advanced Driver Assistant System (ADAS). Many traditional methods are applied to deal with this problem. The change of pupil diameter was utilized by Shirakata et al. [10] to detect imperceptible drowsiness, which is effective but it is not convenient for a driver to take the equipment. Nakamura et al. [5]

utilized face alignment to estimate the degree of drowsiness via k-NN, which cannot achieve online performance. Spatial-temporal features for driver drowsiness detection was proposed by Mahdi et al.

[6], which was based on hough transformation, cannot work well in practical driving environment. Besides, the representations of those methods are hand-crafted, which may be not flexible to adapt to complex situations faced in driving, while our method automatically learns facial representations, which is more effective to the practical task.

Deep learning approaches such as CNN have achieved success in representing information on images [11, 12, 13], and many researchers also applied CNN on driver drowsiness detection. Park et al. [14] combined the results of three existing networks by SVM to present the categories of videos, which cannot detect driver drowsiness online. 3D-CNN is applied to extract spatial and temporal information by Yu et al. [7], and the methods can only capture features with fixed temporal window. The above two methods utilize global face image, which cannot flexibly configure those patches containing the majority of drowsy information. Moreover, they are hard to capture dependencies with variable temporal length.

Due to the well performance of LSTMs on sequential data [15, 16, 17], more and more researchers propose combinations of CNN and LSTMs to learn spatial and temporal representations of sequential frames. It is interesting that Liang M. et al. [18] came up with convolutional layers with intra-layer recurrent connections to integrate the context information for object recognition. Jeff D. et al. [19] provided a method which extracts visual features from images by CNN and learns the long-term dependencies from sequential data by LSTMs. Especially, the approach of Jiang W. et al. [20] processes image with CNN and models sequential labels by LSTMs concurrently, and then combines the two representations via projection layers. However, none of the above methods apply multi-granularity method to concentrate representations on important parts and flexibly fuse configurations of different regions.

Recently, Multi-granularity methods have achieved several excellent results in other applications of computer vision. Qing Li et al.

[21] proposed temporal multi-granularity approach of action recognition. Their method achieved the state-of-the-art performance on action benchmarks, but cannot capture detailed appearance clues and local to global spatial information. Dong C. et al. [22]

applied multi-scale patches based on face alignment on face recognition. Dequan W. et al.

[23] utilized multi-granularity regions, detected by three granularities convolutional neural network, to generate multi-granularity descriptor for fine-grained categorization, but this method cannot process sequential frames. Different from the above, our method can capture spatial multi-granularity information and long-term temporal dependencies. Particularly, our MCNN can learn representations on the most significant regions from well-aligned multi-granularity patches, and the proposed method has achieved the state-of-the-art accuracy on NTHU-DDD dataset for driver drowsiness detection.

Iii Our Approach

The proposed method utilizes Multi-granularity Convolutional Neural Network (MCNN) to learn facial representations from single-frame images. The repreentations, extracted from well-aligned facial patches, contains both detailed appearance information of the main parts and local to global constraints. Furthermore, our approach takes advantages of a deep Long Short Term Memory (LSTM) network to explore dynamical characteristics of the facial representations from sequential frames. The detailed structure of our Long-term Multi-granularity Deep Framework combining MCNN and LSTMs is shown as Fig.1.

Fig. 1:

The long-term multi-granularity deep framework for driver drowsiness detection. The first stage is well-aligned multi-granularity patches which consist of local regions, main parts and global face. Parallel convolutional layers are well-applied to process patches respectively and fully connected layer fuses local and global clues and generates a representation, which is the second stage of the framework. The first two stages construct the Multi-granularity CNN (MCNN). Recurrent Neural Network (RNN) with multiple LSTM blocks mining the clues in temporal dimension together with a fully connected layer form the third stage.

Iii-a Well-aligned Multi-granularity Patches

It is well known that drowsy information is focused on several main facial parts such as eyes, nose and mouth. Alignment provides an excellent way to extract well-aligned features over frames, which effectively represent facial drowsy states. Besides, global patch provides rough information to estimate the states of a driver’s head and full face, which assists the decision of driver’s drowsy states when the locations of parts are not precise. Our method takes advantages of local regions and global face at the same time.

We utilize face alignment technology to locate facial shape points. Given an image with a face in the -th frame, we detect landmark points of facial shape via regressing local binary features proposed by Ren et al. [24]. From those points, it is convenient to get the locations of main parts and important local regions. According to center points and specific sizes of all regions, we crop those patches from the original image and resize them into the same size, which are the well-aligned multi-granularity patches as the input layers of the convolutional neural network.

Those patches, including local regions, main parts, and the global face, are produced by three different mappings. Shown as Fig.2, a mapping can select center points of eyes, nose, and mouth from facial shape , and crop patches of those parts from the input image with given sizes . And the mapping still needs to convert the patches into an unified size . Thus the single-granularity patches of those main parts are generated. The operations of mapping and are similar to the mapping , while the differences lie in the locations and sizes of regions. The mapping selects the corners of the eyes and mouth and the sides of the nose as the interest of regions with size and output local patches . A global facial region with size is chosen by the mapping which finally produces a global facial patch . Formally, the mappings are represented as


Processing the input image by the three mappings, we can obtain a set of well-aligned patches consisting of the main parts, local regions and global face, and it is presented as


where represents all elements of a patch set .

Compared with the original image, the patches set , including both detailed appearance clues of parts and rough information of full face, have more advantages to describe the facial states. Meanwhile, the relations between local and global regions are implied, which is the basis of mining useful features. Therefore, we take the set of patches as the input layer of CNN to learn effective representations.

Fig. 2: The procedure of extracting multi-granularity facial patches which include the main parts, local regions and global face such three granularities.

Iii-B Learning Facial Representations

Our approach learns representations by convolutional neural network but not hand-crafted for its well performance in learning spatial features. We apply several convolutional layers to processing each one in the set of patches independently. To fuse the information of all patches, a fully connected layer is arranged after all convolutional operations, which generates -dimensional descriptors combining local and global clues.

Every patch needs to be processed by convolutional operations at first. For a patch , the -th one of patch set with length

, three convolutional layers are utilized to capture the spatial feature. The first one is made with convolution and rectified linear units (ReLU) activation followed by a max-pooling operation, which projects a normalized 3-channel image to a higher dimensional representation. Only convolution and ReLU activation are selected in the second layer to enlarge the dimension of representation sequentially. And the structure of the third convolutional layer is similar to the first layer but with different parameters to decrease the dimension. A representation

of the patch can be generated by a mapping consisting of those convolutional layers with parameters , which is presented as


where is the -th element of convolutional parameter set .

A fully connected layer is utilized to combine those representations extracted by the mapping

from the set of patches. But before combining operation, we concatenate those representations into a long vector

, formed as


With a specific weighted matrix

and bias vector

, the combining -dimensional representation can be presented by the fully connected layer as


in which is a zero vector.

The descriptor contains not only detailed appearance information implied in every part, but also the constrained relations between local regions and global face. The effectiveness of the descriptor can be improved by appropriate objective functions and proper training methods.

Driver drowsiness detection is a binary classified problem, thus the state of an input frame is just drowsy or not. We label drowsiness with as the positive sample and normal state with as the negative sample. And a label are expressed with a one-hot vector , such as a vector means the positive label.

To train the parameters of the convolutional neural network, we project the representation

into the probabilities of each category

by another fully connected layer with weights and a bias vector , and the probability vector

are normalized via a softmax layer. The cross entropy which can indicate the correct rate of classification is selected as the objective function, and we utilize the adam optimizer to train the whole convolutional neural network. The visual representations can also be generated by the convolutional layers and the first fully connected layer.

Iii-C Exploring Dynamical Characteristics

The representation is extracted in a frame, while whether a driver is drowsy or not is judged by a certain period. We apply LSTMs to model the temporal dynamical characteristics of spatial representations on driver drowsiness detection.

A LSTM block consists of an input gate, a forget gate, an output gate and a memory cell. Because of the three gates, the LSTM block can learn long-term dependencies in sequential data and its parameters are easier to be trained. The memory cell can store long-term information in its vector, which can be rewritten or done other operations for the next time step. Besides, the number of hidden units should be chosen according to the dimension of the input representation .

We employ multiple layers LSTMs to mine the temporal features for driver drowsiness. A mapping containing three layers LSTMs with parameters is utilized to explore temporal clues of the representation generated by MCNN extractor and presents the hidden states of the third layer as a representation containing temporal dependencies, which is presented as


where is the parameter set of these LSTM blocks in the last step.

A fully connected layer with weight and a bias vector is used to project the output of the mapping into a two-dimensional vector that is then decoded by softmax operation to the probabilities of the two categories. To solve the parameters, we take advantage of Adam optimizer to train the LSTMs with cross-entropy objective function.

The label of the current frame can be predicted as the class with the maximum probability.


Similarly, the labels of the sequential data can be obtained.

Iv Experiments

A dataset named National TsingHua University Drowsy Driver Detection (NTHU-DDD) is provided on the challenge of ACCV2016 workshop for driver drowsiness detection, on which we compare our approach with others. To make the sequential labels close to the practical driving environments, we relabel the video set with instant detecting principle. A new dataset is generated from the relabeled video set and it is called Forward Instant Driver Drowsiness Detection (FI-DDD) on which we learn parameters and analyze the performance of several subnetworks. While the performance of our entire approach is evaluated on the original NTHU-DDD dataset, we thus train a set of parameters to achieve long-term memory performance. Finally, the accuracy is obtained by our Long-term Multi-granularity Deep Framework (LMDF) on the evaluation set of the NTHU-DDD dataset, and the proposed method achieves about 37 fps on GPU Tesla M40.

Iv-a Dataset

NTHU-DDD Dataset: The NTHU dataset includes five scenarios listed as glasses, no glasses, glasses at night, no glasses at night and sunglasses. The training set involves 18 volunteers consisting of 10 men and 8 women who act as drivers with four different states in every scenario, while the evaluation set has four volunteers including two men and two women. Non-sleepy videos contain only normal state, while sleepy videos combine normal and drowsy states together. Besides, blinking with nodding and yawning videos only record drowsy eyes and mouth respectively. NTHU-DDD dataset offers four annotation files recording the states of drowsiness, eyes, head and mouth for every video. Table I gives the labels of drowsiness and three main parts.

classlabel 0 1 2
drowsiness normal drowsy -
eyes normal sleepy -
head normal nodding looking aside
mouth normal yawning talking & laughing
TABLE I: The labeled states of each part on NTHU-DDD dataset

It is worth emphasizing that the labels on NTHU-DDD dataset are long-term memory, which means that the states of a frame may depend on the frames in the previous several seconds.

FI-DDD Dataset: A problem comes due to the long-term memory in NTHU-DDD, which is that a driver would still receive the warning prompts even if he had revised his drowsy states to the normal for a few seconds. At the same time, those labels are unable to locate the drowsy states with high precision in temporal dimension. To solve these problems, we relabel those videos with instant principle, which means the latency is limited within second namely frames for FPS videos. Those typical states, such as closing eyes, yawning and lowering head, are still considered as one of the evidences to judge whether a frame is drowsy. Those videos are cut into several clips which contain only the drowsy or the normal states alternatively according to our labels. To describe the transitional states between the normal and the drowsy, we reserve ten normal frames at the head and the tail of every clip with drowsiness. We name the relabeled dataset with Forward Instant Driver Drowsiness Detection (FI-DDD) which includes 14 drivers on trainset and 4 ones on testset. The trainset of FI-DDD at day time has clips and the testset has clips, while at night scenarios, the trainset has ones and the testset has clips with about frames on average.

Static image set: To train the parameters of CNN and analyze the effects of several factors, we build a static image set by sampling lots of frames from the FI-DDD dataset. The samples on the image set are labeled with drowsiness or normality, and the labels can almost indicate the truth states of the corresponding images, even if a small amount of images are matched with wrong labels due to lack of temporal dependence. The static image set has images in day time, and trainset includes images and testset has images. It has images in night scenario, and trainset includes images, testset has images.

Iv-B Implementation Details

Face Alignment: We apply face alignment technology to locate those facial shape points for all videos. Face detection and tracking are combined to increase detecting rate and provide more accurate positions for faces on videos. Face alignment algorithm is based on those face positions. The face detector is from OpenCV and the approach of face tracking is proposed by Danelljan et al. [25]. We implement the method of Ren et al. [24] and retrain the model, and preprocess all videos to obtain the landmark points for every frame. Those frames with no face will be recognized as the empty and filled with zero coordinates for landmark points.

Multi-granularity: We obtain Multi-granularity patches considering two factors: different positions and different sizes. We design to choose positions from facial shape points, which are divided into three granularities: 1 global face with size , 4 main parts with size and 10 local regions with size . The specific locations of all patches are shown as Fig.2. Before sent to CNN, those patches are resized to size , normalized to [-0.5, 0.5], and are converted to channels to ensure that our framework can process RGB data.

Dataset Usage: A static image set, required for training the CNN parameters, are sampled from the videos of FI-DDD with a specific frame interval. The results of CNN is directly related to multi-granularity patches and CNN parameters, we thus analyze the effects of those factors on the static image set. While all experiments for analyzing the effects of LSTMs parameters are carried on FI-DDD dataset. To compare with the previous methods, we evaluate the proposed method on the evaluation set of NTHU-DDD dataset.

Iv-C Experimental Analysis

To further explain the effects of alignment, multi-granularity and CNN extractor, several groups of experiments are conducted on the static image set. We also provide experiments on FI-DDD dataset to verify the effectiveness of LSTMs for detection drowsiness on video.

Iv-C1 The Importance of Alignment

It is essential to carry out experiments to explain the significance of alignment and the effects of locating precision.

None-alignment With Alignment: We provide another two none-alignment methods to sample those multi-granularity patches in facial bounding box: Uniform Sampling (US) and Specific Sampling (SS). The corresponding sizes of our Aligned sampling (AS) method and the two none-alignment ones are the same. Fig.3(Left) shows the comparison of AS, US and SS. AS considering alignment achieves the best accuracy on the testset of the static image set, which is higher than SS method and higher than US one. As a conclusion, alignment of facial patches, providing aligned representations, is an effective way to improve the accuracy on driver drowsiness detection.

Fig. 3:

Left: the comparison of different sampling methods, sampling over uniform distribution(

US), sampling specific locations(SS) and our proposed sampling with aligned positions(AS); Right: the effect of alignment precision,

is the standard deviation of normal distribution. The results(Acc) are achieved by CNN on testset of the static image set.

Effects of alignment precision:

We evaluate the effects of the alignment precision, and research the influence quantitatively by adding random noise with Gaussian distribution

over the well-aligned facial points. Fig.3(Right) shows the results on testset of the static image set, from which, we discover that the accuracy is decreasing with the increasing standard deviation of noise and even less than if px. While the accuracy is more than with less than px, we make a conclusion that the proposed MCNN is robust to the corrupted locations if px.

Iv-C2 The Effects of Multi-granularity Patches

Multi-granularity patches consist of local regions, main parts and global face. It is significant to conduct experiments and explain the importance of those granularities on driver drowsiness detection. We apply a fully connected layer and softmax operation to classify representations presented by MCNN extractor, and analyze the effects of multi-granularity patches by results of the classification.

Fig. 4: The comparison of different granularities, global face, main parts, local regions and multi-granularity patches. The Curve of accuracies(Acc) over training times are achieved by CNN with different granularities on testset of the static image set.

Learning curve on different granularities: We take four different granularities, listed as local regions, main parts, global face and the combination of the above, into account to analyze the effects of multi-granularity facial patches. Fig.4 illustrates the comparisons of those granularities, from which, we know that the convergent speed of method with global face granularity is the slowest compared with the others, and that of local regions is the fastest. While multi-granularity method achieves good performance on both convergent speed and accuracy. Aligned points can achieve higher precision on those local regions with abundant boundary texture, which results in more aligned representations and easier being classified. Nevertheless, multi-granularity patches containing more aligned information is more effective on driver drowsiness detection.

Effects of positions and sizes: We change the positions and sizes of facial patches respectively. Shown as Fig.5(Left), facial main parts, including eyes, nose and mouth, obtain the best accuracy compared with the other single-granularity method. Obviously, the combination of those three granularities achieves the best accuracy . A conclusion comes that the most effective representation is extracted from the three main facial parts, while the fusion of local and global clues is an excellent way to obtain better facial representations.

We set their sizes as the same and change the sizes to research the difference between single-size and multi-granularity methods, keeping the locations of these patches invariable. Fig.5(Right) shows different regions with different sizes achieve accuracy more than that of those single-size patches. The phenomenon is the result of that different physiological parts are of different sizes, e.g., the size of global face is bigger than single eye. The above analysis presents that multi-granularity method is an effective way to represent facial features.

Fig. 5: Left: The comparison of patches with different positions, GF-the global face, MP-main parts(eyes, nose and mouth), LR-local regions(the corner of eyes, the sides of nose and the boundary of mouth); Right: the comparison of patches with different sizes at all locations. Mg represents multi-granularity patches.

Iv-C3 The Parameters Selection of MCNN Extractor

The structure parameters of the convolutional layers are listed as Table II. A patch with size

processed by those convolutional layers is projected to a tensor with size

. And a representation of the patch is generated by reshaping the tensor to a -dimensional vector, which is the input of a fully connected layer.

Layers Operations Attributions
1st Convolution size:
Activation ReLU
Max pooling strides:
2nd Convolution size:
Activation ReLU
Max pooling Not Used
3rd Convolution size:
Activation ReLU
Max pooling strides:
TABLE II: The parameters of the three convolutional layers. Those structure parameters are the same in all parallel convolutional paths.

A fully connected layer is applied to combine the multi-granularity clues and generate MCNN representations. The number of its hidden units , namely the dimension of representation, has effects on the combination of those patches. Changing the number of hidden units , we explore the relations between the dimension of MCNN representations and classification accuracy with well-aligned multi-granularity facial patches. The comparison of different dimensions is shown as Fig.6, which indicates that the number of dimension almost has no influence on the convergent speed. But -dimensional representations achieve the highest accuracy. Therefore it is reasonable for us choose the number of hidden units as .

Fig. 6: The comparison of different dimensional MCNN representations on accuracies and convergent performance achieved by CNN on testset of the static image set in daytime.
Methods Platform Spatial Features Sequential Features Speed Accuracy
Yu et al. [7] GPU
feature fusion 2432 fps 72.60%
Park et al. [14] -
DDD Network
SVM - 73.06%
MSTN [9] -
LSTMs 60 fps 85.52%
MCNN LSTMs 37 fps 90.05%
TABLE III: The comparison of different methods on the evaluation set of NTHU-DDD dataset with the detailed information of environments.

Iv-C4 The Significance of LSTMs

We first apply MCNN to detect driver drowsiness on video, but it has no capacity to capture the temporal clues. MCNN+LSTMs is considered to deal with this drawback. It is necessary to compare the situation with LSTMs [26] and that without LSTMs for understanding the effects of LSTMs. All experiments at this part are carried on the FI-DDD dataset in day-time scenarios.

Parameters setting: The representations given by MCNN extractor are -dimensional, and the number of hidden units in each LSTM block is equal to . The forget gate is enabled and the max memory step is set to frames. We randomly select a batch with samples to train the LSTMs parameters with learning rate . The fully connected layer projects the states of the last LSTM block to a -dimensional vector which is decoded to the probability of drowsiness by a softmax operation.

MCNN-Only MCNN+LSTMs: The experiments are carried on four different granularities to research the effects of the multi-granularity and LSTMs. Fig.7 shows the accuracy of MCNN only and MCNN+LSTMs for detecting videos on testset under different granularities. MCNN-Only method obtains accuracy, while the accuracy achieved by MCNN+LSTMs is more than that by MCNN-Only. The reason is that the LSTMs have ability to mine the clues in temporal dimension which is significant for recognizing lots of ambiguous states, such as closing eyes and blinking. Comparing the accuracies of different granularities, we discover that the well-aligned multi-granularity facial patches still achieve the best performance. The accuracy of the main parts ranks the second, which means the granularity of main parts certainly plays the most important role in improving the effectiveness compared to the other two granularities.

Fig. 7: The comparison of accuracies achieved via MCNN only and MCNN+LSTMs on the testset of FI-DDD dataset. Different granularities are still presented.

Iv-D Comparisons with The Previous Methods

We evaluate the whole method on the evaluation set and compare with the previous methods [7, 14, 9] achieved on the same dataset. Due to the long-term memory characteristics on NTHU-DDD dataset, the max memory length is set to 120 frames and other parameters keep the same as the above experiments. Especially for night scenarios, we retrain a model with the night data of NTHU-DDD to detect driver drowsiness on near-infrared videos.

Accuracy: Table III presents the comparison of our method and the previous work [7, 14, 9] and the proposed method achieves accuracy, as the state-of-the-art method of driver drowsiness detection.

Speed: We measure time consumption of all modules of our proposed method. From Table IV, CNN costs the most time and the approach achieves about 3 fps on CPU platform. While on GPU platform, the proposed method can achieve 37 fps and satisfy the real-time performance requirements.

CNN LSTMs Others Total
CPU(E5)+GPU(M40) 11.1 10.7 0.6 4.5 26.9
CPU(I7) 5.6 302.3 0.6 3.2 311.7
TABLE IV: Time consumption of each modules of the proposed method. Others include reading, writing and some converting operations.

V Conclusion

We propose an effective and efficient Long-term Multi-granularity Deep Framework to detect driver drowsiness on videos. Well alignment of facial patches ensures the effectiveness of representations under large pose variation, and multi-granularity patches efficiently concentrate on the most significant regions. Multi-granularity Convolutional Neural Network (MCNN) can effectively learn both detailed appearance information of the main parts and local to global spatial constraints. The deep Long Short Term Memory (LSTM) network works well on learning the temporal dependencies of spatial representations for driver drowsiness detection.

Moreover, we build a dataset named Forward Instant Driver Drowsiness Detection with higher precision of drowsy locations in temporal dimension. The dataset performs well in training model parameters and analyzing effects of several factors. Finally, we evaluate our method on the evaluation set of NTHU-DDD dataset and achieve accuracy and about 37 fps speed as the state-of-the-art method on driver drowsiness detection.


  • [1] W. H. Organization, “Global status report on road safety 2013: Supporting a decade of action : Summary,” in World Health Organization, 2013.
  • [2] J. Wang and Y. Gong, “Recognition of multiple drivers’ emotional state,” in

    IEEE International Conference Pattern Recognition,December 8-11

    , 2008, pp. 1–4.
  • [3] A. Colic, O. Marques, and B. Furht, Driver Drowsiness Detection - Systems and Solutions, ser. Springer Briefs in Computer Science.   Springer, 2014.
  • [4] M. Rezaei and R. Klette, “Look at the driver, look at the road: No distraction! no accident!” in IEEE Conference on Computer Vision and Pattern Recognition, June 23-28, 2014, pp. 129–136.
  • [5] T. Nakamura, A. Maejima, and S. Morishima, “Detection of driver’s drowsy facial expression,” in Asian Conference on Pattern Recognition, Naha,Japan, November 5-8, 2013, pp. 749–753.
  • [6] B. Akrout and W. Mahdi, “Spatio-temporal features for the automatic control of driver drowsiness state and lack of concentration,” Mach. Vis. Appl., vol. 26, no. 1, pp. 1–13, 2015.
  • [7]

    S. L. Jongmin Yu, Sangwoo Park and M. Jeon, “Representation learning, scene understanding, and feature fusion for drowsiness detection,” in

    Asian Conference on Computer Vision Workshop, 2016.
  • [8]

    Y.-G. K. Xuan-Phung Huynh, Sang-Min Park, “Detection of driver drowsiness using 3d deep neural network and semi-supervised gradient boosting machine,” in

    Asian Conference on Computer Vision Workshop, 2016.
  • [9] C.-T. H. Tun-Huai Shih, “Mstn: Multistage spatial-temporal network for driver drowsiness detection,” in Asian Conference on Computer Vision Workshop, 2016.
  • [10] S. T, K. Tanida, J. Nishiyama, and H. Y, “Detect the imperceptible drowsiness,” SAE Int. J. Passeng. Cars - Electron Electr. Syst., vol. 3, no. 1, pp. 98–108, 2010.
  • [11]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems. Proceedings, December 3-6, Lake Tahoe, Nevada, United States., 2012, pp. 1106–1114.
  • [12] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10, 000 classes,” in IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, June 23-28, 2014, pp. 1891–1898.
  • [13] L. Yann, B. Yoshua, and H. Geoffrey, “Deep learning,” Nature, vol. 521, 2015.
  • [14] S. K. Sanghyuk Park, Fei Pan and C. D. Yoo, “Driver drowsiness detection system based on feature representation learning using various deep networks,” in Asian Conference on Computer Vision Workshop, 2016.
  • [15] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” CoRR, vol. abs/1412.6632, 2014.
  • [16] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” CoRR, vol. abs/1303.5778, 2013.
  • [17] A. N. Chernodub and D. Nowicki, “Sampling-based gradient regularization for capturing long-term dependencies in recurrent neural networks,” in ICONIP, Kyoto, Japan, October 16-21, Proceedings, Part II, 2016, pp. 90–97.
  • [18] M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, June 7-12, 2015, pp. 3367–3375.
  • [19] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, “Long-term recurrent convolutional networks for visual recognition and description,” in IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, June 7-12, 2015, pp. 2625–2634.
  • [20] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “CNN-RNN: A unified framework for multi-label image classification,” CoRR, vol. abs/1604.04573, 2016.
  • [21] Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo, “Action recognition by learning deep multi-granular spatio-temporal video representation,” in ICMR, New York, New York, USA, June 6-9, 2016, pp. 159–166.
  • [22] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3025–3032.
  • [23] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiple granularity descriptors for fine-grained categorization,” in IEEE International Conference on Computer Vision, Santiago, Chile, December 7-13, 2015, pp. 2399–2406.
  • [24] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 FPS via regressing local binary features,” in IEEE Conference on Computer Vision and Pattern Recognition, June 23-28, 2014, pp. 1685–1692.
  • [25] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in British Machine Vision Conference, Nottingham, UK, September 1-5, 2014.
  • [26] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A search space odyssey,” CoRR, vol. abs/1503.04069, 2015.