Eyeblink detection is of essential research value for the application areas of deception detection , drive fatigue detection , face anti-spoofing , dry eye syndrome recovery , etc. During the past decades, numerous of efforts [5, 6, 7, 8, 9, 10, 11, 12] have already been paid to this. Nevertheless, most of them are proposed without considering the case of eyeblink in the wild. Meanwhile, the existing eyeblink detection datasets [3, 13, 14, 15] are generally captured under the constrained indoor conditions with the relative consistent subject and environment setup. However, towards some practical application scenarios eyeblink detection in the wild is actually more preferred. For instance, during the phase of deception detection the eyeblink visual data may be surreptitiously collected using the hidden cameras under the unconstrained indoor or outdoor conditions . In this case, effective and real-time eyeblink detection approach in the wild is essentially required to ensure the performance. Unfortunately, to our knowledge this research problem has not been well studied before. As consequence, our main research motivation is to facilitate this research task in terms of dataset, theory and practices.
To this end, we first establish a challenging labelled eyeblink in the wild dataset termed HUST-LEBW. It consists of 673 eyeblink video clip samples (i.e., 381 positives, and 292 negatives) that captured from the unconstrained movies to reveal the characteristics of “in the wild”. Each positive sample covers one whole eyeblink process that corresponds to the eye status sequence of “eye openeye closeeye open”. To our knowledge, HUST-LEBW is the first eyeblink in the wild dataset that involves the spatial-temporal sequence information. Fig. 1 shows some snapshots of the eyeblink samples within HUST-LEBW. It can be observed that, dramatic variation on human attribute, human pose, illumination, imaging viewpoint, and imaging distance exists. For instance, from the human attribute perspective the subjects involved in HUST-LEBW are of different ages, genders, races and skin colors. Meanwhile, the humans may wear glass or not. This actually imposes great challenges to accurate eyeblink detection, both for eye location and eyeblink verification.
|Dataset||Video clip amount||Resolution||Person No.||Person race||Person age||Person sex||Person sight||Scene||Illumination||Imaging view||Imaging distance|
|Talking face ||4||720576||1||Caucasian||middle-aged||male||frontal||indoor||good||front||fixed|
Next, we propose to formulate eyeblink detection in the wild task as a binary spatial-temporal pattern recognition problem. In particular, a data-driven based real-time eyeblink detection approach that involves 2 stages of eye localization and eyeblink verification is proposed by us. During the spatial eye localization phase, the eye region is first detected using the off-the-shelf SeetaFace face parsing engine , and then tracked by KCF tracker 
to ensure the high running speed. Then towards eyeblink verification, Long Short Term Memory (LSTM) neural network is employed to model the temporal sequential procedure of eyeblink. Due to the issue that eyeblink may happen with the different time duration, we modify the architecture of LSTM to take multi-scale temporal information of eyeblink into consideration.
Meanwhile, a feature extraction approach able to capture the appearance and motion information of eyeblink simultaneously is also proposed by us. In particular, uniform Local Binary Pattern (LBP)  visual descriptor is extracted to reveal the appearance property of eye region. And, the feature difference between the uniform LBPs from 2 consecutive frames is used to encode the motion property of eyeblink. The appearance and motion feature is concatenated as the input of LSTM.
Extensive experiments are then carried out on HUST-LEBW. The comparison with the state-of-the-art approaches demonstrates the superiority of our method on eyeblink in the wild detection, and its real-time running capacity. And we also notice that, the overall performance of the existing methods (including ours) on HUST-LEBW is actually not satisfactory enough. This indeed verify the great challenge of eyeblink detection in the wild.
The main contributions of this paper include:
HUST-LEBW: the first eyeblink detection dataset that involves temporal sequential information towards “in the wild” cases. It involves 673 video samples (i.e., 381 positives, and 292 negatives);
A modified LSTM architecture able to capture multi-scale temporal information is proposed to model the eyeblink detection task as a spatal-temporal pattern recognition problem;
An uniform LBP-based eyeblink feature extraction method is proposed. It captures the appearance and motion information simultaneously.
HUST-LEBW will be released online upon acceptance to facilitate the related research.
The remaining of this paper is organized as follows. Sec. II discusses the related work. The established HUST-LEBW dataset is introduced in Sec. III. Then, the proposed eyeblink detection method in the wild is illustrated in Sec. IV. The essential implemetation details of the proposed eyeblink detection method are given in Sec. V. Experiments and discussions are conducted in Sec. VI. Sec. VII concludes the whole paper.
Ii Related Work
In this section, we will introduce and discuss the related work towards eyeblink detection in the wild in terms of dataset, eyeblink verification and eye localization respectively.
Eyeblink detection dataset. Although numerous of efforts have already been paid to address eyeblink detection problem, the available public datasets are still not abundant. ZJU , Eyeblink8 , Talking face  and Silesian5  are the representative ones with the spatial-temporal video information. Nevertheless, all of the 4 datasets above generally targets on the constrained indoor cases as shown in Fig. 2. The involved samples are captured from the limited number of volunteers, with the relatively consistent scene, subject, illumination and imaging setup. As consequence, they cannot reveal the “in the wild” characteristics faced by some challenging application scenarios. And, the reported performance on these datasets is somewhat saturated (e.g., the detection rate of on ZJU and Silesian5. To facilitate the research on eyeblink detection in the wild, a more challenging dataset is indeed required. Accordingly, we propose to construct HUST-LEBW dataset in the way of collecting samples from the unconstrained live movies to essentially involve richer “in the wild” eyeblink information. Compared to ZJU, Eyeblink8, Talking face and Silesian5, the samples in HUST-LEBW are of much higher diversity towards scene, subject, illumination and imaging conditions. The detailed comparison among them is listed in Table I to verify this, in attributes of “person number”, “person race”, “person age”, “person sex”, “person sight”, “scene”, “illumination”, “imaging view”, and “imaging distance” respectively. Meanwhile, video clip amount and resolution is also listed. Hence, the severe attribute variation within HUST-LEBW will impose great challenges to accurate eyeblink detection.
Eyeblink verification. Towards the existing eyeblink verification approaches, we will introduce them from the perspectives of pattern recognition model, and feature extraction method respectively. First aiming to solve a binary pattern recognition problem, the existing eyeblink verification methods can be categorized into the heuristic and data-driven
paradigms. Specifically, the heuristic way executes eyeblink verification mainly according to the pre-defined decision rules. For instance, when human face has been detected in advance a variance map of the sequential images is extracted to reveal the motion information in
. Eyeblink verification is then carried out via executing thresholding operation on it, in spirit of computing the salient motion pixel ratio. Template matching is first executed to estimate the eye state in. In the way of observing the correlation coefficient change in time, eyeblink is identified when the correlation coefficient is below a pre-defined threshold. KLT trackers are placed over the eye region to extract the motion information of eyeblink in . Eyeblink is consequently determined using the state machine with numerous of pre-defined threshold parameters. After acquiring the “open” and “close” status of eye using SVM, eyeblink is then confirmed according to the temporal contextual relationship between the resulting eye status in . With continuous eye tracking, eyeblink is recognized by observing whether the eyes are covered by eyelids in . Actually, the effectiveness of most of these approaches above highly relies on the adaptability of the pre-defined thresholds for decision making. As consequence, they tend to be sensitive to subject and environment variation. To enhance the generalization capacity, some other researchers resort to data-driven manner. Being incorporated with the discriminative measures on eye status, Conditional Random Field (CRF) is employed to model the eyeblink procedure for verification in . By extracting the EAR feature to characterize the eye opening degree using eye landmarks, SVM is finally used to verify the occurrence of eyeblink in . Actually, compared to the heuristic manner data-driven approach is relatively seldom studied. And, our proposition falls into the data-driven paradigm to use LSTM framework with strong sequential information processing capacity to model the spatial-temporal procedure of eyeblink.
Besides the patter recognition model, another essential issue for eyeblink verification is feature extraction. Generally speaking, appearance feature (e.g., EAR , LBP , Haar, or HOG ) or motion feature (e.g., KTL tracker motion  or pixel-wise frame difference between the consecutive 2 frames ) are extracted to this end. Nevertheless, few approaches take appearance and motion information into consideration simultaneously. To address this, we propose to use uniform LBP as appearance feature and its difference between the consecutive 2 frames as motion feature to jointly characterize eyeblink.
Eye localization. Accurate eye localization is the key step for eyeblink detection within spatial domain. Some existing approaches [6, 8, 5] resort to using color or spectral characteristics to locate eye. Another way is to use motion information  to detect and track eye. Nevertheless, their performance is not promising. Most of the state-of-the-art methods [9, 26, 21, 27, 28] resort to detect facial landmark to this end in the way of face parsing. To achieve the balance between effectiveness and efficiency, we choose use SeetaFace engine  for eye detection first, and then track eye using KCF  for high efficiency.
|Idx||Name||Filming location||Style||Premiere time|
|1||A clockwork orange||UK & USA||Crime & thriller||1971-12-19|
|2||The last emperor||CN||Drama||1987-10-23|
|3||Farewell my concubine||CN||Drama & love||1993-01-01|
|5||Léon||FR & USA||Action||1994-09-14|
|6||Ashes of time||CN||Emotional ethics||1994-09-17|
|7||The matrix||USA||Science fiction||1999-04-30|
|9||The matrix reloaded||USA||Science fiction||2003-05-15|
|10||Pirates of the Caribbean||USA||Adventure & magic||2003-07-09|
|11||Kill Bill 1||CN & USA||Action||2003-10-10|
|12||The lord of the rings3||USA & NZ||Fantasy & action||2003-12-01|
|13||Blood diamond||USA & DE||Adventure||2006-02-06|
|14||Memories of matsuko||JP||Drama & music||2006-05-29|
|15||The bourne ultimatum||USA||Action & suspense||2007-08-03|
|16||Game of thrones||USA||War & fantasy||2011-04-17|
|17||A Chinese fairy tale||CN||Fantasy & love||2011-04-19|
|18||Black mirror||UK||Science & thriller||2011-12-04|
|19||Mad max 4||USA||Action||2015-05-15|
|20||Contratiempo||ES||Crime & suspense||2017-01-06|
Iii HUST-LEBW : A Labelled Dataset for Eyeblink Detection in The Wild
As shown in Fig. 1, eyeblink detection in the wild suffers from some essential challenges on variation on human attribute, human pose, illumination, imaging view and distance, etc. Nevertheless, the existing eyeblink detection datasets (e.g., ZJU , Talking face , Eyeblink8 , and Silesian5 ) cannot reveal the “in the wild” characteristics well as indicated in Table I and Fig. 2. To address this, we propose to build a labelled dataset for eyeblink detection in the wild (termed HUST-LEBW) to shed the light into this research field not well studied before. The essential difference between HUST-LEBW and the existing eyeblink detection datasets is that, we choose to collect eyeblink video clips from the unconstrained movies instead of from the limited number of volunteers under the indoor scene conditions. After capturing the eyeblink video clips from the movies, towards each frame the face region, point-wise eye location, and eye region will be annotated as shown in Fig. 3. Next, we will illustrate the construction procedure and characteristics of HUST-LEBW in details.
Iii-a Movie data source
To reveal the “in the wild” characteristics, the eyeblink samples in HUST-LEBW are collected from 20 different commercial movies. Their main attribute information (i.e., name, filming location, style and premiere time) is listed in Table II. It can be observed that, the attributes of these movies are actually of high diversity. Essentially, this helps to ensure the eventful “in the wild” variation among the captured eyeblink samples in items of human attribute, human pose, scene / illumination condition, and imaging configuration as discussed in Table I. For instance, the employed 20 movies are shot in 8 countries from Asia, America, and Europe with the variational indoor and outdoor filming locations. Thus compared to the fixed indoor shooting condition of the existing eyeblink detection datasets [3, 14, 13, 15], acquiring eyeblink samples from these movies is of much stronger scene variation and challenges. Meanwhile, the discrepancy on movie style and premiere time also helps to promote the human attribute variation, which is more close to the practical applications. For example, the person races in HUST-LEBW include Asian, Caucasian and Melanoderm simultaneously. This actually cannot be met by the other datasets.
Iii-B Capture eyeblink in the wild sample
From the 20 selected movies above, we then choose to capture the eyeblink in the wild samples in the form of video clip that covers the whole eye status sequence of “eye openeye closeeye open” as shown in Fig. 4. Finally, we acquire 381 eyeblink video clips as the positive samples. Meanwhile, 292 non-eyeblink samples are collected as the negative ones. As consequence, the yielded HUST-LEBW dataset consists of 673 samples in all (i.e., 381 positives, and 292 negatives).
Due to the high divergence of the employed movie data source, the captured eyeblink in the wild samples actually reveal dramatic variation on human attribute, human pose, scene condition, imaging view, and imaging distance as illustrated in Fig. 1 and Table II. These “in the wild” factors essentially impose great challenges to effective eyeblink detection. For example, 172 persons of variational human attributes and poses are involved in HUST-LEBW dataset. Their eye appearance is actually of striking discrepancy as shown in Fig. 5. Meanwhile, even within the same eyeblink sample the eye appearance may also be of dramatic variation due to the change on illumination and imaging distance as shown in Fig. 7. When concerning the variation of human attribute, human pose, scene and imaging condition simultaneously, accurately locating human eyes and characterizing the eye status for eyeblink detection in the wild is indeed not an easy task.
Since some existing eyeblink detection approaches (e.g., ) and our proposed LSTM-based manner require the input eyeblink video clips to be of the same length, we choose to polish the raw captured eyeblink samples to be of the fixed temporal size. To this end, statistics on temporal duration of the raw eyeblink samples is executed as shown in Fig. 7
. It can be observed that, the eyeblink temporal duration (frame) generally follows the Gaussian distribution with the mean value (
) of 6.18 and standard deviation (
) of 1.54. To alleviate the outlier effect caused by human labelling bias, we set the fixed temporal duration of eyeblink sample as 10 frames according to the Pauta criterion (i.e., 3criterion)  also as revealed in Fig. 7. In particular, during the eyeblink sample polish phase we will place the fully-closed eye frame around the middle of the eyeblink sample. Then, if the raw eyeblink sample is less than 10 frames the first and last frame will be copied uniformly for extension iteratively. Oppositely, if the raw eyeblink sample is more than 10 frames the excess frames will be cut from the left and right hand uniformly. Meanwhile since some eyeblink detection approaches (e.g., ) require the input sample to be of 13 frames, we will also extend or cut the raw eyeblink samples to 13 frames to make HUST-LEBW dataset to be adapted to them.
Iii-C Eyeblink sample annotation work
After acquiring the 673 eyeblink and non-eyeblink samples, we then execute annotation work on localizing face, localizing eye and extracting local eye images on each frame for performance evaluation towards practices. Next, we will introduce the annotation work in details.
Face localization. For each of the 8749 sample frames, we first use SeetaFace face parsing engine  to localize human face in terms of bounding box. Then, manual refinement is executed to ensure that the face bounding box can cover both of the right and left eye when they appear.
Eye localization. After face localization, we then manually localize the eye center at the point level frame by frame. If only one eye is visible, the coordinate of the invisible eye will be labelled as .
Local eye image extraction. Using the acquired face bounding box and eye center position information, the local eye images are consequently extracted as follows. For one person, if both of the left and right eye are visible with labelled centers the height and width of the local eye image are calculated as
where and indicate the position of left and right eye center; represents the computation of Manhattan distance  between and . Meanwhile, if only one eye is visible the height and width will be determined using the face size information, following the principle proposed in . That is, the height and width of the local eye image are set as the of the face width. Some examples of eyeblink sample annotation are shown in Fig. 8.
It is worthy noting that, to ensure that the eyeblink sample annotation result is applicable to all the methods in experiments we will only localize the eyes and extract the local eye images visible for 13 frames. As consequence, we finally acquire 667 right eye samples and 644 left eye samples.
Iii-D Dataset split
After the HUST-LEBW dataset has been built, we then split it into the training and test set respectively. In particular, the training set consists of 448 samples. Among them, 254 samples are positives with 253 labelled right eyes and 243 labelled left eyes; 190 samples are negatives with 190 labelled right eyes and 181 labelled left eyes.
The test set consists of 225 samples. Among them, 127 samples are positives with 126 labelled right eyes and 122 labelled left eyes; 98 samples are negatives with 98 labelled right eyes and 98 labelled left eyes.
Iv Eyeblink in The Wild Detection Method : A Real-time Spatial-temporal Manner
As aforementioned, we formulate eyeblink detection task as a binary spatial-temporal pattern recognition problem. To solve it, eye localization is first executed at the spatial domain. Then, appearance and motion feature based on uniform LBP is simultaneously extracted per frame from the corresponding local eye images to characterize eyeblink. Multi-scale (MS) LSTM network able to handle multi-scale temporal information is consequently proposed to deal with the time series eyeblink characterization feature to address eyeblink verification. The main technical pipeline of the proposed eyeblink in the wild detection method is shown in Fig. 9. Next, we will illustrate it in details.
Iv-a Eyeblink verification using multi-scale LSTM
Eyeblink can be regarded as the human activity on face, consisting of the time series eye status of “eye openeye closeeye open”. Thus, eyeblink verification is essentially a binary time series pattern recognition problem to distinguish the eyeblink and non-eyeblink samples. Long Short-term Memory Network (LSTM) 
has been demonstrated to be one of the most successful deep learning models to deal with sequential data. It has already been applied to human body activity recognition with promising performance. Inspired by this, we propose to apply LSTM to eyeblink verification.
LSTM is derived from Recurrent Neural Network (RNN) to model the long-term dependency within time series data. As shown in 11, LSTM unit consists of a memory cell (), an input gate (), a forget gate (), and an output gate (). , , and work collaboratively to prevent memory contents from being perturbed by irrelevant inputs and outputs to ensure long-term memory storage in , in the way of controlling the information flow into and out of the LSTM unit. Meanwhile, the gradient vanishing and exploding problem met by RNN can also be alleviated in LSTM accordingly . However, intuitively applying the original LSTM model to eyeblink verification is not optimal. The insight is that eyeblink actually happens with the different temporal duration as revealed in Fig. 7, although they have been manually fixed to the same size within HUST-LEBW dataset. Essentially, the raw LSTM model cannot deal with the multiple temporal case within time series data well . To alleviate this, multi-scale LSTM (MS-LSTM) model is proposed by us from 2 perspectives as follows.
First instead of only using the output (i.e. the hidden state variable in Fig. 11
) of the last LSTM unit to be the input feature of softmax layer as for human body activity recognition[37, 38], we choose to employ the outputs of the last LSTM units jointly by concatenation to involve richer multiple temporal scale information for eyeblink characterization.
Secondly inspired by the conclusion drawn in  that the stacked RNN architecture can help to alleviate the multiple temporal scale problem, we transfer this idea to LSTM case by building stacked LSTM layers within MS-LSTM. Similar to stacked RNN , within the proposed MS-LSTM model the output of the previous LSTM layer will be employed as the input of the next LSTM layer in the parallel manner. Overall, the main structure of the proposed MS-LSTM model 111Within MS-LSTM, is empirically set to 2, and is set to 2. is shown in Fig. 11.
After the multiple temporal scale feature has been acquired within MS-LSTM, softmax layer will finally play the role of decision making to judge the type of input samples (eyeblink or non-eyeblink) as shown in Fig. 9. However, we argue that the original softmax loss  is not discriminative enough for eyeblink verification since it is essentially a fine-grained visual recognition problem. To reveal this, we show one eyeblink sample and one non-eyeblink sample from the same person in Fig. 12. It can be observed that, most of the frames within these 2 samples look similar except the eye close part. This phenomenon may lead to the fact that, the eyeblink and non-eyeblink samples are not easy to distinguish in feature space. To further verify this, we exhibit the distribution of the eyeblink and non-eyeblink samples within HUST-LEBW dataset in Fig. 13, using the appearance and motion feature illustrated in Sec. IV-B. We can see that both in the left and right eye cases the eyeblink and non-eyeblink samples distribute with serious overlap, which is difficult to well discriminate. To enhance the discriminative power towards eyeblink verification, we propose to use the angular softmax (A-Softmax) loss  with the promising performance for face verification. The intuition is that, face verification can also be regarded as a fine-grained visual recognition problem. Next, we will briefly introduce the key idea of A-Softmax loss.
For the binary pattern recognition problem of eyeblink verification, the decision boundary of the original softmax loss is defined as
indicates the input feature vector;and represent the weights and bias. With the constrain of and , the decision boundary will be
where is the angle between and . As consequence, the new 2-class decision boundary is only related to . Actually, the modified softmax loss in Eqn. 4 enables the neural network to learn the angle-based decision boundary. However, it cannot ensure the strong discriminative power and generalization capacity. To alleviate this, A-Softmax loss introduces a integer to control angular margin between the 2 classes. Accordingly, the decision boundaries for the 2 classes are defined as
respectively. In summary, the essential idea of A-Softmax loss is to project the samples from Euclidean feature space to angular feature space and guarantee the angular margin between the 2 classes as shown in Fig. 14. In this way, the discriminative power and generalization capacity can be enhanced towards fine-grained eyeblink verification task. The detailed definition of A-Softmax loss can be found in .
Iv-B Low-level appearance and motion feature extraction for eyeblink characterization
As aforementioned, eyeblink can be regarded as the human facial activity. Inspired by the two-stream (i.e., appearance stream and motion stream) human body activity recognition paradigm , we propose to extract low-level appearance and motion feature simultaneously per frame as the input of MS-LSTM for eyeblink characterization. Concerning the real-time running issue, we choose to achieve this goal based on the light-sheld uniform LBP visual descriptor 
instead of using the high-cost deep Convolutional Neural Network (CNN) and optical flow  as in . Another main reason for why we use uniform LBP is that it is rotation-insensitive , which is beneficial for eyeblink verification in the wild. As shown in Fig. 5, the eyeblink in the wild samples are of different rotation angles due to the variational human poses or imaging views as shown in Fig. 1.
Specifically, towards each eyeblink frame uniform LBP of is extracted from the local eye image as the appearance feature. Besides the appearance feature, we also propose to calculate the difference between the uniform LBPs from the consecutive 2 frames as the motion feature to reveal the eye status evolution during the phase of eyeblink. Intuitively, the appearance and motion feature is of the same dimensionality. They will be concatenated as the input of MS-LSTM for spatial-temporal eyeblink characterization, corresponding to each frame except the first one.
Iv-C Local eye image extraction
As illustrated in Fig. 8 and mentioned in Sec. IV-B, appearance and motion feature is extracted from the local eye images for eyeblink characterization. Thus, the effective and efficient local eye image extraction is crucial for real-time eyeblink detection. To this end, towards one eyeblink sample we choose to localize the center position of left eye () and right eye () using off-the-shelf SeetaFace face parsing engine  at the first frame. Then the local eye images are extracted using and according to Eqn. 1 and 2, which is the same as Sec. III-C. Regarding the remaining frames, the local eye images are acquired by tracking the yielded local eye regions of the last frame directly using KCF tracker  due to its high running efficiency. The main technical pipeline for local eye image extraction is shown in Fig. 15.
V Implementation details
In this section, the essential implementation details of the proposed eyeblink detection in the wild approach is illustrated as follows.
SeetaFace face parsing engine is implemented using the public code with C/C++ programming language at https://github.com/seetaface/SeetaFaceEngine;
KCF tracker is implemented using the public code with C/C++ programming language at https://github.com/vojirt/kcftracker.
Uniform LBP is implemented by ourselves using C/C++ programming language.
|Learning step||Learning rate|
During experiments to reveal the essential challenges of eyeblink detection in the wild and verify the effectiveness of our proposed eyeblink detection approach, we first compare the performance between our method and the other state-of-the-art eyeblink detection manners [21, 8, 10, 12, 13] on the proposed HUST-LEBW dataset in Sec. VI-A. Since the codes of the approaches employed for comparison are not publicly available and cannot be acquired from the authors, we try our best to implement them by ourselves.
Then to demonstrate the superiority of the proposed MS-LSTM based eyeblink verification approach, we compare it with the other state-of-the-art region-level eyeblink verification methods [10, 12, 13] in Sec. VI-B. To remove the impact of eye location for fair comparison, this test is executed under the assumption that the local eye region has already been successfully extracted in the way of using the manual annotation result directly as depicted in Sec. III-C. Since the approaches in [21, 8] cannot take the local eye image as input, they will not be taken into consideration for comparison in this experimental part.
Consequently, the performance comparison between our eye localization method and the other existing approaches [21, 9, 13, 46, 8] is carried out in Sec. VI-C. Here, 3 face parsing approaches (i.e., SeetaFace , Intraface , and MTCNN ) are also compared both from the perspectives of effectiveness and efficiency to justify the reason for why we choose SeetaFace to initially locate the eye center.
The real-time running capacity of our eyeblink detection approach is demonstrated in Sec. VI-D
. And, the ablation studies towards MS-LSTM, A-softmax loss function, and low-level eyeblink feature extraction within our method are executed in Sec.VI-E, Sec. VI-F and Sec. VI-G respectively to reveal the effectiveness of our propositions. The failure cases are given in Sec. VI-H.
The experiments run on a laptop with Intel(R) Core(TM) i7-7700HQ CPU @ 2.8GHz (only using one core) and 8 GB RAM memory, under the Windows 10 operation system. During the training phase of MS-LSTM, GPU is used for speed acceleration. But for online test, GPU will not be used.
|Morris (ver.) ||Left||0.9590||0.0164||0.6667||0.0320|
|Morris (hor.) ||Left||0.9590||0.0410||0.7143||0.0775|
|Morris (flow) ||Left||0.9590||0.0164||0.6667||0.0320|
Vi-a Performance comparison among the different eyeblink detection methods
To evaluate the performance of the different eyeblink detection methods on HUST-LEBW dataset, the criterias of , and score are used as below.
where indicates the number of eyeblink samples recognized correctly; 222It is worthy noting that, the eyeblink samples with wrong eye localization result will be regarded as FNs. and denote the number of eyeblink and non-eyeblink samples recognized incorrectly.
Meanwhile, for eyeblink detection in the wild the failure of eye localization essentially weakens the performance. To reveal the impact of this issue, the failure rate () of eye localization towards eyeblink samples is given as
where indicates the number of eyeblink samples that correspond to the case that the eyes cannot be detected at all; denotes the number of eyeblink samples that correspond to the case that the eyes cannot be localized correctly within the all frames; and represents the number of eyeblink samples in all. The criteria for judging whether the eye has been correctly localized is given as
where is Manhattan distance function; and indicate the ground-truth position of left and right eye center; denotes the position of the detected eye center and represents its ground-truth position. If , we declare that the eye center has not been correctly localized. According to the evaluation criterias above, the comparison among the different eyeblink detection approaches on HUST-LEBW dataset is listed in Table IV. It can be observed that:
Actually, all the eyeblink detection approaches for test (including ours) cannot achieve the satisfactory performance. In summary, their scores cannot exceed 0.7 (0.6735 at most). This phenomenon reveals the fact that, eyeblink detection in the wild is not a trivial but indeed challenging visual recognition task not well solved yet;
Although the still not satisfactory result, the proposed eyeblink detection approach essentially outperforms the other methods significantly at 3 of the 4 evaluation criterias (besides ) both on left and right eye, from the perspectives of eye localization and eyeblink verification. That is, the performance gap between our manner and the others on score is 0.167 at least. This actually demonstrates the superiority of our proposition towards eyeblink detection in the wild. In some cases, the methods of Chau  and Morris (ver.)  can yield higher than ours. Unfortunately, they suffer from low mainly due to high ;
The challenges of eyeblink detection in the wild essentially derive from the procedures of eye localization and eyeblink verification simultaneously. In particular, all the methods suffers from high (over 0.3) on eye localization. Meanwhile, although our approach performs best its and is still relatively low.
|Morris (ver.) ||Left||0.5246||0.4741||0.4981|
|Morris (hor.) ||Left||0.6393||0.5342||0.5821|
|Morris (flow) ||Left||0.4918||0.4918||0.4918|
Vi-B Performance comparison among the different eyeblink verification methods
Since the result of eyeblink detection is jointly determined by eye localization and eyeblink verification, to solely verify the superiority of our MS-LSTM based eyeblink verification approach the different methods are compared under the assumption that the local eye region has already been manually extracted in advance. Accordingly, the performance comparison among the different applicable approaches is listed in Table V. We can see that:
Removing the impact of eye localization, the proposed MS-LSTM based eyeblink verification approach still remarkably outperforms the other methods at score by large margins (0.1768 at least), both on left and right eye. This indeed demonstrates the superiority of our proposition over the other manners;
Even the local eye region has been manually extracted in advance, the performance of the involved approaches is still not promising enough. In particular, the highest score is only 0.8046. Actually this verifies the fact that eyeblink detection can be regarded as a fine-grained spatial-temporal visual pattern recognition problem of essential challenges, which is also revealed in Fig. 13 previously;
Our approach is inferior to Chau’s method  at . Nevertheless, its and score is much lower than ours.
Vi-C Performance comparison among the different eye localization methods
Eye localization is the vital step towards most of the eyeblink detection methods. It essentially affects the final performance a lot. Since the existing eyeblink detection approaches generally suffer from high failure rate () on eye localization as revealed in Table IV, we choose to compare our eye localization approach with the others (i.e., Intraface , OpencvFace+TM , OpencvFace+KLT , Skin  and Yuzhi ) mainly according to . The criteria for judging whether the eye has been localized correctly is the same as Sec. VI-A, according to in Eqn. 11. The experiments are executed on all the sample frames within HUST-LEBW dataset. The performance comparison among the different approaches is shown in Fig. 16. In particular, for compact comparison the average of left and right eye is reported. Obviously our eye localization approach that uses SeetaFace face parsing engine  and KCF tracker  is consistently better than the other manners remarkably, corresponding to the different thresholds.
On the other hand, within our approach SeetaFace face parsing engine plays the essential role of localizing eye center initially before tracking. To solely verify its superiority, we compare it with the other 2 state-of-the-art face parsing approaches (i.e., Intraface , and MTCNN ) from the perspective of effectiveness and efficiency simultaneously. In particular, the performance comparison on effectiveness among the 3 face parsing methods is shown in Fig. 17. We can see that, in most cases SeetaFace is better than Intraface but inferior to MTCNN. Nevertheless, towards real-time eyeblink detection application running efficiency should also be taken into consideration. We compare the average time consumption of these 3 approaches in Table VI. It can be observed that, SeetaFace is of the highest running efficiency (i.e., 26.13 ms per frame). Compared to MTCNN, it runs faster of 1 magnitude. Concerning the tradeoff between effectiveness and efficiency for real-time application, we choose SeetaFace as our initial eye localizer.
Vi-D Real-time online running capacity verification
In this subsection, we will verify that our proposed eyeblink detection method is of real-time online running capacity on a normal laptop with Intel(R) Core(TM) i7-7700HQ CPU @ 2.8GHz (only using one core). The average online running time consumption per frame of the main procedures within our method is listed in Table VII. It can be observed that, the main time consumption is costed by SeetaFace engine for initial eye localization with 33.20 ms. However, it will be executed only on the first frame towards an eyeblink sample. And, the procedures of eye tracking, eyeblink feature extraction, and eyeblink verification are extremely fast with the time consumption of only 7.87 ms in all. We can make a summary that, the initial eye localization procedure can run with the speed over 29 FPS. When turning to eye tracking phase, the proposed eyeblink detection method can run with the speed over 127 FPS. Overall, our approach meets the real-time running requirement (i.e., with the speed over 25 FPS).
|Initial eye localization (SeetaFace)||33.20|
|Eye tracking (KCF)||6.06|
|Eyeblink feature extraction (uniform LBP)||0.32|
|Eyeblink verification (MS-LSTM)||1.49|
Vi-E Ablation study 1: MS-LSTM
MS-LSTM is proposed by us to address the problem of eyeblink verification. From the network structure perspective, it holds 2 main modifications compared with the original LSTM model to alleviate the multiple temporal scale problem within eyeblink. One is to stack multiple LSTM layers. And, the other is to involve multiple temporal scale feature. Here, we will verify the effectiveness of the 2 modifications respectively. The experiments are executed under the assumption that the local eye region has already been manually extracted in advance, which is the same as Sec VI-B.
|Eye idx||Layer number||score|
Stack multiple LSTM layers. The number of the stacked LSTM layers is set from 1 to 4. The performance comparison among them is listed in Table VIII. It can be seen that:
Compared to the original LSTM model with only 1 layer, adding the layer number can consistently leverage the performance on and score in all the test cases. However, it may weaken . Overall, stacking multiple LSTM layers is an effective way to enhance eyeblink verification result comprehensively.
Setting the layer number to 2 can achieve the best average performance on and score. Accordingly, the layer number within the proposed MS-LSTM model is empirically set to 2 for eyeblink verification.
|Eye idx||Scale number||score|
Multiple temporal scale feature. The temporal scale number is set from 1 to 5. The performance comparison among them is listed in Table IX. We can see that:
Involving multiple temporal scale feature essentially leverages the performance of eyeblink verification, especially from the perspectives of average , and score. This actually demonstrates the effectiveness of our proposition on extracting multiple temporal scale feature for eyeblink characterization within MS-LSTM model;
Setting the temporal scale number to 2 can achieve the best average performance on and score. Accordingly, the temporal scale number of the proposed MS-LSTM model is empirically set to 2 for eyeblink verification.
Vi-F Ablation study 2: A-softmax loss function
As revealed in Fig. 13, eyeblink verification can be regarded as a fine-grained binary spatial-temporal pattern recognition problem. To ensure the classification margin between eyeblink and non-eyeblink classes, A-softmax loss function is used within MS-LSTM model. To verify its superiority, we compare it with the original softmax loss function. The experiments are executed under the assumption that the local eye region has already been manually extracted in advance, which is the same as Sec VI-B. The performance comparison between these 2 loss functions is listed in Table X. It is impressive that A-softmax loss function consistently outperforms the original softmax loss function in all test cases, especially on the average and score. This indeed demonstrates the effectiveness of our proposition that applies A-softmax loss function to address eyeblink verification.
Vi-G Ablation study 3: low-level eyeblink feature extraction
To effectively characterize eyeblink, we propose to extract low-level appearance and motion feature simultaneously as the input of MS-LSTM using uniform LBP. To justify the superiority of our low-level eyeblink feature extraction method, we conduct experiments in 2 folders. First, uniform LBP is compared with the other 2 well-established visual descriptors (i.e., HOG  and Haar ). Meanwhile, the effectiveness of the mechanism on extracting appearance and motion feature simultaneously for eyeblink characterization is also verified. The experiments are executed under the assumption that the local eye region has already been manually extracted in advance, which is the same as Sec VI-B. The comprehensive performance comparison is listed in Table XI. It can be observed that:
Among the 3 visual descriptors for test, uniform LBP can achieve the best result on the average score. Its performance on the average and is also comparable to the best one. Overall, uniform LBP is the optimal choice for eyeblink detection;
For all the 3 visual descriptors the mechanism of extracting appearance and motion feature simultaneously can essentially leverage the performance in most cases, compared to using only one type feature.
In addition, the running time comparison of the 3 visual descriptors is listed in Table XII. We can see that, uniform LBP is of the fastest running speed.
The experimental results above indeed demonstrate the effectiveness of our proposed low-level eyeblink feature extraction approach.
|Eye idx||Loss function||score|
|Descriptor||Mechanism||Left eye||Right eye||Average|
|Uniform LBP ||App.||0.7398||0.7459||0.7429||0.7857||0.7444||0.7645||0.7628||0.7452||0.7537|
|Uniform LBP ||0.322|
Vi-H Failure cases of eyeblink detection in the wild
From Sec. VI-A to Sec. VI-G, quantitative performance evaluation is executed on our proposed eyeblink detection approach to demonstrate its effectiveness and superiority. Here, we will conduct the qualitative analysis to show the defects of our proposition towards in the wild application scenario. Accordingly the intuitive failure case examples are given in Fig. 18 from different perspectives, aiming to reveal some insights towards eyeblink detection in the wild and indicate the future research avenue. We can see that accurate face detection, eye localization and eye tracking is still remaining as the challenging visual tasks under the unconstrained “in the wild” conditions, although numerous of efforts have already been paid. The challenges actually derive from the dramatic variation on human attribute, human pose, illumination, and scene conditions. From Fig. 18(c)
, the fast movement of human is also a critical issue to impair eye tracking. Meanwhile, the makeup on eye may also confuse the classifier during the phase of eyeblink verification as shown in Fig.18(d). What is more challenging is that within some eyeblink samples the eyes are not fully closed as shown in Fig. 18(e), which may be caused by the relatively low frame rate of camera. These require us to extract more discriminative spatial-temporal feature for eyeblink characterization.
In this work, we shed the light to the research field of eyeblink detection in the wild that has not been well studied before. Some essential practical and theoretical contributions have been addressed by us. First, a labelled dataset for eyeblink detection in the wild (HUST-LEBW) is built by us. And, it will be released online upon acceptance. Secondly, MS-LSTM model is proposed to address the fine-grained spatial-temporal pattern recognition problem within eyeblink detection. Thirdly, an effective and efficient eyeblink feature extraction approach able to capture appearance and motion information simultaneously is proposed. Meanwhile, our eyeblink detection method can run in real-time on a normal laptop without using GPU and parallel computing. The extensive experiments verify the challenges of eyeblink detection in the wild, and demonstrate the superiority of the proposed approach.
However the performance of our method is still not satisfactory enough, which is far from practical application. In future research work, we intend to resort to deep learning technology (e.g., CNN) to facilitate eye localization and eyeblink verification. Additionally, we also plan to extend HUST-LEBW dataset to meet the data-hungry requirement of deep learning.
This work is jointly supported by the National Key R&D Program of China (No. 2018YFB1004600), National Natural Science Foundation of China (Grant No. 61876211, 61702182, and 61602193), the International Science & Technology Cooperation Program of Hubei Province, China (Grant No. 2017AHB051), the HUST Interdisciplinary Innovation Team Foundation (Grant No. 2016JCTD120), Hunan Provincial Natural Science Foundation of China (Grant 2018JJ3254). Joey Tianyi Zhou is supported by Programmatic Grant No. A1687b0033 from the Singapore government’s Research, Innovation and Enterprise 2020 plan (Advanced Manufacturing and Engineering domain).
-  B. S. Perelman, “Detecting deception via eyeblink frequency modulation,” Peerj, vol. 2, no. 2, p. e260, 2014.
-  L. M. Bergasa, J. Nuevo, M. A. Sotelo, R. Barea, and M. E. Lopez, “Real-time system for monitoring driver vigilance,” IEEE Trans. on Intelligent Transportation Systems, vol. 7, no. 1, pp. 63–77, 2006.
G. Pan, L. Sun, Z. Wu, and S. Lao, “Eyeblink-based anti-spoofing in face
recognition from a generic webcamera,” in
Proc. IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8.
-  M. Rosenfield, “Computer vision syndrome: a review of ocular causes and potential treatments.” Ophthalmic & Physiological Optics, vol. 31, no. 5, pp. 502–515, 2011.
-  Q. Ji and X. Yang, “Real-time eye, gaze, and face pose tracking for monitoring driver vigilance,” Real-Time Imaging, vol. 8, no. 5, pp. 357–377, 2014.
-  J. D. Wu and T. R. Chen, “Development of a drowsiness warning system based on the fuzzy logic images analysis,” Expert Systems with Applications, vol. 34, no. 2, pp. 1556–1561, 2008.
-  W. Dong, P. Qu, and J. Han, “Driver fatigue detection based on fuzzy fusion,” in Proc. Chinese Control and Decision Conference (CCDC), 2008, pp. 2640–2643.
-  P. R. Tabrizi and R. A. Zoroofi, “Open/closed eye analysis for drowsiness detection,” in Proc. Image Processing Theory, Tools and Applications Workshop (IPTAW). IEEE, 2008, pp. 1–7.
-  A. Królak and P. Strumiłło, “Eye-blink detection system for human–computer interaction,” Universal Access in the Information Society, vol. 11, no. 4, pp. 409–419, 2012.
-  M. Chau and M. Betke, “Real time eye tracking and blink detection with usb cameras,” Cas Computer Science Technical Reports, 2005.
-  T. Hong and H. Qin, “Drivers drowsiness detection in embedded system,” in Proc. IEEE International Conference on Vehicular Electronics and Safety (ICVES), 2008, pp. 1–5.
-  T. Morris, P. Blenkhorn, and F. Zaidi, “Blink detection for real-time eye tracking,” Journal of Network & Computer Applications, vol. 25, no. 2, pp. 129–143, 2002.
-  T. Drutarovsky and A. Fogelton, “Eye blink detection using variance of motion vectors,” in Proc. European Conference on Computer Vision Workshop (ECCVW). Springer, 2014, pp. 436–448.
-  “Talking face video,” http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking_face.html, Face&Gesture Recognition Working Group, IST-2000-26434.
-  K. Radlak, M. Bozek, and B. Smolka, “Silesian deception database: Presentation and analysis,” in Proc. ACM Multimodal Deception Detection Workshop (MDDW), 2015, pp. 29–35.
-  M. Kan, M. Kan, S. Shan, S. Shan, and X. Chen, “Funnel-structured cascade for multi-view face detection with alignment-awareness,” Neurocomputing, vol. 221, no. C, pp. 138–145, 2017.
-  J. F. Henriques, C. Rui, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. on Pattern Analysis & Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2014.
-  T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE Trans. on Pattern Analysis & Machine Intelligence, no. 12, pp. 2037–2041, 2006.
-  W. O. Lee, E. C. Lee, and R. P. Kang, “Blink detection robust to various facial poses,” Journal of Neuroscience Methods, vol. 193, no. 2, p. 356, 2010.
-  D. Torricelli, M. Goffredo, S. Conforto, and M. Schmid, “An adaptive blink detector to initialize and update a view-basedremote eye gaze tracking system in a natural scenario,” Pattern Recognition Letters, vol. 30, no. 12, pp. 1144–1150, 2009.
-  T. Soukupová and J. Cech, “Real-time eye blink detection using facial landmarks,” in Proc. Computer Vision Winter Workshop (CVWW), 2016.
-  R. Sun and Z. Ma, “Robust and efficient eye location and its state detection,” in Proc. International Symposium on Advances in Computation and Intelligence (ISACI), 2009, pp. 318–326.
-  Z. Liu and H. Ai, “Automatic eye state recognition and closed-eye photo correction,” in Porc. International Conference on Pattern Recognition (ICPR), 2012, pp. 1–4.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886–893.
-  H. Tan and Y. J. Zhang, “Detecting eye blink states by tracking iris and eyelids,” Pattern Recognition Letters, vol. 27, no. 6, pp. 667–675, 2006.
-  G. Bradski and A. Kaehler, “Opencv,” Dr. Dobb’s Journal of Software Tools, vol. 3, 2000.
-  X. Yin and X. Liu, “Multi-task convolutional neural network for face recognition,” CoRR, vol. abs/1702.04710, 2017. [Online]. Available: http://arxiv.org/abs/1702.04710
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
-  H. Cao, K. Zhou, X. Chen, and X. Zhang, “Early chatter detection in end milling based on multi-feature fusion and 3 criterion,” The International Journal of Advanced Manufacturing Technology, vol. 92, no. 9-12, pp. 4387–4397, 2017.
-  S. M. Stigler, The history of statistics: The measurement of uncertainty before 1900. Harvard University Press, 1986.
-  Ö. Oguz, “The proportion of the face in younger adults using the thumb rule of leonardo da vinci,” Surgical and Radiologic Anatomy, vol. 18, no. 2, pp. 111–114, 1996.
-  L. V. Der Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-aware attention lstm networks for 3d action recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 7, 2017, p. 43.
-  J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proc. of the National Academy of Sciences (NAS), vol. 79, no. 8, pp. 2554–2558, 1982.
-  M. Hermans and B. Schrauwen, “Training and analysing deep recurrent neural networks,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2013, pp. 190–198.
-  W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks,” Proc. National Conference on Artificial Intelligence (AAAI), pp. 3697–3703, 2016.
-  S. Zhang, X. Liu, and J. Xiao, “On geometric features for skeleton-based action recognition using multilayer lstm networks,” in Proc. IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 148–157.
-  W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2017, p. 1.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2014, pp. 568–576.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
-  T. Brox and J. Malik, “Large displacement optical flow: descriptor matching in variational motion estimation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 500–513, 2011.
-  T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Z. Tian and H. Qin, “Real-time driver’s eye state detection,” in Proc. IEEE International Conference on Vehicular Electronics and Safety (ICVES), 2005, pp. 285–289.