EEG-Based Driver Drowsiness Estimation Using Feature Weighted Episodic Training

09/25/2019 ∙ by Yuqi Cuui, et al. ∙ Huazhong University of Science u0026 Technology 10

Drowsy driving is pervasive, and also a major cause of traffic accidents. Estimating a driver's drowsiness level by monitoring the electroencephalogram (EEG) signal and taking preventative actions accordingly may improve driving safety. However, individual differences among different drivers make this task very challenging. A calibration session is usually required to collect some subject-specific data and tune the model parameters before applying it to a new subject, which is very inconvenient and not user-friendly. Many approaches have been proposed to reduce the calibration effort, but few can completely eliminate it. This paper proposes a novel approach, feature weighted episodic training (FWET), to completely eliminate the calibration requirement. It integrates two techniques: feature weighting to learn the importance of different features, and episodic training for domain generalization. Experiments on EEG-based driver drowsiness estimation demonstrated that both feature weighting and episodic training are effective, and their integration can further improve the generalization performance. FWET does not need any labelled or unlabelled calibration data from the new subject, and hence could be very useful in plug-and-play brain-computer interfaces.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Driving safety is very important to our everyday life. However, according to the World Health Organization111
“Global Status Report on Road Safety 2018”, “the number of road traffic deaths continues to rise steadily, reaching 1.35 million in 2016. … Road traffic injuries are the eighth leading cause of death for all age groups. More people now die as a result of road traffic injuries than from HIV/AIDS, tuberculosis or diarrhoeal diseases. Road traffic injuries are currently the leading cause of death for children and young adults aged 5–29 years.

In addition to the reliability of the vehicle and the driver’s experience, driving safety is also strongly related to the driver’s alertness (or, drowsiness). Drowsy driving is the fourth major contributor to road crashes, following only to alcohol, speeding, and inattention [1]. Drowsiness impacts the driver’s ability to quickly and appropriately respond to road emergencies, and hence may lead to accidents [2]. Therefore, accurate estimation of the driver’s drowsiness level is very important in preventing road accidents.

Many approaches have been reported [3, 4, 5, 6, 7], which can be roughly categorized into two directions: contactless detections and wearable sensor based detections. The former use cameras and/or other sensors, which are not attached to the driver’s body, to monitor the driver’s facial activities and/or driving patterns to estimate the drowsiness level [8, 9, 6]. The latter use wearable sensors to measure the driver’s physiological signals, e.g., electroencephalogram (EEG) [10], electrocardiography (ECG) [11, 10], electromyography (EMG) [12, 13], etc, and then perform drowsiness estimation. The heart rate and heart rate variability can be easily obtained from ECG signals. They both vary significantly between alertness and drowsiness, and hence can be indicators of drowsiness [14, 15]. EMG signal is usually combined with other signals to determine the drowsiness level. For instance, Lee et al. [16] proposed a driver fatigue detection approach using EMG and galvanic skin responses. Fu et al. [17] proposed to use EEG, EMG and respiration signals to dynamically detect driver fatigue. In this paper, we focus on using only EEG signals for driver drowsiness estimation.

Since EEG directly measures the brain states, it is very suitable for human psychophysiological state evaluation [18]. The power spectrum of EEG has been used to estimate driver drowsiness level [19, 20, 21, 22], especially the theta (4-7Hz) and alpha (8-12Hz) bands [23, 24, 18]. Additionally, different brain regions have different abilities in assessing the driver’s drowsiness level. Previous studies have shown that theta and alpha band activities in the central and occipital regions are more correlated to fatigue [25, 26, 27]. These results indicate that it may be beneficial to give different brain regions different weights in drowsiness estimation.

A major challenge in EEG-based driver drowsiness estimation is that, due to individual differences, it is very difficult to develop a generic estimator, whose parameters are fixed and optimal for all subjects. Hence, a subject-specific calibration session is usually required to tune the estimator, which is time-consuming and not user-friendly. Lots of efforts have been made to reduce or eliminate this calibration. One of the most frequently used approach is transfer learning 

[28, 29], which uses data from other subjects/sessions (called source domains) to facilitate the learning for a new subject (called target domain). For instance, Lin and Jung [30] proposed a conditional transfer learning framework to promote positive transfer for each individual. It first assesses an individual’s transferability for positive transfer, and then selectively leverages the data from others with comparable feature spaces. This approach has demonstrated promising performance in EEG-based emotion classification. Zanini et al. [31]

proposed a Riemannian space transfer learning framework, which uses a reference covariance matrix at the resting state to align data from different domains, before applying a Riemannian space classifier. He and Wu

[32] proposed a similar EEG data alignment approach in the Euclidean space, which is more efficient than the Riemannian space data alignment approach, and can be used as a pre-processing step before any Euclidean space classifier. However, all these approaches considered only classification problems, and all required some labeled or unlabeled data from the target subject for calibration. So, they cannot be used in true plug-and-play brain-computer interfaces.

This paper considers a much more realistic, also more challenging, scenario: there are no calibration data (either labeled or unlabeled) from the target subject at all; we want to build a model from the auxiliary subjects and apply it directly to the target subject. Each auxiliary subject can be viewed as an independent source domain, and this problem setting is called domain generalization

in computer vision.

Many neural network based approaches have been proposed in recent years for domain generalization 

[33, 34, 35, 36, 37, 38], which can be summarized into two categories:

  1. Train a robust cross-domain model using a specially designed neural network architecture to reduce the domain shift. For instance, Ghifary et al. [33] proposed multi-task auto-encoder, which learns to transform the image in one domain into analogs in multiple related domains. These features, which are robust to variations across domains, are then fed into a classifier. Li et al. [35]

    proposed a low-rank parameterized convolutional neural network to compensate the domain shift. Li

    et al. [34] used adversarial auto-encoders to align the distributions among different domains by minimizing the maximum mean discrepancy (MMD), and matched the aligned distribution to an arbitrary prior distribution via adversarial feature learning. The first step ensures the learned feature representation is universal to the known source domains, and the second step ensures the features can generalize well to the unseen target domain.

  2. Train models with regularization or meta-learning scheme regardless of the model structure. Balaji et al. [38] proposed a meta-regularization approach for domain generalization, which encodes domain generalization using a novel regularization function that makes the model trained in one domain to perform well in another domain. The regularization function was found in a learning-to-learn (or meta-learning) framework. Li et al. [37] proposed a model agnostic training procedure for domain generalization. Their algorithm simulated the shift between source and target domains during training by synthesizing virtual target domains within each mini-batch. The meta-optimization objective ensures performance improvements in both domains. Li et al. [36] further proposed an episodic training (ET) procedure that trains a single deep network while exposing it to the domain shift that characterises a novel domain at runtime. Specifically, it decomposes a deep network into two components: feature extractor and classifier, and then trains each component by simulating it interacting with a partner which may not be well tuned for the current domain.

This paper extends ET from classification to regression, and applies it to EEG-based driver drowsiness estimation. Our main contributions are:

  1. We propose a feature weighting (FW) scheme that automatically assigns each feature a weight, by taking different importance of different brain regions into consideration.

  2. We extend ET in [36] from classification to regression, and simplify it so that the computational cost is reduced without sacrificing the generalization performance.

  3. We integrate FW and ET into a single learning framework, feature weighted episodic training (FWET), to achieve better generalization performance than each individual module.

The remainder of this paper is organized as follows: Section II

introduces our dataset, feature extraction method, and the proposed FWET approach. Section 

III evaluates the performance of FWET in EEG-based driver drowsiness estimation. Section IV draws conclusion.

Ii Feature Weighted Episodic Training (FWET)

This section introduces the dataset for EEG-based driver drowsiness estimation, and our proposed FWET approach, whose overall flowchart is shown in Fig. 1, along with several other variants.

Fig. 1: AGG, FW-AGG, ET and FWET for EEG-based driver drowsiness estimation.

Ii-a Dataset

The data were collected in a simulated driving experiment, which was identical to that used in [21, 19, 39, 40]. Sixteen healthy subjects (age 24.2 3.7, ten males, six females) with normal or corrected to normal vision were recruited to participate in a sustained-attention driving experiment [39, 40]

, which consisted of a real vehicle mounted on a motion platform with six degrees of freedom immersed in a 360-degree virtual reality scene. The experiment simulated driving on an empty highway at 100km/h. Every 5-10 seconds, a random lane-departure event was activated, which caused the car to drift from the center of the lane. The participants were asked to steer the car back to the center of the lane immediately. The reaction time was calculated as the time difference between the drift and the moment the subject started to act, as shown in Fig. 

2. If the participant did not respond to the lane-departure event, such as falling asleep, the vehicle would hit the boundary of the road and continue moving forward along the boundary. The next lane-departure event happened after the response offset. Each participant performed the experiment for 60-90 minutes in the afternoon when the circadian rhythm of sleepiness reached its peak [41].

The Institutional Review Board of the Taipei Veterans General Hospital approved the experimental protocol. Each participant read and signed an informed consent form before the experiment began.

Fig. 2: Illustration of the way the reaction time was computed.

The reaction time was later converted into a drowsiness index (DI) [21, 22, 42, 43, 44],


where was set to 1 in our work. The DIs were then smoothed by a 90s moving-average window. (1) maps the reaction time to and overcomes its long-tail effect (very large reaction time was rare, but it did exist; such extreme values would significantly deteriorate the overall estimation). The fatigue level has been demonstrated to have a strong correlation with the reaction time [45]. Since the DI is positively correlated with the reaction time, DI is also an indicator of the fatigue level.

Note that the value of could also be set individually for each subject. For instance, in [42], was set to the 5 percentile value of the reaction time in each session. However, in a real-world online plug-and-play brain-computer-interface system, we do not have training data from the target subject, thus setting individually is not possible. Nevertheless, to demonstrate the robustness of our proposed approach, we also compare the performances using and individualized in Section III-F, which is possible in offline driver drowsiness estimation.

During the experiment, EEG signals were recorded using a 500Hz 32-channel Neuroscan system (30-channel EEGs plus 2-channel earlobes). Since data from one subject were not recorded correctly, we only used 15 subjects in our paper. To ensure a fair comparison, we used the first 3,600 seconds data from each subject.

Ii-B Preprocessing and Feature Extraction

We used EEGLAB [46] for data preprocessing. We first performed 1-50Hz band-pass filtering to remove artifacts and noise, and then down-sampled the data from 500Hz to 250Hz and re-referenced them to the averaged earlobes.

We tried to predict the DI for each subject every 3 seconds, using 30-second EEG signal before each sample point. We computed the average power spectral density (PSD; their absolute values, instead of relative values, were used) in theta and alpha bands using Welch’s method [47]

, with Hamming window, 1024 points fast Fourier transform, and 50% overlapping. The PSDs were then converted into dBs and used as our features. Each feature vector had

dimensions. All algorithms in our experiments used the same PSD features described above.

Each 30-second EEG signal may include brain activities, e.g., visual stimulus of the lane departure event and the wheel steering intention, and interferences from the wheel steering motor execution and other body movements. These brain activities and interferences are inevitably happening in real-world driving scenarios, and a good drowsiness estimation algorithm should be able to cope with them. Moreover, there are some other activities that are normal in realistic driving situations but were not considered in our experiments, e.g., the motor executions of acceleration and braking, talking, etc. These should be considered in the future improved experiment design.

Ii-C Problem Setting

Assume Subject has labeled EEG trials , where is a -dimension feature vector extracted from the -th EEG trial of Subject , and is the corresponding DI. Assume also that we have subjects in our training set, and we want to predict the DI for from an unseen target subject .

Our model contains two components:

, the feature transformation network, and

, the regression network. Hence the prediction for is .

Ii-D Aggregation Training (AGG)

The simplest domain generalization approach is to combine all source subjects’ data to train one single model, which is usually a very strong baseline. This method is called aggregation training (AGG) in [36].

In this paper, we perform AGG using a multi-layer perceptron (MLP) neural network with one hidden layer and ReLU activation function. The loss function is:


where is the squared error in regression. The parameters and are learned through gradient descent optimization.

Ii-E Feature Weighting (FW)

Previous studies [25, 26, 27] have shown that EEG features (channel-wise PSD features in this paper) in different brain regions have different correlations to the drowsiness. Thus, we use the following FW scheme to assign different weights to different EEG channels:


where and are the original and transformed weight vectors, respectively, and denotes element-wise product. We do not use the weight directly in (4); instead, we use its version , to make sure the weights are non-negative and sum up to 1.

Ii-F Episodic Training (ET)

ET for domain generalization was recently proposed by Li et al. [36] for image recognition. We simplify their algorithm and integrate it with FW. The original ET algorithm in [36] contains three regularization terms. In our work, we only adopt the first loss term (described as epif in Section III-D) for simplicity and speed. As it will be shown in Section III-G, our simplification greatly reduces the computational cost of the original ET, without sacrificing its generalization performance.

A common approach in transfer learning to learn domain-invariant features is to train a feature extractor that makes the marginal distribution consistent for different source domains, . However, since the DIs of different subjects vary due to individualized differences, i.e., the conditional distributions are different for different subjects, aligning the marginal distributions only may not lead to satisfactory generalization performance. ET considers the conditional distributions directly, and trains an that aligns in all source domains, which usually generalizes well to the unseen target domain .

We first establish a subject-specific feature transformation (FT) model and a subject-specific regression model for each source subject to learn the domain-specific information. We also want to train an FT model that makes the transformed features from Subject  still perform well when applied to a regressor trained on Subject  (). Hence, the following loss function is used:


where means that is not updated during back propagation.

The overall loss function of ET, when Subject ’s data are fed into Subject ’s regressor, is:




was used in our experiments.

Note that since there is a (purposeful) mismatch between and in , the gradient may be unstable and sometimes have gradient explosion. Therefore, we clipped the gradient to .

Ii-G Fwet

Our proposed algorithm, FWET, which integrates FW and ET, is shown in Algorithm 1. It learns in FW and and in ET simultaneously through gradient descent optimization.

All , , and , , are uniformly initialized. Take as an example. Let be the number of features in each layer. Then, each element of

is initialized as a uniformly distributed random variable in


Input: Training subject data , ;
          ET weight ;
          Batch size ;
          Learning rate .
Output: FWET model parameters , and .
for  do
       Initialize domain-specific FW vector ;
       Randomly initialize domain-specific model parameters and ;
end for
Initialize in FWET;
Randomly initialize and in FWET;
// Warm up
for  do
      Train the domain-specific model parameters , and

for one epoch, using only data from Subject 

end for
while training do
       for  do
             Sample a batch from Subject ;
             Compute using (4), ;
             Compute the sum of squared loss for the domain-specific model ;
       end for
      for  do
             for  and  do
                   Sample a batch from Subject ;
                   Compute the loss in (6) on the batch;
             end for
       end for
end while
Algorithm 1 Pseudocode of FWET.

Iii Experimental Results

This section studies the performance of FWET in EEG-based driver drowsiness estimation.

Iii-a Evaluation Method and Performance Measures

We used leave-one-subject-out cross-validation to validate the performance of our model. Since this was a regression problem, we used two metrics to evaluate the prediction results: root mean squared error (RMSE) and Pearson correlation coefficient (CC), which respectively measure the error and the correlation between the predicted DIs and the groundtruth DIs.

We compared six different algorithms:

  • kNN, which was a -nearest neighbors regressor with . The prediction was the average of the five nearest neighbors.

  • RR

    , which was ridge regression with L2 regularization coefficient


  • AGG, which was an MLP neural network with only one hidden layer, trained using the loss function in (2). The number of hidden layer units was 40.

  • FW-AGG, which performed FW before AGG.

  • ET, which trained an AGG model and domain-specific models together using ET. Each such model has the same structure as AGG above, i.e., a 3-layer MLP. The first layer was treated as . The other two layers were treated as .

  • FWET, which performed FW before ET.

The first two algorithms are commonly used baselines for regression problems. The last four are AGG based. We compare them to analyze the individual contributions of FW and ET in FWET. The last four models were trained using mini-batch gradient descent with momentum, with batch size , learning rate , momentum , and weight decay . We sampled 10% data from each training subject as the validation set in early-stopping to reduce overfitting. The maximum number of training epochs was set to 500, and early-stopping patience was 10 epochs. One epoch means one iteration in Algorithm 1. We repeated all algorithms five times and report the average performance.

Iii-B Experimental Results

The regression performance for each subject, averaged over five repeats, is shown in Fig. 3. FW-AGG, ET and FWET outperformed kNN, RR and AGG for most subjects. One exception is Subject 10, on which FW-AGG and FWET gave negative CCs.

Fig. 3: (a) RMSEs and (b) CCs in leave-one-subject-out cross-validation. The experiments were repeated five times, and the averages are shown.

To explore why FW-AGG and FWET gave weird CCs on Subject 10, we plot the feature distributions of Subject 10, along with those from the other 14 subjects, in Fig. 4. We first plot the 10 and 90 percentile of PSD features from each subject in Fig. 4 and a -SNE visualization in Fig. 4

to see if there are differences on feature distribution between subjects. Clearly, the distributions of the 51st and 52nd features of Subject 10 are dramatically different from those of other subjects, which may be due to outliers. We can also see that there are some data points from Subject 10 that are not consistent with the data from other subjects. The unsatisfactory performance of FW-AGG and FWET on Subject 10 suggests that maybe FW is sensitive to outliers.

Fig. 4: (a) 90 and 10 percentiles of features from Subject 10, w.r.t. the corresponding feature percentiles (90: solid curves; 10: dashed curves) from the other 14 subjects; (b) -SNE visualization of the features from different subjects.

In offline applications, we know the features from the target subject. So, preprocessing may be used to remove the outlier features. For example, when the 51st and 52nd outlier features of Subject 10 were removed, the corresponding boxplots of the RMSEs and CCs of FW-AGG and FWET are shown in Fig. 5. They were considerably improved over the RMSEs and CCs in Fig. 3.

Fig. 5: Boxplots of (a) RMSEs and (b) CCs of FW-AGG and FWET, when Subject 10 was the target (test) subject, and the 51st and 52nd outlier features were removed.

The last group in each subfigure of Fig. 3 also shows the average performance across the 15 subjects, whose values are given in Table I. AGG is a nonlinear model with more parameters than kNN and RR. Theoretically, it should outperform kNN and RR if well-trained. However, Table I shows that this was not the case. AGG had slightly worse average RMSE than kNN, and slightly worse average CC than RR. There may be two reasons: 1) there were not enough training data to tune AGG well; and, 2) AGG was over-fitted on the training data, so it did not generalize well to a new subject. After introducing FW and ET, both training performance and generalization ability were improved, and both FW-AGG and FWET outperformed the three baselines (kNN, RR, and AGG). More specifically, ET outperformed AGG, improving 4.9% and 0.2% on the RMSE and the CC, respectively. After adding FW to AGG, FW-AGG further outperformed ET by 4.5% on the RMSE and 4.3% on the CC. FWET achieved the best performance, and further improved the RMSE and the CC by 6.9% and 5.7%, respectively, over FW-AGG.

RMSE 0.2688 0.3622 0.2756 0.2504 0.2621 0.2332
CC 0.4394 0.5044 0.5422 0.5668 0.5434 0.5989
TABLE I: Average RMSEs and CCs across the 15 subjects and five runs.

To determine if the differences between different algorithms were statistically significant, we also performed non-parametric multiple comparison tests on the RMSEs and CCs using Dunn’s procedure [48], with a -value correction using the False Discovery Rate method [49]. The results are shown in Table II, where the statistically significant ones are marked in bold. FWET statistically significantly outperformed kNN, RR and AGG on the RMSE, and also kNN and RR on the CC. Though the performance improvements of FWET over FW-AGG and ET were not statistically significant, we have seen from Fig. 3 and Table I that on average FWET still slightly outperformed them.

RR .3697
AGG .4558 .3814
RMSE FW-AGG .1478 .0936 .1376
ET .1982 .1390 .2146 .3783
FWET .0182 .0098 .0170 .1745 .1335
RR .0063
AGG .0000 .0875
CC FW-AGG .0000 .0347 .3043
ET .0001 .1657 .3440 .1908
FWET .0000 .0021 .0873 .1832 .0403
TABLE II: -values of non-parametric multiple comparisons on the RMSEs and CCs.

In summary, we have shown that it is always preferable to use FWET over the other five algorithms.

Iii-C Effects of FW

The four AGG based algorithms have randomness involved, e.g., initialization, batch selection, etc. It’s interesting to study their stability. Recall that we had 15 subjects, and each algorithm was run five times when each subject was used as the target subject. The final results were assembled into a RMSE matrix and a CC matrix. We could plot a boxplot for each subject to show the stability of different algorithms, but that would take too much space, and is difficult to see the forest for the trees. So, we first computed the average performance of each algorithm over the 15 subjects, i.e., we took the average of the RMSE (CC) matrix along the columns to obtain a row vector, and then plotted the box-plots of the five average RMSEs (CCs) in Fig. 6. The RMSEs and CCs of kNN and RR did not have uncertainty, because there was no randomness in these algorithms. Among the four AGG based algorithms, AGG and ET had large variations, and FW-AGG and FWET had very small variations, suggesting one more advantage of introducing FW to AGG, beyond better RMSE and CC.

Fig. 6: Boxplots of the average (a) RMSEs and (b) CCs of the six algorithms.

Fig. 6 shows that generally FW helped reduce the variation from different runs. It’s interesting to study why. Several studies had analyzed the relationship between the generalization performance and sharp minima [50, 51]. It is believed that sharp minima may lead to bad generalization performance. ET tends to have flatter minima, which had already been demonstrated in [36]. We want to investigate if FW has a similar effect. We added random Gaussian noise to the learned parameters and checked how quickly the performance degraded. A rapid decrease indicates that the model is at a sharp minimum, which is bad for generalization.

As shown in Fig. 7, FW-AGG was more robust to the perturbations than AGG, and FWET was also more robust than ET. These observations demonstrated that FW led the model to flatter minima in the parameter space, which helped improve its generalization ability.

Fig. 7: (a) RMSE and (b) CC when Gaussian noise was added to the learned parameters of different algorithms. The models were trained on Subjects 1, 3-15 and tested on Subject 2.

We also visualize the importance of different regions in each power band, determined by in FW. Fig. 8 shows the topoplots of in theta and alpha bands after the function in FW-AGG, when the last 14 subjects were used in training. For the theta band, the central brain region had the maximum weights, i.e., it contributed the most to drowsiness estimation. For the alpha band in FW-AGG, both the central and the occipital brain regions contributed more to drowsiness estimation than other regions. These were partially consistent with the findings in [26], where Zhao et al. studied mental fatigue in 90-minute continuous simulated driving, and found that the frontal, central and occipital regions in the theta band, and the central, parietal, occipital and temporal regions in the alpha band, all had significant difference at the beginning and the end of the driving.

Fig. 8: EEG channel importance in theta and alpha bands, converted from in (a) FW-AGG and (b) FWET.

Fig. 8 shows the topoplots of in theta and alpha bands after the function in FWET, when the last 14 subjects were used in training. We can observe roughly the same patterns as in Fig. 8 for FW-AGG. However, note that the magnitude ranges in Fig. 8 were much smaller than those in Fig. 8, i.e.,

in FWET had smaller variance than that in FW-AGG.

We also computed the average PSD values for alert and drowsy states over the 15 subjects. We considered the subject be alert (drowsy) when his/her DI was lower (higher) than the 15 (85) percentile of the DIs over the entire session. Fig. 9 shows the differences between the topoplots of the drowsy and alert states, and Fig. 9 the Pearson correlation coefficient between each PSD feature and the DI. Interestingly, Figs. 8 and 9 are not similar, i.e., though the channel weights helped improve the drowsiness estimation performance, they were different from the correlation coefficients between the corresponding features and the DI.

Fig. 9: (a) the differences between the topoplots of the drowsy and alert states. (b) the Pearson correlation coefficient between each PSD feature and the DI.

Finally, although FW looks similar to the attention mechanism [52]

, which is being widely used in computer vision and natural language processing, they are different. The attention mechanism assigns dynamic weights to the neighboring locations, which change as the input varies. FW uses a fixed weight for each EEG channel, as the contributions of different brain regions usually do not change much in the same mental task.

Iii-D Effects of ET

This subsection first presents two experiments to understand how ET helped extract more generalizable features from different subjects, and then studies the effect of adding more regularization terms in ET and FWET.

We used data from all 15 subjects to train AGG, ET, FW and FWET, which had different feature extractor . To compare these two , we input Subject ’s data to each , and used Subject ’s () regressor (which was trained on data from Subject  only) for regression. For AGG and ET, the final regression model was . For FW-AGG and FWET, the final regression model was . We tried all for each , i.e., 14 different , and computed the average performance for each . The smaller (larger) the RMSE (CC) is, the better the generalization performance is. The results are shown in Fig. 10, where the subject index means that subject’s data were used as input (Subject  in the above description). ET always achieved a smaller RMSE and a larger CC, suggesting that ET extracted more generalizable features. FWET had comparable RMSEs as FW-AGG, but generally larger CCs than FW-AGG, suggesting again that ET extracted more generalizable features.

Fig. 10: Average RMSEs (a) and CCs (b) when Subject ’s regression network was applied to data from Subject  (). The feature transformation was for both AGG and ET. The feature transformation was for both FW-AGG and FWET.

Three different regularizations were used in ET in [36] for classification problems:

  1. epif (short for episodic feature), which requires the trained feature extractor to work well with all domain-specific classifiers.

  2. epic (short for episodic classifier), which requires the trained classifier to work well with all domain-specific feature extractors.

  3. epir (short for episodic random), which requires the feature extractor to work well with a randomly initialized classifier (representing a completely new domain).

We only adopted epif in our ET, because it was much easier and faster to optimize. The average training time per iteration, when different regularization terms were used in ET and FWET, are shown in Table III. Intuitively, the computational cost increased when more regularization terms were used.

Table III also shows the RMSEs and CCs when more regularizations were used. The weights for the three regularization terms were all set to . For both ET and FWET, using epif only achieved comparable performance with models using more regularization terms, and sometimes even slightly better. For the same type of regularization, FWET always outperformed ET, suggesting again the benefit of FW.

RMSE CC Time (s)
ET (ET-epif) 0.2621 0.5434 0.5302
ET-epif-epic 0.2577 0.5486 0.7050
ET-epif-epic-epir 0.2682 0.5230 0.9235
FWET (FWET-epif) 0.2332 0.5989 0.9651
FWET-epif-epic 0.2398 0.5771 1.0530
FWET-epif-epic-epir 0.2384 0.5795 1.2812
TABLE III: Average RMSEs, CCs and training time (s) when different regularization terms were used in ET and FWET.

In summary, we have shown that our proposed ET and FWET are efficient and effective, and their extracted features have comparable (or even slightly better) generalization performance with those with more regularization terms.

Iii-E Performance Gap between Validation and Test

Early stopping on a validation set is frequently used in machine learning to reduce overfitting, and was also the case in this paper. However, the validation performance is usually more optimistic than the test performance. A model with stronger generalization ability should have a smaller performance gap between the validation performance and the test performance.

Fig. 11 shows the validation and test RMSEs of the four AGG-based algorithms. Although AGG had the smallest validation RMSE, its test RMSE was the largest, i.e., the performance gap between the validation and test RMSEs were the largest, suggesting poor generalization ability. The validation-test RMSE gaps of FW-AGG, ET and FWET were considerably reduced. Particulary, FWET had the smallest gap, and the best test RMSE, suggesting its strong generalizability.

Fig. 11: Validation and test RMSEs of the four AGG-based algorithms. A smaller gap between the validation RMSE and the test RMSE indicates better generalizability.

Iii-F Individualized

in (1) was used in all above experiments. This is because we considered the most challenging case in brain-computer interfaces, i.e., we do not have any labeled data from the new subject. However, if we have some labeled data from the new subject, or some prior knowledge about the reaction time of the new subject, then it is possible to set individually. This subsection demonstrates the performance of FWET in this case.

Following the practice in [42], we set in (1) to be 5 percentile value of the reaction time of the corresponding subject, and repeated the experiments. The performances of FWET for constant and individualized are shown in Fig. 12. Using individualized reduced the RMSE for almost every subject (except Subject 4), although the CCs were roughly the same. This demonstrates that more information about the new subject can generally improve the drowsiness estimation performance.

Fig. 12: (a) RMSEs and (b) CCs in leave-one-subject-out cross-validation of FWET, using and individualized .

Iii-G Discussion

This paper extends domain generalization, a concept mostly used in computer vision, to brain-computer interfaces. There are some important differences between these two application areas, which should be paid attention to in future research:

  1. The number of source domains. In computer vision applications, the number of domains is usually small, e.g., PACS222 has four domains, IXMAS333 has five domains, and MNIST444 usually has seven rotated domains. Scalability is usually ignored in such applications. However, in brain-computer interfaces, more and more datasets with a large number of subjects are collected, and the scalability with respect to the number of domains can no longer be ignored.

  2. The variation of the label distribution in different domains. Most existing domain generalization approaches only focus on learning a feature transformation that makes all source domains to have roughly the same marginal distribution , without considering the label distribution . In EEG-based driver drowsiness estimation, the distribution of DIs varies significantly among different subjects. This makes generalization across different subjects difficult in brain-computer interfaces.

Iv Conclusion

EEG-based driver drowsiness estimation could be very important in improving driving safety. Unfortunately, individual differences among different drivers make it very difficult to design a generic estimation algorithm, whose parameters are fixed and optimal for all subjects. Usually some subject-specific calibration data are needed to tune the model parameters before applying it to a new subject, which is very inconvenient and not user-friendly. Many approaches have been proposed to reduce this calibration effort, but few can completely eliminate it. This paper has proposed an FWET approach to completely eliminate the calibration requirement. It integrates two techniques: FW to learn the importance of different features, and ET for domain generalization. Experiments demonstrated that both FW and ET are effective, and their integration can further improve the generalization performance. FWET does not need any labelled or unlabelled calibration data from the new subject at all, and hence could be very useful in plug-and-play brain-computer interfaces. Our future research will apply FWET to more such applications.


  • [1] F. Sagberg, P. Jackson, H.-P. Krüger, A. Muzet, and A. J. Williams, Fatigue, Sleepiness and Reduced Alertness as Risk Factors in Driving.   Oslo: Institute of Transport Economics, 2004.
  • [2] K. Kozak, J. Pohl, W. Birk, J. Greenberg, B. Artz, M. Blommer, L. Cathey, and R. Curry, “Evaluation of lane departure warnings for drowsy drivers,” in Proc. Human Factors and Ergonomics Society Annual Meeting, Los Angeles, CA, Oct. 2006, pp. 2400–2404.
  • [3] H. Abbood, W. Al-Nuaimy, A. Al-Ataby, S. A. Salem, and H. S. AlZubi, “Prediction of driver fatigue: Approaches and open challenges,” in Proc. 14th UK Workshop on Computational Intelligence, West Yorkshire, UK, Sep. 2014, pp. 1–6.
  • [4] M. I. Chacon-Murguia and C. Prieto-Resendiz, “Detecting driver drowsiness: A survey of system designs and technology.” IEEE Consumer Electronics Magazine, vol. 4, no. 4, pp. 107–119, 2015.
  • [5] H. Kang, “Various approaches for driver and driving behavior monitoring: A review,” in Proc. IEEE Int’l Conf. on Computer Vision Workshops, Sydney, Australia, Dec. 2013, pp. 616–623.
  • [6] A. Sahayadhas, K. Sundaraj, and M. Murugappan, “Detecting driver drowsiness based on sensors: a review,” Sensors, vol. 12, no. 12, pp. 16 937–16 953, 2012.
  • [7] Y.-T. Wang, K.-C. Huang, C.-S. Wei, T.-Y. Huang, L.-W. Ko, C.-T. Lin, C.-K. Cheng, and T.-P. Jung, “Developing an EEG-based on-line closed-loop lapse detection and mitigation system,” Frontiers in Neuroscience, vol. 8, p. 321, 2014.
  • [8] L. M. Bergasa, J. Nuevo, M. A. Sotelo, R. Barea, and M. E. Lopez, “Real-time system for monitoring driver vigilance,” IEEE Trans. on Intelligent Transportation Systems, vol. 7, no. 1, pp. 63–77, 2006.
  • [9] T. D’Orazio, M. Leo, C. Guaragnella, and A. Distante, “A visual approach for driver inattention detection,” Pattern Recognition, vol. 40, no. 8, pp. 2341–2355, 2007.
  • [10] E. Michail, A. Kokonozi, I. Chouvarda, and N. Maglaveras, “EEG and HRV markers of sleepiness and loss of control during car driving,” in Proc. 30th Annual Int’l Conf. of the IEEE Engineering in Medicine and Biology Society, BC, Canada, Aug. 2008, pp. 2566–2569.
  • [11] G. Jahn, A. Oehme, J. F. Krems, and C. Gelau, “Peripheral detection as a workload measure in driving: Effects of traffic complexity and route guidance system use in a driving study,” Transportation Research Part F: Traffic Psychology and Behaviour, vol. 8, no. 3, pp. 255–275, 2005.
  • [12] M. Akin, M. B. Kurt, N. Sezgin, and M. Bayram, “Estimating vigilance level by using EEG and EMG signals,” Neural Computing and Applications, vol. 17, no. 3, pp. 227–236, 2008.
  • [13]

    S. Hu and G. Zheng, “Driver drowsiness detection with eyelid related parameters by Support Vector Machine,”

    Expert Systems with Applications, vol. 36, no. 4, pp. 7651–7658, 2009.
  • [14] K. Fujiwara, E. Abe, K. Kamata, C. Nakayama, Y. Suzuki, T. Yamakawa, T. Hiraoka, M. Kano, Y. Sumi, F. Masuda, M. Matsuo, and H. Kadotani, “Heart rate variability-based driver drowsiness detection and its validation with EEG,” IEEE Trans. on Biomedical Engineering, vol. 66, no. 6, pp. 1769–1778, June 2019.
  • [15] M. Miyaji, H. Kawanaka, and K. Oguri, “Driver’s cognitive distraction detection using physiological features by the AdaBoost,” in Proc. IEEE Int’l Conf. on Intelligent Transportation Systems, St. Louis, MO, Oct. 2009, pp. 1–6.
  • [16] B.-L. Lee, D.-S. Lee, and B.-G. Lee, “Mobile-based wearable-type of driver fatigue detection by GSR and EMG,” in Proc. IEEE Region 10 Conf.   IEEE, Nov. 2015, pp. 1–4.
  • [17]

    R. Fu, H. Wang, and W. Zhao, “Dynamic driver fatigue detection using hidden Markov model in real driving condition,”

    Expert Systems with Applications, vol. 63, pp. 397 – 411, 2016.
  • [18] P. Aricò, G. Borghini, G. Di Flumeri, N. Sciaraffa, A. Colosimo, and F. Babiloni, “Passive BCI in operational environments: Insights, recent advances, and future trends,” IEEE Trans. on Biomedical Engineering, vol. 64, no. 7, pp. 1431–1436, 2017.
  • [19] Y. Cui and D. Wu, “EEG-based driver drowsiness estimation using convolutional neural networks,” in Proc. Int’l Conf. on Neural Information Processing, Guangzhou, China, Nov. 2017, pp. 822–832.
  • [20] C.-H. Chuang, Z. Cao, Y.-K. Wang, P.-T. Chen, C.-S. Huang, N. R. Pal, and C.-T. Lin, “Dynamically weighted ensemble-based prediction system for adaptively modeling driver reaction time,” arXiv preprint arXiv:1809.06675, 2018.
  • [21] D. Wu, C.-H. Chuang, and C.-T. Lin, “Online driver’s drowsiness estimation using domain adaptation with model fusion,” in Proc. Int’l Conf. on Affective Computing and Intelligent Interaction, Xi’an, China, Sep. 2015, pp. 904–910.
  • [22] D. Wu, V. J. Lawhern, S. Gordon, B. J. Lance, and C.-T. Lin, “Driver drowsiness estimation from EEG signals using online weighted adaptation regularization for regression (OwARR),” IEEE Trans. on Fuzzy Systems, vol. 25, no. 6, pp. 1522–1535, 2017.
  • [23] C.-H. Chuang, Z. Cao, J.-T. King, B.-S. Wu, Y.-K. Wang, and C.-T. Lin, “Brain electrodynamic and hemodynamic signatures against fatigue during driving,” Frontiers in Neuroscience, vol. 12, p. 181, 2018.
  • [24] W. Klimesch, P. Sauseng, and S. Hanslmayr, “EEG alpha oscillations: the inhibition–timing hypothesis,” Brain Research Reviews, vol. 53, no. 1, pp. 63–88, 2007.
  • [25] J. Perrier, S. Jongen, E. Vuurman, M. L. Bocca, J. G. Ramaekers, and A. Vermeeren, “Driving performance and EEG fluctuations during on-the-road driving following sleep deprivation,” Biological Psychology, vol. 121, pp. 1–11, 2016.
  • [26] C. Zhao, M. Zhao, J. Liu, and C. Zheng, “Electroencephalogram and electrocardiograph assessment of mental fatigue in a driving simulator,” Accident Analysis & Prevention, vol. 45, pp. 83–90, 2012.
  • [27] C.-T. Lin, C.-H. Chuang, Y.-K. Wang, S.-F. Tsai, T.-C. Chiu, and L.-W. Ko, “Neurocognitive characteristics of the driver: A review on drowsiness, distraction, navigation, and motion sickness,” Journal of Neuroscience and Neuroengineering, vol. 1, no. 1, pp. 61–81, 2012.
  • [28] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
  • [29] A. M. Azab, J. Toth, L. S. Mihaylova, and M. Arvaneh, Signal Processing and Machine Learning for Brain–Machine Interfaces.   Institution of Engineering and Technology, 2018, ch. 5, pp. 81–101.
  • [30] Y.-P. Lin and T.-P. Jung, “Improving EEG-based emotion classification using conditional transfer learning,” Frontiers in Human Neuroscience, vol. 11, p. 334, 2017.
  • [31] P. Zanini, M. Congedo, C. Jutten, S. Said, and Y. Berthoumieu, “Transfer learning: A Riemannian geometry framework with applications to brain–computer interfaces,” IEEE Trans. on Biomedical Engineering, vol. 65, no. 5, pp. 1107–1116, 2018.
  • [32] H. He and D. Wu, “Transfer learning for brain-computer interfaces: A Euclidean space data alignment approach,” IEEE Trans. on Biomedical Engineering, 2019, in press.
  • [33]

    M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi, “Domain generalization for object recognition with multi-task autoencoders,” in

    Proc. IEEE Int’l Conf. on Computer Vision, Santiago, Chile, Dec. 2015, pp. 2551–2559.
  • [34] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot, “Domain generalization with adversarial feature learning,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, Utah, Jun. 2018, pp. 5400–5409.
  • [35] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broader and artier domain generalization,” in Proc. IEEE Int’l Conf. on Computer Vision, Venice, Italy, Oct. 2017, pp. 5542–5550.
  • [36] D. Li, J. Zhang, Y. Yang, C. Liu, Y. Song, and T. M. Hospedales, “Episodic training for domain generalization,” arXiv preprint arXiv:1902.00113, 2019.
  • [37] D. Li, Y. Yang, Y. Song, and T. M. Hospedales, “Learning to generalize: Meta-learning for domain generalization,” in Proc. 32th AAAI Conf. on Artificial Intelligence, New Orleans, Louisiana, Feb. 2018, pp. 3490–3497.
  • [38] Y. Balaji, S. Sankaranarayanan, and R. Chellappa, “MetaReg: Towards domain generalization using meta-regularization,” in Proc. Advances in Neural Information Processing Systems, Montreal, Canada, Dec. 2018, pp. 1006–1016.
  • [39] C.-H. Chuang, L.-W. Ko, T.-P. Jung, and C.-T. Lin, “Kinesthesia in a sustained-attention driving task,” NeuroImage, vol. 91, pp. 187–202, 2014.
  • [40] S.-W. Chuang, L.-W. Ko, Y.-P. Lin, R.-S. Huang, T.-P. Jung, and C.-T. Lin, “Co-modulatory spectral changes in independent brain processes are correlated with task performance,” NeuroImage, vol. 62, no. 3, pp. 1469–1477, 2012.
  • [41] J. Horne and L. Reyner, “Vehicle accidents related to sleep: a review.” Occupational and environmental medicine, vol. 56, no. 5, pp. 289–294, 1999.
  • [42] C.-S. Wei, Y.-P. Lin, Y.-T. Wang, T.-P. Jung, N. Bigdely-Shamlo, and C.-T. Lin, “Selective transfer learning for EEG-based drowsiness detection,” in Proc. IEEE Int’l Conf. on Systems, Man, and Cybernetics, Hong Kong, China, Oct 2015, pp. 3229–3232.
  • [43]

    C.-S. Wei, Y.-P. Lin, and T.-P. Jung, “Exploring the EEG correlates of neurocognitive lapse with robust principal component analysis,” in

    Proc. Int’l Conf. on Augmented Cognition, Toronto, Canada, Jul. 2016, pp. 113–120.
  • [44] C.-S. Wei, Y.-P. Lin, Y.-T. Wang, C.-T. Lin, and T.-P. Jung, “A subject-transfer framework for obviating inter-and intra-subject variability in EEG-based drowsiness detection,” NeuroImage, vol. 174, pp. 407–419, 2018.
  • [45] Q. Ji, Z. Zhu, and P. Lan, “Real-time nonintrusive monitoring and prediction of driver fatigue,” IEEE Trans. on Vehicular Technology, vol. 53, no. 4, pp. 1052–1068, 2004.
  • [46]

    A. Delorme and S. Makeig, “EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis,”

    Journal of Neuroscience Methods, vol. 134, no. 1, pp. 9–21, 2004.
  • [47] P. Welch, “The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms,” IEEE Trans. on Audio and Electroacoustics, vol. 15, no. 2, pp. 70–73, 1967.
  • [48] O. Dunn, “Multiple comparisons using rank sums,” Technometrics, vol. 6, pp. 214–252, 1964.
  • [49] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society, Series B (Methodological), vol. 57, pp. 289–300, 1995.
  • [50] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-sgd: Biasing gradient descent into wide valleys,” in Proc. IEEE Int’l Conf. on Computer Vision, Venice, Italy, Oct. 2017.
  • [51]

    N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” in

    Proc. IEEE Int’l Conf. on Computer Vision, Venice, Italy, Oct. 2017.
  • [52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, Dec. 2017, pp. 5998–6008.