Intra- and Inter-epoch Temporal Context Network (IITNet) for Automatic Sleep Stage Scoring

02/18/2019 ∙ by Seunghyeok Back, et al. ∙ Gwangju Institute of Science and Technology 0

This study proposes a novel deep learning model, called IITNet, to learn intra- and inter-epoch temporal contexts from a raw single channel electroencephalogram (EEG) for automatic sleep stage scoring. When sleep experts identify the sleep stage of a 30-second PSG data called an epoch, they investigate the sleep-related events such as sleep spindles, K-complex, and frequency components from local segments of an epoch (sub-epoch) and consider the relations between sleep-related events of successive epochs to follow the transition rules. Inspired by this, IITNet learns how to encode sub-epoch into representative feature via a deep residual network, then captures contextual information in the sequence of representative features via BiLSTM. Thus, IITNet can extract features in sub-epoch level and consider temporal context not only between epochs but also in an epoch. IITNet is an end-to-end architecture and does not need any preprocessing, handcrafted feature design, balanced sampling, pre-training, or fine-tuning. Our model was trained and evaluated in Sleep-EDF and MASS datasets and outperformed other state-of-the-art results on both the datasets with the overall accuracy (ACC) of 84.0 (MF1) of 77.7 and 80.8, and Cohen's kappa of 0.78 and 0.80 in Sleep-EDF and MASS, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sleep stage scoring, also known as sleep stage classification and sleep stage identification, is essential to diagnose and treat sleep disorders [wulff2010sleep]. Many people suffering from sleep disorders are at risk of underlying health problems [torabi2015withstanding]. The typical sleep disorders (e.g., sleep apnea, narcolepsy, sleepwalking) can be diagnosed by polysomnography (PSG) [berthomier2007automatic]

.PSG is the gold standard in sleep stage scoring based on the biosignals of body functions such as brain activity (Electroencephalogram, EEG), eye movements (Electrooculogram, EOG), heart rhythm (Electrocardiograms, ECG), and muscle activity of chin, face, or limb (Electromyogram, EMG). These recorded signals have been analyzed by trained human experts. They label each segment of 20- or 30-seconds PSG data, called an epoch, with its corresponding sleep stage. The sleep stages are classified into wakefulness (Wake), rapid eye movement (REM), and non-REM (NREM) by following Rechtschaffen and Kales (R&K) rules

[rechtschaffen1968manual] or American Academy of Sleep Medicine (AASM) rules [berry2012aasm]. Following AASM, NREM is further divided into three stages also referred to as S1, S2, and S3 or N1, N2, and N3. In order to draw the whole night hypnogram that represents the sleep stage as a function of time during sleep, the experts have to inspect all the epochs visually and label their sleep stages. This manual sleep stage scoring is labor-intensive, time-consuming [stepnowsky2013scoring, rosenberg2013american, stephansen2017use].

Many sleep stage scoring methods have been proposed to automatically analyze PSG data. Especially, hand-crafted feature extraction techniques have been widely used to automate sleep stage scoring

[aboalayon2016sleep]. The features can be extracted from time-domain signals [koley2012ensemble, cecotti2011convolutional], frequency / time-frequency domain signals [berthomier2007automatic, langkvist2012sleep, vilamala2017deep, biswal2017sleepnet, dong2018mixed, fraiwan2012automated, tsinalis2016automatic, hassan2016decision, sharma2017automatic, hsu2013automatic], or non-linear parameters [hassan2016computer, liang2012automatic, lajnef2015learning]. The extracted features were analyzed by fuzzy classification [berthomier2007automatic]

, decision tree

[hassan2016computer]

, random forest

[fraiwan2012automated, hassan2016decision, sharma2017automatic]

, or support vector machine

[koley2012ensemble, lajnef2015learning, zhu2014analysis]. Some of them used multi-channel data or multi-modality data [stephansen2017use, vilamala2017deep, biswal2017sleepnet, dong2018mixed, lajnef2015learning]

. The studies have shown the machine learning on the hand-crafted features is effective to automate sleep stage scoring. However, these approaches may require additional handcrafted tuning to analyze the PSG data obtained from different recording environment since the features were hand-engineered based on the specific PSG system and available datasets

[supratak2017deepsleepnet].

Recently, deep learning has been adopted to automatically score the sleep stage from the PSG data. The features were extracted by convolutional neural networks (CNNs) in

[supratak2017deepsleepnet, manor2015convolutional, sors2018convolutional, tsinalis2016automatic, chambon2018deep, mirowski2009classification, wulsin2011modeling]. Some deep neural networks learned the features from multi-channel data or multi-modality data to improve the performance [chambon2018deep, parra2005recipes, o2014montreal]

. Nowadays, recurrent neural networks (RNNs) have been adopted to consider the transition rules documented in scoring rules such as AASM manual and learn temporal information from the sequence of epochs

[stephansen2017use, biswal2017sleepnet, dong2018mixed, tsinalis2016automatic, hsu2013automatic, supratak2017deepsleepnet, sors2018convolutional, tsinalis2016automatic, chambon2018deep, phan2018joint]. Supratak et al. used RNNs to consider the temporal context between epoch-wise features, which are individually extracted from epochs by CNNs [supratak2017deepsleepnet]. The results showed that the consideration of the transition rules by analyzing inter-epoch temporal context was effective and essential for automated sleep stage scoring.

When human sleep experts score the sleep stage of an epoch, generally, they find the sleep-related event in sub-epoch such as K-complexes, sleep spindles, and frequency components (alpha, beta, delta, and theta activities). Then, they analyze the relations between the sleep-related events in the epoch and its neighboring epochs. In other words, they inspect the signals in sub-epoch level as well as in epoch level. For instance, an epoch can be labeled with N2 when K-complexes or sleep spindles exists in the last half of its previous epoch. Phan et al. [phan2018seqsleepnet]

extracted the representative features from sub-epochs and analyzed the temporal dynamics between the sub-epoch features by RNN, which may require preprocessing to extract time-frequency domain data from the raw PSG records by short-time Fourier transform.

In this paper, we propose an intra- and inter-epoch temporal context network (IITNet) for the automated sleep stage scoring as the sleep experts do. IITNet encodes each sub-epoch of input signals into the representative feature and analyzes the relationship between the features extracted from each sub-epoch in the forward and backward directions. This enables IITNet to consider both the intra- and inter-epoch temporal contexts and to learn the transition rules in sub-epoch level. For this, a deep residual neural network (ResNet) [he2016deep, he2016identity]

, one of the deep convolution networks, is used to extract a sequence of the representative features from single channel raw EEG signals. A bidirectional long short-term memory (BiLSTM)

[schuster1997bidirectional], one of RNNs, is used to learn the temporal context between the representative features.

IITNet is an end-to-end architecture and does not require any preprocessing, transformations of the raw data, balanced sampling, or cost-sensitive weighting. When comparing the performance with other state-of-the-art methods that uses a single channel EEG signals, IITNet outperformed the state-of-the-art methods on two public datasets: Sleep-EDF and MASS. Moreover, the optimal number of epochs was also investigated for the best prediction performance. The main contributions of this study are as follows:

  • IITNet can learn sleep-related events and transition rules effectively by considering the intra- and inter-epoch temporal context to automate the sleep stage scoring with single channel EEG signals by extracting the representative features from each sub-epoch.

  • Among the sleep scoring methods that use single channel raw EEG signals, IITNet achieved the highest performance for the two datasets: Sleep-EDF and MASS. Moreover, using only a single 30-second epoch as an input, IITNet achieved comparable performance in sleep stage Wake, N2, and N3 with the state-of-the-art methods.

  • IITNet is an end-to-end architecture and does not require any additional training or preprocessing procedure such as balanced sampling, cost-sensitive weighting, pre-training or fine-tuning, and hand-crafted feature design.

Figure 1: The conversion from an epoch signal into feature sequences in the convolutional layers. Each representative feature encodes the corresponding sub-epoch signals with the dimension of . Total representative features are extracted and stacked into feature sequence to be fed in to the recurrent layers.
Figure 2: The architecture of the proposed model, IITNet. The left shows IITNet of one-to-one configuration for intra-epoch temporal context learning. The right shows IITNet of many-to-one configuration for intra- and inter-epoch temporal context learning. Each green box indicates the representative feature extracted from sub-epoch (red dash box). These representative features are aggregated into a feature sequence (blue dash box) in intra-epoch temporal context learning or a series of feature sequences (yellow dash box) for both the intra- and inter-epoch temporal context learning.

2 Model Architecture of IITNet

2.1 Model Overview

The proposed model, intra- and inter-epoch temporal context network (IITNet), is designed to learn the representative features from sub-epochs of raw EEG signals and analyze the temporal relations between the representative features in sub-epoch level. When human sleep experts label a 30-second PSG data of a target epoch with its corresponding sleep stage, they visually inspect the frequency characteristics and the sleep-related events such as spindles and K-complexes. They also check the sleep-related events in previous epochs and consider the relations between sleep-related events to follow the transition rules. Similarly, IITNet extracts the representative feature in sub-epoch level and captures the contextual information in the sequence of representative features to score the sleep stages of the target epoch. Thus, IITNet can consider the intra-epoch temporal context as well as the inter-epoch temporal context.

IITNet is an extension of the convolutional recurrent neural network (CRNN) [shi2017end] and consists of two main parts: the convolutional layers and the recurrent layers as shown in Fig. 2. In the convolutional layers, a modified ResNet [he2016deep] is used to learn the representative features associated with the sleep-related events from single channel raw EEG signals. The residual network has skip connections that enable to train much deeper layers efficiently [he2016identity] and it has been widely used as a promising feature extractor in various tasks [he2017mask], [xie2017rethinking], [lu2016hierarchical]. In the recurrent layers, two layers of bidirectional LSTM (BiLSTM) is employed to capture the contextual information from the multiple representative features in both forward and backward direction [graves2013speech], [hochreiter1997long], [gers2002learning]. A softmax classifier is placed at the top of the model to classify most proper sleep stage for a given target epoch.

IITNet disassembles the 30-second epoch into overlapped sub-epochs and encodes each sub-epoch into the representative feature as shown in Fig. 1. In convolutional layer, an epoch is converted to the feature maps. Each column of the feature maps represents sub-epoch features since it is correlated with the local segment of the input signals according to its receptive field. These sub-epoch features are successively stacked from left to right in chronological order and a sequence of sub-epoch encoding vectors, called a feature sequence, is generated for each 30-second input signals. Then the feature sequence is fed into the recurrent layers and the temporal relation between the representative features is analyzed.

IITNet takes target epoch as input either with or without its previous epoch to predict the sleep stage of a target epoch. When single epoch is fed into IITNet, intra-epoch temporal context are considered only. When the sequence of target epoch and its successive epoch are fed into IITNet, the temporal context are analyzed in both the intra- and inter-epoch levels since the series of feature sequences contains the representative features of both the previous epochs and the target epoch.

In this study, only successive epochs right before the target epoch are used for the inter-epoch temporal context learning. This configuration was determined based on the intensive preliminary study for three cases. The first case was that the target epoch was in the middle of the sequence of the successive epochs. In the second and third case, the target epoch was located at the very front of the sequence and at the very last of the sequence, respectively. According to the results, the third case showed the best performance; thus, the successive epochs before the target epoch were used to make the input sequence for IITNet in this study. This is also clinically reasonable since the human experts generally investigate the previous epochs to consider the transition rules following AASM.

2.2 Intra-epoch Temporal Context Learning

For intra-epoch temporal context learning, IITNet takes a single target epoch

as an input to model the conditional probability

where is the true sleep stage of the epoch. IITNet extracts the representative features from sub-epochs and produces a feature sequence which contains the sub-epoch features in chronological order. The convolutional layers create feature maps from the target epoch and then feature maps are converted into a sequence of features as follows:

(1)

where is the feature sequence, is the vector indicating the representative feature in the corresponding sub-epoch. is the number of columns in the feature maps and each feature vector is drawn from the -th column of the feature maps. The length of a feature vector is which is equal to the number of filters in the last convolutional layer.

In the recurrent layers, BiLSTM has internal hidden states of the dimension for each direction. The two BiLSTM layers process the feature vector with the previous hidden state in both the backward and forward directions, which results in the internal representation of the right context and left context at the time step as follows [graves2005framewise]:

(2)
(3)

In order to predict the true sleep stage , the last hidden states are concatenated to form the internal representation of the bidirectional context of the size

. These hidden states are fed into the fully-connected layer to output the vector of the probability distribution

for the sleep stage of the epoch. Finally, the softmax classifier labels the epoch with the most suitable sleep stage.

2.3 Intra- and Inter-epoch Temporal Context Learning

In order to consider the inter-epoch dependency in sub-epoch level, IITNet takes a target epoch and its previous epochs together as an input. The model scores the target epoch based on the temporal context in the series of feature sequences, which encodes the target epoch and its previous epochs in sub-epoch level when is the sequence length, i.e., the number of the input epochs. Formally, IITNet is trained to model the conditional probability:

(4)

where is a sequence of the successive epochs, is the target epoch, are the previous epochs, and is the true sleep stage of the target epoch.

In the convolutional layers, IITNet extracts the feature sequence individually for each epoch as shown in Fig. 2. In other words, the convolutional layers take -th epoch and produce a corresponding feature sequence . It should be noted that the learnable parameters of the convolutional layers are shared. In the back propagation, the parameters are updated with the average of the gradients computed from all the input epochs. At the top of the convolution layers, a series of the feature sequences is produced as follows:

(5)

where are the sub-epoch feature vectors corresponding to . Accordingly, includes all representative features from the input epochs in chronological order. The recurrent layers and softmax classifier process in the same way as the intra-epoch temporal context learning. The difference with intra-epoch temporal context learning is that the input fed into the BiLSTM is a series of the feature sequences of successive epochs instead of the single feature sequence of an epoch.

3 Experiments

3.1 Datasets

To evaluate the sleep stage scoring performance of IITNet, two public datasets: extended Sleep-EDF [physiotoolkitphysionet, kemp2000analysis] and Montreal Archive of Sleep Studies (MASS) [o2014montreal] were used. The datasets contain PSG records and their corresponding sleep stages labeled by human sleep experts. Table 1 shows the number of 30-second epochs in the datasets with respect to the sleep stages when the sequence length is one.

3.1.1 Sleep-EDF

The dataset of Sleep-EDF Expanded (Sleep-EDF) contains two types of PSG records: SC (20 healthy subjects without sleep-related disorders) and ST (22 subjects for the study of Temazepam effect on sleep). Each record includes two-channel EEGs from Fpz-Cz and Pz-Oz, single channel EOG, and single channel EMG. The sampling rate of the EEG signals is 100 Hz. Each 30-second epoch was labeled with one of eight classes (W, REM, N1, N2, N3, N4, MOVEMENT, UNKNOWN) according to R&K rules [ALLANHOBSON1969644].

To train and evaluate IITNet, the Fpz-Cz channel EEG signals in the SC records (average subject age: ) were used. As the number of class W (wake) was unnecessarily large compared to the others, thirty minutes of epochs before and after the sleep period were only used [supratak2017deepsleepnet]. N3 and N4 stages were merged into the N3 stage following AASM rules. The stages of MOVEMENT and UNKNOWN were excluded because the prediction of them are out of scope in this study. The successive epochs of the length are used as the input:

(6)

For training, the labeled sleep stage of the target epoch is used as the true label . Although the previous epochs have their true labels, the single label of the target epoch is only used. Thus, first epoch of each record was not used as target epoch. There was no additional hand-crafted feature extraction or signal processing in this data preparation.

3.1.2 Mass

The dataset of MASS includes the PSG records of 200 subjects and consists of the five subsets: SS1-SS5. The subsets were grouped together by the research and acquisition protocols. In this study, SS3 records (62 subjects, average subject age: ) were used to train and evaluate IITNet. the subset contains twenty-channel EEG, two-channel EOG, three-channel EMG, and single-channel ECG. The sampling rate of the EOG and EEG signals is 256 Hz. Each 30-second epoch was labeled with one of five classes (W, REM, N1, N2, N3) according to AASM rules. The successive epochs are used as the input and the labeled sleep stage of the target epoch is used for training in the same way as Sleep-EDF without any additional prepossessing.

Stage W N1 N2 N3 (N4) REM Total
Sleep-EDF 8185 2804 17799 5703 7717 42308
MASS 5672 4524 29212 7567 10420 57395
Table 1: The class-wise number of epochs in Sleep-EDF and MASS when the sequence length is one.

3.2 Model Specifications

To deal with one-dimensional time series EEG signals, one-dimensional operations in the modified ResNet-50 were used instead of two-dimensional operations of convolution, max-pool, batch normalization. Furthermore, an additional max-pool is placed between the conv3_x layer and the conv4_x layer to reduce the length of the feature sequence by half. The global average pool layer was excluded, and a dropout layer was added at the end of the convolutional layers in order to prevent overfitting with

. In the recurrent layers, two layers of BiLSTM was adopted. The hidden state size of BiLSTM in each direction and the number of the last convolutional filters was set as .

3.3 Training

For training, IITNet used Adam [kingma2014adam] optimizer with the parameters . In order to avoid overfitting, L2-weight regularization was applied with the parameters

. During the whole experiments, the batch size was 256 and 128 for Sleep-EDF and MASS datasets, respectively. Early stopping was implemented by tracking validation cost, i.e., the training was stopped when there was no improvement in the validation cost for ten training steps in a row. For each cross-validation, the model which achieved the best validation accuracy was used for evaluation in the test sets. The training process was conducted in Python 3.5.0 and PyTorch 0.4.0

[paszke2017automatic].

3.4 Evaluation

For the evaluation of IITNet, -fold cross-validation was conducted on both Sleep-EDF and MASS datasets. When the number of the subjects in a dataset is , the data of subjects was retained as the test set for the model evaluation and the remaining data of the others was split into the training and validation data. The subjects for the test set were sequentially changed by repeating this process times, i.e., the evaluation was consequently performed over all the subjects. To be specific, 20-fold cross-validation was conducted for Sleep-EDF. At each fold, the records corresponding to a single subject were used as the test set for evaluation and the remainders were split into training and validation set with 15 and 4 subjects, respectively. In other words, the single subject was evaluated after the model was trained and validated with the EEG signals of the other 19 subjects. Repeating this procedure 20 times by alternating the single subject that had not been evaluated, all the predicted sleep stages were collected to calculate the performance of the model with the stated criteria. In the case of MASS dataset, 31-fold cross-validation was carried out. At each fold, the records of two randomly selected subjects which do not overlap with the other folds were used as the test set and the rest of the records were divided into training and validation set with 45 and 15 subjects, respectively.

The performance of IITNet was assessed by the following criteria: per-class precision (PR), per-class recall (RE), per-class F1-score (F1), overall accuracy (ACC), macro-averaging F1-score (MF1), and Cohen’s kappa coefficient () [cohen1960coefficient][sokolova2009systematic]

. In the case of a classification task, precision and recall are defined as follows Eq.

7 and 8:

(7)
(8)

where is the element in the -th row and

-th column of the confusion matrix and

is the number of the sleep stages: five in this study.

(9)
(10)
(11)

Precision is the fraction of the positive predictions that are actually positives over the total amount of the positive predictions, which represents how clearly the model distinguishes the sleep stage from the others. The recall is also the fraction of positive predictions, but it is calculated over the total amount of the actual positives. It represents how accurately the model predicts the sleep stage. Overall accuracy is the ratio of the correct predictions to the total predictions, which is an intuitive performance measure. However, F1-score calculated by the harmonic mean of the precision and recall can be more informative than the overall accuracy in the case of the imbalanced class distribution such as PSG data. The average of per-class F1-scores is macro-averaging F1-score (MF1). Cohen’s kappa (

) indicates the agreement between the human experts’ sleep stage scoring (truth) and the sleep stage scoring of the model (prediction).

4 Results

4.1 Performance of IITNet in Sleep Stage Scoring

A hypnogram predicted by IITNet from one of the subjects is illustrated in Fig. 3. According to the input sequence length , the performance was measured by 20 cross-validations for Sleep-EDF and 31 cross-validations for MASS as shown in Fig. 4. As the sequence length increases, the overall accuracy, MF1, and Cohen’s kappa showed similar variation patterns in Fig. 3(a) and 3(c). For Sleep-EDF, IITNet achieved the highest performance with the overall accuracy of , the MF1 score of , and Cohen’s kappa of when the sequence length was 11. For MASS, IITNet produced the best results with the overall accuracy of , the MF1 score of , and Cohen’s kappa of when the sequence length was 17.

Prediction Metric
Wake N1 N2 N3 REM PR RE F1
Truth Wake 7140 505 208 27 405 79.9 86.2 82.9
N1 596 659 592 9 948 36.1 23.5 28.5
N2 521 199 15294 718 1067 86.0 85.9 86.0
N3 214 0 643 4840 6 86.4 84.9 85.6
REM 466 464 1037 7 5743 70.3 74.4 72.3
Table 2: Confusion matrix of IITNet on Sleep-EDF
Prediction Metric
Wake N1 N2 N3 REM PR RE F1
Truth Wake 6781 601 227 78 208 90.1 85.9 87.9
N1 432 1148 684 18 522 49.3 40.9 44.7
N2 127 295 16105 656 616 85.7 90.5 88.0
N3 131 2 721 4846 3 86.5 85.0 85.7
REM 58 283 1058 6 6312 82.4 81.8 82.1
Table 3: Confusion matrix of IITNet on Sleep-EDF
Prediction Metric
Wake N1 N2 N3 REM PR RE F1
Truth Wake 4917 391 130 12 222 87.1 86.7 86.9
N1 437 1378 1297 0 1412 46.4 30.5 36.8
N2 143 550 26676 814 1029 89.6 91.3 90.5
N3 11 0 949 6607 0 88.9 87.3 88.1
REM 137 653 719 1 8910 77.0 85.5 81.0
Table 4: Confusion matrix of IITNet on MASS
Prediction Metric
Wake N1 N2 N3 REM PR RE F1
Truth Wake 4178 388 197 14 178 88.0 84.3 86.1
N1 394 2144 961 3 863 60.9 49.1 54.4
N2 89 506 26926 976 606 90.0 92.5 91.3
N3 2 0 1106 6452 0 86.7 85.3 86.0
REM 86 480 717 1 9136 84.7 87.7 86.2
Table 5: Confusion matrix of IITNet on MASS
Figure 3: Hypnogram comparison between human expert and IITNet
(a) criterion-wise performance in Sleep-EDF
(b) class-wise F1-score in Sleep-EDF
(c) criterion-wise performance in MASS
(d) class-wise F1-score in MASS
Figure 4: The performance of IITNet on Sleep-EDF and MASS datasets when varying the sequence length
Method Performance
Dataset Article Architecture
Input
Channel
Subjects Approach
Input
Type
Sequence
Length
ACC kappa MF1 W N1 N2 N3 REM
Sleep-
EDF
IITNet CRNN Fpz-Cz 20 Many-to-one Raw
11
(10 before)
84.0 0.78 77.7 87.9 44.7 88.0 85.7 82.1
IITNet CRNN Fpz-Cz 20 One-to-one Raw 1 79.6 0.72 71.1 82.9 28.5 86.0 85.6 72.3
Tsinalis [tsinalis2016automatic] Deep CNN Fpz-Cz 20 Many-to-one Raw
5 (2 before
and after)
74.8 0.65 69.8 43.7 80.6 84.9 74.5 65.4
Vilamala [vilamala2017deep]
Deep CNN with
transfer-learning
Fpz-Cz 20 Many-to-one
Time-
frequency
5 (2 before
and after)
81.3 0.74 76.5 80.9 47.4 86.2 86.2 81.9
Supratak [supratak2017deepsleepnet] Deep CNN + RNN Fpz-Cz 20 Many-to-one Raw All 82 0.76 76.9 84.7 46.6 89.8 84.8 82.4
Phan [phan2018joint]
Multitask
1-max CNN
Fpz-Cz 20 One-to-many
Time-
frequency
1 81.9 0.74 73.8 - - - - -
MASS IITNet CRNN F4-EOG (left) 62 Many-to-one Raw
17
(16 before)
86.6 0.80 80.8 86.1 54.4 91.3 86.0 86.2
IITNet CRNN F4-EOG (left) 62 One-to-one Raw 1 84.5 0.77 76.6 86.9 36.8 90.5 88.1 81.0
Dong [dong2018mixed] DNN +LSTM F4-EOG (left) 62 Many-to-one
handcrafted
features
5 (4 before) 85.9 0.79 80.5 84.6 56.3 90.7 84.8 86.1
Supratak [supratak2017deepsleepnet] Deep CNN + RNN F4-EOG (left) 62 Many-to-one Raw All 86.1 0.79 80.2 87.3 52.6 90.3 81.5 89.3
Phan [phan2018joint]
Multitask
1-max CNN
C4-A1 200 One-to-many
Time-
frequency
1 78.6 0.7 70.6 - - - - -
Table 6: Performance comparison between IITNet and the state-of-the-art methods that automated sleep stage scoring via deep learning in Sleep-EDF and MASS datasets. Class-wise performances are measured with per-class F1-score. The bold number indicates the best result in each metric among the algorithms. All indicates the sequence length is the number of epochs over one night.

The influence of the sequence length on the per-class F1-scores was also investigated as shown in Fig. 3(b) and Fig. 3(d). The confusion matrices of IITNet when the sequence length was 1 and the optimal sequence length (11 for Sleep EDF, 17 for MASS) are shown in Table 2, 3, 4, and 5 with the per-class precision, per-class recall, and per-class F1-score. In both the datasets, the performance of N1 and REM tended to be enhanced with the increase in the sequence length. The highest F1-scores were for REM and for N1 in Sleep-EDF and for REM and in MASS. Especially, when the sequence length was changed from 1 to 3, the performance metrics and the F1-scores of N1 and REM were greatly improved. For the sleep stages of Wake, N2, and N3, the F1-scores also reached the highest performance at the optimal sequence length while the prediction of them were less sensitive to the sequence length. The performance did not show significant betterment even if the sequence length was raised larger than the optimal value.

4.2 Performance Comparison with State-of-the-Art Methods

The performance of IITNet was compared with state-of-the-art studies that automate the sleep stage scoring with the deep learning-based approach using the same datasets as shown in Table. 6. For Sleep-EDF, overall accuracy, Cohen’s kappa, MF1, and per-class F1-score were computed by using the aggregated confusion matrix from the 20-fold cross-validation in order to fairly compare the performance. The performance of Tsinalis et al. [vilamala2017deep] that used bootstrap for evaluation was reorganized using the aggregated confusion matrix in their article. In Table 6, the performances of the compared algorithms were listed with the model architecture, its approach, the type of input channels, the number of subjects, the input type for deep learning models, the number of input epochs at once for scoring the sleep stage of the target epoch. For MASS, the performance metrics were calculated by the aggregated confusion matrix from the 31-fold cross-validation for the fair comparison. The results of Phan et al. [phan2018joint] that used the 20-fold cross-validation and 200 subjects were cited without using the aggregated confusion matrix due to lack of the class-wise F1-scores.

In Sleep-EDF, IITNet outperformed all the other state-of-the-art results in terms of the overall accuracy, Cohen’s kappa, and MF1 by big margins (ACC: (), : (), MF1: ()) when sequence length was 12. In MASS, IITNet achieved the best performance compared to the state-of-the-art algorithms using a single channel EEG (ACC: (), : (), MF1: ()). Even in the case of using a single epoch as the input, IITNet achieved the performance comparable with the state-of-the-art methods in both Sleep-EDF and MASS datasets, with the F1 scores for the stages of Wake (), N2 (), and N3 () while the F1-score for the sleep stages of N1 () and REM () were relatively low.

5 Discussion

Human sleep experts find the sleep-related events in sub-epochs and take account into the transition rules to identify the sleep stage of an epoch [berry2012aasm]. Motivated by this, we proposed a novel deep learning model, named IITNet, to score the sleep stage more accurately by considering the inter- and intra-epoch temporal context using raw single-channel EEG signals. The results support that considering the temporal context within an epoch as well as between epochs by adopting CRNN in deep neural networks with shared parameters can improve the performance of the automated sleep scoring without the frequency-based approach. Furthermore, the experiment about the influence of the sequence length shows that there is a certain range for the optimal sequence length to achieve the best performance. From the class-wise performance metrics, the proposed model can learn the scoring rules for each sleep stage in AASM since both the intra- and inter-epoch temporal contexts are considered. The comparison with the state-of-the-art results supports that IITNet is more accurate but less complex than the other models.

From the experiment about the influence of the sequence length, the performance metrics significantly increased particularly in the range of the sequence length from one to three in both the datasets. Then, the performance reached the best results at the certain sequence length: 11 for Sleep EDF and 17 for MASS. It did not be enhanced even if the sequence length increased more. This shows that the temporal context more influences sleep stage scoring of the target epoch as its neighboring epochs are closer to the target. Rather, it can be disadvantageous to consider the previous epochs too far away from the target epoch.

The increase of the overall performance attributes to the enhancement in the prediction for N1 and REM of which the F1 scores changed significantly when the sequence length increases. This supports that considering the inter-epoch temporal context improves the performance in sleep scoring and IITNet follows the sleep scoring rules well in AASM . In fact, AASM recommends for sleep experts to consider the target epoch and its previous epochs, especially when labeling the sleep stage with N1 or REM. For example, the target epoch with low-amplitude, mixed-frequency (LAMF) EEG activity can be scored as the sleep stage of N1 or REM if one of the previous epochs distinctly meets the criteria for the stages of N1 or REM and there are LAMF EEG activities without the evidence for the other sleep stages.

On the other hand, the performance for the sleep stages of Wake, N2, and N3 achieved a state-of-the-art performance regardless of the sequence length and performance for these sleep stages improves in general when the sequence length increases. It indicates that the stages of Wake, N2, and N3 have relatively less inter-epoch dependency than the other sleep stages. According to the AASM, for target epoch to be determined as Wake, N2, and N3, the sleep-related events or specific EEG signal activities of target epoch are considered mainly while it is also required to consider the relations between target epoch and its previous epochs. Thus, this result supports that considering the intra-epoch temporal context as well as the inter-epoch is essential to correctly classify the stages of Wake, N2, and N3.

In order to consider the transition rules in sleep scoring, several studies employed the recurrent neural network to train the inter-epoch temporal context. DeepSleepNet used RNN to analyze the relations between the epoch-wise features extracted from CNN [supratak2017deepsleepnet]. However, DeepSleepNet did not consider the temporal context within the epoch, and thus the intra-epoch temporal context might be not considered explicitly. On the contrary, IITNet achieved similar performance on sleep stages of Wake, N2, and N3 with DeepSleepNet even though the intra-epoch temporal context was only considered. When inter-epoch temporal context was analyzed together, IITNet outperformed the DeepSleepNet. This result shows that the consideration of both the intra- and inter-epoch temporal contexts with an explicit mechanism analyzing the input signals in sub-epoch level via CRNN contributes to the improvement in the performance of sleep stage scoring based on deep learning.

SeqSleepNet showed that the time-frequency feature extraction based on filterbank layers and RNN was advantageous to score sleep stages from multi-channel PSG data [phan2018seqsleepnet]. SeqSleepNet used the hierarchical RNN including the epoch-wise and sequence-wise RNN and captured the temporal dynamics in the feature vectors generated from the filter-bank layers. Though both the intra- and inter-epoch temporal contexts were considered directly, SeqSleepNet used the spectrogram in the time-frequency domain instead of the raw signals in the time domain as an input. This approach might lose the important information needed to train general features since the transform of the raw signals from the time-domain into the time-frequency domain was still dependent on hand-engineered signal processing. Meanwhile, IITNet learns the sub-epoch features and exploits the whole single channel raw EEG signals by utilizing the deep CNN without requiring intricate signal processing.

In addition, regardless of the imbalanced nature of PSG dataset, IITNet no longer needs any sampling method or learning techniques to achieve state-of-the-art performance. Commonly, the PSG datasets have the imbalanced class distribution, i.e., N1 classes are less than 10% of all the classes. In order to account for this, cost-sensitive learning or class-balanced sampling has been applied [stephansen2017use, biswal2017sleepnet, dong2018mixed, supratak2017deepsleepnet, tsinalis2016automatic]. These could improve the performance in the identification of N1 and N3; however, the overall performance could be worse [sors2018convolutional]. So far, those sorts of class-balancing methods could not guarantee the improvement in all the class identification. Without sampling methods, IITNet achieved a state-of-the-art performance for all the sleep stages by only increasing the sequence length to learn the transition rules properly which implies that the considering both the intra- and inter-epoch temporal contexts are necessary. Furthermore, IITNet does not need pre-training, fine-tuning or transfer-learning and thus the training step is much simpler.

6 Conclusions

A novel deep learning architecture, named IITNet, was proposed to automatically score sleep stages by the consideration of the intra-epoch temporal context as well as the inter-epoch temporal context. The proposed model is an end-to-end architecture and consists of convolutional and recurrent layers. In the convolutional layers, a modified ResNet-50 was implemented to extract the representative feature from each sub-epoch. The series of the feature sequences were fed into two layers of BiLSTM in the recurrent layer to capture the contextual information not only in an epoch (intra-epoch) but also between successive epochs (inter-epoch). The results of the experiments that used single-channel EEG signals in Sleep EDF and MASS showed that the proposed model well predicted the sleep stage from the single channel raw EEG signals without any hand-crafted feature extraction or time-frequency transform. IITNet achieved state-of-the-art performance in terms of overall accuracy, Cohen’s kappa, and MF1 when the sequence length is the optimal value. The improvement attributes to the architecture of which was designed to learn the representative features of the sub-epochs and analyze their relations according to the sleep scoring rules in AASM.

Acknowledgments

This work was supported by the Institute of Integrated Technology (IIT) Research Project through a grant provided by GIST in 2018.