Automate Obstructive Sleep Apnea Diagnosis Using Convolutional Neural Networks

06/13/2020 ∙ by Longlong Feng, et al. ∙ Laurier 2

Identifying sleep problem severity from overnight polysomnography (PSG) recordings plays an important role in diagnosing and treating sleep disorders such as the Obstructive Sleep Apnea (OSA). This analysis traditionally is done by specialists manually through visual inspections, which can be tedious, time-consuming, and is prone to subjective errors. One of the solutions is to use Convolutional Neural Networks (CNN) where the convolutional and pooling layers behave as feature extractors and some fully-connected (FCN) layers are used for making final predictions for the OSA severity. In this paper, a CNN architecture with 1D convolutional and FCN layers for classification is presented. The PSG data for this project are from the Cleveland Children's Sleep and Health Study database and classification results confirm the effectiveness of the proposed CNN method. The proposed 1D CNN model achieves excellent classification results without manually preprocesssing PSG signals such as feature extraction and feature reduction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Background

When we sleep, our muscles relax. For the Obstructive Sleep Apnea (OSA) patients, the muscles in the back of throat can relax too much and collapse the airway, and lead to breathing difficulty. OSA presents with abnormal oxygenation, ventilation and sleep pattern. The prevalence of OSA has been reported to be between 1% to 5% Dehlink and Tan (2016). Children at risk need timely investigation and treatment.

The gold standard for diagnosing sleep disorders is polysomnography (PSG), which generates extensive data about biophysical changes during sleep. Studies of PSG assist doctors to diagnose sleep disorders and provide the baseline for an appropriate follow up. A clinical sleep study design based on PSG is to acquire several biological signals while patients are sleeping, These signals typically include electroencephalography (EEG) for monitoring brain activity, electromyogram (EMG) to measure muscle activity and Electrocardiography (ECG) for the electrical activity of heart over a period of sleep Moridani et al. (2019).

In recent decades, various alternative methods have been proposed to minimize the number of biosignals required to detect and classify the OSA. These studies include traditional machine learning methods such as Support Vector Machine and linear discriminant analysis on signals such as ECG

Almuhammadi et al. (2015), respiratory signals Varon et al. (2015), a combination of extracted features and shallow neural network on heart rate variability and ECG derived respiration signal Tripathy (2018)

. These studies focused on extracting time domain, frequency domain, and other nonlinear features from physiological signals and applying some feature selection techniques to reduce the number of dimensions comprising the feature space. However, this process can be labour-intensive, requires domain knowledge, and is particularly limited and costly for high-dimensional data. In addition, feature extraction is difficult for traditional machine learning techniques as the number of features increase dramatically.

Deep learning framework has proved its modeling ability in different PSG channels. McCloskey et al. employed a 2D-CNN model on spectrograms of nasal airflow signal, and their model achieved an average accuracy of 77.6% on three severity levels McCloskey et al. (2018)

. Another more outstanding application of deep learning model came from the work of Cheng et al. in which researchers used a four layered Long Short Term Memory (LSTM) model on the RR-ECG signal and achieved an average accuracy of 97.80% on the detection of OSA

Cheng et al. (2017).

Though recurrent model (e.g., RNN, LSTM) can process time-series data and make sequential predictions, CNN can be trained to recognize the same patterns (severity levels) on different subfields within fixed time windows. CNN saves time from manual scoring in the laboratory environment and makes the pre-screening stage easier in contrast to traditional methods. Moreover, in order to increase the model generalization ability, we tried to explore 1D-CNN models with different length of segmentations in EEG, ECG, EMG and respiratory channels. We focused on the model structure and utilized the fine-tuned model for pediatric OSA prediction in our study.

The rest of this paper is organized as follows. Chapter 2 explains the data processing in detail. Chapter 3 displays the structure of the proposed 1D-CNN model. Evaluation and experimental results are presented in Chapter 4. Finally, Chapter 5 draws discussion and conclusion of the research.

2 Cleveland Children’s Sleep and Health Study Database

The data are retrieved from the National Sleep Research Resource (NSRR), which is a new National Heart, Lung, and Blood Institute resource designed to provide big data resources to the sleep research community. The PSG data are available from Cleveland Children’s Sleep and Health Study (CCSHS) database. Each anonymous record includes a summary result of a 12-hour overnight sleep study (awake and sleep stages) including annotation files with scored events and PSG signals and being formatted as the European Data Format (EDF).

The following channels are selected for the 1D CNN Modeling: 4 EEG channels (C3/C4 and A1/A2), 3 EMG channels (EMG1, EMG2, EMG3), 2 ECG channels (ECG1 and ECG2), and 3 respiratory channels including airflow, thoracic and abdominal breathing.

2.1 Individual Labeling

To define the target variable for this classification problem, each participant needs one label based on the OSA severity level. The Obstructive Apnea Hypopnea Index (oahi3) is used to indicate the severity of sleep apnea. It is represented by the number of apnea and hypopnea events per hour of sleep. It combines AHI and oxygen desaturation to give an overall sleep apnea severity score that evaluates both the number of sleep disruptions and the degree of oxygen desaturation (low oxygen level in the blood). The values of oahi3 are used as the thresholds for grouping the participants. The number of participants with different severity levels are shown in Table 1.

Obstructive Apnea Hypopnea Index Level of Severity Number of Participants
0 oahi3 1 NL (Normal) 362
1 oahi3 5 MIN (Minor) 139
5 oahi3 10 MOD (Moderate) 8
10 oahi3 SV (Severe) 8
Table 1: Grouping participants using oahi3 values

The dataset has an imbalanced response variable (362 normal / 139 minor / 8 moderate / 8 severe). Those minority classes (moderate and severe) are our most interest. We tried to train classifier to learn more from moderate and severe level data. Under-sampling method was applied during the data pre-processing stage, i.e., we randomly selected an equal number of samples (i.e., 8 participants) from each of the normal and minor groups. Overall, there are 32 participants in the final study data set. In this project, we conduct data pre-processing and CNN modeling on the data in EDF format which have a total size of 13 GB.

2.2 Data Preprocessing

This experiment focuses on the sleep data. The beginning and ending awake signals could be treated as noise and need to be removed. Secondly, the deep learning algorithms tend to be difficult to train when the length of time series is very long. Figure 1 presents a segmentation strategy, i.e., dividing the time series into smaller chunks.

each segment was labeled as the same severity level as the participant. In other words, the segments would inherit the severity label from the participant they belong to. With a starting length of L time steps, one channel is divided into blocks of sequence Seq_L yielding about L / Seq_L of new events (or rows) of shorter length (N).

Figure 1: Demonstration of Channel Division

The PSG data were segmented into 1-minute long events. For the ECG channel (frequency of 256) a 1-minute event has a length of 15360 (256

60) data points. An individual has a 8.24-hour ECG channel, which would have 1D time series data with length of 7595520. After segmentation, the long series data turned into a tensor with dimension 494

15360, which indicates 494 events (a length of 15360 for each). Since we have 32 selected participants and 2 ECG channels for each participant, the input tensor has the dimension of 15824 (N)15360 (Seq_L)2 (channels).

With the data segmentation, the length of each time-series is shorter and will be helpful in model training; and the number of data points has increased by a factor of L / Seq_L (number of instances or rows) providing a larger data set to train on.

Since different channels (e.g., ECG, EMG) were measured in different amplitudes, therefore, the last step of data processing is to normalize the PSG data with zero mean and unit standard deviation.

3 1D-CNN Architecture

The convolutional layer and max-pooling layer play the key roles in the CNN’s feature extraction mechanism. The output of convolutional layer of the

layer can be calculated as in Formula 1:


where represents the filter number, denotes the channel number of the input , is the convolutional filter to the channel, and is the bias to the filter, and is the dot product operation.

The max-pooling layer is a sub-sampling function selecting the maximum value within a fixed size filter. After the convolution-pooling blocks, one fully connected layer of neurons which have full connections to all activations in the previous layer, as in the regular Neural Networks. At the end of the convolutional layers, the data were flattened and passed onto the Dropout layer before the softmax classifier.

Figure 2

shows the structure of the 1D CNN model proposed in this project. It contains 3 convolutional and 3 max-pooling layers. We focused our efforts on the CNN building and began the investigation of the CNN method initially by performing a grid search of several hyperparameters.

Figure 2: The Proposed 1D-CNN Architecture

For each participant, his or her PSG data were served for either training or test data, not for both. We implemented a two-level stratified random sampling. In details, there were 2 splitting steps among 32 participants: firstly, 8 were randomly selected as test participants (i.e., 2 participants were randomly selected for each severity level); secondly, the remaining 24 participants were split into two groups: 18 participants for training set and 6 participants for validation set. The tensorflow graph was fed with batches of the training data and the hyperparameters were tuned on a validation set. Finally the trained model was evaluated on the test set.

The CNN model was trained in a fully supervised manner, and the gradients were back-propagated from the softmax layer to the convolutional layers. The network parameters were optimized by minimizing the cross-entropy loss function based on the gradient descent with the Adam updating rule and a learning rate of 0.0001.

CNN Layer of filters Filter Size Stride Padding Activation Function
Conv 1 46 10 2 No Relu
Pooling 1 10 2 No
Conv 2 92 10 2 No Relu
Pooling 2 10 2 No
Conv 3 184 20 2 No Relu
Pooling 3 20 5 No
Table 2: CNN model structure with optimal parameters

Table 2 presents the final values of parameters within each layer. Dropout rate of was used as it is the general setting for CNN models. Model classification performance is evaluated by using the following metrics: classification accuracy, cross-entropy loss, precision, recall and F1-score. While accuracy and loss can be used for evaluating the overall performance, some other metrics can be used to measure the performance of specific class.

4 Results and Analysis

Figure 3 shows the learning curve on training and validation phases. Accuracy and loss were obtained with various number of iterations. The accuracy increases as the number of iteration increases, and the loss decreases at the same time. The accuracy and the loss reach stable values after iterative learning on both phases.

Figure 3: Accuracy and Loss of the Proposed CNN model for OSA Detection

For ECG, we can observe the stable accuracy and loss values after 1000 iterations (Training acc: 0.9987, loss: 0.0114; Validation acc: 0.9916, loss: 0.0289). For EEG, the accuracy and the loss start to converge to a value after 2500 iterations (Training acc: 0.9718, loss: 0.0945; Validation acc: 0.9447, loss: 0.1985). For EMG, the accuracy and the loss become stable after 4000 iterations (Training acc: 0.9999, loss: 0.0013; Validation acc: 0.9707, loss: 0.1131). However, there are a large number of big fluctuations before the convergence during the learning process. This means some portion of the randomness: (1) The Dropout method could cause the network to keep only some portion of neurons (weights) on each iteration. Sometimes those neurons do not fit the current batch well, and this may cause large fluctuations; (2) There is randomness in initialization and data sampling for SGD in back-propagation.

For Respiratory, we can see the train and validation accuracy begin to stay steady with similar values indicating slight overfitting in the classification (Training acc: 0.9854, loss: 0.0378; Validation acc: 0.9180, loss: 0.2945).

Channels(#) Dataset Accuracy Loss Class Precision Recall F1-Score
ECG (2) Training 0.9987 0.0114 NL 0.9997 0.9997 0.9994
MIN 0.9980 0.998 0.9980
MOD 0.9982 0.9982 0.9982
SV 0.9988 0.9994 0.9991
Test 0.9897 0.0289 NL 0.9862 0.9921 0.9891
MIN 0.9990 0.9773 0.9880
MOD 0.9894 0.9961 0.9927
SV 0.9843 0.9940 0.9891
EEG (4) Training 0.9718 0.0945 NL 0.9753 0.9741 0.9747
MIN 0.9784 0.9820 0.9802
MOD 0.9721 0.9684 0.9703
SV 0.9609 0.9621 0.9615
Test 0.9463 0.1985 NL 0.9394 0.9587 0.9490
MIN 0.9415 0.9741 0.9575
MOD 0.9682 0.9166 0.9417
SV 0.9373 0.9354 0.9363
EMG (3) Training 0.9999 0.0013 NL 1.0000 1.0000 1.0000
MIN 1.0000 0.9997 0.9999
MOD 0.9997 1.0000 0.9999
SV 1.0000 1.0000 1.0000
Test 0.9581 0.1132 NL 0.9518 0.9312 0.9414
MIN 0.9660 0.9601 0.9631
MOD 0.9823 0.9712 0.9767
SV 0.9329 0.9696 0.9509
Respiratory (3) Training 0.9854 0.0378 NL 0.9857 0.9857 0.9857
MIN 0.9834 0.9849 0.9842
MOD 0.9895 0.9880 0.9888
SV 0.9828 0.9828 0.9828
Test 0.9199 0.2945 NL 0.9147 0.9147 0.9147
MIN 0.9447 0.9053 0.9246
MOD 0.9323 0.9194 0.9258
SV 0.8899 0.9408 0.9147
Table 3:

The CNN Evaluation Metrics

ECG Training ECG Test
NL 3321 0 1 2 1000 0 3 5
MIN 0 3511 5 2 10 1032 6 8
MOD 1 5 3378 0 1 0 1022 3
SV 0 2 0 3340 3 1 2 1000
EEG Training EEG Test
NL 3239 14 13 59 976 12 4 26
MIN 9 3445 34 20 7 1014 11 9
MOD 15 40 3279 52 28 30 945 28
SV 58 22 47 3222 28 21 16 941
EMG Training EMG Test
NL 3546 0 0 0 731 14 9 31
MIN 0 3745 1 0 11 795 5 17
MOD 0 0 3601 0 9 7 775 7
SV 0 0 0 3571 17 7 0 765
Respiratory Training Respiratory Test
NL 3714 18 14 22 461 10 14 19
MIN 16 3912 13 31 10 478 14 26
MOD 17 17 3786 12 20 7 468 14
SV 21 31 13 3723 13 11 6 477
Table 4: Confusion matrices from the CNN model on training and test data

The evaluation metrics and confusion matrices for all channels with training and test data are presented in Tables 3 and 4 respectively. The results from Table 4 are summarized in Table 3. It can be observed from Table 3 that, for the test data, the CNN model can achieve 98.97% for ECG, 94.63% for EEG, 95.81% for EMG, and 91.99% for Respiratory; We can also verify the training curves from Figure 3 by checking the training accuracy score from Table 3 and the classified results from Table 4. Furthermore, the precision, recall and F1-score for each class are collected in Table 3.

For ECG, the model can achieve a value of for all three metrics for all classes on the training data and for the test data; For EEG, the model achieves score for training data, and for the test data.

For EMG, the scores of 1.0000 are obtained in the training phase on all classes, which means the perfect classification for the training data during the learning process, while the scores of are obtained from the test data.

Similarly, for Respiratory, CNN achieves scores of for the training and slightly lower scores, which are over for the test data. The reason why there exists the gap between training and test scores can be that the respiratory signal sensors is different from ECG, EEG and EMG. In this case, the signal in the respiratory system may not be sensitive enough to detect small changes when OSA happens. Table 4 displays the classification details on the training and test data.

5 Conclusion and Discussion

Firstly, with the correct hyper-parameter setup, our 1D-CNN model can successfully extract the temporal features from the PSG data and achieve high performance in OSA detection for different channels; secondly, our well trained CNN model can be an efficient tool for clinicians to identify OSA severity without manually going through tons of PSG data. Furthermore, our CNN models can replace the traditional data processing such as signal extraction and transforming, which can be time-consuming and labour-intense.

There are some limitations of our work. Firstly, only a small sample of 32 subjects was investigated in this study. Secondly, we used ECG, EEG, EMG and Respiratory channels to build CNN models separately, so there was no cross-checking between different channels. Lastly, our CNN model is slow to be trained without GPU. The well-trained models require a big data set and the fine-tuned hyperparameters in the training step.

The future work can aim at feeding the four single CNN models into an ensemble-like model to making a prediction. There are other possible architectures that would be of great interest for this problem. One of most popular deep learning architectures that models sequence and time-series data is the long-short-term memory (LSTM) cells within recurrent neural networks (RNN).

We are so grateful that National Sleep Research Resource (NSRR) allows us to use the PSG data from Cleveland Children’s Sleep and Health Study. The project is supported by Natural Sciences and Engineering Research Council of Canada (NSERC).


  • [1] W. S. Almuhammadi, K. A. I. Aboalayon, and M. Faezipour (2015-05) Efficient obstructive sleep apnea classification based on eeg signals. In 2015 Long Island Systems, Applications and Technology, Vol. , pp. 1–6. External Links: Document, ISSN Cited by: §1.
  • [2] M. Cheng, W. Sori, F. Jiang, A. Khan, and S. Liu (2017-07) Recurrent neural network based classification of ecg signal features for obstruction of sleep apnea detection. pp. 199–202. External Links: Document Cited by: §1.
  • [3] E. Dehlink and H. Tan (2016) Update on paediatric obstructive sleep apnoea. Journal of Thoracic Disease 8 (2). External Links: ISSN 2077-6624, Link Cited by: §1.
  • [4] S. McCloskey, R. Haidar, I. Koprinska, and B. Jeffries (2018-06) Detecting hypopnea and obstructive apnea events using convolutional neural networks on wavelet spectrograms of nasal airflow. pp. 361–372. External Links: ISBN 978-3-319-93033-6, Document Cited by: §1.
  • [5] M. K. Moridani, M. Heydar, and S. S. Jabbari Behnam (2019-02) A reliable algorithm based on combination of emg, ecg and eeg signals for sleep apnea detection : (a reliable algorithm for sleep apnea detection). In 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI), Vol. , pp. 256–262. External Links: Document, ISSN Cited by: §1.
  • [6] R.K. Tripathy (2018) Application of intrinsic band function technique for automated detection of sleep apnea using hrv and edr signals. Biocybernetics and Biomedical Engineering 38 (1), pp. 136 – 144. External Links: ISSN 0208-5216, Document, Link Cited by: §1.
  • [7] C. Varon, A. Caicedo, D. Testelmans, B. Buyse, and S. Van Huffel (2015-Sep.) A novel algorithm for the automatic detection of sleep apnea from single-lead ecg. IEEE Transactions on Biomedical Engineering 62 (9), pp. 2269–2278. External Links: Document, ISSN Cited by: §1.