Sleep quality is critical for good health. Decreased sleep quality is associated with negative health outcomes such as depression , obesity  and a higher risk of mortality due to cardiovascular diseases . Apnea and hypopnea are common sleep disorders which cause poor sleep quality 
. A significant amount of research has gone into detecting apnea/hypopnea events via machine learning techniques on polysomnography (PSG) data[5, 6] which is an essential and standard method to investigate the sleep quality, and to detect any respiratory or non-respiratory related sleep disorders through measuring multiple physiological signals when the subject is asleep . Apnea/hypopnea events are relatively simple to detect within PSG data, this is reflected in the high inter-rater reliability that has been observed for their scoring . Apnea and hypopnea events have been well researched and have been linked to multiple negative health outcomes.
Sleep arousal due to factors other than apnea and hypopnea is another form of sleep disruption which is an abrupt change in the pattern of brain wave activity leading to a shift from deep sleep, commonly known as rapid eye movement (REM) sleep, to light sleep (NREM), or from sleep to wakefulness. It can become an issue if it happens constantly during sleep. According to the American Academy of Sleep Medicine (AASM) guidelines, arousal is an abrupt shift within Electroencephalogram (EEG) signal frequency bands including alpha, theta and greater than 16 Hz which lasts at least 3 seconds and is preceded with at least 10 seconds of stable condition. During the REM stage, the arousal may also appear with an increase in chin Electromyogram (EMG) signal [9, 10].
Sleep arousals due to factors other than apnea and hypopnea are a less researched form of sleep disruption. Non-apnea/hypopnea arousals can be respiratory effort related (RERA), or else, they may be due to teeth grinding, pain, bruxisms, hypo-ventilation, insomnia, muscle jerks, vocalizations, snores, periodic leg movement, Cheyne-Stokes breathing or respiratory obstructions that are not severe enough to be classified as apnea or hypopnea. Normally, RERA is the most common type of non-apnea/hypopnea arousal. Very little research has been done concerning the effect that non-apnea/hypopnea arousals have on sleep quality and general health because they are difficult to detect. Sleep arousals have been shown to have lower inter scorer reliability when compared to apnea/hypopnea 
. A more robust method of detecting non-apnea/hypopnea arousals would allow health researchers to determine the effects that these events have on health as well as develop more effective treatments to reduce their frequency. The purpose of this work is to determine how accurately non-apnea/hypopnea arousals can be detected with the use of deep learning methods. This work was done as part of the PhysioNet/Computing in Cardiology Challenge 2018 and is the extended version of .
Recently, convolutional neural networks (CNN) have gained a lot of interest in physiological signal processing due to their ability to learn complex features in an end to end fashion without extracting any hand-crafted features [14, 15]. In this work, a dense recurrent convolutional neural network is proposed primarily to detect arousal regions as well as apnea/hypopnea and sleep/wake intervals using PSG data provided in the 2018 PhysioNet challenge. Our network is a modified DenseNet that is proposed in 
and is composed of multiple dense convolutional units (DCU), where each is a sequence of convolutional layers that are all connected to provide maximum information flow. It ends with a bidirectional long-short term memory layer (LSTM) with a residual skip connection and extra convolutions to convert the LSTM hidden states from forward and backward passes to the output shape. To compute the probability of different sleep events at each sample during training process as well as computing losses, a remapping mechanism is also proposed to simplify the network decision making process. Moreover, other task labels such as apnea-hypopnea/normal and sleep/wake are used as auxiliary tasks in a multi-task learning framework to share representations between related tasks and to improve our model generalization on our desired task which is the arousal detection.
3 Materials and Pre-Processing
The dataset includes PSG data from 1,985 subjects which were monitored at the MGH sleep laboratory for the diagnosis of sleep disorders. The data were partitioned into balanced training (n = 994), and test sets (n = 989), where the training data were provided publicly to design a model to detect target arousal regions. It includes multiple physiological signals that were all sampled at 200 Hz and were manually scored by certified sleep technicians at MGH sleep laboratory according to the AASM guidelines. More details regarding the dataset and available annotations for different sleep analysis purposes are provided in .
In this work, the PSG measurements (12 channels) are used to design an arousal detector model. The electrocardiogram (ECG) signal which is not necessary for sleep scoring is excluded from our analysis. First, an anti-aliasing finite impulse response (FIR) filter is applied to all channels. Second, the channels are downsampled to 50 Hz and the DC bias is removed. Finally, the channels are individually normalized by removing the mean and the root-mean-square (RMS) of every channel signal in a moving 18-minute window using fast Fourier transform (FFT) convolution which is the speed-optimized form of a regular convolution. According to the AASM guidelines, the baseline breathing is established in 2 minutes. Normalizing over 18-minute interval ensuresoverlap between the two ends of the baseline window. Our proposed normalization process is not applied to the oxygen saturation (SaO2) measurement that is only scaled to be limited in to avoid saturating the neural network with large values.
4 Sleep Disorder Detector Model
In this section, the DRCNN structure that is proposed to detect arousal regions as well as other sleep disorders is explained. Then, the multi-task learning framework is described in which all available annotations associated with the sleep/wake, arousal and apnea-hypopnea/normal events are employed to improve our network generalization.
4.1 DRCNN Network Structure
In this work, our proposed DRCNN is trained and evaluated using data downsampled to 50 Hz to decrease computational effort and to fit a full night recording into memory to be applied to the network. The network is composed of multiple blocks, DCU1, DCU2 and LSTM which are displayed in Figure 1
. First, there are three DCU1s, each followed by a max-pooling layer to down-sample input signals to one entity per second. This is followed by eleven DCU2s. The DCU1s and DCU2s have similar structure comprising two sequences of two depthwise separable convolutional layers followed by the scaled exponential linear unit (SELU) activation functions.
In DCU2, weight normalization, position-wise normalization and stochastic batch normalization
with a channel specific affine transform are also applied on convolutional layer outputs before using SELU activation function. Position-wise normalization involves subtracting the mean and dividing by the standard deviation across the channel dimension independently for each time step. To extend the DCU2 receptive field, dilated convolutions are also employed, where the dilation rates are first increased exponentially with the depth of the network along the first six DCU2s, and then are exponentially decreased along the remaining ones. However, in DCU1, neither a position-wise normalization nor a dilation factor is applied. Stochastic batch normalization is used in both DCU1 and DCU2.
Following the DCUs, a bidirectional long short-term memory (LSTM) layer with a residual skip connection (linear convolution) is also applied across the input channel temporal dimension. Finally, two more convolutional layers with mapping are used to convert the LSTM hidden states from forward and backward passes to the output shape. The hyperbolic tangent (tanh) is applied before the last convolutional layer that leads to the more stable training process. Weight normalization is applied on each of the three convolutional layers in the LSTM block. The overall structure of our proposed DRCNN is displayed in Figure 2.
4.2 Learning Mechanism
In this work, a multi-task learning mechanism is used to improve the generalization of our proposed arousal detector model and to learn more complex features through using other correlated tasks such as apnea-hypopnea/normal and sleep/wake. The ground truth corresponding to each task is a vector with two or three conditions that is defined as follows:
Arousal presence/absence detection task: (target arousal = 1, non-target arousal (apnea/hypopnea or wake) = -1, and normal = 0),
Apnea-hypopnea/normal detection task: (all types of apnea/hypopnea = 1, and normal = 0),
Sleep/wake detection task: (sleep stages (REM, NREM1, NREM2, NREM3) = 1, wake = 0, and undefined stage = -1)
Considering the above possible conditions associated with every task, 18 combinations can be defined. To investigate the distribution of the data associated with all combinations, a histogram of the labelled data was obtained. As it is displayed in Figure 3, only 13 combinations out of 18 were non-empty. To simplify the structure of the network output layer that computes joint probabilities, the non-empty bins are remapped to 4 bins that are displayed in green color in Figure 3
. All the red bins corresponding to the beginning of the record before annotating the first sleep epoch (undefined sleep stage) are remapped to bin 0. The data associated with bin 0 are still processed by our model during training, however they do not contribute to the loss gradient.
It is by definition impossible to get a sleep disorder while the subject is awake (condition in bin 4). This happens because according to the AASM guidelines, the sleep stages are annotated in 30-second epochs. Therefore, it is necessary to update sleep/wake detection task labels upon reaching such a state. For this purpose, bin 4 is remapped to bin 5. Similarly, bin 2 is remapped to bin 1 because when the arousal label is -1 and no apnea or hypopnea is present, the subject must be awake.
The last convolutional layer of our proposed DRCNN has four output channels that are soft-maxed to compute joint probabilities corresponding to bins 1, 5, 8 and 14. Then, the predicted arousal, apnea-hypopnea/normal and sleep/wake marginal probabilities are computed as: P(arousal) = P(bin 14), P(non-arousal) = P(bin 1) + P(bin 5) + P(bin 8), P(apnea/hypopnea) = P(bin 5), P(no apnea and hypopnea) = P(bin 1) + P(bin 8) + P(bin 14), P(wake) = P(bin 1), and P(sleep) = P(bin 5) + P(bin 8) + P(bin 14).
To train our DRCNN, the apnea-hypopnea/normal and sleep/wake are used as auxiliary detection tasks, whearas the arousal detection is the desired task. The total cross-entropy loss is computed as the weighted average of loss values corresponding to the desired and auxiliary tasks, where the arousal loss weight is set to 2. The network weight parameters are optimized by using the Adam method which outperforms other optimization techniques in this work. In every epoch, one full-night recording is randomly selected and processed through the network. Then, to evaluate the performance of the network, the AUPRC and AUROC are obtained for validation data and the model is checkpointed if there is any improvement with any of the above scores. The full training process is repeated four times across different folds of training and validation data and finally the predictions of our four models are averaged to obtain ensemble model predictions.
5 Empirical Results
The proposed DRCNN is applied to 12 PSG channels, excluding ECG signal. The network hyper-parameters and learning procedure are explained in Section 4. The PSG channels are first pre-processed as described in Section 3. To train our network, the available annotated data are divided into four folds, where each includes 794 training, 100 validation and 100 consistent testing records. Using a multi-task learning process, the AUPRC and AUROC are obtained for sleep/wake, arousal and apnea-hypopnea/normal detection tasks. Table 1 displays the performance metrics measured for each fold of cross-validation as well as the average performance on validation records across the 4 folds.
Using four trained models on different data folds, their corresponding predictions are averaged to form an ensemble model prediction. The ensemble model strategy improves the performance compared to the single model strategy. Table 2 displays single and ensemble model performance evaluation results on the consistent test set. It must be noted that the performance results are obtained for the up-sampled data to the original 200 Hz.
Finally, the average AUPRC and AUROC values associated with the arousal detection task were and , respectively on our testing dataset. An ensemble of four models trained on different data folds improved the AUPRC and AUROC to and , respectively. The ensemble model strategy not only improves the arousal detection performance metrics, but also improves the other auxiliary detection tasks performance.
Although our proposed DRCNN is primarily developed to detect arousal regions, the multi-task learning framework and the added auxiliary tasks enable us to deploy our model for detecting different types of sleep disorders including arousal, apnea and hypopnea. To evaluate our ensemble network on other sleep disorder detection tasks, three popular metrics are measured as follows:
where TST, TRT, SE, AI and AHI correspond to the total sleeping and recording times, sleep efficiency, arousal index and apnea-hypopnea index, respectively. According to the available sleep monitoring literature, the aforementioned metrics are used to identify subjects with sleep disorders as well as to estimate their severity. In, the AHI is graded into four groups, namely as normal (AHI between 0 to 5), mild (AHI between 5 to 15), moderate (AHI between 15 to 30) and severe (AHI above 30). The higher AHI grades are the more serious sleep disorder problems which have to be treated appropriately using various methods such as the continuous positive airway pressure (CPAP) machine or other oral appliances.
The mean absolute errors and the average actual and predicted values of the above metrics are measured and displayed in Table 3 for the first fold of the validation data as well as our testing records.
According to Table 3
, our DRCNN model estimations of SE, AI and AHI are fairly accurate, thus can be used for generating the automated sleep monitoring report for sleeping subjects with sufficiently low estimation errors. The confusion matrix of the AHI grade estimation task using our DRCNN model is also displayed in Tables4 and 5 corresponding to validation and testing data sets, respectively.
Note that in most of the misclassified cases, our model overestimated the apnea-hypopnea severity which is more acceptable than underestimating the severity grade or not detecting at all. To evaluate the performance of our model in estimating apnea-hypopnea severity grade, the accuracy, normal grade false positive rate (FPR) and the other grades false negative rates (FNR) are computed and displayed in Table 6 for validation and testing data sets using their corresponding confusion matrices. The normal grade FPR is the rate of subjects that are incorrectly diagnosed with higher apnea-hypopnea severity grades and the other grades FNR is the rate of subjects within each category whose apnea-hypopnea severity grades are underestimated.
|Normal Grade FPR|
|Mild Grade FNR|
|Moderate Grade FNR|
|Severe Grade FNR|
Therefore, the overall FNR (rate of underestimation) associated with all grades of apnea-hypopnea excluding the normal grade is and for the first fold of validation data and the testing records, respectively.
6 Ablation Study and Discussion
To elucidate the contributions of our proposed DRCNN components, an extensive ablation study including multiple experiments was performed. In each experiment, only one component was modified or removed with respect to the baseline model that is our proposed DRCNN with the architecture given in Figure 2. All the ablation study models were trained and evaluated using the first fold of our data. The performance metrics of AUPRC and AUROC were measured for the first fold of validation data as well as the consistent testing records and were compared among different experiments. Tables 7 and 8 respectively give the list of ablation study experiments and the AUPRC and AUROC corresponding to all three tasks addressed in our multi-task learning framework for the first fold of the validation data set. Similarly, Table 9 displays the ablation study results for our consistent testing records, using the model that is trained on the first fold. It must be noted that the training process was stopped at Epoch 500 in every ablation study, and the model with the highest performance on validation data set was saved to be evaluated on our testing records.
It can be concluded from the ablation study models applied on our testing records, that the AUPRC and AUROC performance metrics are marginally or highly decreased in all experiments, excluding Exp. 2 and Exp. 5, compared to our proposed original model. This confirms the positive contribution of the components that were added to our proposed structure, specifically the contribution of the residual mapping in the LSTM block as well as activating the position-wise normalization in DCU2 and disabling it in DCU1.
In experiments 2 and 5, the AUPRC and AUROC are slightly improved as compared to the original model, which is not a major issue considering the effect of our network weights initialization. It seems that both RELU and SELU activation functions work similarly in this problem, however we still prefer SELU over RELU due to its self-normalizing benefit that limits the risk of dying neurons.
In order to compare the convergence speed of the experiments, the training progress is displayed in Figure 4 associated with our three detection tasks of every ablation study, where the improving AUPRC values are depicted versus the corresponding epoch number. It must be noted that an epoch is a fixed number of full nights, not the full training set. According to Figure 4, the original model converged to its highest AUPRC value on the validation set faster that the other models, excluding the model trained in Exp. 2. The only model that is as good as the original one in terms of the convergence speed is the model trained in Exp. 10 (with fixed dilation) which is not as accurate as the original model in terms of the AUPRC values. Although the AUPRC values obtained from the model in Exp. 5 are marginally higher than those obtained from the original model, the original model converges faster. It must also be noted that the AUPRC/AUROC values as well as the convergence trends of the apnea-hypopnea/normal and sleep/wake detection tasks obtained from the model in Exp. 9 (single task) are not valid, because the model was only trained to detect arousal.
As a result, considering both the performance metrics and the training convergence speed, our proposed original model outperforms the other models that are evaluated in our ablation study, except the Exp. 2 (SELU replaced by RELU). To evaluate the contribution of the SELU activation function, further experiments are suggested to be performed using other training and validation data folds as well as the blind testing set which is out of the scope of the current paper.
In this paper, a modified version of the dense convolutional neural network comprising multiple convolutional and LSTM blocks is proposed to detect sleep disorders including arousal, apnea and hypopnea using 12 PSG channels that are provided in the 2018 PhysioNet challenge database. To improve our network generalization and to use information from correlated tasks, a multi-task learning procedure using hard parameter sharing framework is also exploited in this work. Four DCRNN models are trained and evaluated on different subsets of training and validation data. Finally, an ensemble model is obtained through computing the average prediction of the above four models. The results confirm the superiority of the ensemble model against a single model approach. On the challenge blind testing dataset, the ensemble model achieves an AUPRC of , which is the first-place entry in the PhysioNet challenge official stage.
-  N. Tsuno, A. Besset, and K. Ritchie, “Sleep and depression.” The Journal of clinical psychiatry, 2005.
-  F. P. Cappuccio, F. M. Taggart, N.-B. Kandala, A. Currie, E. Peile, S. Stranges, and M. A. Miller, “Meta-analysis of short sleep duration and obesity in children and adults,” Sleep, vol. 31, no. 5, pp. 619–626, 2008.
-  E. Suzuki, T. Yorifuji, K. Ueshima, S. Takao, M. Sugiyama, T. Ohta, K. Ishikawa-Takata, and H. Doi, “Sleep duration, sleep quality and cardiovascular disease mortality among the elderly: a population-based cohort study,” Preventive medicine, vol. 49, no. 2-3, pp. 135–141, 2009.
-  H. Engleman and N. Douglas, “Sleep⋅ 4: Sleepiness, cognitive function, and quality of life in obstructive sleep apnoea/hypopnoea syndrome,” Thorax, vol. 59, no. 7, pp. 618–622, 2004.
-  N. Pombo, N. Garcia, and K. Bousson, “Classification techniques on computerized systems to predict and/or to detect apnea: A systematic review,” Computer methods and programs in biomedicine, vol. 140, pp. 265–274, 2017.
-  A. Otero, P. Felix, M. R. Alvarez, and C. Zamarron, “Fuzzy structural algorithms to identify and characterize apnea and hypopnea episodes,” in 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS). IEEE, 2008, pp. 5242–5245.
-  J. R. Chesson, L. Andrew, R. A. Ferber, J. M. Fry, M. Grigg-Damberger, K. M. Hartse, T. D. Hurwitz, S. Johnson, G. A. Kader, M. Littner, G. Rosen et al., “The indications for polysomnography and related procedures,” Sleep, vol. 20, no. 6, pp. 423–487, 1997.
-  U. J. Magalang, N.-H. Chen, P. A. Cistulli, A. C. Fedson, T. Gíslason, D. Hillman, T. Penzel, R. Tamisier, S. Tufik, G. Phillips et al., “Agreement in the scoring of respiratory events and sleep among international sleep centers,” Sleep, vol. 36, no. 4, pp. 591–596, 2013.
-  P. Halász, M. Terzano, L. Parrino, and R. Bódizs, “The nature of arousal in sleep,” Journal of Sleep Research, vol. 13, no. 1, pp. 1–23, 2004.
-  R. B. Berry, C. L. Albertario, S. Harding, R. M. Lioyd, D. T. Plante, S. F. Quan, M. M. Troester, and B. V. Vaughn, “The AASM manual for the scoring of sleep and associated events,” Rules, Terminology and Technical Specifications, Version 2.5, American Academy of Sleep Medicine, p. 2018.
-  M. H. Bonnet, “Performance and sleepiness as a function of frequency and placement of sleep disruption,” Psychophysiology, vol. 23, no. 3, pp. 263–271, 1986.
-  M. M. Ghassemi, B. E. Moody, H. L. Li-wei, C. Song, Q. Li, H. Sun, R. G. Mark, M. B. Westover, and G. D. Clifford, “You snooze, you win: the physionet/computing in cardiology challenge 2018,” Hypertension, vol. 40, no. 41, pp. 40–6, 2018.
-  M. Howe-Patterson, B. Pourbabaee, and F. Benard, “Automated detection of sleep arousals from polysomnography data using a dense convolutional neural network,” Computing in Cardiology, Maastricht, Netherlands.
-  B. Pourbabaee, M. Javan Roshtkhari, and K. Khorasani, “Deep convolutional neural networks and learning ECG features for screening paroxysmal atrial fibrillation patients,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, no. 99, pp. 1–10, 2017.
-  B. Pourbabaee, M. Howe-Patterson, E. Reiher, and F. Benard, “Deep convolutional neural network for ECG-based human identification,” vol. 41, 2018.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” vol. 1, no. 2, 2017, pp. 4700–4708.
-  L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real NVP,” arXiv preprint arXiv:1605.08803, p. 2016.
-  S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, p. 2018.
-  F. J. Nieto, T. B. Young, B. K. Lind, E. Shahar, J. M. Samet, S. Redline, R. B. D’agostino, A. B. Newman, M. D. Lebowitz, T. G. Pickering et al., “Association of sleep-disordered breathing, sleep apnea, and hypertension in a large community-based study,” Jama, vol. 283, no. 14, pp. 1829–1836, 2000.
-  G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” pp. 971–980, 2017.