Analysis of sleep patterns is performed manually by experts in sleep clinics using rules and guidelines defined by the American Academy of Sleep Medicine recently updated in 2018 . These guidelines outline technical and clinical best practices when performing routine polysomnography (PSG), which is an overnight recording of electroencephalography (EEG), electrooculography (EOG), electromyography (EMG) electrocardiography (ECG), respiratory effort and peripheral limb activity. Expert technicians and somnologists use these physiological variables to analyse sleep patterns and diagnose sleep disorders based on key metrics and indices, such as total sleep time, amount of sleep spent in various sleep stages, and the observed number of discrete events per hour of sleep. Specifically, the number of arousals (short awakenings during sleep, 15 s), non-periodic and periodic leg movements (PLM), and the number of apnea events per hour of sleep are summarized in the arousal index (AI), periodic leg movements index (PLMI) and apnea/hypopnea index (AHI), the latter of which is a combination of apneic (no/obstructed respiratory effort) and hypopneic (reduced respiratory effort) events. Excessive amounts of these events are disruptive to normal sleep, which can lead to patient complaints of excessive daytime sleepiness , which in turn is linked to an increase in e.g. automotive accidents and reduced quality of life . Increased number of PLMs is also linked to other sleep disorders such as restless legs syndrome, and periodic leg movement disorder [4, 5].
Correct diagnosis of sleep disorders is predicated on precise scoring of sleep stages as well as accurate scoring of these discrete sleep events. However, the current gold standard of manual analysis by experienced technicians is inherently biased and inconsistent.Several studies have shown low inter-rater reliability on both the scoring of sleep stages [6, 7, 8], arousals , and respiratory events . Furthermore, manual analysis of PSGs is time-consuming and prone to scorer fatigue. Thus, there is a need for efficient systems that provide deterministic and reliable scorings of sleep studies.
Several recent studies have already explored automatic classification of sleep stages in large cohorts with good results [11, 12, 13, 14, 15], however, the reliable and consistent detection and classification of discrete PSG events in large cohorts remain largely unexplored.
Recent studies on certain microevents in sleep have indicated that sleep spindles and K-complexes can be reliably detected and annotated with start time and duration using deep learning methods [16, 17]. Specifically, these studies proposed a single-shot event detection algorithm, that parallels the YOLO and SSD algorithms used for object detection in 2D images [18, 19], however, they were limited in scope by detecting events only at the EEG level, and did not explicitly take advantage of the temporal connection of the detected events. Additionally, experiments were carried out on a small-scale database .
In this study, we focused on the detection of arousals (AR) and leg movements (LM). These events arise from highly distinct physiological sources, EEG and leg EMG, while ARs are also visible in the EOG and chin EMG. These events are important for the precise characterization of sleep patterns and possible diagnosis of sleep disorders, and an accurate detection is therefore of high interest. We extend previous work in [16, 17] by 1) preprocessing and analysing multiple input signals at the same time, and 2) taking into account important temporal context using recurrent neural networks. Furthermore, we apply our model on a larger database than previous studies.
Ii-a MrOS Sleep Study
The MrOS Sleep Study is a part of the larger Osteoporotic Fractures in Men Study with the objective of researching the links between sleep disorders, fractures, cardiovascular disease and mortality in older males ( years) [20, 21, 22]. Between 2003 and 2005, 3,135 of the original 5,994 participants were recruited to undergo full-night PSG recording at six centers in the US at two separate visits (visit 1 and visit 2) with following 3 to 5-day actigraphy studies at home. The resulting PSG studies were subsequently scored by experienced sleep technicians for standard sleep variables including sleep stages, leg movements, arousals, and respiratory events.
Ii-B Included events and signals
In this study, we only considered the detection of two PSG events, arousals and leg movements. These events are characterized by a start time and a duration, which we extracted from 2,907 PSG studies from visit 1 available from the National Sleep Research Resource repository [23, 24]. From each PSG study, we extracted left and right central EEG, left and right EOG, chin EMG, and EMG from the left and right anterior tibialis. EEG and EOG channels were referenced to the contralateral mastoid process, while a leg EMG channel was synthesized by referencing left to right. Any PSG without the full set of channels or without any event scoring was eliminated from further analysis.
Ii-C Subset demographics and partitioning
In total, 2,650 out of the 2,907 PSGs available from visit 1 were included in this study. These were partitioned into train, eval, and test sets of sizes 1,485, 165, and 1,000 studies, respectively. A subset of key demographic and PSG variables are presented in Table I.
|Module||Input dim.||Output dim.||Type||Kernel size||No. kernels||Stride||Activation|
1D max. pool.
|1D max. pool.||–||–|
softmax for each kernel
, temporal feature extraction module;, recurrent neural network module; , event classification module; , event localization module; C, number of input channels; T, number of samples in segments; , number of output channels; K, number of event classes; , number of default events in segment;
Iii-a Signal preprocessing
All signals were resampled to using poly-phase filtering with a Kaiser window () before subsequent filtering according to AASM criteria. Briefly, EEG and EOG channels were subjected to a 4th order Butterworth band pass filter with cutoff frequencies
Hz, while chin and leg EMG channels were filtered with a 4th order Butterworth high pass filter with a 10 Hz cutoff frequency. All filters employed zero-phase filtering. Lastly, each channel was normalized by subtracting the channel mean and dividing by the channel standard deviation across the entire night.
Iii-B Detection model overview
In brief, the proposed model receives as input a tensorcontaining channels of data in a segment of samples, along with a set of events , were is the number of events in the associated time segment and are the start time and duration of event . The objective of the deep learning model is then to infer given . To do this, a set of default events is generated over the segment of samples, where
is the size of each default event window in samples. The model outputs probabilities forclasses including the default, non-event class for each default event window. The probability for a given class in the default event window must be greater than a classification threshold . In order to select among many possible candidates of predicted events, all predicted events of class over the possible events in
is subjected to non-maximum suppression using the intersection-over-union (IoU, Jaccard index) as in. A high-level schematic of the detection model is shown in Fig. 1.
Iii-C Network architecture
The architecture for the proposed PSG event detection model follows closely the event detection algorithms described in [16, 17], albeit with some specific changes. An overview of the proposed network in the model is provided in Table II. Briefly, the model comprises three modules:
a channel mixing module ;
a feature extraction module ;
and an event detection module ,
the latter containing two submodules performing event classification and event localization , respectively. The difference between these two submodules is that outputs the probability of the default, non-event class and event classes, while predicts a start time and a duration of all predicted events relative to a specific default event window. The channel mixing module receives a segment of input data , where is the number of input channels and is the number of time samples in the given segment, and subsequently performs linear channel mixing using 1D convolutions to synthesize new channels. Following , the feature extraction module consists of blocks with the first block and the th block . All blocks implement
using 1D convolution layers followed by batch normalization of the feature maps, rectified linear unit activation, and final 1D maximum pooling layers across the temporal dimension. Kernel sizes and strides for convolution and max. pool. layers inwere set to 3 and 1, and 2 and 2, respectively, while the number of feature maps in was set to . The event classification submodule is implemented a 1D convolution layer across the entire data volume using feature maps of size and stride , where is the number of event classes to be detected and is the number of default event windows. The event localization submodule is likewise implemented using a 1D convolution layer across the entire data volume.
Iii-D Data and event sampling
The proposed network requires an input tensor containing PSG data in the time segment of size as well as information about the associated events in the segment. Since the total number of segments in a standard PSG without any event data far outnumbers the number of segments with event data, we implemented a random sampling of non-event and event classes with the sampling probability of class inversely proportional to the number of classes, such that , where is the default (non-event) class. At training step , we thus sample a class and afterwards randomly sample a single class event between all class events. Finally, we extract a segment of PSG data of size with start of segment in the interval , where is the sample midpoint of . This ensures that each overlaps 50% with at least one associated event.
Iii-E Optimization of network parameters
The network parameters were optimized using mini-batch stochastic gradient descent with initial learning rate ofand a momentum of
. Minibatches were balanced with respect to the detected classes. The optimization was performed with respect to the same loss function described in[16, 17] and the network was trained until convergence determined by no decrease in the loss on the eval
set over 10 epochs oftrain data. We also employed learning rate decay with a factor of 2 every 5 epochs of non-decreasing eval loss.
Iii-F Experimental setups
In this study, we examined two different experimental setups.
First, we investigated the differences in predictive performance using a static vs. a dynamic default event window size. This was realized by running six separate training runs with , as well as a single training run where was evaluated for all . The best performing model was determined by evaluating F1 score on the eval set for both LM and AR detection.
Second, we tested a network where we added a recurrent processing block after the feature extraction block as shown in grey in Table II. We considered a single bidirectional gated recurrent unit (bGRU) layer with units. Predictions were evaulated across multiple time-scales
Iii-G Performance metrics
All models were evaluated on the eval and test sets using precision (Pr), recall (Re), and F1 scores (F1):
where TP, FP, and FN, are the number of true positives, false positives and false negatives, respectively.
Iii-H Statistical analysis
Demographic and polysomnographic variables were tested for subset differences with Kruskall-Wallis H-test for independent samples.
Iv Results and Discussion
Shown in Figs. (b)b and (a)a are the F1 scores as a function of IoU and the classification threshold for both the LM and AR detection models. It is apparent that both models perform best with a minimum overlap () with their respective annotated events, and do not benefit from increasing the overlap. This might be caused by the fact that the annotated events might not be precise enough, and not due to issues with the model itself. For example, it is not uncommon to only mark the beginning of an event in standard sleep scoring software, as the duration will automatically be annotated by a default length, such as 3 s for ARs, and 0.5 s for LM (which is the minimum duration as defined by the AASM guidelines ). Future studies will be able to confirm this by either collecting a precisely annotated cohort, or by investigating the average start time and duration discrepancies between annotated and predicted events.
It is also apparent from Figs. (b)b and (a)a that both detection models benefit from imposing a strict classification threshold. Specifically, LM detection performance as measured by F1 was highest with , while maximum AR detection performance was attained with an even higher of 0.8.
Furthermore, we explored allowing for multiple time-scales in the dynamic models, shown in Fig. (c)c. It was hypothesized that having the default event windows dynamic instead of static would allow for more flexibility and thus better predictive performance, however, we observed no significant differences between the optimal static window and the dynamic window model.
Shown in Fig. 3 are the performance curves for the RNN (bidirectional GRU) version of the proposed model for each of the two event detection tasks. While the optimal IoU and points are unchanged from the static/dynamic models presented in Fig. 2, the optimal F1 value for AR detection is increased by incorporating temporal dependencies in the model. The reverse is true for LM detection, which saw a slight decrease in predictive performance caused by a lower precision (see Table III). Future work should consider optimizing predictive performance by investigating the effects of varying the number of bGRU layers and the number of hidden units in , since this was not performed here.
Application of the optimal models on the test data is shown in Table III. We observed that with the given architecture of and the given labels and input data in train
, LM detection was maximal for the model with a static/dynamic window, while adding a recurrent module only positively impacted AR prediction. We observed a general decrease in both precision and recall for LM detection when adding
, while precision actually increased and recall decreased for AR detection. An example visualization of the joint distribution of F1 scores obtained from the dynamic model applied to thetest data is shown in Fig. 4
Subset partitions were reasonably well-distributed with no significant differences between key variables, see Table I. An exception is the AHI, although the associated effect is small and most likely a result of the low sample size in eval compared to train and test. It is noted, that although AHI, AI, and PLMI are not normally distributed and summarizing these variables with standard deviations is invalid, it is nevertheless standard practice in sleep medicine and thus presented the same way here. We performed little data cleaning in order to provide as much data and variation to the deep learning model as possible, however, future efforts should explore and apply inclusion criteria such as minimal total sleep time, artifact detection and removal of studies with severe artifacts. We did impose a trivial lower bound on the number of scored events (>0) for a PSG to be included in this study, but stricter requirements could potentially improve model performance.
In this work, we investigated ’systemic’ PSG events present in multiple signal modalities instead of EEG-specific events, which required changes to the network architecture. Specifically, we kept the signal modality encoded in the first dimension of the tensor propagated through the network, which allowed for the use of one-dimensional convolutional operators. By performing 1D convolutions and keeping the channel information in the feature maps instead of keeping them as separate dimensions and performing 2D convolutions as proposed in [16, 17], we simplify and reduce the number of computations and training time by a factor .
However, we did not investigate the effects of modeling the conditional probability of AR and LM occurrence, but the proposed architecture is versatile enough to detect both events jointly as well as separately. Previous work also suggest that detecting multiple objects at the same time is of high interest and leads to (at least) non-inferior performances [16, 17, 18, 19].
Additionally, we speculated that the temporal dynamics of the PSG signals were important for optimal event detection performance. Although the effects were small, we did show an increase in F1 score in AR detection when adding an RNN module to the network before the detection module. However, this was not the case for LM detection, which is most likely due to the different temporal and physiological characteristics of the two events in question.
Future efforts will be addressing the fact that in the current modeling scheme, events are mutually exclusive given a certain default event window size. However, it is common to see ARs and LMs as a result of one another, and thus, if the window size is too small, a more unlikely event as measured by classification threshold and IoU will be removed even if it matches up to a specific true event of a certain class.
We have proposed a deep learning model that extends on previous work and shows promise in automatic detection of arousals and leg movements during sleep. The proposed model is flexible in allowing for the detection of multiple events of distinct physiological natures. Future work will expand further on adding more signals and event classes in order to complete a general purpose sleep analysis tool.
Some of the computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results.
-  R. B. Berry, C. L. Albertario, S. M. Harding, R. M. Lloyd, D. T. Plante, S. F. Quan, M. M. Troester, and B. V. Vaughn, The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. . Version 2.5. Darien, Il: American Academy of Sleep Medicine, 2018.
-  P. Halász, M. Terzano, L. Parrino, and R. Bódizs, “The nature of arousal in sleep,” J. Sleep Res., vol. 13, no. 1, pp. 1–23, 2004.
-  L. J. Findley, M. E. Unverzagt, and P. M. Suratt, “Automobile Accidents Involving Patients with Obstructive Sleep Apnea,” Am. Rev. Respir. Dis., vol. 138, no. 2, pp. 337–340, 1988.
-  R. Ferri, B. B. Koo, D. L. Picchietti, and S. Fulda, “Periodic leg movements during sleep: phenotype, neurophysiology, and clinical significance,” Sleep Med., vol. 31, pp. 29–38, 2017.
-  American Academy of Sleep Medicine, International classification of sleep disorders, 3rd ed. Darien, Il: American Academy of Sleep Medicine, 2014.
-  R. G. Norman, I. Pal, C. Stewart, J. A. Walsleben, and D. M. Rapoport, “Interobserver agreement among sleep scorers from different centers in a large dataset.” Sleep, vol. 23, no. 7, pp. 901–8, 2000.
-  R. S. Rosenberg and S. Van Hout, “The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring,” J. Clin. Sleep Med., vol. 9, no. 1, pp. 81–87, 2013.
-  M. Younes, J. Raneri, and P. Hanly, “Staging sleep in polysomnograms: Analysis of inter-scorer variability,” J. Clin. Sleep Med., vol. 12, no. 6, pp. 885–894, 2016.
-  M. H. Bonnet, K. Doghramji, T. Roehrs, E. J. Stepanski, S. H. Sheldon, A. S. Walters, M. Wise, and A. L. Chesson Jr, “The scoring of arousal in sleep: reliability, validity, and alternatives,” J. Clin. Sleep Med., vol. 3, no. 2, pp. 133–145, 2007.
-  R. S. Rosenberg and S. Van Hout, “The American Academy of Sleep Medicine inter-scorer reliability program: Respiratory events,” J. Clin. Sleep Med., vol. 10, no. 4, pp. 447–454, 2014.
-  A. N. Olesen, P. Jennum, P. Peppard, E. Mignot, and H. B. D. Sorensen, “Deep residual networks for automatic sleep stage classification of raw polysomnographic waveforms,” in 2018 40th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), Honolulu, HI, USA, July 18 – 20, 2018, pp. 3713–3716.
-  J. B. Stephansen, A. N. Olesen, M. Olsen, A. Ambati, E. B. Leary, H. E. Moore, O. Carrillo, L. Lin, F. Han, H. Yan, Y. L. Sun, Y. Dauvilliers, S. Scholz, L. Barateau, B. Hogl, A. Stefani, S. C. Hong, T. W. Kim, F. Pizza, G. Plazzi, S. Vandi, E. Antelmi, D. Perrin, S. T. Kuna, P. K. Schweitzer, C. Kushida, P. E. Peppard, H. B. D. Sorensen, P. Jennum, and E. Mignot, “Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy,” Nat. Commun., vol. 9, no. 1, p. 5229, 2018.
-  S. Chambon, M. N. Galtier, P. J. Arnal, G. Wainrib, and A. Gramfort, “A Deep Learning Architecture for Temporal Sleep Stage Classification Using Multivariate and Multimodal Time Series,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 26, no. 4, pp. 758–769, 2018.
-  S. Biswal, H. Sun, B. Goparaju, M. B. Westover, J. Sun, and M. T. Bianchi, “Expert-level sleep scoring with deep neural networks,” J. Am. Med. Informatics Assoc., vol. 25, no. 12, pp. 1643–1650, 2018.
-  H. Phan, F. Andreotti, N. Cooray, O. Y. Chen, and M. De Vos, “SeqSleepNet: End-to-End Hierarchical Recurrent Neural Network for Sequence-to-Sequence Automatic Sleep Staging,” IEEE Trans. Neural Syst. Rehabil. Eng., 2019.
-  S. Chambon, V. Thorey, P. J. Arnal, E. Mignot, and A. Gramfort, “A deep learning architecture to detect events in EEG signals during sleep,” in 2018 IEEE Int. Work. Mach. Learn. Signal Process. (MLSP), Aalborg, Denmark, Sept. 17 – 20, 2018, pp. 1–6.
-  S. Chambon, V. Thorey, P. Arnal, E. Mignot, and A. Gramfort, “DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal,” J. Neurosci. Methods, 2019, https://doi.org/10.1016/j.jneumeth.2019.03.017.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once:
Unified, Real-Time Object Detection,” in
2016 IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector,” in Computer Vision (ECCV), B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 21–37.
-  J. B. Blank, P. M. Cawthon, M. L. Carrion-Petersen, L. Harper, J. P. Johnson, E. Mitson, and R. R. Delay, “Overview of recruitment for the osteoporotic fractures in men study (MrOS),” Contemp. Clin. Trials, vol. 26, no. 5, pp. 557–568, 2005.
-  E. Orwoll, J. B. Blank, E. Barrett-Connor, J. Cauley, S. Cummings, K. Ensrud, C. Lewis, P. M. Cawthon, R. Marcus, L. M. Marshall, J. McGowan, K. Phipps, S. Sherman, M. L. Stefanick, and K. Stone, “Design and baseline characteristics of the osteoporotic fractures in men (MrOS) study – A large observational study of the determinants of fracture in older men,” Contemp. Clin. Trials, vol. 26, no. 5, pp. 569–585, 2005.
-  T. Blackwell, K. Yaffe, S. Ancoli-Israel, S. Redline, K. E. Ensrud, M. L. Stefanick, A. Laffan, and K. L. Stone, “Associations Between Sleep Architecture and Sleep-Disordered Breathing and Cognition in Older Community-Dwelling Men: The Osteoporotic Fractures in Men Sleep Study,” J. Am. Geriatr. Soc., vol. 59, no. 12, pp. 2217–2225, 2011.
-  D. A. Dean, A. L. Goldberger, R. Mueller, M. Kim, M. Rueschman, D. Mobley, S. S. Sahoo, C. P. Jayapandian, L. Cui, M. G. Morrical, S. Surovec, G.-Q. Zhang, and S. Redline, “Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource,” Sleep, vol. 39, no. 5, pp. 1151–1164, 2016.
-  G.-Q. Zhang, L. Cui, R. Mueller, S. Tao, M. Kim, M. Rueschman, S. Mariani, D. Mobley, and S. Redline, “The National Sleep Research Resource: towards a sleep data commons,” J. Am. Med. Informatics Assoc., vol. 25, no. 10, pp. 1351–1358, 2018.
-  A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, and Z. Devito, “Automatic differentiation in PyTorch,” Long Beach, CA, USA, 2017.