REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild

In recent years, significant attention has been devoted towards integrating deep learning technologies in the healthcare domain. However, to safely and practically deploy deep learning models for home health monitoring, two significant challenges must be addressed: the models should be (1) robust against noise; and (2) compact and energy-efficient. We propose REST, a new method that simultaneously tackles both issues via 1) adversarial training and controlling the Lipschitz constant of the neural network through spectral regularization while 2) enabling neural network compression through sparsity regularization. We demonstrate that REST produces highly-robust and efficient models that substantially outperform the original full-sized models in the presence of noise. For the sleep staging task over single-channel electroencephalogram (EEG), the REST model achieves a macro-F1 score of 0.67 vs. 0.39 achieved by a state-of-the-art model in the presence of Gaussian noise while obtaining 19x parameter reduction and 15x MFLOPS reduction on two large, real-world EEG datasets. By deploying these models to an Android application on a smartphone, we quantitatively observe that REST allows models to achieve up to 17x energy reduction and 9x faster inference. We open-source the code repository with this paper:


Real-Time Sleep Staging using Deep Learning on a Smartphone for a Wearable EEG

We present the first real-time sleep staging system that uses deep learn...

Sleep Staging Based on Serialized Dual Attention Network

Sleep staging assumes an important role in the diagnosis of sleep disord...

DeepSleepNet-Lite: A Simplified Automatic Sleep Stage Scoring Model with Uncertainty Estimates

Deep learning is widely used in the most recent automatic sleep scoring ...

A Survey on Green Deep Learning

In recent years, larger and deeper models are springing up and continuou...

A Deep Knowledge Distillation framework for EEG assisted enhancement of single-lead ECG based sleep staging

Automatic Sleep Staging study is presently done with the help of Electro...

Adversarial Robustness for Code

We propose a novel technique which addresses the challenge of learning a...

1. Introduction

As many as 70 million Americans suffer from sleep disorders that affects their daily functioning, long-term health and longevity. The long-term effects of sleep deprivation and sleep disorders include an increased risk of hypertension, diabetes, obesity, depression, heart attack, and stroke (altevogt2006sleep)

. The cost of undiagnosed sleep apnea alone is estimated to exceed

billion in the US (american2016economic).

Figure 1. Top: we generate hypnograms for a patient in the SHHS test set. In the presence of Gaussian noise, our Rest-generated hypnogram closely matches the contours of the expert-scored hypnogram. Hypnogram generated by a state-of-the-art (SOTA) model by Sors et al. (sors2018convolutional) is considerably worse. Bottom: we measure energy consumed (in Joules) and inference time (in seconds) on a smartphone to score one night of EEG recordings. Rest is 9X more energy efficient and 6X faster than the SOTA model.
Figure 2. Rest

Overview: (from left) When a noisy EEG signal belonging to the REM (rapid eye movement) sleep stage enters a traditional neural network which is vulnerable to noise, it gets wrongly classified as a Wake sleep stage. On the other hand, the same signal is correctly classified as the REM sleep stage by the

Rest model which is both robust and sparse. (From right) Rest is a three step process involving (1) training the model with adversarial training, spectral regularization and sparsity regularization (2) pruning the model and (3) re-training the compact model.

A central tool in identifying sleep disorders is the hypnogram—which documents the progression of sleep stages (REM stage, Non-REM stages N1 to N3, and Wake stage) over an entire night (see Fig. 1, top). The process of acquiring a hypnogram from raw sensor data is called sleep staging, which is the focus of this work. Traditionally, to reliably obtain a hypnogram the patient has to undergo an overnight sleep study—called polysomnography (PSG)—at a sleep lab while wearing bio-sensors that measure physiological signals, which include electroencephalogram (EEG), eye movements (EOG), muscle activity or skeletal muscle activation (EMG), and heart rhythm (ECG). The PSG data is then analyzed by a trained sleep technician and a certified sleep doctor to produce a PSG report. The hypnogram plays an essential role in the PSG report, where it is used to derive many important metrics such as sleep efficiency and apnea index. Unfortunately, manually annotating this PSG is both costly and time consuming for the doctors. Recent research has proposed to alleviate these issues by automatically generating the hypnogram directly from the PSG using deep neural networks (biswal2017automated; supratak2017deepsleepnet). However, the process of obtaining a PSG report is still costly and invasive to patients, reducing their participation, which ultimately leads to undiagnosed sleep disorders (sterr2018sleep).

One promising direction to reduce undiagnosed sleep disorders is to enable sleep monitoring at the home using commercial wearables (e.g., Fitbit, Apple Watch, Emotiv) (henriksen2018using). However, despite significant research advances, a recent study shows that wearables using a single sensor (e.g., single lead EEG) often have lower performance for sleep staging, indicating a large room for improvement (Beattie2017-qn).

1.1. Contributions

Our contributions are two-fold—(i) we identify emerging research challenges for the task of sleep monitoring in the wild; and (ii) we propose Rest, a novel framework that addresses these issues.

I. New Research Challenges for Sleep Monitoring.

  • [leftmargin=*]

  • C1. Robustness to Noise. We observe that state-of-the-art deep neural networks (DNN) are highly susceptible to environmental noise (Fig. 1, top). In the case of wearables, noise is a serious consideration since bioelectrical signal sensors (e.g., electroencephalogram “EEG”, electrocardiogram “ECG”) are commonly susceptible to Gaussian and shot noise, which can be introduced by electrical interferences (e.g., power-line) and user motions (e.g., muscle contraction, respiration) (chang2011gaussian; blanco2008ecg; chen2010removal; bhateja2013novel). This poses a need for noise-tolerant models. In this paper, we show that adversarial training and spectral regularization can impart significant noise robustness to sleep staging DNNs (see top of Fig 1).

  • C2. Energy and Computational Efficiency. Mobile deep learning systems have traditionally offloaded compute intensive inference to cloud servers, requiring transfer of sensitive data and assumption of available Internet. However, this data uploading process is difficult for many healthcare scenarios because of—(1) privacy: individuals are often reluctant to share health information as they consider it highly sensitive; and (2) accessibility: real-time home monitoring is most needed in resource-poor environments where high-speed Internet may not be reliably available. Directly deploying a neural network to a mobile phone bypasses these issues. However, due to the constrained computation and energy budget of mobile devices, these models need to be fast in speed and parsimonious with their energy consumption.

II. Noise-robust and Efficient Sleep Monitoring. Having identified these two new research challenges, we propose Rest, the first framework for developing noise-robust and efficient neural networks for home sleep monitoring (Fig. 2). Through Rest, our major contributions include:

  • [leftmargin=*]

  • Robust and Efficient Neural Networks for Sleep Monitoring” By integrating a novel combination of three training objectives, Rest endows a model with noise robustness through (1) adversarial training and (2) spectral regularization; and promotes energy and computational efficiency by enabling compression through (3) sparsity regularization.

  • Extensive evaluation We benchmark the performance of Rest against competitive baselines, on two real-world sleep staging EEG datasets—Sleep-EDF from Physionet and Sleep Heart Health Study (SHHS). We demonstrate that Rest produces highly compact models that substantially outperform the original full-sized models in the presence of noise. Rest models achieves a macro-F1 score of 0.67 vs. 0.39 for the state-of-the-art model in the presence of Gaussian noise, with parameter and MFLOPS reduction.

  • Real-world deployment. We deploy a Rest model onto a Pixel 2 smartphone through an Android application performing sleep staging. Our experiments reveal Rest achieves energy reduction and faster inference on a smartphone, compared to uncompressed models.

2. Related Work

In this section we discuss related work from three areas—(1) the task of sleep stage prediction, (2) robustness of deep neural networks and (3) compression of deep learning models.

2.1. Sleep-Stage Prediction

Sleep staging is the task of annotating a polysomnography (PSG) report into a hypnogram, where 30 second sleep intervals are annotated into one of five sleep stages (W, N1, N2, N3, REM). Recently, significant effort has been devoted towards automating this annotation process using deep learning (sors2018convolutional; biswal2017automated; chambon2018deep; phan2019joint; andreotti2018multichannel; dina_icml_sleep), to name a few. While there exists a large body of research in this area—two works in particular look at both single channel (biswal2017automated) and multi-channel (chambon2018deep) deep learning architectures for sleep stage prediction on EEG. In (biswal2017automated), the authors develop a deep learning architecture (SLEEPNET) for sleep stage prediction that achieves expert-level accuracy on EEG data. In (chambon2018deep), the authors develop a multi-modal deep learning architecture for sleep stage prediction that achieves state-of-the-art accuracy. As we demonstrate later in this paper (Section 4.5), these sleep staging models are frequently susceptible to noise and suffer a large performance drop in its presence (see Figure 1). In addition, these DNNs are often overparameterized (Section 4.6), making deployment to mobile devices and wearables difficult. Through Rest, we address these limitations and develop noise robust and efficient neural networks for edge computing.

2.2. Noise & Adversarial Robustness

Adversarial robustness seeks to ensure that the output of a neural network remains unchanged under a bounded perturbation of the input; or in other words, prevent an adveresary from maliciously perturbing the data to fool a neural network. Adversarial deep learning was popularized by (goodfellow2014explaining), where they showed it was possible to alter the class prediction of deep neural network models by carefully crafting an adversarially perturbed input. Since then, research suggests a strong link between adversarial robustness and noise robustness (Ford2019; hendrycks2019benchmarking; tsipras2018robustness). In particular, (Ford2019) found that by performing adversarial training on a deep neural network, it becomes robust to many forms of noise (e.g., Gaussian, blur, shot, etc.). In contrast, they found that training a model on Gaussian augmented data led to models that were less robust to adversarial perturbations. We build upon this finding of adversarial robustness as a proxy for noise robustness and improve upon it through the use of spectral regularization; while simultaneously compressing the model to a fraction of its original size for mobile devices.

2.3. Model Compression

Model compression aims to learn a reduced representation of the weights that parameterize a neural network; shrinking the computational requirements for memory, floating point operations (FLOPS), inference time and energy. Broadly, prior art can be classified into four directions—pruning (han2015learning), quantization (rastegari2016xnor), low rank approximation (xue2013restructuring) and knowledge distillation (hinton2015distilling). For Rest

, we focus on structured (channel) pruning thanks to its performance benefits (speedup, FLOP reduction) and ease of deployment with regular hardware. In structured channel pruning, the idea is to assign a measure of importance to each filter of a convolutional neural network (CNN) and achieve desired sparsity by pruning the least important ones. Prior work demonstrates several ways to estimate filter importance—magnitude of weights

(li2016pruning), structured sparsity regularization (wen2016learning), regularization on activation scaling factors (liu2017learning), filter similarity (cluster2019duggal) and discriminative power of filters (zhuang2018discrimination). Recently there has been an attempt to bridge the area of model compression with adversarial robustness through connection pruning (guo2018sparse) and quantization (lin2019defensive). Different from previous work, Rest aims to compress a model by pruning whole filters while imparting noise tolerance through adversarial training and spectral regularization. Rest can be further compressed through quantization (lin2019defensive).

3. Rest: Noise-Robust & Efficient Models

Rest is a new method that simultaneously compresses a neural network while developing both noise and adversarial robustness.

3.1. Overview

Our main idea is to enable Rest

to endow models with these properties by integrating three careful modifications of the traditional training loss function. (1) The

adversarial training term, which builds noise robustness by training on adversarial examples (Section 3.2); (2) the spectral regularization term, which adds to the noise robustness by constraining the Lipschitz constant of the neural network (Section 3.3

); and (3) the sparsity regularization term that helps to identify important neurons and enables compression (Section

3.4). Throughout the paper, we follow standard notation and use capital bold letters for matrices (e.g., A

), lower-case bold letters for vectors (e.g.,


3.2. Adversarial Training

The goal of adversarial training is to generate noise robustness by exposing the neural network to adversarially perturbed inputs during the training process. Given a neural network with input X, weights W and corresponding loss function , adversarial training aims at solving the following min-max problem:


Here is the unperturbed dataset consisting of the clean EEG signals ( is the number of channels and is the length of the signal) along with their corresponding label . The inner maximization problem in (1) embodies the goal of the adversary—that is, produce adversarially perturbed inputs (i.e., ) that maximize the loss function . On the other hand, the outer minimization term aims to build robustness by countering the adversary through minimizing the expected loss on perturbed inputs.

Maximizing the inner loss term in (1) is equivalent to finding the adversarial signal that maximally alters the loss function within some bounded perturbation . Here is the set of allowable perturbations. Several choices exist for such an adversary. For Rest, we use the iterative Projected Gradient Descent (PGD) adversary since it’s one of the strongest first order attacks (madry2017towards). Its operation is described below in Equation 2.


Here and at every step , the previous perturbed input is modified with the sign of the gradient of the loss, multiplied by (controls attack strength). is a function that clips the input at the positions where it exceeds the predefined bound . Finally, after iterations we have the Rest adversarial training term in Equation 3.



3.3. Spectral Regularizer

The second term in the objective function is the spectral regularization term, which aims to constrain the change in output of a neural network for some change in input. The intuition is to suppress the amplification of noise as it passes through the successive layers of a neural network. In this section we show that an effective way to achieve this is via constraining the Lipschitz constant of each layer’s weights.

For a real valued function the Lipschitz constant is a positive real value such that . If then the change in input is magnified through the function . For a neural net, this can lead to input noise amplification. On the other hand, if then the noise amplification effect is diminished. This can have the unintended consequence of reducing the discriminative capability of a neural net. Therefore our goal is to set the Lipschitz constant . The Lipschitz constant for the fully connected layer parameterized by the weight matrix is equivalent to its spectral norm (cisse2017parseval). Here the spectral norm of a matrix W

is the square root of the largest singular value of

. The spectral norm of a 1-D convolutional layer parameterized by the tensor

can be realized by reshaping it to a matrix and then computing the largest singular value.

A neural network of layers can be viewed as a function composed of sub-functions . A loose upper bound for the Lipschitz constant of is the product of Lipschitz constants of individual layers or (cisse2017parseval). The overall Lipschitz constant can grow exponentially if the spectral norm of each layer is greater than 1. On the contrary, it could go to 0 if spectral norm of each layer is between 0 and 1. Thus the ideal case arises when the spectral norm for each layer equals 1. This can be achieved in several ways (yoshida2017spectral; cisse2017parseval; farnia2018generalizable), however, one effective way is to encourage orthonormality in the columns of the weight matrix W through the minimization of where I

is the identity matrix. This additional loss term helps regulate the singular values and bring them close to 1. Thus we incorporate the following spectral regularization term into our loss objective, where

is a hyperparameter controlling the strength of the spectral regularization.


3.4. Sparsity Regularizer & Rest Loss Function

The third term of the Rest objective function consists of the sparsity regularizer. With this term, we aim to learn the important filters in the neural network. Once these are determined, the original neural network can be pruned to the desired level of sparsity.

The incoming weights for filter in the fully connected (or 1-D convolutional) layer can be specified as (or ). We introduce a per filter multiplicand that scales the output activation of the neuron in layer . By controlling the value of this multiplicand, we realize the importance of the neuron. In particular, zeroing it amounts to dropping the entire filter. Note that the norm on the multiplicand vector , where , can naturally satisfy the sparsity objective since it counts the number of non zero entries in a vector. However since the norm is a nondifferentiable function, we use the norm as a surrogate (lebedev2016fast; wen2016learning; liu2017learning)

which is amenable to backpropagation through its subgradient.

To realize the per filter multiplicand

, we leverage the per filter multiplier within the batch normalization layer

(liu2017learning). In most modern networks, a batchnorm layer immediately follows the convolutional/linear layers and implements the following operation.


Here denotes output activation of filter in layer while denotes its transformation through batchnorm layer ; ,

denote the mini-batch mean and standard deviation for layer

’s activations; and and are learnable parameters. Our sparsity regularization is defined on as below, where is a hyperparameter controlling the strength of sparsity regularization.


The sparsity regularization term (6) promotes learning a subset of important filters while training the model. Compression then amounts to globally pruning filters with the smallest value of multiplicands in (5) to achieve the desired model compression. Pruning typically causes a large drop in accuracy. Once the pruned model is identified, we fine-tune it via retraining.

Now that we have discussed each component of Rest, we present the full loss function in (7) and the training process in Algorithm LABEL:alg:specom. A pictorial overview of the process can be seen in Figure 2.


4. Experiments

We compare the efficacy of Rest neural networks to four baseline models (Section 4.2) on two publicly available EEG datasets—Sleep-EDF from Physionet (goldberger2000physiobank) and Sleep Heart Health Study (SHHS) (quan1997sleep). Our evaluation focuses on two broad directions—noise robustness and model efficiency. Noise robustness compares the efficacy of each model when EEG data is corrupted with three types of noise: adversarial, Gaussian and shot. Model efficiency compares both static (e.g., model size, floating point operations) and dynamic measurements (e.g., inference time, energy consumption). For dynamic measurements which depend on device hardware, we deploy each model to a Pixel 2 smartphone.

Dataset W N1 N2 N3(N4) REM Total
Sleep-EDF 8,168 2,804 17,799 5,703 7,717 42,191
SHHS 28,854 3,377 41,246 13,409 13,179 100,065
Table 1. Dataset summary outlining the number of 30 second EEG recordings belonging to each sleep stage class.

4.1. Datasets

Our evaluation uses two real-world sleep staging EEG datasets.

  • [leftmargin=*,topsep=4pt,itemsep=0ex,partopsep=0ex,parsep=1ex]

  • Sleep-EDF: This dataset consists of data from two studies—age effect in healthy subjects (SC) and Temazepam effects on sleep (ST). Following (supratak2017deepsleepnet), we use whole-night polysomnographic sleep recordings on 40 healthy subjects (one night per patient) from SC. It is important to note that the SC study is conducted in the subject’s homes, not a sleep center and hence this dataset is inherently noisy. However, the sensing environment is still relatively controlled since sleep doctors visited the patient’s home to setup the wearable EEG sensors. After obtaining the data, the recordings are manually classified into one of eight classes (W, N1, N2, N3, N4, REM, MOVEMENT, UNKNOWN); we follow the steps in (supratak2017deepsleepnet) and merge stages N3 and N4 into a single N3 stage and exclude MOVEMENT and UNKNOWN stages to match the five stages of sleep according to the American Academy of Sleep Medicine (AASM) (berry2012aasm). Each single channel EEG recording of 30 seconds corresponds to a vector of dimension . Similar to (sors2018convolutional), while scoring at time , we include EEG recordings from times . Thus we expand the EEG vector by concatenating the previous three time steps to create a vector of size . After pre-processing the data, our dataset consists of EEG recordings, each described by a length vector and assigned a sleep stage label from Wake, N1, N2, N3 and REM using the Fpz-Cz EEG sensor (see Table 1 for sleep stage breakdown). Following standard practice (supratak2017deepsleepnet), we divide the dataset on a per-patient, whole-night basis, using for training, for validation, and for testing. That is, a single patient is recorded for one night and can only be in one of the three sets (training, validation, testing). The final number of EEG recordings in their respective splits are , and . While the number of recordings appear to differ from the -- ratio, this is because the data is split over the total number of patients, where each patient is monitored for a time period of variable length (9 hours few minutes.)

  • Sleep Heart Health Study (SHHS): The Sleep Heart Health Study consists of two rounds of polysomnographic recordings (SHHS-1 and SHHS-2) sampled at 125 Hz in a sleep center environment. Following (sors2018convolutional), we use only the first round (SHHS-1) containing 5,793 polysomnographic records over two channels (C4-A1 and C3-A2). Recordings are manually classified into one of six classes (W, N1, N2, N3, N4 and REM). As suggested in (berry2012aasm), we merge N3 and N4 stages into a single N3 stage (see Table 1 for sleep stage breakdown). We use 100 distinct patients randomly sampled from the original dataset (one night per patient). Similar to (sors2018convolutional), we look at three previous time steps in order to score the EEG recording at the current time step. This amounts to concatenating the current EEG recording of size (equal to 125 Hz 30 Hz) to generate an EEG recording of size . After this pre-processing, our dataset consists of EEG recordings, each described by a length vector and assigned a sleep stage label from the same 5 classes using the Fpz-Cz EEG sensor. We use the same 80-10-10 data split as in Sleep-EDF, resulting in EEG recordings for training, for validation, and for testing.

4.2. Model Architecture and Configurations

We use the sleep staging CNN architecture proposed by (sors2018convolutional)

, since it achieves state-of-the-art accuracy for sleep stage classification using single channel EEG. We implement all models in PyTorch 0.4. For training and evaluation, we use a server equipped with an Intel Xeon E5-2690 CPU, 250GB RAM and 8 Nvidia Titan Xp GPUs. Mobile device measurements use a Pixel 2 smartphone with an Android application running Tensorflow Lite

111TensorFlow Lite: With (sors2018convolutional) as the architecture for all baselines below, we compare the following 6 configurations:

  1. [leftmargin=*]

  2. Sors (sors2018convolutional): Baseline neural network model trained on unperturbed data. This model contains 12 1-D convolutional layers followed by 2 fully connected layers and achieves state-of-the-art performance on sleep staging using single channel EEG.

  3. Liu (liu2017learning): We train on unperturbed data and compress the Sors model using sparsity regularization as proposed in (liu2017learning).

  4. Blanco (blanco1997applying): We use same setup from Liu above. During test time, the noisy test input is filtered using a bandpass filter with cutoff 0.5Hz-40Hz This technique is commonly used for removing noise in EEG analysis (blanco1997applying).

  5. Ford (Ford2019): We train and compress the Sors model with sparsity regularization on input data perturbed by Gaussian noise. Gaussian training parameter = 0.2 controls the perturbation strength during training; identified through a line search in Section 4.4.

  6. Rest (A): Our compressed Sors model obtained through adversarial training and sparsity regularization. We use the hyperparameters: = 10, = 5/10 (SHHS/Sleep-EDF), where is a key variable controlling the strength of adversarial perturbation during training. The optimal value is determined through a line search described in Section 4.4.

  7. Rest (A+S): Our compressed Sors model obtained through adversarial training, spectral and sparsity regularization. We set the spectral regularization parameter = and sparsity regularization parameter = based on a grid search in Section 4.4.

All models are trained for 30 epochs using SGD. The initial learning rate is set to 0.1 and multiplied by 0.1 at epochs 10 and 20; the weight decay is set to 0.0002. All compressed models use the same compression method, consisting of weight pruning followed by model re-training. The sparsity regularization parameter

is identified through a grid search with (after determining through a line search). Detailed analysis of the hyperparameter selection for , and can be found in Section 4.4. Finally, we set a high sparsity level = 0.8 (80% neurons from the original networks were pruned) after observation that the models are overparametrized for the task of sleep stage classification.

4.3. Evaluation Metrics

Noise robustness metrics To study the noise robustness of each model configuration, we evaluate macro-F1 score in the presence of three types of noise: adversarial, Gaussian and shot. We select macro-F1 since it is a standard metric for evaluating classification performance in imbalanced datasets. Adversarial noise is defined at three strength levels through in Equation 2; Gaussian noise at three levels through in Equation 8; and shot noise at three levels through in Equation 9. These parameter values are chosen based on prior work (madry2017towards; hendrycks2019benchmarking) and empirical observation. For evaluating robustness to adversarial noise, we assume the white box setting where the attacker has access to model weights. The formulation for Gaussian and shot noise is in Equation 8 and 9, respectively.


In Equation 8, is the standard deviation of the training data and

is the normal distribution. The noise strength—low, medium and high—corresponds to



In Equation 9, denote the minimum and maximum values in the training data; and is a function that projects the input to the range [0,1].

Model efficiency metrics To evaluate the efficiency of each model configuration, we use the following measures:

  • [leftmargin=*,topsep=0pt,itemsep=0ex,partopsep=2ex,parsep=1ex]

  • Parameter Reduction: Memory consumed (in KB) for storing the weights of a model.

  • Floating point operations (FLOPS): Number of multiply and add operations performed by the model in one forward pass. Measurement units are Mega ().

  • Inference Time: Average time taken (in seconds) to score one night of EEG data. We assume a night consists of 9 hours and amounts to 1,080 EEG recordings (each of 30 seconds). This is measured on a Pixel 2 smartphone.

  • Energy Consumption: Average energy consumed by a model (in Joules) to score one night of EEG data on a Pixel 2 smartphone. To measure consumed energy, we implement an infinite inference loop over EEG recordings until the battery level drops from down to . For each unit percent drop (i.e., 15 levels), we log the number of iterations performed by the model. Given that a standard Pixel 2 battery can deliver 2700 mAh at 3.85 Volts, we use the following conversion to estimate energy consumed (in Joules) for a unit percent drop in battery level . The total energy for inferencing over an entire night of EEG recordings is then calculated as where is the number of inferences made in the unit battery drop interval. We average this for every unit battery percentage drop from to (i.e., 15 intervals) to calculate the average energy consumption

4.4. Hyperparameter Selection

Optimal hyper-parameter selection is crucial for obtaining good performance with both baseline and Rest models. We systematically conduct a series of line and grid searches to determine ideal values of , , and using the validation sets.

Selecting    This parameter controls the perturbation strength of adversarial training in Equation 2. Correctly setting this parameter is critical since a small value will have no effect on noise robustness, while too high a value will lead to poor benign accuracy. We follow standard procedure and determine the optimal on a per-dataset basis (madry2017towards), conducting a line search across 0,30 in steps of 2. For each value of we measure benign and adversarial validation macro-F1 score, where adversarial macro-F1 is an average of three strength levels: low (=2), medium (=6) and high (=12). We then select the with highest macro-F1 score averaged across the benign and adversarial macro-F1. Line search results are shown in Figure 3; we select for both dataset since it’s the value with highest average macro-F1.

Figure 3. Line search results for on Sleep-EDF and SHHS datasets. We select =10, since it provides the best average macro-F1 score on both datasets.
Guassian F1
Benign F1 Low Med High Average F1


0.1 0.75 0.76 0.7 0.5 0.68
0.2 0.7 0.72 0.75 0.64 0.70
0.3 0.67 0.68 0.71 0.75 0.7025


0.1 0.69 0.74 0.45 0.21 0.52
0.2 0.68 0.69 0.68 0.43 0.62
0.3 0.55 0.57 0.65 0.74 0.63
Table 2. Line search results for identifying optimal on Sleep-EDF and SHHS datasets. Macro-F1 is abbreviated F1 in table; average macro-F1 is the mean of all macro-F1 scores. We select =0.2 for both datasets as it represents a good trade-off between benign and Gaussian macro-F1.
Adversarial F1
Benign F1 Low Med High Avg. F1
0.001 1E-04 0.73 0.66 0.65 0.61 0.66
0.003 1E-04 0.72 0.64 0.63 0.59 0.65
0.005 1E-04 0.72 0.65 0.64 0.62 0.66
0.001 1E-05 0.73 0.66 0.65 0.62 0.67
0.003 1E-05 0.73 0.67 0.66 0.62 0.67
0.005 1E-05 0.73 0.64 0.64 0.62 0.66
Table 3. Grid search results for and on Sleep-EDF dataset. Macro-F1 is abbreviated as F1 in table; average macro-F1 is the mean of all macro-F1 scores. We select and with highest average macro-F1 score.

Selecting    This parameter controls the noise perturbation strength of Gaussian training in Equation 8. Similar to , we determine on a per-dataset basis, conducting a line search across values: 0.1 (low), 0.2 (medium) and 0.3 (high). Based on results from Table 2, we select =0.2 for both datasets since it provides the best average macro-F1 score while minimizing the drop in benign accuracy.

Selecting and    These parameters determine the strength of spectral and sparsity regularization in Equation 7. We determine the best value for and through a grid search across the following parameter values and . Based on results from Table 3, we select and . Since these are model dependent parameters, we calculate them once on the Sleep-EDF dataset and re-use them for SHHS.

Adversarial Gaussian Shot
Data Method Compress No noise Low Med High Low Med High Low Med High


Sors (sors2018convolutional) 0.67 0.02 0.57 0.02 0.51 0.04 0.19 0.06 0.66 0.03 0.60 0.03 0.39 0.08 0.58 0.04 0.42 0.08 0.11 0.03
Liu (liu2017learning) 0.69 0.02 0.52 0.07 0.41 0.07 0.09 0.02 0.67 0.02 0.53 0.02 0.28 0.04 0.52 0.03 0.31 0.04 0.06 0.01
Blanco (blanco1997applying) 0.68 0.01 0.51 0.06 0.40 0.06 0.09 0.02 0.65 0.02 0.54 0.04 0.31 0.10 0.53 0.04 0.34 0.09 0.08 0.02
Ford (Ford2019) 0.64 0.01 0.59 0.01 0.60 0.02 0.31 0.08 0.65 0.01 0.67 0.02 0.57 0.03 0.67 0.02 0.60 0.02 0.10 0.01
Rest (A) 0.66 0.02 0.64 0.02 0.64 0.02 0.61 0.02 0.66 0.02 0.67 0.01 0.66 0.01 0.67 0.01 0.66 0.01 0.42 0.06
Rest (A+S) 0.69 0.01 0.67 0.02 0.66 0.01 0.61 0.03 0.69 0.01 0.68 0.01 0.67 0.02 0.68 0.01 0.67 0.02 0.42 0.08


Sors (sors2018convolutional) 0.78 0.01 0.62 0.03 0.46 0.03 0.33 0.00 0.64 0.03 0.43 0.02 0.35 0.04 0.69 0.02 0.59 0.03 0.45 0.01
Liu (liu2017learning) 0.77 0.01 0.61 0.02 0.49 0.04 0.34 0.03 0.66 0.05 0.45 0.05 0.34 0.04 0.70 0.04 0.62 0.04 0.47 0.05
Blanco (blanco1997applying) 0.77 0.01 0.60 0.03 0.47 0.04 0.33 0.02 0.64 0.07 0.43 0.05 0.34 0.04 0.67 0.06 0.59 0.05 0.46 0.04
Ford (Ford2019) 0.62 0.02 0.59 0.01 0.62 0.00 0.59 0.05 0.66 0.00 0.75 0.04 0.47 0.10 0.65 0.00 0.68 0.01 0.74 0.04
Rest (A) 0.70 0.01 0.68 0.00 0.70 0.01 0.67 0.01 0.72 0.01 0.76 0.01 0.58 0.03 0.72 0.01 0.74 0.01 0.76 0.01
Rest (A+S) 0.72 0.01 0.69 0.01 0.70 0.01 0.69 0.02 0.74 0.01 0.77 0.01 0.62 0.03 0.73 0.01 0.75 0.01 0.78 0.00
Table 4. Meta Analysis: Comparison of macro-F1 scores achieved by each model. The models are evaluated on Sleep-EDF and SHHS datasets with three types and strengths of noise corruption. We bold the compressed model with the best performance (averaged over 3 runs) and report the standard deviation of each model next to the macro-F1 score. Rest performs better in all noise test measurements.

4.5. Noise Robustness

To evaluate noise robustness, we ask the following questions—(1) what is the impact of Rest on model accuracy with and without noise in the data? and (2) how does Rest

training compare to baseline methods of benign training, Gaussian training and noise filtering? In answering these questions, we analyze noise robustness of models at three scales: (i) meta-level macro-F1 scores; (ii) meso-level confusion matrix heatmaps; and (iii) granular-level single-patient hypnograms.

I. Meta analysis: Macro-F1 Scores In Table 4, we present a high-level overview of model performance through macro-F1 scores on three types and strength levels of noise corruption. The Macro-F1 scores and standard deviation are reported by averaging over three runs for each model and noise level. We identify multiple key insights as described below:

  1. [topsep=2mm, itemsep=0mm, parsep=1mm, leftmargin=*]

  2. Rest Outperforms Across All Types of Noise As demonstrated by the higher macro-F1 scores, Rest outperforms all baseline methods in the presence of noise. In addition, Rest has a low standard deviation, indicating model performance is not dependent on weight initialization.

  3. Spectral Regularization Improves Performance Rest consistently improves upon Rest , indicating the usefulness of spectral regularization towards enhancing noise robustness by constraining the Lipschitz constant.

  4. SHHS Performance Better Than Sleep-EDF Performance is generally better on the SHHS dataset compared to Sleep-EDF. One possible explanation is due to the SHHS dataset being less noisy in comparison to the Sleep-EDF dataset. This stems from the fact that the SHHS study was performed in the hospital setting while Sleep-EDF was undertaken in the home setting.

  5. Benign & Adversarial Accuracy Trade-off Contrary to the traditional trade-off between benign and adversarial accuracy, Rest performance matches Liu in the no noise setting on sleep-EDF. This is likely attributable to the noise in the Sleep-EDF dataset, which was collected in the home setting. On the SHHS dataset, the Liu model outperforms Rest in the no noise setting, where data is captured in the less noise prone hospital setting. Due to this, Rest models are best positioned for use in noisy environments (e.g., at home); while traditional models are more effective in controlled environments (e.g., sleep labs).

II. Meso Analysis: Per-class Performance We visualize and identify class-wise trends using confusion matrix heatmaps (Fig. 4). Each confusion matrix describes a model’s performance for a given level of noise (or no noise). A model that is performing well should have a dark diagonal and light off-diagonal. We normalize the rows of each confusion matrix to accurately represent class predictions in an imbalanced dataset. When a matrix diagonal has a value of 1 (dark blue, or dark green) the model predicts every example correctly; the opposite occurs at 0 (white). Analyzing Figure 4, we identify the following key insights:

  1. [topsep=2mm, itemsep=0mm, parsep=1mm, leftmargin=*]

  2. Rest Performs Well Across All Classes Rest accurately predicts each sleep stage (W, N1, N2, N3, REM) across multiple types of noise (Fig. 4, bottom 3 rows), as evidenced by the dark diagonal. In comparison, each baseline method has considerable performance degradation (light diagonal) in the presence of noise. This is particularly evident on the Sleep-EDF dataset (left half) where data is collected in the noisier home environment.

  3. N1 Class Difficult to Predict When no noise is present (Fig. 4, top row), each method performs well as evidenced by the dark diagonal, except on the N1 sleep stage class. This performance drop is likely due to the limited number of N1 examples in the datasets (see Table 1).

  4. Increased Misclassification Towards “Wake” Class On the Sleep-EDF dataset, shot and adversarial noise cause the baseline models to mispredict classes as Wake. One possible explanation is that the models misinterpret the additive noise as evidence for the wake class which has characteristically large fluctuations.

Figure 4. Meso Analysis: Class-wise comparison of model predictions. The models are evaluated over the SHHS test set perturbed with different noise types. In each confusion matrix, rows are ground-truth classes while columns are predicted classes. The intensity of a cell is obtained by normalizing the score with respect to the class membership. When a cell has a value of 1 (dark blue, or dark green) the model predicts every example correctly, the opposite occurs at 0 (white). A model that is performing well would have a dark diagonal and light off-diagonal. Rest has the darkest cells along the diagonal on both datasets.
Figure 5. Granular Analysis: Comparison of the overnight hypnograms obtained for a patient in the SHHS test set. The hypnograms are generated using the Sors (left) and Rest (right) models in the presence of increasing strengths of Gaussian noise. When no noise is present (top row), both models perform well, closely matching the ground truth (bottom row). However, with increasing noise, Sors performance rapidly degrades, while Rest continues to generate accurate hypnograms.

III. Granular Analysis: Single-patient Hypnograms We want to more deeply understand how our Rest models counteract noise at the hypnogram level. Therefore, we select a test set patient from the SHHS dataset, and generate and visualize the patient’s overnight hypnograms using the Sors and Rest models on three levels of Gaussian noise corruption (Figure 5). Each of these hypnograms is compared to a trained technicians hypnogram (expert scored in Fig. 5), representing the ground-truth. We inspect a few more test set patients using the above approach, and identify multiple key representative insights:

  1. [topsep=2mm, itemsep=0mm, parsep=1mm, leftmargin=*]

  2. Noisy Environments Require Robust Models As data noise increases, Sors performance degrades. This begins at the low noise level, further accelerates in the medium level and reaches nearly zero at the high level. In contrast, Rest effectively handles all levels of noise, generating an accurate hypnogram at even the highest level.

  3. Low Noise Environments Give Good Performance In the no noise setting (top row) both the Sors and Rest models generate accurate hypnograms, closely matching the contours of expert scoring (bottom).

Figure 6. Time and energy consumption for scoring a single night of EEG recordings. Rest(A+S) is significantly faster and more energy efficient than the state-of-the-art Sors model. Evaluations were done on a Pixel 2 smartphone.

4.6. Model Efficiency

We measure model efficiency along two dimensions—(1) static metrics: amount of memory required to store weights in memory and FLOPS; and (2) dynamic metrics: inference time and energy consumption. For dynamic measurements that depend on device hardware, we deploy each model to a Pixel 2 smartphone.

Analyzing Static Metrics: Memory & Flops Table 5 describes the size (in KB) and computational requirements (in MFlops) of each model. We identify the following key insights:

  1. [topsep=2mm, itemsep=0mm, parsep=1mm, leftmargin=*]

  2. Rest Models Require Fewest FLOPS On both datasets, Rest requires the least number of FLOPS.

  3. Rest Models are Small Rest models are also smaller (or comparable) to baseline compressed models while achieving significantly better noise robustness.

  4. Model Efficiency and Noise Robustness Combining the insights from Section 4.5 and the above, we observe that Rest models have significantly better noise robustness while maintaining a competitive memory footprint. This suggests that robustness is more dependent on the the training process, rather than model capacity.

Data Model Size (KB) MFlops


Sors (sors2018convolutional) 8,896 1451
Liu (liu2017learning) 440 127
Blanco (blanco1997applying) 440 127
Ford (Ford2019) 448 144
Rest (A) 464 98
Rest (A+S) 449 94


Sors (sors2018convolutional) 8,996 1815
Liu (liu2017learning) 464 211
Blanco (blanco1997applying) 464 211
Ford (Ford2019) 478 170
Rest (A) 476 160
Rest (A+S) 496 142
Table 5. Comparison on model size and the FLOPS required to score a single night of EEG recordings. Rest models are significantly smaller and comparable in size/compute to baselines.

Analyzing Dynamic Metrics: Inference Time & Energy In Figure 6, we benchmark the inference time and energy consumption of a Sors and Rest model deployed on a Pixel 2 smartphone using Tensorflow Lite. We identify the following insights:

  1. [topsep=2mm, itemsep=0mm, parsep=1mm, leftmargin=*]

  2. Rest Models Run Faster When deployed, Rest runs and faster than the uncompressed model on the two datasets.

  3. Rest Models are Energy Efficient Rest models also consume and less energy than an uncompressed model on the Sleep-EDF and SHHS datasets, respectively.

  4. Enabling Sleep Staging for Edge Computing The above benefits demonstrate that model compression effectively translates into faster inference and a reduction in energy consumption. These benefits are crucial for deploying on the edge.

5. Conclusion

We identified two key challenges in developing deep neural networks for sleep monitoring in the home environment—robustness to noise and efficiency. We proposed to solve these challenges through Rest—a new method that simultaneously tackles both issues. For the sleep staging task over electroencephalogram (EEG), Rest trains models that achieve up to parameter reduction and MFLOPS reduction with an increase of up to 0.36 in macro-F-1 score in the presence of noise. By deploying these models to a smartphone, we demonstrate that Rest achieves up to energy reduction and faster inference.

6. Acknowledgments

This work was in part supported by the NSF award IIS-1418511, CCF-1533768, IIS-1838042, CNS-1704701, IIS-1563816; GRFP (DGE-1650044); and the National Institute of Health award NIH R01 1R01NS107291-01 and R56HL138415.