More than 422 million people around the world (more than 24 million in the U.S. alone) suffer from diabetes [43, 36]. This chronic disease imposes a substantial economic burden on both the patient and the government, and accounts for nearly 25% of the entire healthcare expenditure in the U.S. 
. However, diabetes prevention, care, and especially early diagnosis are still fairly challenging given that the disease usually develops and gets treated outside a clinic, hence out of reach of advanced clinical care. In fact, it is estimated that more than 75% of the patients still remain undiagnosed. This may lead to irreversible and costly consequences. For example, studies have shown that the longer a person lives with undiagnosed and untreated diabetes, the worse their health outcomes are likely to be . Without an early alarm, people with pre-diabetes, a less intensive diabetes status that can be cured, could end up with diabetes mellitus within five years that can no longer be cured . Thus, it is important to develop an accessible and accurate diabetes diagnosis system for the daily life scenario that can greatly improve general welfare and bend the associated healthcare expenditure downwards .
The emergence of wearable medical sensors (WMSs) points to a promising way to address this challenge. In the past decade, advancements in low-power sensors and signal processing techniques have led to many disruptive WMSs . These WMSs enable a continuous sensing of physiological signals during daily activities, and thus provide a powerful, yet user-transparent, human-machine interface for tracking the user’s health status. Combining WMSs and machine learning brings up the possibility of pervasive health condition tracking and disease diagnosis in a daily context . This approach exploits the superior knowledge distillation capability of machine learning to extract medical insights from health-related physiological signals . Hence, it offers a promising method to bridge the information gap that currently separates the clinical and daily domains. This helps enable a unified smart healthcare system that serves people in both the daily and clinical scenarios .
However, disease diagnosis based on WMS data and its effective deployment at the edge still remain challenging . Conventional approaches typically involve feature extraction, model training, and model deployment. However, such an approach suffers from two major problems:
Inefficient feature extraction: Handcrafting features may require substantial engineering effort and expert domain knowledge for each targeted disease. Searching for informative features through trial-and-error can be very inefficient, hence it may not be easy to effectively explore the available feature space. This problem is exacerbated when the feature space scales up given (i) a growing number of available signal types from WMSs, and (ii) more than 69,000 human diseases that need to be monitored .
Vast computation cost: Due to the need to execute a large number of floating-point operations (FLOPs) during feature extraction and model inference, continuous health monitoring can be very computationally intensive, hence hard to deploy on resource-constrained platforms .
To solve these problems, we propose a framework called DiabDeep that combines off-the-shelf WMSs with efficient neural networks (NNs) for pervasive diabetes diagnosis. DiabDeep completely bypasses the feature extraction stage, acts on raw signals captured by commercially available WMSs, and makes accurate diagnostic decisions. It supports inference both on the server and the edge. On the resource-rich server, we deploy stacked sparsely connected (SC) layers (DiabNN-server) to focus on high accuracy. On the resource-poor edge, we use the hidden-layer long short-term memory (H-LSTM) based recurrent layer (DiabNN-edge) to cut down on computation and storage costs while incurring only a minor accuracy loss. Augmented by a grow-and-prune training methodology, DiabDeep simultaneously improves accuracy, shrinks model size, and cuts down on computation costs relative to conventional approaches, such as support vector machines (SVMs) and random forest.
We summarize the major contributions of this article as follows:
We propose a novel DiabDeep framework that combines off-the-shelf WMSs and efficient NNs for pervasive diabetes diagnosis. DiabDeep focuses on both physiological and demographic information that can be captured by WMSs in the daily domain, including Galvanic skin response, blood volume pulse, inter-beat interval of heart, body temperatures, ambient environment, body movements, and patient’s demographic background.
We design a novel DiabNN architecture that uses different NN layers in its edge and server inference model variants to accommodate varying resource budgets and design goals.
We develop a training flow for DiabNNs based on a grow-and-prune NN synthesis paradigm that enables the networks to learn both weights and connections in order to simultaneously tackle accuracy and compactness.
We show that DiabDeep is accurate: we evaluate DiabDeep based on data collected from 52 participants. Our system achieves a 96.3% (95.3%) accuracy in classifying diabetics against healthy individuals on the server (edge), and 95.7% (94.6%) accuracy in distinguishing among type-1 diabetics, type-2 diabetics, and healthy individuals.
We show that DiabDeep is efficient: we compare DiabNNs with conventional models, including SVMs with linear and radial basis function (RBF) kernels, k-nearest neighbors (k-NN), random forest, and linear ridge classifiers. DiabNNs achieve the highest accuracy, while reducing model size (FLOPs) by up to 454.5 (8.9).
We show that DiabDeep is pervasive: it captures all the signals non-invasively through comfortably-worn WMSs that are already commercially available. This greatly assists with continuous diabetes detection and monitoring without disrupting daily lifestyle.
The rest of this paper is organized as follows. We review related work in Section 2. Then, in Section 3, we discuss the proposed DiabDeep framework in detail. We explain our implementation details of DiabDeep in Section 4 and present our experimental results in Section 5. In Section 6, we discuss the inspirations of our proposed framework from the human brain and future directions inspired by DiabDeep. Finally, we draw conclusions in Section 7.
2 Related Work
In this section, we first discuss diabetes diagnosis approaches using machine learning algorithms that have been previously proposed. Then, we focus on recent progress in efficient NN design.
2.1 Machine learning for diabetes diagnosis
Numerous studies have focused on applying machine learning algorithms to diabetes diagnosis from the clinical domain to the daily scenario.
Clinical approach: Electronic health records have been widely used as an information source for diabetes prediction and intervention . With the recent upsurge in the availability of biomedical datasets, new information sources have been unveiled for diabetes diagnosis, including gene sequences  and retinal images . However, these approaches are still restricted to the clinical domain, hence have very limited access to patient status when he/she leaves the clinic.
Daily approach: Daily glucose level detection has recently captured an increasing amount of research attention. One stream of study has explored subcutaneous glucose monitoring for continuous glucose tracking in a daily scenario . This is an invasive approach that still requires a high level of compliance, relies on regular sensor replacement (3-14 days), and impacts user experience . Recent systems have started exploiting non-invasive WMSs to alleviate these shortcomings. For example, Yin et al. combine machine learning ensembles and non-invasive WMSs to achieve a diabetes diagnostic accuracy of 77.6% . Ballinger et al. propose a system called DeepHeart that acts on Apple watch data and patient demographics 
. DeepHeart uses bidirectional LSTMs to deliver an 84.5% diagnostic accuracy. However, it relies on a small spectrum of WMS signals that include only discrete heart rate and step count measurements (indirectly estimated by photoplethysmograph and accelerometer). This may lead to information loss, hence reduce diagnostic capability. Swapna et al. achieve a 93.6% diagnostic accuracy by combining convolutional neural networks (CNNs) with LSTMs and heart rate variability measurements. However, the system has to rely on an electroencephalogram (ECG) data stream sampled at 500Hz that is not supported by commercial WMSs.
2.2 Efficient neural networks
Efficient NN design is a vibrant field. We discuss two approaches next.
Compact model architecture: One stream of research exploits the design of efficient building blocks for NN redundancy removal. For example, MobileNetV2 stacks inverted residual building blocks to effectively shrink its model size and reduce its FLOPs . Ma et al. use channel shuffle operation and depth-wise convolution to deliver model compactness . Wu et al. propose ShiftNet based on shift-based modules, as opposed to spatial convolution layers, to achieve substantial computation and storage cost reduction . Besides, automated compact architecture design also provides a promising solution [34, 44]. Dai et al. develop efficient performance predictors to speed up the search process for efficient NNs 
. Compared to MobileNetV2 on the ImageNet dataset, the generated ChamNets achieve up to 8.5% absolute top-1 accuracy improvement while reducing inference latency substantially.
Network compression: Compression techniques [14, 5] have emerged as another popular direction for NN redundancy removal. The pruning methodology was initially demonstrated to be effective on large CNNs by reducing the number of parameters in AlexNet by 9 and VGG by 13 for the well-known ImageNet dataset, without any accuracy loss . Follow-up works have also successfully shown its effectiveness on recurrent NNs such as the LSTM [13, 33, 24]. Network growth is a complementary method to pruning that enables a sparser, yet more accurate, model before pruning starts [5, 4]. A grow-and-prune synthesis paradigm typically reduces the number of parameters in CNNs [5, 15] and LSTMs  by another 2, and increases the classification accuracy . It enables NN based inference even on Internet-of-Things (IoT) sensors . The model can be further compressed through low-bit quantization. For example, Zhu et al. show that a ternary representation of the weights instead of full-precision (32-bit) in ResNet-56 can significantly reduce memory cost while incurring only a minor accuracy loss . The quantized models offer additional speedup potential for current NN accelerators .
Knowledge distillation: Knowledge distillation allows a compact student network to distill information (or ’dark knowledge’) from a more accurate, but computationally intensive, teacher network (or group of teacher networks) by mimicking the prediction distribution, given the same data inputs. The idea was first introduced by Hinton et al. . Since then, knowledge distillation has been effectively used to discover efficient networks. Romero et al. proposed FitNets that distill knowledge from the teacher’s hint layers to teach compact students 
. Passalis et al. enhanced the knowledge distillation process by introducing a concept called feature space probability distribution loss
. Yim et al. proposed fast minimization techniques based on intermediate feature maps that can also support transfer learning.
In this section, we describe the proposed DiabDeep framework in detail. We first give a high-level overview of the entire framework. Then, we zoom into the DiabNN architecture used for DiabDeep inference, followed by a detailed description of gradient-based growth and magnitude-based pruning algorithms for DiabNN training.
3.1 The DiabDeep framework
We illustrate the proposed DiabDeep framework in Fig. 1. DiabDeep captures both physiological and demographic information as data input. It deploys a grow-and-prune training paradigm to deliver two inference models, i.e., DiabNN-server and DiabNN-edge, that enable inference on the server and on the edge, respectively. Finally, DiabDeep generates diagnosis as output. The details of data input, model training, and model inference are as follows:
Data input: As mentioned earlier, DiabDeep focuses on (i) physiological signals and (ii) demographic information that are available in the daily context. Physiological signals can be captured by WMSs (e.g., from a smartphone and smartwatch) in a non-invasive, passive, and efficient manner. The list of collectible signals includes, but is not limited to, heart rate, body temperature, Galvanic skin response, and blood volume pulse. Additional signals such as eletromechanical and ambient environmental data (e.g., accelerometer, gyroscope, and humidity sensor readings) may also provide information on user habit tracking that offers diagnostic insights . This list is expanding rapidly, given the speed of ongoing technological advancements in this field . Demographics information (e.g., age, weight, gender, and height) also assists with disease diagnosis . It can be easily captured and updated through a simple user interface on a smartwatch or smartphone. Then, both physiological and demographic data are aggregated and merged into a comprehensive data input for subsequent analysis.
Model training: DiabDeep utilizes a grow-and-prune paradigm to train its NNs, as shown in the middle part of Fig. 1
. It starts NN synthesis from a sparse seed architecture. It first allows the network to grow connections and neurons based on gradient information. Then, it prunes away insignificant connections and neurons based on magnitude information to drastically reduce model redundancy. This leads to improved accuracy and efficiency[5, 3], where the former is highly preferred on the server and the latter is critical at the edge. The training process generates two inference models, i.e., DiabNN-server and DiabNN-edge, for server and edge inference, respectively. Both models share the same DiabNN architecture, but vary in the choice of internal NN layers based on different resource constraints and design objectives, as explained later.
Model inference: Due to the distinct inference environments encountered upon deployment, DiabNN-server and DiabNN-edge require different input data flows, as depicted by the separate data paths in Fig. 1. In DiabNN-server, data have to be accumulated in local memory, e.g., local phone/watch storage, before they can be transferred to the base station in a daily, weekly, or monthly manner, depending on user preference. As opposed to the accumulation-and-inference process, DiabNN-edge enables on-the-fly inference directly at the edge, e.g., a smartphone. This enables users to receive instantaneous diagnostic decisions. As mentioned earlier, it incurs a slight accuracy degradation (around 1%) due to the scarce energy and memory budgets on the edge. However, this deficit may be alleviated when DiabNN-edge jointly works with DiabNN-server. When an alarm is raised, DiabNN-edge can store the relevant data sections as disease-onset records (DORs) that can be later transferred to DiabNN-server for further analysis. In this manner, DiabNN-edge offers a substantial data storage reduction in the required edge memory by bypassing the storage of ’not-of-interest’ signal sections, while preserving the capability to make accurate inference on the server side. Such DORs can also be used as informative references when future physician intervention and checkup are needed.
We next explain our proposed DiabNN architecture in detail.
3.2 The DiabNN architecture
Fig. 2 shows the DiabNN architecture that distills diagnostic decisions (shown at the top) from data inputs (shown at the bottom). There are three sequential steps employed during this process: (i) data preprocessing, (ii) transformation via NN layers, and (iii) output generation using softmax. We describe these steps next.
The preprocessing stage is critical for DiabNN inference due to the following reasons:
Data normalization: NNs typically favor normalized inputs. Normalization methods, such as min-max scaling, standardization, and L2 normalization, generally lead to accuracy and noise tolerance improvements [20, 16]. In this work, we apply min-max scaling to scale each input data stream into the [0,1] range:
Data alignment: WMS data streams may vary in their start times and sampling frequencies . Therefore, we guarantee that the data streams are synchronized by checking their timestamps and applying appropriate offsets accordingly.
We use different NN layers in DiabNN for server and edge inference. DiabNN-server deploys SC layers to aim at high accuracy whereas DiabNN-edge utilizes sparsely recurrent (SR) layers to aim at extreme efficiency. All NN layers are subjected to dropout regularization, which is a widely-used approach for addressing overfitting and improving accuracy .
In DiabNN-server, each SC layer conducts a linear transformation (using a sparse matrix as opposed to a conventional full matrix) followed by a nonlinear activation function. As shown later, utilizing SC layers leads to more model parameters than SR layers, hence leads to an improved learning capability and higher accuracy. Consequentially, DiabNN-server achieves a 1accuracy improvement over DiabNN-edge.
In DiabNN-edge, we base our SR layer design on the H-LSTM cell . It is a variant of the conventional LSTM cell obtained through addition of hidden layers to its control gates. Fig. 3 shows the schematic diagram of an H-LSTM. Its internal computation flow is governed by the following equations:
where , , , , , , and denote the forget gate, input gate, output gate, cell update vector, input, hidden state, and cell state at step , respectively; and refer to the previous hidden and cell states at step ; , , b, , and refer to a hidden layer that performs a linear transformation followed by an activation function, sparse weight matrix, bias, function, and element-wise multiplication, respectively; indicates zero or more
layers for each NN gate. The additional hidden layers enable three advantages. First, they enhance gate control through a multi-level abstraction that can lead to accuracy gains. Second, they can be easily regularized through dropout, and thus lead to better generalization. Third, they offer a wide range of choices for internal activation functions, such as the rectified linear unit (ReLU), that can lead to faster learning. Using H-LSTM based SR layers, DiabNN-edge reduces the model size by 130 and inference FLOPs by 2.2 relative to DiabNN-server.
3.3 Grow-and-prune training for DiabNN
We next explain the gradient-based network growth and magnitude-based network pruning algorithms in detail. Unless otherwise stated, we assume a mask-based approach for tackling sparse networks. Each weight matrix W has a corresponding binary mask matrix Msk that has the exact same size. It is used to disregard dormant connections (connections with zero-valued weights).
illustrates the connection growth process. The main objective of the weight growth phase is to locate only the most effective dormant connections to reduce the value of the loss function. To do so, we first evaluate the gradient for all the dormant connections and use this information as a metric for ranking their effectiveness. During the training process, we extract the gradient of all weight matrices () for each mini-batch of training data using the back-propagation algorithm. We repeat this process over a whole training epoch to accumulate . Then, we calculate the average gradient over the entire epoch by dividing the accumulated values by the number of training instances. We activate a dormant connection if and only if its gradient magnitude is larger than the percentile of the gradient magnitudes of its associated layer matrix. Its initial value is set to the product of its gradient value and the current learning rate. The growth ratio
is a hyperparameter. We typically usein our experiments. The NN growth method was first proposed in . It has been shown to be very effective in enabling the network to reach a higher accuracy with far less redundancy than a fully connected model.
We show the connection pruning algorithm in Algorithm 2. During this process, we remove a connection if and only if its magnitude is smaller than the percentile of the weight magnitudes of its associated layer matrix. When pruned away, the connection’s weight value and its corresponding mask binary value are simultaneously set to zero. The pruning ratio is also a hyperparameter. Typically, we use in our experiments. Connection pruning is an iterative process, where we retrain the network to recover its accuracy after each pruning iteration.
|Galvanic skin response||Smart watch|
|Skin temperature||Smart watch|
|Acceleration ()||Smart watch|
|Inter-beat interval||Smart watch|
|Blood volume pulse||Smart watch|
|Ambient illuminance||Smart phone|
|Ambient light color spectrum||Smart phone|
|Ambient temperature||Smart phone|
|Gravity ()||Smart phone|
|Angular velocity ()||Smart phone|
|Orientation ()||Smart phone|
|Acceleration ()||Smart phone|
|Linear acceleration ()||Smart phone|
|Air pressure||Smart phone|
|Wi-Fi radiation strength||Smart phone|
|Magnetic field strength||Smart phone|
|Relatives with diabetes||Questionnaire|
4 Implementation Details
In what follows, we first describe the dataset collected from 52 participants that is used for DiabDeep evaluation. Then, we explain the implementation details of DiabDeep based on the collected dataset.
4.1 Data collection and preparation
In this study, we collected both the physiological data and demographic information from 52 participants. 27 participants were diagnosed with diabetes (14 with type-1 and 13 with type-2 diabetes) whereas the remaining 25 participants were healthy non-diabetic baselines. We collected the physiological data using a commercially available Empatica E4 smartwatch  and Samsung Galaxy S4 smartphone, as shown in Fig. 4. We also used a questionnaire to gather demographic information from all the participants. We summarize all the data types collected in this study in Table LABEL:tb:signal. It can be observed that the collected data cover a wide range of physiological and demographic signals that may assist with diabetes diagnosis in the daily context. The smartwatch data capture the physiological state of the target user. This information, e.g., GSR (measures the electrical activity of the skin, i.e., skin conductance) and BVP (measures cardiovascular activity, e.g., heart beat waveform and heart rate variability, etc.), has been shown to effectively capture the body status in terms of its health indicators . The ambient information from the smartphone may assist with sensing of body movement and physiological signal calibration. Finally, demographic information has been previously shown to be effective for diabetes diagnosis . In this work, we study whether synergies among sources of the above information collected in the daily context can support the task of pervasive diabetes diagnosis.
During data collection, we first inform all the participants about the experiment, let them sign the consent form, and ask them to fill the demographic questionnaire. Then, we place the Empatica E4 smartwatch on the wrist of participant’s non-dominant hand, and the Samsung Galaxy S4 smartphone in the participant’s pocket. The experiment lasts between 1.0 and 1.5 hours per participant during which time the smartwatch and smartphone continuously track and store the physiological signals. We use the Empatica E4 Connect portal for smartwatch data retrieval . We develop an Android application to record all the smartphone sensor data streams. All the data streams contain detailed timestamps that are later used for data synchronization. The experimental procedure was approved by the Institutional Review Board of Princeton University. None of the participants reported mental, cardiac, or endocrine disorders.
We next preprocess the dataset before training the model. We first synchronize and window the WMS data streams. To avoid time correlation between adjacent data windows, we divide data into windows with shifts in between. The final dataset contains 5030 data instances. We use 70%, 10%, and 20% of the data as training, validation, and test sets. The training, validation, and test sets have no time overlap. We then extract the value ranges of the data streams from the training set, and then scale all three datasets based on the min-max scaling method, as explained earlier.
4.2 DiabDeep implementation
We implement the DiabDeep framework using PyTorch on Nvidia GeForce GTX 1060 GPU (with 1.708GHz frequency and 6GB memory) and Tesla P100 GPU (with 1.329GHz frequency and 16GB memory). We employ CUDA 8.0 and CUDNN 5.1 libraries in our experiments. We next describe our implementation of DiabNNs based on the collected dataset.
We first explain the implementation details for DiabNN-server.
Data input: For each data instance, we flatten and concatenate the data within the same monitoring window from both the smartphone and smartwatch. This results in a vector of length 3705, where the flattened smartwatch window contains 2535 signal readings (from one data stream at 64Hz, three data streams at 32Hz, two data streams at 4Hz and one data stream at 1Hz), and the flattened smartphone window provides additional 1170 signal readings (from 26 data streams at 3Hz). Finally, we append the seven demographic features at its end and obtain a vector of length 3712 as the input for DiabNN-server.
Model architecture: We present the model architecture for DiabNN-server in Fig. 5. We use six sequential SC layers in DiabNN-server with widths set at 1024, 512, 256, 128, 64 and 2 (3 for three-class classification), respectively. The input dimension is the same as the input tensor dimension of 3712. We use ReLU as the nonlinear activation function for all SC layers.
We use a stochastic gradient descent (SGD) optimizer with a momentum of 0.9 for this experiment. We initialize the learning rate to 0.005 and divide the learning rate by 10 when the validation accuracy does not increase in 50 consecutive epochs. We use a batch size of 256 and a dropout ratio of 0.2. For grow-and-prune training, we initialize the seed architecture with a filling rate of 20%. We grow the network for three epochs using a 0.2 growth ratio. For network pruning, we initialize the pruning ratio to 0.2. We halve the pruning ratio if the retrained model cannot restore accuracy on the validation set. We terminate the process when the ratio falls below 0.01. The training curve for DiabNN-server is presented in Fig.6.
We explain the implementation details for DiabNN-edge in this section.
Data input: Unlike SC layer based DiabNN-server, SR layer based DiabNN-edge acts on time series data step by step . Thus, at each time step, we concatenate the temporal signal values from each data stream along with the demographic information to form an input vector of length 40 (corresponding to seven smartwatch data streams, 26 smartphone data streams, and seven demographic features, as shown in Table I
). DiabNN-edge operates on four input vectors per second. When a signal reading is missing in a data stream (e.g., due to a lower sampling frequency), we use the closest previous reading in that data stream as the interpolated value.
Model architecture: We present the model architecture for DiabNN-edge in Fig. 7. DiabNN-edge contains one H-LSTM cell based SR layer that has a hidden state width of 96. Each control gate within the H-LSTM cell contains one hidden layer. We use ReLU as the nonlinear activation function.
Training: We again use an SGD optimizer with a momentum of 0.9 for this experiment. The learning rate is initialized to 0.001. We divide the learning rate by 10 when the validation accuracy does not increase in 30 consecutive epochs. We use a batch size of 64 and a dropout ratio of 0.2 for training. For grow-and-prune training, we use the same hyperparameter set as in the experiment for DiabNN-server. The training curve for DiabNN-edge is presented in Fig. 8.
|False positive rate|
|False negative rate|
|: diabetic (healthy) instances classified as diabetic (healthy)|
|: healthy (diabetic) instances classified as diabetic (healthy)|
Performance evaluation metrics
5 Evaluating DiabDeep Performance
In this section, we first analyze the performance of DiabNN-server and DiabNN-edge for two classification tasks: (i) binary classification that distinguishes between diabetic vs. healthy individuals, and (ii) three-class classification to distinguish among type-1 diabetic, type-2 diabetic, and healthy individuals. Then, we compare the performances of DiabNN-server, DiabNN-edge, and the relevant baselines.
We evaluate the performance of DiabNNs using four performance metrics, as summarized in Table II. Accuracy indicates the overall prediction capability. The false positive rate (FPR) and false negative rate (FNR) measure the DiabNN’s capability to avoid misclassifying healthy and diabetic instances, respectively. The F1 score measures the overall performance of precision and sensitivity.
5.1 DiabNN-server performance evaluation
We first analyze the performance of DiabNN-server. Table III
presents the confusion matrix of DiabNN-server for the binary classification task. DiabNN-server achieves an overall accuracy of 96.3%. For the healthy instances, it achieves a very low FPR of 4.3%, demonstrating its effectiveness in avoiding false alarms. For the diabetic instances, it achieves an FNR of 3.1%, indicating its effectiveness in raising alarms when diabetes does occur. DiabNN-server achieves an F1 score of 96.5% for the binary classification task.
We present the confusion matrix of DiabNN-server for the three-class classification task in Table IV. DiabNN-server achieves an overall accuracy of 95.7%. For the healthy instances, it achieves a low FPR of 6.6%, again demonstrating its ability to avoid false alarms. It also delivers low FNRs for both type-1 and type-2 diabetic individuals of 1.6% and 2.8%, respectively (each FNR depicts the ratio of the number of false predictions for a target diabetes type divided by the total number of instances of that type). DiabNN-server achieves an F1 score of 95.7% for the three-class classification task.
Furthermore, the grow-and-prune training paradigm not only delivers high diagnostic accuracy, but also leads to model compactness as a side benefit. For binary classification, the final DiabNN-server model contains only 429.1K parameters with a sparsity level of 90.5%. For the three-class classification task, the final DiabNN-server model contains only 445.8K parameters with a sparsity level of 90.1%. The model compactness achieved in both cases can help reduce storage and energy consumption on the server.
5.2 DiabNN-edge performance evaluation
We next analyze the performance of DiabNN-edge. We present the confusion matrix of DiabNN-edge for the binary classification task in Table V. DiabNN-edge achieves an overall accuracy of 95.3%. For the healthy case, it also achieves a very low FPR of 3.7%. For diabetic instances, it achieves an FNR of 5.6%. This shows that DiabNN-edge can also effectively raise disease alarms on the edge. DiabNN-edge achieves an F1 score of 95.4% for the binary classification task.
We also evaluate DiabNN-edge for the three-class classification task and present the confusion matrix in Table VI. DiabNN-edge achieves an overall accuracy of 94.6%. For the healthy case, it achieves an FPR of 4.5%. It achieves FNRs of 6.5% and 5.7% for the type-1 and type-2 diabetic instances, respectively. DiabNN-server achieves an F1 score of 94.4% for the three-class classification task.
DiabNN-edge delivers extreme model compactness. For binary classification, the final DiabNN-edge model contains a sparsity level of 96.3%, yielding a model with only 3.3K parameters. For the three-class classification task, the final DiabNN-edge model contains a sparsity level of 95.9%, yielding a model with only 3.7K parameters. This greatly assists with inference on the edge that typically suffers from very limited resource budgets.
5.3 Results analysis
As mentioned earlier, DiabNN-edge and DiabNN-server offer several performance tradeoffs over diagnostic accuracy, storage cost, and run-time efficiency. This provides flexible design choices that can accommodate varying design objectives related to model deployment. To illustrate their differences, we compare these two models for the binary classification task in Table LABEL:tb:edge_server. We observe that DiabNN-server achieves a higher accuracy, a higher F1 score, and a lower FNR. DiabNN-edge, on the other hand, caters to edge-side inference by enabling:
A smaller model size: The edge model contains fewer parameters, leading to a substantial memory reduction.
Less computation: It requires 2.2 fewer FLOPs per inference, enabling a more efficient, hence more frequent, monitoring capability on the edge.
A lower FPR: It reduces the FPR by 0.6. This enables fewer false alarms and hence an improved usability for the system in a daily usage scenario.
We also analyze the performance tradeoffs under changing model complexity and present the results in Fig. 9. It can be observed that an increase in computational complexity can lead to performance improvements, in general. However, such benefits gradually degrade as the computation complexity continues to increase.
Performance tradeoffs against model complexity based on five runs per data point with different random seeds. Error bars denote standard deviations.
|Model||Accuracy||#Parameters||Feature extraction||Classification||Total FLOPs|
|: Number of comparison operations.|
|: Calculation excluding the comparison operation cost.|
|Swapna et al.||ECG sensor||Conv-LSTM||95.1%|
|Swapna et al.||ECG sensor||CNN||93.6%|
|Ballinger et al. ||Watch + demographics||LSTM||84.5%|
|Yin et al. ||Watch + demographics||Ensemble||77.6%|
|This work (DiabNN-server)||Watch + phone + demographics||Stacked SC layers||96.3%|
|This work (DiabNN-edge)||Watch + phone + demographics||H-LSTM SR layer||95.3%|
We next compare DiabNNs with widely-used learning methods, including SVMs with linear and RBF kernels, k-NN, random forest, and linear ridge classifiers. For all the methods, we use the same train/validation/test split and the same binary classification task for a fair comparison. In line with the studies in  and 
, we extract the signal mean, variance, Fourier transform coefficients, and the third-order Daubechies wavelet transform approximation and detail coefficients on Daubechies D2, D4, D8, and D24 filters from each monitoring window, resulting in a feature vector of length 304 per data instance. We train all our non-NN baselines using the Python-based Scikit learn libraries. We compare the performance of all the inference models in Table LABEL:tb:energy_compare. In addition to classification accuracy, we also compute the necessary FLOPs per inference involved in both feature extraction and classification stages. We can see that DiabNN-server achieves the highest accuracy among all the models. With a higher accuracy than all the non-NN baselines, DiabNN-edge achieves the smallest model size (up to 454.5 reduction) and least FLOPs per inference (up to 8.9 reduction). Note that the feature extraction stage accounts for 491K FLOPs even before the classification stage starts executing. This is already 1.3 the total inference cost of DiabNN-edge.
Finally, we compare DiabDeep with relevant work from the literature in Table LABEL:tb:lit. We also focus on the same binary classification task that is the focus of these studies. DiabDeep achieves the highest accuracy relative to the baselines due to its two major advantages. First, it relies on a more comprehensive set of WMSs. This captures a wider spectrum of user signals in the daily context for diagnostic decisions. Moreover, it utilizes a grow-and-prune training paradigm that learns both the connections and weights in DiabNNs. This enables a more effective SGD in both the model architecture space and parameter space.
6 Discussions & Future work
In this section, we discuss the inspirations we took from human brains to train DiabNNs as well as the future directions enabled by DiabDeep.
Our brains continually remold the synaptic connections as we acquire new knowledge. These changes happen every second throughout our lifetimes. It has even been shown that most knowledge acquisition and information learning processes in our brains result from such a synaptic rewiring, also referred to as neuroplasticity . This is very different from most current NNs that have a fixed architecture. To mimic the learning mechanism of human brains, we utilize gradient-based growth and magnitude-based pruning to train accurate, yet very compact, DiabNNs for DiabDeep. The grow-and-prune synthesis paradigm allows DiabNNs to easily adjust their synaptic connections to the diabetes diagnosis task.
DiabDeep opens up the potential for future WMS-based disease diagnosis studies, given that more than 69,000 diseases exist . We hope that this work will encourage clinics/hospitals/researchers to start collecting WMS data from individuals across more challenging diagnostic tasks, e.g., for long-term cancer prediction. Bypassing the feature extraction stage with efficient NNs enables easy scalability of the proposed approach across other disease domains. The grow-and-prune synthesis paradigm may even support continuous disease trend forecasting capability, given its continuous learning capability . As more data become available and analyzed with the proposed methodology, its effectiveness as a scalable approach for future pervasive diagnosis and medication level determination will continue to improve.
In this work, we proposed a framework called DiabDeep that combines off-the-shelf WMSs with efficient
DiabNNs for continuous and pervasive diabetes diagnosis on both the server and the edge. On the
resource-rich server, we deployed stacked SC layers to focus on high accuracy. On the
resource-scarce edge, we used an H-LSTM based SR layer to reduce computation and storage costs
with only a minor accuracy loss. We trained DiabNNs by leveraging gradient-based growth and
magnitude-based pruning algorithms. This enables DiabNNs to learn both weights and connections during
training. We evaluated DiabDeep based on data collected from 52 participants. Our system achieves
a 96.3% (95.3%) accuracy in classifying diabetics against healthy individuals on the server (edge),
and a 95.7% (94.6%) accuracy in distinguishing among type-1 diabetic, type-2 diabetic, and healthy
individuals. Against conventional baselines, such as SVMs with linear and RBF kernels, k-NN, random
forest, and linear ridge classifiers, DiabNN-edge reduces model size (FLOPs) by up to
454.5 (8.9) while improving accuracy. Thus, we have demonstrated that DiabDeep
can be employed in a pervasive fashion, while offering high efficiency and accuracy.
The authors would like to thank Premal Kamdar, Abdullah Guler, Shrenik Shah, Aumify Health, and DiabetesSisters for assistance with data collection.
-  (2019) Continuous glucose monitoring: A brief review for primary care practitioners. Advances in Therapy 36 (3), pp. 579–596. Cited by: §2.1.
DeepHeart: Semi-supervised sequence learning for cardiovascular risk prediction.
Proc. AAAI Conf. Artificial Intelligence., pp. 2079–2086. Cited by: §2.1, TABLE IX.
-  (2018) Grow and prune compact, fast, and accurate LSTMs. arXiv preprint arXiv:1805.11797. Cited by: §2.2, 2nd item, §3.2, §3.2, §4.2.2.
-  (2019) Incremental learning using a grow-and-prune paradigm with efficient neural networks. arXiv preprint arXiv:1905.10952. Cited by: §2.2, §6.
-  (2019) NeST: A neural network synthesis tool based on a grow-and-prune paradigm. IEEE Trans. on Computers. External Links: Cited by: 2nd item, §2.2, 2nd item, §3.3.
-  (2019) ChamNet: Towards efficient network design through platform-aware model adaptation. In , pp. 11398–11407. Cited by: §2.2.
-  (2019) Empatica E4 connect portal. External Links: Cited by: §4.1, §4.1.
-  (2016) Continuous glucose monitoring sensors: Past, present and future algorithmic challenges. Sensors 16 (12), pp. 2093–2104. Cited by: §1, §2.1.
-  (2016) Global report on diabetes. External Links: Cited by: §1.
-  (1988) Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks 1 (1), pp. 17–61. Cited by: §6.
Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. J. American Medical Association 316 (22), pp. 2402–2410. Cited by: §2.1.
Performance enhancement of mental task classification using EEG signal: A study of multivariate feature selection methods. Soft Computing 19 (10), pp. 2799–2812. Cited by: §5.3.
-  (2017) ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, pp. 75–84. Cited by: §2.2.
-  (2015) Learning both weights and connections for efficient neural network. In Proc. Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §2.2.
-  (2019) SCANN: Synthesis of compact and accurate neural networks. arXiv preprint arXiv:1904.09090. Cited by: §2.2.
-  (2019) VerIDeep: Verifying integrity of deep neural networks through sensitive-sample fingerprinting. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, Cited by: 1st item.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.
-  (2018) A microprocessor implemented in 65nm CMOS with configurable and bit-scalable accelerator for programmable in-memory computing. arXiv preprint arXiv:1811.04047. Cited by: §2.2.
-  (2017) Machine learning and data mining methods in diabetes research. J. Computational and Structural Biotechnology 15, pp. 104–116. Cited by: §2.1.
-  (2012) ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: 1st item.
-  (2012) ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §3.2.
-  (2018) ShuffleNet V2: Practical guidelines for efficient CNN architecture design. arXiv preprint arXiv:1807.11164. Cited by: §2.2.
-  (2005) ECG feature extraction using Daubechies wavelets. In Proc. Int. Conf. Visualization, Imaging and Image Processing, pp. 343–348. Cited by: §5.3.
Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119. Cited by: §2.2.
-  (2018) Learning deep representations with probabilistic knowledge transfer. In Proc. European Conf. Computer Vision, pp. 268–284. Cited by: §2.2.
-  (2017) Automatic differentiation in PyTorch. NIPS Workshop Autodiff. Cited by: §4.2.
-  (2011-Oct.) Scikit-learn: Machine learning in Python. J. Machine Learning Research 12, pp. 2825–2830. Cited by: §5.3.
-  (2005) Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care, pp. 1130–1139. Cited by: 1st item.
-  (2014) FitNets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.2.
-  (2017) Diabetes 2030: insights from yesterday, today, and future trends. Population Health Management 20 (1), pp. 6–12. Cited by: §1.
-  (2018) Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381. Cited by: §2.2.
-  (2018) Automated detection of diabetes using CNN and CNN-LSTM network and heart rate signals. Procedia Computer Science 132, pp. 1253–1262. Cited by: §2.1, TABLE IX.
-  (2017) Learning intrinsic sparse structures within long short-term memory. arXiv preprint arXiv:1709.05027. Cited by: §2.2.
-  (2019) FBNet: hardware-aware efficient ConvNet design via differentiable neural architecture search. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, Cited by: §2.2.
-  (2018) Shift: A zero FLOP, zero parameter alternative to spatial convolutions. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 9127–9135. Cited by: §2.2.
-  (2018) Economic costs of diabetes in the U.S. in 2017. Diabetes Care 41 (5), pp. 917–928. Cited by: §1.
-  (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §2.2.
-  (2018) Smart healthcare. Foundations and Trends in Electronic Design Automation 12 (4), pp. 401–466. Cited by: §1, §4.1.
-  (2017) A health decision support system for disease diagnosis based on wearable medical sensors and machine learning ensembles. IEEE Trans. Multi-Scale Computing Systems 3 (4), pp. 228–241. Cited by: §1, §1, §2.1, 1st item, §4.1, TABLE IX, §6.
-  (2018) A hierarchical inference model for Internet-of-Things. IEEE Trans. Multi-Scale Computing Systems 4, pp. 260–271. Cited by: 2nd item.
-  (2018) Transfer learning on fMRI datasets. In Proc. Int. Conf. Artificial Intelligence and Statistics, Cited by: §1.
-  (2017) A machine learning-based framework to identify type 2 diabetes through electronic health records. Int. J. Medical Informatics 97, pp. 120–127. Cited by: §2.1.
-  (2016) Worldwide trends in diabetes since 1980: a pooled analysis of 751 population-based studies with 4.4 million participants. The Lancet 387 (10027), pp. 1513–1530 (English). External Links: Cited by: §1.
-  (2018) Neural architect: A multi-objective neural architecture search with performance prediction. In Proc. Conf. SysML, Cited by: §2.2.
-  (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.2.