CovidDeep: SARS-CoV-2/COVID-19 Test Based on Wearable Medical Sensors and Efficient Neural Networks

07/20/2020 ∙ by Shayan Hassantabar, et al. ∙ University of Pavia 0

The novel coronavirus (SARS-CoV-2) has led to a pandemic. Due to its highly contagious nature, it has spread rapidly, resulting in major disruption to public health. In addition, it has also had a severe negative impact on the world economy. As a result, it is widely recognized now that widespread testing is key to containing the spread of the disease and opening up the economy. However, the current testing regime has been unable to keep up with testing demands. Hence, there is a need for an alternative approach for repeated large-scale testing of COVID-19. The emergence of wearable medical sensors (WMSs) and novel machine learning methods, such as deep neural networks (DNNs), points to a promising approach to address this challenge. WMSs enable continuous and user-transparent monitoring of the physiological signals. However, disease detection based on WMSs/DNNs and their deployment on resource-constrained edge devices remain challenging problems. In this work, we propose CovidDeep, a framework that combines efficient DNNs with commercially available WMSs for pervasive testing of the coronavirus. We collected data from 87 individuals, spanning four cohorts including healthy, asymptomatic (but SARS-CoV-2-positive) as well as moderately and severely symptomatic COVID-19 patients. We trained DNNs on various subsets of the features extracted from six WMS and questionnaire categories to perform ablation studies to determine which subsets are most efficacious in terms of test accuracy for a four-way classification. The highest test accuracy obtained was 99.4 WMS subsets may be more accessible (in terms of cost, availability, etc.) to different sets of people, we hope these DNN models will provide users with ample flexibility. The resultant DNNs can be easily deployed on edge devices, e.g., smartwatch or smartphone, which also has the benefit of preserving patient privacy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

SARS-CoV-2, also known as novel coronavirus, emerged in China and soon after spread across the globe. The World Health Organization (WHO) named the resultant disease COVID-. COVID-19 was declared a pandemic on March 11, [world2020coronavirus]. In its early stages, the symptoms of COVID- include fever, cough, fatigue, and myalgia. However, in more serious cases, it can lead to shortness of breath, pneumonia, severe acute respiratory disorder, and heart problems, and may lead to death [mahase2020coronavirus]. It is of paramount importance to detect which individuals are infected at as early a stage as possible in order to limit the spread of disease through quarantine and contact tracing. In response to COVID-19, governments around the world have issued social distancing and self-isolation orders. This has led to a significant increase in unemployment across diverse economic sectors. As a result, COVID- has triggered an economic recession in a large number of countries [nicola2020socio].

Reverse Transcription-Polymerase Chain Reaction (RT-PCR) is currently the gold standard for SARS-CoV-2 detection [butt2020deep]. This test is based on viral nucleic acid detection in sputum or nasopharyngeal swab. Although it has high specificity, it has several drawbacks. The RT-PCR test is invasive and uncomfortable, and non-reusable testing kits have led to significant supply chain deficiencies. SARS-CoV-2 infection can also be assessed with an antibody test [dheda2020diagnosis]. However, antibody titers are only detectable from the second week of illness onwards and persist for an uncertain length of time. The antibody test is also invasive, requiring venipuncture which, in combination with a several-day processing time, makes it less ideal for rapid mass screening. In the current economic and social situation, there is a great need for an alternative SARS-CoV-2/COVID-19 detection method that is easily accessible to the public for repeated testing with high accuracy.

To address the above issues, researchers have begun to explore the use of artificial intelligence (AI) algorithms to detect COVID-

[bullock2020mapping]. Initial work concentrated on CT scans and X-ray images [farooq2020covid, wang2020covid, SIRM, giovagnonidiagnosi, butt2020deep, zhang2020covid, narin2020automatic, abbas2020classification, hall2020finding, sethy2020detection, li2020artificial, gozes2020rapid, apostolopoulos2020covid, wang2020fully, afshar2020covid]. A survey of such datasets can be found in [kalkreuth2020covid, cohen2020covid]

. These methods often rely on transfer learning of a convolutional neural network (CNN) architecture, pre-trained on large image datasets, on a smaller COVID-

image dataset. However, such an image-based AI approach faces several challenges that include lack of large datasets and inapplicability outside the clinic or hospital. In addition, other work [Lin2020] shows that it is difficult to distinguish COVID-19 pneumonia from influenza virus pneumonia in a clinical setting using CT scans. Thus, the work in this area is not mature yet.

CORD-19 [cord19] is an assembly of scholarly articles on COVID-

. It can be used with natural language processing methods to distill useful information on COVID-

-related topics.

AICOVID- [imran2020ai4covid] performs a preliminary diagnosis of COVID- through cough sample recordings with a smartphone application. However, since coughing is a common symptom of two dozen non-COVID- medical conditions, this is an extremely difficult task. Nonetheless, AICOVID- shows promising results and opens the door for COVID- diagnosis through a smartphone.

The emergence of wearable medical sensors (WMSs) offers a promising way to tackle these challenges. WMSs can continuously sense physiological signals throughout the day [yin2017health]. Hence, they enable constant monitoring of the user’s health status. Training AI algorithms with data produced by WMSs can enable pervasive health condition tracking and disease onset detection [yin2019diabdeep]. This approach exploits the knowledge distillation capability of machine learning algorithms to directly extract information from physiological signals. Thus, it is not limited to disease detection in the clinical scenarios.

We propose a framework called CovidDeep for daily detection of SARS-CoV-2/COVID-19 based on off-the-shelf WMSs and compact deep neural networks (DNNs). It bypasses manual feature engineering and directly distills information from the raw signals captured by available WMSs. It addresses the problem posed by small COVID-19 datasets by relying on intelligent synthetic data generation from the same probability distribution as the training data [hassantabar2020Tutor]. These synthetic data are used to pre-train the DNN architecture in order to impose a prior on the network weights. To cut down on the computation and storage costs of the model without any loss in accuracy, CovidDeep leverages the grow-and-prune DNN synthesis paradigm [dai2017nest, hassantabar2019scann]. This not only improves accuracy, but also shrinks model size and reduces the computation costs of the inference process.

The major contributions of this article are as follows:

  • We propose CovidDeep, an easy-to-use, accurate, and pervasive SARS-CoV-2/COVID-19 detection framework. It combines features extracted from physiological signals using WMSs and simple-to-answer questions in a smartphone application-based questionnaire with efficient DNNs.

  • It uses an intelligent synthetic data generation module to obtain a synthetic dataset [hassantabar2020Tutor], labeled by decision rules. The synthetic dataset is used to pre-train the weights of the DNN architecture.

  • It uses a grow-and-prune DNN synthesis paradigm that learns both an efficient architecture and weights of the DNN at the same time [dai2017nest, hassantabar2019scann].

  • It provides a solution to the daily SARS-CoV-2/COVID-19 detection problem. It captures all the required physiological signals non-invasively through comfortably-worn WMSs that are commercially available.

The rest of the article is organized as follows. Section 2 reviews background material. Section 3 describes the CovidDeep framework. Section 4 provides implementation details. Section 5 presents experimental results. Section 6 provides a short discussion on CovidDeep and possible directions for future research. Finally, Section 7 concludes the article.

2 Background

In this section, we discuss background material related to the CovidDeep framework. It involves recent methods for synthesizing and training efficient DNN architectures.

One approach is based on the use of efficient building blocks. For example, MobileNetV [sandler2018mobilenetv2] uses inverted residual blocks to reduce model size and floating-point operations (FLOPs). ShuffleNet-v [ma2018shufflenet] uses depth-wise separable convolutions and channel-shuffling operations to ensure model compactness. Spatial convolution is one of the most expensive operations in CNN architectures. Shift [wu2018shift] uses shift-based modules that combine shifts and point-wise convolutions to significantly reduce computational cost and storage needs. FBNetV uses differentiable neural architecture search to automatically generate compact architectures. Efficient performance predictors, e.g., for accuracy, latency, and energy, are also used to accelerate the DNN search process [dai2018chamnet, hassantabar2019steerage].

In addition, DNN compression methods can be used to remove redundancy in DNN models. Han et al. [han2015deep]

propose a pruning methodology to remove redundancy from large CNN architectures, such as AlexNet and VGG. Pruning methods are also effective with recurrent neural networks

[han2017ese]. Combining network growth with pruning enables a sparser, yet more accurate, architecture. Dai et al. [dai2017nest, dai2018grow]

use the grow-and-prune synthesis paradigm to generate efficient CNNs and long short-term memories. SCANN

[hassantabar2019scann] uses feature dimensionality reduction alongside grow-and-prune synthesis to generate very compact models for deployment on edge devices and Internet-of-Things sensors. It allows the depth of the architecture to be changed during the training process.

Orthogonal to the above works, low-bit quantization of DNN weights can also be used to reduce FLOPs. A ternary weight representation is used in [zhu2016trained] to significantly reduce computation and memory costs of ResNet-, with a limited reduction in accuracy.

3 Methodology

In this section, we present the CovidDeep framework. First, we give an overview of the entire framework. Then, we describe the DNN architecture that is used in CovidDeep for inference. We also describe how synthetic data generation can be used to impose a prior on the DNN weights and then use the DNN grow-and-prune synthesis paradigm to boost the test accuracy further and ensure computational efficiency of the model.

Fig. 1:

Schematic diagram of the CovidDeep framework (GSR: Galvanic skin response, IBI: inter-beat interval, Ox.: oxygen saturation, BP: blood pressure, DT/RF: decision tree/random forest, NN: neural network, KB: knowledge-base, MND: multi-variate Normal distribution, GMM: Gaussian mixture model, KDE: kernel density estimation).

3.1 Framework overview

The CovidDeep framework is shown in Fig. 1. CovidDeep obtains data from two different sources: physiological signals and questionnaire. It has two flows: one that does not use synthetic data and another one that does. When synthetic data are not used, the framework just uses the real dataset divided into three categories: training, validation, and test. It trains the DNNs with the training dataset and picks the best one for the given set of features based on the validation dataset, and finally tests this DNN on the test dataset to obtain the test accuracy. However, when the real training dataset size is small, it is often advantageous to draw a synthetic dataset from the same probability distribution. CovidDeep uses synthetic data generation methods to increase the dataset size and use such data to pre-train the DNN architecture. Then, it uses grow-and-prune synthesis to generate inference models that are both accurate and computationally-efficient. The models generated by CovidDeep are efficient enough to be deployed on the edge, e.g., the smartphone or smartwatch, for SARS-CoV-2/COVID-19 inference.

Next, we discuss the data input, model training, and model inference details.

  • Data input: As mentioned above, physiological signals and a questionnaire are the two sources of data input to the model. The physiological signals are derived from WMSs embedded in a smartwatch as well as a discrete pulse oximeter and blood pressure monitor. These signals can be easily obtained in a non-invasive, passive, and user-transparent manner. The list of these signals includes Galvanic skin response (GSR), inter-beat interval (IBI) that indicates the heart rate, skin temperature, oxygen saturation, and blood pressure (systolic and diastolic). We also collected blood volume pulse (BVP) data from the smartwatch. However, we did not find the BVP data to be useful in distinguishing among various healthy and virus-positive groups. Moreover, in the questionnaire, we asked the following yes/no questions: immune-compromised, chronic lung disease, cough, shortness of breath, chills, fever, muscle pain, headache, sore throat, smell-taste loss, and diarrhea. We collected data on age, gender, weight, height, and smoking/drinking (yes/no), but did not find them to be useful either because of overfitting or being unrepresentative. All the relevant data sources are aggregated into a comprehensive data input for further processing.

  • Model training: CovidDeep uses different types of DNN models: (i) those trained on the raw data only, (ii) those trained on raw data augmented with synthetic data to boost accuracy, and (iii) those subjected to grow-and-prune synthesis for both boosting accuracy further and reducing model size. The first type of DNN model uses a few hidden layers. The second type of DNN model is trained based on a system called TUTOR [hassantabar2020Tutor]

    and is suitable for settings where data availability is limited. It provides the DNN with a suitable inductive bias. The third type of DNN model is based on the grow-and-prune DNN synthesis paradigm and employs three architecture-changing operations: neuron growth, connection growth, and connection pruning. These operations have been shown to yield DNNs that are both accurate and efficient

    [hassantabar2019scann].

  • Model inference: CovidDeep enables the users to have SARS-CoV-2/COVID-19 detection decision on their edge device on demand.

Next, we discuss the CovidDeep DNN architecture.

3.2 Model architecture

Fig. 2 shows the processing pipeline of the CovidDeep framework. The architecture takes the data inputs (shown at the bottom) and generates a prediction, i.e., the detection decision, (shown at the top). The pipeline consists of four steps: data pre-processing, synthetic data generation and architecture pre-training, grow-and-prune synthesis, and output generation through softmax.

Fig. 2: An illustration of the CovidDeep processing pipeline to generate predictions from data inputs.

In the data pre-processing stage, data normalization and data alignment/aggregation are done.

  • Data normalization: This step is aimed at changing feature values to a common scale. While data normalization is not always required, it is highly beneficial in the case of datasets that have features with very different ranges. It leads to better noise tolerance and improvement in model accuracy [krizhevsky2012imagenet]. Data normalization can be done in several ways, such as min-max scaling and standardization. In this work, we use min-max scaling to map each data input to the interval. Scaling can be done as follows:

  • Data alignment/aggregation: The data from different WMSs may have different start times and frequencies. In order to merge them into a dataset, we need to synchronize the data streams based on their timestamps. The answers to the questions in the questionnaire are also added to the final dataset.

Synthetic data generation: The training dataset generated in the above manner is next used to generate a synthetic dataset that is used to pre-train the DNN. These synthetic data and pre-training steps are based on the TUTOR framework [hassantabar2020Tutor]. The schematic diagram of the training scheme based on synthetic data is shown in Fig. 3. The synthetic dataset is generated in three different ways in TUTOR:

Fig. 3: The schematic diagram for pre-training of the DNN model with the synthetic dataset (DT/RF: decision tree/random forest, NN: neural network, KB: knowledge-base).
  • Using multi-variate Normal distribution (MND): In this approach, the real training dataset, i.e., the one obtained as a fraction of the data obtained from the WMSs and questionnaire, is modeled as a normal distribution to generate the synthetic data.

  • Using Gaussian mixture model (GMM): This approach uses a multi-dimensional GMM to model the data distribution. The optimal number of GMM components is obtained with the help of a validation dataset. Subsequently, the synthetic dataset is generated from this GMM.

  • Using kernel density estimation (KDE): This approach uses non-parametric density estimation to estimate the probability distribution as a sum of many kernels. In our implementation, KDE is based on the Gaussian kernel function. The synthetic data are generated based on samples generated from this model.

Building a knowledge base (KB): After generation of the synthetic data, we need to label the data points. To this end, we build a KB from the real training dataset. Decision tree (DT) and random forest (RF) are two classical machine learning methods that are inherently rule-based. In fact, each decision path in a decision tree, from the root to a leaf, can be thought of as a rule. Therefore, we aim to identify the set of rules that best describe the data. We use such a model as a KB to label the generated synthetic dataset.

Training with synthetic data: We use the labeled synthetic data to impose a prior on the DNN weights. To accomplish this, we pre-train the DNN model by using the generated synthetic dataset. This provides the network with an appropriate inductive bias and helps the network to “get underway.” This helps improve accuracy when data availability is limited.

3.3 Grow-and-prune synthesis of the DNN

In this section, we discuss the grow-and-prune synthesis paradigm [dai2017nest, hassantabar2019scann]. The approach presented in [hassantabar2019scann] allows the depth of the DNN to grow during synthesis. Thus, a hidden neuron can receive inputs from any neuron activated before it (including input neurons) and can feed its output to any neuron activated after it (including output neurons). As a result, the depth of the model is determined based on how the hidden neurons are connected, enabling the depth to be changed during training. We use three basic architecture-changing operations in the grow-and-prune synthesis process that are discussed next.

Connection growth: This activates the dormant connections in the network. The weights of the added connections are set to and trained later. We use two different methods for connection growth:

  • Gradient-based growth: This approach was first introduced by Dai et al. [dai2017nest]. Algorithm 1

    shows the process of gradient-based growth. Each weight matrix has a corresponding binary mask of the same size. This mask is used to disregard the inactive connections. The algorithm adds connections to reduce the loss function

    significantly. To this end, the gradients of all the dormant connections are evaluated and their effectiveness ranked based on this metric. During a training epoch, the gradients of all the weight matrices for all the data mini-batches are captured in the back-propagation step. An inactive connection is activated if its gradient magnitude is large relative to the gradients in its associated layer.

  • Full growth: This connection growth restores all the dormant connections in the network to make the DNN fully-connected.

0:  : weight matrix of dimension ; : weight mask of the same dimension as the weight matrix; Network ; : gradient of the weight matrix (of dimension ); data ; : growth ratio
  if full growth then
     
  else if gradient-based growth then
     Forward propagation of data through network and then back propagation
     Accumulation of for one training epoch
      largest element in the matrix
     for all  do
        if  then
           
        end if
     end for
  end if
   =
  Modified weight matrix and mask matrix
Algorithm 1 Connection growth algorithm

Connection pruning: Connection pruning deactivates the connections that are smaller than a specified threshold. Algorithm 2 shows this process.

0:  Weight matrix ; mask matrix of the same dimension as the weight matrix; : pruning ratio
   largest element in
  for all  do
     if  then
        
     end if
  end for
   =
  Modified weight matrix and mask matrix
Algorithm 2 Connection pruning algorithm

Neuron growth: This step adds neurons to the network and thus increases network size. This is done by duplicating existing neurons in the architecture. To break the symmetry, random noise is added to the weights of all the connections related to the newly added neurons. The neurons to be duplicated are either selected randomly or based on higher activation values. The process is explained in Algorithm 3.

0:  Network ; weight matrix ; mask matrix of the same dimension as the weight matrix; data ; candidate neuron to be added; array of activation values for all hidden neurons
  if activation-based selection then
     forward propagation through using data
     
  else if random selection then
     randomly pick an active neuron
  end if
  
  
  Modified weight matrix and mask matrix
Algorithm 3 Neuron growth algorithm

We apply connection pruning after neuron growth and connection growth in each iteration. Grow-and-prune training runs for a pre-defined number of iterations. Finally, the architecture that performs the best on the validation dataset is chosen.

4 Implementation Details

In this section, we first explain how the data were obtained from 87 individuals and how various datasets were prepared from the data. We also provide implementation details of the CovidDeep DNN model.

4.1 Data collection and preparation

We collected physiological signals and questionnaire data with Institutional Research Board (IRB) approval at San Matteo Hospital in Pavia, Italy. 30 individuals were healthy (referred to as Cohort ) and the remaining were SARS-CoV-2-positive with varying levels of disease severity. The SARS-CoV-2-positive cases were categorized into three other cohorts: asymptomatic (Cohort with 27 individuals), symptomatic-hospitalized (Cohort with 13 individuals), and symptomatic-intubated (Cohort with 17 individuals). Distinguishing among these cohorts is important to ascertain who may be spreading the virus unknowingly and how much medical support is needed for symptomatic individuals with varying levels of severity. Hence, we train DNN models that can perform four-way classification.

To collect the physiological signals, we used commercially available devices: Empatica E smartwatch (sensors we found useful: GSR, IBI, skin temperature), a pulse oximeter, and a blood pressure monitor. Alongside the physiological signals, we employed a questionnaire to collect information about possible COVID--related symptoms from all the individuals. We also collected data about age, gender, weight, height, and smoking/drinking (yes/no), but did not rely on these features as they were not necessarily representative of the larger population. Similarly, we collected BVP data from the smartwatch, but did not find them to be useful in distinguishing the four cohorts. Table I shows all the data types that we found to be useful. The smartwatch data capture the physiological state of the user. GSR measures continuous variations in the electrical characteristics of the skin, such as conductance, which can be caused by variations in body sweat. IBI correlates with cardiac health. Furthermore, skin acts as a medium for insulation, sweat, and control of blood flow. Although it is not a clear indicator of internal body temperature, skin temperature helps assess skin health. The pulse oximeter indirectly measures blood oxygen saturation. It is a comfortable and painless way of measuring how well oxygen is being sent to parts of the body furthest from the heart, such as the arms and legs. Blood pressure exposes various underlying health problems. Last, but not the least, the questionnaire elicits information that may help improve COVID-19 detection accuracy. From all these sources of data, we derive various subsets as datasets for use in the CovidDeep framework to see which data features are the most beneficial to obtaining a high detection accuracy. In addition, the various sensor subsets have different costs. Hence, our results also let one take test accuracy vs. cost into consideration.

Before data collection commences, we inform the participants about the procedure. We then collect some relevant information and COVID--related symptoms in response to a questionnaire. We place the pulse oximeter on the index finger of the user for blood oxygen measurement. We also obtain the systolic/diastolic blood pressure measurements. We place the smartwatch on the participant’s wrist. Data collection lasts for approximately one hour for each participant, during which time we collect sensor data from the smartwatch. We stream the data from the smartwatch to the smartphone over Bluetooth in real-time using a smartphone application. This application collects the data and performs basic validation to ensure data integrity.

Next, we pre-process the raw data to generate a comprehensive dataset. To this end, we first synchronize the WMS data streams. We then divide the data streams into -second data windows. The final dataset contains data instances. We divide this dataset into three parts: training, validation, and test, with , , and of the data, respectively. The training, validation, and test sets have no time overlap. Furthermore, in order to conduct ablation studies to gauge the impact of different data streams, we create different datasets, with various subsets of all the features.

Data type Data source
Immune-compromised Questionnaire
Chronic lung disease Questionnaire
Shortness of breath Questionnaire
Cough Questionnaire
Fever Questionnaire
Muscle pain Questionnaire
Chills Questionnaire
Headache Questionnaire
Sore throat Questionnaire
Smell/taste loss Questionnaire
Diarrhea Questionnaire
Galvanic skin response (S) Smartwatch
Skin temperature () Smartwatch
Inter-beat interval () Smartwatch
Oxygen saturation (%) Pulse oximeter
Systolic blood pressure (mmHg) Blood pressure monitor
Diastolic blood pressure (mmHg) Blood pressure monitor
TABLE I: Data types collected in the CovidDeep framework

4.2 Model implementation

We have implemented the CovidDeep framework in PyTorch. We perform DNN training on the Nvidia Tesla P

data center accelerator, with GB of memory. We use cuDNN library to accelerate GPU processing. Next, we give the details of the implemented DNN architectures trained on the different datasets.

We train various DNNs (with different numbers of layers and different numbers of neurons per layer) and verify their performance on the validation dataset. In general, a three-layer architecture with 256, 128, and 4 neurons, respectively, performs the best. The number of neurons in the input layer depends on which subset of features is selected for training the DNN. In the case of the full dataset, the input layer has 194 neurons, the same number as the dataset dimension. We obtain the features of the dataset from the 15 data window as follows. Sensor data collected from the smartwatch in the data window consist of 180 signal readings, hence 180 features, from the three data streams running at

Hz. We derive 11 features from the 11 questionnaire questions. Finally, we append the pulse oximeter oxygen saturation measurement and systolic/diastolic blood pressure measurements to obtain a feature vector of length 194.

We use leaky ReLU as the nonlinear activation function in all the DNN layers. As explained in Section

3, we generate three DNNs for each dataset: (i) DNN trained on the real training dataset, (ii) DNN pre-trained on the synthetic dataset and then trained on the real training dataset, and (iii) DNN synthesized and trained with the grow-and-prune synthesis paradigm.

4.3 Network training

We use the Adam optimizer for DNN training, with a learning rate of e- and batch size of . We use synthetic data instances to pre-train the network architecture. Moreover, in the grow-and-prune synthesis phase, we train the network for epochs each time the architecture changes. We apply network-changing operations over five iterations. In this step, we use pruning to achieve a pre-defined number of connections in the network.

5 Experimental Results

In this section, we analyze the performance of CovidDeep DNN models. We target four-way classification among the four cohorts described earlier. In addition, we perform an ablation study to analyze the impact of different subsets of features as well as different steps of CovidDeep DNN synthesis.

The CovidDeep DNN models are evaluated with four different metrics: test accuracy, false positive rate (FPR), false negative rate (FNR), and F score. These terms are based on the following:

  • True positive (negative): SARS-CoV-2/COVID-

    (healthy) data instances classified as SARS-CoV-2/COVID-

    (healthy).

  • False positive (negative): healthy (SARS-CoV-2/COVID-) data instances classified as SARS-CoV-2/COVID- (healthy).

These metrics evaluate the model performance from different perspectives. Test accuracy evaluates its overall prediction power. It is simply the ratio of all the correct predictions on the test data instances and the total number of such instances. The FPR is defined as the ratio of the number of negative, i.e., healthy, instances wrongly categorized as positive (false positives) and the total number of actual negative instances. The FNR is the ratio of positives that yield different test outcomes. Thus, there is an FNR for each of the following cohorts: 2, 3, and 4. Because of the four-way classification, the F score we report is the Macro F1 score.

5.1 Model performance evaluation

We obtained the highest test accuracy with a DNN model trained with the grow-and-prune synthesis paradigm on the dataset that contained features from four categories: GSR, pulse oximeter (Ox), blood pressure (BP), and questionnaire (Q). Table II

shows the confusion matrix for four-way classification among the four cohorts: Cohort 1 (healthy), Cohort 2 (asymptomatic-positive), Cohort 3 (symptomatic-hospitalized), and Cohort 4 (symptomatic-intubated), denoted as C1, C2, C3, and C4, respectively. CovidDeep DNN achieves a test accuracy of 99.4%. The model achieves an FPR of only 1.8%. The low FPR means that the model does not raise many false alarms. It also results in 0.1% FNR for Cohort 2, and a 0.0% FNR for Cohorts 3 and 4, denoted as FNR(2), FNR(3), and FNR(4), respectively (each FNR refers to the ratio of the number of false predictions for that cohort divided by the total number of data instances of that type). The very low FNRs demonstrate the ability of the DNN model to not miss virus-positive cases. Moreover, the Macro F1 score of the DNN model is also high: 99.6%.

Label\Prediction C1 C2 C3 C4 Total
C1
C2
C3
C4
Total
TABLE II: Confusion matrix for the most accurate four-way classification model

Next, we compare the three DNN models, trained on the real training dataset, with the aid of synthetic data, and with the aid of grow-and-prune synthesis, for the most accurate case in Table III. We see that even though the test accuracy of the DNN model trained on just the real dataset is already very high, use of synthetic data and then grow-and-prune synthesis is able to push it even higher. The impact of the second and third DNN models is more pronounced when the test accuracy of the first model is not that high (which would happen when we drop various features in the ablation studies), as we will see next. Note that the third model has a slightly better F1 score than the second model, despite having the same FPR and FNRs, because of rounding error.

DNN model trained on Acc. FPR FNR(2) FNR(3) FNR(4) F1 Score
Real training dataset
Real+synthetic training dataset
Real+synthetic training dataset + grow-prune
TABLE III: Test accuracy, FPR, FNRs, and F1 score (all in %) for the three DNN models obtained for the most accurate case

5.2 Ablation studies

In this section, we report results on various ablation studies. We begin by considering DNN models trained on features obtained from subsets of the six data categories (five sensors and the questionnaire). This helps us understand the impact of each of these categories and their various combinations. Then, we analyze the impact of different parts of the CovidDeep training process, pre-training with synthetic data, and grow-and-prune synthesis.

Since there are six data categories from which the corresponding features are obtained, there are 64 subsets. However, one of these subsets is the null subset. Thus, we evaluate the remaining 63 subsets. For these evaluations, we only consider the first two types of DNN models, referred to as DNN Model 1 and 2. We consider grow-and-prune synthesis-based models later. The results shown in Table IV correspond to the case when features from only one, two or three data categories are chosen, and in Table V when features from four, five or six data categories are chosen.

We first notice that DNN Model 2 generally performs better than DNN Model 1 across the various performance metrics. This underscores the importance of using synthetic data when the available dataset size is not large. Second, we observe that since this is a four-way classification, only 25% accuracy is possible by randomly predicting one of the four cohorts. Thus, even single data categories (GSR, Temp, IBI, Ox, BP, Q) enable much better prediction than by chance. These single data categories are still only weak learners of the correct label, when used in isolation. Third, DNN models, in general, tend to perform better on the various performance metrics when more data categories are used. However, this is not always true. For example, we obtain the highest accuracy of 99.4% with DNN Model 2 when only features from four (GSR, Ox, BP, Q) of the six categories are used. Adding features based on Temp or IBI or both actually reduces the test accuracy. This may be due to the curse of dimensionality. When the number of features increases, in general, the dataset size needs to be increased to obtain a good accuracy. For a fixed dataset size, this curse indicates that the number of features should be reduced. However, throwing out informative features would also reduce accuracy. In addition, some features are interactive, i.e., work synergistically to increase accuracy. Hence, a balance has to be found between accuracy and the number of features. Finally, when not all sensors are available (perhaps due to cost reasons), a suitable set that still provides reasonable accuracy can be chosen based on the given cost budget. This may help a broader cross-section of the population access the technology.

DNN Model 1 DNN Model 2
Data category Acc. FPR FNR(2) FNR(3) FNR(4) F1 Score Acc. FPR FNR(2) FNR(3) FNR(4) F1 Score
GSR
Temp
IBI
Ox
BP
Q
GSR+Temp
GSR+IBI
GSR+Ox
GSR+BP
GSR+Q
Temp+IBI
Temp+Ox
Temp+BP
Temp+Q
IBI+Ox
IBI+BP
IBI+Q
Ox+BP
Ox+Q
BP+Q
GSR+Temp+IBI
GSR+Temp+Ox
GSR+Temp+BP
GSR+Temp+Q
GSR+IBI+Ox
GSR+IBI+BP
GSR+IBI+Q
GSR+Ox+BP
GSR+Ox+Q
GSR+BP+Q
Temp+IBI+Ox
Temp+IBI+BP
Temp+IBI+Q
Temp+Ox+BP
Temp+Ox+Q
Temp+BP+Q
IBI+Ox+BP
IBI+Ox+Q
IBI+BP+Q
Ox+BP+Q
TABLE IV: Test accuracy, FPR, FNRs, and F1 score (all in %) for two DNN models obtained for feature subsets from one, two or three data categories
DNN Model 1 DNN Model 2
Data category Acc. FPR FNR(2) FNR(3) FNR(4) F1 Score Acc. FPR FNR(2) FNR(3) FNR(4) F1 Score
GSR+Temp+IBI+Ox
GSR+Temp+IBI+BP
GSR+Temp+IBI+Q
GSR+Temp+Ox+BP
GSR+Temp+Ox+Q
GSR+Temp+BP+Q
GSR+IBI+Ox+BP
GSR+IBI+Ox+Q
GSR+IBI+BP+Q
GSR+Ox+BP+Q
Temp+IBI+Ox+BP
Temp+IBI+Ox+Q
Temp+IBI+BP+Q
Temp+Ox+BP+Q
IBI+Ox+BP+Q
GSR+Temp+IBI+Ox+BP
GSR+Temp+IBI+Ox+Q
GSR+Temp+IBI+BP+Q
GSR+Temp+Ox+BP+Q
GSR+IBI+Ox+BP+Q
Temp+IBI+Ox+BP+Q
GSR+Temp+IBI+Ox+BP+Q
TABLE V: Test accuracy, FPR, FNRs, and F1 score (all in %) for two DNN models obtained for feature subsets from four, five or six data categories

To illustrate the effect of the different parts of the CovidDeep training process, we compare 10 CovidDeep DNN models, trained based on the different DNN synthesis and training steps. We chose these models from different accuracy ranges. Table VI shows comparison results for the four-way classification task. We have already compared various performance metrics for DNN Models 1 and 2 earlier. Hence, here, we just report their accuracy, FLOPs, and number of model parameters (#Param). Acc.(1) and Acc.(2), respectively, refer to the accuracy of DNN Model 1 and 2. The FLOPs and #Param. for these two models are identical. We report all the performance metrics for DNN Model 3 that is generated by grow-and-prune synthesis using both real and synthetic data. Thus, the starting point for DNN Model 3 synthesis is DNN Model 2. Next, we compare DNN Model 3 with the other two models based on various measures and show why it is suitable for deployment on the edge devices.

  • Smaller model size: It contains

    fewer parameters on an average (geometric mean) than DNN Model 1 and 2, thus significantly reducing the memory requirements.

  • Less computation: It reduces FLOPs per inference by on an average (geometric mean) relative to DNN Model 1 and 2, thus facilitating more efficient inference on the edge devices.

  • Better performance: It improves accuracy on an average by % (%) relative to DNN Model 1 (2), while also lowering FPR and FNRs, in general.

DNN Models 1 and 2 DNN Model 3
Data category Acc.(1) Acc.(2) FLOPs #Param. Acc. FLOPs #Param FPR FNR(2) FNR(3) FNR(4) F1 Score
GSR+Ox+BP+Q 99.3 99.4 104.1k 52.2k 99.4 3.6k 2.0k 1.8 0.1 0.0 0.0 99.6
Temp+IBI+Ox+BP 96.1 97.9 129.1k 64.8k 98.8 11.6k 6.0k 2.6 0.9 0.0 0.1 99.0
GSR+Temp+IBI+BP 93.8 96.5 159.3k 79.9k 98.7 19.6k 10.0k 1.3 2.0 1.9 0.1 98.6
Ox+BP+Q 95.4 95.4 73.3k 36.9k 95.4 5.6k 3.0k 0.6 12.3 0.0 0.0 96.6
GSR+IBI+BP 93.0 94.3 128.6k 64.5k 97.1 11.6k 6.0k 3.8 4.2 2.3 0.1 97.4
GSR+Ox+Q 92.8 93.6 103.0k 51.7k 93.9 9.6k 5.0k 10.3 7.6 1.2 0.0 95.3
Temp+BP+Q 88.8 92.4 103.5k 52.0k 96.2 15.6k 8.0k 2.8 8.2 0.0 0.0 97.2
GSR+Temp+Ox 90.3 93.0 128.1k 64.3k 96.6 29.6k 15.0k 5.2 2.7 3.9 1.9 96.6
GSR+IBI+Ox 87.1 89.5 128.1k 64.2k 93.1 15.6k 8.0k 9.6 5.5 4.6 6.3 93.0
GSR+Ox 81.9 85.1 97.4k 48.9k 86.2 9.6k 5.0k 14.5 12.4 7.7 16.9 85.7
TABLE VI: Comparison of the three DNN models (all performance metrics in %) for various feature sets

6 Discussion and Future Work

In this section, we discuss the inspirations we took from the human brain in the synthesis process of CovidDeep DNNs. We also discuss future directions in medical research enabled by the CovidDeep framework.

An interesting ability of the human brain is to efficiently solve novel problems in a new domain despite limited prior experience. Inspired by this human capability, CovidDeep uses the TUTOR [hassantabar2020Tutor] approach for synthetic data generation and labeling to help the neural network start from a better initialization point. Use of gradient descent from a learned initialization point provides the DNN with an appropriate inductive bias. Hence, it reduces the need for large datasets that are not readily available for SARS-CoV-2/COVID- AI research.

The CovidDeep DNN training process takes another inspiration from the human brain development process in the grow-and-prune synthesis step. The human brain undergoes dynamic changes in its synaptic connections every second of its lifetime. Acquisition of knowledge depends on these synaptic rewirings [grossberg1988nonlinear]. Inspired by this phenomenon, CovidDeep utilizes the grow-and-prune synthesis paradigm to enable DNN architecture adaptation throughout training. CovidDeep DNNs synthesized with grow-and-prune synthesis do not suffer from the situation faced by most current DNNs: fixed connections during training. This enables CovidDeep to generate very compact, yet accurate, models for SARS-CoV-2/COVID- detection.

CovidDeep uses physiological signals extracted using commercially available devices and achieves high test accuracy. As a result, it provides a testing mechanism that is accurate, easily accessible to the general public, and easy for individuals to use. Furthermore, this mechanism only requires a few minutes of data collection from an individual to perform an inference. Note that one hour of data collection from each individual was only required for training of the DNN models. It does not require the presence of a nurse or physician during testing. In fact, besides the data collected by the smartwatch and discrete sensors (for obtaining blood oxygen and blood pressure), the additional information required by the electronic questionnaire is small, related to the general health of the subject, and can be easily filled out with a yes/no answer. Thus, CovidDeep has the potential to significantly decrease the spread of SARS-CoV-2, save hundreds of thousands of lives, and drastically reduce the need for hospitalization, while also helping the world economy recover.

CovidDeep demonstrates that WMS-based SARS-CoV-2/COVID-19 detection is feasible. Previously, diabetes diagnosis was shown to be possible with the help of such sensors [yin2019diabdeep]. We believe that WMS-based disease detection is feasible for a large number of diseases [yin2017health].

Since data were collected from only 87 individuals, despite being augmented with synthetic training data drawn from the real training data probability distribution, more work is needed for validating the various DNN models in the field, especially since the data were obtained from a single location in Italy. This process has begun across various continents.

7 Conclusion

In this article, we proposed a framework called CovidDeep to facilitate daily and pervasive detection of SARS-CoV-2/COVID-19. The framework combines off-the-shelf WMSs with efficient DNNs to achieve this goal. CovidDeep DNNs can be easily deployed on edge devices (e.g., smartphones and smartwatches) as well as servers. CovidDeep uses synthetic data generation to alleviate the need for large datasets. In addition, training of CovidDeep DNNs based on the grow-and-prune synthesis paradigm enables them to learn both the weights and the architecture during training. CovidDeep was evaluated based on data collected from 87 individuals. The highest accuracy it achieves is 99.4%. However, many subsets of features that correspond to easily accessible sensors in the market also achieve high enough accuracy to be practically useful. With more data collected from larger deployment scenarios, the accuracy of CovidDeep DNNs can be improved further through incremental learning.

Contributions: The SARS-CoV-2/COVID-19 detection project was conceived by Niraj K. Jha. He also supervised the dataset preparation and DNN model generation efforts. Shayan Hassantabar performed DNN synthesis and evaluation. Vishweshwar Ghanakota developed the smartphone application for data collection, authenticated the credentials of the application sending data, ensured data integrity, and ran pre-processing scripts. Gregory N. Nicola MD and Ignazio R. Marino MD defined the patient cohorts, and helped with the IRB approval process. Gregory N. Nicola MD, Ignazio R. Marino MD, and Bruno Raffaele decided on the questions to be placed in the questionnaire. Novati Stefano, Alessandra Ferrari, and Bruno Raffaele collected data from patients and healthy individuals and labeled the data. All co-authors helped with the revision and editing of the manuscript.

Acknowledgments: The project was facilitated by the tireless efforts of Bob Schena (CEO, Rajant Corp.) and Adel Laoui (CEO, NeuTigers, Inc.). Giana Schena and Maria Schena helped with buying and transporting the instruments as well as English-to-Italian translations of various documents. Joe Zhang helped initially with feature extraction from the raw dataset. Claudia Cirillo coordinated the administrative work and helped with translation of documents to Italian for the IRB application. Ravi Jha helped with proofreading of the manuscript.The Chief of the Italian Police, Franco Gabrielli, helped ensure safe and fast entrance and transfer of US researchers on Italian soil during the COVID-19 lockdown.

Competing interests: Four of the co-authors of this article, Niraj K. Jha, Shayan Hassantabar, Vishweshwar Ghanakota, and Gregory N. Nicola MD have equity in NeuTigers, Inc. Neutigers, along with Rajant Corporation and Thomas Jefferson University and Jefferson Health, enabled data collection from San Matteo Hospital, Pavia, Italy.

References