Continuous Monitoring of Blood Pressure with Evidential Regression

02/06/2021 ∙ by Hyeongju Kim, et al. ∙ 0

Photoplethysmogram (PPG) signal-based blood pressure (BP) estimation is a promising candidate for modern BP measurements, as PPG signals can be easily obtained from wearable devices in a non-invasive manner, allowing quick BP measurement. However, the performance of existing machine learning-based BP measuring methods still fall behind some BP measurement guidelines and most of them provide only point estimates of systolic blood pressure (SBP) and diastolic blood pressure (DBP). In this paper, we present a cutting-edge method which is capable of continuously monitoring BP from the PPG signal and satisfies healthcare criteria such as the Association for the Advancement of Medical Instrumentation (AAMI) and the British Hypertension Society (BHS) standards. Furthermore, the proposed method provides the reliability of the predicted BP by estimating its uncertainty to help diagnose medical condition based on the model prediction. Experiments on the MIMIC II database verify the state-of-the-art performance of the proposed method under several metrics and its ability to accurately represent uncertainty in prediction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

According to the World Health Organization report (World Health Organization, 2020), cardiovascular disease (CVD) is still a leading cause of death, threatening a large number of lives. Hypertension is one of the major risk factors for CVD but most people with hypertension are unaware of the risk and does not acknowledge the necessity in controlling their blood pressure (BP). As hypertension silently damages the heart and arteries, continuous monitoring of BP is expected to play a significant role in preventing CVDs.

With traditional methods, however, it is difficult to monitor BP regularly in everyday life. For instance, cuff-based measurement is the most common method of measuring BP, which slowly deflates an inflated cuff to estimate the systolic blood pressure (SBP) and diastolic blood pressure (DBP) (Van Montfrans, 2001). However, this process is rather uncomfortable and takes a minute or two to obtain the results. In addition, the stress or anxiety of the subjects often leads to inaccurate measurements, which is known as the white coat syndrome (He et al., 2013). Another approach is a catheter-based measurement which inserts a catheter into an artery to observe the level of BP in real-time (Investigators, 2011). However, the catheter-based measurement is not adequate for regular monitoring as it is an invasive operation which requires an expert and carries the risk of infection. Thus, conventional measurement methods are typically performed as a one-off in limited situations, and more convenient BP monitoring methods are necessarily required.

Recently, photoplethysmogram (PPG) signal-based approaches have received a lot of attention for estimating BP (Hsu et al., 2020; Slapničar et al., 2019; Liang et al., 2018; Kachuee et al., 2016). Essentially, fluctuation in the PPG waveform is associated with blood circulation as the variation of blood volume affects the amount of light absorbed by the tissue (Elgendi, 2012). Also, the acquisition of PPG is relatively simple and inexpensive as follows: (1) A light-emitting diode (LED) illuminates the skin with infrared light and (2) a photodetector records the intensity of the non-absorbed light reflected from the tissue (Castaneda et al., 2018). Attributed to such convenience, the PPG has been widely used for clinical monitoring of physiologic parameters such as heart rate, oxygen saturation, and the level of hemoglobin concentration in blood (Kavsaoğlu et al., 2015; Allen, 2007).

To take advantage of PPG’s simple acquisition procedure and its association with BP, we propose an elaborately designed framework to predict a continuous waveform of BP using PPG signal only. First, we adopt a variant architecture of Demucs (Défossez et al., 2019)

that is suited for modeling time series data. The proposed framework employs the U-net structure to leverage PPG and BP signals’ common periodicity, and offers far better performance than the conventional methods. Secondly, we note that a simple regression loss such as mean absolute error (MAE) does not guarantee the optimal performance due to the discrepancy between the training objective function and test criteria. To alleviate the mismatch, we propose an auxiliary loss function to match peak values between true and estimated BP values. Since the proposed loss function imposes more penalty on the incorrect peak prediction, the regression model prioritizes estimating SBP and DBP more accurately than other BP values. Finally, we employ deep evidential regression (DER) 

(Amini et al., 2019)

to provide uncertainty in model prediction. Knowing the reliability of the model prediction can be very important when deploying the model in the real world as it would help diagnose a patient’s disease based on model prediction or calibrate the model estimation. Direct application of DER to our task, however, is found to cause an overfitting problem, which results in degraded performance of the BP estimation. To deal with the issue, we propose two different training techniques: i) weight initialization with deterministic regression and ii) temperature scaling. Both techniques allow the model to measure BP quite accurately and to represent the uncertainty well. In the experiments, we demonstrate that the proposed framework shows cutting-edge performance on a variety of evaluation metrics and represents uncertainty in prediction desirably.

In summary, our main contributions are as follows:

  • We present a state-of-the-art model suitable for monitoring a continuous waveform of BP using only raw PPG signals.

  • We propose an auxiliary objective function called peak-to-peak matching loss to obtain better estimates of SBP and DBP.

  • We propose two different training strategies to overcome the overfitting problem of DER and demonstrate that the model represents uncertainty appropriately.

  • To the best of our knowledge, this is the first work to take into account the reliability of model prediction in BP measurements beyond naive regression.

2 Related Work

The basic working principle behind PPG acquisition is closely associated with the changes of blood volume. As a result, PPG signals have been widely used in calculating physiological parameters in the body. Since the blood volume variations are related to the blood flow which exerts the pressure on the vessel, PPG signals are commonly considered as a good evidence for estimating the BP (Ibtehaz and Rahman, 2020). However, the clear relation between PPG and BP is not fully understood yet.

To exploit the valuable information underlying in PPG signals for BP measurement in a data-driven manner, many recent studies have employed various machine learning algorithms (Hsu et al., 2020; Ibtehaz and Rahman, 2020; Slapničar et al., 2019; Kachuee et al., 2016). For instance, Kachuee et al. (2016)

applied classical algorithms such as linear regression, decision tree, support vector machine (SVM), adaptive boosting (AdaBoost), and random forest to the handcrafted features extracted from PPG and ECG signals. The authors reported that the AdaBoost model performed best among various approaches based on the mean absolute error (MAE) criterion. Although they proposed non-invasive BP estimation methods, their models achieved low grades in British Hypertension Society (BHS) and the Association for the Advancement of Medical Instrumentation (AAMI) standards due to the limited performance of the classical machine learning algorithms.

On the other hand, Hsu et al. (2020)

proposed to train a feed-forward network on the manually selected features obtained from a single cardiac cycle segmentation to output SBP and DBP values. However, their method relies on a heuristic search for feature selection and did not adopt an architecture suitable to fully utilize sequential information of PPG signal.

Slapničar et al. (2019) is another line of work which directly estimates the SBP and DBP using deep learning. They employed ResNet (He et al., 2016) and GRU (Cho et al., 2014) to leverage both temporal and frequency information in PPG signal. Still, their measurements were somewhat inaccurate and cannot estimate a continuous waveform of BP. Ibtehaz and Rahman (2020) is the most similar work to our framework in that they introduced a model that predicts sequential values of BP. They employed a one-dimensional U-net model (Ronneberger et al., 2015) and used the raw PPG signal as input to perform BP regression. However, they failed to satisfy the BHS and AAMI standards in terms of SBP. In addition, none of these approaches can provide the reliability of BP measurement, which can be critical information for making a medical decision based on model prediction.

3 Proposed Method

This section introduces an elaborately designed framework appropriate for measuring a continuous waveform of BP using only the PPG signal as input. In our framework, SBP, DBP, and mean arterial pressure (MAP) values are calculated by finding the maximum, minimum, and average values of the predicted BP waveform, respectively. For better estimation of SBP and DBP, we introduce a novel peak-to-peak matching loss with a simple peak detection algorithm. Furthermore, beyond naive regression, the proposed model is optimized to faithfully represent the prediction reliability as well.

3.1 Continuous Monitoring

Figure 1: Overall structure of the proposed model.

Most of current works still focus on directly estimating the values of SBP, DBP, and MAP given continuous PPG signals as input via traditional machine learning methods or simple feed-forward networks (Hsu et al., 2020; Mousavi et al., 2019; Kachuee et al., 2016). However, these models cannot provide invaluable information for the diagnosis and treatment of CVDs that underlies in the waveform of BP itself (Seo et al., 2015). Here, we consider a more sophisticated architecture suitable for modeling time series input and output data. Fig 1 represents the overall structure of the proposed model. The architecture is a one-dimensional adaptation of U-net (Ronneberger et al., 2015), which is similar to Demucs (Défossez et al., 2019). Though the input and the output domains are different, we adopt skip connections to facilitate the model to leverage the cardiac periodicity of PPG and BP. Each convolution layer except the last layer in the decoder is followed by gated linear unit (GLU) (Dauphin et al., 2017)

or rectified linear unit (ReLU) for activation function. To stabilize the training process and achieve better test performance, batch normalization 

(Ioffe and Szegedy, 2015) and weight normalization (Salimans and Kingma, 2016) techniques are additionally used. Finally, a two-layer bidirectional LSTM is employed between the encoder and the decoder to capture long-term dependencies in PPG signals. For use as input to the decoder, the channels size of the bidirectional LSTM’s output is halved by a fully connected layer. Unlike PPG2ABP (Ibtehaz and Rahman, 2020) which simply outputs a sequence of point estimates of BP, the proposed model yields a 4-dimensional temporal sequence for the parameters of the Normal Inverse-Gamma (NIG) distribution. We compute the likelihood of the ground-truth BP using the NIG parameters and train the model according to the maximum likelihood. More details about evidential regression will be explained in Section 3.3.

3.2 Peak-to-Peak Matching Loss

Figure 2: Example of peak points detected by our method. Black dots represent the maximum and minimum point in each frame.

Regression models are usually optimized to minimize mean squared error (MSE), mean absolute error (MAE), or negative log-likelihood (NLL) in general. For instance, PPG2ABP, which aims to monitor BP waveforms, uses the MAE and MSE loss for training the BP regression model. However, these loss functions are not optimal for the BP measurement task since some medical diagnoses at test time are conducted based on other statistics (e.g., SBP and DBP) calculated from the estimated BP waveform. The discrepancy between the training loss function and the test criteria hinders the full potential of the model from being exploited.

To mitigate the mismatch, we propose a peak-to-peak matching loss as an auxiliary objective function. Let be a sequence of true BP values and be the corresponding estimate. We first divide and into segments: and . Then, the peak-to-peak matching loss derived from the peak points of is computed as follows:


where and are the -th elements in the -th frame. Similarly, the second peak-to-peak matching loss can also be obtained by replacing with in Eq. (1) and computing Eq. (2). The total peak-to-peak matching loss is given by and we scale it with a coefficient . Ideally, the peak-to-peak matching loss should be applied to the maximum and minimum values in every cycle. However, peak detection in an exact single cycle is somewhat tricky. Instead, we set a certain time interval and select peak values in each interval. Fig. 2 shows an example of the detected peaks of the ground-truth BP using our method. Taking into account the average heart rate, the frames are divided every 0.8 seconds. Although the method skips some peak values or incorrectly select non-peak values, most peak values are well detected. By minimizing the peak-to-peak matching loss , the model is trained to estimate the peak values more accurately, and much reliable results can be obtain in determining hypertension based on the predicted waveforms.

3.3 Evidential Regression

Figure 3: Left: Negative log-likelihood (NLL) curve. Right: Mean absolute error (MAE) curve. MAE continues to decline during training while overfitting of NLL occurs in the early training stage.

In order to use the predicted BP values for medical diagnosis purposes, it is necessary not only to predict accurate values, but also to consider the reliability of the predicted value. However, most of existing works focuses solely on estimating SBP and DBP values, and overlooks the importance of the latter. In this paper, we employ deep evidential regression (DER) (Amini et al., 2019) to provide the reliability of the prediction. Furthermore, we propose two training techniques to solve an overfitting problem that arises when DER is applied to the BP measurement task.

In the DER framework, the target distribution is parameterized by a hierarchical structure with unknown

mean and variance

. The prior distribution of (, ) is set to the Normal Inverse-Gamma (NIG) distribution with known parameters  111We omit the index of the NIG parameters for simplicity.. More specifically, the distribution of the -th value of BP is assumed as follows:


where , 0, 1, 0 and

is the Inverse-Gamma distribution. Eqs. (

3) and (4) indicate that a single higher-order distribution yields various lower-order data distributions which in turn generate . This setting allows us to define two types of uncertainty as follows:


where captures the innate stochasticity of data and represents the uncertainty of the model arising from a lack of training data for particular data patterns.

To compute the likelihood of given the NIG parameters , we should marginalize over all the possible pairs of :


Since the NIG distribution is the conjugate prior of the normal distribution, we can derive the negative log-likelihood

analytically in a closed form as follows:


where  (Murphy, 2007). To regularize high evidence on incorrect prediction, we also employ a penalty term introduced by Amini et al. (2019). The total evidential loss is given by


where is a regularization coefficient.

The implementation of DER using our model described in Section 3.1 is straightforward; We set the dimension of output sequence to 4 (i.e., each dimension represents , and ) and train the model according to Eq. (8). In practice, however, we found that the direct optimization of Eq. (8) results in an overfitting problem in the early training stage. Fig. 3 shows that the validation loss begins to get worse even if the MAE continues to decrease. In other words, if the model is selected based on the validation NLL, it is bound to have suboptimal prediction accuracy which leads to degraded performance on the evaluation metrics. To handle this issue, we propose two different training techniques in the following subsections.

3.3.1 Weight Initialization with Deterministic Regression

The first idea is to initialize the model weights to ensure high accuracy of model prediction in the beginning. With the proposed model, a predicted value is given by . Before optimizing the model according to Eq. (8), we suggest to leverage the MAE loss between and in the initialization stage:


Note that the gradient of

does not backpropagate through

. The initialization then is performed by selecting a model based on the MAE loss computed on the validation set. With the pretrained weights, the model is again optimized with respect to . The final model is selected based on the evidential loss calculated on the validation set. In our experiments, we verify that this simple training strategy significantly improves the prediction accuracy.

3.3.2 Temperature Scaling

As the training progresses, the model becomes too overconfident before the model prediction becomes accurate enough. To solve this problem, we retrieve the appropriate level of model confidence by scaling and with two scalar temperature parameters and in Eq. (5) as follows:


This strategy is motivated by a calibration method for classification networks that scales a logit vector to raise output entropy 

(Guo et al., 2017). When , both epistemic and aleatoric uncertainties are readjusted to be smaller. Likewise, as increases, the overconfidence of the model is gradually alleviated. For implementation, we first choose the best model based on the MAE loss computed on the validation set after training the model according Eq. (8). Then, and are optimized with respect to on the validation set in a post-processing step. Note that all the model parameters are not updated during post-processing, so the model accuracy remains the same. At test time, we use the scaled parameters and for uncertainty estimation.

4 Experiments

Training technique Mean absolute error (mmHg)
Model 1 0.0 Not applied 3.870 0.034 1.956 0.021 1.998 0.021 3.161 0.002
Model 2 1.0 3.469 0.032 1.949 0.020 1.973 0.021 3.170 0.002
Model 3 0.0 Weight initialization 3.443 0.034 1.871 0.021 1.904 0.021 2.831 0.002
Model 4 1.0 3.040 0.033 1.811 0.021 1.776 0.022 2.817 0.002
Model 5 0.0 Temperature scaling 3.404 0.035 1.811 0.022 1.815 0.023 2.768 0.002
Model 6 1.0 3.098 0.034 1.761 0.021 1.756 0.022 2.688 0.002
Table 1:

Comparison of BP measurements performance using different model configurations. MAE values are provided with their 95% confidence interval. The results demonstrate that the auxiliary loss

and the two training techniques for DER play an important role in achieving high accuracy of BP measurements.

We conducted a set of experiments using the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) II (Saeed et al., 2011) database to evaluate the proposed framework. We utilized the refined version of the MIMIC II provided by Kachuee et al. (2015) and followed the same pre-processing steps as in Ibtehaz and Rahman (2020) to sample well-distributed BP signals between 50mmHg and 200mmHg. We used 10 second long signals with sample rate of 125Hz for training and validation sets. Since the evaluation metrics used in this work require BP values computed from a single cardiac cycle, we constructed a test set consisting of 2 second long segments to ensure that each test data contains at least one cardiac cycle. The total duration of the training, validation, and test sets was 250 hours, 27.8 hours, and 75.7 hours, respectively.

We constructed the encoder of the model with 4 blocks of convolution networks (

). Each block consisted of a convolution layer with kernel size of 6 and stride of 2 (

and ), a 1x1 convolution layer, and a batch normalization layer. We also applied a weight normalization to each convolution layer. The output channel size of the first block was set to 64 and the subsequent blocks doubled it ( and ). The same configuration was symmetrically used for the decoder. Between the encoder and the decoder, a two-layer bidirectional LSTM with hidden size of 512 and a fully-connected layer were employed. The fully connected layer converted the dimension of the output of the bidirectional LSTM back to 512.

We trained the models for 500K iterations with a batch size of 512 using a single 2080-Ti GPU. We used the the Adam optimizer (Kingma and Ba, 2014) and set the learning rate to . When the weight initialization was conducted using , the model was further fine-tuned 50K steps more according to Eq.(8) where was set to 1.5. When we performed a post-process to calibrate the model confidence, we adjusted the learning rate to 0.02 and optimized the temperature parameters for 1K steps using the validation set. In this case, we set to 0.1 for both training and post-processing.

4.1 BP Measurement

4.1.1 Validation of Proposed Framework

We trained the proposed model with various configurations and report the results of BP measurement in Table. 1. SBP, MAP, and DBP were calculated by finding the maximum, average, and minimum values in the test segments, respectively. Since the proposed model is capable of measuring an entire BP waveform, we also report the MAE of arterial blood pressure (ABP) at all time steps. First of all, it can be observed that the additional optimization of improved the accuracy of SBP measurement in all cases (Model 2, Model 4, and Model 6). Moreover, the use of did not hurt the performance of predicting ABP and MAP values. These results suggest that the optimization of efficiently leads the model to prioritize estimating peak values more accurately than other BP values while maintaining the accuracy of other BP predictions. Secondly, we can verify that the two proposed techniques for training DER are quite helpful for solving an overfitting problem and result in much reliable measurements. when the model parameters were initialized according to and then updated to optimize , the model converged to a way better point for BP measurements (Model 3 and Model 4). Likewise, when the model was selected based on of the validation set and the model confidence was scaled with temperature parameters, the BP estimation faithfully followed the true BP values (Model 5 and Model 6). Thus, the overall results demonstrate the proposed methods work well in practice. In the following experiments, we evaluate the performance of the proposed framework using Model 6.

4.1.2 BHS Standard

Cumulative Error Percentage (%)
5mmHg 10mmHg 15mmHg
SBP 82.83 91.73 95.34
MAP 90.40 96.27 98.24
DBP 91.36 96.62 98.33
ABP 85.03 93.28 96.44
Grade A 60 85 95
Grade B 50 75 90
Grade C 40 65 85
Table 2: Cumulative error percentage of BP predictions obtained from our model and grading criteria of the BHS standard.

The BHS standard is a protocol of requirements for the evaluation of BP measuring devices and methods (O’Brien et al., 1993). The BHS standard counts the cumulative number of predictions belonging to three intervals (i.e., whether the absolute error of a prediction is lower than (i) 5mmHg, (ii) 10mmHg, and (iii) 15mmHg) and evaluates the accuracy of measurement according to the tabulated grading criteria. There are four types of grade in the BHS standard: grade A, grade B, grade C, and grade D. To get a specific grade, the cumulative error percentage should satisfy three thresholds simultaneously. If a measurement method cannot fulfill even the grade C thresholds, it acquires a grade D score. Table 2 presents the grading criteria of the BHS standard and the cumulative error percentage of the proposed model. Surprisingly, the proposed model acquires grade A scores in all assessments. In other words, most of our model’s predictions fit well with the corresponding true BP values within the 15mmHg error range. In particular, for the result of SBP, it is very meaningful to get a grade A score since most literature could not achieve it on the MIMIC II dataset (Ibtehaz and Rahman, 2020; Mousavi et al., 2019). The results demonstrate that the proposed model provides accurate BP measurement.

4.1.3 AAMI Standard

# of
SBP 0.337 7.058 942
MAP 0.270 4.364
DBP 0.200 4.508
ABP 0.270 6.311
Criterion 5 8 85
Table 3: Comparison with the AAMI standard.

The AAMI standard is another evaluation metric that has been widely used in the literature for benchmark. The AAMI standard requires BP measuring methods to simultaneously meet the following criteria: (i) mean error (ME) is less than 5mmHg, (ii) the standard deviation (STD) of errors is less than 8mmHg, and (iii) evaluation is performed on at least 85 subjects. We report the average and standard deviation of the prediction errors of our model in Table 

3. It is noteworthy that the proposed model meets the requirements of the AAMI standard in all cases. Even in the case of SBP, which is the most difficult BP value to predict, the proposed model satisfies the AAMI criteria by recording 0.337mmHg and 7.058mmHg for ME and STD, respectively. These experimental results support that the proposed model estimates BP values with fairly high accuracy and can potentially be exploited for clinical use.

4.2 Reliability of Model Prediction

4.2.1 Uncertainty Estimation

Figure 4: Examples of uncertainty estimation in model prediction. Red dashed lines are true BP values, black solid lines are model predictions, and grey areas represent the degree of uncertainty in prediction.

To efficiently show the relation between the reliability of model prediction and its accuracy, we aim to visualize the level of model uncertainty. For this goal, we first calculated the model uncertainty of each measurements. Then, we colored the area around the BP measurements in proportion to the corresponding model uncertainty. In Fig. 4, we present several visualized examples of model reliability obtained though this process. By comparing the first and second rows, we can observe the significant correlation between the accuracy of model prediction and the estimated uncertainty. When the estimated uncertainty is high, as shown in the the first row, the predicted BP values have some obvious errors compared to the actual BP values. In contrast, when the model uncertainty is comparatively low as in the second row, the model predictions are highly trustworthy and almost perfectly fit the true BP values. The experimental results demonstrate that we have appropriately applied the DER framework to the BP measurement task along with the proposed training methods and managed to properly estimate the reliability of the predictions.

4.2.2 BP measurement on Selected Samples

The uncertainty of model prediction can be exploited to produce more reliable BP measurements. For example, if the model uncertainty for a particular PPG input is high, we can remeasure it or optimize the model for more steps using PPG signals similar to that input pattern. Here, we choose to skip BP measurements on low-reliability PPG signals and re-evaluate BP measurement performance. We computed the model uncertainty of all test samples and excluded 20% of them with the highest model uncertainty. Using this subset of the test set, we assessed the BP measurement performance of our model again. As shown in Table 4, we can observe that the MAE values of SBP, MAP, DBP, and ABP decreased by 0.666 mmHg, 0.433 mmHg, 0.468 mmHg, and 0.539 mmHg, respectively. The experimental results well illustrate that the estimated reliability is highly consistent with the model accuracy and it can be further utilized in a wide range of applications.

4.2.3 Hypertension Classification

Mean Absolute Error (mmHg)
test set SBP MAP DBP ABP
All 3.098 1.761 1.756 2.688
Subset 2.432 1.328 1.288 2.149
Table 4: Comparison of BP measurements using Model 6 between all test samples and selected test samples according to the model reliability.
Figure 5: Examples of uncertainty estimation in model prediction. Red dashed lines are true BP values, blue solid lines are model predictions, and sky blue areas represent the degree of uncertainty in prediction.

MAE (mmHg)
Kachuee et al. PPG, ECG 8.21 4.31 D A
Ibtehaz et al. PPG 5.73 3.45 B A
Li et al. PPG, ECG 6.73 2.52 B A
Hsu et al. PPG 3.21 2.23 A A
Ours PPG 3.10 1.76 A A
Table 5: Comparison of overall performance with other approaches. The proposed model meets the BHS and AAMI standards, and achieves state-of-the-art performance on the MAE criteria in SBP and DBP measurements. In addition, the proposed model is the only one capable of monitoring both the continuous BP waveform and the prediction reliability.

As a real-world application, we can diagnose hypertension based on the model prediction. The BP classification criteria for SBP and DBP are well established (Holm et al., 2006)

, and for this experiment, we classified the levels of BP into three categories based on SBP: hypertension, prehypertension, and normotension. In order for more accurate diagnosis, we also conducted BP classification on the subset of the test set filtered by reliability in the same way as in Section 

4.2.2. We present the resultant confusion matrices of BP classification in Fig. 5

. Due to the sophisticated architecture and the proposed methods, the model achieved high-performance by detecting hypertension with a probability of about 90%. It is also noteworthy that our model classified three BP groups with similar levels of accuracy. When the test samples with low-reliability were excluded, as shown in the right of Fig. 

5, the classification accuracy increased in all classes. These experimental results suggest that our model is suitable for simple diagnosis of hypertension and can additionally leverage the estimated reliability to increase accuracy.

4.3 Comparison with Other Works

Although there have been lots of studies using PPG signals to measure BP, it is difficult to compare them directly with our work since each study used different configurations in the experiments, and some of them even assessed the performance using private data. In order to evaluate the model as fairly as possible, we selected the papers that conducted training and evaluation using the MIMIC II dataset, and present the overall performance comparison in Table 5. First of all, it can be observed that the estimation of SBP is so difficult that Kachuee et al. (2016) and Li et al. (2020) obtained grade D and grade B from the BHS standard, respectively, even with additional electrocardiogram (ECG) input. Nevertheless, our model achieved grade A for both SBP and DBP, recording the lowest MAEs. Though the BP measurement performance of Hsu et al. (2020) is close to ours, their model cannot estimate other BP values except SBP and DBP. Ibtehaz and Rahman (2020) proposed to predict a whole BP waveform using a raw PPG signal, but their measurement accuracy was somewhat insufficient for the BHS standard. Most importantly, none of the existing works has adopted a toolbox to provide the reliability of prediction, which may play a critical role in real-world deployments. These facts strongly support that our model achieved cutting-edge performance with the attractive features and paved a new way for continuous monitoring of BP waveforms.

5 Conclusion and Future Work

In this paper, we have introduced an elaborately designed framework for monitoring a continuous waveform of BP using the raw PPG signal as input. We experimentally demonstrated that the proposed model is capable of measuring the BP values with high accuracy and satisfies the BHS and AAMI standards even in SBP measurement. To go further, we proposed two training techniques to adequately apply the DER framework to the BP measurement task. These techniques ensures strong correlation between the model accuracy and the estimated reliability, and this is experimentally demonstrated through the uncertainty visualization, BP measurement and BP classification on the high-reliability subset of the test set. We believe that the estimated reliability can be utilized to help determine whether the measured BP values should be trusted or provide more informative training data to optimize robust models. We also expect that the proposed approach can be employed for other safety-critical applications as well.

Although we presented the state-of-the-art model for BP estimation, there are still some room for improvement. First, the proposed approach may accompany biased uncertainty estimation due to the two-stage training procedure. To enhance our approach, one may combine the weight initialization and temperature scaling together or develop another end-to-end model for exact likelihood estimation. Another issue is that the threshold for model reliability is rather heuristically determined in this work. We believe that the boundaries can be theoretically derived to produce more convincing reliability estimates and improved results. In order to develop safe medical applications, we plan to conduct extensive research to address these issues.


  • J. Allen (2007) Photoplethysmography and its application in clinical physiological measurement. Physiological measurement 28 (3), pp. R1. Cited by: §1.
  • A. Amini, W. Schwarting, A. Soleimany, and D. Rus (2019) Deep evidential regression. arXiv preprint arXiv:1910.02600. Cited by: §1, §3.3, §3.3.
  • D. Castaneda, A. Esparza, M. Ghamari, C. Soltanpur, and H. Nazeran (2018) A review on wearable photoplethysmography sensors and their potential future applications in health care. International journal of biosensors & bioelectronics 4 (4), pp. 195. Cited by: §1.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.
  • Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In International conference on machine learning, pp. 933–941. Cited by: §3.1.
  • A. Défossez, N. Usunier, L. Bottou, and F. Bach (2019) Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254. Cited by: §1, §3.1.
  • M. Elgendi (2012) On the analysis of fingertip photoplethysmogram signals. Current cardiology reviews 8 (1), pp. 14–25. Cited by: §1.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017)

    On calibration of modern neural networks

    In International Conference on Machine Learning, pp. 1321–1330. Cited by: §3.3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §2.
  • X. He, R. A. Goubran, and X. P. Liu (2013) Evaluation of the correlation between blood pressure and pulse transit time. In 2013 IEEE International Symposium on Medical Measurements and Applications (MeMeA), pp. 17–20. Cited by: §1.
  • S. W. Holm, L. L. Cunningham, E. Bensadoun, and M. J. Madsen (2006) Hypertension: classification, pathophysiology, and management during outpatient sedation and local anesthesia. Journal of oral and maxillofacial surgery 64 (1), pp. 111–121. Cited by: §4.2.3.
  • Y. Hsu, Y. Li, C. Chang, and L. N. Harfiya (2020) Generalized deep neural network model for cuffless blood pressure estimation with photoplethysmogram signal only. Sensors 20 (19), pp. 5668. Cited by: §1, §2, §2, §3.1, §4.3.
  • N. Ibtehaz and M. S. Rahman (2020)

    PPG2ABP: translating photoplethysmogram (ppg) signals to arterial blood pressure (abp) waveforms using fully convolutional neural networks

    arXiv preprint arXiv:2005.01669. Cited by: §2, §2, §2, §3.1, §4.1.2, §4.3, §4.
  • S. H. Investigators (2011) Catheter-based renal sympathetic denervation for resistant hypertension: durability of blood pressure reduction out to 24 months. Hypertension 57 (5), pp. 911–917. Cited by: §1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1.
  • M. Kachuee, M. M. Kiani, H. Mohammadzade, and M. Shabany (2015) Cuff-less high-accuracy calibration-free blood pressure estimation using pulse transit time. In 2015 IEEE international symposium on circuits and systems (ISCAS), pp. 1006–1009. Cited by: §4.
  • M. Kachuee, M. M. Kiani, H. Mohammadzade, and M. Shabany (2016) Cuffless blood pressure estimation algorithms for continuous health-care monitoring. IEEE Transactions on Biomedical Engineering 64 (4), pp. 859–869. Cited by: §1, §2, §3.1, §4.3.
  • A. R. Kavsaoğlu, K. Polat, and M. Hariharan (2015) Non-invasive prediction of hemoglobin level using machine learning techniques with the ppg signal’s characteristics features. Applied Soft Computing 37, pp. 983–991. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • Y. Li, L. N. Harfiya, K. Purwandari, and Y. Lin (2020) Real-time cuffless continuous blood pressure estimation using deep learning model. Sensors 20 (19), pp. 5606. Cited by: §4.3.
  • Y. Liang, Z. Chen, R. Ward, and M. Elgendi (2018) Photoplethysmography and deep learning: enhancing hypertension risk stratification. Biosensors 8 (4), pp. 101. Cited by: §1.
  • S. S. Mousavi, M. Firouzmand, M. Charmi, M. Hemmati, M. Moghadam, and Y. Ghorbani (2019) Blood pressure estimation from appropriate and inappropriate ppg signals using a whole-based method. Biomedical Signal Processing and Control 47, pp. 196–206. Cited by: §3.1, §4.1.2.
  • K. P. Murphy (2007)

    Conjugate bayesian analysis of the gaussian distribution

    def 1 (22), pp. 16. Cited by: §3.3.
  • E. O’Brien, J. Petrie, W. Littler, M. de Swiet, P. L. Padfield, D. Altman, M. Bland, A. Coats, N. Atkins, et al. (1993) The british hypertension society protocol for the evaluation of blood pressure measuring devices. J hypertens 11 (Suppl 2), pp. S43–S62. Cited by: §4.1.2.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2, §3.1.
  • M. Saeed, M. Villarroel, A. T. Reisner, G. Clifford, L. Lehman, G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark (2011) Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Critical care medicine 39 (5), pp. 952. Cited by: §4.
  • T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868. Cited by: §3.1.
  • J. Seo, S. J. Pietrangelo, H. Lee, and C. G. Sodini (2015) Noninvasive arterial blood pressure waveform monitoring using two-element ultrasound system. IEEE transactions on ultrasonics, ferroelectrics, and frequency control 62 (4), pp. 776–784. Cited by: §3.1.
  • G. Slapničar, N. Mlakar, and M. Luštrek (2019) Blood pressure estimation from photoplethysmogram using a spectro-temporal deep neural network. Sensors 19 (15), pp. 3420. Cited by: §1, §2, §2.
  • G. A. Van Montfrans (2001) Oscillometric blood pressure measurement: progress and problems. Blood pressure monitoring 6 (6), pp. 287–290. Cited by: §1.
  • World Health Organization (2020) World health statistics 2020: monitoring health for the SDGs, sustainable development goals, Geneva. Cited by: §1.