A Supervised Learning Approach for Robust Health Monitoring using Face Videos

by   Mayank Gupta, et al.
Purdue University

Monitoring of cardiovascular activity is highly desired and can enable novel applications in diagnosing potential cardiovascular diseases and maintaining an individual's well-being. Currently, such vital signs are measured using intrusive contact devices such as an electrocardiogram (ECG), chest straps, and pulse oximeters that require the patient or the health provider to manually implement. Non-contact, device-free human sensing methods can eliminate the need for specialized heart and blood pressure monitoring equipment. Non-contact methods can have additional advantages since they are scalable with any environment where video can be captured, can be used for continuous measurements, and can be used on patients with varying levels of dexterity and independence, from people with physical impairments to infants (e.g., baby camera). In this paper, we used a non-contact method that only requires face videos recorded using commercially-available webcams. These videos were exploited to predict the health attributes like pulse rate and variance in pulse rate. The proposed approach used facial recognition to detect the face in each frame of the video using facial landmarks, followed by supervised learning using deep neural networks to train the machine learning model. The videos captured subjects performing different physical activities that result in varying cardiovascular responses. The proposed method did not require training data from every individual and thus the prediction can be obtained for the new individuals for which there is no prior data; critical in approach generalization. The approach was also evaluated on a dataset of people with different ethnicity. The proposed approach had less than a 4.6% error in predicting the pulse rate.



There are no comments yet.


page 3

page 6


Non-contact Atrial Fibrillation Detection from Face Videos by Learning Systolic Peaks

Objective: We propose a non-contact approach for atrial fibrillation (AF...

Real Time Video based Heart and Respiration Rate Monitoring

In recent years, research about monitoring vital signs by smartphones gr...

Infant Contact-less Non-Nutritive Sucking Pattern Quantification via Facial Gesture Analysis

Non-nutritive sucking (NNS) is defined as the sucking action that occurs...

Multispectral Video Fusion for Non-contact Monitoring of Respiratory Rate and Apnea

Continuous monitoring of respiratory activity is desirable in many clini...

MOMBAT: Heart Rate Monitoring from Face Video using Pulse Modeling and Bayesian Tracking

A non-invasive yet inexpensive method for heart rate (HR) monitoring is ...

Remote Blood Oxygen Estimation From Videos Using Neural Networks

Blood oxygen saturation (SpO_2) is an essential indicator of respiratory...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Regular and non-invasive measurement of vital physiological attributes such as pulse rate (PR), pulse rate variability (PRV), and blood pressure (BP) are important due to their fundamental role in tracking one’s fitness level, diagnosis of cardiovascular diseases, and monitoring of well-being. In-office and home environments, passive non-contact measurements are essential to monitor warning signs for cardiovascular diseases, stress, and anxiety. This paper explores the use of facial features from the videos to predict these vital health attributes.

Currently, the gold standard techniques for measuring such vital health attributes include using intrusive contact devices such as an electrocardiogram (ECG), chest straps, and pulse oximeters. Traditionally, ECG was extensively used for such measurement, but the recent trend has been shifted towards using pulse oximeters because of its low cost.

Although pulse oximeters are easy to use, they have limitations for frequent measurements. First, it requires the purchase of equipment and needs either the health provider or the user to manually perform the measurements. Second, the device needs to be carried to the different places that the user goes, limiting its use. Third, the finger clip-on and earlobe clip-on may not always fit well on every individual due to the varying size of fingers and earlobes. The improper fitting of the device may lead to estimation errors

(haynes2007ear, ). Fourth, using clip-on may be potentially uncomfortable during long use. Thus, this paper considers the use of a non-contact based approach where the passive video of the face can be used to estimate the health metrics.

Monitoring of health parameters using non-contact methods like videos from commercially-available camera has been recently considered in (verkruysse2008remote, ; poh2011advancements, ; sun2011motion, ; kumar2015distanceppg, ), showing that the photoplethysmogram (PPG) signal can be extracted from the videos of the face.These techniques required no dedicated source of light, and a low-cost digital camera can be used. The non-contact measurement using camera video has many applications including determination of health parameters of people working in an office environment, shop-floor, newborn infants in the hospital where using contact probes may not be possible. The non-contact method can also replace current contact methods deployed on treadmills for measurement of pulse rate. In these works, the PPG signal is extracted from each individual, and thus the coefficients of the video features that provide the PPG signal are dependent on the individual. In contrast, we do not consider individual characteristics in the prediction. The proposed method can thus help predict health metrics of an individual for which no training sample has been collected in the past, making our methodology robust. Also, the use of non-contact methodology has an additional advantage of being scalable, and portable since cameras are ubiquitous.

The proposed approach has two key steps. The first step considers capturing the video and extracting the face from the video. The features corresponding to the face in each frame are obtained. The second step includes training a neural network to learn the health parameters from the above obtained features. Our results obtained from the deep learning model has a mean absolute percentage error of 4.6% for predicting pulse rate. The appendix contains initial results on predicting indicators of the variance in pulse rate.

2. Related Work

The authors in (humphreys2007noncontact, ) showed that it is possible to extract the PPG signal from the video using a complementary metal-oxide semiconductor camera by illuminating a region of tissue using through external light-emitting diodes at dual-wavelength (760nm and 880nm). Further, the authors of (verkruysse2008remote, ) demonstrated that the PPG signal can be estimated by just using ambient light as a source of illumination along with a simple digital camera. Further in (poh2011advancements, )

, the PPG waveform was estimated from the videos recorded using a low-cost webcam. The red, green, and blue channels of the images were decomposed into independent sources using independent component analysis. One of the independent sources was selected to estimate PPG and further calculate HR, and HRV. All these works showed the possibility of extracting PPG signals from the videos and proved the similarity of this signal with the one obtained using a contact device. Further, the authors in

(10.1109/CVPR.2013.440, ) showed that heart rate can be extracted from features from the head as well by capturing the subtle head movements that happen due to blood flow.

The authors of (kumar2015distanceppg, ) proposed a methodology that overcomes a challenge in extracting PPG for people with darker skin tones. The challenge due to slight movement and low lighting conditions during recording a video was also addressed. They implemented the method where PPG signal is extracted from different regions of the face and signal from each region is combined using their weighted average making weights different for different people depending on their skin color.

There are other attempts where authors of (6523142, ; 6909939, ; 7410772, ; 7412627, ) have introduced different methodologies to make algorithms for estimating pulse rate robust to illumination variation and motion of the subjects. The paper (6523142, ) introduces a chrominance-based method to reduce the effect of motion in estimating pulse rate. The authors of (6909939, ) used a technique in which face tracking and normalized least square adaptive filtering is used to counter the effects of variations due to illumination and subject movement. The paper (7410772, ) resolves the issue of subject movement by choosing the rectangular ROI’s on the face relative to the facial landmarks and facial landmarks are tracked in the video using pose-free facial landmark fitting tracker discussed in (yu2016face, ) followed by the removal of noise due to illumination to extract noise-free PPG signal for estimating pulse rate.

Recently, the use of machine learning in the prediction of health parameters have gained attention. The paper (osman2015supervised, ) used a supervised learning methodology to predict the pulse rate from the videos taken from any off-the-shelf camera. Their model showed the possibility of using machine learning methods to estimate the pulse rate. However, our method outperforms their results when the root mean squared error of the predicted pulse rate is compared. The authors in (hsu2017deep, )

proposed a deep learning methodology to predict the pulse rate from the facial videos. The researchers trained a convolutional neural network (CNN) on the images generated using Short-Time Fourier Transform (STFT) applied on the R, G, & B channels from the facial region of interests. The authors of

(osman2015supervised, ; hsu2017deep, ) only predicted pulse rate, and we extended our work in predicting variance in the pulse rate measurements as well.

All the related work discussed above utilizes filtering and digital signal processing to extract PPG signals from the video which is further used to estimate the PR and PRV. The method proposed in (kumar2015distanceppg, ) is person dependent since the weights will be different for people with different skin tone. In contrast, we propose a deep learning model to predict the PR which is independent of the person who is being trained. Thus, the model would work even if there is no prior training model built for that individual and hence, making our model robust.

3. Data Collection

We designed our own experiment to collect the data for training the model. Twenty healthy volunteers participated in this study. The participants were recruited from a university population through email including a description of the study. This study was reviewed by the university’s Institutional Review Board, and all participants provided informed consent. The details of our experiment are given below:

3.1. Set-Up

To predict the vital health metrics, we used the face video of the person. The video is obtained using a 5 MP front-facing Hello face-authentication camera (1080p HD) from Microsoft Surface Book, having 30 frames per second. The camera is capable of capturing red, green and blue color channels. The authors of (verkruysse2008remote, ) suggested that green channel of the video outperforms blue and red channel in estimating the health parameters. Therefore, we also utilize the green channel of the camera. The features obtained from the video will be used to predict the health metrics.

To train the data, true values of PR is calculated using a contact measurement device, Shimmer3 GSR+, that records the ground truth PPG signal. We chose earlobe as the suitable position for recording PPG since it is close to the face. We care about this proximity since face is used for our video recordings as well. Subjects were asked to sit still for a 50s video, facing towards the camera at a distance of approximately 0.5m, and the PPG signal was recorded simultaneously through the Shimmer3 GSR+ device. The experimental set-up for conducting the experiments is shown in Figure 4 in appendix.

Figure 1.

The steps followed for feature extraction from each frame of the video. (a) The actual image (one of the many frame) from the video captured during the experiment. (b) The detected and aligned face using DeepFace along with landmark points. (c) The face is cropped using landmark points to get only required face features. (d) Each frame is downsampled to 20x20 image

3.2. Experiments

The study involved people of different skin colors and ethnicities. For each subject, the measurements were performed after different activity levels, thus providing a variation in the heart rate of each subject. The different activity levels at which the measurements were collected were: Rest Position, Brisk Walk, and Exercise.

Rest Position: The first experiment was conducted when each participant was in rest condition. Each subject was asked to relax and sit in front of the camera. The video and Shimmer device recordings were collected simultaneously.

Brisk Walk: The next experiment involved data collection of the same participants after they were asked to do a brisk walk for 0.25 miles at a speed of 3-5 mph on the treadmill. The video and shimmer recordings were captured immediately after the subject complete the brisk walk.

Exercise: The last experiment involved more challenging physical tasks. All the subjects were asked to perform as many push-ups or sit-ups as they can such that they exert themselves to their full capacity. This activity was designed to elicit a high pulse rate since the individual was working out at their full capacity.

The mean pulse rate for rest, walk, and exercise conditions were 72.9, 79.6, and 98.5 respectively. PPG and facial data were videos recorded immediately after each activity to minimize recovery effects on the physiological data. Each subject was given rest of 10 minutes before each activity so they can recover prior to the next activity.

We acknowledge that the heart rate will dynamically change during the video capture; however, the purpose of this study was to compare device-free sensing to the gold standard continuous measurement. The success of capturing these dynamic behaviors may further show the promising capability of the technique in addressing the complexity of dynamic changes commonly seen in the real world.

4. Proposed Approach

The methodology adopted for the estimation of health parameters is two-fold. The videos of subjects captured under different conditions were processed to extract the face features as the first step. The next step involves training the deep learning model using the face features as predictors and actual values of PR as the response variables. The detailed steps are explained below.

4.1. Video Processing

The video was recorded for 50 seconds for each subject under different activities. The entire video was broken into frames where each frame was composed of red, green, and blue color bands. We utilized the DeepFace (taigman2014deepface, ) algorithm for the purpose of face recognition. DeepFace algorithm was developed by the researchers at Facebook and had an accuracy of 97.35 % on the Labeled Faces in the Wild (LFW) dataset, which reduced the error of the current state of the art (huang2012learning, ; sun2013deep, ; cao2013practical, ; chen2013blessing, ) by more than 27%. DeepFace utilized a nine-layer deep neural network and was trained on a large facial dataset of four million facial images belonging to more than 4,000 identities.

The first step in video processing involved the detection of human faces in each frame of the video. The detected human face was then aligned automatically by DeepFace using the 3D alignment method (taigman2014deepface, ). The aligned face was cropped from the image using the landmark points on the face shown in Figure 1. The image was cropped to only facial features and removing extra pixel values from the images. We were careful in retaining the forehead since it contained the maximum information about blood perfusion inside the arteries (kumar2015distanceppg, ). The cropped images were used for training the deep learning model. The stages of video processing are shown in Figure 1.

The extraction of ”right” features is important as it plays a significant role in training a neural network. Choosing the subset of features from the available data reduces redundancy in the input to the neural networks and subsequently improving the performance. Therefore, we down-sample each cropped image to 20x20 image and extract 400 pixels intensity values from each frame as shown in Figure 1 (d).

4.2. Model Training

We trained a deep learning model using TensorFlow to estimate health metrics. The model was trained through a multi-layered neural network. We used a fully connected neural network with three hidden layers. The detailed architecture of the network used in shown in Figure


(in Appendix). The network consisted of rectified linear units (ReLU) and the rectifier activation was given as

, where

was the input to the neuron. The choice of network architecture and activation function was dependent on the minimum value of the loss function. The network was trained using a backpropagation algorithm with the mean squared error as the loss function. Batch normalization was used in each hidden layer

(DBLP:journals/corr/IoffeS15, ). The use of drop-out was one of the simplest ways to avoid over-fitting of the neural network (srivastava2014dropout, ). The drop-out rate was set to 30% to avoid over-fitting in all the three hidden layers. This will help in better generalizing the network for unseen data. The green color band was shown to be the best source for extracting information about the health parameters (verkruysse2008remote, ). Hence, we utilized all the pixel values corresponding to the green channel of each frame to train our machine learning model. The image from each frame was downsampled to a 20x20 image. Hence, we used 400 features from each frame to train the model. The features were normalized to bring them in a range of [0,1] so that it was easier for the neural network to learn from the data. The downsampling was also done to reduce the computational expense of our model. The actual response value, i.e., PR was extracted from the PPG signal recorded during our experiment. In order to extract actual PR from PPG, we computed power spectral density (PSD) of the PPG signal using a fast fourier transform (FFT) algorithm. The PR was then estimated as the frequency corresponding to the maximum power in the PSD (PR= 60. bpm), where is the required frequency.

5. Prediction Results

Figure 2. Behavior of train loss and test loss for predicting pulse rate.
Figure 3. Scatter plot of the predicted PR value vs. the ground truth PR value. The straight line is function . The closeness of the points to the line indicates the model accuracy.

The values of PR predicted by our model were compared with the true values calculated from the readings of a contact device and the errors were calculated accordingly. The mean absolute percentage errors were calculated using leave-one-out cross-validation. To be more specific, since we had 20 subjects in total, we chose nineteen out of the twenty subjects and picked out all observations from those nineteen subjects to make up our entire training set. All observations for the subject that was left out from the training set were considered as the test set. We iterated this procedure for each subject to make sure that we test our model on each individual. Since the data from the test set is completely new compared to the training set, this tells us how our model predicts subjects it has never seen before, regardless of skin tone, race, and facial features.

We then calculated the mean absolute percentage error (MAPE) and Root Mean Squared Error (RMSE) for our predictions. The mean of errors on all 20 subjects was found to be 4.6%. Similarly, the RMSE value for our test set is found out to be 4.39. The authors of (osman2015supervised, ) reported an RMSE of 9.52 on the test set in predicting PR meaning that our model outperforms theirs and shows a reduction in RMSE by 53% for predicting PR.

Figure 2 shows how the test and train loss varies with the number of iterations run by our network. For our computation, we used mean squared error as the loss function. The number of iterations was chosen based on the behavior of test and train loss. If the number of iterations was too low, it leads to under-fitting wherein both train and test errors were high and if the number of iterations was too many, it leads to over-fitting. To avoid these scenarios, we ran our model at 170 iterations.

Figure 3 shows a scatter plot between predicted and actual values of pulse rate. The straight line shown is a 45-degree line (), and the closeness of the scatter points to the straight line indicates the high accuracy of our model.

6. Conclusions

The monitoring of the health parameters like PR and PRV is important to keep a check on the individual’s health and spot the potential cardiovascular diseases. Recently, the use of device-free methods such as using camera videos is preferred over contact methods like pulse oximeters for such measurement. In this paper, we proposed a two-fold methodology wherein a supervised learning technique is leveraged to predict the pulse rate. The physiological parameters are remotely predicted using the video of human faces captured using a laptop’s camera. The subtle changes in the face pixels intensity over the different frames of the video are exploited to train a neural network with three hidden layers. Experimental evaluations are performed for twenty subjects, and the proposed approach demonstrates significant improvement as compared to the baselines thus validating that the approach has the potential to be applied in real scenarios.


  • (1) J. M. Haynes, “The ear as an alternative site for a pulse oximeter finger clip sensor,” Respiratory care, vol. 52, no. 6, pp. 727–729, 2007.
  • (2) W. Verkruysse, L. O. Svaasand, and J. S. Nelson, “Remote plethysmographic imaging using ambient light.” Optics express, vol. 16, no. 26, pp. 21 434–21 445, 2008.
  • (3) M.-Z. Poh, D. J. McDuff, and R. W. Picard, “Advancements in noncontact, multiparameter physiological measurements using a webcam,” IEEE transactions on biomedical engineering, vol. 58, no. 1, pp. 7–11, 2011.
  • (4) Y. Sun, S. Hu, V. Azorin-Peris, S. Greenwald, J. Chambers, and Y. Zhu, “Motion-compensated noncontact imaging photoplethysmography to monitor cardiorespiratory status during exercise,” Journal of biomedical optics, vol. 16, no. 7, pp. 077 010–077 010, 2011.
  • (5) M. Kumar, A. Veeraraghavan, and A. Sabharwal, “Distanceppg: Robust non-contact vital signs monitoring using a camera,” Biomedical optics express, vol. 6, no. 5, pp. 1565–1588, 2015.
  • (6) K. Humphreys, T. Ward, and C. Markham, “Noncontact simultaneous dual wavelength photoplethysmography: a further step toward noncontact pulse oximetry,” Review of scientific instruments, vol. 78, no. 4, p. 044304, 2007.
  • (7) G. Balakrishnan, F. Durand, and J. Guttag, “Detecting pulse from head motions in video,” in

    Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition

    , ser. CVPR ’13.   USA: IEEE Computer Society, 2013, p. 3430–3437. [Online]. Available: https://doi.org/10.1109/CVPR.2013.440
  • (8) G. de Haan and V. Jeanne, “Robust pulse rate from chrominance-based rppg,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 10, pp. 2878–2886, Oct 2013.
  • (9) X. Li, J. Chen, G. Zhao, and M. Pietikäinen, “Remote heart rate measurement from face videos under realistic situations,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp. 4264–4271.
  • (10) A. Lam and Y. Kuno, “Robust heart rate measurement from video using select random patches,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 3640–3648.
  • (11) M. A. Haque, R. Irani, K. Nasrollahi, and T. B. Moeslund, “Heartbeat rate measurement from facial video,” IEEE Intelligent Systems, vol. 31, no. 3, pp. 40–48, May 2016.
  • (12) X. Yu, J. Huang, S. Zhang, and D. N. Metaxas, “Face landmark fitting via optimized part mixtures and cascaded deformable model,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 11, pp. 2212–2226, 2016.
  • (13) A. Osman, J. Turcot, and R. El Kaliouby, “Supervised learning approach to remote heart rate estimation from facial videos,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 1.   IEEE, 2015, pp. 1–6.
  • (14) G.-S. Hsu, A. Ambikapathi, and M.-S. Chen, “Deep learning with time-frequency representation for pulse estimation from facial videos,” in Biometrics (IJCB), 2017 IEEE International Joint Conference on.   IEEE, 2017, pp. 383–389.
  • (15) Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708.
  • (16)

    G. B. Huang, H. Lee, and E. Learned-Miller, “Learning hierarchical representations for face verification with convolutional deep belief networks,” in

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2518–2525.
  • (17) Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3476–3483.
  • (18)

    X. Cao, D. Wipf, F. Wen, G. Duan, and J. Sun, “A practical transfer learning algorithm for face verification,” in

    Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3208–3215.
  • (19) D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3025–3032.
  • (20) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
  • (21) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

Appendix A Experimental Setup and Neural Network

Figure 4 depicts the experimental setup. Further, Fig. 5 depicts the used fully connected neural network architecture.

Figure 4. Experimental set-up. The contact probe of the Shimmer3 GSR+ is attached to the earlobe and laptop camera is placed around 0.5 m away from the subject.
Figure 5. The architecture of a fully connected neural network with three hidden layers
Figure 6. Scatter plot of the predicted LF value vs. the ground truth LF value.

Appendix B Pulse Rate Variability

Figure 7. Training loss and test loss for predicting Low Frequency (LF) component of PRV.
Figure 8. Training loss and test loss for predicting High Frequency (HF) component of PRV.
Figure 9. Scatter plot of the predicted HF value vs. the ground truth HF value.

Pulse rate variability is the variation in the time interval between two expansions of the artery. It is usually measured by the variation in beat-to-beat interval. This metric is considered as a non-invasive technique for measuring autonomic nervous system (ANS) activity. The autonomic nervous system has two branches; sympathetic nervous system (SNS) and parasympathetic nervous system (PNS) and is regulated by hypothalamus. Its function includes control of respiration, cardiac regulation, vasometer activity and certain reflex actions like coughing, sneezing, swallowing, and vomiting. High-frequency (HF) component of PRV is affected by efferent vagal (parasympathetic) activity and it decreases during the conditions of acute time pressure, emotional strain, mental stress, and elevated anxiety. The low-frequency (LF) component of PRV is known to contain both sympathetic and vagal influences. Thus, frequent and accurate measurement of PR and PRV can provide critical signs of one’s well being and any abnormality could lead to potential health problems.

In this study, the LF and HF components of PRV were roughly estimated by computing the area under the PSD curve between specific frequency range. For LF, the frequency range is 0.04-0.15Hz and for HF, it is 0.15-0.4Hz. We used these rough estimates of HF, and LF as a response variable to train our model. We trained our model using High-Frequency component and Low-Frequency component of the PPG signal. We designed separate models to train our model to predict the HF component and the LF component.

We tested our model using leave-one-out cross validation method similar to the pulse rate. Our model makes predictions on the user who has not been seen before in the training data. We choose the number of iterations to run our model as 200. The test and training losses with iterations for the LF and the HF component of the PPG signal are depicted in Fig. 7 and Fig. 8, respectively.

For predicting LF component, mean absolute percentage error on test set is 4.58%, and the root mean squared error is 3.49. On the other hand, the MAPE for HF component is found to be 10.2%, and the RMSE is 4.96 for test data. The mean RMSE for our model is 4.3 whereas the mean RMSE taken over different skin colored people in (kumar2015distanceppg, ) is 25.3 thus providing 83% decrease in the RMSE. Figures 6 and 9 depict the comparison between the actual values and the predicted values for the two components of the PPG signal, respectively. We note that dataset in (kumar2015distanceppg, ) & (osman2015supervised, ) is different from the dataset used in this paper, thus not providing comparison on the same dataset. However, since the code or data of the prior work is not public, the comparison is made on the aggregate prediction accuracy.