Regular and non-invasive measurement of vital physiological attributes such as pulse rate (PR), pulse rate variability (PRV), and blood pressure (BP) are important due to their fundamental role in tracking one’s fitness level, diagnosis of cardiovascular diseases, and monitoring of well-being. In-office and home environments, passive non-contact measurements are essential to monitor warning signs for cardiovascular diseases, stress, and anxiety. This paper explores the use of facial features from the videos to predict these vital health attributes.
Currently, the gold standard techniques for measuring such vital health attributes include using intrusive contact devices such as an electrocardiogram (ECG), chest straps, and pulse oximeters. Traditionally, ECG was extensively used for such measurement, but the recent trend has been shifted towards using pulse oximeters because of its low cost.
Although pulse oximeters are easy to use, they have limitations for frequent measurements. First, it requires the purchase of equipment and needs either the health provider or the user to manually perform the measurements. Second, the device needs to be carried to the different places that the user goes, limiting its use. Third, the finger clip-on and earlobe clip-on may not always fit well on every individual due to the varying size of fingers and earlobes. The improper fitting of the device may lead to estimation errors(haynes2007ear, ). Fourth, using clip-on may be potentially uncomfortable during long use. Thus, this paper considers the use of a non-contact based approach where the passive video of the face can be used to estimate the health metrics.
Monitoring of health parameters using non-contact methods like videos from commercially-available camera has been recently considered in (verkruysse2008remote, ; poh2011advancements, ; sun2011motion, ; kumar2015distanceppg, ), showing that the photoplethysmogram (PPG) signal can be extracted from the videos of the face.These techniques required no dedicated source of light, and a low-cost digital camera can be used. The non-contact measurement using camera video has many applications including determination of health parameters of people working in an office environment, shop-floor, newborn infants in the hospital where using contact probes may not be possible. The non-contact method can also replace current contact methods deployed on treadmills for measurement of pulse rate. In these works, the PPG signal is extracted from each individual, and thus the coefficients of the video features that provide the PPG signal are dependent on the individual. In contrast, we do not consider individual characteristics in the prediction. The proposed method can thus help predict health metrics of an individual for which no training sample has been collected in the past, making our methodology robust. Also, the use of non-contact methodology has an additional advantage of being scalable, and portable since cameras are ubiquitous.
The proposed approach has two key steps. The first step considers capturing the video and extracting the face from the video. The features corresponding to the face in each frame are obtained. The second step includes training a neural network to learn the health parameters from the above obtained features. Our results obtained from the deep learning model has a mean absolute percentage error of 4.6% for predicting pulse rate. The appendix contains initial results on predicting indicators of the variance in pulse rate.
2. Related Work
The authors in (humphreys2007noncontact, ) showed that it is possible to extract the PPG signal from the video using a complementary metal-oxide semiconductor camera by illuminating a region of tissue using through external light-emitting diodes at dual-wavelength (760nm and 880nm). Further, the authors of (verkruysse2008remote, ) demonstrated that the PPG signal can be estimated by just using ambient light as a source of illumination along with a simple digital camera. Further in (poh2011advancements, )
, the PPG waveform was estimated from the videos recorded using a low-cost webcam. The red, green, and blue channels of the images were decomposed into independent sources using independent component analysis. One of the independent sources was selected to estimate PPG and further calculate HR, and HRV. All these works showed the possibility of extracting PPG signals from the videos and proved the similarity of this signal with the one obtained using a contact device. Further, the authors in(10.1109/CVPR.2013.440, ) showed that heart rate can be extracted from features from the head as well by capturing the subtle head movements that happen due to blood flow.
The authors of (kumar2015distanceppg, ) proposed a methodology that overcomes a challenge in extracting PPG for people with darker skin tones. The challenge due to slight movement and low lighting conditions during recording a video was also addressed. They implemented the method where PPG signal is extracted from different regions of the face and signal from each region is combined using their weighted average making weights different for different people depending on their skin color.
There are other attempts where authors of (6523142, ; 6909939, ; 7410772, ; 7412627, ) have introduced different methodologies to make algorithms for estimating pulse rate robust to illumination variation and motion of the subjects. The paper (6523142, ) introduces a chrominance-based method to reduce the effect of motion in estimating pulse rate. The authors of (6909939, ) used a technique in which face tracking and normalized least square adaptive filtering is used to counter the effects of variations due to illumination and subject movement. The paper (7410772, ) resolves the issue of subject movement by choosing the rectangular ROI’s on the face relative to the facial landmarks and facial landmarks are tracked in the video using pose-free facial landmark fitting tracker discussed in (yu2016face, ) followed by the removal of noise due to illumination to extract noise-free PPG signal for estimating pulse rate.
Recently, the use of machine learning in the prediction of health parameters have gained attention. The paper (osman2015supervised, ) used a supervised learning methodology to predict the pulse rate from the videos taken from any off-the-shelf camera. Their model showed the possibility of using machine learning methods to estimate the pulse rate. However, our method outperforms their results when the root mean squared error of the predicted pulse rate is compared. The authors in (hsu2017deep, )
proposed a deep learning methodology to predict the pulse rate from the facial videos. The researchers trained a convolutional neural network (CNN) on the images generated using Short-Time Fourier Transform (STFT) applied on the R, G, & B channels from the facial region of interests. The authors of(osman2015supervised, ; hsu2017deep, ) only predicted pulse rate, and we extended our work in predicting variance in the pulse rate measurements as well.
All the related work discussed above utilizes filtering and digital signal processing to extract PPG signals from the video which is further used to estimate the PR and PRV. The method proposed in (kumar2015distanceppg, ) is person dependent since the weights will be different for people with different skin tone. In contrast, we propose a deep learning model to predict the PR which is independent of the person who is being trained. Thus, the model would work even if there is no prior training model built for that individual and hence, making our model robust.
3. Data Collection
We designed our own experiment to collect the data for training the model. Twenty healthy volunteers participated in this study. The participants were recruited from a university population through email including a description of the study. This study was reviewed by the university’s Institutional Review Board, and all participants provided informed consent. The details of our experiment are given below:
To predict the vital health metrics, we used the face video of the person. The video is obtained using a 5 MP front-facing Hello face-authentication camera (1080p HD) from Microsoft Surface Book, having 30 frames per second. The camera is capable of capturing red, green and blue color channels. The authors of (verkruysse2008remote, ) suggested that green channel of the video outperforms blue and red channel in estimating the health parameters. Therefore, we also utilize the green channel of the camera. The features obtained from the video will be used to predict the health metrics.
To train the data, true values of PR is calculated using a contact measurement device, Shimmer3 GSR+, that records the ground truth PPG signal. We chose earlobe as the suitable position for recording PPG since it is close to the face. We care about this proximity since face is used for our video recordings as well. Subjects were asked to sit still for a 50s video, facing towards the camera at a distance of approximately 0.5m, and the PPG signal was recorded simultaneously through the Shimmer3 GSR+ device. The experimental set-up for conducting the experiments is shown in Figure 4 in appendix.
The study involved people of different skin colors and ethnicities. For each subject, the measurements were performed after different activity levels, thus providing a variation in the heart rate of each subject. The different activity levels at which the measurements were collected were: Rest Position, Brisk Walk, and Exercise.
Rest Position: The first experiment was conducted when each participant was in rest condition. Each subject was asked to relax and sit in front of the camera. The video and Shimmer device recordings were collected simultaneously.
Brisk Walk: The next experiment involved data collection of the same participants after they were asked to do a brisk walk for 0.25 miles at a speed of 3-5 mph on the treadmill. The video and shimmer recordings were captured immediately after the subject complete the brisk walk.
Exercise: The last experiment involved more challenging physical tasks. All the subjects were asked to perform as many push-ups or sit-ups as they can such that they exert themselves to their full capacity. This activity was designed to elicit a high pulse rate since the individual was working out at their full capacity.
The mean pulse rate for rest, walk, and exercise conditions were 72.9, 79.6, and 98.5 respectively. PPG and facial data were videos recorded immediately after each activity to minimize recovery effects on the physiological data. Each subject was given rest of 10 minutes before each activity so they can recover prior to the next activity.
We acknowledge that the heart rate will dynamically change during the video capture; however, the purpose of this study was to compare device-free sensing to the gold standard continuous measurement. The success of capturing these dynamic behaviors may further show the promising capability of the technique in addressing the complexity of dynamic changes commonly seen in the real world.
4. Proposed Approach
The methodology adopted for the estimation of health parameters is two-fold. The videos of subjects captured under different conditions were processed to extract the face features as the first step. The next step involves training the deep learning model using the face features as predictors and actual values of PR as the response variables. The detailed steps are explained below.
4.1. Video Processing
The video was recorded for 50 seconds for each subject under different activities. The entire video was broken into frames where each frame was composed of red, green, and blue color bands. We utilized the DeepFace (taigman2014deepface, ) algorithm for the purpose of face recognition. DeepFace algorithm was developed by the researchers at Facebook and had an accuracy of 97.35 % on the Labeled Faces in the Wild (LFW) dataset, which reduced the error of the current state of the art (huang2012learning, ; sun2013deep, ; cao2013practical, ; chen2013blessing, ) by more than 27%. DeepFace utilized a nine-layer deep neural network and was trained on a large facial dataset of four million facial images belonging to more than 4,000 identities.
The first step in video processing involved the detection of human faces in each frame of the video. The detected human face was then aligned automatically by DeepFace using the 3D alignment method (taigman2014deepface, ). The aligned face was cropped from the image using the landmark points on the face shown in Figure 1. The image was cropped to only facial features and removing extra pixel values from the images. We were careful in retaining the forehead since it contained the maximum information about blood perfusion inside the arteries (kumar2015distanceppg, ). The cropped images were used for training the deep learning model. The stages of video processing are shown in Figure 1.
The extraction of ”right” features is important as it plays a significant role in training a neural network. Choosing the subset of features from the available data reduces redundancy in the input to the neural networks and subsequently improving the performance. Therefore, we down-sample each cropped image to 20x20 image and extract 400 pixels intensity values from each frame as shown in Figure 1 (d).
4.2. Model Training
We trained a deep learning model using TensorFlow to estimate health metrics. The model was trained through a multi-layered neural network. We used a fully connected neural network with three hidden layers. The detailed architecture of the network used in shown in Figure5, where
was the input to the neuron. The choice of network architecture and activation function was dependent on the minimum value of the loss function. The network was trained using a backpropagation algorithm with the mean squared error as the loss function. Batch normalization was used in each hidden layer(DBLP:journals/corr/IoffeS15, ). The use of drop-out was one of the simplest ways to avoid over-fitting of the neural network (srivastava2014dropout, ). The drop-out rate was set to 30% to avoid over-fitting in all the three hidden layers. This will help in better generalizing the network for unseen data. The green color band was shown to be the best source for extracting information about the health parameters (verkruysse2008remote, ). Hence, we utilized all the pixel values corresponding to the green channel of each frame to train our machine learning model. The image from each frame was downsampled to a 20x20 image. Hence, we used 400 features from each frame to train the model. The features were normalized to bring them in a range of [0,1] so that it was easier for the neural network to learn from the data. The downsampling was also done to reduce the computational expense of our model. The actual response value, i.e., PR was extracted from the PPG signal recorded during our experiment. In order to extract actual PR from PPG, we computed power spectral density (PSD) of the PPG signal using a fast fourier transform (FFT) algorithm. The PR was then estimated as the frequency corresponding to the maximum power in the PSD (PR= 60. bpm), where is the required frequency.
5. Prediction Results
The values of PR predicted by our model were compared with the true values calculated from the readings of a contact device and the errors were calculated accordingly. The mean absolute percentage errors were calculated using leave-one-out cross-validation. To be more specific, since we had 20 subjects in total, we chose nineteen out of the twenty subjects and picked out all observations from those nineteen subjects to make up our entire training set. All observations for the subject that was left out from the training set were considered as the test set. We iterated this procedure for each subject to make sure that we test our model on each individual. Since the data from the test set is completely new compared to the training set, this tells us how our model predicts subjects it has never seen before, regardless of skin tone, race, and facial features.
We then calculated the mean absolute percentage error (MAPE) and Root Mean Squared Error (RMSE) for our predictions. The mean of errors on all 20 subjects was found to be 4.6%. Similarly, the RMSE value for our test set is found out to be 4.39. The authors of (osman2015supervised, ) reported an RMSE of 9.52 on the test set in predicting PR meaning that our model outperforms theirs and shows a reduction in RMSE by 53% for predicting PR.
Figure 2 shows how the test and train loss varies with the number of iterations run by our network. For our computation, we used mean squared error as the loss function. The number of iterations was chosen based on the behavior of test and train loss. If the number of iterations was too low, it leads to under-fitting wherein both train and test errors were high and if the number of iterations was too many, it leads to over-fitting. To avoid these scenarios, we ran our model at 170 iterations.
Figure 3 shows a scatter plot between predicted and actual values of pulse rate. The straight line shown is a 45-degree line (), and the closeness of the scatter points to the straight line indicates the high accuracy of our model.
The monitoring of the health parameters like PR and PRV is important to keep a check on the individual’s health and spot the potential cardiovascular diseases. Recently, the use of device-free methods such as using camera videos is preferred over contact methods like pulse oximeters for such measurement. In this paper, we proposed a two-fold methodology wherein a supervised learning technique is leveraged to predict the pulse rate. The physiological parameters are remotely predicted using the video of human faces captured using a laptop’s camera. The subtle changes in the face pixels intensity over the different frames of the video are exploited to train a neural network with three hidden layers. Experimental evaluations are performed for twenty subjects, and the proposed approach demonstrates significant improvement as compared to the baselines thus validating that the approach has the potential to be applied in real scenarios.
- (1) J. M. Haynes, “The ear as an alternative site for a pulse oximeter finger clip sensor,” Respiratory care, vol. 52, no. 6, pp. 727–729, 2007.
- (2) W. Verkruysse, L. O. Svaasand, and J. S. Nelson, “Remote plethysmographic imaging using ambient light.” Optics express, vol. 16, no. 26, pp. 21 434–21 445, 2008.
- (3) M.-Z. Poh, D. J. McDuff, and R. W. Picard, “Advancements in noncontact, multiparameter physiological measurements using a webcam,” IEEE transactions on biomedical engineering, vol. 58, no. 1, pp. 7–11, 2011.
- (4) Y. Sun, S. Hu, V. Azorin-Peris, S. Greenwald, J. Chambers, and Y. Zhu, “Motion-compensated noncontact imaging photoplethysmography to monitor cardiorespiratory status during exercise,” Journal of biomedical optics, vol. 16, no. 7, pp. 077 010–077 010, 2011.
- (5) M. Kumar, A. Veeraraghavan, and A. Sabharwal, “Distanceppg: Robust non-contact vital signs monitoring using a camera,” Biomedical optics express, vol. 6, no. 5, pp. 1565–1588, 2015.
- (6) K. Humphreys, T. Ward, and C. Markham, “Noncontact simultaneous dual wavelength photoplethysmography: a further step toward noncontact pulse oximetry,” Review of scientific instruments, vol. 78, no. 4, p. 044304, 2007.
- (7) G. Balakrishnan, F. Durand, and J. Guttag, “Detecting pulse from head motions in video,” in https://doi.org/10.1109/CVPR.2013.440
- (8) G. de Haan and V. Jeanne, “Robust pulse rate from chrominance-based rppg,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 10, pp. 2878–2886, Oct 2013.
- (9) X. Li, J. Chen, G. Zhao, and M. Pietikäinen, “Remote heart rate measurement from face videos under realistic situations,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp. 4264–4271.
- (10) A. Lam and Y. Kuno, “Robust heart rate measurement from video using select random patches,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 3640–3648.
- (11) M. A. Haque, R. Irani, K. Nasrollahi, and T. B. Moeslund, “Heartbeat rate measurement from facial video,” IEEE Intelligent Systems, vol. 31, no. 3, pp. 40–48, May 2016.
- (12) X. Yu, J. Huang, S. Zhang, and D. N. Metaxas, “Face landmark fitting via optimized part mixtures and cascaded deformable model,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 11, pp. 2212–2226, 2016.
- (13) A. Osman, J. Turcot, and R. El Kaliouby, “Supervised learning approach to remote heart rate estimation from facial videos,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 1. IEEE, 2015, pp. 1–6.
- (14) G.-S. Hsu, A. Ambikapathi, and M.-S. Chen, “Deep learning with time-frequency representation for pulse estimation from facial videos,” in Biometrics (IJCB), 2017 IEEE International Joint Conference on. IEEE, 2017, pp. 383–389.
- (15) Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708.
G. B. Huang, H. Lee, and E. Learned-Miller, “Learning hierarchical representations for face verification with convolutional deep belief networks,” inComputer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2518–2525.
- (17) Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3476–3483.
X. Cao, D. Wipf, F. Wen, G. Duan, and J. Sun, “A practical transfer learning algorithm for face verification,” inProceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3208–3215.
- (19) D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3025–3032.
- (20) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
- (21) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
Appendix A Experimental Setup and Neural Network
Appendix B Pulse Rate Variability
Pulse rate variability is the variation in the time interval between two expansions of the artery. It is usually measured by the variation in beat-to-beat interval. This metric is considered as a non-invasive technique for measuring autonomic nervous system (ANS) activity. The autonomic nervous system has two branches; sympathetic nervous system (SNS) and parasympathetic nervous system (PNS) and is regulated by hypothalamus. Its function includes control of respiration, cardiac regulation, vasometer activity and certain reflex actions like coughing, sneezing, swallowing, and vomiting. High-frequency (HF) component of PRV is affected by efferent vagal (parasympathetic) activity and it decreases during the conditions of acute time pressure, emotional strain, mental stress, and elevated anxiety. The low-frequency (LF) component of PRV is known to contain both sympathetic and vagal influences. Thus, frequent and accurate measurement of PR and PRV can provide critical signs of one’s well being and any abnormality could lead to potential health problems.
In this study, the LF and HF components of PRV were roughly estimated by computing the area under the PSD curve between specific frequency range. For LF, the frequency range is 0.04-0.15Hz and for HF, it is 0.15-0.4Hz. We used these rough estimates of HF, and LF as a response variable to train our model. We trained our model using High-Frequency component and Low-Frequency component of the PPG signal. We designed separate models to train our model to predict the HF component and the LF component.
We tested our model using leave-one-out cross validation method similar to the pulse rate. Our model makes predictions on the user who has not been seen before in the training data. We choose the number of iterations to run our model as 200. The test and training losses with iterations for the LF and the HF component of the PPG signal are depicted in Fig. 7 and Fig. 8, respectively.
For predicting LF component, mean absolute percentage error on test set is 4.58%, and the root mean squared error is 3.49. On the other hand, the MAPE for HF component is found to be 10.2%, and the RMSE is 4.96 for test data. The mean RMSE for our model is 4.3 whereas the mean RMSE taken over different skin colored people in (kumar2015distanceppg, ) is 25.3 thus providing 83% decrease in the RMSE. Figures 6 and 9 depict the comparison between the actual values and the predicted values for the two components of the PPG signal, respectively. We note that dataset in (kumar2015distanceppg, ) & (osman2015supervised, ) is different from the dataset used in this paper, thus not providing comparison on the same dataset. However, since the code or data of the prior work is not public, the comparison is made on the aggregate prediction accuracy.