Assessment of Deep Learning-based Heart Rate Estimation using Remote Photoplethysmography under Different Illuminations

by   Ze Yang, et al.
Beihang University

Remote photoplethysmography (rPPG) monitors heart rate without requiring physical contact, which allows for a wide variety of applications. Deep learning-based rPPG have demonstrated superior performance over the traditional approaches in controlled context. However, the lighting situation in indoor space is typically complex, with uneven light distribution and frequent variations in illumination. It lacks a fair comparison of different methods under different illuminations using the same dataset. In this paper, we present a public dataset, namely the BH-rPPG dataset, which contains data from twelve subjects under three illuminations: low, medium, and high illumination. We also provide the ground truth heart rate measured by an oximeter. We evaluate the performance of three deep learning-based methods to that of four traditional methods using two public datasets: the UBFC-rPPG dataset and the BH-rPPG dataset. The experimental results demonstrate that traditional methods are generally more resistant to fluctuating illuminations. We found that the rPPGNet achieves lowest MAE among deep learning-based method under medium illumination, whereas the CHROM achieves 1.5 beats per minute (BPM), outperforming the rPPGNet by 60 deep learning-based heart rate estimation algorithms, illumination variation should be taken into account. This work serves as a benchmark for rPPG performance evaluation and it opens a pathway for future investigation into deep learning-based rPPG under illumination variations.


page 1

page 2

page 4

page 8


A Comparative Evaluation of Heart Rate Estimation Methods using Face Videos

This paper presents a comparative evaluation of methods for remote heart...

PPGnet: Deep Network for Device Independent Heart Rate Estimation from Photoplethysmogram

Photoplethysmogram (PPG) is increasingly used to provide monitoring of t...

VIPL-HR: A Multi-modal Database for Pulse Estimation from Less-constrained Face Video

Heart rate (HR) is an important physiological signal that reflects the p...

RhythmNet: End-to-end Heart Rate Estimation from Face via Spatial-temporal Representation

Heart rate (HR) is an important physiological signal that reflects the p...

TransPPG: Two-stream Transformer for Remote Heart Rate Estimate

Non-contact facial video-based heart rate estimation using remote photop...

Subjective evaluation of traditional and learning-based image coding methods

We conduct a subjective experiment to compare the performance of traditi...

Green Stability Assumption: Unsupervised Learning for Statistics-Based Illumination Estimation

In the image processing pipeline of almost every digital camera there is...

I Introduction

Heart rate (HR) is an important physiological indicator for both physical and mental health. HR monitoring has been used in many applications, such as state monitoring [23], driver fatigue detection [36], face anti-spoofing [5]

, etc. Traditional HR monitoring methods rely on electrocardiograph (ECG) and contact photoplethysmography (PPG) sensors. However, wearing such contact devices is uncomfortable and often interferes with daily activities. With the development of computer vision algorithm, remote HR measurement based on remote photoplethysmograply (rPPG) has been proposed 

[4, 3, 16, 26]. While rPPG offers the potential for contactless and continuous measurement of HR using low-cost web cameras, the system performance is still limited by many factors, such as lighting variations and head movements[14].

Lighting conditions is critical for rPPG since the quality of rPPG signal is determined by the light ingested into the skin. However, most existing studies only explored the laboratory condition with good lighting condition. [3, 42, 4, 34, 21, 38]. Insufficient illumination may lead to low amplitude of rPPG signal, due to the fact that the energy of light is too vulnerable to penetrate into skin surface. Moreover, most traditional methods are required to find the specific skin area [38, 4] and yet the low contrast on an image makes it difficult to obtain correct region of interest (ROI). On the contrary, high intensity of light lead to image clipping on the skin surface [37, 20]. Besides, the light distribution also has significant impact on rPPG. The conventional methods usually select the whole face as ROI and assumes the same contribution of rPPG signal at different parts of the face. This assumption may not hold in the real-world applications, especially in indoor space, the light distribution and intensity are different due to the relative position between the subject and the light source.

The traditional approaches use different methods to extract rPPG signal, which can be mainly categorized into two types: 1) skin reflection model-based approach[4, 38] and 2) blind source separation-based approach [21, 12]. Unfortunately, these models seldom take the lighting conditions into consideration. Po et al. proposed an adaptive ROI approach-based on the quality of rPPG signal acquired from sub-region of face to tackle the uneven light distribution challenge. However, the system performance under different lighting intensities has not yet been evaluated.

Deep learning-based approaches, such as Convolutional Neural Networks (CNN), have been used to estimate HR using rPPG signal. Špetlik

et al. [26] proposed to use 2D-CNN as backbone to directly estimate single value of HR at the early stage of rPPG, but it neglects the temporal information between frames. Physnet[41] and rPPGNet [42] can detect atrial fibrillation (AF) by generating more precise rPPG signals. The Physnet employs deep spatio-temporal networks with 3D-CNN as backbone and builds an end-to-end model. The rPPGNet treats the HR estimation as multi-task learning task (HR estimation task and skin segmentation task), which also uses the 3D-CNN as backbone with skin segmentation branch. Although these works achieve the superior performance, the quality of rPPG signal generated under different illumination is still unknown. The Deepphys[3] combines the theory of skin reflection model and attention mechanism, which adopt 2D-CNN as backbone. It outperforms the traditional methods by using attention mechanism, which takes into account the rPPG intensity distribution in different parts of the face. It is well-known that the performance of deep learning model is sensitive to illumination. While the filters in CNN model learn the specific pattern capturing different levels of visual information in most of computer vision tasks, the lighting affects the quality of the rPPG signal itself in HR estimation task. Wang et al.[43] conducted a series of experiments to show that CNN uses color variation information in blood absorption to estimate HR. However, they did not validate the performance of deep learning models under different illuminations.

For data-driven approaches, the quality of training data determines the system performance. Most of the previous studies evaluated the performance on different datasets, which makes it unfair to compare the system performance. For example, the Physnet[41] is trained on OBF[13] dataset, the Deepphys is trained on RGB Video I[3] dataset, both of the datasets are not publicly accessible. To evaluate the robustness of different methods in real-world applications, here we presented a public dataset, i.e., BH-rPPG dataset (BH stands for BeiHang University), which consists of three lighting intensities with uneven light distribution on the face (see first row in Fig.1).

In summary, the primary contributions of this paper are three-fold:

  1. We present the BH-rPPG, a public dataset for rPPG-based heart rate estimation. The BH-rPPG consists of twelve subjects’ data under three different illuminations. The link can be found at

  2. We systematically evaluated the robustness to illumination variation of typical methods for rPPG-based heart rate estimation, including four traditional methods [4, 34, 21, 38]) and three deep learning-based methods  [3, 42, 41]).

  3. Our experimental results suggest that although the deep learning-based methods achieve superior performance under normal illumination, they are less resistant to illumination variations compared with traditional methods. These findings draw attention to designing more robust deep learning-based methods for remote heart rate estimation.

Fig. 1: Uncontrolled light sources and controlled light sources. The first row is collected in natural lighting condition. The lighting intensity on face is uneven and the three columns from left to right correspond to pictures taken at different light intensities. The second row represents video recorded in controlled lighting environment.

The remainder of this paper is organized as follows.  Section II

summarizes the related works of the heart rate estimation methods using rPPG and the lighting conditions in different applications. Section III describes the experimental setup including the dataset, methods, experimental protocols and performance evaluation metrics. Section V presents the experimental results. Section VI gives a discussion of the findings. Finally, Section VI concludes this paper and outlines the future work.

Ii Related Work

We first review the existing methods of heart rate estimation using rPPG, including both traditional approaches and deep learning-based approaches. Then we summarize the different lighting conditions in various rPPG applications.

Ii-a Heart Rate Estimation via rPPG

Heart rate can be remotely monitored through two channels: ballistocardiographic (BCG)[2, 9] and remote photoplethysmography (rPPG) [3, 42, 4, 34, 21, 38]. The BCG-based methods use a camera to capture subtle movements induced by the periodic blood ejected into the vessels with each heartbeat. The BCG-based non-contact pulse measurements is achieved by blind source separation of the head movements in video. However, BCG-based methods are usually limited by user’s head movement since the faint movements trace induced by cardiac activity is hard to capture during large scale head movements. On the contrary, rPPG-based methods register the pulse induced by subtle color variations of human skin[29, 34]. This measurement is based on the fact that the pulsatile blood propagating in the human cardiovascular system changes the blood volume in skin tissue. The oxygenated blood circulation leads to fluctuations in the amount of hemoglobin molecules and proteins thereby causing variations in the optical absorption and scattering across the light spectrum[29].

The rPPG-based methods can be categorized into two types, 1) the traditional methods that rely on optical models, e.g., the Lambert-Beer law and the Shafer’s dichromatic reflection model; and 2) the deep learning-based methods that rely on the appearance of the face.

The optical models that used in the traditional methods are grounded by the optical properties of the skin under ambient illumination. Different color channels contain different quality of rPPG signal. The green channel has been used in early rPPG research, since it generates strongest rPPG signal [34]. Previous studies have shown that the cardiac activity causes variation in the optical absorption across the light spectrum[1], using this characteristic, CHROM[4] and POS[38]

project RGB on different plane by re-weighting and linear combination of color channel. Blind signal separation has also been proposed, which considers the temporal trace of PPG that can be retrieved from independent or uncorrelated signal sources under certain assumptions. Independent component analysis has been used in multi-signal sources obtained by different approaches such as the same region of color channels

[21] and patch level of regions of interest (ROI) [10].

Many deep learning-based heart rate estimation methods have been proposed recently. Chen and McDuff [3] present Deepphys, which employs two parallel branches of CNN to extract rPPG feature: the motion branch and the appearance branch. The motion branch is fed with normalized frame-differences to cancel motion effect on rPPG signal, while the appearance branch uses attention mechanism that enables the network to focus on the area of skin. Other researchers investigated different network architectures for better estimation. Yu et al. [41] developed an end-to-end network to estimate heart rate using compressed videos. They used a three dimensional CNN to capture temporal information and an extra skin segmentation branch to regress PPG signal. Niu et al.[19] proposed to directly estimate heart rate from a spatio-temporal network. Špetlik et al.[26]

introduce two-step network for feature extraction and heart rate estimation. Qiu et al.

[22] integrated the signal magnified technique named Eulerian video magnication[40]

with convolutional neural network to estimate heart rate. Lee et al.

[17] proposed a transductive meta-learner to adapt model to different domains. In addition to meta-learner, Niu et al. [35] introduced a cross-verified scheme to purify the feature constructed with spatio-temporal map. Although deep learning methods yield promising results, their performance under different illumination remains to be explored.

Ii-B Lighting conditions in rPPG applications

The rPPG has been used in many applications, such as state monitoring at home or driver fatigue detection on the car, where the lighting conditions can be very different.

In indoor environment, depending on the relative position between the person and light source, it may suffer from insufficient and uneven lighting condition. In the application of state monitoring, the algorithm should adapt to the light variation. For example, Sun et al.[28] continuously monitored discomforts of infants over a long period, which requires the algorithm to work in complex light conditions. To estimate heart rate in extremely low light condition, Lin et al.[15] proposed to use infrared spectrum to extract features. In addition, due to the COVID-19, the demand for non-contact healthcare techniques is dramatically increasingly [24]. However, the investigation of algorithm performance under complex lighting conditions is relatively rare.

In outdoor environment, the heart rate estimation becomes more challenging since the illumination changes dramatically [8]. In the driver fatigue detection task, the illumination is quite distinct from the laboratory. In other applications such as face anti-spoofing and online payment system, rPPG technology could be used to perform liveness detection, which prevents using a fake face to circumvent the system and to gain unauthorized access[5]. The Deepfake video can also be distinguishable by rPPG[6], whereas the lighting conditions is far more complicated due to wide range of usage.

In summary, although rPPG has been deployed in many applications, the non-ideal lighting conditions degrades the system performance. Thus, it is necessary to conduct a systematic comparison between different approaches and to evaluate the robustness of these approaches under different lighting conditions.

Iii Experimental setup

In this section, we first briefly introduce the public dataset used in the experiment. Then, we present the details of BH-rPPG dataset under three different lighting conditions: low/medium/high illumination. Next, we introduce the methods compared in this paper. Finally, we describe our experimental protocol and the performance evaluation metrics.

Iii-a Public dataset

Most public datasets are collected under controlled environment, such as UBFC-rPPG [33], VIPL [18], PURE [27] and MAHNOB [25]. To the best of our knowledge, no public dataset examines the effect of illumination intensities. Although COHFACE [7] dataset was collected under controlled lighting and natural lighting, the lighting intensity remains the same. Here we choose the UBFC-rPPG [33] dataset as the training set for deep learning method.

The University Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy dataset (i.e., the UBFC-rPPG dataset) consists of two scenarios, here we only use the part that subjects play a time-sensitive mathematical game. This is because it is a real life setting which includes natural head movements. Subjects’ heart rate changes over time as induced by mathematical games. The dataset includes 42 one-minute videos from different subjects. The video is recorded using a low-cost webcam (Logitech C920 HD Pro) at 30fps with a resolution of 640x480 in uncompressed 8-bit RGB format. A CMS50E transmissive pulse oximeter was used to obtain the ground truth PPG data comprising of the PPG waveform as well as the PPG heart rates.

Iii-B BH-rPPG dataset

Iii-B1 Apparatus setup

Fig. 2: Experimental setup.

Fig.2 presents the experimental setup. There are two light sources (a ceiling lamp and a table lamp) that create different lighting conditions. An oximeter (CONTEC CMS50E) was used to obtain the ground truth PPG data. A webcam (Logitech HD pro webcam C310 color camera) recorded the video data synchronized with the oximeter. The resolution for video is . The web camera actual frame rate is 30 fps, but under low lighting intensity, the actual frame rate is about 20 fps. The subject sits 1 meter away from the camera. Since our study focuses on illumination variations instead of head movements, subjects are asked to keep their head stationary during the data collection. The reason that we used two lamps in the experiment is that this is more similar to the settings in daily living. We collected data under three lighting conditions, as shown in Table I. With the ceiling lamp always on in three conditions, we change the mode of table lamp to modulate the illumination. Fig. 3 shows some sample images under different illuminations.

Lighting Condition Ceiling Lamp Table Lamp
Low intensity level off
Medium intensity level normal
High intensity level high
TABLE I: Three lighting conditions in our dataset.

Iii-B2 Data collection procedure

We recruited twelve healthy subjects (11 males and 1 females) on campus, with a mean age of 32, SD of 2.56. For each subject, we took three 30-second videos under three lighting conditions. The left part of Fig.3 shows the average lighting intensity under three conditions. The illuminations of low, medium and high level are 8.0, 42.4, and 104.0 lux, respectively.

Fig. 3: Average illumination and sample images.

Iii-C Methods

We evaluate both traditional methods and deep learning-based methods. To make a fair comparison and eliminate the difference during preprocessing, we used Viola-Jones face detector to extract face area for reducing noise from background. We employed Kanade-Lucas-Tomasi (KLT) [31] algorithm to track the location of the face region to avoid head rigid movements. The processed video frames are used as input for different algorithms.

Iii-C1 Traditional methods

We compared four representative methods: GREEN[34], CHROM[4], POS[38] and ICA[21]

. For implementation, we used the open source toolbox iPhys

[16]. The basic workflow of the traditional method is shown in Fig. 4. First, we detect and track the bounding box of face using the KLT algorithm [31]. Then, the skin area is detected and the eyes and mouth are removed, since they often bring noise for non-rigid movements during blink and speech. Next, the pulse signal is extracted by spatial pooling and all mean values for each frame are concatenated as raw pulse trace for = 1, …, , where

is the number of frames in the video. After that, the varying part induced by heart rate can be obtained by band pass filter and detrending. Finally, we apply different methods to raw pulse trace, transform the signal into frequency domain using fast Fourier transform (FFT), and find the peaks to estimate the heart rate.

Fig. 4: Traditional method workflow

Iii-C2 Deep learning-based methods

We evaluate the performance of three typical deep learning-based methods: Deepphys [3], rPPGNet [42], and Physnet [41]. The Deepphys is a two dimensional CNN-based network that uses attention mechanism to learn the skin map. The rPPGNet is also a two dimensional CNN-based network that use soft attention to make model focus on skin area. The Physnet uses temporal encoder-decoder structure for rPPG task, which is applied in action segmentation task. The basic procedure of deep learning-based method can be formulated as below.


where are the ground truths collected by finger oximeter, are the frames sampled from original video. In Deepphys, T is set to 2, meaning that two consecutive frames are used to compute normalized difference frame and outputs . First, a CNN backbone extracts spatial-temporal feature with parameter , then is used for channel aggregation with parameter . The estimated PPG is . Deepphys uses 2D CNN as to extract spatial information and uses soft attention to assign different weights on skin regions. Physnet and rPPGnet use 3D CNN as to model temporal signal and take into account the correlation between ground truth and output. We have re-implemented the Deepphys algorithm since the author did not release the source code, for rPPGnet and Physnet, we directly used the open-source model.

Iii-D Experimental protocols

On the one hand, we would like to compare the performance of different deep learning methods trained with the same protocol, i.e., trained and evaluated using the same dataset. On the other hand, our goal is to evaluate the performance of deep learning-based method under different lighting conditions. We provide a comprehensive performance comparison between different methods.

Iii-D1 Performance comparison under the same training protocol

We utilize the UBFC-rPPG dataset to train and test different deep learning-based methods. Specifically, we randomly divide 42 videos into training set (37 videos) and test set (5 videos). Since each video corresponds to one subject, the task is subject-independent. For traditional methods, we only evaluated performance on the test set for a fair comparison with deep learning-based method. For Deepphys[3], we reproduced the model and trained with the same learning rate and batchsize. For Physnet[41] and rPPGnet[42], we adopted the frame length of 128, 64 as single clip which are sampled from original video respectively.

Iii-D2 Performance comparison under different illuminations

We evaluated the traditional methods and the deep learning-based methods trained with UBFC-rPPG dataset. For traditional methods, we followed the settings of iPhys[16], except that the skin pixel value range is replaced according to different lighting conditions. Therefore, the frequency range of raw signal between 0.7 to 2.5 Hz was extracted. For deep learning-based methods, we used the best model trained by the protocol mentioned in Section II to cross test the Lighting dataset. For a fair comparison, we used the same window size of 8 sec and step size of 1 sec for the evaluation protocol of traditional methods.

Iii-E Performance evaluation metrics

We used four evaluation metrics: the mean absolute error (MAE), the root mean square error (RMSE), Signal-to-Noise Ratio (SNR), and Bland-Altman plot 


Iii-E1 Mean absolute error


where is the total number of samples, is the HR estimated by rPPG from th samples, is the ground-truth HR from th samples.

Iii-E2 Root mean square error


Iii-E3 Signal-to-noise ratio

The SNR computes the ratio of the energy around the fundamental frequency plus the first harmonic of the pulse signal and the remaining energy contained in the spectrum.


Here we followed the same definition in [4], where is the spectrum of the pulse signal, is the frequency in beats per minute, and is a binary template window.

Iii-E4 Bland-Altman plot

This plot demonstrates the consistency between two signal. The differences between the heart rate estimated by rPPG algorithm and the ground truth are plotted against the system average. We show the mean, standard deviation (SD), 95% agreement limits (±1.96SD) of the difference.

Iv Experimental Results

Iv-a Performance comparison under the same protocol

Table II demonstrates the performance of traditional methods and deep learning-based methods. Note that the deep learning-based methods are trained and tested under the same protocol using the UBFC-rPPG dataset. The Physnet[41] achieves the best within UBFC-rPPG. In comparison to the traditional methods, the deep learning-based methods perform admirably. These results suggest that deep learning-based methods indeed demonstrate superior performance.

Deep learning method Deepphys[3] 3.71 5.27
rPPGNet[42] 3.24 4.97
Physnet[41] 2.33 3.04
[1pt/1pt] Traditional method CHROM[4] 9.13 15.00
GREEN[34] 20.90 29.33
ICA[21] 8.62 13.54
POS[38] 15.39 27.22
TABLE II: The results of UBFC-rPPG

Iv-B Performance comparison under different illumination

Table III shows the performance of traditional methods and deep learning-based methods under different illuminations. From the MAE plot and RMSE plot in Fig. 5, we observed that the traditional method is generally more robust to the light variations. Except for the ICA, which performs poorly in low lighting conditions. The superior performance of conventional method may attribute to the hypothesis that the relationship between different color channels are irrelevant to illuminations. In the RMSE plots, we found that RMSE of all conventional methods are less than 3, except for the ICA method which achieves 10.78 in low-light conditions. For deep learning-based methods, the rPPGNet achieved a MAE of 4.5 BPM in medium lighting conditions. The MAE of Physnet and Deepphys are all above 8.6 BPM. This demonstrates that the domain gap between different lighting conditions has great impact on deep learning-based methods.

Fig. 5: Performance comparison between different methods. The three subplots shows the MAE, RMSE, SNR, respectively. The x-axis represents different methods, different colors shows the three lighting conditions: high, medium and low.
Methods Lighting conditions MAE RMSE SNR
rPPGNet[42] low 5.20 6.81 -3.23
medium 4.50 5.85 -0.47
high 5.60 7.76 -0.99
Deepphys[3] low 9.01 13.26 -2.58
medium 10.95 15.93 1.95
high 12.20 17.24 -0.80
Physnet[41] low 11.85 15.62 -8.93
medium 8.98 14.57 -3.74
high 10.54 15.62 -3.95
CHROM[4] low 1.25 1.48 -0.67
medium 1.50 2.12 1.45
high 1.45 2.25 0.72
GREEN[34] low 1.16 1.41 -0.18
medium 1.58 2.00 3.27
high 2.21 3.18 0.84
ICA[21] low 5.58 10.76 -0.64
medium 1.83 2.68 2.43
high 1.54 2.27 3.55
POS[38] low 1.25 1.47 -1.14
medium 1.37 1.85 1.32
high 1.46 2.26 1.79
TABLE III: The results of different lighting conditions

From the SNR plot in Fig. 5, it is found that the conventional method achieved positive values in high and medium lighting conditions. However, only the Deepphys achieves positive SNR in medium light conditions. It is notable that deep learning-based methods are significantly inferior to traditional methods in terms of generalization ability, and different algorithms perform inconsistently in different illuminations. Deepphys works better in low light condition, rPPGNet is more effective in low/medium light conditions, Physnet only works well in moderate light conditions. Fig. 6 depicts the ground-truth HR and the estimated HR under three different lighting situations to illustrate the estimation consistency of various methods. Traditional methods are much more consistent with ground-truth HR than deep learning-based method.

Fig. 6: The Bland-Altman plots of the rPPGNet, Deepphys, Physnet and POS methods under different light conditions. Each row shows different methods while each column shows different lighting conditions. The Bland-Altman plots illustrates the degree of agreement between the estimated HR and ground-truth HR. The solid lines represent the mean value while the dashed lines represent 95% limits of agreement.

V Discussion

V-a Distribution of rPPG signal under uneven light

To better understand the performance difference, we visualize the region-of-interest (ROI) in different methods. Fig. 7 shows the original frame, ground truth, preprocessing results of traditional method, and attention weights of intermediate steps in rPPGNet.

The ground truth is generated using POS method and ground truth signal with the definition of SNR mentioned in Section III.E. Since the traditional method can accurately depict the real distribution of the rPPG signal on the face. We can see from the original frame and ground truth that the brighter the region of face, the higher the SNR of the rPPG signal. The varying light intensity changes the distribution of rPPG signal, which echoes the findings in [38] and [37].

Fig. 7: The visualization of ROI. The three columns show the results of the traditional method, the ground truth and the attention map extracted by rPPGNet.

However, the attention weights learned in deep learning methods demonstrates that the neural network focuses on background and skin area that are irrelevant to light. One possibility is that there is a domain gap between the training data (UBFC-rPPG dataset) and test data (BH-rPPG dataset). Because the skin tone and good lighting conditions in the training set of UBFC-rPPG is different from the test set of BH-rPPG. Lee et al.[11] proposed a meta-learning framework to update model weight, which may help model to adapt to different application situations. In addition, the skin branch in rPPGNet is a series of learnable weights optimized as the ground truth of binary skin mask which is generated by [30]. When the lighting intensity, skin tone or environment changes, it is natural for the skin branch to produce the incorrect skin mask. We believe that finding the ROI branch is significant to deep learning-based rPPG task. While the varying lighting conditions have significant effect on the performance of deep learning-based methods.

In contrast to deep learning-based rPPG, traditional methods detect skin regions in the preprocessing steps which has been visualized in traditional columns of Fig.7. Although the lighting intensity distribution on face is uneven, the skin detection algorithm successfully locates the correct skin area that contains rPPG signal induced by cardiac activity. This may explain why traditional approaches perform better than deep learning-based methods under different illuminations. Furthermore, the average pooling for whole region can retrieve the information for brighter regions that contains more rPPG signal and smooth the darker regions where contribute less to rPPG. Additionally, due to the spatial redundancy of rPPG[39], some studies partition the face into distinct grids allows that each grid for different grids may be beneficial under varying lighting conditions.

V-B Algorithm robustness to illumination variation

In this paper, we investigated the performance of deep learning-based rPPG under varying lighting conditions and used conventional methods as a baseline. From Table II and Table III, we found that deep learning-based methods perform well within UBFC-rPPG dataset and show poor performance in BH-rPPG dataset. Especially for Physnet, which show best performance within UBFC-rPPG and perform bad results in BH-rPPG. For this case, the appearance of background, subject’s skin tone and lighting conditions have great impact on deep learning-based method. According to the results of MAE and RMSE in Table III, the rPPGNet is most robust to illumination variations. The ambient light needs to be chosen carefully, high or low lighting intensity will reduce the accuracy by 24 and 15.

As for conventional method, it performs poorly in the UBFC-rPPG dataset but well in BH-rPPG dataset. One possible reason for low accuracy in UBFC-rPPG is the motion effect on rPPG signal. Deep learning-based method learns the relationship between pixels values and HR by large number of non-linear mapping and the linear combination of color channels in conventional method may not hold in complicated environment with large head movements. However, Due to the robustness of conventional method under varying lighting conditions, conventional method are more suitable in scenarios with less head movement, such as liveness detection in payment system.

Vi Conclusion

In this work, we compared the performance of different methods for rPPG-based heart rate estimation under three lighting intensities. The results show that conventional methods are more robust to lighting intensities changes and uneven lighting distribution, while the rPPGNet achieves the best performance among the deep learning-based methods. In the development of deep learning-based method, we should consider the varying lighting conditions, especially the different lighting intensities and uneven lighting distribution. Moreover, we conduct a comparative evaluation of deep learning-based techniques under the same training paradigms. The results show that Physnet achieves the best within UBFC-rPPG dataset and deep learning-based methods are able to capture the temporal variation of skin color with motion. The findings of this study urge additional research into developing more robust deep learning models to enable the real application in daily living.


  • [1] J. Allen (2007) Photoplethysmography and its application in clinical physiological measurement. Physiological measurement 28 (3), pp. R1. Cited by: §II-A.
  • [2] G. Balakrishnan, F. Durand, and J. Guttag (2013) Detecting pulse from head motions in video. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3430–3437. Cited by: §II-A.
  • [3] W. Chen and D. McDuff (2018) Deepphys: video-based physiological measurement using convolutional attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 349–365. Cited by: item 2, §I, §I, §I, §I, §II-A, §II-A, §III-C2, §III-D1, TABLE II, TABLE III.
  • [4] G. de Haan and V. Jeanne (2013) Robust pulse rate from chrominance-based rppg. IEEE Transactions on Biomedical Engineering 60 (10), pp. 2878–2886. External Links: Document Cited by: item 2, §I, §I, §I, §II-A, §II-A, §III-C1, §III-E3, TABLE II, TABLE III.
  • [5] (Website) External Links: Link Cited by: §I, §II-B.
  • [6] S. Fernandes, S. Raj, E. Ortiz, I. Vintila, M. Salter, G. Urosevic, and S. Jha (2019) Predicting heart rate variations of deepfake videos using neural ode. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §II-B.
  • [7] G. Heusch, A. Anjos, and S. Marcel (2017-09-04)(Website) External Links: Link, 1709.00962 Cited by: §III-A.
  • [8] P. Huang, B. Wu, and B. Wu (2020) A heart rate monitoring framework for real-world drivers using remote photoplethysmography. IEEE journal of biomedical and health informatics 25 (5), pp. 1397–1408. Cited by: §II-B.
  • [9] (Website) External Links: Link Cited by: §II-A.
  • [10] A. Lam and Y. Kuno (2015) Robust heart rate measurement from video using select random patches. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3640–3648. Cited by: §II-A.
  • [11] E. Lee, E. Chen, and C. Lee (2020) Meta-rppg: remote heart rate estimation using a transductive meta-learner. In European Conference on Computer Vision, pp. 392–409. Cited by: §V-A.
  • [12] M. Lewandowska, J. Rumiński, T. Kocejko, and J. Nowak (2011) Measuring pulse rate with a webcam—a non-contact method for evaluating cardiac activity. In 2011 federated conference on computer science and information systems (FedCSIS), pp. 405–410. Cited by: §I.
  • [13] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. Tulppo, and G. Zhao (2018) The obf database: a large face video database for remote physiological signal measurement and atrial fibrillation detection. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 242–249. Cited by: §I.
  • [14] X. Li, J. Chen, G. Zhao, and M. Pietikainen (2014) Remote heart rate measurement from face videos under realistic situations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4264–4271. Cited by: §I, §III-E.
  • [15] X. Lin and G. de Haan

    Using blood volume pulse vector to extract rppg signal in infrared spectrum

    Cited by: §II-B.
  • [16] D. McDuff and E. Blackford (2019) Iphys: an open non-contact imaging-based physiological measurement toolbox. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 6521–6524. Cited by: §I, §III-C1, §III-D2.
  • [17] (Website) External Links: Link Cited by: §II-A.
  • [18] X. Niu, H. Han, S. Shan, and X. Chen (2018) VIPL-hr: a multi-modal database for pulse estimation from less-constrained face video. In Asian Conference on Computer Vision, pp. 562–576. Cited by: §III-A.
  • [19] X. Niu, S. Shan, H. Han, and X. Chen (2019) Rhythmnet: end-to-end heart rate estimation from face via spatial-temporal representation. IEEE Transactions on Image Processing 29, pp. 2409–2423. Cited by: §II-A.
  • [20] A. Papageorgiou and G. de Haan (2014) Adaptive gain tuning for robust remote pulse rate monitoring under changing light conditions. Technische Universiteit Eindhoven. Cited by: §I.
  • [21] M. Poh, D. J. McDuff, and R. W. Picard (2010) Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE transactions on biomedical engineering 58 (1), pp. 7–11. Cited by: item 2, §I, §I, §II-A, §II-A, §III-C1, TABLE II, TABLE III.
  • [22] Y. Qiu, Y. Liu, J. Arteaga-Falconi, H. Dong, and A. El Saddik (2018) EVM-cnn: real-time contactless heart rate estimation from facial video. IEEE transactions on multimedia 21 (7), pp. 1778–1787. Cited by: §II-A.
  • [23] (Website) External Links: Link Cited by: §I.
  • [24] A. C. Smith, E. Thomas, C. L. Snoswell, H. Haydon, A. Mehrotra, J. Clemensen, and L. J. Caffery (2020) Telehealth for global emergencies: implications for coronavirus disease 2019 (covid-19). Journal of telemedicine and telecare 26 (5), pp. 309–313. Cited by: §II-B.
  • [25] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic (2011) A multimodal database for affect recognition and implicit tagging. IEEE transactions on affective computing 3 (1), pp. 42–55. Cited by: §III-A.
  • [26] R. Špetlík, V. Franc, and J. Matas (2018) Visual heart rate estimation with convolutional neural network. In Proceedings of the British Machine Vision Conference, Newcastle, UK, pp. 3–6. Cited by: §I, §I, §II-A.
  • [27] R. Stricker, S. Müller, and H. Gross (2014) Non-contact video-based pulse rate measurement on a mobile service robot. In The 23rd IEEE International Symposium on Robot and Human Interactive Communication, pp. 1056–1062. Cited by: §III-A.
  • [28] Y. Sun, J. Hu, W. Wang, M. He, and P. H. de With (2021) Camera-based discomfort detection using multi-channel attention 3d-cnn for hospitalized infants. QUANTITATIVE IMAGING IN MEDICINE AND SURGERY 11 (7), pp. 3059–3069. Cited by: §II-B.
  • [29] C. Takano and Y. Ohta (2007) Heart rate measurement based on a time-lapse image. Medical engineering & physics 29 (8), pp. 853–857. Cited by: §II-A.
  • [30] M. J. Taylor and T. Morris (2014)

    Adaptive skin segmentation via feature-based face detection

    In Real-Time Image and Video Processing 2014, Vol. 9139, pp. 91390P. Cited by: §V-A.
  • [31] C. Tomasi and T. Kanade (1991) Detection and tracking of point. Technical report features. Technical Report CMU-CS-91-132, Carnegie, Mellon University. Cited by: §III-C1, §III-C.
  • [32] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, and N. Sebe (2016) Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2396–2404. Cited by: §III-E.
  • [33] (Website) External Links: Link Cited by: §III-A.
  • [34] W. Verkruysse, L. O. Svaasand, and J. S. Nelson (2008) Remote plethysmographic imaging using ambient light.. Optics express 16 (26), pp. 21434–21445. Cited by: item 2, §I, §II-A, §II-A, §III-C1, TABLE II, TABLE III.
  • [35] (Website) External Links: Link Cited by: §II-A.
  • [36] (Website) External Links: Link Cited by: §I.
  • [37] W. Wang (2017) Robust and automatic remote photoplethysmography. Cited by: §I, §V-A.
  • [38] W. Wang, A. C. den Brinker, S. Stuijk, and G. De Haan (2016) Algorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering 64 (7), pp. 1479–1491. Cited by: item 2, §I, §I, §II-A, §II-A, §III-C1, TABLE II, TABLE III, §V-A.
  • [39] W. Wang, S. Stuijk, and G. De Haan (2014) Exploiting spatial redundancy of image sensor for motion robust rPPG. 62 (2), pp. 415–425. Cited by: §V-A.
  • [40] H. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, and W. Freeman (2012) Eulerian video magnification for revealing subtle changes in the world. ACM transactions on graphics (TOG) 31 (4), pp. 1–8. Cited by: §II-A.
  • [41] Z. Yu, X. Li, and G. Zhao (2019) Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks. arXiv preprint arXiv:1905.02419. Cited by: item 2, §I, §I, §II-A, §III-C2, §III-D1, §IV-A, TABLE II, TABLE III.
  • [42] Z. Yu, W. Peng, X. Li, X. Hong, and G. Zhao (2019) Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 151–160. Cited by: item 2, §I, §I, §II-A, §III-C2, §III-D1, TABLE II, TABLE III.
  • [43] Q. Zhan, W. Wang, and G. de Haan (2020) Analysis of cnn-based remote-ppg to understand limitations and sensitivities. Biomedical optics express 11 (3), pp. 1268–1283. Cited by: §I.