Heart rate (HR) is an important physiological signal that reflects the physical and emotional activities, and HR measurement can be useful for many applications, such as training aid, health monitoring, and nursing care. Traditional HR measurement usually relies on contact monitors, such as electrocardiograph (ECG) and contact photoplethysmography (cPPG), which are inconvenient for the users and limit the application scenarios. In recent years, a growing number of studies have been reported on remote HR estimation from face videos [1, 2, 3, 4, 5, 6, 7].
The existing video-based HR estimation methods mainly depend on two kinds of signals: the remote photoplethysmography (rPPG) signals [1, 2, 4, 5, 6, 7] and the ballistocardiographic (BCG) signals . There is no doubt that the idea of estimating HR from face videos could be a convenient physiological platform for clinical monitoring and health caring in the future. However, most of the existing approaches only provide evaluations on private databases, leading to difficulties in comparing different methods. Although a few public-domain HR databases are available [8, 6, 9, 10, 11], the sizes of these databases are very limited (usually smaller than 50 subjects). Moreover, these databases are usually captured in a well-controlled scenario, with minor illumination and motion variations (see Fig. 1). These limitations will lead to two issues for the remote HR estimation: i) it is hard to analyze the robustness of individual HR estimation algorithms against different variations and acquisition devices; ii) it is difficult to leverage the deep representation learning approaches in remote HR estimation, which are believed to have the ability to overcome the limitation of hand-crafted methods designed on specific assumptions .
To overcome these limitations, we introduce the VIPL-HR database for remote HR estimation, which is a large-scale multi-modal database recorded with various head movement, illumination variations, and acquisition device changes (see Fig. 1).
Given the large VIPL-HR database, we further propose a deep HR estimator, named as RhythmNet, for robust heart rate estimation from face. RhythmNet takes an informative spatial-temporal map as input and adopts an effective training strategy to learn the HR estimator. The results of within-database and cross-database experiments have shown the effectiveness of the proposed approach.
The rest of this paper is organized as follow: Section 2 discusses the related works of remote HR estimation and existing public-domain HR databases. Section 3 introduces the large-scale VIPL-HR database we have collected. Section 4 provides the details of the proposed RhythmNet. After that, Section 5 evaluates the proposed method under both within-database and cross-database protocols. Finally, the conclusions and future work are summarized in Section 6.
2 Related Work
2.1 Remote HR Estimation
The possibility of using PPG signals captured by custom color cameras was firstly introduced by Verkruysse et.al . Then many algorithms have been proposed, which can be generally divided into blind signal separation (BSS) methods, model-based methods and data-driven methods.
. They applied independent component analysis (ICA) to temporal filtered red, green, and blue (RGB) color channel signals to seek the heartbeat-related signal, which they assumed is one of the separated independent components. A patch-level HR signal calculation with ICA was performed later in and achieved the state-of-the-art on the public-available database MAHNOB-HCI .
Another kind of PPG-based HR estimation methods focus on leveraging the prior knowledge of the skin model to remote HR estimation. Haan and Jeanne firstly proposed a skin optical model of different color channels under the motion condition and computed a chrominance feature using the combination of RGB signals to reduce the motion noise . In a later work of , pixel-wise chrominance features are computed and used for HR estimation. A detailed discussion of different skin optical model used for rPPG-based HR estimation is presented in , and the authors further proposed a new projection method for the original RGB signals to extract pulse signals. In , Niu et al. further applied the chrominance feature  to continuous estimation situations.
Besides hand-crafted methods, there are also some data-driven methods designed for remote HR estimation. Tulyakov et.al  divided the face into multiple regions of interest (ROI) to get a matrix temporal representation and used a matrix completion approach to purify rPPG signals. In , Hsu et al. generated the time-frequency maps from different color signals and used them to learn an HR estimator. Although the existing data-driven approaches attempted to build learning based HR estimator, they failed to build an end-to-end estimator. Besides, the features they used remain hand-crafted, which may not be optimum for the HR estimation task. In , Niu et al. proposed a general-to-specific learning strategy to solve the problem of representing HR signals and lacking data. However, they didn’t investigate the choice of color spaces and missing data situation.
Instead of the PPG-based HR measurement methods, the ballistocardiographic (BCG) signals, which is the subtle head motions caused by cardiovascular circulation, can also be used for remote HR estimation. Inspired by the Eulerian magnification method , Balakrishnan et al. tracked the key points on face and used PCA to get the pulse signal from the trajectories of feature points. . Since these methods are based on subtle motion, the subjects’ voluntary movements will introduce significant influence to the HR signals, leading to limited use in real-life applications.
Although the published methods have made a lot of progress in remote HR measurement, they still have limitations. First, the existing approaches are usually tested on well-controlled small-scale databases, which could not represent the real-life situations. Second, most of the existing approaches are designed in a step-by-step way using hand-crafted features, which are based on some specific assumptions and may fail in some complex conditions. Data-driven methods based on large-scale database are needed.
2.2 Public-domain Databases for Remote HR Estimation
Many of the published methods reported their performance on private databases, leading to difficulties in performance comparison by other approaches. The first public domain database was introduced by Li et.al . They evaluated their method on the MAHNOB-HCI database , which was designed for emotion detection and the subjects performed slight head movement and facial expressions. Later in 2016, Tulyakov et.al introduced a new database MMSE-HR , which was part of the MMSE database  and the subjects’ facial expressions were more various. However, these two databases were originally designed for emotion analysis, and the subjects’ motions were mainly limited to facial expression changes, which was far from enough for real-world remote HR estimation.
There are also a few public-available databases specially designed for the task of remote HR estimation. Stricker et al. firstly released the PURE database collected by the camera of a mobile sever robot . Hus et al. released the PFF database containing 10 subjects under 8 different situations . These two databases are limited by the number of subjects and recording situations. In 2018, Xiaobai et al. proposed a database designed for HR and heart rate variability (HRV) measurement . Since this database aims at HRV analysis and all the situations in this database are well-controlled, making it very easy for remote HR estimation.
The existing public-domain databases for remote HR estimation can be found in Table 1. As we can see from Table 1, all these databases are limited in either the number of subjects or the recording situations. A large-scale database recorded under real-life variations is required to push the studies on remote HR estimation.
L = Lab Environment, D = Dark Environment, B = Bright Environment, E = Expression,
S = Stable, SM = Slight Motion, LM = Large Motion, T = Talking,
C = Color Camera, N = NIR Camera, P = Smart Phone Frontal Camera
3 The VIPL-HR Database
In order to evaluate methods designed for real-world HR estimation, a database containing various face variations, such as head movements and illumination change, and acquisition diversity, is needed. To fill in this gap, we collected the VIPL-HR database, which contains more than one hundred subjects under various illumination conditions and different head movements. All the face videos are recorded by three different cameras and the relative physical measurements, such as HR, SpO2, and blood volume pulse (BVP) signal, are also simultaneously recorded. In this section, we introduce our VIPL-HR database from three aspects: i) setup and data collection, ii) video compression, and iii) database statistics.
3.1 Setup and Data Collection
We design our data collection procedure with two objectives in mind: i) videos should be recorded under natural conditions (i.e., head movement and illumination change) instead of well-controlled situations; and ii) videos should be captured using various recording devices to replicate the common case in daily life, i.e., smartphones, RGB-D cameras, and web cameras. The recording setup is arranged based on these two targets, which includes a computer, an RGB web-camera, an RGB-D camera, a smartphone, a finger pulse oximeter, and a filament lamp. The details of the device specifications can be found in Table 2.
Videos recorded from different devices are the core component of VIPL-HR database. In order to test the influence of cameras with different recording quality, we choose the widely used web-camera Logitech C310 and the color camera of RealSense F200 to record the RGB videos. At the same time, while smartphones have become an indispensable part of our daily lives, remote HR estimation from the videos recorded by smart phone cameras has not been studied yet. Thus, we use a HUAWEI P9 smart phone (with its frontal camera) to record the RGB face videos for the potential applications of remote HR estimation on mobile devices. Besides recording the RGB color videos, we also record the NIR face videos using a RealSense F200 to investigate the possibility of remote HR estimation under dim lighting conditions. Related physiological signals, including HR, SpO2, and BVP signals, are synchronously recorded with a CONTEC CMS60C BVP sensor.
|Computer||Lenovo ThinkCentre||Windows 10 OS||N/A|
|Color camera||Logitech C310||25fps||Color videos|
|960720 color camera|
|RGB-D camera||RealSense F200||30fps, 640480 NIR camera||Color videos|
|19201080 color camera,||NIR videos|
|Smart phone||HUAWEI P9||30fps,||Color videos|
|frontal camera||19201080 color camera|
|BVP recoder||CONTEC CMS60C||N/A||HR, SpO2,|
|and BVP signals|
The recording environmental setup is illustrated in Figure 2. The subjects are asked to sit in front of the cameras at two different distances: one meter and 1.5 meters. A filament lamp is placed aside the cameras to change the light conditions. Each subject is asked to sit naturally in front of the cameras, and daily activities such as talking and looking around are encouraged during the video recording. HR changes of the subject after exercises are also taken into consideration. The smartphone is first fixed in front of the subject for video recording, and then we asked the subject to hold the smartphone by themselves to record videos like a video chat scenario. Videos under nine different situations are recorded in total for each subject, and the details of these situations are listed in Table 3.
S = Stable, LM = Large Motion, T = Talking,
L = Lab Environment, D = Dark Environment, B = Bright Environment
3.2 Database Compression
As stated in , video compression plays an important role in video-based heart rate estimation. The raw data of VIPL-HR we collected is nearly 1.05TB in total, making it very inconvenient for the public access. In order to balance the convenience of data sharing and completeness of HR signals, we investigate to make a compressed and resized version of our database, which can retain the completeness of the HR signals as much as possible. The compression methods we considered include video compression and frame resizing. The video compression codecs we take into consideration are ‘MJPG’, ‘FMP4’, ‘DIVX’, ‘PIM1’ and ‘X264’, which are commonly-used video compression codecs. The resizing scales we consider are 1/2, 2/3, and 3/4 for each dimension of the original frame. We choose one of the widely used remote HR estimation method Haan2013  as a baseline HR estimation method to verify the HR estimation accuracy changes after individual comparison approaches.
The HR estimation accuracies by the baseline HR estimator on various compressed videos are given in Fig. 3 in terms of root mean square error (RMSE). From the results, we can see that the ‘MJPG’ video codec is better in maintaining the HR signal in the videos while it is able to reduce the size of the database significantly. Resizing the frames to two-thirds of the image resolution leads to little damage to the HR signal. Therefore, we choose the ‘MJPG’ codec and two-thirds of the original resolution as our final data compression solution, and we obtained a compressed VIPL-HR dataset with about 48GB. However, we would like to share both the uncompressed and compressed databases to the research community based on the researchers’ preference.
3.3 Database Statistics
The VIPL-HR dataset contains a total of 2,451 color videos and 752 NIR videos from 107 participants (79 males and 28 females) aged between 22 and 41. Each video is recorded with a length of about 30s, and the frame rate is about 30fps (see Table 1). Some example video frames of one subject captured by different devices are shown in Fig. 4.
To further analyze the characteristics of our VIPL-HR database, we calculated the head pose variations using the OpenFace head pose estimator222https://github.com/TadasBaltrusaitis/OpenFace for the videos with head movement (see Situation 2 in Table 3). Histograms for maximum amplitudes of the three rotation components for all the videos can be found in Fig. 5. From the histograms, we can see that the maximum rotation amplitudes of the subjects vary in a large range, i.e., the maximum rotation amplitudes are in roll, in pitch and in yaw. This is reasonable because every subject is allowed to look around during the video recording.
At the same time, in order to quantitatively demonstrate the illumination changes in VIPL-HR database, we have calculated the mean grey-scale intensity of face area for Situation 1, Situation 4, and Situation 5 in Table 3. The results are shown in Fig. 6. We can see that the mean gray-scale intensity varies from 60 to 212, covering complicated illumination variations.
A histogram of ground-truth HRs is also shown in Fig. 7. We can see that the ground-truth HRs in VIPL-HR vary from 47 bpm to 146bpm, which covers the typical HR range333https://en.wikipedia.org/wiki/Heart_rate
. The wide HR distribution in VIPL-HR fills the gap between lab-controlled databases and the HR distribution presenting in daily-life scenes. The relatively large size of VIPL-HR also makes it possible to use deep learning methods to build data-driven HR estimators.
4 Deeply Learned HR Estimator
With the less-constrained VIPL-HR database, we are able to build a data-driven HR estimator using deep learning methods. Following the idea of , we propose a deep HR estimation method, named as RhythmNet. An overview producer of RyhthmNet can be seen in Fig. 8.
4.1 Spatial-temporal Maps Generation
In order to identify the face area within the video frames, we first use the face detector provided by the open source SeetaFace444https://github.com/seetaface/SeetaFaceEngine to get the face location and 81 facial landmarks (see Fig. 9
). Since the facial landmarks detection is able to run at a frame rate of more than 30 fps, we perform face detection and landmarks detection on every frame in order to get consistent ROI localization in a face video sequence. A moving average filter is applied to the 81 landmark points to get more stable landmark localizations.
According to , the most informative facial parts containing the color changes due to heart rhythms are the cheek area and the forehead area. The ROI containing both the cheek and forehead area is determined using the cheek border and chin locations as shown in Fig. 9. Face alignment is firstly performed using the eye center points, and then a bounding box is defined with a width of (where is the horizontal distance between the outer cheek border points) and height (where is the vertical distance between chin location and eye center points). Skin segmentation is then applied to the defined ROI to remove the non-face area such as the eye region and the background area.
According to , a good representation of the HR signals is very important for training a deep HR estimator. Niu et al. directly used the average pixel values of RGB channels as the HR signals representation, which may not be the best choice to represent HR signals. As stated in , alternative color spaces derived from RGB video are beneficial for getting a better HR signal representation. After testing alternative color spaces, we finally choose the YUV color space for further computing. The color space transform can be formulized as
and the final spatial-temporal map generation producer can be found in Fig. 9.
4.2 Learning Strategy for HR Measurement
For each face video, we first divide it into small video clips using a fixed sliding window and then compute the spatial-temporal map, which is used to estimate the HR per video clip. The average HR of the face video is computed as the average of the HRs estimated from each clip. We choose the ResNet18 
as the convolutional layers for learning the mapping from spatial-temporal maps to HRs, which is commonly used in various computer vision tasks. The ResNet18 architecture including four blocks made up of convolutional layers and residual link, one convolutional layer and one fully connected layer for the final classification. The output of the network is one single HR value, which is normalized based on the frame rate of the face video.losses are used for measuring the distance between the predicted HR and ground truth HR.
It is important to leverage the prior knowledge of HR signals during HR estimation. To achieve this purpose, we use the synthetic rhythm data for network pre-training as stated in 
. Specifically, our training strategy could be divided into three stages. Firstly, we train our model using the large-scale image database ImageNet
. Then the synthetic spatial-temporal maps are used to further guide the network to learn the prior knowledge of mapping an RGB video sequence into an HR value. With this prior knowledge, we can further fine-tune the neural network for the final HR estimation task using the real-life face videos.
Another situation we need to consider is that the face detector may fail in a short time interval, which commonly happens when the subject’s head is moving or rotating. The failing of face detection will cause the missing data of HR signals. In order to handle this issue, we randomly mask (set to zero) the spatial-temporal maps along the time dimension to simulate the missing data situation due to the failing of face detection. The masked spatial-temporal maps are found to be useful to train a robust HR estimator against failing of face detection.
5.1 Database, Protocol, and Experimental Settings
In this paper, we use the VIPL-HR database for within-database testing and the MMSE-HR database  for cross-database evaluation. Details about these two databases can be found in Table. 1. We first perform participant-dependent five-fold cross-validation for within-database testing on the VIPL-HR database. Then, we directly train the RhythmNet on the VIPL-HR database and test it on the MMSE-HR database. After that, the RhythmNet pre-trained on the VIPL-HR is fine-tuned and tested on the MMSE-HR.
For each 30-second video, a sliding window with frames is used for generating the spatial-temporal maps. We divide the face area into 25 blocks ( grids). The percentage of masked spatial-temporal maps is , and the mask length varies from frames to
frames. The RhythmNet is implemented using the PyTorch555https://pytorch.org/
framework. The Adam solver with an initial learning rate of 0.001 is applied to train the model, and the number of maximum iteration epochs is 50.
5.2 Within-database Testing
5.2.1 Experiments on Color Face Videos.
We first perform the within-database testing on the VIPL-HR database. We use the state-of-the-art methods (Haan2013 , Tulyakov2016 , Wang2017  and Niu2018 ) for comparisons. The results of the individual methods are reported in Table. 4.
From the results, we can see that the proposed method could achieve promising results with an of 8.94bpm, which is a much lower error than other methods. At the same time, in order to further analyze the consistency between the ground-truth HR and the estimated HR, we draw a Bland-Altman plot for the RhythmNet in Fig. 10. Again, it can be seen that our method achieves a good consistency on the VIPL-HR database.
5.2.2 Experiment on NIR Face Videos.
The experiments on NIR face videos are also conducted using the protocol proposed in Section. 5.1. Since the NIR face videos only have one channel, no color space transformation is used, and we get one-channel spatial-temporal maps for the deep HR estimator. Very few methods have been proposed and evaluated on the NIR data, thus we only report the results based on the RhythmNet in Table. 5. The Bland-Altman plot for NIR data can also be found in Fig. 10.
5.3 Cross-database Testing
The cross-database experiments are then conducted based on the MMSE-HR database. Specifically, we first train our model on the VIPL-HR database and directly test it on the MMSE-HR database. We also fine-tune the model on MMSE-HR to see whether a finetuning could improve the HR estimation accuracy or not. All the results can be found in Table 6. The baseline methods we use for comparisons are Li2014 , Haan2013  and Tulyakov2016 , and their performances are from .
From the results, we can see that the proposed method could achieve a promising performance with an of 10.58 bpm, even when we directly test our VIPL-HR pre-trained model on MMSE-HR. The error rate is further reduced to 8.22 bpm when we fine-tune the pre-trained model on MMSE-HR. Both results of the proposed approach are much better than previous methods. These results indicate that the variations of illumination, movement, and acquisition device in the VIPL-HR database are helpful to learn an HR estimator which has good generalization ability to unseen scenarios. In addition, the proposed RhythmNet leverages the diverse information contained in VIPL-HR to learn a robust HR estimator.
6 Conclusion and Further Work
Remote HR estimation from a face video has wide applications; however, accurate HR estimation from the face in the wild is challenging due to the various variations in less-constrained scenarios. In this paper, we introduce a multi-modality VIPL-HR database for remote heart estimation under less-constrained conditions, such as head movement, illumination change, and camera diversity. We also proposed the RhythmNet, a data-driven heart estimator based on CNN, to perform remote HR estimation. Benefited from the proposed spatial-temporal map and the effective training strategy, our approach achieves promising HR accuracies in both within-database and cross-database testing.
In the future, besides investigating more effective HR signals representations, we are also going to establish models to leverage the relation between adjacent measurements from the sliding windows. In addition, detailed analysis of individual methods under various recording situations will be provided using the VIPL-HR database.
-  Poh, M.Z., McDuff, D.J., Picard, R.W.: Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Opt. Express 18 (2010) 10762–10774
-  Poh, M.Z., McDuff, D.J., Picard, R.W.: Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. 58 (2011) 7–11
-  Balakrishnan, G., Durand, F., Guttag, J.: Detecting pulse from head motions in video. In: Proc. IEEE CVPR. (2013) 3430–3437
-  de Haan, G., Jeanne, V.: Robust pulse rate from chrominance-based rPPG. IEEE Trans. Biomed. Eng. 60 (2013) 2878–2886
-  Li, X., Chen, J., Zhao, G., Pietikainen, M.: Remote heart rate measurement from face videos under realistic situations. In: Proc. IEEE CVPR. (2014) 4264–4271
-  Tulyakov, S., Alameda-Pineda, X., Ricci, E., Yin, L., Cohn, J.F., Sebe, N.: Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In: Proc. IEEE CVPR. (2016)
-  Wang, W., den Brinker, A.C., Stuijk, S., de Haan, G.: Algorithmic principles of remote ppg. IEEE Trans. Biomed. Eng. 64 (2017) 1479–1491
-  Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M.: A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3 (2012) 42–55
-  Stricker, R., Müller, S., Gross, H.M.: Non-contact video-based pulse rate measurement on a mobile service robot. In: Proc. IEEE RO-MAN. (2014) 1056–1062
-  Gee-Sern Hsu, ArulMurugan Ambikapathi, M.S.C.: Deep learning with time-frequency representation for pulse estimation. In: Proc. IJCB. (2017)
-  Li, X., Alikhani, I., Shi, J., Seppanen, T., Junttila, J., Majamaa-Voltti, K., Tulppo, M., Zhao, G.: The obf database: A large face video database for remote physiological signal measurement and atrial fibrillation detection. In: Proc. IEEE FG. (2018) 242–249
-  Niu, X., Han, H., Shan, S., Chen, X.: Synrhythm: Learning a deep heart rate estimator from general to specific. In: Proc. IEEE ICPR. (2018)
-  Verkruysse, W., Svaasand, L.O., Nelson, J.S.: Remote plethysmographic imaging using ambient light. Opt. Express 16 (2008) 21434–21445
-  Lam, A., Kuno, Y.: Robust heart rate measurement from video using select random patches. In: Proc. IEEE ICCV. (2015) 3640–3648
-  Wang, W., Stuijk, S., De Haan, G.: Exploiting spatial redundancy of image sensor for motion robust rppg. IEEE Trans. Biomed. Eng. 62 (2015) 415–425
-  Niu, X., Han, H., Shan, S., Chen, X.: Continuous heart rate measurement from face: A robust rppg approach with distribution learning. In: Proc. IJCB. (2017)
-  Wu, H.Y., Rubinstein, M., Shih, E., Guttag, J., Durand, F., Freeman, W.: Eulerian video magnification for revealing subtle changes in the world. ACM Trans. Graph. 31 (2012) 65
-  Zhang, Z., Girard, J.M., Wu, Y., Zhang, X., Liu, P., Ciftci, U., Canavan, S., Reale, M., Horowitz, A., Yang, H., et al.: Multimodal spontaneous emotion corpus for human behavior analysis. In: Proc. IEEE CVPR. (2016) 3438–3446
-  McDuff, D.J., Blackford, E.B., Estepp, J.R.: The impact of video compression on remote cardiac pulse measurement using imaging photoplethysmography. In: Proc. IEEE FG. (2017) 63–70
-  Kwon, S., Kim, J., Lee, D., Park, K.: ROI analysis for remote photoplethysmography on facial video. In: Proc. EMBS. (2015) 851–862
-  Tsouri, G.R., Li, Z.: On the benefits of alternative color spaces for noncontact heart rate measurements using standard red-green-blue cameras. J. of Biomed. Opt 20 (2015) 048002
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. IEEE CVPR. (2016) 770–778
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. IJCV 115 (2015) 211–252