Recovering remote Photoplethysmograph Signal from Facial videos Using Spatio-Temporal Convolutional Networks

by   Zitong Yu, et al.

Recently average heart rate (HR) can be measured relatively accurately from human face videos based on non-contact remote photoplethysmography (rPPG). However in many healthcare applications, knowing only the average HR is not enough, and measured blood volume pulse signal and its heart rate variability (HRV) features are also important. We propose the first end-to-end rPPG signal recovering system (PhysNet) using deep spatio-temporal convolutional networks to measure both HR and HRV features. PhysNet extracts the spatial and temporal hidden features simultaneously from raw face sequences while outputs the corresponding rPPG signal directly. The temporal context information helps the network learn more robust features with less fluctuation. Our approach was tested on two datasets, and achieved superior performance of HR and HRV features comparing to the state-of-the-art methods.



page 6


Remote Heart Rate Measurement from Highly Compressed Facial Videos: an End-to-end Deep Learning Solution with Video Enhancement

Remote photoplethysmography (rPPG), which aims at measuring heart activi...

Kinship Verification from Videos using Spatio-Temporal Texture Features and Deep Learning

Automatic kinship verification using facial images is a relatively new a...

Efficient Remote Photoplethysmography with Temporal Derivative Modules and Time-Shift Invariant Loss

We present a lightweight neural model for remote heart rate estimation f...

AutoHR: A Strong End-to-end Baseline for Remote Heart Rate Measurement with Neural Searching

Remote photoplethysmography (rPPG), which aims at measuring heart activi...

RhythmNet: End-to-end Heart Rate Estimation from Face via Spatial-temporal Representation

Heart rate (HR) is an important physiological signal that reflects the p...

TransPPG: Two-stream Transformer for Remote Heart Rate Estimate

Non-contact facial video-based heart rate estimation using remote photop...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The heart pulse is an important vital sign that needs to be measured in many circumstances for health care or medical purposes. Traditionally, the Electrocardiography (ECG) and pulse oximeter (measures photoplethysmograph, PPG) are the two most common ways for measuring heart activities. From ECG or PPG signals, doctors can get not only the basic average heart rate (HR), but also more detailed information as the heart rate variability (HRV) features for supporting their diagnosis. However, both ECG and PPG sensors need to be attached to body parts which may cause discomfort and are inconvenient for long-term monitoring. To counter for this issue, new technology of remote photoplethysmography (rPPG) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] is developing fast in recent years, which targets to measure heart activity remotely without any contact.

In earlier studies of rPPG, most methods [1, 2, 3, 4, 5, 6, 7]

can be seen as a two-stage pipeline, which first detects or tracks the face to extract the rPPG signals from face video, and then estimates the corresponding average HR from frequency analysis. This framework is reasonable and based on the original work 

[1], which has shown that rPPG signal can be measured in normal ambient light environment. However, there are two disadvantages of these methods worth concern. First, these methods can only measure the average HR over a time span, which is very limited info, while most medical applications requires detailed info of inter-beat-interval (IBI) and HRV. Second, the traditional methods involves handcrafted features or filters, which may not generalize well and easily lose some important information related to heart beat.

Figure 1: Proposed rPPG signal recovering framework using PhysNet.

Recently, deep learning based methods attract more attention and are used successfully in many vision tasks such as image classification [11]. There are also several attempts to estimate HR remotely based on deep learning [12, 13, 14, 15]

. But all of them have at least one of the following drawbacks: 1) They treat HR estimation as one-stage average HR regression problem, and lose the individual pulse peak info, so it is impossible to measure HRV features, which limits their usage in healthcare fields. 2) They are not real end-to-end learning as the inputs of networks are not the original face videos, pre-processing steps or handcrafted features are still needed. 3) They are usually designed based on the 2-dimensional (2D) spatial neural network without considering the temporal context of the facial features.

In this paper, we propose to train a spatio-temporal convolutional network to recover the rPPG signal from original face videos aiming to locate each individual heartbeat peak accurately. Figure 1 shows the whole framework. The output of recovered rPPG signal looks similarly like a PPG signal measured from fingers using an oximeter, which can be processed and allow us to achieve not only the average HR, but also detailed heart activity analysis of the IBI and HRV.

The main contributions of this work are: 1) We utilize spatio-temporal convolutional networks for recovering rPPG signals. It takes temporal context into account which was absolutely ignored in previous work. 2) It is the first end-to-end deep network which can recover the rPPG signals from a raw RGB face sequence directly. 3) It is a milestone to measure the HRV features from recovered rPPG signals by deep learning based method, which is more valuable in healthcare fields compared with only measuring average HR in the past.

2 Related work

Previous methods on remote heart rate measurement, basic background of HRV measurement, and spatio-temporal networks are briefly reviewed in three sub-sections.

2.1 Remote Heart Rate Measurement

In the past few years, several studies explored measuring HR remotely from face videos by analysing facial color changes. Poh et al. [2]

introduced independent component analysis to decompose the original RGB channel signals into independent non-Gaussian signals to measure HR.

After that, several methods based on region of interest (ROI) selection from the face have been studied. In  [5], Li et al. firstly defined and tracked the particular ROI of face in every frame, then used least mean square filter and non-rigid motion elimination to obtain a more robust vital signal. In [16], Lam and Kuno used multiple randomly sampled blocks from the already-defined ROI to form multiple smaller ROIs, and then used majority voting to make the final prediction. Tulyakov et al. [7] separated the wrapped face ROI into multiple blocks and used a matrix completion approach to yield a final filtered signal.

Besides these ROI based methods, there were other studies concerned about color subspace projections. These works  [6, 17] assumed that all the skin-color pixels make contribution to construct the desired signal. Haan et al. [6] proposed chrominance-based rPPG, which first projected RGB channels into chrominance subspace, and then utilized bandpass filters separately in the XY chrominance color space, and then used temporal-weighted summation for fusion. Wang et al. [17] proposed to estimate a spatial subspace of selected skin pixels and measured its temporal rotation to achieve the pulse signal.

These methods all contributed special solutions for improving the HR measurement accuracy. But on the other side, these methods require complex priori knowledge for ROI selection/skin-pixels detection and handcrafted signal processing steps, which are hard to deploy and not necessarily generalize well to new data. Besides, the majority of these works only worked on getting the average HR, but did not consider the accuracy of locating each individual pulse peak, which is a more challenging task.

There were a few one-stage deep learning based methods used in average HR estimation. In [12]

, Hsu et al. employed the short-time fourier transform to build the 2D time-frequency representations of the sequences as the inputs of convolutional neural network (CNN) and regressed the HR directly. In a similar manner, Niu et al. 

[13] used a spatial-temporal map representation instead as the input of CNN.

In addition to the one-stage methods, there were also a few two-stage deep learning based methods for remote HR estimation. Radim et al. [14] proposed the HR-CNN, which aligned the faces in every frame strictly first and then used CNN to predict the HR. In [15], Chen and McDuff deployed convolutional attention networks, and used normalized difference between every two successive frames as inputs to predict the pulse signal. These methods still needed handcrafted features or aligned face images as inputs, which are not end-to-end frameworks. Moreover, they were all based on 2D CNN, which lacks the ability to learn long- and short-term temporal features of facial blood flow changes.

2.2 Heart Rate Variability Measurement

Most of the mentioned studies only focused on average HR measurement. The HR counts the total number of heartbeats in a given time period, which is coarse for describing the cardiac activity. On the other side, HRV features describe heart activity on a much finer scale, which are computed by analysing the IBI of pulse signals. Most common HRV features include low frequency (LF), high frequency (HF), and their ratio of LF/HF and more, which are widely used in most medical applications. Besides, the respiratory frequency (RF) can also be estimated by analyzing the frequency power of IBIs, as in [3] and [18]. Apparently, compared with the task of estimating the average HR (only one number), measuring HRV features is more challenging, which requires accurate measure of the time location of each individual pulse peak. For the needs of most healthcare applications, average HR is far from enough. We need to step forward to develop methods that can measure heart activity on HRV level.

Figure 2:

The proposed network: (a) the overall architecture; (b) the spatio-temporal convolutional block. All 4-dimensional tensors described below in each block are the corresponding output dimensions (


2.3 Spatio-Temporal Networks

With the rapid development and applications in video multimedia, deep spatio-temporal network plays a crucial role in many computer vision fields (e.g., action recognition and video captioning). There are two mainstreams of spatio-temporal frameworks. First, 3D convolutional neural networks are used widely in video understanding as they can capture the spatial and temporal context simultaneously. In 

[19], the 3D Convolutional Networks (C3D) was firstly proposed by Du et al. to aggregate spatial and temporal information at the same time for more robust decisions. Later several extended versions (e.g., Pseudo-3D CNN (P3D) [20]

) were developed for less computation cost and better performance. Second, another framework (CNN cascaded with recurrent neural network) can also capture the temporal context among the CNN spatial features.

Existing spatio-temporal neural network frameworks such as P3D cannot be used directly in our task, because they are designed for particular tasks (e.g., action recognition) and squeezes the spatial and temporal information concurrently to obtain semantic features. We design a novel spatio-temporal convolutional framework to recover the facial rPPG signals, which is able to represent the long- and short-term spatio-temporal context features. Moreover, the proposed method is an end-to-end framework, taking raw face images as inputs and outputs rPPG signals, which can be used for subsequent HRV analysis directly.

3 Proposed Approach

In this section, we propose a spatio-temporal convolutional network which is able to recover the rPPG signals from a face video. The input of the network is a sequence of successive face images and its output is a 1D vector (rPPG signal).

3.1 Network Architecture

According to [6], there are two important procedures in order to achieve pulse information from facial videos. First is to project RGB into another color space with better capacity of representing the pulse information. After that, the color subspace needs to be reprojected in order to get rid of unrelevant info (e.g., noise caused by illumination or motion) and achieve the target signal space. We propose an end-to-end deep spatio-temporal convolutional network (denoted as PhysNet), which is able to merge the two steps and achieve the final rPPG signals in one step of training.

There are two basic rules for designing PhysNet. 1) For the task of rPPG signals recovering, the networks should have a large enough receptive field in spatial domain and a moderate receptive field in temporal domain, which has been implied by [6]. 2) It is better to be a lightweight network with not too many parameters considering the available training set is not large (200 videos from 100 subjects), otherwise it may easily lead to over-fitting.

The overall architecture of PhysNet is shown in Figure 2(a). The input of the network is -frame face images with RGB channels. After forwarding several operations of spatial and temporal convolution with average pooling, multi-channel manifolds are formed to represent the spatio-temporal features. Finally, the latent manifolds are projected into signal space using simple pseudo 3D convolution operation with kernel to generate the predicted rPPG signal. The whole procedure can be formulated as


where are the input frames, and is the output signal of the network. is the spatio-temporal model for subspace projection (all the convolution and pooling layers except the last one in Figure 2), is a concatenation of all convolutional filter parameters of this model, is the final signal projection (last channel-wise convolution layer in Figure 2), and is a set of its parameters.

As shown in Figure 2(b), the spatio-temporal convolutional block (ST Conv Block) uses separable convolutions like [20] for reducing redundant parameters compared to traditional 3D convolutions. ST Conv Block not only aggregates the spatial information with temporal variation, but also filters out the noisy fluctuation for obtaining smoother features, which is beneficial to project the raw RGB sequence into appropriate subspaces.

In terms of the implementation details, all the convolution operations use stride, and are cascaded with batch normalization and nonlinear activation function ReLU except the last convolution with

kernel. In addition, spatial padding for all convolutions with

and kernels are needed to keep a consistent spatial size, while there is no temporal padding for convolutions with kernel for obtaining smoother context information. AvgPool is employed instead of MaxPool, following [12] in order to achieve higher signal-to-noise ratio for the recovered rPPG measurement. For an input sequence of frames, the length of the corresponding recovered signal is .

3.2 Loss Function

Besides designing the network architecture, we also need an appropriate loss function to guide the networks. Comparing to previous works in this field, one major step-forward of the current study is that we aim to recover rPPG signals to predict the time location of each individual pulse peak as accurate as possible, so that we can not only measure the average HR, but also compute the IBI for detailed HRV analysis. Figure 3 illustrates the problem that recovered signals with accurate average HR (blue curve) may be very erroneous when considering peak locations and IBI. Our loss function uses negative Pearson correlation to minimize the peak location errors as:


where is the length of the signals, indicates the predicted rPPG signals, and indicates the ground truth PPG signals.

PPG signals are used as the ground truth for training our network instead of ECG for reasonable consideration. PPG measured from fingers resembles more to the rPPG measured from faces, as they both measure the peripheral blood volume changes, while ECG measures electrical activities thus contains extra components that do not present in rPPG.

4 Experiments

Two datasets are employed in our experiments. First, we train the proposed PhysNet on the OBF dataset [18]. OBF has facial videos and with corresponding PPG signals measured from finger as the ground truth, which are suited for our training need. The trained PhysNet is first tested on the OBF dataset, and then also tested on the MAHNOB-HCI [21] dataset which is commonly used in previous works.

HR(bpm) RF(Hz) LF(u.n) HF(u.n) LF/HF

Baseline [18]
-0.03 2.16 2.16 0.99 0.001 0.08 0.08 0.32 0.03 0.2 0.2 0.57 0.03 0.2 0.2 0.57 -0.11 0.82 0.83 0.57
PhysNet64_Spa 23.06 10.9 25.5 0.62 0.001 0.09 0.09 0.02 0.09 0.26 0.27 0.16 0.09 0.26 0.27 0.16 -0.33 0.99 1.04 0.16
PhysNet128_Spa 12.48 4.8 13.1 0.94 0.004 0.09 0.09 0.07 0.11 0.23 0.25 0.30 0.11 0.23 0.25 0.30 -0.39 0.89 0.97 0.3
PhysNet256_Spa 6.81 3.76 8.32 0.96 0.004 0.09 0.09 0.08 0.06 0.21 0.22 0.45 0.06 0.21 0.22 0.45 -0.25 0.89 0.92 0.41
PhysNet64 -0.48 3.14 3.17 0.97 0.001 0.07 0.07 0.46 0.03 0.15 0.16 0.74 0.03 0.15 0.16 0.74 -0.13 0.66 0.67 0.71
PhysNet128 -0.41 1.88 1.92 0.99 0.001 0.07 0.07 0.47 0.04 0.16 0.17 0.71 0.04 0.16 0.17 0.71 -0.15 0.69 0.7 0.68
PhysNet256 -0.46 4.14 4.32 0.95 0 0.07 0.07 0.37 0.04 0.18 0.19 0.63 0.04 0.18 0.19 0.63 -0.17 0.76 0.78 0.59
Table 1: Summary results on OBF compared to baseline using the proposed framework (best performance in bold)
(bpm) (bpm) (bpm)
SynRhythm [13] 10.88 - 11.08 -
HR-CNN [14] - 7.25 9.24 0.51
DeepPhys [15] - 4.57 - -
PhysNet256_Spa 9.91 7.68 9.97 0.64
PhysNet128 8.75 6.85 8.76 0.69
Table 2: Results of estimated average HR using deep learning based methods on MAHNOB (best performance in bold)

4.1 Experimental Settings

Datasets.OBF dataset [18] is used for both training and testing. OBF contains data recorded from 100 healthy adults. For each subject, there were two 5-minute sessions, in which the facial videos and the corresponding ground truth physiological signals (ECG and breathing signal measured from chest, and PPG from finger) were recorded simultaneously while the subject was sitting in a chair. The first session recorded the subject in resting state, and then the subject exercised for five minutes to elevate the HR, and then the second session recorded the post-exercise subject for another five minutes. OBF contains videos recorded with both a RGB camera and an NIR camera. In the current study we only use the RGB videos. The RGB videos were recorded at 60 fps with resolution of 1920x2080.

MAHNOB-HCI dataset [21] is used only for testing but not for training, because our training needs PPG as ground truth which it doesn’t have. It includes 27 subjects, each has 20 face videos recorded with 61 fps and resolution of 780x580. ECG signals were recorded with three channel sensors (EXG1, 2 and 3) and we used the second channel (EXG2) signals as the ground truth in evaluation. There are altogether 527 videos (13 videos were missing due to data loss) used in our experiments. The original videos are of different lengths. In order to make fair comparison with previous works [13, 14, 15]. we followed the same routine as their works and used the 30 seconds (frames 306 to 2135) of each video.

Training Settings. Facial videos and corresponding PPG signals are synchronized before training. The input video clip size is tailored from three aspects with concerns of the memory loads.

1) Face cut: the face area is detected by Viola-Jones face detector [22] and coarsely cropped (Figure 4 left) on the first input image and fixed through the whole clip. The process is simply to exclude large amount of irrelevant background pixels of the original videos. We do not specify any ROI as [13] did. Instead, the PhysNet is expected to learn by itself a reasonable weight distribution (Figure 4 right) through the training. The face images are normalized to 120x120.

Figure 3: Comparison of the recovered rPPG signals by different methods. The black curve is the ground truth PPG, the blue curve is output of baseline method [18], which gets accurate average HR but is erroneous for individual peak locations; the red curve is output of PhysNet with accurate peak locations.

2) Temporal down-sample: the input length T is limited by memory resource. According to [23], reducing frame rate doesn’t strongly impact the performance of recovering rPPG.So we down-sampled the videos to 30 fps, and the ground truth PPG signals are also down-sampled to the same sampling rate.

3) Three levels of T: in order to explore the effect of input length, we set three input lengths of T = 64, 128 and 256 (noted as PhysNet64, PhysNet128 and PhysNet256).

4) Effect of temporal convolution: we evaluate the network with or without temporal convolution layers (purple blocks in Figure 2), as two conditions of PhysNet (with temporal convolution) and PhysNet_Spa (without temporal convolution). The motivation is that PhysNet_Spa shares similar idea as the network proposed in [14], which completely ignored the temporal context. We’d like to explicitly demonstrate the contributions of temporal convolutions.

The proposed network is trained in Nvidia P100 using PyTorch library. For data augmentation, online horizontal flipping with the probability 50

are adopted to face images in the same T-length sequence. Adam optimizer is used while learning rate is set to 0.0001. We find that 15 epochs are enough for convergence.

Testing Settings and Performance Metrics. The trained PhysNet is first tested on OBF and then on MAHNOB-HCI. We followed paper [3, 18]

to process the recovered rPPG signal: detrending, normalization, interpolated back to 256 Hz, and then peak detection to get the IBI, as shown in Figure 1. Then performance is evaluated on two levels: average HR level, and detailed HRV level. For HRV level evaluation, we followed paper 

[3] to calculate three commonly used HRV features (in normalized units, n.u.) and the RF (in Hz). Details about the features are referred to [3, 18]. OBF has both PPG and ECG, and for test results on OBF we used PPG as the ground truth. We also compared the results of ECG and PPG and their difference is trivial (less than 0.2). MAHNOB-HCI only has ECG thus we used ECG for evaluation.

Performance metrics for evaluating both the average HRs and HRV features include: the mean error (ME), the standard deviation (SD), the root mean square error (RMSE), the Pearson’s correlation coefficient (R), and the mean absolute error (MAE). These metrics are to follow and compare to previous works 

[13, 14, 15, 18].

4.2 Evaluation on OBF

OBF is used for both training and testing, and we perform subject-independent 10-fold cross validation to use, i.e., 9 folds for training and the other one fold for testing for each round alternately. PhysNet is evaluated at three T levels of 64, 128 and 256, and the output rPPGs are concatenated to 30 seconds length clips for evaluation to achieve reasonable stable HRV analysis [18]. PhysNet with or without (PhysNet_Spa) temporal convolution are also compared. All results on OBF are summarized in Table 1.

1) Comparison with baseline: We replicated the method in [18] as the baseline results. Our replicated results are slightly better than the original ones in [18], perhaps due to different signal filtering and peak detection algorithms. Our PhysNet achieved better performance than the baseline not only on the average HRs, but also on the HRV feature level. Other than the statistical numbers, we also visualized one pair of sample output signals in Figure 3 for direct comparison. The blue curve is the rPPG signal recovered with baseline method in [18], and the red curve is the output of the current method. The blue curve might be able to estimate average HR at OK level (the counting of peaks is correct), but the red curve is apparently more accurate while considering each peak location for HRV feature computing. Moreover, the red curve is with less global fluctuations, indicating that PhysNet can also help with detrending.

2) Contribution of temporal convolution: The proposed full PhysNet (with temporal convolution) achieved better performance than the baseline on both the average HR and the HRV level. The errors (in forms of SD and RMSE) are reduced, and the correlations (in form of R) are increased. On the other side, PhysNet without temporal convolution (PhysNet_Spa) didn’t help with task with even lower results than the baseline. These results indicate that the effects of the temporal convolution layers are essential, and with those the proposed PhysNet can improve the pulse measurement accuracy especially on the HRV level.

3) Effects of input length T: Results show that the input length T has different effects under the two network conditions. Without temporal convolution layers (PhysNet_Spa), the trend is clear that longer inputs lead to better performance; but with temporal convolution layers we are able to achieve significantly better results with much shorter inputs, i.e., PhysNet64 outperforms PhysNet256_Spa. The results are in line with our expectation, because the temporal convolution layers are employed to provide extra help in learning temporal representations.

4.3 Evaluation on MAHNOB

We apply the model trained on OBF and test on MAHNOB-HCI as a cross-dataset validation. We test with three input length levels and with or without temporal convolution layers the same as we did on OBF. We first evaluate results of average HR, and the settings with the best performance are listed in Table 2 which are compared with previous results. Without temporal convolution layers, the PhysNet256_Spa achieved similar performance level as the HR-CNN in [14], and better performance was achieved with temporal convolution layers of our proposed PhysNet128. We cited the subject-independent results (Table 4 of [13]) of SymRhythm method for fair comparison as other results in Table 2 are all subject-independent, and our PhysNet128 achieved better results than theirs. DeepPhys in [15] reported lower mae of HR, but they didn’t report any other metric for a fuller comparison.

None of the previous works listed in Table 2 reported measurement on HRV level as their approaches only concerned average HR. We are the first and only one so far to report remote HRV measuring results on MAHNOB-HCI (using PhysNet128), as listed in Table 3. The performance are a bit lower if compared with results on OBF. But considering that this is a cross-dataset evaluation without any prior knowledge about the test samples, the results are very promising and show the generalization capability of the proposed method.

RF(Hz) LF(u.n) HF(u.n) LF/HF
ME -0.003 0.046 0.046 -0.107
SD 0.11 0.22 0.22 0.52
RMSE 0.11 0.23 0.23 0.53
Table 3: HRV results on MAHNOB using PhysNet128
(a) Visualization on OBF (b) Visualization on MAHNOB
Figure 4: Visualization results for original face (left) and the corresponding neural features (right).

We examined erroneous cases for our future work plan. We found that neural features (activations after the first ST Conv Block) of some MAHNOB-HCI samples (Figure 4(b)) are noisier than OBF samples (Figure 4(a)), as the facial skin pixels should have higher weight (as in Figure 4(a) right) while bright patches (as in Figure 4(a) right) in the background or collar area are erroneous noises. This is probably caused by lacking of training on MAHNOB-HCI samples. We will adapt our approach in our future work to include MAHNOB-HCI samples for training, which will improve the performance.

5 Conclusion

In this paper, we study the contribution of deep learning based methods for rPPG signals recovering. We propose an end-to-end framework with spatio-temporal convolutional network which can recover rPPG signals from raw face sequences directly. The results show that the proposed PhysNet can learn the time location of each individual pulse peak, so that it can not only estimate average HR, but also measure on detailed HRV features including RF, LF and HF. We achieved promising results on OBF and MAHNOB-HCI datasets.


  • [1] W. Verkruysse, L. O Svaasand, and J. S. Nelson, “Remote plethysmographic imaging using ambient light.,” Opt. Express, vol. 16, no. 26, pp. 21434–21445, Dec 2008.
  • [2] M.-Z. Poh, D. J. McDuff, and R. W. Picard, “Non-contact, automated cardiac pulse measurements using video imaging and blind source separation,” Opt. Express, vol. 18, no. 10, pp. 10762–10774, 2010.
  • [3] M.-Z. Poh, D. J. McDuff, and R. W. Picard, “Advancements in noncontact, multiparameter physiological measurements using a webcam,” IEEE Trans. Biomed. Eng., vol. 58, no. 1, pp. 7–11, 2011.
  • [4] G. Balakrishnan, F. Durand, and J. Guttag, “Detecting pulse from head motions in video,” in CVPR, 2013.
  • [5] X. Li, J. Chen, G. Zhao, and M. Pietikäinen, “Remote heart rate measurement from face videos under realistic situations,” in CVPR, 2014.
  • [6] G. Haan and V. Jeanne., “Robust pulse rate from chrominance-based rppg,” IEEE Trans. Biomed. Eng., vol. 60, no. 10, pp. 2878––2886, 2013.
  • [7] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, and N. Sebe, “Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions,” in CVPR, 2016.
  • [8] D. Tran, H. Lee, and C. Kim, “A robust real time system for remote heart rate measurement via camera,” in ICME, 2015.
  • [9] Y. Cheng, C. Gene, and S. Vladimir, “Estimating heart rate and rhythm via 3d motion tracking in depth video,” IEEE Trans. Multimedia, vol. 19, no. 7, pp. 1625–1636, 2017.
  • [10] Y. Lin and Y. Lin, “Step count and pulse rate detection based on the contactless image measurement method,” IEEE Trans. Multimedia, vol. 20, no. 8, pp. 2223–2231, 2018.
  • [11] A. Krizhevsky, S. Ilya, and H. Geoffrey,

    Imagenet classification with deep convolutional neural networks,”

    in NIPS, 2012.
  • [12] G. Hsu, M.-S C., and A. Ambikapathi, “Deep learning with time-frequency representation for pulse estimation,” in IJCB, 2017.
  • [13] X. Niu, H. Han, S. Shan, and X. Chen, “Synrhythm: Learning a deep heart rate estimator from general to specific,” in ICPR, 2018.
  • [14] R. Špetlík, V. Franc, and J Matas, “Deep learning with time-frequency representation for pulse estimation,” in BMVC, 2018.
  • [15] W. Chen and D. McDuff, “Deepphys: Video-based physiological measurement using convolutional attention networks,” in ECCV, 2018.
  • [16] A. Lam and Y. Kuno, “Robust heart rate measurement from video using select random patches,” in ICCV, 2015.
  • [17] W. Wang, S. Stuijk, and G. de Haan, “A novel algorithm for remote photoplethysmography: Spatial subspace rotation,” IEEE Trans. Biomed. Eng., vol. 63, no. 9, pp. 1974–1984, 2016.
  • [18] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttilaz, K. Volttiz, M. Tulppoz, and G. Zhao, “The obf database: A large face video database for remote physiological signal measurement and atrial fibrillation detection,” in FG, 2018.
  • [19] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015.
  • [20] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in CVPR, 2017.
  • [21] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal database for affect recognition and implicit tagging,” IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 42–55, 2012.
  • [22] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, 2001.
  • [23] E. Blackford and J. Estepp, “Effects of frame rate and image resolution on pulse rate measured using multiple camera imaging photoplethysmography,” in SPIE Medical Imaging, 2015, pp. 94 172D–94 172D.