TransPPG: Two-stream Transformer for Remote Heart Rate Estimate

Non-contact facial video-based heart rate estimation using remote photoplethysmography (rPPG) has shown great potential in many applications (e.g., remote health care) and achieved creditable results in constrained scenarios. However, practical applications require results to be accurate even under complex environment with head movement and unstable illumination. Therefore, improving the performance of rPPG in complex environment has become a key challenge. In this paper, we propose a novel video embedding method that embeds each facial video sequence into a feature map referred to as Multi-scale Adaptive Spatial and Temporal Map with Overlap (MAST_Mop), which contains not only vital information but also surrounding information as reference, which acts as the mirror to figure out the homogeneous perturbations imposed on foreground and background simultaneously, such as illumination instability. Correspondingly, we propose a two-stream Transformer model to map the MAST_Mop into heart rate (HR), where one stream follows the pulse signal in the facial area while the other figures out the perturbation signal from the surrounding region such that the difference of the two channels leads to adaptive noise cancellation. Our approach significantly outperforms all current state-of-the-art methods on two public datasets MAHNOB-HCI and VIPL-HR. As far as we know, it is the first work with Transformer as backbone to capture the temporal dependencies in rPPGs and apply the two stream scheme to figure out the interference from backgrounds as mirror of the corresponding perturbation on foreground signals for noise tolerating.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

10/25/2019

RhythmNet: End-to-end Heart Rate Estimation from Face via Spatial-temporal Representation

Heart rate (HR) is an important physiological signal that reflects the p...
10/31/2021

Dual Attention Network for Heart Rate and Respiratory Rate Estimation

Heart rate and respiratory rate measurement is a vital step for diagnosi...
04/26/2020

AutoHR: A Strong End-to-end Baseline for Remote Heart Rate Measurement with Neural Searching

Remote photoplethysmography (rPPG), which aims at measuring heart activi...
05/07/2019

Recovering remote Photoplethysmograph Signal from Facial videos Using Spatio-Temporal Convolutional Networks

Recently average heart rate (HR) can be measured relatively accurately f...
11/23/2021

PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer

Remote photoplethysmography (rPPG), which aims at measuring heart activi...
04/27/2020

Remote Photoplethysmography: Rarely Considered Factors

Remote Photoplethysmography (rPPG) is a fast-growing technique of vital ...
07/28/2021

Assessment of Deep Learning-based Heart Rate Estimation using Remote Photoplethysmography under Different Illuminations

Remote photoplethysmography (rPPG) monitors heart rate without requiring...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Heart rate (HR) is one of the most significant vital signals that reflects people’s physical status. HR monitoring plays an important role in a variety of applications such as health care, early detection of cardiovascular diseases, and characterizing emotional status in psychology tests. Traditional heart rate measurement generally utilizes Electrocardiography (ECG) and Photoplethysmograph (PPG) [Shelley et al.2001], both of which need professional equipments to attach sensors to people’s skin, in general discomfortable or inconvenient. To avoid these problems, remote photoplethysmograph (rPPG) [Verkruysse et al.2008]

is proposed, the target of which is to recover the volumetric change of blood over time through facial videos captured by webcams remotely, based on measuring subtle color changes of the skin. Here, the resulting signal is known as blood volume pulse (BVP), from which HR can be estimated. However, the extracted BVPs will be significantly disturbed under realistic situations, because changes of skin colors in videos are caused by both vital (blood volume) and non-vital factors (e.g., illumination and head movement), and the amplitude of the latter is much greater than the former, leading to very low signal-to-noise ratio. Therefore, extracting BVPs from facial videos has been remaining a challenging task so far.

Figure 1: Simplified skin reflection model. When a beam of light irradiates on the surface of the skin, a part of it is absorbed by different tissues while the other part is reflected

.

In recent years, many rPPG-based methods have been proposed towards obtaining more accurate and stable HR value. [Poh et al.2010]

utilized independent component analysis (ICA) to obtain cleaner BVP signals. CHROME

[De Haan and Jeanne2013] and POS [Wang et al.2016] transformed the color from RGB space to a new one based on prior knowledge to avoid the adverse impact of illumination conditions. [Tulyakov et al.2016]

selected regions of interest (ROIs) adaptively to compute BVP based on matrix completion. Recently, several deep learning based approaches were proposed.

[Špetlík et al.2018] first extracted features via spatial decomposition and temporal filtering from particular face region, and then utilized CNN to map the features to HR value. [Chen and McDuff2018] designed a convolutional attention network that takes the normalized difference between frames as input to predict BVP signals. [Niu et al.2019] proposed a novel spatial-temporal feature referred to as MSTMap, which is extracted from multiple small face ROIs, and then utilized a CNN-RNN model for HR estimation. [Yu et al.2019] proposed a two-stage CNN that first enhance the quality of videos, and then utilized the enhanced videos to obtain more stable BVP. [Niu et al.2020]

utilized disentangle representation learning to denoise the raw BVP signals extracted from videos so as to deal with unstable environmental factors.

However, all the aforementioned approaches have some common defects: First, they are only focused on detecting the pulse patterns from face region (say foreground) and ignore the information contained in the background, which is the key for noise removal since the signal from background can act as a mirror to figure out the homogeneous perturbation imposed on foreground face region such as illumination caused disturbing. Second, during data processing, they crop and resize the face region to solve the problem of inconsistent face size, and utilize facial landmarks to align face regions in order to solve the head movement problem. However, all the resize and align functions will change pixel values from the original frames, which will introduce new noise. Third, almost all the learning-based approaches utilize CNN to estimate HR, which is not good at handling temporal information. Therefore, in this paper, we propose a new framework to address these problems. Our contributions are as follows:

  • [fullwidth,itemindent=2em]

  • We propose a novel video embedding method that can embed videos into feature maps referred to as MAST_Mop, which can solve the problem of inconsistent face size and head movement without introducing any noise since no resize operation is needed for the sake of alignment.

  • We propose a two-stream Transformer network that utilize both foreground information and background information to estimate HR, where the background stream functions to perceive interference arising from background so as to reduce its negative impact on foreground signals. As far as we konw, it is the first work with Transformer as backbone to capture the temporal dependencies in rPPGs and apply the two-stream scheme to figure out the interference from backgrounds as mirror of the corresponding perturbation on foreground signals.

  • The proposed approach has achieved state-of-the-art performances on two public datasets MAHNOB-HCI [Soleymani et al.2011] and VIPL-HR [Niu et al.2018].

Figure 2:

Overview of our framework. We first embed the facial videos into MAST_Mop, and then feed it to a two-stream Transformer, in which one stream learns the foreground features while the other stream learns the background features. After that, we subtract the background feature from that of foreground, and then utilize a Multilayer Perceptron (MLP) predictor to estimate HR

.

2 Related Works

2.1 Photoplethysmography

Photoplethysmography (PPG) [Shelley et al.2001] is a widely used contact HR measurement technology. Lots of equipments such as finger-clip HR meter and smart watch are all for PPG applications. Its principle is to calculate HR through measuring the volume change of blood in capillaries on the surface of skin caused by heartbeat. More specifically, as shown in Figure 1, the blood volume in capillaries increases with heart systole and decreases with diastole. Accroding to the Lambert-Beer law [Lambert1760], the absorption of illumination by blood varies with its volume while the absorption by other tissues (e.g., dpidermis, dermis) is constant. Therefore, we use constant-intensity light to irradiate the skin surface. When the heart is systolic, the absorption of light increases, so the intensity of reflected light decreases, while opposite during the diastolic time. Accordingly, we can utilize a photosensitive sensor to receive the reflected light. By analyzing the change of intensity, we can get BVP and calculate the heart rate correspondingly.

2.2 Remote-photoplethysmography

The first study of remote photoplethysmography (rPPG) is reported in [Verkruysse et al.2008], which is based on the traditional PPG algorithm. Traditional PPG algorithms require a light source transmitter with constant intensity and a photosensitive receiver, and both need to be contacted to the skin. In [Verkruysse et al.2008], the authors suggested to utilize natural illumination instead of a transmitter and applied a webcamera as the receiver, which is also able to capture change of skin color. Meanwhile, considering the dense capillaries in human faces, rPPG extracts BVP signals from face region instead of from finger as done in traditional PPG algorithms.

However, natural illumination is not constant, many factors such as air humidity and dust will affect the intensity of light. At the same time, motion of head will also have a remarkable impact on the consistency of perceived color signals, so that early studies of rPPG algorithms such as [Poh et al.2010, Li et al.2014, Verkruysse et al.2008] only work well under well-controlled scenes, which is not practical.

In order to improve the accuracy and robustness of rPPG algorithms, many studies have been conducted in recent years, which can be divided into two categories: Traditional hand-crafted methods and learning-based methods. The traditional approaches aim to extract more robust BVP signals using different color spaces [De Haan and Jeanne2013, Wang et al.2016] or different regions of interest (ROIs) [Li et al.2014, Tulyakov et al.2016]. Besides, independent component analysis (ICA) [Poh et al.2010]

and principal components analysis (PCA)

[Lewandowska et al.2011] are widely used to approach cleaner BVP signals. However, most of these methods are based on some priori assumptions, which may not hold in less-constrained and complicated scenarios, resulting in significant performance degradation in terms of accuracy and stability for HR estimation.

In addition, there are also a number of learning-based rPPG algorithms with the rise of deep learning recently. [Chen and McDuff2018] proposed a CNN model to deal with head motion by using attention mechanism to focus on regions more important so that more stable BVP signals can be extracted. [Yu et al.2019] proposed a two-stage model: They utilized an encode-decoder to enhance quality of videos, and then utilized a CNN model to predict HR from the enhanced videos. [Niu et al.2019] first proposed a new representation of videos, and then utilized a CNN-RNN model to estimate HR from it. [Niu et al.2020] proposed a novel framework based on disentangle representation learning so that they can filter out noise more efficiently. [Sabokrou et al.2021]

proposed a two-stage framework that first learns a discriminative representation of face videos, and then uses an auto-encoder to regress HR value. These methods have made lots of breakthroughs by improving network structure and feature extraction. However, there are also some common defects: Firstly, they extract features only from face regions, which losts background information that can act as noise model to refine foreground information. Secondly, all these approaches utilize CNN-based models, which do not work well in capturing temporal dependencies.

2.3 Transformer

Transformer is first proposed in [Vaswani et al.2017]

by Google. The model stacks self attention layers and fully connection layers for encoder and decoder, without repetition and convolution. It was first utilized in various tasks in natural language processing (NLP), such as the BERT model proposed in

[Devlin et al.2018] that replaces Word2Vec [Le and Mikolov2014]. The model utilized Transformer as the backbone, which is able to capture the bidirectional relationship in sentences more thoroughly. Moreover, due to the powerful ability to capture long-term context information, Transformer is also applied to time series processing [Li et al.2019], where the encoder part of Transformer is used to obtain latent features to conduct regression. Hereafter, [Zhou et al.2021] proposed ProbSparse self-attention mechanism to reduce time complexity and a generative style decoder to acquire long sequence output with only one forward step needed, avoiding cumulative error spreading during the inference phase.

Figure 3: Video embedding: We first get the bounding box of the face (red box) and background (green box) region spanned by the 81 landmarks. Next, we split each of the cropped frames into blocks and get sequences of length . We repeat the above process times with different value of , say, , and finally get the MAST_Mop with the size of .

3 Method

Figure 2 gives an overview of our framework. There are two steps in the framework: First, we embed a facial video into two feature maps referred to as MAST_Mop that represent the foreground and background information. Second, a two-stream Transformer model is designed to learn a mapping from MAST_Mop to HR value.

3.1 Video embedding

The target of video embedding is to embed a video to a feature map that contains both spatial and temporal information. There are totally two steps in our video embedding approach: First, we crop a particular region of every frame of the video. Second, we compute a spatial map for every frame, and aggregate all spatial maps into a spatiotemporal map MAST_Mop.

3.1.1 Video cropping:

For one frame, as shown in Figure 3

, we first localize 81 facial landmarks with open source toolkit

Seetaface [CAS2019]. Then, a face bounding box is obtained with the width and height to be and , where is the horizontal distance between the outer cheek border points and is 1.2 times of the vertical distance between chin location and eye eyebrow centre, the same as [Niu et al.2019]. This corresponds with foreground but include both vital and non-vital information, which might arise from unstable illumination. Using merely the information from foreground leads to the difficulty in separating vital information from non-vital information if without the reference model of pure non-vital information. Since only non-vital information are contained in background in general, in this study, we let the background serve as the reference model of non-vital information so as to identify its counterpart in the foreground for noise suppressing by comparing foreground to background. Note that such a scheme holds due to the fact that the perturbations on foreground and the surrounding background are in general homogeneous. For this sake, we crop on both left and right side of the face region and concatenate the two cropped regions together to form the background region, which is refer to as . Specifically, let

indicate a video, where denote the number of frames, 3 the RGB channels, and , the height and width of the frame, respectively. We first get the bounding boxes of face regions referred to as through the facial landmarks:

where and denotes the upper left point and the lower right point of , respectively. Then, we can get the as follows:

Note that this step is not trivial in that it incorporates the background area into the receptive field such that the perturbation caused by illumination instability should be reflected in foreground and background simultaneously and homogeneously due to the adjacency between them.

3.1.2 MAST_Mop for Representing BVP:

Our MAST_Mop is inspired by the MST_Map introduced in [Niu et al.2019], in which they first apply a sliding-window to partition every frame into blocks, then concatenate the average pixel value of each block at the corresponding location in a video sequence, and finally concatenate sequences into a feature map (MST_Map) to represent the BVP. MST_Map utilizes the whole face information instead of using only a specific region like [Li et al.2014], making the information more sufficient. Meanwhile, the average pooling over blocks makes it more robust. However, there are still some defects: First, to ensure the number of blocks of each frame identical to guarantee the same feature dimension, they resize the cropped frames to fit a fixed window size, which may introduce new noise arising from image scaling. Second, they apply only one fixed window size when sliding, making the features too monotonous. To avoid these, we proposed MAST_Mop. As shown in Figure 3, the difference from MST are as follows:

  • [fullwidth,itemindent=2em]

  • Since resize operation is noise prone, instead of performing it to normalize the cropped regions of different sizes, we apply a fixed number of sliding windows to cover a cropped region such that the size of the sliding windows is adapted to the size of the cropped region, which avoids the noise-inducing resize operation. In detail, the size of the sliding window is subject to both the number of sliding windows and the size of the region cropped from the th frame, defined as:

    with the overlap between two sliding windows to be:

    where and denote the height and width of the th or .

  • To further grant the robustness of feature extraction, we apply a couple of numbers of sliding windows to incorporate different resolutions, that is,

    For each , we calculate the corresponding feature map:

    Then, we concatenate all the maps to obtain the multiscale feature map to promise robustness against the varying size of ROI across frames in various scenarios:

MAHNOB-HCI VIPL-HR
RMSE std r MAE RMSE std r
[Poh et al.2010] 13.6 24.3 0.08 - - - -
[De Haan and Jeanne2013] 22.36 13.67 0.21 11.4 16.9 15.1 0.28
[Li et al.2014] 7.62 6.88 0.81 - - - -
[Tulyakov et al.2016] 6.23 5.81 0.83 11.5 17.2 15.3 0.30
[Chen and McDuff2018] - - - 11.0 13.8 13.6 0.11
[Yu et al.2019] 5.93 5.57 0.88 - - - -
[Yu et al.2020] 5.10 0.86 - - - -
[Niu et al.2020] 5.23 5.10 7.92 5.02 7.97 7.92
Ours 4.80
Table 1: Comparison of our method with state-of-the-art methods for HR estimation. Values in bold show the best result

3.2 Two-stream Transformer

As concluded in [Li et al.2014], the features extracted from foreground can be regarded as superposition of non-vital information (such as illumination changes) over vital information (BVP), while the features extracted from background only contains non-vital information. This motivates us to let the system be aware of the non-vital information disturbing the foreground by referring to the background. To approach this goal, we proposed a network being composed of two independent streams. As shown in Figure 2, the two streams both take MAST_Mop as their input. The fore-stream () aims to learn foreground features, while the back-stream (

) background features. After that, we subtract the background features from foreground features so as to get stable and cleaner BVP features. Since background and foreground are adjacent to each other, the non-vital information such as unstable illumination should be imposed on them simultaneously and homogeneously, this makes the subtraction between the two streams work in a way like adaptive noise cancellation. Finally, we utilize a multilayer perceptron (

) as the predictor to estimate HR from the BVP features, trained under the supervision of the ground-truth HR with loss function.

It is also worth noting that to deal with the long-range temporal dependencies in BVP signals, we utilize the encoder proposed in Informer [Zhou et al.2021] to extract features from MAST_Mop rather than the primordial Transformer encoder [Vaswani et al.2017]. After the video embedding, we have got an feature map . However, considering the problem that the HR could change remarkably if the video is too long while the granularity of HR is too coarse to be precisely captured if the video is too short, we cut the whole video into small clips of fixed length and obtain the corresponding features , where is the frame number of one video clip. Then, we feed as the input to the network, and then obtain the corresponding heart rate under the supervision:

For long term estimating, we just perform the estimation for each clip separately, and then compute the average value of all the clips.

4 Experiments

In this section, we provide evaluations of the proposed framework on two widely used datasets: MAHNOB-HCI [Soleymani et al.2011] and VIPL-HR [Niu et al.2018].

4.1 Datasets

[Soleymani et al.2011] is a multimodal dataset with videos of subjects participating in two experiments: Emotion elicitation and implicit tagging. It contains 27 subjects (12 males and 15 females) in total, and all 527 videos are in 780x580 resolution along with the corresponding ECG signals.

[Niu et al.2018] is a large public dataset, whose target is to promote researches on remote HR estimation under less-constrained scenarios, such as with head movement, illumination variation, and acquisition device diversity. VIPL-HR contains 2371 RGB facial videos from 107 subjects, which are recorded with three different devices (webcam, RealSense, and smartphone) under varying illuminations and pose variations. Meanwhile, it offers the ground truth HR records.

Figure 4: Scatter plot camparing the ground truth and the predicted on MAHNOB-HCI

.

4.2 Experimental setting

We implement our framework with PyTorch

[Paszke et al.2017] on a GeForce GTX2080Ti GPU. We set for the sliding window and both of the fore-stream () and back-stream () are composed of 3 standard Informer encoders [Zhou et al.2021]. For MAHNOB-HCI [Soleymani et al.2011], we compute the ground truth HR of the second channel (EXG2) through the ECG waveforms provided in the data. For training, every video is cut into 10-second clips with a moving forward step of 0.5 second. For testing, following early works such as [Li et al.2014] and [Niu et al.2019], we use a 30-second clip (from 10 second to 40 second) to evaluate our model. As for VIPL [Niu et al.2018], we do the same as MAHNOB-HCI [Soleymani et al.2011] when training and use a 20-second (from 0 to 20 second) cilp for testing as the durations of many videos are shorter than 30 seconds in the dataset. We use adam optimizer [Kingma and Ba2014]

to train our framework for 80 epochs with the learning rate set to 0.0001 for MAHNOB-HCI and 120 epochs with the learning rate set to 0.00005 for VIPL-HR.

We finally compare our framework with a couple of methods on multiple metrics such as mean absolute error (MAE), root mean square erros (RMSE), mean error (Mean) and standard deviation (Std), and Pearson correlation coefficients (

).

4.3 Results

4.3.1 Results on MAHNOB-HCI:

We use 422 videos for training and 105 videos for test. The performance of our method compared with the other state-of-art methods are shown in Table 1. Also, we plot the camparison between the ground truth and the predicted in Figure 4. Note that [Niu et al.2019, Yu et al.2019, Sabokrou et al.2021] are all learning-based methods. As can be seen, our method obtains substantially better results than these baseline methods with a considerable progress. From Table 1 and Figure 4, we can see that the outcome of our framework in estimating HR is more accurate and more stable, no large deviation in the prediction results.

4.3.2 Results on VIPL:

We use 1910 videos for training and 468 videos for test. The performance of our method compared with those of the other state-of-art methods are shown in Table 1. Also, we plot the camparison between the ground truth and the predicted in Figure 5. Note that all methods referred to in Table 1 are learning-based except [De Haan and Jeanne2013] and [Tulyakov et al.2016]. From the results, we can see that our method has achieved a much higher accuracy with =4.94 and =7.42. Besides, the lowest =7.30 shows the better stability of our method.

Figure 5: Scatter plot camparing the ground truth and the predicted on VIPL-HR

.

4.4 Ablation experiments

In order to validate the effectiveness of the two-stream network, we use MAST_Mop as input to train a one-stream network containing only foreground stream () so as to check whether the proposed method really benefits from the two-stream based adaptive noise suppressing. Furthermore, to validate the contribution of the MAST_Mop, we use STMap [Niu et al.2019] as input instead of MAST_Mop to train this one-stream network so as to check whether the performance varies with feature. Feature comparison on the two-stream framework is also conducted. The results are shown in Table 2, from which we can see that the performance on the two datasets RMSE is increased by 0.54 and 0.68 by applying the two stream structure and MAST_Mop.

MAHNOB-HCI VIPL-HR
RMSE MAE RMSE
+ STMap 5.42 5.12 8.10
+ + STMap 5.18 5.62 7.71
+ MAST_Mop 5.23 5.10 7.92
TransPPG
Table 2: Ablation experiments.

5 Conclusion

In this paper, we proposed a novel framework for accurate and stable HR estimation under less-constrained scenarios, which achieves remarkably better results under multiple metrics. Our framework advances the literature with two innovations: Video Embedding and Two-stream Transformer. Video embedding aims to map the video to a feature map that represent the vital and non-vital information with multi-resolution window size but avoid the noise-prone resize operation at the same time. Two-stream Transformer is used to estimate HR with the vital and non-vital information so as to reduce the interference by incorporating the two channels as the reference profile to each other. The reason that we do not use CNN-based models is that Transformer-based models are better at capturing long-term context in the extracted feature maps. As far as we konw, it is the first work with Transformer as backbone to capture the temporal dependencies in rPPGs and apply the two stream scheme to figure out the interference from background as mirror of the corresponding perturbation on foreground signals. We evaluate our framework on two widely used datasets and the results shows that our framework outperforms the state-of-art methods on the two widely used benchmarks.

References

  • [CAS2019] CAS. Seetaface2. https://github.com/seetafaceengine/SeetaFace2, 2019. Accessed: 2021-07-15.
  • [Chen and McDuff2018] Weixuan Chen and Daniel McDuff. Deepphys: Video-based physiological measurement using convolutional attention networks. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    , pages 349–365, 2018.
  • [De Haan and Jeanne2013] Gerard De Haan and Vincent Jeanne. Robust pulse rate from chrominance-based rppg. IEEE Transactions on Biomedical Engineering, 60(10):2878–2886, 2013.
  • [Devlin et al.2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [Lambert1760] JH Lambert. Photometria sive de mensura et gradibus luminis colorum et umbrae augsburg. Detleffsen for the widow of Eberhard Klett, 1760.
  • [Le and Mikolov2014] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In

    International conference on machine learning

    , pages 1188–1196. PMLR, 2014.
  • [Lewandowska et al.2011] Magdalena Lewandowska, Jacek Ruminski, Tomasz Kocejko, and Jkedrzej Nowak. Measuring pulse rate with a webcam—a non-contact method for evaluating cardiac activity. In 2011 federated conference on computer science and information systems (FedCSIS), pages 405–410. IEEE, 2011.
  • [Li et al.2014] Xiaobai Li, Jie Chen, Guoying Zhao, and Matti Pietikainen. Remote heart rate measurement from face videos under realistic situations. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 4264–4271, 2014.
  • [Li et al.2019] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in Neural Information Processing Systems, 32:5243–5253, 2019.
  • [Niu et al.2018] Xuesong Niu, Hu Han, Shiguang Shan, and Xilin Chen. Vipl-hr: A multi-modal database for pulse estimation from less-constrained face video. In Asian Conference on Computer Vision, pages 562–576. Springer, 2018.
  • [Niu et al.2019] Xuesong Niu, Shiguang Shan, Hu Han, and Xilin Chen. Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation. IEEE Transactions on Image Processing, 29:2409–2423, 2019.
  • [Niu et al.2020] Xuesong Niu, Zitong Yu, Hu Han, Xiaobai Li, Shiguang Shan, and Guoying Zhao. Video-based remote physiological measurement via cross-verified feature disentangling. In European Conference on Computer Vision, pages 295–310. Springer, 2020.
  • [Paszke et al.2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [Poh et al.2010] Ming-Zher Poh, Daniel J McDuff, and Rosalind W Picard. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Optics express, 18(10):10762–10774, 2010.
  • [Sabokrou et al.2021] Mohammad Sabokrou, Masoud Pourreza, Xiaobai Li, Mahmood Fathy, and Guoying Zhao. Deep-hr: Fast heart rate estimation from face video under realistic conditions. Expert Systems with Applications, page 115596, 2021.
  • [Shelley et al.2001] Kirk Shelley, Stacey Shelley, and Carol Lake. Pulse oximeter waveform: photoelectric plethysmography. Clinical monitoring, 2, 2001.
  • [Soleymani et al.2011] Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. A multimodal database for affect recognition and implicit tagging. IEEE transactions on affective computing, 3(1):42–55, 2011.
  • [Špetlík et al.2018] Radim Špetlík, Vojtech Franc, and Jirí Matas.

    Visual heart rate estimation with convolutional neural network.

    In Proceedings of the british machine vision conference, Newcastle, UK, pages 3–6, 2018.
  • [Tulyakov et al.2016] Sergey Tulyakov, Xavier Alameda-Pineda, Elisa Ricci, Lijun Yin, Jeffrey F Cohn, and Nicu Sebe. Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2396–2404, 2016.
  • [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [Verkruysse et al.2008] Wim Verkruysse, Lars O Svaasand, and J Stuart Nelson. Remote plethysmographic imaging using ambient light. Optics express, 16(26):21434–21445, 2008.
  • [Wang et al.2016] Wenjin Wang, Albertus C den Brinker, Sander Stuijk, and Gerard De Haan. Algorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering, 64(7):1479–1491, 2016.
  • [Yu et al.2019] Zitong Yu, Wei Peng, Xiaobai Li, Xiaopeng Hong, and Guoying Zhao. Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 151–160, 2019.
  • [Yu et al.2020] Zitong Yu, Xiaobai Li, Xuesong Niu, Jingang Shi, and Guoying Zhao. Autohr: A strong end-to-end baseline for remote heart rate measurement with neural searching. IEEE Signal Processing Letters, 27:1245–1249, 2020.
  • [Zhou et al.2021] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI, 2021.