1 Introduction
The gold standard for monitoring instantaneous heart rate (iHR) is electrocardiogram (ECG) [1]. Another popular noninvasive technique is photoplethysmogram (PPG) [2, 3]. Both techniques require direct skin contact with the subject, which might not be suitable in contexts such as driver drowsiness, or sleep monitoring. PPG relies on measuring the rapid variations in light absorption in an illuminated skin region caused by the difference in absorption curves for oxigenated and nonoxigenated blood. This principle motivated the use of digital cameras to measure the plethysmographic signals from face videos under ambient light conditions [4, 5, 6]. Several methodologies for estimating heart rate from face videos have been developed over the years [7, 8, 9, 10, 11, 12]. In particular, [13] provides a comprehensive overview of the history of the research done in this area and compares the performance of some of these approaches. As a general rule, most of these methods need an illumination source, depend on color band manipulation, and require control over the signal acquisition process (e.g., controlled light sources, or subjects remaining motionless during acquisition).
The recent inclusion of infrared (IR) cameras in many conventional devices, coupled with their resilience to lowlight and variablelight conditions, make them especially attractive for remote monitoring in the context of iHR detection. Their use has just now started to be explored in the detection of heart rate using infrared face videos [14, 15], but so far these approaches are limited to estimating a heart rate average over a considerable time frame (over 30 seconds). This paper shows that, under controlled motion conditions, it is feasible to extract even subsecond approximations to the iHR using basic spatiotemporal analysis and timefrequency analysis.
We describe this approach, and show its performance on face IR videos acquired using a Kinect camera from 7 healthy volunteers. The extracted iHR is compared against ECG and contact PPG ground truth signals that were simultaneously acquired.
2 Noncontact PPG signal from IR video
Here we describe the proposed algorithm to construct the noncontact PPG signal from an IR face video and hence extract the instantaneous heart rate. We divide this process into three main steps. The first step is detecting and segmenting the face in the video into disjoint spatial regions. Secondly, we take the mean activity of each region, denoise it, and decompose it into a smaller subset of sources. Finally, we introduce a signal quality index to select the signals of interest, and combine them to construct the noncontact PPG signal we are after. Figure 1 summarizes the initial preprocessing stages, while Figure 2 shows an example of the subsequent recovery process of the noncontact PPG.
2.1 Input IR video
For each subject, denote the recorded IR video as , where denotes the recorded frame at time , which is of size (height width). Suppose the video is sampled every seconds; that is, sampled at Hz, and the recording starts at time and lasts for seconds. We have thus frames. In this study, . We additionally assume that the subject’s head is fixed, so major movements between frames are ignored.
2.2 Preprocessing the IR video
We detect the boundaries of the face using the Dlib landmark detector [16] on the average face location, framebyframe detection is not performed since the subject is assumed to be immobile. An example is shown on Figure 1(a). We then divide the area inside the detected face into disjoint regions following a predefined mesh grid.
Denote those disjoint regions as , . For each video frame , , we compute the mean IR value on each region
As a result, we obtain the data matrix
In other words, the th row of matrix contains a time series with the mean IR activity over region , there are such regions defined across the face. We will refer to these as channels. Note that the constructed data matrix is commonly encountered in spatiotemporal analysis. For the purposes of this study, the face is subdivided into regions using a nonoverlapping pixel grid. See Figure 1(b) for illustration.
To denoise the time series, we apply to each channel in the data matrix an order bandpass Butterworth with cutoff frequencies at and bpm, a range that comfortably acommodates most normal heart rates. Denote the filtered signals as the data matrix . This bandpass filter is chosen based on the physiological knowledge that, for a normal subject, the heart rate is between and bpm.
2.3 Low rank spatiotemporal model
We assume that the IR video captures different physiological dynamics, such as respiration, body movement, and hemodynamics, among others. Denote these physiological sources as , where and . Note that in general and might not be orthogonal when ; for example, the hemodynamics and respiration might be coupled due to the respiratory sinus arrhythmia.
The data matrix is then modeled as a mixture of these source signals with additive and uncorrelated noise
(1) 
where contains the physiological source signals, is the source mixture matrix,
is a noise matrix with independent and identically distributed entries with zero mean, unit variance and finite fourth moment, and
is a scalar constant that describes the noise variance. In other words, the recorded signal on each region, , is a mixture of different sources via , contaminated by noise. We further make the low rank assumption that is fixed and small. This assumption means that there are limited sources of physiological dynamics that are captured by the IR video.2.4 Determine important sources
Due to the lowrank assumption and the highdimensional nature of the spatiotemporal model, apply SVD to the data matrix :
(2) 
where
consists of the left singular vectors,
consists of the right singular vectors, andconsists of singular values
. Denote and to be the th left and right singular vectors respectively. Note that contains the relevant temporal signals that are mixed in each region, and their weight in each spatial location. An example is illustrated in Figure 2.Denote and denote is the number of singular values such that . Since the noise level is in general not known, we estimate it as proposed in [17]:
where is the median of the MarcenkoPastur distribution [18] with parameter . Applying this procedure to reduced the number of nonzero singular values by over on average.
2.5 Reconstructing the noncontact PPG signal
Due to the nonorthogonal nature of physiological sources, we cannot recover directly from by applying the usual blind source separation technique. We thus propose the following procedure to reconstruct the noncontact PPG signal.
Define a signal quality index (SQI) for a signal of length as
where
is the Fourier transform of the time series
, and is the expected heart rate of a normal subject. Note that quantifies how concentrated the time series is around in the frequency domain.We rank all temporal signals , where , according to their SQIs . Consider the reordering permutation so that . Our hemodynamic estimator, the noncontact PPG signal denoted as , is defined as
for chosen by the user. Here we determine by greedily accumulating the sources until the maximal quality is achieved; that is,
Figure 2 shows an outline of the recovery process for noncontact PPG over the full face. Figure 3 shows the recovered iHR signal when we applied the proposed method to the channels contained in each of the five major facial areas independently, this is provided merely for illustration purposes. In general, using the entire facial area provided the best results. Figure 4 shows short time segments of noncontact PPG compared against ground truth contact PPG. Additional examples will be provided in the following sections.
2.6 Estimation of the instantaneous heart rate
Denote the short time Fourier transform (STFT) of the constructed noncontact PPG signal as , where is the STFT coefficient at time and frequency . From the STFT we extract the dominant curve using the curve extractor proposed in [19],
(3)  
where is a regularization constant. The iHR is thus determined by
Figure 5 shows the obtained and its STFT; ground truth iHR from ECG is also shown for comparison.
3 Experiments
We acquired 9 simultaneous ECG, PPG, and IR face video using a standard patient monitor (Philips IntelliVue MP70 Patient Monitor) and a Microsoft Kinect camera. The clocks in the Kinect camera and the patient monitor were synchronised with a time accuracy of . Acquisitions were done over 7 healthy subjects. The subjects were asked to look straight into the camera and maintain a steady posture, but otherwise behave, blink, and breathe normally. The instantaneous heart rate (iHR) was estimated from the IR video using the process described in Section 2. Ground truth iHR was extracted from the ECG signal using the Rpeak detection algorithm implemented in the python library biosppy [20].
4 Results
For each of the 9 datasets we measured the differences between the recovered iHR signal and ground truth using root mean square error (RMSE) and relative error
Table 1 shows these values. Figure 5 shows the extracted iHR signals. Implemented code is available at https://github.com/natalialmg/IR_iHR.

RMSE [bpm] 



Every 1s  Every 10s  Every 30s  Every 30 s  
d1/1  5.39  4.27  4.03  4.50  
d2/1  5.61  4.99  5.22  6.51  
d3/2  4.71  4.44  3.70  4.40  
d4/2  3.59  2.87  1.33  1.56  
d5/3  4.39  3.86  1.95  2.60  
d6/4  4.95  4.65  2.91  3.58  
d7/5  2.21  1.31  1.02  1.60  
d8/6  3.30  1.42  0.23  0.25  
d9/7  2.38  1.26  0.66  1.08 
In general, RMSE results averaged for longer timeframes () are satisfactory. Perhaps surprisingly, RMSE results for iHR at intervals are also reasonable. Figure 5 shows good correspondence between the ground truth ECG iHR and the STFT of the recovered noncontact PPG.
5 Concluding remarks
In this paper, we extracted noncontact PPG from IR facial video. We showed that a simple, principled method based on matrix decomposition was sufficient to recover instantaneous heart rate with small relative errors on a secondbysecond basis when subjects remain relatively stationary.
This suggests the viability of IR for noncontact PPG, particularly when we consider the lowlight and varyinglight performance of IR in general compared to traditional RGB methods. Additional work is required to adequately and robustly correct for motion artifacts. Improvements can be done on the process by which we combine the singular vectors to obtain our final hemodynamic estimator. Finally, more research needs to be done on the characterization of absorption curves of biological processes of interest in the near infrared spectrum. We leave this physiological research as a future collaborative work. We could also consider more sophisticated timefrequency representation tools to further analyze the obtained noncontact PPG signal for the instantaneous heart rate estimation. A more general manifold learning algorithm and matrix denoise technique can be applied to capture motion and time latency; for example, due to the high dimensional nature of , the matrix can be denoised by the optimal shrinkage algorithm proposed in [21]: , where is the optimal shrinkage under the Frobenius norm. This approach has the potential to further improve the overall quality of the signal. We will explore these possibilities in future work.
References
 [1] JA Dawson, COF Kamlin, C Wong, AB Te Pas, M Vento, TJ Cole, SM Donath, SB Hooper, PG Davis, and CJ Morley, “Changes in heart rate in the first minutes after birth,” Archives of Disease in ChildhoodFetal and Neonatal Edition, vol. 95, no. 3, pp. F177–F181, 2010.
 [2] Aymen A Alian and Kirk H Shelley, “Photoplethysmography,” Best Practice & Research Clinical Anaesthesiology, vol. 28, no. 4, pp. 395–406, 2014.
 [3] John Allen, “Photoplethysmography and its application in clinical physiological measurement,” Physiological measurement, vol. 28, no. 3, pp. R1, 2007.
 [4] Wim Verkruysse, Lars O Svaasand, and J Stuart Nelson, “Remote plethysmographic imaging using ambient light.,” Optics express, vol. 16, no. 26, pp. 21434–21445, 2008.
 [5] MingZher Poh, Daniel J McDuff, and Rosalind W Picard, “Noncontact, automated cardiac pulse measurements using video imaging and blind source separation.,” Optics express, vol. 18, no. 10, pp. 10762–10774, 2010.
 [6] Maria I Davila, Gregory F Lewis, and Stephen W Porges, “The physiocam: cardiac pulse, continuously monitored by a color video camera,” Journal of Medical Devices, vol. 10, no. 2, pp. 020951, 2016.
 [7] MingZher Poh, Daniel J McDuff, and Rosalind W Picard, “Advancements in noncontact, multiparameter physiological measurements using a webcam,” IEEE transactions on biomedical engineering, vol. 58, no. 1, pp. 7–11, 2011.
 [8] Sungjun Kwon, Hyunseok Kim, and Kwang Suk Park, “Validation of heart rate extraction using video imaging on a builtin camera system of a smartphone,” in Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE. IEEE, 2012, pp. 2174–2177.

[9]
Xiaobai Li, Jie Chen, Guoying Zhao, and Matti Pietikainen,
“Remote heart rate measurement from face videos under realistic
situations,”
in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2014, pp. 4264–4271.  [10] Mayank Kumar, Ashok Veeraraghavan, and Ashutosh Sabharwal, “Distanceppg: Robust noncontact vital signs monitoring using a camera,” Biomedical optics express, vol. 6, no. 5, pp. 1565–1588, 2015.
 [11] Antony Lam and Yoshinori Kuno, “Robust heart rate measurement from video using select random patches,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3640–3648.
 [12] Sergey Tulyakov, Xavier AlamedaPineda, Elisa Ricci, Lijun Yin, Jeffrey F Cohn, and Nicu Sebe, “Selfadaptive matrix completion for heart rate estimation from face videos under realistic conditions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2396–2404.
 [13] Chen Wang, Thierry Pun, and Guillaume Chanel, “A comparative survey of methods for remote heart rate detection from frontal face videos,” Frontiers in Bioengineering and Biotechnology, vol. 6, 2018.
 [14] Jie Chen, Zhuoqing Chang, Qiang Qiu, Xiaobai Li, Guillermo Sapiro, Alex Bronstein, and Matti Pietikäinen, “Realsense= real heart rate: Illumination invariant heart rate estimation from videos,” in Image Processing Theory Tools and Applications (IPTA), 2016 6th International Conference on. IEEE, 2016, pp. 1–6.
 [15] Qi Zhang, Yimin Zhou, Shuang Song, Guoyuan Liang, and Haiyang Ni, “Heart rate extraction based on nearinfrared camera: Towards driver state monitoring,” IEEE Access, vol. 6, pp. 33076–33087, 2018.

[16]
Davis E. King,
“Dlibml: A machine learning toolkit,”
Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.  [17] David L Donoho and Matan Gavish, “The optimal hard threshold for singular values is 4/√ 3,” arXiv preprint, 2013.

[18]
Vladimir Alexandrovich Marchenko and Leonid Andreevich Pastur,
“Distribution of eigenvalues for some sets of random matrices,”
Matematicheskii Sbornik, vol. 114, no. 4, pp. 507–536, 1967.  [19] Antonio Cicone and HauTieng Wu, “How nonlineartype timefrequency analysis can help in sensing instantaneous heart rate and instantaneous respiratory rate from photoplethysmography in a reliable way,” Frontiers in Physiology, vol. 8, pp. 701, 2017.
 [20] Carlos Carreiras, Ana Priscila Alves, André Lourenço, Filipe Canento, Hugo Silva, Ana Fred, et al., “BioSPPy: Biosignal processing in Python,” 2015–, [Online; accessed Jan 29, 2019].
 [21] Matan Gavish and David L Donoho, “Optimal shrinkage of singular values,” IEEE Transactions on Information Theory, vol. 63, no. 4, pp. 2137–2152, 2017.
Comments
There are no comments yet.