ReViSe: Remote Vital Signs Measurement Using Smartphone Camera

by   Donghao Qiao, et al.
Queen's University

Remote Photoplethysmography (rPPG) is a fast, effective, inexpensive and convenient method for collecting biometric data as it enables vital signs estimation using face videos. Remote contactless medical service provisioning has proven to be a dire necessity during the COVID-19 pandemic. We propose an end-to-end framework to measure people's vital signs including Heart Rate (HR), Heart Rate Variability (HRV), Oxygen Saturation (SpO2) and Blood Pressure (BP) based on the rPPG methodology from the video of a user's face captured with a smartphone camera. We extract face landmarks with a deep learning-based neural network model in real-time. Multiple face patches also called Region-of-Interests (RoIs) are extracted by using the predicted face landmarks. Several filters are applied to reduce the noise from the RoIs in the extracted cardiac signals called Blood Volume Pulse (BVP) signal. We trained and validated machine learning models using two public rPPG datasets namely the TokyoTech rPPG and the Pulse Rate Detection (PURE) datasets, on which our models achieved the following Mean Absolute Errors (MAE): a) for HR, 1.73 and 3.95 Beats-Per-Minute (bpm) respectively, b) for HRV, 18.55 and 25.03 ms respectively, and c) for SpO2, a MAE of 1.64 on the PURE dataset. We validated our end-to-end rPPG framework, ReViSe, in real life environment, and thereby created the Video-HR dataset. Our HR estimation model achieved a MAE of 2.49 bpm on this dataset. Since no publicly available rPPG datasets existed for BP measurement with face videos, we used a dataset with signals from fingertip sensor to train our model and also created our own video dataset, Video-BP. On our Video-BP dataset, our BP estimation model achieved a MAE of 6.7 mmHg for Systolic Blood Pressure (SBP), and a MAE of 9.6 mmHg for Diastolic Blood Pressure (DBP).


BP-Net: Efficient Deep Learning for Continuous Arterial Blood Pressure Estimation using Photoplethysmogram

Blood pressure (BP) is one of the most influential bio-markers for cardi...

A Deep Learning Approach to Predict Blood Pressure from PPG Signals

Blood Pressure (BP) is one of the four primary vital signs indicating th...

Assessment of deep learning based blood pressure prediction from PPG and rPPG signals

Exploiting photoplethysmography signals (PPG) for non-invasive blood pre...

Remote Pulse Estimation in the Presence of Face Masks

Remote photoplethysmography (rPPG) is a known family of techniques for m...

Deep-HR: Fast Heart Rate Estimation from Face Video Under Realistic Conditions

This paper presents a novel method for remote heart rate (HR) estimation...

Learning Higher-Order Dynamics in Video-Based Cardiac Measurement

Computer vision methods typically optimize for first-order dynamics (e.g...

Heart Rate Variability during Periods of Low Blood Pressure as a Predictor of Short-Term Outcome in Preterms

Efficient management of low blood pressure (BP) in preterm neonates rema...

I Introduction

Vital signs like the Heart Rate (HR), Heart Rate Variability (HRV), Oxygen Saturation (SpO2) and Blood Pressure (BP) are important indicators of a person’s physiological and emotional well-being. HR indicates the number of times a person’s heart beats in a minute. HR fluctuates depending on people’s physical activity as well as mental state. It can also be indicative of a person’s emotional state and reaction to external stimuli [1, 2]. For instance, while playing video games or watching a movie, the HR can indicate how much a person is enjoying or is engrossed in the activity. HRV denotes the variation in the time interval between adjacent heartbeats. It is very sensitive to physical condition and is difficult to measure [3]. SpO2 measurement indicates the proportion of oxygenated hemoglobin in the blood compared to the total amount of hemoglobin [4]. It is widely used for monitoring lung infections in patients. SpO2 measurement is based on the theory that Oxygenated Hemoglobin (HbO2) absorbs light differently at different wavelengths than the non-oxygenated Hemoglobin (Hb). BP is the pressure which blood exerts on the walls of the arteries due to the pumping of the heart, which helps circulate the blood through the arteries.

Traditional methods: The most common methods used for monitoring vitals such as Electrocardiogram (ECG), sphygmomanometer, and pulse oximeter require skin contact [5, 6]. ECG, the current gold standard for measuring HR, is not only tedious but can also cause discomfort in a patient as it involves applying gel on a patient’s chest in order to measure the heart’s electrical signals [5]. Oximeter uses the principle of Photoplethysmography (PPG) wherein the skin is illuminated with light. Proportional to the volume of blood flowing through the tissues, a part of the light is absorbed by the tissues and the rest is reflected. By monitoring the amount of reflected light, the Blood Volume Pulse (BVP) signal is extracted, from which the HR, SpO2, and BP are computed [1, 2]. Although using an oximeter usually does not cause discomfort in the patient, it is not readily available to all for a quick measurement. Mercury and digital sphygmomanometer are widely used for measuring BP. These devices contain an inflatable cuff which is applied to the arm and the BP is measured by increasing and releasing the pressure in the cuff. The pressure applied on the arm causes uneasiness and some people even complain of pain. The most accurate method of measuring BP is to place an arterial catheter in an artery and measure the pressure on a transducer. This method is intrusive and is used only in ICUs for critical patients. It is not usable in daily measurements as it is a very sensitive method which can lead to bleeding and infection if not managed well [36]. Therefore, the medical community is constantly looking for more convenient methods for regular measurement of vital signs to improve health monitoring.

rPPG Methods: In recent years, the remote methods for measuring the vital signs based on the principle of PPG have gained momentum which proved to be invaluable during the pandemic period. These methods are referred to as rPPG methods [6] and they employ a contactless video based method for vital signs measurement [6]. Computer vision techniques are applied to videos of skin to obtain the BVP signal which represents the blood flow pattern inside the blood vessels [1, 15, 38]. By using statistical models [41, 43] and machine learning algorithms [39, 40, 42]

various vital signs can be computed from the BVP signal. The signal obtained by the remote method is very sensitive to motion and illumination artifacts. To eliminate this noise and obtain a stable signal, robust image filtering and signal processing techniques must be applied. rPPG methods can be broadly classified into two methods: the motion-based methods and the intensity-based methods. In the motion-based method, the head movement of the subject is tracked in videos in order to obtain the BVP signal

[28, 29]. In the intensity-based method, the light reflected from the skin surface results in slight color changes on the face, which can be tracked in face videos to obtain the BVP signal [9, 10, 11, 12, 13, 14]. rPPG methods are very promising as they not only eliminate the discomfort experienced in intrusive methods, but also provide an ideal solution for enabling remote facial expression analysis and medical consultation.

The major deterrents in obtaining an accurate BVP signal in rPPG are illumination changes in the environment and various motion noises, which include rotation of the head, blinking of the eyes and twitching of the face [1, 9, 43]. These movements are usually present under realistic conditions and are therefore unavoidable. To combat these issues, researchers must employ different techniques in order to filter out noises from the BVP signal due to light intensity changes and motion in the rPPG videos [1, 2, 6].

Contributions: The key contributions of this work are as follows.

  1. We propose an rPPG based end-to-end framework ReViSe as shown in Fig. 1 for remote measurement of vital signs in near real-time using only a face video captured in natural environment using a smart phone camera while the state-of-the-art research uses videos captured in ideal laboratory environment. The ReViSe framework has been commercialized with further enhancements (mobile application is called Veyetals 111 by our industry partner. Anyone who has a smartphone camera and an access to telecommunication signal can use the mobile ubiquitous service to measure vital signs. The results are returned in near real-time based on a 20 seconds face video which is extremely useful for remote medical advising.

  2. ReViSe measures HR, HRV, SpO2, and BP, all four vital signs unlike the state of the art approaches which measure mostly HR [1, 2, 9, 10, 11, 12].

  3. We develop and present an advanced video data processing pipeline that includes multiple techniques to extract a robust BVP signal such as face detection, face landmarks prediction, extraction of face regions providing high quality signals, intensity-based signal extraction, and noise removal techniques to diminish motion and light noises.

This paper presents an extension to our previous work [35] with enhancements in noise removal techniques, inclusion of an advanced data ingestion and processing pipeline as shown in Fig. 1, the addition of BP measurement capability, and creation of two new datasets, Video-HR and Video-BP. The front camera of a smartphone captures the face video which is streamed to a back-end cloud platform. A deep learning-based face landmarks prediction model processes the streaming video frames and calculates the vital signs. The models are trained and validated using two publicly available datasets namely the TokyoTech rPPG dataset [7] and the Pulse Rate Detection (PURE) dataset [8]. The BP estimation model is trained using a publicly available PPG dataset, and validated using a self-created dataset, Video-BP. The ReViSe framework is validated by volunteers who used a smartphone mobile application to measure their vital signs and simultaneously recorded their vitals using a medical device as the ground truth values.

The rest of the paper is organized as follows. Section II presents an overview of our end-to-end rPPG framework. The related work is discussed in Section III. Section IV explains the methodology in detail. The experiments and results are presented in Section V and Section VI respectively. Finally, Section VII concludes the paper with a brief description of the future work.

Fig. 1: Overview of the ReViSe framework. It comprises three subsystems: (a) front-end app called Veyetals, (b) back-end processing system and (c) cloud database system.

Ii Overview of the End-to-End Framework

The end-to-end ReViSe framework is shown in Fig. 1 which facilitates non-invasive remote monitoring of user’s vital signs specifically HR, HRV, SpO2, and BP. It is composed of three main subsystems. Subsystem 1 is a mobile application called Veyetals, which captures the user’s face video and streams it to the cloud server using the mobile phone service or the internet service. Subsystem 2 is the cloud server at the back-end, which processes the video frames and estimates the vitals. The back-end is linked to a database, which represents Subsystem 3. It stores users data and vitals history. A detailed description of each subsystem is presented below.

Ii-a Subsystem 1: Front-end Application - Veyetals

Veyetals is the front-end mobile application as shown in Fig. 1.(a), which needs to be installed on users’ smartphones. After the user logs in, the application switches on the front camera and starts streaming the user’s video to the cloud platform. When the estimated vital signs are returned by the back end, the application displays the readings to the user on the front-end User Interface (UI), which allows users to interact with the application and view the past measurements.

Fig. 2: The six steps performed by the back-end system to process the video data and calculate the vital signs.

Ii-B Subsystem 2: Back-end Data Processing

The back-end hosts a Python program on the Google Cloud Platform (GCP)222, which accesses the streamed video data at a defined server location through a Universal Resource Locator (URL) as set up in the Amazon Web Services (AWS)333 Processing of the video data is performed in 6 steps which is illustrated in Fig. 2. In Step 1, we analyze the luminance and brightness of the input video and only the videos with appropriate lighting are processed further. In Step 2, a deep learning-based model is used to detect the face and predict the face landmarks in the video by segmenting it into specific areas referred to as the Region of Interests (RoIs). The changes in the color intensity of pixels in the RoIs yield the raw Blood Volume Pulse (BVP) signals. Multiple RoIs are extracted in Step 2 as the strength of the BVP signals varies in the different face areas such as the forehead, cheeks, and nose owing to varying light conditions, obstruction due to facial hair, and face movement. In Step 3, the raw BVP signals are extracted from all the segmented RoIs. The commonly used RoIs include forehead, cheek, and the area around the nose [7, 8, 13, 14]. In Step 4, the noise due to changes in light intensity and motion is eliminated to a large extent by applying various signal processing and image filtering techniques. This results in robust BVP signals. In Step 5, the RoI that provides the best BVP signal is selected. Finally, the chosen BVP signal is processed further to compute the vital signs namely HR, HRV, SpO2, and BP in Step 6, as explained later in Section IV. The computed values are transmitted to Subsystem 3 for storage.

Ii-C Subsystem 3: Cloud Database

The computed vital signs of the user are uploaded to a cloud database called Firestore444, which is supported by the development platform. The front-end application in Subsystem 1 pulls data from this database and displays the readings to the user. Subsystem 3 allows the vital signs to be saved for registered system users so that the users can monitor the historical measurement data and if needed, share the information with medical professionals for consultation.

Iii Related Work

The proposed framework deploys deep learning models and signal processing techniques to compute the vital signs from users face video. In this section, we describe the relevant literature on face detection and RoI extraction, signal processing, and computation of vital signs.

Iii-a Face Detection and RoI Extraction

The most critical step for accurate measurement of vital signs is selecting an appropriate RoI which can provide a clean and stable periodic BVP signal containing the maximum pulsatile data. The BVP signal is often dampened by motion artifacts owing to involuntary facial movements like blinking, twitching, smiling, and frowning. Therefore, it becomes necessary to choose a RoI that includes the least noise and the most cardiac information.

Two methods are commonly used in the literature for extracting the RoIs on the face [9, 43, 1]. The first method uses a face detector that can localize the bounding box surrounding the face region. The second method suggests an alternative approach to directly predict the coordinates of the face landmarks such as eyes and mouth, which can then aid in segmenting the RoIs. The former approach includes the background along with the face. Since only the facial skin contributes to the cardiac information, it is necessary to eliminate the background. Rectangular RoI was extracted in [8, 10, 11, 12] to get rid of the background components. 60% width and full height of the detected face is extracted as the RoI. McDuff et al. [8] used a combination of two rectangular face patches and the face landmarks. However, using specific scales of weight and height of the predicted face bounding box to select RoIs can include some background noise. Besides, the face movements or rotations can change the relative regions cropped in the face bounding box.

The face landmarks can help select specific regions on the face and some irregular shaped RoIs were also extracted to exclude parts of the face such as mouth and eyes which are more susceptible to movement. Gudi et al. [14] excluded the mouth, Li et al. [9] excluded the eyes, while Tulyakov et al. [17] excluded both the areas. We attempt to exclude the influence of eyes and mouth, but add the forehead which has a large and flat skin area. In order to diminish the influence of image warping, we use the extracted RoIs directly which is different from [17].

Some researchers selected multiple face regions to overcome incident light intensity and surface reflectance. Maki et al. [7]

randomly and repeatedly sampled pair of face patches. Each pair of face patches generated a single raw BVP signal. From these signals, the one with the highest confidence ratio, signal-to-noise ratio and peak height variance was selected. Kumar et al.

[15] divided the face into 7 regions by using the face landmarks. The extracted BVP signal was the weighted average of the raw BVP signals from these 7 regions. Wang et al. [16] split the face area into multiple triangular patches and adaptively selected useful patches by analyzing the box plots of pixel quantity. However, the noise and validity of the RoIs are not taken into consideration when selecting the RoIs. Inspired by these methods, we extract three RoIs after face detection including forehead, nose, and face excluding the mouth. This selection approach account for factors such as asymmetric lighting, head rotation and skin occlusion due to facial hair or mask and help select the most effective RoI for vital signs calculation.

In low light environment the skin cardiac data is not clearly visible, which affects the extracted PPG features [9, 10]. Xi et al. [37] decomposed the video frames (L) into an illumination part (T) and a reflectance part (R) which can be expressed as


where means element wise multiplication. Using the green channel as the initial illumination map , they found the enhanced reflectance image (R) by pixel wise division. Guo et al. [51] applied Histogram Equalization (HE) and the low Light Image Enhancement (LIME) via illumination map estimation algorithm to the videos and found that the enhanced videos gave larger RoIs than the original ones and also resulted in improved quality pulsatile signals. The HE-enhanced videos showed very little difference from the LIME-enhanced ones. Quellec et al. [52] improved the image quality by using the YCrCb color space wherein the Y channel is known to represent luminance. Leaving the Cr and Cb components unchanged, the image background (extracted using a Gaussian kernel) was subtracted from the Y component and the resultant image was converted to the RGB color space.

Iii-B Signal Processing

The signal processing in rPPG is a vital step in generating denoised and clean BVP signals. Signal processing methods can be divided into deep learning-based and conventional ones. The deep learning-based methods use deep learning neural networks to regress a BVP signal from the original video clips or raw BVP signals.

Przybyło [24] and Song et al. [25] extracted a raw BVP signal from the original face video, and leveraged LSTM and GAN to process the raw BVP signal and regress the BVP signal respectively. Tsou et al. [26] utilized 3D-CNN to directly regress the BVP signal from the video which is very time consuming. The input of the network has two branches: video of the forehead and video of the cheek. A shared-weight 6 layer 3D-CNN is applied to regress the two BVPs separately and the sum of the two BVPs is utilized to estimate HR. Deep learning-based methods can achieve comparable measurements to the conventional methods. However, these mentors are data intensive and may have significant influence on prediction results depending on the quantity and quality of the training data. Therefore, the conventional signal processing methods for noise reduction and BVP signal generation were explored further in this work.

In conventional rPPG methodology, the individual video frames are monitored over a period of time to track the changes in pixel color intensity or location of facial features. These changes generate the raw BVP signal. Once the RoI is selected, two main techniques are generally applied to extract the raw signal: motion-based methods and color intensity based methods [9]. Efficient signal processing is critical for eliminating noise and accurately calculating the vital signs. While the motion-based methods track coordinates of facial points from frame to frame, the light intensity-based methods track variations in pixel intensity between video frames which indicate the color changes due to cardiac activity.

Balakrishnan et al. [28] proposed the use of head movement for obtaining the BVP signal. The motion-based methods compute the ballistocardiogram movement obtained from the head movement. The involuntary vertical head movement happens due to the pulse and the bobbing movement happens due to respiration. The basic idea is to track features from a person’s head, filter out the velocity of interest, and then extract the periodic signal caused by heartbeats. The featured points on the face (face landmarks) represented by coordinates x(n) and y(n), are tracked along time, and only the vertical component y is used to extract the trajectory. The trajectory is used as the raw signal.

However, the intensity-based methods are more popular due to the ease of implementation wherein many techniques can be applied to extract color changes caused by the pulse. When the skin is illuminated by light, the absorption and reflection of light is a function of the blood volume flowing through the capillaries in the region due to heartbeat. The hemoglobin tends to have a higher absorption capacity in the green color range of light than the red color channel [44]. However, all the three-color channels namely red (R), green (G) and blue (B) contain PPG data. Consequently, the green component is widely used to extract the PPG data. Blind Source Separation (BSS) is used if all the three-color channels (RGB) are used.

Wang et al. [9] utilized the PPG data from the green channel of the RGB signal as it has the maximum Signal to Noise Ratio (SNR). Poh et al. [10, 11] used all three-color channels with BSS for signal extraction. Li et al. [9] applied the detrending filter to remove the low frequencies from the signal. To further eliminate the irrelevant frequencies, various bandpass filters were used. Balakrishnan et al. [28] used Butterworth filter while Li et al. [9] used Finite Impulse Response (FIR) Bandpass filters with cutoff frequency representing the normal HR.

BSS is used to separate a mixed signal into its source components [7, 10, 11, 12]

. This technique can help separate the noise component from the BVP signal. The two commonly employed BSS techniques are Principal Component Analysis (PCA)


and Independent Component Analysis (ICA)

[19]. Among the various ICA algorithms, Joint Approximate Diagonalization of Eigen matrices (JADE) is popular for HR calculation as it is computationally efficient [11, 12]. Balakrishnan et al. [28] used PCA for extracting the head trajectories caused by cardiac activities.

ICA involves separating a multivariate signal into its additive subcomponents. The use of signal from the green channel and separating it using ICA is the most popular method employed in rPPG. Assume that a linear mixture of observed signal . The ICA approach first finds a matrix , which is the inverse of , and then obtains the independent components by

. PCA maps the signal into unit vectors (known as eigen vectors) where each vector is in the direction of a line that best fits the data while being orthogonal to the other vectors. The eigen vectors are in the directions along which the features have maximum variance. In intensity-based methods, the aim is to extract the component with the highest variance (first eigen vector). In motion-based methods, the component with the highest periodicity in frequency spectra is chosen as the BVP signal

[45]. PCA has lower computational complexity compared to ICA, while ICA is more robust [1].

SNR is the ratio of the first two harmonic ranges of the signal to the remaining parts of the signal in the frequency domain. It is used as a metric to analyze the quality of the BVP signal

[37, 51]. For motion elimination, Haan et al. [53] divided the signal into

segments of equal length and computed the standard deviation of each segment. The top 5% of the segments with the highest standard deviation were eliminated and the remaining segments were concatenated.

Iii-C Vital Signs Calculation

We calculate four different vital signs from the extracted BVP signal: HR, HRV, SpO2, and BP as described below.

HR: The time-domain algorithms [10, 11, 13, 14]

detect the peaks in the BVP signal and compute the HR based on the interval of the peaks. However, these methods rely heavily on the extraction of a clean BVP signal and are sensitive to noise. Since the HR is usually a periodic signal, the BVP signal can be transformed into the frequency domain using the Fast Fourier Transform (FFT)

[9, 12] or Discrete Cosine Transform (DCT) [29]

methods. The transformed frequency power spectrum cannot indicate the HR instantly unlike the time-domain methods, but it is more robust in HR estimation. Machine learning algorithms like Support Vector Machines (SVM) were used by Monkaresi et al.

[30] and Osman et al. [31] to compute the Inter-Beat Interval (IBI) from the power spectrum. Deep learning models like DeepPhys [32] and HR-Net [33]

extracted BVP signals for HR calculation from a series of images in videos. However, a large dataset is needed to train deep learning models using supervised learning algorithms.

HRV: HRV can be computed by calculating the time interval between two successive peaks in the BVP signal in the time-domain. This process is extremely sensitive to noise as random spikes and jumps can add artifacts to the signal and thereby, affect the accuracy of the results. There are also some other HRV measurement techniques applicable to the frequency domain such as Low-Frequency (LF) power, High-Frequency (HF) power, and ratio of LF-to-HF power.

SpO2: Tamura et al. [4] described the principle of SpO2 estimation [34]. They used red light and infrared light for calculating SpO2. For SpO2, pixel intensities of two different light sources of different wavelengths are approximated using mathematical models. SpO2 computation is based on the principle that the absorbance of Red (R) light and Infrared Red (IR) light by the pulsatile blood (blood flowing with periodic variations) changes with the degree of oxygenation. The extracted BVP signal obtained from the reflected light is divided into two parts: the Alternating Current (AC) component resulting from the arterial blood and the Direct Current (DC) component resulting from the underlying tissues, venous blood and constant part of arterial blood flow. The DC component is subtracted while the AC component is amplified of the R and IR lights and used to calculate the SpO2 level in the blood using Eq. 2 as given below.


where and are parameters that can be calibrated by using a standard pulse oximeter.

BP: A non-linear relationship exists between the temporal features of a BVP signal and the BP [36]. Therefore, different machine learning models have been employed to exploit this non-linearity. Chowdhury et al. [39]

used the PPG signal recorded with a fingertip sensor device to extract 107 features such as the systolic and diastolic peak, notch, systolic peak time, first and second derivative of the signal, demographic features (height, weight, BMI, gender, age), statistical features (standard deviation, skewness, kurtosis), time domain features and frequency domain features. Using ReliefF feature selection for automatic attribute selection and Gaussian Process Regression (GPR), they estimated the SBP and DBP. Su et al.


used a sequence learning model based on a deep recurrent neural network for predicting multi day BP from seven handcrafted features of ECG and PPG signals. Autoencoders were used to extract the complex signal features by Shimazaki et al.

[48] to train a four layer neural network. Xing et al. [40] and Viejo et al. [49]

used the PPG signal’s amplitudes and phases as input to a feed forward neural network. Slapnicar et al.


trained a random forest model and a deep neural network (Spectro-temporal ResNet) with PPG signals from MIMIC-III dataset. The first model used frequency domain features while the second model used the signal, its first and second derivatives, and the frequency features calculated using spectrogram. They reported that the deep learning model was the best model for BP prediction with a MAE of 6.7 for SBP and 9.6 for DBP. Huang et al.


used the results from applying transfer learning on the MIMIC II dataset with k-nearest neighbours for BP prediction from face videos. Schrumpf at el.

[50] trained Resnet, AlexNet, Spectro Temporal ResNet and LSTM models. They trained the models on PPG data from MIMIC-III dataset and then used transfer learning to train on rPPG data. They found the Resnet gave the best performance.

Iv Methodology

This section first describes the functionality of the end-to-end ReViSe framework and then explains the methods applied in the back-end cloud system to measure the vital signs from the streaming video data.

Iv-a End-to-End Framework: ReViSe

We developed a smartphone and tablet compatible mobile and cloud supported end-to-end framework in collaboration with our industry partner, which has been productized as the mobile application called Veyetals. The front-end user interface for the mobile device has been developed in React Native 555 for iOS and Android mobile platforms. The complete framework has been developed using the Firebase platform which has a cloud server back-end. The mobile application must be downloaded from app store and deployed on a smartphone or mobile device. As the user starts the application by clicking on the application icon, the front camera from the user’s smart device gets activated to capture user’s face video for measuring vital signs. The user gets about 10 seconds to read the instructions and adjust the device camera such that the face is positioned within a rectangle box on the screen (Fig. 1. (a)). This front-end application records a 20-second video and simultaneously down-samples it to px resolution to reduce the volume of data that must be transferred over the communication network. The down-sampled video is recorded and streamed simultaneously at 30 Frames-Per-Second (fps) to a Real-Time Messaging Protocol (RTMP) server at the back-end set up on the cloud using AWS.

The back-end cloud server system constitutes of multiple AI models and algorithms to process the video, compute the vital signs, and return the measurements back to the user’s device. Python3 programming language is used to develop the models and algorithms, which are compiled, tested and deployed as the back-end application on the Google Cloud Platform (GCP) as a function. Videos streamed to the RTMP URL are accessed by the back-end application developed in Python using Google Cloud Function when triggered by React Native. Before processing, the brightness of the video is assessed to ensure that a good video with strong rPPG features is used to compute the vital signs as described in Section IV-B.

As soon as the processing completes, the results are stored in the Firebase Firestore database, a real-time cloud storage system in Subsystem 3, which is supported by React Native. The font-end React Native application then reads the measurements from the cloud-based Firestore database in near real-time and displays the same to the user on the front-end mobile interface.

Iv-B Back-end Video Processing and Vital Signs Calculation

The video is processed to calculate vital signs in a 6-step pipeline in the back-end system as shown in Fig. 2.

Iv-B1 Step 1: Luminance Analysis

After receiving the video, we analyze the luminance and brightness condition with following methods.

  • Mean Grayscale - The RGB image of a video frame is converted to grayscale and the mean of all pixel values is computed. The range of pixels in grayscale is 0-255 and 127 lies in the midway. Therefore, if pixel values lie between 127-255, the video frame corresponds to a light image. However, it was empirically determined that HE improved the brightness resulting in good PPG data when the mean is <75. When the mean is between 75-127 HE added noise to the PPG data.

  • Low contrast - Using skimage.exposure library’s function is_low_contrast666, the video frame contrast is computed and compared with a threshold of 0.65. If the image contrast is low, the lighting is not good enough to return the PPG features.

  • Y channel of YCrCb model - The RGB image is converted to YCrCb color space and the mean value of Y channel pixels is computed. This value is monitored in all the video frames and if it varies beyond a threshold of 15 units, the system returns a message that the ambient light is varying.

Iv-B2 Step 2: Face Detection and RoIs Extraction

Fig. 3: Video processing pipeline includes the face landmarks prediction, raw BVPs extraction from the segmented RoIs, BVPs extraction with signal processing algorithms and PSDs calculation by using the BVPs. (a) Signals from RoI 1 (forehead). (b) Signals form RoI 2 (nose and cheek). (c) Signals from RoI 3 (face exclude mouth). The raw BVP signal of RoI 1 is more variant since the skin is covered by hair.

Considering the robustness and efficiency of face detection and face landmarks prediction, the open source deep learning-based MediaPipe Face Mesh framework

[23] is applied to process the user-video frame by frame. First, the BlazeFace [22]

is applied to extract the whole face area in the image. The face detector consists of a Convolutional Neural Network (CNN) with BlazeBlock for feature extraction and a modified Single Shot multibox Detector (SSD) for bounding box prediction.

After detecting the face, a CNN is applied to extract features from the face image and 478 face landmarks are predicted on the face. The face landmarks are used to segment 3 RoIs as shown in Fig. 3: RoI 1 is the the forehead, RoI 2 is the cheek and nose area, and RoI 3 is the most face area excludes mouth. A desired RoI is the one that contains the most facial skin as the skin includes the maximum cardiac signal, while having the least noise. RoI 1 is the forehead area which has the largest and flattest skin on our face, it can also be used used when users are wearing a mask. However, when user’s hair covers the forehead (as shown in Fig. 3), this RoI contains less useful information and more noise as indicated by the larger variance in the extracted signal in Fig. 3. (a). RoI 2 is devoid of noise due to obstruction by hair and/or beard, and motion noise from eyes blinking or mouth movements. RoI 3 aims to select the maximum skin to extract the general information from the face.

Iv-B3 Step 3: Raw BVP Signal Extraction

Each RoI provides a three dimensional array comprising of pixel intensities in the three color channels: red, green and blue. The raw BVP signals of each frame are represented by the mean pixel values of the green channel. Hence, the 3 RoIs generate 3 raw BVP signals , , and where is the index of frame. Next, we process the raw BVP signals to eliminate the noise and extract final clean BVP signals.

Iv-B4 Step 4: Signal Processing

To attenuate the noise and extract the desired BVP signal, the raw BVP signal is processed in several steps as explained below.

  • Signal to Noise Ratio (SNR): To analyze the quality of the signal we use the SNR metric. We transform the raw BVP signal to the frequency domain and retain only the components between 30-240 Hz as this is the human HR range. The components are then normalized and the SNR is calculated using Eq. 3.


    where is the spectrum of the BVP signal S, f is the frequency in bpm, and is a binary template window that extends from the maximum amplitude to 3 units after that. This has been illustrated in Fig.4. Therefore, the signal corresponds to the amplitudes inside the binary window and the remaining amplitudes are the noise.

    The SNR is computed for each RoI. When the SNR is negative, a utility to eliminate the motion artefacts is carried out. In here, the time domain signal is divided into 10 segments of equal length. The standard deviation of each segment is calculated and the top 5% segments with the highest standard deviation are eliminated. The resultant signal is the cleaned raw BVP signal.

    Fig. 4: The SNR is the ratio of the components inside to those outside the binary window of 3 units.
  • Denoise Filter: We apply a customized denoise filter to the raw BVP signal to remove the large jumps and steps caused by motion noise such as head rotation and shaking. The input is the time series BVP signal . We calculate the absolute difference of the data points in the signal consecutively and compare this difference against a threshold , where . If the threshold is exceeded, we subtract the difference from the posterior signal to remove the steps. The pseudocode of our denoise filter is shown in Algorithm 1.

    1:Input: Signal and threshold
    2:Initialize integer , list
    3:while  do
    4:     if  then
    6:     else
    9:     end if
    10:end while
    Algorithm 1 Denoise filter
  • Normalization: Normalization makes the process of comparison of two signals more robust as it brings the signals within the same range. We normalize the denoised signal by subtracting its mean and then dividing the result by the standard deviation as shown in Eq. 4.


    where and are the mean and standard deviation of the signal respectively.

  • Independent Component Analysis (ICA): Next, to extract the independent source signal from the mixed signal set, we use ICA. ICA randomly returns a positive or a negative signal, so that the output signal may sometimes be reversed after ICA. This is of little significance in HR calculation in frequency-domain, but to estimate the HRV in time-domain, more accurate peak positions are required. Therefore, we return the signal (positive or negative) that has a higher correlation with the input signal.

  • Detrending Filter: After obtaining the source signal, we apply a detrending filter [20] which is designed for PPG signal processing. Detrending filter helps in reducing the non-stationary components of the signal. The method is based on smoothness priors formulation and a smoothing parameter is applied to adjust the frequency response.

  • Moving Average Filter: Finally, a moving average filter as shown in Eq. 5, is applied with to remove random noise. Moving average filter helps in temporal filtering of the signal. It computes the average of the data points between the frames, thereby reducing random noise yet retaining a sharp step response.


    where is the number of points and is the index of frame. This filter helps in smoothing the signal by removing jumps due to sudden light changes or motion.

Iv-B5 Step 5: RoI and BVP Selection

Signal processing provides 3 BVP signals and corresponding to the 3 RoIs. At this stage, a Bandpass filter is applied to retain only the frequencies-of-interest. Bandpass filter retains frequencies within a defined range and eliminates frequencies outside the range. Since human HR lies in the range of 0.7 Hz - 4 Hz which corresponds to 42 beat-per-minute (bpm) - 240 bpm HR, all the unwanted frequencies can be eliminated using a Bandpass filter. The Power Spectral Density (PSD) of each BVP is computed with the Welch’s method [21]. The signal with the highest peak spectrum power is selected for calculating the vital signs. Fig. 3 shows the PSDs of the three BVP signals and the peak values of these PSDs, which are 162.59 of RoI 1, 296.54 of RoI 2 and 258.51 of RoI 3 respectively. Therefore, RoI 2 with the highest PSD and its corresponding BVP 2 will be selected to calculate the vital signs.

Iv-B6 Step 6: Vital Signs Calculation

  • HR Calculation: HR is normally a periodic signal. So the HR is calculated in the frequency-domain by using the PSD of the selected BVP signal. In Fig. 3. (b), signal 2 is selected as its power is the maximum. The peak coordinate (1.21, 296.54) indicates the HR frequency is 1.21 and the maximum power is 296.54. The final calculated HR of the subject is bpm i.e., bpm.

  • HRV Calculation: In time-domain, Inter-beat Interval (IBI) is an important cardiac parameter used to calculate HRV. IBI is the time period between the heartbeats represented by the peaks of the extracted BVP signals. The red dots of the BVPs in Fig. 3 are the peaks, which must be greater than zero. Therefore, where is the time of the detected peak. The Root Mean Square of Successive HR Interval Differences (RMSSD) is calculated using Eq. 6, which is one of the HRV time-domain measures [3] that represents the HRV.


    where N is the number of IBIs in the sequence.

    The HRV measurement is very sensitive and vulnerable in time-domain. It heavily relies on a clean and reliable BVP signal and accurate peak detection. Therefore, after getting a set of IBIs from the BVP signal, we set one standard deviation from the mean of the set as the cut-off in identifying and removing the outliers, and diminish the influence of calculation errors caused by peak detection and residual noise.

  • SpO2 Calculation: As we discussed in the related work Section III-C, we utilized the AC and DC of the R and IR signals to calculate the SpO2 with Eq. 2. The calibration parameters and are 1 and 0.04 respectively in our experiment.

  • BP Calculation: We trained a deep learning model using ResNet blocks for predicting the BP. Since the face video samples were limited, transfer learning was employed. The model was trained on BVP signals obtained first from PPG samples recorded with a sensor and then from rPPG videos. The network has three branches for the three different input signals namely the BVP signal, its first derivative, and its second derivative. After passing the signals through the ResNet blocks, the outputs were flattened, concatenated and passed through two dense layers to generate the SBP and DBP as output. The network architecture is shown in Fig. 5.

    Fig. 5: Deep learning ResNet model architecture for BP Estimation taking three inputs: the BVP signal, its first derivative and its second derivative. a) The branch diagram illustrating the layers through which each input passes, b) the output of the three branches are concatenated and passed through Dense layers to get SBP and DBP

V Experiments

In this section we present the experiments that we conducted to train and test our framework mainly focusing on the robustness and accuracy of the models for real life use case scenario.

We recruited participants with the help of our industry partner to use the smartphone application for demonstrating the usability and performance of the framework in real life. The data collected was used for testing the models trained with benchmark data. In this paper we mainly present the quantitative evaluation for the initial version of our application where we compared the values returned by our framework against ground truth data measured using a medical device.

Most of the existing work have only validatee their models in controlled laboratory environment [9, 1, 55] which generally provides good quality video to start with. We validated our system in two phases, (i) using benchmark dataset, and (ii) using our own dataset created from videos and ground truth data contributed by a group of participants with their informed consents.

V-a Experiment Design:

We designed 4 experiments to evaluate our models for HR, HRV, SpO2, and BP measurement. These models were deployed on the back-end cloud server. The HR, HRV and SpO2 compute models were evaluated using two publicly available rPPG datasets as illustrated in Experiments 1 and 2. In Experiment 3, additional tests were done using a dataset, Video-HR, created from the data which was collected from a group of participants who used the Veyetals smartphone application to measure their vital signs while recording the same using a medical device. The face videos were recorded in real life scenario with ambient lights. Finally, Experiment 4 presents the BP measurement model development and validation results. Experiment 4 was conducted with a public benchmark dataset and the Video-BP dataset that we created.

V-B Benchmark Datasets

Three publicly available datasets were selected to train and test our models. The data were collected in a laboratory environment using stable light sources, so the videos generally have good luminance. The TokyoTech rPPG Dataset [7] contains PPG data recorded using finger-contacted PPG sensors that measured HR and HRV values. The other dataset, PURE [8], contains the ground truth values of HR, HRV and SpO2. This dataset is more challenging since some of the data contain face videos with various head movements such as rotation and movements. The models can be leveraged to achieve higher accuracy for this data to ensure robustness and reliability against motion noise interference. A third PPG-BP dataset published by Liang et al. [54] was used for training the BP model. A detailed description of these datasets is given below.

  • TokyoTech rPPG Dataset: This dataset consists of 9 subjects (8 male and 1 female) between the age group of 20 to 60 years. Each subject has three 1-minute videos corresponding to three sessions: relax, exercise and relax. The participants perform hand grip exercise before the exercise session. Each 1-minute video is split into three 20-second videos. A finger clip Contact PPG (cPPG) sensor (Procomp Infinity T7500M, Thought Technology Ltd., Canada) with a sampling frequency of 2048 Hz is used to gather the contact BVP signals for ground truth reference.

  • PURE Dataset: This dataset contains 10 subjects (8 male and 2 female) and each subject has 6 different setups of 1-minute videos. Therefore, there are 60 video sequences and the total video duration for each subject is 6 minutes. It is more challenging to achieve good results with this dataset as the videos were recorded in 6 setups including steady, talking, slow translation, fast translation, small rotation and medium rotation. The videos were captured with an eco274CVGE camera by SVS-Vistek GmbH at a frame rate of 30 fps with a cropped resolution of pixels. A finger clip pulse oximeter (pulox CMS50E) is applied to simultaneously measure pulse rate wave and SpO2 with a sampling rate of 60 Hz.

  • PPG-BP Dataset: Due to the absence of a dataset containing face video and ground truth BP, we used the publicly available PPG-BP dataset released by Liang et al. [54]. It contains 657 PPG signal samples collected from 219 subjects using finger PPG sensor. The dataset also contains ground truth vitals namely HR, SBP and DBP recorded simultaneously.

V-C Self-created Video-HR Dataset

To evaluate the end-to-end system, we built our own dataset named Video-HR with the help of 15 participants between 10 to 80 years of age using their smartphones with our framework to measure their vital signs in real life environment. Each participant recorded two 20 seconds face videos with his/her smartphone Veyetals application in different light and active positions. The smartphones used were Samsung Galaxy Note 9, iPhone 6s, iPhone X and iPhone 11 Pro. Ground truth HR measurement were collected at the same time using Apple Watch and Bluebird Pulse oximeter for this pilot study. In the future we plan to conduct the experiment in a clinic where the ground truth data would be collected using authorized personnel using medical devices.

V-D Self-created Video-BP Dataset

To the best of our knowledge there is no open public dataset that contains BP readings with users’ face videos. Therefore, we created a dataset named Video-BP for training the BP estimation model. The samples in this dataset include face videos which were recorded by a group of participants using a Samsung Galaxy Note 9 smartphone camera application for about 25 seconds at a frame rate of 30 fps in an environment with ample daylight. The recording device was either held in hand, placed on a surface, or fixed on a tripod stand. This was done to regularize the model to adapt to device movement while video recording. The ground truth BP was recorded using the Andesfit Health BP Monitor. 49 people comprising of 15 males and 34 females between 11-78 years of age participated in the study which resulted in 144 data samples. In the collected dataset, the SBP ranges between 85-168 mmHg and the DBP ranges between 50-103 mmHg.

V-E Model Training and Validation

The metric used to evaluate the performance of the vital signs measurement models is Mean Absolute Errors (MAE).

V-E1 HR, HRV, and SpO2

The HR, HRV (RMSSD) and SpO2 calculation as described in Section IV-B6 do not require model training.

V-E2 Bp

Because no public dataset was available for BP containing face videos, we decided to train our model using a publicly available PPG-BP dataset released by Liang et al. [54]. The BVP signal is extracted from the PPG signal to train our BP model using the ground truth HR, SBP and DBP data provided in the dataset. Next we replace the input part of the model architecture to extract the BVP signal from the face video in the same way as we did for the HR, HRV, and SpO2, and use the remaining part of the pre-trained model to estimate BP.

Data Preprocessing: First, using the signal skewness metric, the poor signal samples were discarded. Each sample contains 2,100 data points. The sample was clipped to 1,905 data points and segmented into 3 parts. Each part became an input to the network.

Training and Validation: Our deep learning model for measuring BP as described in Section IV-B6

was trained on PPG-BP dataset with a batch size of 256 for 50 epochs using Adam optimizer and a learning rate of 0.001. Next we fine-tuned the pre-trained model using the self-created Video-BP Dataset. After a train test split of 90:10, the model was trained with a batch size of 25. The model was trained for 50 epochs with early stopping and using Adam optimizer and a learning rate of 0.0001. K-fold cross validation (k=10) was used during training so that the model gets to train on each type of sample.

Relax Exercise Relax Overall
HR 1.29 2.54 1.34 1.73
HRV (RMSSD) 16.66 20.85 18.12 18.55
TABLE I: Performance TokyoTech rPPG Dataset [7]
Fig. 6: Boxplot of the MAE in HR and HRV (RMSSD) calculation on the TokyoTech rPPG dataset [7].

Vi Results

The results of the vital signs calculation are evaluated with MAE. TokyoTech rPPG dataset is applied to evaluate the HR and HRV, and the evaluation results are shown in Table I. The boxplot of each section is shown in Fig. 6. Since this dataset is collected in a laboratory environment with less light and motion noise, our model achieves 1.73 bpm MAE in HR estimation and 18.55 ms MAE in HRV estimation. The evaluation results and the boxplot of PURE dataset are displayed in Table II and Fig. 7. Our model achieves better results in steady or small movements such as slow translation and small rotation with less than 3 bpm MAE in HR estimation. The MAEs of HRV and SpO2 calculation are 25.03 ms and 1.64% respectively. Our use case Video-HR dataset is built to evaluate the HR measurement from the smartphone. The MAE is 2.49 bpm, and the boxplot is shown in Fig. 8. Even though this data has more variation than the laboratory datasets, for less motion noise scenario a good performance was achieved in HR calculations. The BP estimation model achieved a MAE of 6.7 mmHg for SBP and 9.6 mmHg for DBP on the test set of the Video-BP dataset and the boxplot is shown in Fig. 9.

Steady Talking Slow Translation Fast Translation Small Rotation Medium Rotation Overall
HR 1.40 6.7 2.8 3.3 2.9 6.6 3.95
HRV (RMSSD) 24.57 20.84 22.14 24.38 28.69 29.54 25.03
SpO2 1.48 1.53 1.78 2.08 1.39 1.56 1.64
TABLE II: Performance on PURE Dataset [8]
Fig. 7: Boxplot of the MAE in HR, HRV (RMSSD) and SpO2 calculation on the PURE dataset [8].
Fig. 8: Boxplot of the MAE in HR calculation on our Video-HR dataset.
Fig. 9: Boxplot of the MAE in BP calculation on our Video-BP dataset.

Vi-a Discussion

The model performs well in HR prediction on all datasets, while the head movements affect the model stability in some videos. The HRV estimation model depends on the inter peak distance in the BVP signal. The TokyoTech dataset has videos that were recorded under strict controlled laboratory conditions. Therefore, the MAE for HRV estimation on this dataset is low which is good. However, under real life conditions, there are more noise in the rPPG data which results in higher MAE as we observe in the results achieved with the PURE dataset. In the presence of noise which can be extremely challenging to eliminate, the signal is dampened and its periodicity is affected. Since SpO2 values in the training datasets ranged between 90-100%. for the test samples where the SpO2 was below 90%, the model gave low accuracy.

Polynomial Kernel Regression [56] 16.4 mmHg 13.3 mmHg
Feed Forward Neural Network [49] 13.6 mmHg 12.1 mmHg
Convolutional Neural Network [57] 14.5 mmHg 10.1 mmHg
Deep Learning ResNet 6.7 mmHg 9.6 mmHg
TABLE III: Comparing BP estimation model with previous work using Video-BP dataset

Our ResNet BP estimation model returned a MAE of 6.7 mmHg and 9.6 mmHg for SBP and DBP respectively. This is lower than the MAE of 14.1 mmHg and 11.2 mmHg for SBP and DBP respectively reported by Schrumpf et al [50]. However, our model and their model were tested on different datasets. Although the authors [50] stated that they did not find much improvement using signal derivatives, it reduced the MAE for our model by 1.6 mmHg. Since the improvement is not that high, to reduce computational complexity and optimize the use of resources, the model can be trained with only the signal instead of its derivative. The model accuracy may be further improved a larger training Video-BP dataset V-D.

Vii Conclusion

In this paper, we propose an end-to-end framework, ReViSe, to measure users’ vital signs (namely HR, HRV, SpO2 and BP) by using a smartphone camera. Our video processing and vital signs estimation methodology are evaluated on two public rPPG datasets. For HR estimation, the model achieved a MAE of 1.48 and 4.20 bpm with the two public TokyoTech and PURE datasets respectively. We also evaluated the end-to-end framework with our own Video-HR dataset and achieved a MAE of 2.49 bpm. HRV heavily relies on a clean BVP signal which is very sensitive to variation in illumination and motion noise. The model achieved a MAE of 1.60 for SpO2 on the PURE dataset. The BP model was trained using a publicly available PPG-BP dataset and tested using a self-created Video-BP dataset. It achieved a MAE of 6.7 and 9.6 mmHg for SBP and DBP respectively on the Video-BP dataset.

Medical professionals can use our smartphone applications to measure a patient’s vital signs remotely for medical advising. It is a good alternative to invasive methods that are currently the gold standard for vital signs measurement. During the COVID-19 pandemic when in-person medical visits were restricted, such remote contactless methods proved to be extremely useful. In future, the framework can be extended for health screening, health risk prediction, or remote heath care provisioning and thereby, making healthcare more affordable and accessible online around the clock.

Future Work: There are many challenges in processing real life video data. A good quality video with higher resolution can provide better accuracy but would take longer time to transmit to the server, process and greater network bandwidth. Motion and lighting conditions offer difficult challenges in applying the technology in real life scenario. In collaboration with our industry partner, we had the opportunity to explore and address some of these challenges in this preliminary version of the work. In future work, we intend to address the above challenges to further reduce the noise, improve the measurement accuracy, test the framework for a wide range of participants of varying age, skin color, gender and health conditions at a clinical environment to collect a greater range of measured vital signs. We plan to use Health Canada approved medical devices to collect the ground truth data for better training and validation of the models. With the larger dataset, we plan to implement deep learning based models for estimating all the vital signs. We will also extend the framework to support multiple devices at the front-end and perform a usability study of the framework.


This research is funded by Mitacs and Markitech.


  • [1] Wang, C., Pun, T. and Chanel, G., 2018. A comparative survey of methods for remote heart rate detection from frontal face videos. Frontiers in bioengineering and biotechnology, 6, p.33.
  • [2] Rouast, P.V., Adam, M.T., Chiong, R., Cornforth, D. and Lux, E., 2018. Remote heart rate measurement using low-cost RGB face video: a technical literature review. Frontiers of Computer Science, 12(5), pp.858-872.
  • [3] Shaffer, F. and Ginsberg, J.P., 2017. An overview of heart rate variability metrics and norms. Frontiers in public health, 5, p.258.
  • [4] Tamura, T., 2019. Current progress of photoplethysmography and SPO 2 for health monitoring. Biomedical engineering letters, 9(1), pp.21-36.
  • [5] Tamura, T., Maeda, Y., Sekine, M. and Yoshida, M., 2014. Wearable photoplethysmographic sensors—past and present. Electronics, 3(2), pp.282-302.
  • [6] Sun, Y. and Thakor, N., 2015. Photoplethysmography revisited: from contact to noncontact, from point to imaging. IEEE transactions on biomedical engineering, 63(3), pp.463-477.
  • [7] Maki, Y., Monno, Y., Yoshizaki, K., Tanaka, M. and Okutomi, M., 2019, July. Inter-Beat Interval Estimation from facial video Based on Reliability of BVP Signals. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 6525-6528). IEEE.
  • [8] Stricker, R., Müller, S. and Gross, H.M., 2014, August. Non-contact video-based pulse rate measurement on a mobile service robot. In The 23rd IEEE International Symposium on Robot and Human Interactive Communication (pp. 1056-1062). IEEE.
  • [9]

    Li, X., Chen, J., Zhao, G. and Pietikainen, M., 2014. Remote heart rate measurement from face videos under realistic situations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4264-4271).

  • [10] Poh, M.Z., McDuff, D.J. and Picard, R.W., 2010. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Optics express, 18(10), pp.10762-10774.
  • [11] Poh, M.Z., McDuff, D.J. and Picard, R.W., 2010. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE transactions on biomedical engineering, 58(1), pp.7-11.
  • [12] Kwon, S., Kim, H. and Park, K.S., 2012, August. Validation of heart rate extraction using video imaging on a built-in camera system of a smartphone. In 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 2174-2177). IEEE.
  • [13] Gudi, A., Bittner, M., Lochmans, R. and van Gemert, J., 2019. Efficient real-time camera based estimation of heart rate and its variability. In Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 0-0).
  • [14] Gudi, A., Bittner, M. and van Gemert, J.V., 2020. Real-Time Webcam Heart-Rate and Variability Estimation with Clean Ground Truth for Evaluation. Applied Sciences, 10(23), p.8630.
  • [15] Kumar, M., Veeraraghavan, A. and Sabharwal, A., 2015. DistancePPG: Robust non-contact vital signs monitoring using a camera. Biomedical optics express, 6(5), pp.1565-1588.
  • [16] Wang, Z., Yang, X. and Cheng, K.T., 2018. Accurate face alignment and adaptive patch selection for heart rate estimation from videos under realistic scenarios. PloS one, 13(5), p.e0197275.
  • [17] Tulyakov, S., Alameda-Pineda, X., Ricci, E., Yin, L., Cohn, J.F. and Sebe, N., 2016. Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2396-2404).
  • [18] Tipping, M.E. and Bishop, C.M., 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), pp.611-622.
  • [19] Hyvärinen, A. and Oja, E., 2000. Independent component analysis: algorithms and applications. Neural networks, 13(4-5), pp.411-430.
  • [20] Tarvainen, M.P., Ranta-Aho, P.O. and Karjalainen, P.A., 2002. An advanced detrending method with application to HRV analysis. IEEE Transactions on Biomedical Engineering, 49(2), pp.172-175.
  • [21] Welch, P., 1967. The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on audio and electroacoustics, 15(2), pp.70-73.
  • [22] Bazarevsky, V., Kartynnik, Y., Vakunov, A., Raveendran, K. and Grundmann, M., 2019. Blazeface: Sub-millisecond neural face detection on mobile gpus. arXiv preprint arXiv:1907.05047.
  • [23] Grishchenko, I., Ablavatski, A., Kartynnik, Y., Raveendran, K. and Grundmann, M., 2020. Attention Mesh: High-fidelity Face Mesh Prediction in Real-time. arXiv preprint arXiv:2006.10962.
  • [24] Przybyło, J., 2022. A deep learning approach for remote heart rate estimation. Biomedical Signal Processing and Control, 74, p.103457.
  • [25] Song, R., Chen, H., Cheng, J., Li, C., Liu, Y. and Chen, X., 2021. PulseGAN: Learning to generate realistic pulse waveforms in remote photoplethysmography. IEEE Journal of Biomedical and Health Informatics, 25(5), pp.1373-1384.
  • [26] Tsou, Y.Y., Lee, Y.A., Hsu, C.T. and Chang, S.H., 2020, March. Siamese-rPPG network: Remote photoplethysmography signal estimation from face videos. In Proceedings of the 35th annual ACM symposium on applied computing (pp. 2066-2073).
  • [27] Garcia, M.B., Pilueta, N.U. and Jardiniano, M.F., 2019. VITAL APP: Development and User Acceptability of an IoT-Based Patient Monitoring Device for Synchronous Measurements of Vital Signs. In 2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM) (pp. 1-6). IEEE.
  • [28] Balakrishnan, G., Durand, F., and Guttag, J. (2013). “Detecting pulse from head motions in video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Portland, Oregon.
  • [29] Irani, R., Nasrollahi, K., and Moeslund, T. B. (2014). “Improved pulse detection from head motions using DCT,” in Computer Vision Theory and Applications (VISAPP), 2014 International Conference on, Vol. 3 (Lisbon, Portugal: IEEE).
  • [30] Monkaresi, H., Calvo, R. A., and Yan, H. (2014). A machine learning approach to improve contactless heart rate monitoring using a webcam. IEEE J Biomed. Health Inform. 18, 1153–1160. doi:10.1109/JBHI.2013.2291900
  • [31] Osman, A., Turcot, J., and El Kaliouby, R. (2015). “Supervised learning approach to remote heart rate estimation from facial videos,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, Vol. 1 (Washington: IEEE)
  • [32] Chen, W. and McDuff, D., 2018. Deepphys: Video-based physiological measurement using convolutional attention networks. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 349-365).
  • [33] Špetlík, R., Franc, V. and Matas, J., 2018, September. Visual heart rate estimation with convolutional neural network. In Proceedings of the british machine vision conference, Newcastle, UK (pp. 3-6).
  • [34] Kanva, A.K., Sharma, C.J. and Deb, S., 2014, January. Determination of SpO 2 and heart-rate using smartphone camera. In Proceedings of The 2014 International Conference on Control, Instrumentation, Energy and Communication (CIEC) (pp. 237-241). IEEE.
  • [35] Qiao, D., Zulkernine, F., Masroor, R., Rasool, R. and Jaffar, N., 2021, June. Measuring heart rate and heart rate variability with smartphone camera. In 2021 22nd IEEE International Conference on Mobile Data Management (MDM) (pp. 248-249). IEEE.
  • [36] El-Hajj, C., and Kyriacou, P.A., 2020. A review of machine learning techniques in photoplethysmography for the non-invasive cuff-less measurement of blood pressure. Biomedical Signal Processing and Control, Vol. 58 1011870. Elsevier.
  • [37] Xi, L., Chen, W., Zhao, C., Wu, X., and Wang J., 2020. Image Enhancement for Remote Photoplethysmography in a Low-Light Environment. 15th IEEE International Conference on Automatic Face and Gesture Recognition.
  • [38] Nemcovaa, A., Jordanovaa, I., Vareckaa, M., Smiseka, R., Marsanovaa, L., Smitala, L., and Viteka M., 2020. Monitoring of heart rate, blood oxygen saturation, and blood pressure using a smartphone. Biomedical Signal Processing and Control, Vol. 59.
  • [39]

    Chowdhury, M., Shuzan, M., Chowdhury, M., Mahbub, Z., Monir Uddin, M., Khandakar, A., and Ibne Reaz, M., 2020. Estimating Blood Pressure from the Photoplethysmogram Signal and Demographic Features Using Machine Learning Techniques. Artificial Intelligence in Medical Sensors. MDPI.

  • [40] Xing, X., and Sun, M., 2016. Optical blood pressure estimation with photoplethysmography and FFT-based neural networks. Biomed. Opt. Express, Vol. 7 (pp. 3007-3020).
  • [41] Huang, P., Lin, C., Chung, M., Lin, T., and Wu, B., 2017. Image Based Contactless Blood Pressure Assessment using Pulse Transit Time. 2017 International Automatic Control Conference (CACS).doi:10.1109/cacs.2017.8284275(accessed on October 28, 2021)
  • [42] Luo, H., Yang, D., Barszczyk, A., Vempala, N., Wei, J., Wu, S., Zheng, P., Fu, G., Lee, K. and Feng Z., 2019. Smartphone-based blood pressure measurement using transdermal optical imaging technology. CIRC-CARDIOVASC IMAG, Vol. 12(8).
  • [43] Rahman, H., Ahmed, M., Begum, S., and Funk, P., 2016. Real Time Heart Rate Monitoring From Facial RGB Color Video Using Webcam. 9th Annual Workshop of the Swedish Artificial Intelligence Society (SAIS).
  • [44] Verkruysse, W., Svaasand, L., and Nelson, J., 2008. Remote plethysmographic imaging using ambient light. Optics express.
  • [45] Lewandowska, M., Rumiński, J., Kocejko, T., and Nowak, J., 2011. “Measuring pulse rate with a webcam—a non-contact method for evaluating cardiac activity,” in Computer Science and Information Systems (FedCSIS), 2011 Federated Conference on (Szczecin, Poland: IEEE).
  • [46] Su, P., Ding, X., Zhang, Y., Liu, J., Miao,F., Zhao, N., 2018. Long-term Blood Pressure Prediction with Deep Recurrent Neural Networks. IEEE EMBS International Conference on Biomedical & Health Informatics. IEEE.
  • [47] Slapnicar, G., Mlakar, N., and Luštrek, M., 2019. Blood Pressure Estimation from Photoplethysmogram Using a Spectro-Temporal Deep Neural Network. Vol.19(15). Sensors.
  • [48] Shimazaki, S., Bhuiyan, S., Kawanaka, H., Oguri, K., 2018. Features Extraction for Cuffless Blood Pressure Estimation by Autoencoder from Photoplethysmography. 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE.
  • [49] Viejo, C., Fuentes, S., Torrico, D., and Dunshea, F., 2018. Non-Contact Heart Rate and Blood Pressure Estimations from Video Analysis and Machine Learning Modelling Applied to Food Sensory Responses: A Case Study for Chocolate. Sensors. Vol. 18(6). MDPI.
  • [50] Schrumpf, F., Frenzel, P., Aust, C., Osterhoff, G., and Fuchs, M., 2021. Assessment of Non-Invasive Blood Pressure Prediction from PPG and rPPG Signals Using Deep Learning. International Workshop on Computer Vision for Physiological Measurement in Conjunction with the CVPR. Sensors. 2021
  • [51] Guo, X., Li, Y., and Ling, H., 2017. Lime: Low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing, Vol. 26(2) (pp.982-993). IEEE.
  • [52] Quellec, G., Lamard, M., Conze, P.H., Massin, P. and Cochener, B., 2020. Automatic detection of rare pathologies in fundus photographs using few-shot learning. Medical image analysis, Vol. 61.
  • [53] Haan, G., and Jeanne, V., 2013. Robust pulse-rate from chrominance-based rPPG. IEEE Transactions on Biomedical Engineering, Vol. 60(10), (pp. 2878-2886). IEEE.
  • [54] Liang, Y., Liu, G., Chen, Z., and Elgendi, M. (2018). PPG-BP Database. (accessed on December 15, 2021).
  • [55] Fan, X., Ye, Q., Yang, X., and Choudhury, S., 2018. Robust blood pressure estimation using an RGB camera. Journal of Ambient Intelligence and Humanized Computing, Springer Nature 2018.
  • [56] Jain, M., Deb, S., Subramanyam, A., 2016. Face video based touchless blood pressure and heart rate estimation. 2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP).
  • [57] Brophy, E., Muehlhausen, W., Smeaton, A.F. and Ward, T.E., 2020. Optimised convolutional neural networks for heart rate estimation and human activity recognition in wrist worn sensing applications. arXiv preprint arXiv:2004.00505.