Wide-spreading of cheap color cameras, especially built into smartphones, makes efforts to retrieve biosignals remotely, out of video and without specials sensors, very appealing.
Early papers [Verkruysse2008, Wieringa2005, Wu2003] demonstrated this possibility, with information retrieved trough analysis of small fluctuations of skin color. This approach was later called iPPG (imaging photoplethysmography) or rPPG (remote photoplethysmography), which are effectively synonyms (possibly, except the specific cases like one-pixel camera [Wang2019]). Later papers also demonstrated analysis of micro-movements caused by pulse (imaging ballistocardiography) [Balakrishnan2013, Shao2017].
More recent works demonstrated that pulse signal can be recovered from long distances [Shi2010] up to 50 m [Blackford2016] and from images as small as pixels [McDuff_2018_CVPR_Workshops].
Motion of the subject is a significant challenge. The standard approach is to extract heart rate (HR) information from a signal based on small fluctuations of skin color (the green color component is mostly useful [Osman2015, Verkruysse2008]), often on basis of its spectrum [Verkruysse2008]. However, frequency of the typical subject’s movements (head tilts, for example) often fits within the expected HR range, generating strong false signal [Chen2018DeepMagSS, Osman2015, Verkruysse2008].
tried to resolve this issue by Independent Component Analysis (ICA), which tries to distinguish different sources of the final signal[Macwan2018, Poh2010].
It was demonstrated, however, that even small motion of a subject during natural interaction with the computer causes significant accuracy decrease (compared to controllable no-movements case), and in-door exercise environment makes ICA almost useless. At the same time, similar methods improved by ML (machine learning) techniques shows much better accuracy[Moco2016, Monkaresi2014, Nowara2018, Prakash2018].
Our work aims at studying impact of certain architectural tricks which could contribute to modern ML-based approaches on rPPG.
In particular, we consider classification-based estimation of HR values by convolutional network followed by two fully connected layers.
Outputs of this network, if normalized, can be treated as relative probabilities distribution (we have single output for the every HR value with constant step). This “pseudo-spectrum” is often noisy. In order to suppress noise-related outliers, some processing or filtering method may be applied to this distribution, for example, smoothing. We replace determined processing by convolutional layers assuming to get optimal filtering procedure during training.
2 Related works
Ground truth signal in most cases retrieved by either electrocardiography (ECG) [Chen2018, Monkaresi2014, Spetlik2018] or contact PPG [Osman2015, Villarroel2017NoncontactVS, Yu2019RecoveringRP]. ECG is quoted to be more reliable [Spetlik2018], while contact PPG is quoted to be closer to rPPG signal retrieved [Yu2019RecoveringRP], making training easier. Reproduction of ground truth signal is often the main area of ML-based training [Chen2018].
Some mask is often applied to select the so-called “region of interest” (ROI) – area of the frame image without background pixels and with most informative fluctuations [Chen2018, Monkaresi2014, Osman2015]. In most cases, the video of the face is processed.
While modern progress in deep learning techniques is believed to provide powerful tools for the rPPG, straightforward approach is facing difficulties:
Small datasets for training. Most of the available datasets include less than 100 subjects. Often only one type of camera is used.
To compensate a little number of subjects, researchers tend to gather a lot of videos from every single subject, which doesn't seem to resolve the problem, since even with long videos the datasets are still relatively small. To overcome this problem, transfer learning approach is used with original data coming from other domains[Faust2012Review, Ganapathy2018Taxonomy] or even generated out from training on mock signals [Niu2018]. Transfer learning capability is also used in [Chen2018] to measure quality of the proposed method.
Video compression methods, which tend to preserve details which is visible to the naked eye, and suppress mostly invisible (and therefore meant to be not important) details. Another problem is variable frame-rate [Fletcher2015], often generated by video codecs trying to keep constant bit-rate, which causes jitter on periods between frames.
It was observed that non-compressed video, while impractical, is a much better source for iPPG [Spetlik2018], and iPPG accuracy decreases linearly as compression rate increases [McDuff_2018_CVPR_Workshops, McDuff2017Compression].
Magnification of small motion and color changes of the skin is used to overcome problems caused by video compression and as a method to increase general sensitivity, [Chen2018DeepMagSS, Hurter2017Cardiolens, Wu2012EulerianMagnification]. Another quite unusual approach is to use 1-pixel camera which has no problems with the bandwidth and therefore needs no compression [Wang2019].
The following methods used to handle the motion of the subject and to overcome the related problems like illumination changes:
ML-based detection of peaks instead of spectrum analysis [Osman2015];
CNN signal post-processing to extract HR information [Spetlik2018];
CNN pre-processing which is expected to be stable to small movements [Spetlik2018, Tang2018];
skin reflection model, which is expected to help with noise caused by observational skin color changes caused by different view angles [Chen2018];
attention model – ROI building procedure which pays special attention to moving pixels of image, also expected to distinguish smaller movements from global rigid motion [Chen2018, Kumar2015, Yang2019];
spatio-temporal CNN, which is able to extract temporal-based features out of series of 2D images but, compared to the traditional 3D convolutions, uses significantly less parameters for training [Yu2019RecoveringRP].
Another sources of inspiration for this paper include:
increasing of quality of events detection when multi-lead ECG used instead of 1-lead [Ganapathy2018Taxonomy];
motion-compensated pixel-to-pixel pulse extraction sensors used to utilize spatial-redundancy of image [Fletcher2015].
The both researches involved multichannel registration of biosignals, which led to more accurate evaluation of spatio-temporal characteristics of signals. This could be explained by the enhancement of manifestation of the common sources of the registered signals in their cross-correlations.
3 Experimental setup
This section describes the self-collected dataset containing 52 videos recorded on three cameras in different motion scenarios. The preprocessed and ground truth data are publicly available (see Section 4.1).
Three cameras were alternately used for video recording:
Cam: Logitech C920 webcam with (Width Height) pixels and WMV2 video codec.
Cam: Microsoft VX800 webcam with pixels and WMV3 video codec.
Cam: Lenovo B590 laptop integrated webcam with pixels and WMV3 video codec.
All video sequences were recorded in RGB (24-bit depth) at 15 frames per second (fps) with 60–80 seconds duration. Each frame contains a person's face.
From 2 to 14 video sequences were recorded for each of 8 healthy participants (7 male, 1 female, aged from 24 to 37, with skin-tones categorized from type-I to type-IV on the Fitzpatrick scale). Distribution of reference HR values is shown in Fig. 1. Each subject signed written consent to take part in the tests, which were performed in compliance with the bioethics regulations; experimental protocols were approved by the bioethics committee of the anonymized University.
The distances range from the face to webcam was 0.5–0.7 m. The pixel size of the facial area was from pixels to when using the Cam and from to when using cameras Cam, Cam. Each video sequence was recorded at 15 frames per second in daylight illumination (300–1000 lx). Ground truth, or reference, HR values were obtained by the Choicemmed MD300C318 pulse oximeter (with declared mean absolute error of 2 bpm).
Experiments were conducted in Stationary Scenario and in Mixed Motion Scenario (Fig. 2).
Stationary Scenario. Subject sat still in front of the webcams in a fixed pose looking straight ahead. 12 video sequences for each webcam were recorded.
Mixed Motion Scenario. Subject rotated their head from right to left (with amplitude), from up to down (with amplitude). Subject was asked to speak and change facial expressions. 6 video sequences were recorded for each webcam.
In this section, we describe methods of data pre-processing and HR estimation on a color signal sample by means of CNN model. The final model architecture sequentially performs three steps: feature extraction (convolutional layers in Fig.3 (a) ), HR prediction (fully connected layers in Fig. 3 (b) ), and filtering (Fig. 3 (b) ).
The following contributions are considered.
Multiple ROIs, forming several input signals (Section (4.1.2) ). Our simple attention model focuses on easy-identifiable parts of the face, known to be important [Kopeliovich2016, Kumar2015, McDuff_2018_CVPR_Workshops]. CNN expected to use spatial-redundancy as in [Fletcher2015] and extract cross-correlations between signals like in multi-lead ECG [Ganapathy2018Taxonomy].
Pseudo-spectrum instead of regression-like models (Section (4.2) ). As long as different sources (blinking, head tilts, mimics) generates noise of different frequencies. Therefore features important to filter them out may differs for different heart rates. Also classification models (compared to regression ones, like used by Spetlik [Spetlik2018]) are mentioned to be effective for events detection (HR estimation is believed to be based on detection of a heartbeat events) even if the input is noisy [Ganapathy2018Taxonomy].
Combined loss, which is based on the cross-entropy and mean squared error losses (Section (4.3) ). In order to minimize the penalty of the minor missclassification (when estimated HR value is close to reference one), we add a mean squared error to the loss function.
Post-processing 1D CNN (Section (4.4) ). Magnification of pseudo-spectrum peaks increases contrast and make detection of the HR more accurate. It resembles the post-processing approach of Spetlik [Spetlik2018] (but used to increase contrast between classes) and magnification techniques described in [Chen2018DeepMagSS, Hurter2017Cardiolens] (but used for post-processing instead of pre-processing).
4.1 Data pre-processing
The pre-processing is made independently on each given video sequence. It includes extraction of color signals from a sequence and generating of training, validation and test sets. It is assumed that each frame of a video sequence contains face of the same person, while persons can differ in different sequences.
The data containing and coordinates of with synchronized reference HR values are publicly available [datasetDL_anon].
4.1.1 Color signal extraction
For a given video sequence, is defined as -th rectangle with coordinates relative to facial bounding box in -th frame. Facial bounding box is detected by the OpenCV implementation of the Viola–Jones face detector [Viola2001] applied to each frame. Regarding the works on ROIs selection [Kopeliovich2016, Kumar2015], six ROIs are used in this paper: (Fig. 4): corresponding to nose, nose bridge, areas under eyes, truncated facial box, and full bounding box. Coordinates of the bounding box are averaged over the last 20 frames (1.3 sec) to minimize detection jitter.
is a color signal value obtained by averaging intensity of red , green , or blue color component over the . Finally, one-dimensional color signals were obtained from each video sequence, forming multi-dimensional color signals , where is index of video sequence.
4.1.2 Input data generation
The input data samples are 64-frames fragments of color signals (4.3 sec per fragment), obtained by splitting each signal into overlapping segments, starting from the first video frame with a step of 10 frames. Samples are scaled to fit the interval. It is assumed that HR will not change significantly over the sample, so ground truth HR values are averaged within sample resulting in one reference value for each sample.
|Cam||1169 / 94 / 340||439 / 27 / 128||1608 / 121 / 468|
|Cam||944 / 64 / 279||446 / 28 / 132||1390 / 92 / 411|
|Cam||1056 / 84 / 312||402 / 24 / 120||1458 / 108 / 432|
|All||3169 / 242 / 931||1287 / 79 / 380||4456 / 321 / 1311|
The video sequences are typically collected in similar environmental conditions but with different participants. In order to keep the training, validation and test sets statistically equivalent, the training set includes first 70% of samples obtained for a , while the validation set includes next 10% samples (excluding ones that overlap with the training), and test set includes last 20% samples. In this way, color signals from each video sequence are presented in all sets, while the training set doesn't intersect with validation or test sets. Alternative distribution of train, validation, and test sets was also evaluated, where the sets were chosen from different non-overlapping video sequences, recorded by different webcams to estimate model generalization properties. Table 1 represents number of samples per each camera and scenario.
The data augmentation was made on the training set, where random uniform noise was added to
values. The noise amplitude was also a uniformly distributed random value with amplitude from 5e-3 to 5e-2, that typically corresponds to 5%–50% of the pulse signal amplitude. The amplitude changed after each training step.
4.2 Network architecture
Data sample size is , where the 1st dimension is for color signal channels, the 2nd is for discrete time. A sample is processed as a single-channel image. Due to relatively large kernels and, therefore, quick reducing of temporal information through the convolutional layers of the network, we don't use pooling layers to avoid double reducing.
(a) ). After each convolution layer there is 2D Batch Normalization[Ioffe2015], while after fully connected layers there are 1D Batch Normalization and dropout layers (with 0.5 dropout rate). We tried to add Batch Normalization before and after ReLU, and the latter proved to lead to better accuracy. The number of output channels in convolution layers is 16. Kernel size is 511 (color signal channelsdiscrete time) for the first four layers, and 211 for 5th layer. Consequently, a 16-channel image of 1
14 size is input to the first fully connected layer, which has 60 output neurons.
We formulate the problem of HR estimation in two ways: as regression or classification tasks. For regression, output of the second fully connected layer is a single value representing HR estimate. For classification, the output is a -length prediction vector, where is the number of classes. Classes are generated from the range of admissible HR values (40–125 bpm), which is split into segments of equal size ( bpm). The segments are assigned to corresponding class labels . The resulting label is calculated as argmax.
Note that reference value and estimate correspond to HR values in regression task, while in classification task they are class labels.
4.3 Loss functions
We consider several loss functions: squared error (SE) for regression task; cross entropy (CE) and combined loss (CL) for classification task.
SE loss is calculated between model output and reference HR value :
The distribution of reference HR values in a dataset can be unbalanced. To compensate this, weight coefficients are involved in all CE losses. First, vector is calculated denoting inverse numbers of samples per classes in dataset. Next, the weights vector is calculated by smoothing the vector, which is computed as:
where is zero-mean Gaussian kernel with window size bpm and bpm. Operator
means discrete convolution with the same padding.
The CE loss combines softmax and negative log likelihood functions:
In pure classification task, loss value does not depend on distance between predicted and reference HR values. To take account for the distance, we introduce one-hot vector , corresponding to label. Similarly to Eq. (2), the vector is smoothed resulting in :
where bpm, bpm.
) ) can be hyperparameters, however, their optimization is a challenging problem that should be possibly done after each training iteration. Instead of adding new hyperparameters, we add filtering step to the basic network architecture (Fig.3(b) ), optimizing smoothing of during standard training process in order to fit .
Filtering step goes after the output of second fully connected layer. We use three 1D convolution layers with 16 output channels (single channel for the last layer) and kernel size of . Each layer is followed by ReLU activation and 1D Batch Normalization. Due to the layers containing no padding, output of the second fully connected layer increased to , so the model output shape remains the same single-channel vector of length.
We recommend using the filtering step together with CL loss. Nevertheless, it also can be applied when using CE loss in order to clarify a class label. Regarding the regression task, where the model output is a single value, the filtering step is inappropriate.
5 Experimental evaluation and results
This section describes metrics for models evaluation, list of training hyperparameters, and experimental results.
5.1 Evaluation metric
We used two metrics to evaluate the performance of different models of HR estimation. All metrics were applied to results on color signal samples of a test set.
Mean absolute error calculates L1-distance between estimated vector of HR values and referent vector Y:
where M is a number of samples. We treat MAE as a qualitative measure of model accuracy.
Coverage at 3 bpm
This metric was used by Wang [Wang2019] for analysis of video sequences. We redefine it as a percentage of samples for which MAE value was smaller than 3 bpm. Regarding the classification task, model output is one of classes corresponding to segments within the range of admissible HR values (40–125 bpm). Therefore, for the classification, we use coverage at 4 class labels which are approximately equal to 2.7 bpm. Coverage metric can be interpreted as model quality. The 3 bpm threshold is close to the MAE of the pulse oximeter (2 bpm) indicating that such a threshold could be used to determine an acceptable measurement.
Here we describe hyperparameters used during the training process. In this work, we set them manually and don't address their optimization.
The batch size was 1024 samples; the number of epochs was 5000. The training set was randomly shuffled after each epoch. The best model parameters were selected from epoch with minimum MAE value on the validation set. The optimization method was Adam[KinBa17] with default parameters. The balancing coefficient for the CL metric (Eq. (5) ) is also a hyperparameter.
Before the training process, we applied the learning rate range test [Smith2015] to choose the learning rate boundaries. The test consisted in the estimation of MAE metric after 5-epochs training for several learning rates varying from 10e-7 to 10e+1. The resulting curve (see Fig. 5) was smoothed using Gaussian kernel. The maximum learning rate is defined as argmin of the smoothed curve; minimum learning rate is chosen by dividing the maximum reduced by two orders of magnitude. During further training, learning rate was linearly changed from minimum to maximum and back according to the “1cycle” learning policy [Smith2017].
We evaluated four models that are titled by the corresponding loss functions: SE model for regression task (62,675 parameters); CE (70,676 parameters), CL (70,676 parameters), and CL+F (with filtering layers, 72,017 parameters) for the classification task.
Training and evaluation methods were implemented in Python (using PyTorch library). The code for generating of the dataset from the color signals as well as the implementation of the proposed architecture, training and testing procedures, and trained models are freely available online[datasetDL_anon].
|Test subset||Stationary||Mixed Motion||Cam||Cam||Cam||Full test set|
The test set was divided into the several subsets by scenario and used cameras (see Section 3): Stationary and Mixed Motion subsets titled according to the corresponding scenarios, Cam, Cam, Cam subsets containing the samples from the corresponding cameras in both scenarios, and Full test set, which includes all samples of the test set.
5.3.1 Accuracy on test subsets
Model comparison results are presented in Table 2. The considered models were trained on the Full training set (defined in Section 4.1.2). Then the models were evaluated on the test subsets. It is clear that the SE model had much lower accuracy than classification-based models. Accuracy of both CL and CL+F models was higher than of CE, where the distance between classes is not taken into account. Adding filtering layers to the the CL model led to the highest accuracy in most cases including Full test set. The CL model had low coverage value on Mixed Motion and Cam subsets. The former can be explained by the presence of high-amplitude noise in color signals caused by motions.
The coverage metric estimations of the CL+F model were typically near 50%. It is insufficient to use the model in a practical applications. However, the size of the training set was nearly 15 times smaller than the number of the model parameters. Therefore, the model accuracy and coverage could grow with the dataset expansion.
Fig. 6 shows scatter plots for the considered models evaluated on the different test subsets. Predictions of the SE model were distributed within the first half of admissible HR range as the result of the unbalanced dataset (see Fig. 1). Classification-based models led to the similar plots differing in a number of outliers, where the CL+F model showed the best results.
As said before, head motion causes disturbances in the color signal. The amplitude of disturbances is up to two orders higher than amplitude of the signal. Due to that, it had been unexpected for the model to have the same coverage values on the Stationary and Mixed Motion subsets, which was true for CL+F. Different MAE values indicate a large number of outliers (Fig. 6(b) ) on the Mixed Motion subset. We believe that the filtering out of such outliers merits further research.
5.3.2 Model generalization
|Cam||Cam||Cam||Full test set|
We studied the generalization of the CL+F network architecture by comparing of the models trained on the different subsets: (CL+F) () trained on the training subsets with samples from the single camera Cam; (CL+F) () trained on the training subsets with the samples from two cameras Cam, Cam. Sets based on two cameras were reduced by removing random 50% of the samples in order to equalize a number of samples.
MAE values are given in the Table 3. In the single-camera case, the error was high on the every test subset excluding one corresponding to the camera. This is due to the different camera resolutions, noise, codecs, and other parameters. Two-cameras case led to the similar results: low error for cameras from the training subset and high errors for the remaining camera. However, the error for the remaining camera was noticeably lower than the errors on cameras out of training subset in the single-camera case. Moreover, two-camera cases provided better accuracy on the full test set. As the every training subset had a comparable number of samples, we conclude that the CL+F network architecture provides a high generalizing ability for its instances.
The problem of remote photoplethysmography by means of deep learning was considered. Color signals, which are time series of red, green, and blue color components averaged over certain regions in the facial area (cheeks, forehead, nose, .), were used as inputs. Inputs were processed by convolutional neural network followed by two fully connected layers. Multiple outputs of this network correspond to different possible HR values, with constant step. The impact of improvements to network architecture and loss function was studied.
In particular, adding convolutional-based filter for post-processing of network outputs led to better accuracy of HR estimations. We expect this improvement can benefit to wide range of deep neural network architectures which address a regression problem by classification and produce “pseudo-spectrum” as output.
Another improvement is the combined loss function, where the first component is a cross entropy and the second one is a mean squared error between the network output and smoothed one-hot vector. The proposed model demonstrated generalization tendency: the model performance, which was evaluated on a particular camera increases with an increasing number of cameras in the training set (excluding the chosen camera); the number of training samples preserved the same.