Revealing subtle signals in our everyday world is important for helping us understand the processes that cause them. Magnifying small temporal variations in video has applications in both basic science (e.g., visualizing physical processes in the world), engineering (e.g., identifying the motion of large structures) and education (e.g, teaching scientific principals). To provide an illustration, physiological phenomena are often invisible to the unaided eye, yet understanding these processes can help us detect and treat negative health conditions. Pulse and respiration magnification specifically, are good exemplar tasks for video magnification as physiological phenomena cause both subtle color and motion variations. Furthermore, larger rigid and non-rigid motions of the body often mask the subtle variations, which makes the magnification of physiological signals non-trivial.
Several methods have been proposed to reveal subtle temporal variations in video. Lagrangian methods for video magnification [liu2005motion] rely on accurate tracking of the motion of particles (e.g., via optical flow) over time. These approaches are computationally expensive and will not work effectively for color changes. Eulerian
video magnification methods do not rely on motion estimation, but rather magnify the variation of pixel values over time[wu2012eulerian]. This simple and clever approach allows for subtle signals to be magnified that might otherwise be missed by optical flow. Subsequent iterations of such approaches have improved the method with phase-based representations [wadhwa2013phase], matting [elgharib2015video], second-order manipulation [Zhang2017], and learning-based representations [oh2018learning]. However, all these approaches use frequency properties to separate the target signal from noise, so they require precise prior knowledge about the signal frequency. Furthermore, if the signal of interest is at a similar frequency to another signal (for example if head motions are at a similar frequency as the pulse signal) an Eulerian approach will magnify both and cause numerous artifacts (see Fig. 1).
To address these problems, we present a generalized approach for magnifying color and motion variations in videos that feature other periodic or random motions. Our method leverages a convolutional neural network (CNN) as a video motion discriminator to separate a specific source signal even if it overlaps with other motion sources in the frequency domain. Then the separated signal can be magnified in video by performing gradient ascent[erhan2009visualizing] in the input space of the CNN, with the other motion sources untouched. To adapt the gradient ascent method to the video magnification task, several methodological innovations are introduced including adding L1 normalization and sign correction. The whole algorithm proves to work effectively even in the presence of interference motions with large magnitudes and velocities. Fig. 1 shows a comparison between the proposed method and previous approaches.
While our method can generally be applied to any type of color or motion magnification task, magnifying physiological changes on the human body without impacting other aspects of the visual appearance is an especially interesting use case with numerous applications in and of itself. In medicine and affective computing the photoplethysmogram (PPG) and respiration signals are used as unobtrusive measures of cardiopulmonary performance. Visualizing these signals could help in the understanding vascular disease, heart conditions (e.g., arterial fibrillation) [chan2016diagnostic] and stress responses. For example, jugular venous pressure (JVP) is analyzed by studying subtle motions of the neck. This is challenging for clinicians and video-magnification could offer a practical aid. Another application is in the design of avatars [suwajanakorn2017synthesizing]. Synthetic embodied agents may fall into the “uncanny valley” [mori1970uncanny] or be easily detected as “spoofs” if they do not exhibit accurate physiological responses, including respiration, pulse rates and blood flow that can be recovered using video analysis [poh2010non]. Our method presents the opportunity to not only magnify signals but also synthesize them at different frequencies within a video.
The main contributions of this paper are to: (1) present our novel end-to-end framework for video magnification based on a deep convolutional neural network and gradient ascent, (2) demonstrate recovery of the pulse and respiration waves and magnification of these signals in the presence of large rigid head motions, (3) systematically quantitatively and qualitatively compare our approach with state-of-the-art motion magnification approaches under different rigid motion conditions.
2 Related Work
2.1 Video Motion Magnification
Lagrangian video magnification approaches involve estimation of motion trajectories that are then amplified [liu2005motion, wang2006cartoon]. However, these approaches require a number of complex steps including, performing a robust registration, frame intensity normalization, tracking and clustering of feature point trajectories, segmentation and magnification. Another approach, using temporal sampling kernels can aid visualization of time-varying effects within videos [fuchs2010real]. However, this method involves video downsampling and relies on high framerate input videos.
The neat Eulerian video magnification (EVM) approach proposed by Wu et al. [wu2012eulerian] combines spatial decomposition with temporal filtering to reveal time varying signals without estimating motion trajectories. However, it uses linear magnification that only allows for relatively small magnifications at high spatial frequencies and cannot handle spatially variant magnification. To counter the limitation, Wadhwa et al. [wadhwa2013phase] proposed a non-linear phase-based approach, magnifying phase variations of a complex steerable pyramid over time. Replacing the complex steerable pyramid [wadhwa2013phase] with a Riesz pyramid [wadhwa2014riesz] produces faster results. In general, the linear EVM technique is better at magnifying small color changes, while the phase-based pipeline is better at magnifying subtle motions [Wu2012web]. Both the EVM and the phase-EVM techniques rely on hand-crafted motion representations. To optimize the representation construction process, a learning-based method [oh2018learning] was proposed, which uses convolutional neural networks as both frame encoders and decoders. With the learned motion representation, fewer ringing artifacts and better noise characteristics have been achieved.
One common problem with all the methods above is that they are limited to stationary objects, whereas many realistic applications would involve small motions of interest in the presence of large ones. After motion magnification, these large motions would result in large artifacts such as haloes or ripples, and overwhelm any small temporal variation. A couple of improvements have been proposed including a clever layer-based approach called DVMAG [elgharib2015video]. By using matting, it can amplify only a specific region of interest (ROI) while maintaining the quality of nearby regions of the image. However, the approach relies on 2D warping (either affine or translation-only) to discount large motions, so it is only good at diminishing the impact of motions parallel to the camera plane and cannot deal with more complex 3D motions such as the human head rotation. The other method addressing large motion interferences is video acceleration magnification (VAM) [Zhang2017]. It assumes large motions to be linear on the temporal scale so that magnifying the motion acceleration via a second-order derivative filter will only affect small non-linear motions. However, the method will fail if the large motions have any non-linear components, and ideal linear motions are rare in real life, especially on living organisms.
Another problem with all the previous motion magnification methods is that they use frequency properties to separate target signals from noise, so they typically require the frequency of interest to be known a priori for the best results and, as such, have at least three parameters (the frequency bounds and a magnification factor) that need to be tuned. If there are motion signals from different sources that are at similar frequencies (e.g., someone is breathing and turning their head), it is previously not possible to isolate the different signals.
2.2 Gradient Ascent for Feature Visualization
Opposite to gradient descent, gradient ascent is a first-order iterative optimization algorithm that takes steps proportional to the positive of the gradient (or approximate gradient) of a function. Since neural networks are generally differentiable with respect to their inputs, it is possible to perform gradient ascent in the input space by freezing the network weights and iteratively tweaking the inputs towards the maximization of an internal neuron firing or the final output behavior. Early works found that this technique can be used to visualize network features (showing what a network is looking for by generating examples)[erhan2009visualizing, simonyan2013deep] and to produce saliency maps (showing what part of an example is responsible for the network activating a particular way) [simonyan2013deep].
A recent famous application of gradient ascent in feature visualization is Google DeepDream[mordvintsev2015deepdream]. It maximizes the L2 norm of activations of a particular layer in a CNN to enhance patterns in images and create a dream-like hallucinogenic appearance. It should be noted that applying gradient ascent independently to each pixel of the inputs commonly produces images with nonsensical high-frequency noise, which can be improved by including a regularizer that prefers inputs that have natural image statistics. Also, following the same idea of DeepDream, not only a network layer but also a single neuron, a channel, or an output class can be set as the objective of gradient ascent. For a comprehensive discussion of various regularizers and different optimization objectives used in feature visualization tasks see [olahfeature].
None of the previous works have applied gradient ascent to motion magnification or any task related to motions in video. In contrast to DeepDream and similar visualization tools, our method maximizes the output activation of a CNN in motion representations computed from frames instead of in raw images.
2.3 Video-Based Physiological Measurement
Over the past decade video-based physiological measurement using RGB cameras has developed significantly [mcduff2015survey]. For instance, physiological parameters such as heart rate (HR) and breathing rate (BR) have been accurately extracted from facial videos in which subtle color changes of the skin caused by blood circulation can be amplified and analyzed (a.k.a., imaging plethysmography) [verkruysse2008remote, poh2010non, poh2011advancements, de2013robust, Tarassenko2014, Wang2016b]. Similar metrics have also been extracted by analyzing subtle face motions associated with the blood ejection into the vessels (a.k.a., imaging ballistocardiography) [Balakrishnan2013] as well as more prominent chest volume changes during breathing [Tan2010, Janssen2016].
Early work on imaging plethysmography identified that spatial averaging of skin pixel values from an imager could be used to recover the blood volume pulse [takano2007heart]. The strongest pulse signal was observed in the green channel [verkruysse2008remote], but a combination of color channels provides improved results [poh2010non, mcduff2014improvements]. Combining these insights with face tracking and signal decomposition enables a fully automated recovery of the pulse wave and heart rate [poh2010non].
In the presence of dynamic lighting and motion, advancements were needed to successfully recover the pulse signal. Leveraging models grounded in the optical properties of the skin has improved performance. The CHROM [de2013robust]
method uses a linear combination of the chrominance signals. It makes the assumption of a standardized skin color profile to white-balance the video frames. The Pulse Blood Vector (PBV) method[de2014improved] relies on characteristic blood volume changes in different regions of the frequency spectrum to weight the color channels. Adapting the facial ROI can improve the performance of iPPG measurements as blood perfusion varies in intensity across the body [Tulyakov2016]
Few approaches have made use of supervised learning for video-based physiological measurement. Formulating the problem is not trivial and performance has been modest[osman2015supervised, monkaresi2014machine]. Recent advances in deep neural video analysis offer opportunities for recovering accurate physiological measurements. Recently, Chen and McDuff [chen2018deepphys] presented a supervised method using a convolutional attention network that provided state-of-the-art measurement performance and generalized across people. Our video magnification algorithm is based on a novel framework that allows recovery of pulse and respiratory waves using such a convolutional architecture.
3.1 Video Magnification Using Gradient Ascent
Fig. 2 shows the workflow of the proposed video magnification algorithm using gradient ascent. Similar to previous video magnification algorithms, it reads a series of video frames , magnifies a specific subtle motion in them, and outputs frames of the same dimension .
The first step of our algorithm is computing the input motion representation from the original video frames . represents any change happening between two consecutive frames and . Common motion representations include frame difference and optical flow. Different motion representations can emphasize different aspects of motions. For example, the physio-logy-based motion representation called normalized frame difference [chen2018deepphys] was proposed to capture skin absorption changes robustly under varying rigid motions. On the other hand, optical flow based on the brightness constancy constraint is good at representing object displacements, but largely ignores the light absorption changes of objects. As a general framework for video magnification, our algorithm supports any type of motion representation.
In realistic videos the motion representations are comprised of multiple motions from different sources. For example, unconstrained facial video recordings commonly contain not only respiration movements and pulse-induced skin color changes but also head rotations and facial expressions. As we are only interested in magnifying one of these motions at a time, a video magnification algorithm should have the ability to separate the target motion from the others in the motion representation. Previous methods have typically used frequency-domain characteristics of the target motion in separation, so they rely on precise prior knowledge about the motion frequency (e.g. the exact heart rate). Furthermore, if any other motion overlaps with the target motion in frequency, it will still be magnified and cause artifacts. To improve the specificity of magnification and reduce the dependence on prior knowledge, we propose to use a deep convolutional neural network (CNN) to model the relationship between the motion representation and the motion of interest. As shown in Fig. 2, the CNN has the input motion representation as its input, and the first-order derivative of the target motion signal as its output. For many motion types, there are available datasets with paired videos and ground truth motion signals (e.g., facial videos with pulse and respiration signals measured from medical devices). Therefore, the weights of the CNN can be determined by training it on one of these datasets. It has been shown in [chen2018deepphys] that CNNs trained in this way have good generalization ability over different objects (human subjects), different backgrounds, and different lighting conditions.
As the CNN has established the relationship between the input motion representation and the target motion signal , magnification of in can be achieved by amplifying the L2 norm of its first-order derivative and then propagating the changes back to using gradient ascent. The process can be expressed as
in which is the total number of iterations and is the step size. is the weights of the CNN, which are frozen during gradient ascent. is the gradient of with respect to , which is the direction to which can be modified to specifically magnify the target motion rather than the other motions. Note that both and correspond to time point in (1), but is omitted for conciseness.
The vanilla gradient ascent in (1) is appropriate for magnifying a single motion representation at time . However, for video magnification, a series of motion representations need to be processed and magnified to the same level. Since the magnitude of the gradient is sensitive to the surface shape of the objective function (i.e. a point on a steep surface will have high magnitude whereas a point on the fairly flat surface will have low magnitude), it is not guaranteed that the accumulated gradient will be proportional to the original motion amplitude. Therefore, we apply L1 normalization to the gradient
so that only the gradient direction is kept and the gradient magnitude is controlled by the step size .
Another problem with (1) is that motions in opposite directions contribute equivalently to the L2 norm of . As a result, the target motion might be amplified in terms of the absolute amplitude but 180-degrees out of phase. To address the problem, we correct the signs of the gradient to always match the signs of the input motion representation
in which is the sign function and is element-wise multiplication.
Summing up the changes of in all the iterations, we get the final expression of the magnified motion representation:
There are only two hyper-parameters and , which can be tuned to change the magnification factor. Finally, the magnified motion representation can be combined with previous frames to iteratively generate the output video. The complete algorithm is summarized in Algorithm 1.
3.2 Example I: Color Magnification
One example of applying our proposed algorithm is in the magnification of subtle skin color changes associated with the cardiac cycle. As blood flows through the skin it changes the light reflected from it. A good motion representation for these color changes is normalized frame difference [chen2018deepphys], which is summarized below.
For modeling lighting, imagers and physiology, previous works used the Lambert-Beer law (LBL) [lam2015robust, Xu2014a] or Shafer’s dichromatic reflection model (DRM) [Wang2016b]. We build our motion representation on top of the DRM as it provides a better framework for separating specular reflection and diffuse reflection. Assume the light source has a constant spectral composition but varying intensity. We can define the RGB values of the -th skin pixel in an image sequence by a time-varying function:
where denotes a vector of the RGB values; is the luminance intensity level, which changes with the light source as well as the distance between the light source, skin tissue and camera; is modulated by two components in the DRM: specular reflection , mirror-like light reflection from the skin surface, and diffuse reflection , the absorption and scattering of light in skin-tissues; denotes the quantization noise of the camera sensor. , and
can all be decomposed into a stationary and a time-dependent part through a linear transformation[Wang2016b]:
where denotes the unit color vector of the skin-tissue; denotes the stationary reflection strength; denotes the relative pulsatile strengths caused by hemoglobin and melanin absorption; denotes the BVP.
where denotes the unit color vector of the light source spectrum; and denote the stationary and varying parts of specular reflections.
where is the stationary part of the luminance intensity, and is the intensity variation observed by the camera. The stationary components from the specular and diffuse reflections can be combined into a single component representing the stationary skin reflection:
As the time-varying components are much smaller (i.e., orders of magnitude) than the stationary components in (10), we can neglect any product between varying terms and approximate as:
The first step in computing our motion representation is spatial averaging of pixels, which has been widely used for reducing the camera quantization error in (11). We implemented this by downsampling every frame to pixels by
pixels using bicubic interpolation. Emperical evidence shows that bicubic interpolation preserves the color information more accurately than linear interpolation[mcduff2018super]. Selecting is a trade-off between suppressing camera noise and retaining spatial resolution ([wang2015exploiting] found that was a good choice for face videos.) The downsampled pixel values will still obey the DRM model only without the camera quantization error:
where is the new pixel index in every frame.
Then we need to reduce the dependency of on the stationary skin reflection color , resulting from the light source and subject’s skin tone. In (12), appears twice. It is difficult to eliminate the second term as it interacts with the unknown . However, the first time-invariant term, which is usually dominant, can be removed by taking the first order derivative of both sides of (12) with respect to time:
One problem with this frame difference representation is that the stationary luminance intensity level is spatially heterogeneous due to different distances to the light source and uneven skin contours. The spatial distribution of has nothing to do with physiology, but is different in every video recording setup. Thus, was normalized by dividing it by the temporal mean of to remove :
where . In (14), needs to be computed pixel-by-pixel over a short time window to minimize occlusion problems and prevent the propagation of errors. We found it was feasible to compute it over two consecutive frames so that (14) can be expressed discretely as:
which is the normalized frame difference we used as motion representation.
The CNN we used for extracting pulse signals from the motion representation is shown in Fig. 3
(a). The pooling layers are 2x2 average pooling, and the convolution layers have a stride of one. All the layers use ReLU as the activation function. Note that bounded activation function such as tanh and sigmoid are not suitable for this task, as they will limit the extent to which the motion representation can be magnified in the gradient ascent.
After gradient ascent, the input motion representation was magnified as , from which we could reconstruct the magnified video. The first step of reconstruction is to denoise the output motion representation by filtering the accumulated gradient:
in which is a zero-phase band-pass filter. Note that unlike previous motion magnification methods the function of the filter here is not to select the target motion but to remove low and high frequency noise, so the filter bands do not need to precisely match the motion frequency in the video and can be chosen conservatively. Specifically, a 6th-order Butterworth filter with cut-off frequencies of 0.7 and 2.5 Hz was used to generally cover the normal heart rate range (42 to 150 beats per minute). Then we applied the inverse operation of (15) to reconstruct the downsampled version of the frames :
Finally, was upsampled back to the original video resolution:
is an image upsampling operator.
3.3 Example II: Motion Magnification
Our second example is amplifying subtle motions on the human body induced by respiration. We used phase variations in a complex steerable pyramid to represent the local motions in a video. The complex steerable pyramid [Simoncelli1992, Portilla2000] is a filter bank that breaks each frame of the video into complex-valued sub-bands corresponding to different scales and orientations. The basis functions of this transformation are scaled and oriented Gabor-like wavelets with both cosine- and sine-phase components. Each pair of cosine- and sine-like filters can be used to separate the amplitude of local wavelets from their phase. Specifically, each scale and orientation is a complex image that can be expressed in terms of amplitude and phase as:
We take the first-order temporal derivative of the local phases computed in this equation as our input motion representation:
For small motions, these phase variations are approximately proportional to displacements of image structures along the corresponding orientation and scale [gautama2002phase]. To lower computational cost, we computed a pyramid with octave bandwidth and four orientations (). Using half-octave or quarter-octave bandwidth and more orientations would enable our algorithm to amplify more motion details, but would require significantly greater computational recourses. In theory, contains scales of representations in different spatial resolutions, and extracting the target respiration motion from them would need different CNNs to fit different input dimensions. However, we found that and the amplified on different scales were approximately proportional to , so it is possible to only process one scale and interpolate the other scales with it.
The CNN we used for extracting respiration signals from the motion representation is shown in Fig. 3 (b). The neural network is deeper than the one used for pulse magnification, because the input motion representation for respiration has a higher dimension. The pooling layers and convolution layers are of the same type as in Fig. 3 (a). As we met the dying ReLU problem (ReLU neurons were stuck in the negative side and always output 0) in our experiments, the activation functions of all the layers were replaced with scaled exponential linear units (SELU) [klambauer2017self].
After gradient ascent, the input motion representation was magnified as , from which we could reconstruct the magnified video. Unlike in Example I, the phase variations were reconstructed by reversing (20) before denoising:
Then the reconstructed phase was denoised by band-pass filtering and phase clipping:
The filter is a 6th-order zero-phase Butterworth filter with cut-off frequencies of 0.16 and 0.5 Hz for generally covering the normal breathing rate range (10 to 30 beats per minute). The magnified phase of the other scales can be interpolated by exponentially scaling the filtered term:
Finally, the magnified video frame can be reconstructed from all the scales of the complex steerable pyramid with their phase updated as (23).
We used the dataset collected by Estepp et al. [estepp2014recovering] for testing our approach. Videos were recorded with a Basler Scout scA640-120gc GigE-standard, color camera, capturing 8-bit, 658x492 pixel images, 120 fps. The camera was equipped with 16 mm fixed focal length lens. Twenty-five participants (17 males) were recruited to participate for the study. Nine individuals were wearing glasses, eight had facial hair, and four were wearing makeup on their face and/or neck. The participants exhibited the following estimated Fitzpatrick Sun-Reactivity Skin Types [fitzpatrick1988validity]: I-1, II-13, III-10, IV-2, V-0. Gold-standard physiological signals were measured using a BioSemi ActiveTwo research-grade biopotential acquisition unit.
We used videos of participants during a set of four, five-minutes tasks for our analysis. Two of the tasks (A and D) were performed in front of a patterned background and two (B and C) were performed in front of a black background. The four tasks were designed to capture different levels of head rotation about the vertical axis (yaw). Examples of frames from the tasks can be seen in Figs. 4.
Task A: Participants stayed still allowing for small natural motions.
Task B: Participants performed a 120-degree sweep centered about the camera at a speed of 10 degrees/sec.
Task C: Similar to Task B but with a speed of 30 degrees/sec.
Task D: Participants were asked to reorient their head position once per second to a randomly chosen targets positioned in 20-degree increments over a 120-degree arc. Thus simulating random head motion.
We compare the color magnification results to Eulerian video magnification [wu2012eulerian] and video acceleration magnification [Zhang2017], and compare the motion magnification results to phase-based Eulerian video magnification [wadhwa2013phase] and video acceleration magnification (EVM and phase-based EVM perform poorly for motion magnification and color magnification respectively). In each case we perform qualitative evaluations similar to that presented in prior work. In addition, we perform a quantitative evaluation by assessing the image quality of the resulting videos. Prior work has generally not considered quantitative evaluations.
For obtaining our own results, the CNN model was either trained and tested on different time periods of the same videos (participant-dependent) or trained and tested on videos of different human participants (participant-independent), both using a 20% holdout rate for testing. The qualitative and quantitative results we show in the following sections are always from video excerpts in the test set. To achieve a fair comparison, all the compared methods used the same filter bands: [0.7 Hz, 2.5 Hz] for pulse color magnification, and [0.16 Hz, 0.5 Hz] for respiration motion magnification. Since VAM uses difference of Gaussian (DoG) filters defined by a single pass-band frequency, we adopted the center frequencies of the physiology frequency bands ( for pulse, and for respiration) as its filtering parameters. In the color magnification baselines, video frames were decomposed into multiple scales using a Gaussian pyramid with the intensity changes in the fourth level amplified (following the source code released by [wu2012eulerian]). All the motion magnification baselines used complex steerable pyramids with octave bandwidth and four orientations. The magnification factors of all the methods were tuned to be visually the same on task A without head motion interferences.
5.1 Color Magnification
We apply our method to the task of magnifying the photoplethysmogram. In this task the target variable for training the CNN was the gold standard contact PPG signal. The input motion representation was 36 pixels 36 pixels 3 color channels. In terms of the hyper-parameters of gradient ascent, the number of iterations was chosen to be 20, and the step size was chosen to be . We found these choices provided a moderate magnification level, equivalent to the magnification using EVM. Different choices of these hyper-parameters will be discussed in the following sections.
Fig. 5 shows a qualitative comparison between our method and the baseline methods. The human participant in the video reoriented his head once per second to a random direction. In the horizontal scan line of the input video, only the head rotation is visible and the subtle color changes of the skin corresponding to pulse cannot be seen with the unaided eye. In the results of the baseline methods, strong motion artifacts are introduced. This is because the complex head motion is not distinguishable from the pulse signal in the frequency domain, so it is amplified along with the pulse. Since the pulse-induced color changes are several orders of magnitude weaker than the head motion, they are completely buried by the motion artifacts in the amplified video. The VAM scan line (Fig. 5 (c)) shows slightly fewer artifacts than the EVM scan line (Fig. 5 (b)) as the head rotation was occasionally semi-linear. On the other hand, our algorithm uses a deep neural network to separate the pulse signal from the head motion, and uses gradient ascent to specifically amplify it. Consequently, its scan line (Fig. 5 (d)) preserves the morphology of the head rotation while revealing the periodic color changes clearly on the skin.
To show the magnification effects on different colors and different object surfaces, we drew the original and magnified traces of a pixel in three color channels of a video in Fig. 6. The human participant in the video rotated her head left and right, so the selected pixel was on her forehead in half of time and was on the black background in the other half of time (corresponding to the notches in the traces). First, the pulse-induced color changes were only magnified when the pixel was on the skin surface, which proved the good spatial specificity of our algorithm. Second, the magnified pulse signal has much higher amplitude in the green channel than in the other channels. This is consistent with previous findings that the amplitude of the human pulse is approximately 0.33:0.77:0.53 in RGB channels under a halogen lamp [de2013robust], and verifies that our algorithm faithfully kept the original physiological property in magnification. Third, we changed the chosen step size to its multiples (, and ) with the number of iterations unaltered, and visualized the resulting pixel traces also in Fig. 6. There is a clear trend that longer step sizes lead to higher amplitudes of the magnified pulse.
|DeepMag - P. Dep.||38.2||42.8||42.8||38.5||.981||.987||.987||.981||33.3||41.5||41.4||34.1||.940||.980||.979||.952|
|DeepMag - P. Ind.||38.3||42.7||42.6||38.5||.981||.987||.987||.981||33.4||41.5||41.4||34.0||.940||.979||.979||.951|
and VAM. The table shows the average metrics among all videos within each task, while the bar charts also show the standard deviations as error bars. Our models (both participant-dependent and participant independent) produce videos with higher PSNR and SSIM compared to the baselines for all tasks. The benefit of our model is particularly strong for videos with greater levels of head rotation.
To perform a quantitative evaluation of video quality we used two metrics: peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). In both cases we calculated the metrics on every frame of the tested videos, and took their averages across all participants within each task. The reference frame in each case was the corresponding frame from the original, unmagnified video. Table 1 shows a comparison of the video quality metrics for the baselines and our method. Although the magnified blood flow or respiration will naturally cause the metrics to be lower, we found that artifacts in the generated videos had a much more significant impact on their values than the magnified physiology. Thus, lower PSNR and SSIM values indicate more artifacts and lower quality. According to the table, our methods achieve both higher PSNR and SSIM than the baseline methods, which verify the ability of our methods to magnify subtle color changes with motion artifact suppressed. On task A containing limited head motions, the metrics of the baseline methods are very close to those of our method. However, as the head rotation becomes faster and random on more difficult tasks, the video quality of the baseline outputs dramatically decreases. This is because their algorithms amplify any motion lying in the filter band and does so indiscriminately. The magnification thus leads to significant artifact when large head motions are present. On the other hand using our method, the video quality is maintained at almost the same level on different tasks. Both PSNR and SSIM are only slightly lower on Task A and Task D, because the patterned background is more vulnerable to artifacts than the black one. The difference between the participant-dependent results and the participant-independent results is also very small, suggesting that our algorithm has good generalization ability and can be successfully applied to new videos containing different human participants without additional tuning.
5.2 Motion Magnification
We apply our method to the task of magnifying respiration motions. In this task the target variable for training the CNN was the gold standard respiration signal measured via the chest strap. Given the subtle nature of the motions we found that a higher dimension input motion representation was needed than for the PPG magnification. As shown in Fig. 3, the motion representation was in 123 pixels 123 pixels 4 orientations. The gradient ascent hyper-parameters and were chosen to be 20 and to produce moderate magnification effects.
Fig. 7 shows a qualitative comparison between our method and the baseline methods. The human participant in the video rotated his head at a speed of 10 degrees/sec. A vertical scanline on his shoulder was drawn along with time to show the respiration movement. In the input video, the respiration movement is very subtle. Both our method and the baseline methods greatly increased its magnitude (Fig. 7 (b) (c) (d)). However, the baseline methods cannot clearly distinguish the phase variations caused by respiration and by head rotation, so it also amplified the head rotation and blurred the participant’s face. Our method is based on a better motion discriminator learned via the CNN so that the head motions are not amplified.
To show the intermediate phase variations and different magnification effects along different orientations, we drew the original and magnified traces of a pixel in the phase representation (Fig. 8). Since the selected pixel is on the shoulder of the human participant, the respiration movement is mainly in the vertical direction. As a result, the amplified phase variations corresponding to breathing have the highest amplitude along (Fig. 8 (c)) and the lowest amplitude along (Fig. 8 (a)). We also changed the chosen step size to its multiples (, and ) with the number of iterations unaltered, and visualized the resulting phase traces in Fig. 8. The figure suggests that the magnification level always increases along with the step size.
The same quantitative metrics as those for color magnification were computed and shown in Table 1. They also generally follow the same pattern as in the color magnification analysis: The video quality of the baseline methods is impacted by the level of head motions, while our method is considerably more robust. There is no significant difference between our participant-dependent results and participant-independent results.
5.3 Magnification Factors
The magnification factor of our algorithm is controlled by two hyper-parameters, the number of iterations and the step size . In Fig. 6 and Fig. 8, we chose the same and tuned to be different multiples. The resulting magnification levels were always higher when was longer. However, there is a trade-off in the selection of , as a higher magnification factor also introduces more artifacts. Table 2 shows the average video quality metrics PSNR and SSIM for our output videos on an exemplary task (Task C) with different choices of . For both the pulse and respiration magnification tasks, the video quality decreases to different extents with the increase of . Given that artifacts considerably reduce the PSNR and SSIM metrics (as shown in Table 1), the fact that the values do not change dramatically with shows that few artifacts are introduced with increasing magnification.
To quantitatively analyze the effects of and on the magnification factor, we drew exemplary learning curves for one of our videos in Fig. 9 (a) with different choices of parameters. The curves show the changes of our CNN loss, the L2 norm of the differential motion signal, which is a good estimate of the target motion magnitude. According to the learning curves, both and positively correlate with the motion magnitude, and the relationship between and the motion magnitude is semi-linear. However, a longer step size with fewer iterations is not equivalent to a shorter step with more iterations. In Fig. 9 (b), we show how the loss changes along with the product of and , which suggests that relatively small step sizes and more iterations can increase the magnification factor more efficiently.
5.4 Gradient Ascent Mechanisms
Compared with traditional gradient ascent, we added two new mechanisms to adapt the approach to the task of video magnification: L1 normalization and sign correction. Here we show experimental results to support the necessity of these mechanisms.
The goal of applying L1 normalization is to make sure every frame in a video is magnified to the same level. To achieve this goal, the gradient in (1) needs to be approximately proportional to the motion representation . However, it was not the case without L1 normalization. Fig. 10 shows the time series and histograms of the L1 norms of and
for a 30-second video. It is obvious that the distribution of the motion representation is Gaussian while the distribution of the gradient is highly skewed. To correct the distribution of the gradient to match the motion representation, it needs to be L1 normalized.
In Fig. 11, we show the pixel-wise correlation coefficients between the input and the magnified motion representations in the respiration magnification task, with and without the sign correction mechanism. When there is no sign correction, the correlation coefficients have both positive and negative values (Fig. 11 (b)). As introduced in Section 3.1, the negative values appear because the target motion could be amplified with its direction reversed. In the example in Fig. 11 (b), most of the negative values happen on the background, which are negligible as the background has nearly no motion to amplify, but some of them are on the human body, which will cause the output video to be blurry on magnification. After sign correction is applied, all the correlation coefficients become positive (Fig. 11 (c)).
Revealing subtle signals in our everyday world is important for helping us understand the processes that cause them. We present a novel single deep neural framework for video magnification that is robust to large rigid motions. Our method leverages a CNN architecture that enables magnification of a specific source signal even if it overlaps with other motion sources in the frequency domain. We present several methodological innovations in order to achieve our results, including adding L1 normalization and sign correction to the gradient ascent method.
Pulse and respiration magnification are good exemplar tasks for video magnification as these physiological phenomena cause both subtle color and motion variations that are invisible to the unaided eye. Our qualitative evaluation illustrates how the PPG color changes and respiration motions can be clearly magnified. Comparisons with baseline methods show that our proposed architecture dramatically reduces artifacts when there are other rotational head motions present in the videos.
In a systematic quantitative evaluation our method improves the PSNR and SSIM metrics across tasks with different levels of rigid motion. By magnifying a specific source signal we are able to maintain the quality of the magnified videos to a greater extent.