Warping-based video stabilizers smooth camera trajectory by constraining each pixel's displacement and warp stabilized frames from unstable ones accordingly. However, since the view outside the boundary is not available during warping, the resulting holes around the boundary of the stabilized frame must be discarded (i.e., cropping) to maintain visual consistency, and thus does leads to a tradeoff between stability and cropping ratio. In this paper, we make a first attempt to address this issue by proposing a new Out-of-boundary View Synthesis (OVS) method. By the nature of spatial coherence between adjacent frames and within each frame, OVS extrapolates the out-of-boundary view by aligning adjacent frames to each reference one. Technically, it first calculates the optical flow and propagates it to the outer boundary region according to the affinity, and then warps pixels accordingly. OVS can be integrated into existing warping-based stabilizers as a plug-and-play module to significantly improve the cropping ratio of the stabilized results. In addition, stability is improved because the jitter amplification effect caused by cropping and resizing is reduced. Experimental results on the NUS benchmark show that OVS can improve the performance of five representative state-of-the-art methods in terms of objective metrics and subjective visual quality. The code is publicly available at https://github.com/Annbless/OVS_Stabilization.READ FULL TEXT VIEW PDF
With the increased demand for high-quality video using handheld devices, video stabilization has become increasingly important, as such videos always contain undesirable jitter. Many video stabilization methods have been proposed to eliminate jitter in unstable videos for a better visual experience [31, 5, 38, 36, 37]
, and can facilitate many other computer vision tasks[3, 27, 42, 4, 41, 40].
perform stabilization by first estimating and then smoothing the camera trajectories. The stabilized video is warped from the unstable video based on the pixel displacement field obtained from the transformation between the shake and smoothed trajectories. Unfortunately, some of the requisite source pixels during warping lie outside the boundary of the current unstable frame, inevitably leading to holes near the boundary of the stabilization result. To maintain visual consistency, cropping and resizing operations are employed to discard these holes, but may result in a reduction of the effective frame size, a change in the frame aspect ratio, and an amplification of the jitter. Previous approaches mitigate this problem by reducing the area of these out-of-boundary pixels, , by limiting the maximum deformation displacement[18, 19, 8, 7]. This constraint makes stability and crop ratio a compromise. It provides smoother trajectories with large cropping area and vice versa, both of which are not ideal for a better visual experience. Is it possible to maintain stability while increasing the cropping ratio to get (near) full frame stabilized results?
The interpolation-based stabilizers provide a solution to achieve this goal by iteratively interpolating intermediate frames from adjacent frames, including those pixels that lie in the out-of-boundary view of the current frame. These methods implicitly exploit the property that the content in adjacent frames and within each frame follows spatial coherence, a property that has been widely adopted in SFM [9, 32], video inpainting [15, 12, 39]
, and super-resolution[16, 17, 44]. However, this property has not been explored in previous warping-based methods for filling the out-of-boundary pixels. The most intuitive way to exploit this property is to use interpolation to fill the black holes after warping. Unfortunately, some valuable content may have been permanently discarded because it was not sampled during the warping process when the stable frames were obtained, making this post-warping inpainting method invalid here as shown in . In contrast, we try to investigate pre-warping extrapolation to alleviate this problem, which aims to synthesize the out-of-boundary view of each frame by exploiting the property of spatial coherence, thus facilitating the subsequent warping process to sample enough pixels as needed.
Specifically, we propose a new Out-of-boundary View Synthesis (OVS) method in this paper, which consists of two stages, in which the view outside the boundaries is inferred from the adjacent frames by aligning them to each reference frame. In the first coarse alignment stage, the adjacent frames are roughly aligned with the reference frame using a grid-based motion estimate. Afterwards, a second fine alignment stage is introduced to handle subtle misalignment and refine results. It first calculates the optical flow, then predicts the optical flow in the out-of-boundary views via affinity propagation, and finally warps pixels from adjacent frames according to the predicted optical flows. This process is iteratively carried out to gradually align distant neighboring frames to the current frame to expand the area of the out-of-boundary view, so that the subsequent warping process can find needed candidate pixels to obtain a stable frame. Experimental results on the NUS benchmark show that OVS can be plugged in five representative warping-based methods, significantly improving the cropping ratio of stabilized results.
In summary, the contribution of this work is threefold:
1) We make a first attempt to improve warping-based video stabilizers towards full-frame stabilization by extrapolating the requisite out-of-boundary view during warping.
2) We propose a two-stage coarse-to-fine method for out-of-boundary view synthesis by exploiting the spatial coherence in the video.
3) Experimental results on publicly available datasets show that the proposed method can serve as a plug-and-play module to significantly improve both grid-based and pixel-based warping methods.
A representative solution for video stabilization is to estimate the warping field from unstable frames to stabilized ones. Traditional methods typically follow a three-step procedure, first estimating the trajectory, then smoothing the trajectory, and finally obtaining the stabilized frame from the unstable one based on the warping field. The warping field is generated by computing the transformation between the original trajectory and the smoothed trajectory. These methods can constrain the maximum transformation to reduce the area of the out-of-boundary view needed during the warping process, yielding a high cropping ratio while leading to low stability. For example, Subspace  minimizes the displacement of pixels after smoothing the trajectory, which is fitted by a polynomial curve. A similar strategy is utilized in CPW [18, 8]. Bundled , SteadyFlow , and MeshFlow  try to minimize the transformation as well as the motion between adjacent frames to reduce the area of the out-of-boundary view. Liu  uses a depth camera for video stabilization and limits the maximum rotation and translation transformations. L1Stabilizer 
limits the range of each element in the warping matrix to reduce the area of the out-of-boundary view. Deep learning-based stabilizers learn to regress the unstable-to-stable warping field for stabilization or stabilized frames directly. By leveraging the ground truth stable frames as supervisory signal[31, 33], the warping field is implicitly constrained to produce stabilized results. Since some pixels on the stabilized frame are not available due to the absence of pixel correspondence during warping, their losses are not calculated, implying that cropping is still needed to get the desired final result.
These methods only make a tradeoff between cropping ratio and stability as the need of out-of-boundary pixels during warping is reduced rather than satisfied. In contrast, we propose a new method, named OVS, that explicitly extrapolates the requisite out-of-boundary pixels for warping, helping warping-based stabilizers to achieve full-frame video stabilization. It offers a new perceptive on video stabilization and does not require the transformation to be limited to a small range. In addition, OVS can serve as a plug-and-play module to significantly improve both grid-based and pixel-based warping methods.
Towards cropping-free video stabilization, DIFRINT  proposes to treat stabilization as a frame generation problem, where the stabilized frames are intermediate frames generated by interpolating from adjacent frames in an iterative manner. However, due to the large jitter in unstable videos, the estimated optical flow for interpolation is not reliable, leading to distortions or ghost artifacts, especially around the boundaries of dynamic objects. Nevertheless, it is insightful to exploit the spatial coherence of adjacent frames for full-frame video stabilization. In this paper, we also investigate the impact of spatial coherence, but from a different perspective, , extrapolating the out-of-boundary view for benefiting warping-based methods rather than interpolation for video stabilization directly. Together with warping-based stabilizers, they can obtain better stabilized results with less distortions or ghost artifacts while maintaining a high cropping ratio as well as good stability.
Obtaining out-of-boundary view of each frame can be obtained by aligning adjacent frames to the reference one. Traditional image alignment methods always detect feature points first, , SIFT , SURF , ORB , LIFT , then select robust feature points from them, , by RANSAC , and finally use them to calculate the transformation for alignment. Recently, several deep learning-based image alignment methods are proposed [6, 26, 43]43] estimates a global homography between adjacent frames and uses it for alignment. It is noteworthy that using a single global homography may be ineffective when there is a large movement of the camera pose, which is very common in unstable videos in the video stabilization task. Nevertheless, because of its superiority over traditional methods, we use it as the baseline method for out-of-boundary view synthesis. In contrast, we propose a two-stage coarse-to-fine method, which can deal with large camera movements and dynamic objects more effectively.
The proposed OVS method aims to synthesize the out-of-boundary view for video stabilization. It consists of a coarse alignment stage and a fine alignment stage. As shown in Figure 2, the coarse alignment stage takes each current frame together with its neighboring frames as input. It aligns each neighboring frame to the reference one and generates a mask to indicate the valid out-of-boundary view after alignment. The second fine alignment stage takes the aligned frames and their masks as input for further refinement. These two stages are carried out alternatively to gradually align distant frames to the reference one and expand the area of out-of-boundary view. The details of the two stages will be discussed as follows.
We take the motion estimation module in DUT  for motion estimation in the coarse alignment stage. It first detects keypoints and estimates their motion in each frame, then propagates the motion of keypoints to a set of predefined grids, and finally estimates the homography for each grid. The keypoints are robust to noise and illumination variation while the grid-based estimation can handle dynamic objects and large jitter. In addition, a multi-planar estimation strategy is used to deal with the complex motion patterns in real scenes. After obtaining the motion of the mesh vertices, the mesh-based homography matrices are computed and used for coarse alignment. The aligned frames are used as the input for the fine alignment stage. Specifically, the frames in the input unstable video are assumed to be , where is the total number of frames and represents the frame. The size of the input frames is , where is 480 and
is 640. We employ zero-padding outside the boundary of each frame before warping, 80 pixels in each direction and using masksto denote the valid regions after padding. The final output of the coarse alignment module is , where the former set is generated from a forward pass and the latter set is generated from a backward pass. Note that we keep , , , and . The coarse-aligned frames () and corresponding masks () are of size , where and .
The coarse aligned frames can serve as good initial out-of-boundary view synthesis results, although there is some subtle misalignment due to the influence of dynamic objects or large jitter. To further refine the results, we propose a fine alignment (FA) stage. In the following, we denote as and as and only present the forward pass for simplicity. As shown in Figure 2(b), the fine alignment stage first estimates optical flow between the roughly aligned frame and the reference frame inside the boundary, then propagates them outside the boundary based on affinity, and finally the out-of-boundary pixels can be extrapolated based on the optical flow from the reference frame as the refined result.
Taking as input, the optical flow from to is estimated using PWCNet :
where denotes element-wise multiplication. Note that PWCNet has been widely used for optical flow estimation in many stabilizers [5, 34] due to its good performance. Then, a flow reverse (FR) layer [1, 11] is employed to get the reversed optical flow from to :
The reason that we use flow reverse to estimate instead of directly estimating it from to using PWCNet is that some pixels in may correspond to the pixels outside the boundary of , therefore leading to erroneous optical flow. Moreover, the flow reverse layer can output a mask which indicates whether or not a pixel in has a corresponding pixel in , , the yellow shared view of and illustrated in Figure 3.
Given the inside the shared view, we want to get the optical flow outside the boundary such that the pixels there can be extrapolated from accordingly. Following the spatial coherence property within the frame, we argue that the motion of static objects should be coherent locally. Based on this assumption, we propose to estimate the affinity kernels from the color and structure information and use them to propagate the optical flow inside the shared view to the outside of the boundary as illustrated in Figure 3(c).
Technically, we devise an encoder-decoder network as shown in Figure 2(b) to estimate the affinity kernels from , , , , , , , and , where () denotes the edge map extracted from () using the Sobel filter, which encodes the structure information. We use ResNet-50  as the backbone encoder. The decoder takes the features from the last layer of the encoder as input and gradually upsamples the features to decode the features to the original resolution. To fully utilize both high- and low-level features, we follow the UNet-like structure 
to concatenate the feature from the encoder and previous layer of the decoder as the input of the next decoder layer. Each decoder layer is composed of three convolution layers with batch normalization. After getting the output of the decoder, a convolution layer (denoting as) is employed to predict pixel-wise affinity kernels. It also predicts a refined flow to use in the subsequent propagation process by employing a separate convolution layer (denoting as ), ,
The affinity matrixis of size , where is the radius of affinity kernel, , 4 in this paper. is the refined flow of size , , which also provides initial estimate for the out-of-boundary view. Then, we use the affinity matrix , the refined optical flow , and the mask to propagate the optical flow from pixels in to the out-of-boundary pixels. Mathematically, this process can be formulated as follows:
where and denote the 2D location of each pixel, denotes the number of iterations, and .
where and denote the former and latter matrices, respectively. The partial differential of Eq (9) w.r.t. is:
Therefore, we have:
Note that, the maximum gradient will always be less than 1 because we use the normalized kernel value during the propagation process. Thus the stability of the flow propagation based on affinity can be guaranteed. Besides, since the refined flow inside the shared view is much more reliable, we use a slow update strategy to preserve their flow values during the propagation process. To be specific, we use a ratio of for updating as follows:
The loss function of the coarse alignment stage are the same as the DUT. For the fine alignment stage, we use robust L1 loss during training. Given the propagated optical flow , we can extrapolate the pixel value outside the boundary in , , and , respectively,
Then, we can calculate the L1 loss between the above predictions and their ground truth (, ), ,
where is a small value for stability, , in this paper. Note that there may be a trivial propagation solution, where very element in the extrapolated mask based on is zero. To address this issue, we add a regularization loss on as follows:
which penalizes the shrinkage of the mask region after propagation. The final training objective function for the fine alignment stage is , where the loss weights are set empirically.
Note that although we adopt supervised training for OVS, it does not require paired data. In our experiments, we prepare the training data by cropping the unstable frames. Specifically, we first randomly crop a region of size from each unstable frame of size and use it as the ground truth . Then, we crop its central part of size as . The left surrounding part in is indeed the out-of-boundary view w.r.t. .
We only use the unstable videos in the DeepStab  dataset for training and validation. Specifically, fifty videos are used for training and other ten videos are used for validation. We use the Adam 
optimizer with beta (0.9, 0.99) during training. The learning rate is set to 2e-4 and decays by 0.5 every 30 epochs. We train the coarse alignment stage for 50 epochs and the fine alignment stage for 200 epochs, respectively. The training process takes about 36 hours on a single NVIDIA V100 GPU. We test the performance of our OVS model on the NUS dataset.
|PWStabNet + OVS||0.959||0.957||0.832|
|Yu + OVS||0.922||0.784||0.834|
|StabNet + OVS||0.763||0.829||0.748|
|MeshFlow + OVS||0.898||0.683||0.823|
|DUT + OVS||0.967||0.926||0.847|
|DUT + OVS*||0.999||0.944||0.849|
To demonstrate the versatility and effectiveness of our proposed OVS, we select several representative warping-based stabilizers and integrated OVS into these stabilizers as a plug-and-play module. Specifically, PWStabNet  and Yu  utilize pixel-based waring for stabilization while StabNet , Meshflow , and DUT  are grid-based warping stabilizers. OVS is integrated with these stabilizers by replacing the warping steps and keeping the rest of each method unchanged.
The results are summarized in Table 1. We use three metrics following Bundled  for performance evaluation, including cropping ratio, distortion, and stability, which are better for larger values. DIFRINT is the interpolation-based method, which performs better in terms of cropping ratio while falls behind warping-based methods like PWStabNet and DUT in terms of stability and distortion. It is noteworthy that the cropping ratio of DIFRINT is smaller than 1 although it does not require cropping. We suspect that this is because the stabilizer learned a small zoom in effect during the iterative interpolation process, as shown in the Figure 4. More analysis is provided in the magentasupplementary material.
In addition, it can be seen that OVS can significantly improve the cropping ratio of all the warping-based methods. Meanwhile, the distortion and stability of these methods also increase as a by-product since larger cropping ratio means less cropping and resizing of the stabilized frames, therefore reducing the jitter amplification effect. Beside, since cropping aims to remove the black holes around the boundary, it does not guarantee to keep the original frame aspect ratio. Consequently, when the holes dominate in one direction, there may be large distortions in the result, , a lower distortion metric. In contrast, after extrapolating the out-of-boundary view via the proposed OVS, the frame aspect ratio can be better preserved, resulting in larger distortion metrics. We also notice that when applying our OVS in DUT, the cropping ratio is 0.967, which is close to 1. Most of the holes lie in texture-less regions like sky. We further use a simple trick to fill in the holes using nearest neighbor interpolation during warping. This trick leads to full-frame stabilization results as shown in the last row in Table 1.
We also provide some subjective results of DUT , Meshflow , and Yu  with and without our OVS as well as DIFRINT  for comparison in Figure 4. It can be seen that without the OVS, there is a significant content losses in the DUT, Meshflow, and Yu ’s results due to the absence of out-of-boundary views during warping. This content loss can be mitigated by extrapolating the out-of-boundary view using our OVS as shown in the fourth, sixth, and last rows. The results of DIFRINT suffer from large distortions, as highlighted by the red circle, especially when there are severe jitters between neighboring frames and dynamic objects. It is because the optical flow extracted for interpolation is not accurate due to large jitters and dynamic objects, leading to ghost artifacts around object boundaries. Besides, the results in DIFRINT suffer from small content loss, , the bottom left flag in the second column and the zoom-in effect in the sixth column. In contrast, OVS mitigates these issues, as shown in the results of DUT+OVS. Based on both the quantitative and qualitative evaluation results, OVS demonstrates its effectiveness in helping warping-based stabilizers to better video stabilization results with a high cropping ratio, , DUT+OVS achieves near full-frame stabilization results.
We select the SOTA image alignment method in  as a comparison to our OVS, which is a deep learning based method and specifically designed for adjacent frames alignment. In addition, we isolate the coarse alignment stage and the fine alignment stage from our OVS to compare their results with the complete version of OVS. Specifically, we choose DUT as the basic stabilizer and use the aligned frames from different models for warping. Their PSNR and SSIM metrics on the validation set are summarized in Table 2. It can be seen that 1) the baseline alignment method  can achieve good alignment results in terms of the SSIM scores and help DUT to improve the cropping ratio, demonstrating its effectiveness for image alignment. Our coarse alignment module can also help DUT improve the cropping ratio and distortion metric as well as obtains a better PSNR score than . However, its SSIM score is the worst among all the models, implying that there are some structural misalignment in the results. Our fine alignment module can obtain better out-of-boundary view synthesis results than the coarse alignment module in terms of both PSNR and SSIM metrics. However, it does not help DUT to achieve better stabilization results, implying that it can only serve as a refinement module. After combining both modules together, our OVS achieves the best view synthesis performance among all the models as well as significantly improve the performance of DUT for video stabilization.
Some visual results are shown in Figure 5 and Figure 6. The baseline method  produces significant discontinuities around the boundary of sharp edges and dynamic objects in its stabilized results, while our OVS has no such distortions. Similar discontinuities can be observed in the results of only using coarse alignment for warping, implying the necessity of fine alignment. However, only using fine alignment leads to less out-of-boundary view as shown by the large holes around the boundaries. These results confirm the complementarity between coarse alignment and fine alignment in our OVS, which help synthesize large out-of-boundary views with less distortions collaboratively.
We investigate the influence of iterations when employing OVS in warping-based stabilizers, , taking DUT as an exemplar stabilizer. The stability, distortion, and cropping ratio at different settings are shown in Table 3. It can be seen that as the number of iterations increases, the cropping ratio first increases and then almost reaches saturation. The distortion metric increases significantly at the beginning and then slightly decreases while the stability metric keep almost the same. As the number of iterations increases, more distant frames are aligned to the current reference frame for out-of-boundary view synthesis, leading to a larger area. Besides, there may be subtle misalignment accumulated during the iterations, especially misalignment from distant frames. Nevertheless, distant frames indeed contribute less to synthesis since they have less shared areas with the reference frame. Therefore, the distortion metric is slightly decreased as the number of iterations increases.
Due to variations in adjacent frames, such as illumination, dynamic objects, and noise, some visible seams may exist in the stabilized frames, degrading the visual experience. Such discontinuities can be refined using an encoder-decoder network like . It is noteworthy that we use the widely used PWCNet  for optical flow estimation following [5, 34] and an simple encoder-decoder network for fine alignment and affinity propagation, which can not be claimed as a major contribution. Actually, the major contrition of this work is that we provide a fresh perspective for warping-based methods towards full-frame video stabilization, , explicitly extrapolating the requisite out-of-boundary view during warping. In addition, we exploit the spatial coherence in the video to achieve this goal via a simple coarse-to-fine scheme. In the future, we plan to devise a more effective end-to-end model for better out-of-boundary view synthesis and video stabilization.
This paper presents a new Out-of-boundary View Synthesis (OVS) method that can help warping-based stabilizers to achieve near full frame stabilization with less distortions and better stability. OVS exploits the spatial coherence in the video to effectively align adjacent frames to reference frames and synthesize the out-of-boundary view, therefore benefiting the warping process in warping-based stabilizers directly by providing requisite candidate pixels. It can serve as a plug-and-play module and significantly improve both pixel- and grid-based warping stabilizers. We hope this study can provide valuable insights to the community and inspire follow-up research from a different but promising direction for video stabilization.
Acknowledgement Mr. Yufei Xu and Dr. Jing Zhang are supported by the ARC project FL-170100117.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3703–3712. Cited by: §3.2.
Progressive reconstruction of visual structure for image inpainting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5962–5971. Cited by: §1.
Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal 8 (10), pp. 7789–7817. Cited by: §1.