Deep Online Video Stabilization

02/22/2018 ∙ by Miao Wang, et al. ∙ 0

Video stabilization technique is essential for most hand-held captured videos due to high-frequency shakes. Several 2D-, 2.5D- and 3D-based stabilization techniques are well studied, but to our knowledge, no solutions based on deep neural networks had been proposed. The reason for this is mostly the shortage of training data, as well as the challenge of modeling the problem using neural networks. In this paper, we solve the video stabilization problem using a convolutional neural network (ConvNet). Instead of dealing with offline holistic camera path smoothing based on feature matching, we focus on low-latency real-time camera path smoothing without explicitly representing the camera path. Our network, called StabNet, learns a transformation for each input unsteady frame progressively along the time-line, while creating a more stable latent camera path. To train the network, we create a dataset of synchronized steady/unsteady video pairs via a well designed hand-held hardware. Experimental results shows that the proposed online method (without using future frames) performs comparatively to traditional offline video stabilization methods, while running about 30 times faster. Further, the proposed StabNet is able to handle night-time and blurry videos, where existing methods fail in robust feature matching.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Video captured by hand-held camera is often not easy to watch due to shaky content. Several digital video stabilization techniques have been proposed in the past decade to improve the visual quality of hand-held videos, by removing high-frequency camera movements [1, 2, 3, 4, 5]

. The majority of the proposed methods deal with this problem using a global view, by estimating and smoothing the camera path using offline computation. The very few online stabilization methods do a ‘capture

computedisplay’ operation for each incoming video frame in real time with low latency. Due to the real-time requirement, the camera motion is estimated by an Affine transformation, homography or using meshflow. In this paper, we focus on the online stabilization problem. Different from existing approaches, that must explicitly model the camera path to smooth it, we use a learning-based framework to directly compute a target steady transformation, with guidance from historical stabilized frames (see Figure 1).

Fig. 1: Deep online video stabilization. We propose StabNet, a ConvNet model that learns to predict transformation parameters for each incoming unsteady frame, given the history of steady frames. Applying the predicted transformation parameters to the original unsteady frame generates the stabilized output frame. Stabilized frames then act as historical frames for stabilizing future unsteady frames.

In recent years, we have witnessed how convolutional neural networks (ConvNets) change vision and graphics fields. In general, methods that are based on ConvNets perform more efficiently. Several traditional video processing topics such as video stylization [6] and video deblurring [7] are re-addressed using ConvNets. To our knowledge, there are no ConvNet-based methods published for digital video stabilization, although it is an important topic in video processing. We observed two main obstacles preventing a ConvNet-based stabilization solution: 1) Lack of training data: pairs of steady and unsteady synchronized videos with an identical capturing route and content are required for training a ConvNet model. While this is not necessary for traditional methods, it is essential for a learning-based stabilization approach. 2) Problem definition: traditional stabilization methods compute and smooth camera path, which cannot be easily adapted to a ConvNet-based solution. A somewhat different problem definition is required.

Based on these observations, we propose to solve corresponding issues by creating a practical data set for training a neural network, and modifying the formulation of the problem using progressive online stabilization. To collect training data, we captured synchronized hand-held unsteady/steady video pairs using a remodeled hand-held stabilizer with two cameras. With this hardware, only one camera is stabilized by the hand-held stabilizer while the other camera is fixed to the stabilizer grip, moving consistently with the hand motions. In our modified formulation of the stabilization problem, instead of estimating and smoothing a virtual camera path, we learn the transformation parameters for each unsteady frame progressively along the time-line, and generate a steady output video in an online mode.

We present StabNet, a ConvNet model to stabilize frames with light-weighted feed-forward operations through the network. The learning process is driven by the information of historically stabilized frames with the supervised ground-truth steady frame. Figure 1 shows the overview of our deep video stabilization.

The proposed deep stabilization method performs comparably well on test videos collected from existing works. The main merit of our algorithm is the ability to run in real-time at 93 FPS with low latency (1 frame), being about 30 faster than offline methods. Further, our method is superior to existing methods with the ability of handling night-time videos and extreme blurry videos, where existing feature matching based methods may totally fail. To our knowledge, the proposed StabNet is a pioneer in using convolutional network for digital video stabilization. We also built the DeepStab dataset consisting of pairs of synchronized steady/unsteady videos for training. We believe that the first stabilization training dataset will benefit the community, and plan to release it for future research.

Ii Related Work

Our work is closely related to digital video stabilization approaches and deep learning video manipulation.

Digital Video Stabilization

Existing offline stabilization techniques estimate the camera trajectory from 2D, 2.5D or 3D perspective and then synthesize a new smooth camera trajectory to remove the undesirable high-frequency motion. 2D video stabilization methods estimate (bundled) homography or affine transformations between consecutive frames and smooth these transformations temporally. In early work, low-pass filters were applied to individual model parameters [1, 8] . Grundman et al. applied L1-norm optimization to synthesize a path consisting of simple cinematography motions [3]. Liu et al. [5] modeled the camera motion on multiple local camera paths. Zhang et al. [9] proposed to optimize geodesics on the Lie group embedded in transformation space. 3D-based stabilization methods reconstruct a 3D scene [10] and estimate the 3D camera trajectory. Liu et al. [2] proposed the first 3D stabilization method using content-preserve warping. The later subspace video stabilization method [4] smooths long tracked features using subspace constraints. Goldstein and Fattal [11] proposed to enhance the length of feature tracks with epipolar transfer. The above two 2.5D based methods can deal with cases of reconstruction failure, and produces results that are visually as good as a 3D-based method. With user interactions, Bai et al. [12] proposed a user-guided stabilization approach to select good feature tracks and warping results. Rolling shutter problem in high-speed video is addressed in [13]. Recently, a 2D-3D hybrid stabilization approach was proposed to stabilize 360 video [14]. In general, 2D stabilization methods perform efficiently and robustly, while 3D-based methods can generate visually better results.

Real-time online stabilization is specifically desired for live stream applications. Liu et al. [15] proposed an online stabilization method which only use historical camera path to compute warping functions for incoming frames. Inspired by their idea, we present a deep online stabilization approach which performs stabilization given afew historical stabilized frames. The novelty of our approach is that we avoid explicitly estimating and smoothing camera path, instead, we use a ConvNet model to directly predict warping functions.

ConvNets for Video Applications

In recent years, ConvNets have made huge improvements in computer vision tasks such as image recognition

[16, 17, 18], segmentation [19, 20, 21] and generation [22, 23, 24]. When feeding multiple successive frames from videos, ConvNets can predict optical flow [25, 26] or semantics [27, 28]. There are several works which use ConvNets to directly produce video contents, such as scene dynamic generation [29, 30]

, frame interpolation

[31, 32] and deblurring [7, 33]. Because predicting a long video sequence is still a challenging problem, all of the above works used only two or very few successive frames as training samples. The proposed StabNet also considers a temporal neighborhood at each time. The stabilization problem cannot be solved using a generation-based model because of the severe vibration of the input video content. To generate visually pleasing result, our StabNet learns the warping parameters instead of generating pixel values.

Iii Training Dataset

Fig. 2: Exemplar frames of DeepStab dataset. The dataset includes pairs of synchronously captured videos. Each pair consists of an unsteady video and a stabilized video, with the same content. Camera motions include forward movement, pan movement, spin movement and complex movements including combinations of the above, at various speed.

Generating training data is one of the key challenges for digital video stabilization, where ground truth data cannot be easily collected/labeled. To train StabNet, two synchronized video sequences of the same scene are required: one sequence captures a steady camera movement, while the other is unstable. One possible way to generate such data is to render a virtual scene with two camera path configurations: smooth and turbulent. However, ConvNet models trained using rendered virtual scene may not generalize well due to the domain gap between training synthetic video and testing real videos captured by hand-held camera. To generate authentic data, we designed a hardware with two portable GoPro Hero 4 Black cameras and a hand-held stabilizer111, where the cameras lay horizontally next to each other with small disparity (Figure 3). When capturing videos, the two cameras shoot synchronously, with only one camera stabilized, while the other moves consistently with the hand/body motion of the holder. We turned off the auto-focus and auto-exposure functions of the cameras and used the synchronous remote control for synchronization.

Fig. 3: Hardware and training data capturing process.

Training videos are obtained by holding the designed hardware while taking shots in a first-person point of view. We present the DeepStab dataset, containing pairs of synchronized videos of outdoor scenes with diverse camera movements. The dataset includes indoor scenes with large parallax and common outdoor scenes with buildings, vegetation, crowd, etc. Camera motions include forward movement, pan movement, spin movement and complex movements including combinations of the above, at various speed. Videos are processed to remove the fish-eye distortion and trim out clips with large lighting difference.

In total, we collected 60 pairs of synchronized videos, each is 20-30 seconds long at 30 FPS on average. The videos are split into 44 training pairs, 8 validation pairs and 8 testing pairs. Figure 2 shows representative sampled frames from the dataset. The recorded video pairs are augmented to provide more training samples by horizontally flipping the frames, reversing the video sequences and combining both flipping and reversing.

Iv The StabNet

Fig. 4: Network Architecture. StabNet is a two-branch Siamese network with shared parameters in each branch. It consists of an encoder and a Homography Regressor. Homography Regressor is a Conv layer output 8 channels with 1

1 kernel size and 1 pixel stride (k1n8s1). During training, samples of two successive frames

and are fed to the network, and the transformation parameters and predicted. The network is trained with both a stability loss and a temporal loss.


As our focus is online stabilization, we cannot use future frames when processing a given frame. We convert the online stabilization problem to a supervised learning problem of conditional transformation without explicitly computing a camera path. The inputs to

StabNet are an incoming unsteady frame and conditionally 5 historical steady frames sampled from approximately one previous second for time-stamp . The output is a transformation for frame . The steady frame is created by applying , where is the warping operator. The learning process is supervised by ground-truth steady frames . When training StabNet, the conditional inputs used are the ground-truth steady frames , while for testing, are the historical stabilized frames .

Iv-a Network Architecture

Our StabNet is a Siamese network [34] that has two branches sharing the network parameters. We use a Siamese architecture to preserve temporal consistency of successive transformed frames and . Each branch of StabNet is a two-stage network consisting of an encoder, that extracts high-level features from the inputs and a regressor, that predicts the stabilization transformation parameters from the extracted feature map. Figure 4 shows the architecture of StabNet. The inputs are 6 concatenated grayscale frames, each with dimension , consisting of 5 conditional steady frames and one unsteady frame . Frames are sent to an encoder to extract features. This encoder adapts ResNet-50 [18] as the backbone feature extractor, using the conv 1 as the input channel, modified to meet our inputs, and removing all layers after average pooling. The extracted feature map from the encoder is of dimension . Next, we use a conv layer to regress a Homography transformation with kernel size. A Homography transformation has eight parameters to regress:


these parameters could be represented using a vector. As a result, a vector is produced as the output of StabNet.

Iv-B Stabilization Loss Function


training process is driven by a two-terms loss function including a stability term and a temporal smoothness term, which is based on neighboring unsteady frames

and The loss function is defined as:


where is the stability loss, and is the temporal loss. is a weighing parameter balancing the two losses.

Iv-B1 Stability Loss

The stability loss drives the warped unsteady frames to the ground-truth steady frames using cues of pixel alignment and feature point alignment. It is defined as:


where is the pixel alignment term, is the feature alignment term, and is a weighing parameter set to .

The pixel alignment term measures how the transformed frame aligns with the ground-truth steady frame , using mean squared error (MSE):


where is the spatial dimension of frame. The transformation operates in the image domain. To make the warping function differentiable, we used spatial transformer layer [35]. loss will be small if the transformed frame aligns well with the ground-truth frame . However, during training can converge slowly to . During early training stages, frames are not aligned well and the loss term is less correlated. For faster convergence during training, we further introduce a feature alignment loss.

The feature alignment term is computed as the average alignment error of matched feature points after transforming the unsteady frame using the predicted transformation :


where are the pairs of matched feature points between each steady/unsteady frame pair, and and are the -th matched feature points from unsteady frame and ground-truth steady frame respectively.

To compute the feature loss, all pairs are computed in a pre-processing stage between steady and unsteady frame pairs. We extract SURF features [36] from both and , then calculate the matching between them by dividing the frames into sub-images, and using a RANSAC algorithm [37] to fit a Homography in each corresponding sub-image. We match features in sub-images instead of as in [5]

, because of the large camera pose and content variation between the steady and unsteady cameras. Note that the feature extraction and feature matching processes are

only performed for training the network and not needed during online stabilization.

Iv-B2 Temporal Loss

Simply applying the transformations separately to every video frame can create wobble artifacts in the video. Therefore, we incorporate a temporal loss term to enforce temporal coherency between adjacent frames using the Siamese network architecture. Figure 5 shows the comparison of stabilization result with and without temporal loss. Each time two successive samples and are fed into StabNet, two successive transformations and are predicted. The temporal loss is defined as the mean square error between the successive output frames:


where is the spatial dimension of frame, is a function that warps the steady frame at to the steady frame according to pre-computed optical flow. In our experiments we use TV-L1 algorithm [38] to compute the optical flow, but alternative methods for optical flow calculation can also be used.

Fig. 5:

Stabilization result without and with temporal loss. Results with temporal loss leads to sharper, less “ghosted” mean and lower standard deviations. (a), (b) and (c) are successive stabilized frames, corresponding mean and standard deviation value without temporal loss; (d), (e) and (f) are successive stabilized frames, corresponding mean and standard deviation value with temporal loss.

Fig. 6: Comparison with 6 publicly avaliable videos in terms of three metrics.

Iv-C Implementation Details

To train StabNet, we resize the videos to a spatial dimension of and

for efficiency. Pre-trained ResNet-50 model on ImageNet

[39] without the Conv 1 layer is loaded, and is fine-tuned during the training process. We use mini-batch size of and ADAM [40] for optimization with , . Initial learning rate is set to , and multiplied by every iterations. The training process is terminated when reaching iterations. The whole training process takes about hours on an NVIDIA GTX 1080 Ti graphics card.

In the training process, we feed StabNet with two successive samples in the two branches (with shared network parameters) to learn temporal coherency. However, during testing, the network is used to stabilize a single frame at a time; temporal consistency is automatically preserved. Further, the stabilization is self-driven for a test video as follows: we start by duplicating the first frame times and regard the duplicated frames as . After stabilizing, frame , historical stabilized frames are regarded as for stabilizing the next frame . This process is repeated through the time-line.

The stabilization results inevitably have meaningless frame borders introduced by the warping function. As StabNet uses stabilized frames as the inputs for future frames, we need to make StabNet robust to such borders. During training, we add some black borders produced by Homography perturbation around the Identity transformation to the ground-truth historical frames. The Homography perturbances are randomly sampled between and , where the image axis is normalized to . For testing, we crop and trim the borders in post-processing. We plan to release source code and pre-trained StabNet model.

Fig. 7: Comparison with commercial software Adobe Premiere CS6. (a) Quantitative evaluation; (b) Visual comparison of 5 consecutive stabilized frames with its central frame, average frame and standard deviation.

V Experimental Results

We tested our method on various video sources. Testing videos are either from our DeepStab testing set or from [5]. On average, testing runs at FPS, which meets the requirement of real-time online stabilization with frame latency.

V-a Quantitative Evaluation

Online stabilization problem is inherently harder than offline stabilization, because only historical frames are available in online stabilization, without global sense of the camera path. Hence, quantitative performance statistics for online stabilization methods would be inferior to offline ones.

We compare our method with existing approaches using quantitative evaluation, computed following [5, 15]. The three objective metrics are cropping ratio, distortion and stability.

Cropping ratio

This metric measures the area of the remaining content after stabilization. Larger cropping ratio with less cropping is favored. Per frame cropping ratio is computed as the scale component of the global Homography estimated from input frame to output frame . Ratio values of video frames are averaged to generate the cropping ratio value of the whole video.

Distortion value

Distortion value evaluates the distortion degree introduced by stabilization. Per frame distortion value is computed by the ratio of the two largest eigenvalues of the affine part of the Homography

. The minimum value which represents the worst distortion is chosen as the distortion value for the whole video.

Stability score

Stability score measures how stable a video is. As there is no benchmark evaluating stabilization videos, following [5], we use frequency-domain analysis of camera path to estimate the stability score. The camera path is computed as accumulated Homography transformations between successive frames: . We extract the rotation and translation component from as a 1D temporal signals, and take each of their lowest frequencies from 2nd to 6th components over full frequencies as translation and rotation stability score. Finally, we take the minimum value of the translation stability and rotation stability as the overall stability score.

We compare 6 publicly available videos against [2, 4, 11, 3, 15] in terms of the objective metrics, based on results provided by corresponding authors. Comparing against offline stabilizations is slightly unfair for our method because future-frames information is not available for our online stabilization method in real-time. As a result, the stability score of our method is slightly lower. Nevertheless, our method performs in real time while being visually comparable to all existing methods. Comparison details are shown in Figure 6, for videos that we do not find the result, we leave it blank.

We further compare our method with commercial offline stabilization software Adobe Premiere CS6 on a publicly available video data set [5] and our DeepStab data set. As far as we know, Adobe Premiere stabilizer is developed according to the methods in [4]. We choose the default parameters for Adobe Premiere (smoothness: 50, ‘Smooth Motion’ and ‘Subspace Warp’) to produce results. The testing sets group videos into several categories according to scene type and camera motion, including Regular, Quick Rotation, Quick Zooming, Large Parallax, Running and Crowd. Evaluation is reported in Figure 7. It can be seen from Figure 7 (a) that our method generally performs as well as Adobe Premmiere CS6, and from Figure 7 (b) our results would be better than Adobe Premmiere CS6 in Regular and Running categories.


As mentioned, the quantitative evaluation score of our online stabilization method is inevitable lower than offline methods. However the average running time performance of our method is superior to all existing methods. Further, our method is only based on historical frames, thus can be used for online streaming. Running time performance is given in Table I.

Method FPS Future Frames
Bundled Cameras 3.5 Yes
Adobe Premiere 4.2 Yes
MeshFlow 22.0 No
Ours 93.3 No
TABLE I: Running time comparison. FPS statistics are given in the second column. Third column shows whether future frames are required for stabilization.
Fig. 8: User study result by comparing our method with Adobe Premiere Stabilizer.

V-B User Study

To visually compare our method with Adobe Premiere stabilizer, we further conduct a user study with 20 participants aged from 18 to 32. We provide 18 videos for testing, 3 from each aforementioned category. In each testing case, we simultaneously show the original input video, our result, and the result from Adobe Premiere stabilizer to the subjects. The two stabilization results are displayed horizontally in random order. Every participant is asked to pick the more stable result from the results of our method and Adobe Premiere stabilizer, or mark them ‘indistinguishable’, while disregarding differences in aspect ratio, or sharpness.

The user study results are shown in Figure 8. For each category, we show the average percentage of user preference. It can be concluded that for videos from Regular, Quick Zooming, Running categories, our results are comparable with those from offline approach. For other categories that were harder to process without future-frames information, our result is slightly worse. The user study result coincide with our aforementioned discussion.

V-C Handling Low Quality Videos

StabNet is robust to low quality videos caused by noise, motion blur or low resolution. When dealing with such kind of videos, traditional methods could fail because few or even no features are available for computing camera path. We show low quality video stabilization results on night-time video and blurry video cases in supplemental video.

V-D Limitations

StabNet has limitations. First, in our current implementation, a global Homography transformation is predicted for stabilizing each unsteady frame. More complex network architectures and more complex transformations such as bundled transformations can be further explored. Second, in scenes with drastic motion or with extreme near-range foreground objects, our method may fail. We note that these scenarios are also a challenge for previous methods [4, 3, 5, 15].

Vi Conclusion

We have presented StabNet, a convolutional network for digital online video stabilization. Unlike traditional methods which calculate estimated camera paths, StabNet learns a warping transformation for each unsteady frame, only using historical stabilized frames as condition. It runs in real time by fast feed-forward operations. We also present DeepStab - a dataset consisting of pairs of synchronized steady/unsteady videos for training. This set was created using a practical method to generate training videos with synchronized steady/unsteady frames, which could benefit future deep stabilization methods. To our knowledge, StabNet is the first ConvNet for video stabilization. We have demonstrated the power of StabNet for handling typical types of hand-held videos. We believe ConvNet-based methods are promising for digital video stabilization.