Video Stitching for Linear Camera Arrays

07/31/2019 ∙ by Wei-Sheng Lai, et al. ∙ 3

Despite the long history of image and video stitching research, existing academic and commercial solutions still produce strong artifacts. In this work, we propose a wide-baseline video stitching algorithm for linear camera arrays that is temporally stable and tolerant to strong parallax. Our key insight is that stitching can be cast as a problem of learning a smooth spatial interpolation between the input videos. To solve this problem, inspired by pushbroom cameras, we introduce a fast pushbroom interpolation layer and propose a novel pushbroom stitching network, which learns a dense flow field to smoothly align the multiple input videos for spatial interpolation. Our approach outperforms the state-of-the-art by a significant margin, as we show with a user study, and has immediate applications in many areas such as virtual reality, immersive telepresence, autonomous driving, and video surveillance.



There are no comments yet.


page 1

page 3

page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to sensor resolution and optics limitations, the field of view (FOV) of most cameras is too narrow for applications such as autonomous driving and virtual reality. A common solution is to stitch the outputs of multiple cameras into a panoramic video, effectively extending the FOV. When the optical centers of these cameras are nearly co-located, stitching can be solved with a simple homography transformation. However, in many applications, such as autonomous driving, remote video conference, and video surveillance, multiple cameras have to be placed with wide baselines, either to increase view coverage or due to some physical constraints. In these cases, even state-of-the-art methods [jiang2015video, perazzi2015panoramic] and current commercial solutions (e.g, VideoStitch Studio [video_stitch_studio], AutoPano Video [autopano], and NVIDIA VRWorks [vrworks]) struggle to produce artifact-free videos, as shown in Figure 6.

One main challenge for video stitching with wide baselines is parallax, i.ethe apparent displacement of an object in multiple input videos due to camera translation. Parallax varies with object depth, which makes it impossible to properly align objects without knowing dense 3D information. In addition, occlusions, dis-occlusions, and limited overlap between the FOVs also cause a significant amount of stitching artifacts. To obtain better alignment, existing image stitching algorithms perform content-aware local warping [chang2014shape, zaragoza2013projective] or find optimal seams around objects to mitigate artifacts at the transition from one view to the other [eden2006seamless, zhang2014parallax]. Applying these strategies to process a video frame-by-frame inevitably produces noticeable jittering or wobbling artifacts. On the other hand, algorithms that explicitly enforce temporal consistency, such as spatio-temporal mesh warping with a large-scale optimization [jiang2015video], are computationally expensive. In fact, commercial video stitching software often adopts simple seam cutting and multi-band blending [burt1983multiresolution]. These methods, however, often cause severe artifacts, such as ghosting or misalignment, as shown in Figure 6. Moreover, seams can cause objects to be cut off or completely disappear from stitched images—a particularly dangerous outcome for use cases such as autonomous driving.

We propose a video stitching solution for linear cameras arrays that produces a panoramic video. We identify three desirable properties in the output video: (1) Artifacts, such as ghosting, should not appear. (2) Objects may be distorted, but should not be cut off or disappear in any frame. (3) The stitched video needs to be temporally stable. With these three desiderata in mind, we formulate video stitching as a spatial view interpolation problem. Specifically, we take inspiration from the pushbroom camera, which concatenates vertical image slices that are captured while the camera translates [gupta1997linear]

. We propose a pushbroom stitching network based on deep convolutional neural networks (CNNs). Specifically, we first project the input views onto a common cylindrical surface. We then estimate bi-directional optical flow, with which we

simulate a pushbroom camera by interpolating all the intermediate views between the input views. Instead of generating all the intermediate views (which requires multiple bilinear warping steps on the entire image), we develop a pushbroom interpolation layer to generate the interpolated view in a single feed-forward pass. Figure 7 shows an overview of the conventional video stitching pipeline and our proposed method. Our method yields results that are visually superior to existing solutions, both academic and commercial, as we show with an extensive user study.

2 Related Work

(a) Conventional video stitching pipeline [jiang2015video]

(b) Proposed pushbroom stitching network

Figure 7: Algorithm overview. (a) Conventional video stitching algorithms [jiang2015video] use spatio-temporal local mesh warping and 3D graph cut to align the entire video, which are often sensitive to scene content and computationally expensive. (b) The proposed pushbroom stitching network adopts a pushbroom interpolation layer to gradually align the input views, and outperforms prior work and commercial software.

Image stitching.

Existing image stitching methods often build on the conventional pipeline of Brown and Lowe [brown2007automatic], which first estimates a 2D transformation (e.g, homography) for alignment and then stitches the images by defining seams [eden2006seamless] and using multi-band blending [burt1983multiresolution]. However, ghosting artifacts and mis-alignment still exist, especially when input images have large parallax. To account for parallax, several methods adopt spatially varying local warping based on the affine [lin2011smoothly] or projective [zaragoza2013projective] transformations. Zhang et al [zhang2014parallax] integrate the content-preserving warping and seam-cutting algorithms to handle parallax while avoiding local distortions. More recent methods combine the homography and similarity transforms [chang2014shape, lin2015adaptive] to reduce the projective distortion (i.e, stretched shapes) or adopt a global similarity prior [chen2016natural] to preserve the global shape of the resulting stitched images.

While these methods perform well on still images, applying them to videos frame-by-frame results in strong temporal instability. In contrast, our algorithm, also single-frame, generates videos that are spatio-temporally coherent, because our pushbroom layer only operates on the overlapping regions, while the rest is directly taken from the inputs.

Video stitching.

Due to computational efficiency, it is not straightforward to enforce spatio-temporal consistency in existing image stitching algorithms. Commercial software, e.g, VideoStitch Studio [video_stitch_studio] or AutoPano Video [autopano], often finds a fixed transformation (with camera calibration) to align all the frames, but cannot align local content well. Recent methods integrate local warping and optical flow [perazzi2015panoramic] or find a spatio-temporal content-preserving warping [jiang2015video] to stitch videos, which are computationally expensive. Lin et al [lin2016seamless] stitch videos captured from hand-held cameras based on 3D scene reconstruction, which is also time-consuming. On the other hand, several approaches, e.g, Rich360 [lee2016rich360] and Google Jump [anderson2016jump], create videos from multiple videos captured on a structured rig. Recently, NVIDIA released VRWorks [vrworks], a toolkit to efficiently stitch videos based on depth and motion estimation. Still, as shown in Figure 6(b), several artifacts, e.g, broken objects and ghosting, are visible in the stitched video.

In contrast to existing video stitching methods, our algorithm learns local warping flow fields based on a deep CNN to effectively and efficiently align the input views. The flow is learned to optimize the quality of the stitched video in an end-to-end fashion.

Pushbroom panorama.

Linear pushbroom cameras [gupta1997linear] are common for satellite imagery: while a satellite moves along its orbit, they capture multiple 1D images, which can be concatenated into a full image. A similar approach has also been used to capture street view images [seitz2003multiperspective]. However, when the scene is not planar, or cannot be approximated as such, they introduce artifacts, such as stretched or compressed objects. Several methods handle this issue by estimating scene depth [rav2004mosaicing], finding a cutting-seam on the space-time volume [wexler2005space], or optimizing the viewpoint for each pixel [agarwala2006photographing]. The proposed method is a software simulation of a pushbroom camera which creates panoramas by concatenating vertical slices that are spatially interpolated between the input views. We note that the method of Jin et al [jin2018learning] addresses a similar problem of view morphing, which aims at synthesizing intermediate views along a circular path. However, they focus on synthesizing a single object, person or a car, and do not consider the background. Instead, our method synthesizes intermediate views for the entire scene.

3 Stitching as Spatial Interpolation

Our method produces a temporally stable stitched video from wide-baseline inputs of dynamic scenes. While the proposed approach is suitable for a generic linear camera array configuration, here we describe it with reference to the automotive use case. Unlike other applications of structured camera arrays, in the automotive case, objects can come arbitrarily close to the cameras, thus requiring the stitching algorithm to tolerate large parallax.

For the purpose of describing the method, we define the camera setup as shown in Figure 10(a), which consists of three fisheye cameras whose baseline spans the entire car’s width. Figures 16(a)-(c) show typical images captured under this configuration, and underscore some of the challenges we face: strong parallax, large exposure differences, as well as geometric distortion. To minimize the appearance change between the three views and to represent the wide FOV of the stitched frames, we first adopt a camera pose transformation to warp and to the position of and , respectively. Therefore, the new origin is set at the center camera . Then, we apply a cylindrical projection (by approximating the scene to be at infinity) to warp all the views onto a common viewing cylinder, as shown in Figure 10(a). However, even after camera calibration, exposure compensation, fisheye distortion correction, and cylindrical projection, parallax still causes significant misalignment, which results in severe ghosting artifacts, as shown in Figure 16(d).

(a) Camera setup
(b) Images on the viewing cylinder
Figure 10: Camera setup and input images. (a) The input videos are captured from three fisheye cameras, , , and , separated by a large baseline. The output is a viewing cylinder centered at . (b) The input images are first projected on the output cylinder assuming a constant depth. Within the transition regions, our pushbroom interpolation method progressively warps and blends vertical slices from the input views to create a smooth transition. Outside the transition regions, we do not modify the content from the inputs.
(d) Images blended on the cylinder
(e) Pushbroom interpolation
Figure 16: Example of input and stitched views. Simply projecting the input views , , and onto the output cylinder causes artifacts due to the parallax and scene depth variation (d). Our pushbroom interpolation method effectively stitches the views and does not produce ghosting artifacts (e).

Figure 17: Pushbroom interpolation layer. A straightforward implementation of the pushbroom interpolation layer requires to generate all the intermediate flows and the intermediate views, which is time-consuming when the number of interpolated views is large. Therefore, we develop a fast pushbroom interpolation layer by a column-wise scaling on optical flows, which only requires to generate one interpolated image for any given .

3.1 Formulation

In this work, we cast video stitching as a problem of spatial interpolation between the side views and the center view. We denote the output view by , and the input views (projected onto the output cylinder) by , , and , respectively. Note that , , and are in the same coordinate system and have the same resolution. We define a transition region as part of the overlapping region between a pair of inputs (see the yellow regions in Figure 10(b)). Within the transition region, we progressively warp vertical slices from both images to create a smooth transition from one camera to another. Outside the transition region, we directly take the pixel values from the input images without modifying them.

For presentation clarity, here we focus only on and . Our goal is to generate intermediate frames, , which smoothly transition between and . We first compute the bidirectional optical flows, and , and then generate warped frames and , where is a function that warps image based on flow , and scales the flow to create the smooth transition. We define the left stitching boundary as the column of the leftmost valid pixel for on the output cylinder. Given the interpolation step size , the left half of the output view, , is constructed by


where is the width of the output frame, and is obtained by appropriately fusing and , (see Section 3.2). By construction, the output image is aligned with at , and aligned with at . Within the transition region, the output view gradually changes from to by taking the corresponding columns from the intermediate views. The right half part of the output, , is defined similarly to . Figure 16(e) shows a result of this interpolation.

We note that the finer the interpolation steps, the higher the quality of the stitched results. We empirically set and , i.e, pushbroom columns each -pixel wide.

3.2 Fast Pushbroom Interpolation Layer

Synthesizing the transition regions exactly as described in the previous section is computationally expensive. For each side, it requires scaling the forward and backward optical flow fields times, and using them to warp the full-resolution images just as many times. For images, this results in pixels to warp for each side. However, we only need a slice of pixels from each of them.

Instead of scaling each flow field in its entirety, we propose to generate a single flow field in which entries corresponding to different slices are scaled differently. For instance, from the flow field from to , we generate a new field


where are the boundaries of each slice. We can then warp both images as and , where is computed with Equation 2 with in place of . Note that this approach only warps each pixel in the input images once.

To deal with the unavoidable artifacts of optical flow estimation, we use a flow refinement network to refine the scaled flows and predict a visibility map for blending. As shown in Figure 17, the flow refinement network takes the scaled optical flows and the initial estimates of the warped images, from which it generates refined flows and a visibility map . The visibility map can be considered as a quality measure of the flow, which prevents any potential ghosting artifacts due to occlusions. With the refined flows, we warp the input images again to obtain and . The final interpolated image is then generated by blending based on visibility: .

Finally, the output view, , is constructed by replacing all the in Equation 1 with . We generate and construct by mirroring the process above. Our fast pushbroom interpolation layer generates the results with similar quality but is about faster than the direct implementation for an output image with a resolution of pixels.

3.3 Training Pushbroom Stitching Network

Training dataset.

Capturing data to train our system is challenging, as one would need to use hundreds of synchronized linear cameras. Instead, we render realistic synthetic data using the urban driving simulator CARLA [carla], which allows us to specify the location and rotation of cameras. For the input cameras, we follow the setup of Figure 10(a). To synthesize the output pushbroom camera, we use 100 cameras uniformly spaced between and , and between and . We then use Equation 1 and replace with these views to render ground-truth stitched video. We synthesize 152 such videos with different routes and weather conditions (e.g, sunny, rainy, cloudy, etc.) for training. We provide the detailed network architectures of the flow estimation and flow refinement networks in the supplementary material.

Training loss.

To train our pushbroom interpolation network, we optimize the following loss functions: (1) content loss, (2) perceptual loss, and (3) temporal warping loss.

The content loss is computed by , where is the output image, is the ground-truth, and is a mask indicating whether pixel is valid on the viewing cylinder. The perceptual loss is computed by , where denotes the feature activation at the relu4-3 layer of the pre-trained VGG-19 network [VGG] and is the valid mask downscaled to the size of the corresponding features. To improve the temporal stability, we also optimize the temporal warping loss [Lai-ECCV-2018] , where is the set of neighboring frames at time , is a confidence map, and is the frame warped with optical flow . We use PWC-Net [sun2018pwc] to compute the optical flow between subsequent frames. Note that the optical flow is only used to compute the training loss, and is not needed at testing time. The confidence map is computed from the ground-truth frame and , where . A smaller value of indicates that the pixel is more likely to be occluded.

The overall loss function is defined as , where , , and are balancing weights set empirically. We empirically set , , and . For the spatial optical flow in the transition regions, we use SuperSloMo [jiang2018super] initialized with the weights provided by the authors and then fine-tuned to our use-case in our end-to-end training. We provide more implementation details in the supplementary material.

4 Experimental Results

(a) Stitched frame (++)
(b) GT
(c) Baseline
(e) +
(f) ++
Figure 24: Stitching on a synthetic video. After training the proposed model on the synthetic data, our model aligns the content well and reduces ghosting artifacts.

The output of our algorithm, while visually pleasing, does not match a physical optical system, since the effective projection matrix changes horizontally. To numerically evaluate our results we can use rendered data (Section 4.1). However, a pixel-level numerical comparison with other methods is not possible as each method effectively uses a different projection matrix. For a fair comparison, then, we carried out an extensive user study (Section 4.2). The video results are available in the supplementary material and our project website at

4.1 Model Analysis

To quantitatively evaluate the performance of the stitching quality, we use the CARLA simulator to render a test set using a different town map from the training data. We render 10 test videos, where each video has 300 frames.

We measure the PSNR and SSIM [wang2004image] between the stitched frames and the ground-truth images for evaluating the image quality. In addition, we measure the temporal stability by computing the temporal warping error , where is the flow-warped frame , is a mask indicating the non-occluded pixels, and is the number of valid pixels in the mask. We use the occlusion detection method by Ruder et al [Ruder-2016] to estimate the mask .

We first evaluate the baseline model, where the pushbroom interpolation layer is initialized with the pre-trained SuperSloMo [jiang2018super]. The baseline model provides a visually plausible stitching result but causes object distortion and temporal flickering due to inaccurate flow estimation. After fine-tuning the whole model, both the visual quality and temporal stability are significantly improved. As shown in Table 1, all the loss functions, , , and , improve the PSNR and SSIM and also reduce the temporal warping error. In Figure 24, we show an example where our full model aligns the speed sign well and avoids the ghosting artifacts.

Figure 25(a) shows a stitched frame from the baseline model, where the pole on the right is distorted and almost disappears. After training, the pole remains intact, Figure 25(b). We also visualize the optical flows before and after training the model. After end-to-end training, the flows are smoother and warp the pole as a whole, avoiding distortion.

Training loss PSNR SSIM N.A. (baseline) 27.69 0.908 13.89 30.72 0.925 11.72 + 31.04 0.926 11.63 + + 31.27 0.928 10.67 Table 1: Ablation study. After training the model with the content loss , perceptual loss , and the temporal loss , the image quality and temporal stability are significantly improved. Ours vs. Preference Broken Less Similar objects ghosting results AutoPano [autopano] 90.74 85.71 20.41 10.20 VRWorks [vrworks] 97.22 80.00 49.52 1.90 STCPW [jiang2015video] 98.15 87.74 38.68 0 Overall 95.37 84.74 36.57 3.88 Table 2: User study. We conduct pairwise comparisons on 20 real videos. Our method is preferred by of users on average.
(a) (b) (c)
Figure 25: Visualization of the stitched frames and flows. We show the stitched frames (a), forward flows (b), and backward flows (c) from the pushbroom interpolation layer before (top) and after (bottom) fine-tuning the proposed model. The fine-tuned model generates smooth flow fields to warp the input views and preserve the content (e.g., the pole on the right) well.
(a) STCPW [jiang2015video]
(b) AutoPano Video [autopano]
(c) NVIDIA VRWorks [vrworks]
(d) Ours
Figure 30: Comparison with existing video stitching methods. The proposed method achieves better alignment quality while better preserving the shape of objects and avoiding ghosting artifacts.

4.2 Comparisons with Existing Methods

We compare the proposed method with commercial software, AutoPanoVideo [autopano], and existing video stitching algorithms, STCPW [jiang2015video] and NVIDIA VRWorks [vrworks]. We show two stitched frames from real videos in Figure 30, where the proposed approach generally achieves better alignment quality with fewer broken objects and ghosting artifacts. More video results are provided in the supplementary material.

As different methods use different projection models, a fair quantitative evaluation of the different video stitching algorithms is impossible. Therefore, we conduct a human subject study through pairwise comparisons. Specifically, we ask the participants to indicate which stitched video presents fewer artifacts from a pair of videos. We evaluate a total of 20 real videos and ask each participant to compare 12 pairs of videos. In each comparison, participants can watch both videos for multiple times before making a selection. In total, we collect the results from 54 participants.

Table 2 shows that our results are preferred by about of users, which demonstrates the effectiveness of the proposed method on generating high-quality stitching results. In addition, we ask participants to provide the reasons why they prefer the selected video from the following options: (1) the video has fewer broken lines or objects, (2) the video has less ghosting artifacts, and (3) the two videos are similar. Overall, our results are preferred due to better alignment and fewer broken objects. Moreover, only of users feel that our result is comparable to the others, which indicates that users generally have a clear judgment when comparing our method with other approaches.

4.3 Discussion and Limitations

Our method requires the cameras to be calibrated for the cylindrical projection of the inputs. While common to many stitching methods, e.g, NVIDIA’s VRWorks [vrworks], this requirement can be limiting, if strict. However, our experiments reveal that moving the side cameras inwards by up to of the original baseline, reduces the PSNR by less than 1dB. An outward shift is more problematic because it reduces the overlap between the views. Still, an outward shift that is of the original baseline causes less than 2dB drop in PSNR. Fine-tuning the network by perturbing the original configuration of cameras can reduce the error. We present detailed analysis in the supplementary material. Our method inherits some limitations of the optical flow. For instance, thin structures can cause a small amount of ghosting effects. We show failure cases in the supplementary material. In practice, the proposed method performs robustly even in such cases.

5 Conclusion

In this work, we present an efficient algorithm to stitch videos with deep CNNs. We propose to cast video stitching as a problem of spatial interpolation and we design a pushbroom interpolation layer for this purpose. Our model effectively aligns and stitches different views while preserving the shape of objects and avoiding ghosting artifacts. To the best of our knowledge, ours is the first learning-based video stitching algorithm. A human subject study demonstrates that it outperforms existing algorithms and commercial software.