Due to sensor resolution and optics limitations, the field of view (FOV) of most cameras is too narrow for applications such as autonomous driving and virtual reality. A common solution is to stitch the outputs of multiple cameras into a panoramic video, effectively extending the FOV. When the optical centers of these cameras are nearly co-located, stitching can be solved with a simple homography transformation. However, in many applications, such as autonomous driving, remote video conference, and video surveillance, multiple cameras have to be placed with wide baselines, either to increase view coverage or due to some physical constraints. In these cases, even state-of-the-art methods [jiang2015video, perazzi2015panoramic] and current commercial solutions (e.g, VideoStitch Studio [video_stitch_studio], AutoPano Video [autopano], and NVIDIA VRWorks [vrworks]) struggle to produce artifact-free videos, as shown in Figure 6.
One main challenge for video stitching with wide baselines is parallax, i.ethe apparent displacement of an object in multiple input videos due to camera translation. Parallax varies with object depth, which makes it impossible to properly align objects without knowing dense 3D information. In addition, occlusions, dis-occlusions, and limited overlap between the FOVs also cause a significant amount of stitching artifacts. To obtain better alignment, existing image stitching algorithms perform content-aware local warping [chang2014shape, zaragoza2013projective] or find optimal seams around objects to mitigate artifacts at the transition from one view to the other [eden2006seamless, zhang2014parallax]. Applying these strategies to process a video frame-by-frame inevitably produces noticeable jittering or wobbling artifacts. On the other hand, algorithms that explicitly enforce temporal consistency, such as spatio-temporal mesh warping with a large-scale optimization [jiang2015video], are computationally expensive. In fact, commercial video stitching software often adopts simple seam cutting and multi-band blending [burt1983multiresolution]. These methods, however, often cause severe artifacts, such as ghosting or misalignment, as shown in Figure 6. Moreover, seams can cause objects to be cut off or completely disappear from stitched images—a particularly dangerous outcome for use cases such as autonomous driving.
We propose a video stitching solution for linear cameras arrays that produces a panoramic video. We identify three desirable properties in the output video: (1) Artifacts, such as ghosting, should not appear. (2) Objects may be distorted, but should not be cut off or disappear in any frame. (3) The stitched video needs to be temporally stable. With these three desiderata in mind, we formulate video stitching as a spatial view interpolation problem. Specifically, we take inspiration from the pushbroom camera, which concatenates vertical image slices that are captured while the camera translates [gupta1997linear]
. We propose a pushbroom stitching network based on deep convolutional neural networks (CNNs). Specifically, we first project the input views onto a common cylindrical surface. We then estimate bi-directional optical flow, with which wesimulate a pushbroom camera by interpolating all the intermediate views between the input views. Instead of generating all the intermediate views (which requires multiple bilinear warping steps on the entire image), we develop a pushbroom interpolation layer to generate the interpolated view in a single feed-forward pass. Figure 7 shows an overview of the conventional video stitching pipeline and our proposed method. Our method yields results that are visually superior to existing solutions, both academic and commercial, as we show with an extensive user study.
2 Related Work
Existing image stitching methods often build on the conventional pipeline of Brown and Lowe [brown2007automatic], which first estimates a 2D transformation (e.g, homography) for alignment and then stitches the images by defining seams [eden2006seamless] and using multi-band blending [burt1983multiresolution]. However, ghosting artifacts and mis-alignment still exist, especially when input images have large parallax. To account for parallax, several methods adopt spatially varying local warping based on the affine [lin2011smoothly] or projective [zaragoza2013projective] transformations. Zhang et al [zhang2014parallax] integrate the content-preserving warping and seam-cutting algorithms to handle parallax while avoiding local distortions. More recent methods combine the homography and similarity transforms [chang2014shape, lin2015adaptive] to reduce the projective distortion (i.e, stretched shapes) or adopt a global similarity prior [chen2016natural] to preserve the global shape of the resulting stitched images.
While these methods perform well on still images, applying them to videos frame-by-frame results in strong temporal instability. In contrast, our algorithm, also single-frame, generates videos that are spatio-temporally coherent, because our pushbroom layer only operates on the overlapping regions, while the rest is directly taken from the inputs.
Due to computational efficiency, it is not straightforward to enforce spatio-temporal consistency in existing image stitching algorithms. Commercial software, e.g, VideoStitch Studio [video_stitch_studio] or AutoPano Video [autopano], often finds a fixed transformation (with camera calibration) to align all the frames, but cannot align local content well. Recent methods integrate local warping and optical flow [perazzi2015panoramic] or find a spatio-temporal content-preserving warping [jiang2015video] to stitch videos, which are computationally expensive. Lin et al [lin2016seamless] stitch videos captured from hand-held cameras based on 3D scene reconstruction, which is also time-consuming. On the other hand, several approaches, e.g, Rich360 [lee2016rich360] and Google Jump [anderson2016jump], create videos from multiple videos captured on a structured rig. Recently, NVIDIA released VRWorks [vrworks], a toolkit to efficiently stitch videos based on depth and motion estimation. Still, as shown in Figure 6(b), several artifacts, e.g, broken objects and ghosting, are visible in the stitched video.
In contrast to existing video stitching methods, our algorithm learns local warping flow fields based on a deep CNN to effectively and efficiently align the input views. The flow is learned to optimize the quality of the stitched video in an end-to-end fashion.
Linear pushbroom cameras [gupta1997linear] are common for satellite imagery: while a satellite moves along its orbit, they capture multiple 1D images, which can be concatenated into a full image. A similar approach has also been used to capture street view images [seitz2003multiperspective]. However, when the scene is not planar, or cannot be approximated as such, they introduce artifacts, such as stretched or compressed objects. Several methods handle this issue by estimating scene depth [rav2004mosaicing], finding a cutting-seam on the space-time volume [wexler2005space], or optimizing the viewpoint for each pixel [agarwala2006photographing]. The proposed method is a software simulation of a pushbroom camera which creates panoramas by concatenating vertical slices that are spatially interpolated between the input views. We note that the method of Jin et al [jin2018learning] addresses a similar problem of view morphing, which aims at synthesizing intermediate views along a circular path. However, they focus on synthesizing a single object, e.ga person or a car, and do not consider the background. Instead, our method synthesizes intermediate views for the entire scene.
3 Stitching as Spatial Interpolation
Our method produces a temporally stable stitched video from wide-baseline inputs of dynamic scenes. While the proposed approach is suitable for a generic linear camera array configuration, here we describe it with reference to the automotive use case. Unlike other applications of structured camera arrays, in the automotive case, objects can come arbitrarily close to the cameras, thus requiring the stitching algorithm to tolerate large parallax.
For the purpose of describing the method, we define the camera setup as shown in Figure 10(a), which consists of three fisheye cameras whose baseline spans the entire car’s width. Figures 16(a)-(c) show typical images captured under this configuration, and underscore some of the challenges we face: strong parallax, large exposure differences, as well as geometric distortion. To minimize the appearance change between the three views and to represent the wide FOV of the stitched frames, we first adopt a camera pose transformation to warp and to the position of and , respectively. Therefore, the new origin is set at the center camera . Then, we apply a cylindrical projection (by approximating the scene to be at infinity) to warp all the views onto a common viewing cylinder, as shown in Figure 10(a). However, even after camera calibration, exposure compensation, fisheye distortion correction, and cylindrical projection, parallax still causes significant misalignment, which results in severe ghosting artifacts, as shown in Figure 16(d).
In this work, we cast video stitching as a problem of spatial interpolation between the side views and the center view. We denote the output view by , and the input views (projected onto the output cylinder) by , , and , respectively. Note that , , and are in the same coordinate system and have the same resolution. We define a transition region as part of the overlapping region between a pair of inputs (see the yellow regions in Figure 10(b)). Within the transition region, we progressively warp vertical slices from both images to create a smooth transition from one camera to another. Outside the transition region, we directly take the pixel values from the input images without modifying them.
For presentation clarity, here we focus only on and . Our goal is to generate intermediate frames, , which smoothly transition between and . We first compute the bidirectional optical flows, and , and then generate warped frames and , where is a function that warps image based on flow , and scales the flow to create the smooth transition. We define the left stitching boundary as the column of the leftmost valid pixel for on the output cylinder. Given the interpolation step size , the left half of the output view, , is constructed by
where is the width of the output frame, and is obtained by appropriately fusing and , (see Section 3.2). By construction, the output image is aligned with at , and aligned with at . Within the transition region, the output view gradually changes from to by taking the corresponding columns from the intermediate views. The right half part of the output, , is defined similarly to . Figure 16(e) shows a result of this interpolation.
We note that the finer the interpolation steps, the higher the quality of the stitched results. We empirically set and , i.e, pushbroom columns each -pixel wide.
3.2 Fast Pushbroom Interpolation Layer
Synthesizing the transition regions exactly as described in the previous section is computationally expensive. For each side, it requires scaling the forward and backward optical flow fields times, and using them to warp the full-resolution images just as many times. For images, this results in pixels to warp for each side. However, we only need a slice of pixels from each of them.
Instead of scaling each flow field in its entirety, we propose to generate a single flow field in which entries corresponding to different slices are scaled differently. For instance, from the flow field from to , we generate a new field
where are the boundaries of each slice. We can then warp both images as and , where is computed with Equation 2 with in place of . Note that this approach only warps each pixel in the input images once.
To deal with the unavoidable artifacts of optical flow estimation, we use a flow refinement network to refine the scaled flows and predict a visibility map for blending. As shown in Figure 17, the flow refinement network takes the scaled optical flows and the initial estimates of the warped images, from which it generates refined flows and a visibility map . The visibility map can be considered as a quality measure of the flow, which prevents any potential ghosting artifacts due to occlusions. With the refined flows, we warp the input images again to obtain and . The final interpolated image is then generated by blending based on visibility: .
Finally, the output view, , is constructed by replacing all the in Equation 1 with . We generate and construct by mirroring the process above. Our fast pushbroom interpolation layer generates the results with similar quality but is about faster than the direct implementation for an output image with a resolution of pixels.
3.3 Training Pushbroom Stitching Network
Capturing data to train our system is challenging, as one would need to use hundreds of synchronized linear cameras. Instead, we render realistic synthetic data using the urban driving simulator CARLA [carla], which allows us to specify the location and rotation of cameras. For the input cameras, we follow the setup of Figure 10(a). To synthesize the output pushbroom camera, we use 100 cameras uniformly spaced between and , and between and . We then use Equation 1 and replace with these views to render ground-truth stitched video. We synthesize 152 such videos with different routes and weather conditions (e.g, sunny, rainy, cloudy, etc.) for training. We provide the detailed network architectures of the flow estimation and flow refinement networks in the supplementary material.
To train our pushbroom interpolation network, we optimize the following loss functions: (1) content loss, (2) perceptual loss, and (3) temporal warping loss.
The content loss is computed by , where is the output image, is the ground-truth, and is a mask indicating whether pixel is valid on the viewing cylinder. The perceptual loss is computed by , where denotes the feature activation at the relu4-3 layer of the pre-trained VGG-19 network [VGG] and is the valid mask downscaled to the size of the corresponding features. To improve the temporal stability, we also optimize the temporal warping loss [Lai-ECCV-2018] , where is the set of neighboring frames at time , is a confidence map, and is the frame warped with optical flow . We use PWC-Net [sun2018pwc] to compute the optical flow between subsequent frames. Note that the optical flow is only used to compute the training loss, and is not needed at testing time. The confidence map is computed from the ground-truth frame and , where . A smaller value of indicates that the pixel is more likely to be occluded.
The overall loss function is defined as , where , , and are balancing weights set empirically. We empirically set , , and . For the spatial optical flow in the transition regions, we use SuperSloMo [jiang2018super] initialized with the weights provided by the authors and then fine-tuned to our use-case in our end-to-end training. We provide more implementation details in the supplementary material.
4 Experimental Results
The output of our algorithm, while visually pleasing, does not match a physical optical system, since the effective projection matrix changes horizontally. To numerically evaluate our results we can use rendered data (Section 4.1). However, a pixel-level numerical comparison with other methods is not possible as each method effectively uses a different projection matrix. For a fair comparison, then, we carried out an extensive user study (Section 4.2). The video results are available in the supplementary material and our project website at http://vllab.ucmerced.edu/wlai24/video_stitching/.
4.1 Model Analysis
To quantitatively evaluate the performance of the stitching quality, we use the CARLA simulator to render a test set using a different town map from the training data. We render 10 test videos, where each video has 300 frames.
We measure the PSNR and SSIM [wang2004image] between the stitched frames and the ground-truth images for evaluating the image quality. In addition, we measure the temporal stability by computing the temporal warping error , where is the flow-warped frame , is a mask indicating the non-occluded pixels, and is the number of valid pixels in the mask. We use the occlusion detection method by Ruder et al [Ruder-2016] to estimate the mask .
We first evaluate the baseline model, where the pushbroom interpolation layer is initialized with the pre-trained SuperSloMo [jiang2018super]. The baseline model provides a visually plausible stitching result but causes object distortion and temporal flickering due to inaccurate flow estimation. After fine-tuning the whole model, both the visual quality and temporal stability are significantly improved. As shown in Table 1, all the loss functions, , , and , improve the PSNR and SSIM and also reduce the temporal warping error. In Figure 24, we show an example where our full model aligns the speed sign well and avoids the ghosting artifacts.
Figure 25(a) shows a stitched frame from the baseline model, where the pole on the right is distorted and almost disappears. After training, the pole remains intact, Figure 25(b). We also visualize the optical flows before and after training the model. After end-to-end training, the flows are smoother and warp the pole as a whole, avoiding distortion.
4.2 Comparisons with Existing Methods
We compare the proposed method with commercial software, AutoPanoVideo [autopano], and existing video stitching algorithms, STCPW [jiang2015video] and NVIDIA VRWorks [vrworks]. We show two stitched frames from real videos in Figure 30, where the proposed approach generally achieves better alignment quality with fewer broken objects and ghosting artifacts. More video results are provided in the supplementary material.
As different methods use different projection models, a fair quantitative evaluation of the different video stitching algorithms is impossible. Therefore, we conduct a human subject study through pairwise comparisons. Specifically, we ask the participants to indicate which stitched video presents fewer artifacts from a pair of videos. We evaluate a total of 20 real videos and ask each participant to compare 12 pairs of videos. In each comparison, participants can watch both videos for multiple times before making a selection. In total, we collect the results from 54 participants.
Table 2 shows that our results are preferred by about of users, which demonstrates the effectiveness of the proposed method on generating high-quality stitching results. In addition, we ask participants to provide the reasons why they prefer the selected video from the following options: (1) the video has fewer broken lines or objects, (2) the video has less ghosting artifacts, and (3) the two videos are similar. Overall, our results are preferred due to better alignment and fewer broken objects. Moreover, only of users feel that our result is comparable to the others, which indicates that users generally have a clear judgment when comparing our method with other approaches.
4.3 Discussion and Limitations
Our method requires the cameras to be calibrated for the cylindrical projection of the inputs. While common to many stitching methods, e.g, NVIDIA’s VRWorks [vrworks], this requirement can be limiting, if strict. However, our experiments reveal that moving the side cameras inwards by up to of the original baseline, reduces the PSNR by less than 1dB. An outward shift is more problematic because it reduces the overlap between the views. Still, an outward shift that is of the original baseline causes less than 2dB drop in PSNR. Fine-tuning the network by perturbing the original configuration of cameras can reduce the error. We present detailed analysis in the supplementary material. Our method inherits some limitations of the optical flow. For instance, thin structures can cause a small amount of ghosting effects. We show failure cases in the supplementary material. In practice, the proposed method performs robustly even in such cases.
In this work, we present an efficient algorithm to stitch videos with deep CNNs. We propose to cast video stitching as a problem of spatial interpolation and we design a pushbroom interpolation layer for this purpose. Our model effectively aligns and stitches different views while preserving the shape of objects and avoiding ghosting artifacts. To the best of our knowledge, ours is the first learning-based video stitching algorithm. A human subject study demonstrates that it outperforms existing algorithms and commercial software.