Street-view images have been proven to be helpful for exploring remote places and they are useful for a variety of applications in virtual or mixed reality, realistic simulations and gaming, viewpoint interpolation, or cross-view matching. Nevertheless, their acquisition is rather expensive, and regular updates to capture changes are required for some tasks. On the other hand, satellite images are cheaper to obtain, have significantly better earth coverage, and are generally much more widely available than street-view images. The automatic generation of street-view images from given satellite or aerial images is thus an attractive and interesting alternative for several aforementioned applications.
While single street-view image generation from satellite images has recently been investigated [regmi2018cross, lu20geometry], these methods are not suitable to create continuous view-point changes around a given location since they built upon random generators and lack constraints on the correspondence between frame pixels. They are thus unable to synthesize temporally and spatially consistent image sequences that are desired for a better visual experience.
In this paper, we approach the novel task to synthesize street-view panoramic video sequences as realistically as possible and as consistent as possible from a single satellite image and given viewing locations. To achieve this, instead of resorting to 2D generators like [regmi2018cross, lu20geometry]
and generating images individually, we propose to generate the whole scene in a 3D representation of point cloud, and establish the correspondence between these visible points and the 2D frame pixels. In this way, the projected views from the entire generated scene instance will be naturally consistent by design. In order to generate image frames as good as a single image, we design a two-stage 3D generator in a coarse-to-fine manner that exploits the characteristics of different 3D convolutional neural networks. Fig.1 presents two examples of our synthesized results, which well demonstrate the temporal consistency of our generated video.
In summary, our major contributions are three-folds, which are listed as follows. (1) We present the first work for satellite-to-ground video synthesis from a single satellite image with a trajectory. (2) We propose a novel cross-view video synthesis method that ensures both spatial and temporal consistency by explicitly modeling a cross-frame correspondence using a 3D point cloud representation and building projective geometry constraints into our network architecture. (3) We expand the London panorama dataset used in S2G [lu20geometry] with only single static street-view panoramic images to the video version, named London Panorama Video, in order for training and testing networks for cross-view video synthesis. All the source code and pre-trained models will be made publicly available upon publication.
2 Related Work
Cross-view synthesis focuses on synthesizing from a completely different view of the given image. Most existing works in this field are targeted at single image synthesis. A very typical application is to generate the street view from a given satellite image. Zhai [zhai2017predicting] proposed to learn to map the semantic segmentation from the aerial to the ground perspective, which can be further used to synthesize ground-level views based on GANs [goodfellow2014generative]. Regmi [regmi2018cross, regmi2019bridging] proposed to use conditional GANs to learn the aerial or ground view images together with semantic segmentation. In order to keep the geometrical consistency, Lu [lu20geometry] proposes a differentiable geo-transformation layer that turns a semantically labeled satellite depth image to corresponding street-view depth and semantics for further street-view panorama generation. Turning to the field of cross-view video synthesis, there is no much work involving in yet as the problem becomes even harder. Although the video can be synthesized frame-by-frame by the image synthesis method, its temporal consistency is hard to be guaranteed, which is important for a video.
Video synthesis is a field that attracted more attentions in the community and have various forms according to the given input, which can be roughly divided into the following three categories. (1) Unconditional video synthesis [saito2017temporal, tulyakov2018mocogan, vondrick2016generating]
generates video clips from given input random variables by extending the current GAN frameworks on (spatial) images further into the temporal dimension. (2) Future video prediction[finn2016unsupervised, hao2018controllable, li2018flow, liang2017dual, lotter2016deep, mathieu2015deep, walker2016uncertain, walker2017pose, pan2019video] aims at inferring the future frames of a video based on the current observations so far. (3) Video-to-video synthesis [chan2019everybody, chen2019mocycle, wang2019few, wang2018vid2vid, mallya2020world]
maps a video in a certain domain to what in another domain, which shares a similar spirit with image-to-image translation with enhanced temporal consistency. Nevertheless, the cross-view video synthesis setting in our work is still different from all these categories, which should consider both the temporal consistency between video frames and the geometrical consistency between different views.
Novel view synthesis and neural rendering
technologies develop rapidly recently with the advancements in deep neural networks. Many state-of-the-art works focus on the synthesis from a single image. SynSin[wiles2020synsin] proposed an end-to-end view synthesis pipeline via a learned point cloud and a differentiable soft z-buffer method, where a point is projected to a region in the image plane with some radius using -compositing with other projected points (regions). Shih [shih20203dp] regarded the input depth image as a layered structure, and the learning-based inpainting model synthesizes color-and-depth content into the occluded region in a spatial context-aware manner. These works usually assume that the viewpoint changes are small, which makes it nearly impossible to directly employ them. On the other hand, synthesis and rendering with arbitrary viewpoint changes often achieved by multiple images input [sitzmann2019scene, thies2019deferred, sitzmann2019deepvoxels, meshry2019neural, mildenhall2020nerf]. Traditional methods usually adopt the image-based rendering technique [shum2000review] to generate novel views. Riegler [Riegler2020FVS] employed differentiable reprojection of image features. Sitzmann [sitzmann2019deepvoxels]
learned a 3D-structured scene representation from only 2D supervision that encodes the view-dependent appearance of a 3D scene. Sitzmann[sitzmann2019scene] further proposed a implicit 3D scene representation which could be also learned from 2D images via a differentiable ray-marching algorithm. Mildenhall [mildenhall2020nerf] propose to represent scenes as 5D neural radiance fields which could render photorealistic novel views of complex scenes. Meshryet [meshry2019neural] uses additional depth and semantic information of the point cloud, together with an encoded latent vector to achieve realistic rendering with different styles. Recent surveys on neural rendering can be found in [tewari2020neuralrendering, kato2020differentiable]. All these methods require a set of images or the built point cloud as input in order to learn detailed 3D scene representation with the deep network. Since our input is only a single satellite image, it is even more difficult for the network to learn meaningful representation.
We introduce a novel framework for synthesizing street-view panoramic video from a single satellite image and provide an overview of our proposed pipeline in Fig. 2. As shown in the figure, we use a cascaded network architecture with three stages: a satellite stage, a transformation stage, and a 3D-to-video generation stage. The satellite stage is similar to the current state-of-the-art method S2G [lu20geometry]
and estimates both depth map and semantics from an input satellite image. Different from the geo-transformation layer used in S2G[lu20geometry] which transforms the satellite domain to the street view, we directly extract visible points from the constructed occupancy grid according to the given input trajectory. In the last 3D-to-video generation stage, two cascaded networks are utilized to generate a colorized point cloud from semantics, followed by a projection to each video frame. The second and third stages are detailed in the following subsections.
3.1 Visible Points Extraction
We first build a semantic voxel occupancy grid using the depth and semantic images from the satellite stage. Together with the sampling locations in the input trajectory, we create a point cloud with only visible points and build 3D-2D correspondences. This corresponds to finding the index of the point in the 3D space for each pixel in the video. Each pixel has a uniquely corresponding 3D point, and each point in the 3D space may correspond to multiple pixels. The same mapping will also be utilized for projecting the colored point cloud onto the video frames in the final step of the 3D-to-video generation stage.
Algorithm 1 describes the detailed procedure for extracting visible points and building 3D-2D correspondences. The algorithm takes as input the voxelized occupancy grid and ordered sampling locations . Here, denotes the number of sampling locations, which is equal to the number of video frames. The final outputs consists of an ordered set saving the 3D coordinates
of all visible points and a mapping tensorfor all 2D frame pixels. Each element keeps an index value if the frame pixel in position corresponds to the visible point in . The ordered set of visible points and mapping matrix are iteratively computed. We assign value of to all frame pixels that have no corresponding point in the point cloud in the current iteration.
At each time step , we first obtain a dense depth map for the frame at location by taking a operation in the occupancy grid . This processing step is identical to the geo-transformation layer proposed in S2G [lu20geometry]. Then, a preliminary mapping , which indicates the correspondence between the current frame pixels and the visible points set so far, is calculated by the function. means that the point in is projected to the pixel and the depth value is in a range of , otherwise . In our experments, we set . For pixels without corresponding points, , , we them to the 3D space to obtain an additional ordered set containing the incremental visible points and an additional mapping saving the correspondences between these pixels and the incremental visible points. It should be noted that the incremental indices satisfy , where a pixel with a corresponding point in is assigned with 0 and pixels with correspondences in have the indices offset by . Hence, always holds, where denotes the Hadamard product. Finally, we update the visible point set and mapping tensor by joining with and saving to .
Since only the center frame has ground truth street-view RGB and to reflect the projection characteristics of the panorama image, the locations of the sampling points are inputted to the algorithm in an order of , where is the index of the center frame.
3.2 3D Generator
In the 3D-to-video generation stage, we aim to directly infer the color information for the point cloud in the 3D space from the semantics. The semantics of the points is gathered from the satellite semantics according to each point’s coordinates in the horizontal plane. Distant points are simply labeled as sky. The proposed 3D generator consists of a SparseConvNet [graham18sparse] and a RandLA-Net [hu20randlanet], with a cascaded connection. Both networks are operating purely in the 3D domain and have an hourglass structure acting on coarse and fine generation successively. Finally, the colorized points are projected to each frame to generate a video with high temporal consistency.
The coarse generation stage is based on voxels. At the beginning of this stage, the point cloud is first voxelized according to the given voxel size. Multiple points sharing the same voxel will be averaged as the feature of that voxel. In our experiments, the voxel size is set to 3.125m (32 voxels per meter). The SparseConvNet only operates on the occupied area of the voxel grid avoiding unnecessary computations on free space and thus achieving time- and memory-efficient 3D convolutions. Finally, the output of the network is de-voxelized to a point cloud. Again, points sharing the same voxel will be assigned to the same feature. As depicted in Fig. 2, the visualized point cloud with intermediate SparseConvNet features already shows some characteristics of the building facade like windows.
The fine generation stage is based on the point cloud. The input of this stage is a concatenation of the intermediate SparseConvNet [graham18sparse] features and the original point semantics from the skip connection. RandLA-Net [hu20randlanet] is an efficient and lightweight state-of-the-art architecture designed for semantic segmentation of large-scale point clouds. We leverage this network reversely to infer the final RGB for each point from the semantics as well as the features. We set the number of nearest neighbors to 8, and the decimation ratio in its local feature aggregation module to 4. Finally, each pixel in the video frame gathers RGB from its corresponding point in the point cloud according to the point-pixel mapping computed in the transformation stage.
The reason for using a cascaded architecture of these two networks rather than only using RandLA-Net is that its efficient setting makes the size of the network rather small, but the capacity may not be enough to support a scene generation. With the help of SparseConvNet which is good at learning high-level features, RandLA-Net is able to better infer RGB from local information. The two networks compliment each other while performing separate duties. We also conduct experiments on a generator with only RandLA-Net as detailed in Sec. 4.4.
3.3 Multi-class Encoder
S2G [lu20geometry] follows BicycleGAN [zhu2017toward] to use a single latent vector when generating the whole scene. Instead, we use a multi-class texture encoder that computes several latent vectors per class to enrich the diversity of generated scenes.
The encoder in the BicycleGAN used in our pipeline takes as input the ground truth street-view RGB, as well as the semantics of the center frame during training. The role of the semantics here is an indicator used for attentive pooling. After obtaining the feature map of the entire image, the encoder does not directly perform average pooling but instead pools the features of pixels with the same semantic class to finally obtain multiple latent vectors. As shown in Eq. 1, for a specific class , its corresponding semantic map is used for attentive pooling to finally obtain the latent vector of this class.
The encoder for the satellite image is similar to the encoder in the BicycleGAN. During training, the goal is to make the generated latent vectors as similar as possible to what is generated by the encoder in the BicycleGAN. Since some of the classes, sky, and sidewalk, may not be able to infer from the satellite image, there is no loss on the latent vectors for these classes during training and they are directly given random vectors during inference.
4.1 Ground Truth Video Generation
To the best of our knowledge, there is currently no available dataset that provides both satellite images and corresponding street-view panorama videos. As the first work that sheds light on the task of street-view video synthesis from a single satellite image, we first produce a dataset that satisfies the requirements for the task by extending the London panorama dataset used in S2G [lu20geometry] to generate the ground truth of street-view video snippets. The original dataset includes around 2K pairs of satellite images and corresponding street-view panoramas that are captured in the center position of the satellite images. The estimated depths (elevation) and semantics of the satellite image are also provided as ground truth. In brief, we generate the street-view panorama videos by the interpolation in the 3D space via a point cloud, of which the geometry is calculated by the depth of the satellite image. Since it is hard to infer accurate geometry from the satellite depth, we further provide a refined version by inferring dense depth from the street-view panorama. We elaborate on the details of ground truth video generation as follows.
Sampling trajectory. Each single street-view panorama image provided in the London panorama dataset [lu20geometry] is taken in the center of the satellite image and is associated with an orientation. To generate the street-view panorama video surrounding the location of this image, we set the sampling paths in both training and inference in a total range of 7 meters straight ahead and back from the viewing center. With the interval step of 0.5 meters, a total number of 15 frames including the center frame are sampled to form a video. We denote the provided single street-view panorama image as the center frame for brevity.
Street-view semantics. To obtain the semantics of a street-view panorama, we adopt DeepLab v3+ [chen18deeplabv3plus] with an Xception 71 [chollet17xception]
backbone and which is pre-trained on the Cityscapes[cordts16cityscapes] dataset. Compared to SegNet [badri17segnet] utilized in S2G [lu20geometry], DeepLab v3+ generates more accurate semantics.
Point cloud coloring. A point cloud can be estimated from the satellite ground truth depth by using the algorithm described in Sec. 3.1. However, only points unprojected from the center frame possess the exact semantics and RGB information. For other points in the point cloud, we complement their information through nearest neighbor search. More specifically, for each uncolored point, we search for its 32 nearest-neighbor center-frame points that have valid information and determine its RGB and semantics by a distance-based weighted average and voting on these neighbors respectively.
Incremental dataset. Because the accuracy of the satellite depth is barely high enough, it is hard to ensure that the point cloud estimated from the satellite depth is well aligned with the panorama image. For instance, the distant points (geometrically sky points) might be assigned with colors or semantics of the building if the estimated height of that building does not agree with the panorama. Especially, for the objects like street lamps and cars which are nearly invisible in the satellite image, misalignment is inevitable. Such misalignments between color/semantics and geometry will result in distorted panorama frames. Therefore, to mitigate such errors, we introduce an incremental training set by refining the geometry of the point cloud. We first generate a dense depth map for the center frame using MiDaS [ranftl20midas], a state-of-the-art method for monocular depth estimation from single RGB images. We further correct this depth map by scaling it to ensure that the height of the viewing center (standing point) is 3 meters. Then we unproject the central frame depth to generate a raw 3D point cloud and get the depth for other frames by re-projecting the point cloud into each frame. For the location without a valid projection, we infer its missing depth value by exploiting the OpenCV inpainting function. Through unprojecting each frame’s depth to the 3D space, a refined point cloud can be finally built, whose color is complemented by nearest neighbors search as described in the above paragraph.
Fig. 3 shows 3 examples of ground truth street-view panoramic frames generated based on ground truth satellite depth in S2G [lu20geometry] and the depth map predicted by MiDaS [ranftl20midas]. The red bounding boxes clearly show that depth misalignments can cause shape distortions in the generated frame, which is alleviated by leveraging the refined point cloud. Since most RGB frames generated using satellite depth-based geometry are blurred or distorted, we adopt more consistent RGB frames with higher quality generated from the refined point cloud by using MiDaS [ranftl20midas] depth as the ground truth during the inference evaluation. In the training phase, we also use this as an incremental dataset. As shown in the ablation study in Sec. 4.4, the incorporation of the incremental dataset improves the final performance.
|Center Frame GT||S2G [lu20geometry]||MiDaS [ranftl20midas]|
4.2 Implementation Details
Our framework is implemented in PyTorch and run on a single Nvidia Tesla V100 GPU with 32GB memory. For the subnetworks, we use the official implementation of BicycleGAN[zhu2017toward] and SparseConvNet [graham18sparse], as well as, an unofficial RandLA-Net [hu20randlanet] PyTorch implementation111https://github.com/aRI0U/RandLA-Net-pytorch. For the dataset, we keep an output resolution of 256128 such that the point cloud size of each scene is around 200K. During training, both versions of the dataset are used, but in the inference phase, although the geometry comes from the satellite stage, the generated RGB frames from MiDaS version are used as ground truth to do the evaluation. For the network architecture, the default training settings of BicycleGAN are employed using 16 for the size of latent vectors and 64 for the size of intermediate features. The discriminator uses all predicted frames as the fake set and RGBs from MiDaS version as the true set. The multi-noise encoder only takes as input the center frame. We further distinguish between left and right buildings in the semantic labels to achieve better diversity.
For the 3D generator, we use the default provided U-Net [ronneberger15unet] implementations under SparseConvNet [graham18sparse] and RandLA-Net [hu20randlanet] frameworks which are originally used for point cloud semantic segmentation. During training, the 3D generator loss is computed directly on the point cloud instead of being projected to each frame. If the point cloud geometry comes from the satellite stage, points with misalignment, , distant points with a non-sky label or nearby points having a sky label, will be given zero weights.
4.3 Baseline Comparison
|Pix2Pix [isola2017image]||- / 13.2567||- / 0.3129||- / 24.6731||- / 0.6290||- / 0.4778|
|Regmi [regmi2018cross]||- / 13.3050||- / 0.3198||- / 24.5601||- / 0.6001||- / 0.4430|
|S2G [lu20geometry]-F||14.0905 / 14.1260||0.3461 / 0.3458||25.8236 / 25.8334||0.6259 / 0.6264||0.4223 / 0.4223|
|S2G [lu20geometry]-I||14.1694 / 14.1260||0.3650 / 0.3458||26.1079 / 25.8334||0.5941 / 0.6264||0.4044 / 0.4223|
|Ours||14.6547 / 14.7141||0.3852 / 0.3911||25.8212 / 25.8335||0.5756 / 0.5721||0.4034 / 0.3993|
|6 Frames||Center Frame||6 Frames||Sate. RGB||6 Frames||Center Frame||6 Frames|
|GT S2G-F Ours S2G-I|
|GT S2G-F Ours S2G-I|
|GT S2G-F Ours S2G-I|
|GT S2G-F Ours S2G-I|
|GT S2G-F Ours S2G-I|
|GT S2G-F Ours S2G-I|
Since we are the first to propose a method for generating street-view panoramic videos from single satellite images, we design two baseline methods by adapting the state-of-the-art street-view panoramic image synthesis method S2G [lu20geometry] for video generation: (1) S2G-F: each frame is generated individually but shares the same latent vector encoded from the input satellite image; (2) S2G-I: only the center frame is generated and other frames are interpolated by using the point cloud coloring procedure described in Sec. 4.1. We compare our method with these two baseline methods on the test set of London Panorama Video [lu20geometry].
For quantitative evaluation, we follow [lu20geometry] and use PSNR, SSIM, and sharpness difference (Sharp Diff.) as low-level metrics to measure the per-pixel differences between the predicted image and the ground truth. The high-level perceptual similarity is also used. P and P denote the evaluation results based on the backbone of VGG [vgg] and SqueezeNet [squeezenet] respectively.
In addition to the above two baselines, we compare to two image-to-image translation works, Pix2Pix [isola2017image] and Regmi [regmi2018cross], on the center frame generation. The quantitative results are shown in Table 1. For the video generation comparison, our improved performance likely results from the better temporal consistency of our generated video, since all methods use the same geometry inferred from the input satellite image. For the center frame comparison, we outperform all state-of-the-art methods on all metrics, which indicates superiority of our method in generating geometrically consistent single street-view panorama.
More qualitative results are presented in Fig. 4. We can see that the frames generated by our method are both temporally and geometrically consistent. Since each frame from the baseline method S2G-F is synthesized independently, the textures in different frames are nearly stationary and there is no consistent transition between them when the observation location changes. For S2G-I, we can see that the interpolation can ensure consistency of the texture between frames since every frame’s texture comes from the center frame and is based on the geometry. Nevertheless, it is easy to find that the texture in the frames which are far away from the center frame is likely to be blurred, especially on the building facades which are invisible in the center frame.
4.4 Ablation Study
|R||13.6855 / 13.7390||0.3265 / 0.3252||25.7345 / 25.7454||0.6209 / 0.6192||0.4434 / 0.4426|
|R+S||14.1518 / 14.3178||0.3850 / 0.3922||25.3743 / 25.4204||0.5743 / 0.5690||0.4082 / 0.4026|
|R+S+M||14.5507 / 14.5900||0.4023 / 0.4032||25.5022 / 25.4883||0.5723 / 0.5681||0.4042 / 0.4018|
|R+S+M+D (Ours)||14.6547 / 14.7141||0.3852 / 0.3911||25.8212 / 25.8335||0.5756 / 0.5721||0.4034 / 0.3993|
|Sate. RGB||R||R+S||R+S+M||R+S+M+D (Ours)||GT|
To better evaluate the effectiveness of the individual components or our method, we also conduct an ablation study by incrementally adding components into our basis framework. More specifically, we focus on the following three components: (1) the SparseConvNet used in the 3D generator; (2) the setting of multiple latent vectors; (3) the incremental MiDaS training set. We set the basic framework as the pipeline with only RandLA-Net in the 3D generation stage, while our method possesses all components.
Tab. 2 shows quantitative evaluation results of the ablation study. The abbreviations of the method names in the table are defined as follows. R: is the basic framework that uses RandLA-Net and a global latent vector in the 3D generation stage. R+S: the coarse and fine generation framework by further incorporating SparseConvNet. R+S+M: further adding a multi-class encoder to the R+S setting. R+S+M+D: further using the MiDaS training set which forms our final method with all components.
The effectiveness of each added component is shown by clear performance improvements of the PSNR and P metrics. Especially, the addition of SparseConvNet significantly improves the performance on nearly every metric compared to the method that only uses RandLA-Net. We address the main reason as the explicit allocation of the coarse generation and fine generation to two cascaded different networks respectively. This alleviates the struggle of RandLA-Net in generating both coarse and fine textures. The additional multi-class encoder further improves the performance on every metric especially PSNR since it disentangles the latent vectors for different classes and enables more generation possibilities. Fig. 5 shows a qualitative comparison of the results generated by the aforementioned methods. Since each network takes the same semantic point cloud as input, the consistency of each method stays exactly the same. Therefore, we only show and compare results of the generated center frames in this figure. From the figure, we can observe that the basic setting (R) can only give the overall color and cannot restore the texture details, , the building facade. With the addition of SparseConvNet (R+S), more realistic scenes can be generated but with potentially fixed artifacts (like the fixed light spot on the road). With the introduction of multiple latent vectors (R+S+M), there are no more obvious artifacts, and the facade becomes clearer and more diverse. Although the changes in visualization are inconspicuous after using the incremental MiDaS training set (R+S+M+D), the quality of the results looks less noisy and the performance is better on several metrics.
We proposed a novel approach for cross-view video synthesis. In particular, we presented a multi-stage pipeline that takes as input a single satellite image with a given trajectory, and generates a street-view panoramic video with both geometrical and temporal consistency. The generator in our pipeline colors a point cloud that is built from the input satellite image using a successively coarse and fine generation estimated by two cascaded hourglass structures. The direct generation in a 3D manner significantly improves the temporal consistency across the frames which is especially important for video synthesis. Our experiments demonstrated that our method outperforms existing state-of-the-art cross-view generation approaches and is able to synthesize more realistic street-view panoramic videos and in larger variability. To the best of our knowledge, we presented the first work that synthesizes videos under cross-view settings.