Novel view synthesis aims to infer the appearance of an object from unobserved points of view. The synthesis of unseen views of objects could be important for image-based 3D object manipulation [Kholgade2014], robot traversability [Hirose2019], or 3D object reconstruction [Tatarchenko2016]. Generating a coherent view of unseen parts of an object requires a non-trivial understanding of the object’s inherent properties such as (3D) geometry, texture, shading, and illumination.
Different algorithms make use of provided source images in different ways. Model-based approaches use similar-look open stock 3D models [Kholgade2014], or through user interactive construction [Zheng2012, Chen2013, Rematas2017]. Image-based methods [Tatarchenko2016, Zhou2016, Park2017, Sun2018, Olszewski2019]
assume an underlying parametric model of object appearances conditioned on viewpoints and try to learn it using statistical frameworks. Despite their differences, both approaches use 3D information in predicting object new views. The former imposes stronger assumptions on the full 3D structure and shifts the paradigm to obtain the full models, while the latter captures the 3D information in latent space to cope with (self) occlusion.
The principle is that the generation of a new view of an object is composed of (1) relocating pixels in source images that will be visible to the corresponding positions in the target view, (2) removing the pixels that will be occluded, and (3) adding disoccluded pixels that are not seen in the source and will be visible in the target view [Park2017]Zhou2016, Park2017, Sun2018] show that (1) and (2) can be done by learning an appearance flow field that "flows" pixels from a source image to the corresponding positions in the target view, and (3) can be done by a completion network with an adversarial loss.
In this paper, we leverage the explicit use of geometry information in synthesizing novel views. We argue that (1) and (2) can be done in a straightforward manner by obtaining access to the geometry of the objects. The appearance flow [Zhou2016, Park2017, Sun2018] which associates pixels of the source view to their positions in the target view, is the projection of the 3D displacement of objects’ points before and after transformation. Occluded object parts can be identified based on the orientation of the object surface normals and the view directions. The argument can also be extended for multiple input images. In this paper, we show that the geometry of an object provides an explicit and natural basis to the problem of novel view synthesis.
In contrast to geometry-based methods, the proposed approach does not require 3D supervision. The method predicts a depth map in a self-supervised manner by formulating the depth estimation problem in the context of novel view synthesis. The predicted depth is used to partly construct the target views and to assist the completion network.
The main contributions of this paper are: (1) a novel methodology for novel view synthesis using explicit transformations of estimated point clouds; (2) an integrated model combining self-supervised monocular depth estimation and novel view synthesis, which can be trained end-to-end; (3) natural extensions to multi-view inputs and full point cloud reconstruction from a single image; and (4) experimental benchmarking to validate the proposed method, which outperforms the current state-of-the art methods for novel view synthesis.
2 Related Work
2.1 Geometry-based view synthesis
View synthesis via 3D models Full models (textured meshes or colored point clouds) of objects or scenes are constructed from multiple images taken from various viewpoints [Debevec1996, Seitz2006, Meshry2019] or are given and aligned interactively by users [Kholgade2014, Rematas2017]
. The use of 3D models allows for extreme pose estimation, re-texturing and flexible (re-)lighting by applying rendering techniques[Nguyen2018, Meshry2019]. However, obtaining complete 3D models of objects or scenes is a challenging task in itself. Therefore, these approaches require additional user input to identify objects boundaries [Zheng2012, Chen2013], select and align 3D models with image views [Kholgade2014, Rematas2017], or use simple textured-mapped 3-planar billboard models [Hoiem2005]. In contrast, the proposed method makes use of objects partial point clouds constructed from a given source view and does not require a predefined (explicit) 3D model.
View synthesis via depth Methods using 3D models assume a coherent structure between the desired objects and the obtained 3D models [Chen2013, Kholgade2014]. Synthesis using depth obtains an intermediate representation from depth information. The intermediate representation captures hidden surfaces from one or multiple viewpoints. [Zitnick2004] proposes to use layered depth images, [Flynn2016] creates 3D plane sweep volumes by projecting images onto target viewpoints at different depths, [Zhou2018stereo] uses multi-plane images at fix-distances to the camera, and [Choi2019]
estimates depth probability volumes to leverage depth uncertainty in occluded regions.
In contrast, the proposed method estimates depth directly from monocular views to partially construct the target views. Self-supervised depth estimation using deep neural networks using photometric re-projection consistency has been researched by several authors[Garg2016, Zhou2017, Godard2017, Godard2019, Johnston2019]. In this paper, we train a self-supervised depth prediction network with novel view synthesis in an end-to-end system.
2.2 Image-based view synthesis
Requiring explicit geometrical structures of objects or scenes as a precursor severely limits the applicability of a method. With the advance of neural networks (CNNs), generative adversarial networks [GAN] (GANs) achieve impressive results in image generation, allowing view synthesis without explicit geometrical structures of objects or scenes.
View synthesis via embedded geometry Zhou et al [Zhou2016] proposes learning a flow field that maps pixels in input images to their corresponding locations in target views to capture latent geometrical information. [Olszewski2019] learns a volumetric representation in a transformable bottleneck layer, which can generate corresponding views for arbitrary transformations. The former explicitly utilizes input (source) image pixels in constructing new views, either fully [Zhou2016], or partly with the rest being filled by a completion network [Park2017, Sun2018]. The latter explicitly applies transformations on the volumetric representation in latent space and generates new views by means of pixel generation networks.
The proposed method takes the best of both worlds. By directly using object geometry the source pixels are mapped to their target positions based on given transformation parameters, hence making the best use of the given information synthesizing new views. Our approach is fundamentally different from [Park2017]: we estimate the object point cloud using self-supervised depth predictions and obtain coarse target views from purely geometrical transformations, while [Park2017]
View synthesis directly from image
Since the introduction of image-to-image translation[Isola2017], there is a paradigm shift towards pure image-based approaches [Tatarchenko2016]. [Zhu2018] synthesizes bird view images from a single frontal view image, while [Regmi2018] generates cross-views of aerial and street-view images. The networks can be trained to predict all the views in an orbit from a single-view object [Kicanaoglu2018, Johnston2019], or generate a view in an iterative manner [Galama2019]. Additional features can be embedded such as view-independent intrinsic properties of objects [Xu2019_ICCV]. In this paper, we employ GANs to generate complete views, which is conditioned on the geometrical features and the relative poses between source and target views. Our approach can be interpreted as a reverse and end-to-end process of [Johnston2019]: we estimate objects’ arbitrary new views via point clouds constructed from self-supervised depth maps, while [Johnston2019] predict objects’ fixed orbit views for 3D reconstruction.
3.1 Point-cloud based transformations
The core of the proposed novel view synthesis method is to use point clouds for geometrically aware transformations. Using the pinhole camera model and known intrinsics , the point cloud can be reconstructed when the pixel-wise depth map (D) is available. The camera intrinsics can be obtained by camera calibration, yet for the synthetic data used in our experiments, is given. A pixel on the source image plane (using homogeneous coordinates), corresponds to a point in the source camera space:
Rigid transformations can be obtained by matrix multiplications. The relative transformation to the target viewpoint from the source camera, is given by
where denotes the desired rotation matrix and the translation vector. Points in the target camera view are given by . This can also be regarded as an image-based flow field parameterized by (c.f [Zhou2016, Park2017, Sun2018]). The flow field returns the homogeneous coordinates of pixels in the target image for each pixel in the source image:
By observing that , the Cartesian pixel coordinates in the target view can be extracted. The advantage of the flow field interpretation is that it provides a direct mapping between the image planes of the source view and the target view.
Forward warping The flow field is used to generate the target view from the source:
Backward warping The flow field is used to generate the source view from the target:
The process assigns a value to every pixel in resulting in a dense image, as illustrated in Fig. 2 (bottom-right). The generated source view may contain artifacts due to (dis)occlusion in the target view. To sample from , a differentiable bi-linear sampling layer [Jaderberg2015] is used. The generated source view is used for self-supervised monocular depth prediction (Sec. 3.3).
3.2 Novel view synthesis
The point-cloud-based forward warping relocates the visible pixels of the object in the source view to their corresponding positions in the target view. For novel view synthesis, however, two more steps are required: (1) obtaining the target coarse view by discarding occluded pixels, and (2) filling in the pixels that are not seen in the source view.
Coarse view construction The goal is to remove the pixels which are seen in the source view yet should not be visible in the target view, due to occlusion. To this end, pixels that have surface normals (after transformation) pointing away from the viewing direction are removed, similarly to [Park2017]. Surface normals are obtained from normalized depth gradients.
An illustration of the coarse view construction is shown in Fig. 3 for different target views. The first row depicts the target views, the second row indicates the visible parts from the input image (third column). The third and fourth row show the coarse view with and without occlusion removal (or backface culling). Finally, the fifth row shows an enhanced version of the coarse view, where the object is assumed to be left-right symmetric [Park2017]. The proposed method directly identifies and removes occlusion pixels from the input view using estimated depth, which contrasts to [Park2017], where ground truth visibility mask are required for each target view to train a visibility prediction network.
View completion The obtained coarse view is already in the target viewpoint, but it remains sparse. To synthesize the final dense image, an image completion network is used.
The completion network uses the hour-glass architecture [Newell2016]. Following [Park2017], we concatenate the depth bottleneck features and embedded transformation to the completion network bottleneck. By conditioning the completion network on the input features and the desired transformation , the network can fix artifacts and errors due to estimated depth and cope better with extreme pose transformations, i.ewhen coarse view image is near empty (e.gcolumns 9-11 in Fig. 3).
The image completion network is trained in a GAN-manner by using a generator , a discriminator , an input image and a target image . The combination of losses that are used is given by
|LS-GAN Discriminator loss||(6)|
where the perceptual loss uses (
) to denote features extracted from imageand from the discriminator network and pre-trained VGG network respectively, c.f [Perceptual]. SSIM denotes the structural similarity index measure, see Sec. 4. The total loss is given by:
where denotes the weighting of the losses (, and , c.f [Park2017]).
3.3 Self-supervised Monocular Depth estimation
The discussion so far has assumed that pixel-wise depth maps are available. In this section, the method used to estimate depth from a single image is detailed. In order to make the minimum assumption about the training data, self-supervised methods are considered, which do not require ground-truth depth [Garg2016, Zhou2017, Godard2017, Godard2019, Johnston2019].
For the depth prediction an encoder-decoder network with bottleneck architecture is used, similar to [Godard2019]. The network is optimised using a set of (reconstruction) losses between the source image and its synthesized version , using the backward warping, Eq. (5), from a second (target) image and the predicted depth map. The underlying rationale is that a more realistic depth map will have a lower reconstruction loss.
The losses are given by:
|Smoothness loss [Godard2017]||(11)|
where , and is the mean-normalized inverse depth, , and is an indicator function which equals 1 iff the photometric loss , see [Godard2019] for more details. The smoothness loss encourages nearby pixels to have similar depths, while the artifacts due to (dis)occlusion are excluded by the per-pixel minimum-projection mechanism.
In this section, the proposed method is analysed on the 3D ShapeNet benchmark including an ablation to study the effects of the different components and a state-of-the-art comparison.
Dataset We use the object-centered car and chair images rendered from the 3D ShapeNet models [shapenet] using the same render engine111The specific render engine and setup is to guarantee fair comparison with reported methods as none of the authors-provided weights perform at the similar level on images rendered with different rendering setups. and set up as in [Zhou2016, Park2017, Sun2018, Olszewski2019]. Specifically, there are 7497 car and 698 chair models with high-quality textures, split by 80%/20% for training and test. The images are rendered at 18 azimuth angles (in , -separation) and 3 elevation angles (). Input and output images are of size .
Metrics We evaluate the generated images using the standard pixel-wise error (normalized to the ranged , lower is better) and the structural similarity index measure (SSIM) [ssim] (value range of , higher is better). indicates the proximity of pixel values between a completed image and the target, while SSIM measures the perceived quality and structural similarity between the images.
Baseline We compare the results of our method with the following state-of-the-art methods: AFN [Zhou2016], TVSN [Park2017], M2NV [Sun2018], and TBN [Olszewski2019].
4.1 Initial experiments
Comparison to image-based completion In this section, we compare the intermediate views generated by the forward warping using estimated point clouds and those by image-based flow field prediction by DOAFN [Park2017] and M2NV [Sun2018]. For this experiment, the coarse view after occlusion removal and left-right symmetric enhancements are used. The image completion network is the basis variant, using DCGAN, without bottleneck inter-connections. The results are shown in Table 2(a). The transformation of estimated point clouds provides coarse views which are closer to the target view, and these help to obtain a higher quality of completed views.
We analyze the effects of the different component of the proposed pipeline. The results are shown in Table 2(b). The use of the LSGAN loss shows a relative large improvement over the traditional DCGAN. The drop of performance by removing symmetry assumption shows the importance of prior knowledge on target objects, which is intuitive. The inter-connection from the depth network and the embedded transformation to the completion network allow the model to not rely solely on intermediate views. This is important for overcoming errors and artifacts which occur in the coarse images (due to inevitable uncertainties in depth prediction) and generate in general higher quality images. The SSIM loss, first employed by [Olszewski2019]
, shows improvement in SSIM metric, which is intuitive as training objectives are closer to evaluation metrics.
4.2 Comparison to State-of-the-Art
In this section, the proposed method is compared with state-of-the-art methods. The quantitative results are shown in Table 2. The proposed method performs consistently performs (slightly) better on both evaluation metrics for both types of objects. Quantitative results are shown in Fig. 4 where challenging cases are shown in the last 2 rows. Notice the better ability in retaining objects’ textures (such as color patterns and texts on cars) of methods that explicitly use input pixel values in generating new views to that of TBN. The results of cars are constantly higher than that of chairs due to the intricate structures of chairs. However, by having access to object geometry, geometrical assumptions such as symmetry and occlusion can be applied directly to intermediate views (instead of having to learn from annotated data c.f [Park2017]), which creates better views for near-to symmetry targets. High-quality qualitative results and more analyses can be found in the supplementary materials.
Table 2 also shows the evaluation when target viewpoints are from different elevation angles. Methods such AFN, TVSN, and M2NV encode transformation as one-hot vectors and thus, are limited to operate within a pre-defined set of transformations (18 azimuth angles, same elevation). This is not the case for our method and TBN which apply direct transformation. We use the same azimuth angles as in the standard test set while randomly sample new elevation angles for input images in . The results are shown with networks trained with the regular fixed-elevation settings. The new transformations produces different statistics from what the networks have been trained, resulting in a performance drop for both methods. Nevertheless, the proposed method can still maintain high quality image synthesis.
4.3 Multi-View Synthesis and Point Cloud Reconstruction
Multi-view inputs The proposed method can be naturally extended to use multi-view inputs as follows: for each image depth is predicted independently and combined into a single point cloud. The resulting coarse target image will be denser when more images are used, and is passed through the image completion network.
In this experiment, the model trained for single-view prediction is used and evaluated using multiple (1 to 8) input images. The results in Table 3 show that the quality of the coarse view increases, as expected, when more input images are used and hence the point clouds are denser. Surprisingly, however, the image completion network only marginally improves, indicating that the coarse view contains enough information for the image completion network to synthesis a high quality target image.
|L1 ()||SSIM ()||L1 ()||SSIM ()|
Point cloud reconstruction
In this final experiment, the aim is to reconstruct a full dense point cloud from a single image, using the models trained for novel view synthesis. In order to do so, -views are generated from a single view of an object, see Fig. 5 (top). Each of these views are fed to the depth estimation network and the obtain estimated depth is used to generate a partial point cloud. These point clouds are stitched together, using corresponding transformations, resulting in a high quality dense point cloud, as shown in Fig. 5 (bottom).
4.4 Results on real-world imagery
We apply the trained car model to the car images of the real-imagery ALOI dataset [geusebroek2005], consisting of 100 objects, captured at 72 viewing angles. We use 4 cars for fine-tuning only the depth network, which requires no ground truths, while the image-completion network is left untouched. The quantivative resuls on the remaining 3 cars are shown in Fig. 6.
In this paper partial point clouds are estimated from a single image, by a self-supervised depth prediction network and used to obtain a coarse image in the target view. The final image is produced by an image completion network which uses the coarse image as input. Experimentally the proposed method outperforms any of the current SOTA methods on the ShapeNet Benchmark on novel view synthesis. Qualitative results show high quality and dense point clouds, obtained from a single image, by synthesizing and combining views. Based on these results, we conclude that point clouds are a suitable, geometry aware representation for true novel view synthesis.
6 Supplementary materials
In the supplementary materials a more elaborate qualitative comparison is provided between the proposed method and other state-of-the-art methods. For this the synthesized target views for 36 car images are shown in Figure 7- 9 and the syntesized target views for 21 chair images are shown in Figure 10- 12.
From the results we observe that in line with the overall quality metrics (Table 2 of the main paper), our method synthesize in general higher quality (better geometrical shape and matching texture) compared to the other methods.
A few observations:
Observe that TVSN, M2NV and the proposed method all have a similar inter-connection network architecture, in contrast to TBN. The inter-connection allow for explicit use of the input image pixels in constructing the generated views, and thus these models can retain the object textures in the generated views. Examples include row 4, 5 of Figure 8 and row 2, 5, 10 of Figure 9, where the specific color patterns or texts on the input views are retained in the generated views.
Note also row 4 of Figure 8, which is indeed a failure, yet legitimate, case. The target is posed at an extreme angle with the input, but the unseen back of the truck has a different texture/color. The methods based on the assumption of object symmetry to make an "educated" guess of the unseen view, thus fail when the object texture does not follow such assumption.
The main difference of our method to that of other state-of-the-arts is the explict use of object geometry in reasoning of occlusion, symmetry and in generating new views. TVSN and M2NV use occlusion and symmetry in creating annotated data and in training the network to predict the coarse view, while the proposed method impose such assumption directly on the coarse view. This has the benefit when the target pose is close to the symmetric pose of the input. Examples can be seen in the generated chair images, specifically row 2-7 of Figure 10 and row 2-4 of Figure 11. Despite the intricate structure of chairs, these examples standout in quality compared to other methods.