Novel View Synthesis from Single Images via Point Cloud Transformation

09/17/2020 ∙ by Hoang-An Le, et al. ∙ 0

In this paper the argument is made that for true novel view synthesis of objects, where the object can be synthesized from any viewpoint, an explicit 3D shape representation isdesired. Our method estimates point clouds to capture the geometry of the object, which can be freely rotated into the desired view and then projected into a new image. This image, however, is sparse by nature and hence this coarse view is used as the input of an image completion network to obtain the dense target view. The point cloud is obtained using the predicted pixel-wise depth map, estimated from a single RGB input image,combined with the camera intrinsics. By using forward warping and backward warpingbetween the input view and the target view, the network can be trained end-to-end without supervision on depth. The benefit of using point clouds as an explicit 3D shape for novel view synthesis is experimentally validated on the 3D ShapeNet benchmark. Source code and data will be available at



There are no comments yet.


page 6

page 10

page 14

page 15

page 16

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Novel view synthesis aims to infer the appearance of an object from unobserved points of view. The synthesis of unseen views of objects could be important for image-based 3D object manipulation [Kholgade2014], robot traversability [Hirose2019], or 3D object reconstruction [Tatarchenko2016]. Generating a coherent view of unseen parts of an object requires a non-trivial understanding of the object’s inherent properties such as (3D) geometry, texture, shading, and illumination.

Different algorithms make use of provided source images in different ways. Model-based approaches use similar-look open stock 3D models [Kholgade2014], or through user interactive construction [Zheng2012, Chen2013, Rematas2017]. Image-based methods [Tatarchenko2016, Zhou2016, Park2017, Sun2018, Olszewski2019]

assume an underlying parametric model of object appearances conditioned on viewpoints and try to learn it using statistical frameworks. Despite their differences, both approaches use 3D information in predicting object new views. The former imposes stronger assumptions on the full 3D structure and shifts the paradigm to obtain the full models, while the latter captures the 3D information in latent space to cope with (self) occlusion.

Figure 1: Overview of the proposed model for training and inference. From a single input image, the pixel-wise depth map is predicted. The depth map is subsequently used to compute a coarse novel view (forward warping), and trained by making use of backward warping (from the target view back to the source view). The model is trained end-to-end.

The principle is that the generation of a new view of an object is composed of (1) relocating pixels in source images that will be visible to the corresponding positions in the target view, (2) removing the pixels that will be occluded, and (3) adding disoccluded pixels that are not seen in the source and will be visible in the target view [Park2017]

. With the advance of convolution neural networks (CNNs) and generative adversarial networks (GANs),

[Zhou2016, Park2017, Sun2018] show that (1) and (2) can be done by learning an appearance flow field that "flows" pixels from a source image to the corresponding positions in the target view, and (3) can be done by a completion network with an adversarial loss.

In this paper, we leverage the explicit use of geometry information in synthesizing novel views. We argue that (1) and (2) can be done in a straightforward manner by obtaining access to the geometry of the objects. The appearance flow [Zhou2016, Park2017, Sun2018] which associates pixels of the source view to their positions in the target view, is the projection of the 3D displacement of objects’ points before and after transformation. Occluded object parts can be identified based on the orientation of the object surface normals and the view directions. The argument can also be extended for multiple input images. In this paper, we show that the geometry of an object provides an explicit and natural basis to the problem of novel view synthesis.

In contrast to geometry-based methods, the proposed approach does not require 3D supervision. The method predicts a depth map in a self-supervised manner by formulating the depth estimation problem in the context of novel view synthesis. The predicted depth is used to partly construct the target views and to assist the completion network.

The main contributions of this paper are: (1) a novel methodology for novel view synthesis using explicit transformations of estimated point clouds; (2) an integrated model combining self-supervised monocular depth estimation and novel view synthesis, which can be trained end-to-end; (3) natural extensions to multi-view inputs and full point cloud reconstruction from a single image; and (4) experimental benchmarking to validate the proposed method, which outperforms the current state-of-the art methods for novel view synthesis.

2 Related Work

2.1 Geometry-based view synthesis

View synthesis via 3D models  Full models (textured meshes or colored point clouds) of objects or scenes are constructed from multiple images taken from various viewpoints  [Debevec1996, Seitz2006, Meshry2019] or are given and aligned interactively by users [Kholgade2014, Rematas2017]

. The use of 3D models allows for extreme pose estimation, re-texturing and flexible (re-)lighting by applying rendering techniques 

[Nguyen2018, Meshry2019]. However, obtaining complete 3D models of objects or scenes is a challenging task in itself. Therefore, these approaches require additional user input to identify objects boundaries [Zheng2012, Chen2013], select and align 3D models with image views [Kholgade2014, Rematas2017], or use simple textured-mapped 3-planar billboard models [Hoiem2005]. In contrast, the proposed method makes use of objects partial point clouds constructed from a given source view and does not require a predefined (explicit) 3D model.

View synthesis via depth  Methods using 3D models assume a coherent structure between the desired objects and the obtained 3D models [Chen2013, Kholgade2014]. Synthesis using depth obtains an intermediate representation from depth information. The intermediate representation captures hidden surfaces from one or multiple viewpoints. [Zitnick2004] proposes to use layered depth images, [Flynn2016] creates 3D plane sweep volumes by projecting images onto target viewpoints at different depths, [Zhou2018stereo] uses multi-plane images at fix-distances to the camera, and [Choi2019]

estimates depth probability volumes to leverage depth uncertainty in occluded regions.

In contrast, the proposed method estimates depth directly from monocular views to partially construct the target views. Self-supervised depth estimation using deep neural networks using photometric re-projection consistency has been researched by several authors 

[Garg2016, Zhou2017, Godard2017, Godard2019, Johnston2019]. In this paper, we train a self-supervised depth prediction network with novel view synthesis in an end-to-end system.

2.2 Image-based view synthesis

Requiring explicit geometrical structures of objects or scenes as a precursor severely limits the applicability of a method. With the advance of neural networks (CNNs), generative adversarial networks [GAN] (GANs) achieve impressive results in image generation, allowing view synthesis without explicit geometrical structures of objects or scenes.

View synthesis via embedded geometry  Zhou et al [Zhou2016] proposes learning a flow field that maps pixels in input images to their corresponding locations in target views to capture latent geometrical information. [Olszewski2019] learns a volumetric representation in a transformable bottleneck layer, which can generate corresponding views for arbitrary transformations. The former explicitly utilizes input (source) image pixels in constructing new views, either fully [Zhou2016], or partly with the rest being filled by a completion network [Park2017, Sun2018]. The latter explicitly applies transformations on the volumetric representation in latent space and generates new views by means of pixel generation networks.

The proposed method takes the best of both worlds. By directly using object geometry the source pixels are mapped to their target positions based on given transformation parameters, hence making the best use of the given information synthesizing new views. Our approach is fundamentally different from [Park2017]: we estimate the object point cloud using self-supervised depth predictions and obtain coarse target views from purely geometrical transformations, while [Park2017]

learns mappings from input images and ground truth occluded regions to generate coarse target views using one-hot encoded transformation vectors.

View synthesis directly from image

  Since the introduction of image-to-image translation 

[Isola2017], there is a paradigm shift towards pure image-based approaches [Tatarchenko2016]. [Zhu2018] synthesizes bird view images from a single frontal view image, while [Regmi2018] generates cross-views of aerial and street-view images. The networks can be trained to predict all the views in an orbit from a single-view object [Kicanaoglu2018, Johnston2019], or generate a view in an iterative manner [Galama2019]. Additional features can be embedded such as view-independent intrinsic properties of objects [Xu2019_ICCV]. In this paper, we employ GANs to generate complete views, which is conditioned on the geometrical features and the relative poses between source and target views. Our approach can be interpreted as a reverse and end-to-end process of [Johnston2019]: we estimate objects’ arbitrary new views via point clouds constructed from self-supervised depth maps, while [Johnston2019] predict objects’ fixed orbit views for 3D reconstruction.

3 Method

3.1 Point-cloud based transformations

The core of the proposed novel view synthesis method is to use point clouds for geometrically aware transformations. Using the pinhole camera model and known intrinsics , the point cloud can be reconstructed when the pixel-wise depth map (D) is available. The camera intrinsics can be obtained by camera calibration, yet for the synthetic data used in our experiments, is given. A pixel on the source image plane (using homogeneous coordinates), corresponds to a point in the source camera space:


Rigid transformations can be obtained by matrix multiplications. The relative transformation to the target viewpoint from the source camera, is given by


where denotes the desired rotation matrix and the translation vector. Points in the target camera view are given by . This can also be regarded as an image-based flow field parameterized by (c.f [Zhou2016, Park2017, Sun2018]). The flow field returns the homogeneous coordinates of pixels in the target image for each pixel in the source image:


By observing that , the Cartesian pixel coordinates in the target view can be extracted. The advantage of the flow field interpretation is that it provides a direct mapping between the image planes of the source view and the target view.

Forward warping  The flow field is used to generate the target view from the source:


The resulted image is sparse due to the discrete pixel coordinates and (dis)occluded regions, see Fig. 2 (top-right). It is used as input to the image completion network (Sec. 3.2).

Figure 2: Illustration of the forward and backward warping operation of point clouds. The forward warping is used to generate a coarse target view, while the backward warping is used to reconstruct the source view from a target view for self-supervised depth estimation.

Backward warping  The flow field is used to generate the source view from the target:


The process assigns a value to every pixel in resulting in a dense image, as illustrated in Fig. 2 (bottom-right). The generated source view may contain artifacts due to (dis)occlusion in the target view. To sample from , a differentiable bi-linear sampling layer [Jaderberg2015] is used. The generated source view is used for self-supervised monocular depth prediction (Sec. 3.3).

3.2 Novel view synthesis

The point-cloud-based forward warping relocates the visible pixels of the object in the source view to their corresponding positions in the target view. For novel view synthesis, however, two more steps are required: (1) obtaining the target coarse view by discarding occluded pixels, and (2) filling in the pixels that are not seen in the source view.

Coarse view construction  The goal is to remove the pixels which are seen in the source view yet should not be visible in the target view, due to occlusion. To this end, pixels that have surface normals (after transformation) pointing away from the viewing direction are removed, similarly to [Park2017]. Surface normals are obtained from normalized depth gradients.

An illustration of the coarse view construction is shown in Fig. 3 for different target views. The first row depicts the target views, the second row indicates the visible parts from the input image (third column). The third and fourth row show the coarse view with and without occlusion removal (or backface culling). Finally, the fifth row shows an enhanced version of the coarse view, where the object is assumed to be left-right symmetric [Park2017]. The proposed method directly identifies and removes occlusion pixels from the input view using estimated depth, which contrasts to [Park2017], where ground truth visibility mask are required for each target view to train a visibility prediction network.

View completion  The obtained coarse view is already in the target viewpoint, but it remains sparse. To synthesize the final dense image, an image completion network is used.

The completion network uses the hour-glass architecture [Newell2016]. Following [Park2017], we concatenate the depth bottleneck features and embedded transformation to the completion network bottleneck. By conditioning the completion network on the input features and the desired transformation , the network can fix artifacts and errors due to estimated depth and cope better with extreme pose transformations, i.ewhen coarse view image is near empty (e.gcolumns 9-11 in Fig. 3).

The image completion network is trained in a GAN-manner by using a generator , a discriminator , an input image and a target image . The combination of losses that are used is given by

LS-GAN Discriminator loss (6)
Generator loss (7)
Perceptual loss (8)

where the perceptual loss uses (

) to denote features extracted from image

and from the discriminator network and pre-trained VGG network respectively, c.f [Perceptual]. SSIM denotes the structural similarity index measure, see Sec. 4. The total loss is given by:


where denotes the weighting of the losses (, and , c.f [Park2017]).

Figure 3: Image coarse views for different target viewpoints. The input image is depicted in the third column (red box). From top to bottom: (1) target views, (2) source region visible in each target viewpoint, coarse view (3) naive (4) with occlusion removal, and (5) with occlusion removal and symmetry.

3.3 Self-supervised Monocular Depth estimation

The discussion so far has assumed that pixel-wise depth maps are available. In this section, the method used to estimate depth from a single image is detailed. In order to make the minimum assumption about the training data, self-supervised methods are considered, which do not require ground-truth depth [Garg2016, Zhou2017, Godard2017, Godard2019, Johnston2019].

For the depth prediction an encoder-decoder network with bottleneck architecture is used, similar to [Godard2019]. The network is optimised using a set of (reconstruction) losses between the source image and its synthesized version , using the backward warping, Eq.  (5), from a second (target) image and the predicted depth map. The underlying rationale is that a more realistic depth map will have a lower reconstruction loss.

The losses are given by:

Photometric loss (10)
Smoothness loss [Godard2017] (11)
Total loss (12)

where , and is the mean-normalized inverse depth, , and is an indicator function which equals 1 iff the photometric loss , see [Godard2019] for more details. The smoothness loss encourages nearby pixels to have similar depths, while the artifacts due to (dis)occlusion are excluded by the per-pixel minimum-projection mechanism.

4 Experiments

In this section, the proposed method is analysed on the 3D ShapeNet benchmark including an ablation to study the effects of the different components and a state-of-the-art comparison.

Dataset  We use the object-centered car and chair images rendered from the 3D ShapeNet models [shapenet] using the same render engine111The specific render engine and setup is to guarantee fair comparison with reported methods as none of the authors-provided weights perform at the similar level on images rendered with different rendering setups. and set up as in [Zhou2016, Park2017, Sun2018, Olszewski2019]. Specifically, there are 7497 car and 698 chair models with high-quality textures, split by 80%/20% for training and test. The images are rendered at 18 azimuth angles (in , -separation) and 3 elevation angles (). Input and output images are of size .

Metrics  We evaluate the generated images using the standard pixel-wise error (normalized to the ranged , lower is better) and the structural similarity index measure (SSIM) [ssim] (value range of , higher is better). indicates the proximity of pixel values between a completed image and the target, while SSIM measures the perceived quality and structural similarity between the images.

Baseline  We compare the results of our method with the following state-of-the-art methods: AFN [Zhou2016], TVSN [Park2017], M2NV [Sun2018], and TBN [Olszewski2019].

4.1 Initial experiments

Comparison to image-based completion  In this section, we compare the intermediate views generated by the forward warping using estimated point clouds and those by image-based flow field prediction by DOAFN [Park2017] and M2NV [Sun2018]. For this experiment, the coarse view after occlusion removal and left-right symmetric enhancements are used. The image completion network is the basis variant, using DCGAN, without bottleneck inter-connections. The results are shown in Table 2(a). The transformation of estimated point clouds provides coarse views which are closer to the target view, and these help to obtain a higher quality of completed views.

Coarse view Completed view
DOAFN [Park2017] .220 .876 .121 .910
M2NV [Sun2018] .226 .879 .154 .906
Ours .203 .882 .118 .924
(a) Coarse vs Completed
.118 .924
.101 .939
.103 .939
.107 .933
.097 .939
.097 .942
(b) Ablation study
Table 1: Quality of coarse and completed view (a) and ablation study (b). The proposed transformation within estimated point clouds generates better coarse images over image-based predicted flow fields. Subsequently, the completed view quality is improved. The ablation study shows the best results are obtained by using LSGAN (LS), perceptual loss (PL), symmetry (Sym), bottleneck inter-connection (IC), and SSIM loss (SL).

Ablation Study

We analyze the effects of the different component of the proposed pipeline. The results are shown in Table 2(b). The use of the LSGAN loss shows a relative large improvement over the traditional DCGAN. The drop of performance by removing symmetry assumption shows the importance of prior knowledge on target objects, which is intuitive. The inter-connection from the depth network and the embedded transformation to the completion network allow the model to not rely solely on intermediate views. This is important for overcoming errors and artifacts which occur in the coarse images (due to inevitable uncertainties in depth prediction) and generate in general higher quality images. The SSIM loss, first employed by [Olszewski2019]

, shows improvement in SSIM metric, which is intuitive as training objectives are closer to evaluation metrics.

4.2 Comparison to State-of-the-Art

In this section, the proposed method is compared with state-of-the-art methods. The quantitative results are shown in Table 2. The proposed method performs consistently performs (slightly) better on both evaluation metrics for both types of objects. Quantitative results are shown in Fig. 4 where challenging cases are shown in the last 2 rows. Notice the better ability in retaining objects’ textures (such as color patterns and texts on cars) of methods that explicitly use input pixel values in generating new views to that of TBN. The results of cars are constantly higher than that of chairs due to the intricate structures of chairs. However, by having access to object geometry, geometrical assumptions such as symmetry and occlusion can be applied directly to intermediate views (instead of having to learn from annotated data c.f [Park2017]), which creates better views for near-to symmetry targets. High-quality qualitative results and more analyses can be found in the supplementary materials.

Table 2 also shows the evaluation when target viewpoints are from different elevation angles. Methods such AFN, TVSN, and M2NV encode transformation as one-hot vectors and thus, are limited to operate within a pre-defined set of transformations (18 azimuth angles, same elevation). This is not the case for our method and TBN which apply direct transformation. We use the same azimuth angles as in the standard test set while randomly sample new elevation angles for input images in . The results are shown with networks trained with the regular fixed-elevation settings. The new transformations produces different statistics from what the networks have been trained, resulting in a performance drop for both methods. Nevertheless, the proposed method can still maintain high quality image synthesis.

Methods cars chairs
Same elevation
AFN [Zhou2016] .148 .877 .229 .871
TVSN [Park2017] .119 .913 .202 .889
M2VN [Sun2018] .098 .923 .181 .895
TBN [Olszewski2019] .091 .927 .178 .895
ours .096 .945 .175 .914
TBN .199 .910 .215 .902
ours .122 .934 .207 .905
Table 2: Quantitative comparison with state-of-the-art methods on novel view synthesis: our method consistently performs (slightly) better than the other methods for both categories where target views have the same or different elevation angles with input views.












Figure 4: Qualitative comparisons of synthesized cars (left) and chairs (right), given a single input image (first column) and a given target view (last column). The last two rows show a more challenging examples. The proposed method captures better the geometry of the object and the fine (texture) details. More examples are provided in the supplementary materials.

4.3 Multi-View Synthesis and Point Cloud Reconstruction

Multi-view inputs  The proposed method can be naturally extended to use multi-view inputs as follows: for each image depth is predicted independently and combined into a single point cloud. The resulting coarse target image will be denser when more images are used, and is passed through the image completion network.

In this experiment, the model trained for single-view prediction is used and evaluated using multiple (1 to 8) input images. The results in Table 3 show that the quality of the coarse view increases, as expected, when more input images are used and hence the point clouds are denser. Surprisingly, however, the image completion network only marginally improves, indicating that the coarse view contains enough information for the image completion network to synthesis a high quality target image.

No. views Coarse Final
L1 () SSIM () L1 () SSIM ()
1 .203 .882 .090 .945
2 .188 .888 .089 .945
4 .152 .906 .085 .946
8 .111 .907 .084 .947
Table 3: Performance by extending single-view-trained networks for multi-view inputs.

Point cloud reconstruction

In this final experiment, the aim is to reconstruct a full dense point cloud from a single image, using the models trained for novel view synthesis. In order to do so, -views are generated from a single view of an object, see Fig. 5 (top). Each of these views are fed to the depth estimation network and the obtain estimated depth is used to generate a partial point cloud. These point clouds are stitched together, using corresponding transformations, resulting in a high quality dense point cloud, as shown in Fig. 5 (bottom).

Figure 5: Top -views generation from single input images (first column), bottom point cloud reconstruction using estimated depths of each generated views. Depth estimation trained on real images can perform well on synthesized ones.

4.4 Results on real-world imagery

We apply the trained car model to the car images of the real-imagery ALOI dataset [geusebroek2005], consisting of 100 objects, captured at 72 viewing angles. We use 4 cars for fine-tuning only the depth network, which requires no ground truths, while the image-completion network is left untouched. The quantivative resuls on the remaining 3 cars are shown in Fig. 6.









Figure 6: Quantitative results on real-imagery ALOI [geusebroek2005] dataset. The inputs and targets are shown with provided object masks, while the synthesized images are with predicted masks. The completion network does not need to be finetuned, yet provide competent results.

5 Conclusion

In this paper partial point clouds are estimated from a single image, by a self-supervised depth prediction network and used to obtain a coarse image in the target view. The final image is produced by an image completion network which uses the coarse image as input. Experimentally the proposed method outperforms any of the current SOTA methods on the ShapeNet Benchmark on novel view synthesis. Qualitative results show high quality and dense point clouds, obtained from a single image, by synthesizing and combining views. Based on these results, we conclude that point clouds are a suitable, geometry aware representation for true novel view synthesis.


6 Supplementary materials

In the supplementary materials a more elaborate qualitative comparison is provided between the proposed method and other state-of-the-art methods. For this the synthesized target views for 36 car images are shown in Figure 79 and the syntesized target views for 21 chair images are shown in Figure 1012.

From the results we observe that in line with the overall quality metrics (Table 2 of the main paper), our method synthesize in general higher quality (better geometrical shape and matching texture) compared to the other methods.

A few observations:

  • Observe that TVSN, M2NV and the proposed method all have a similar inter-connection network architecture, in contrast to TBN. The inter-connection allow for explicit use of the input image pixels in constructing the generated views, and thus these models can retain the object textures in the generated views. Examples include row 4, 5 of Figure 8 and row 2, 5, 10 of Figure 9, where the specific color patterns or texts on the input views are retained in the generated views.

  • Note also row 4 of Figure 8, which is indeed a failure, yet legitimate, case. The target is posed at an extreme angle with the input, but the unseen back of the truck has a different texture/color. The methods based on the assumption of object symmetry to make an "educated" guess of the unseen view, thus fail when the object texture does not follow such assumption.

  • The main difference of our method to that of other state-of-the-arts is the explict use of object geometry in reasoning of occlusion, symmetry and in generating new views. TVSN and M2NV use occlusion and symmetry in creating annotated data and in training the network to predict the coarse view, while the proposed method impose such assumption directly on the coarse view. This has the benefit when the target pose is close to the symmetric pose of the input. Examples can be seen in the generated chair images, specifically row 2-7 of Figure 10 and row 2-4 of Figure 11. Despite the intricate structure of chairs, these examples standout in quality compared to other methods.






Figure 7: Qualitative examples of cars generated by our point-cloud-based method and other related work. Our method retains better the objects’ geometrical structure and detailed textures, e.grow 3 and 5.






Figure 8: Qualitative examples of cars generated by our point-cloud-based method and other related work. Our method retains better the objects’ geometrical structure and detailed textures, e.grow 5 and 11.






Figure 9: Qualitative examples of cars generated by our point-cloud-based method and other related work. Our method retains better the objects’ geometrical structure and detailed textures, e.grow 2 and 7.






Figure 10: Qualitative examples of chairs generated by our point-cloud-based method and other related work. Our method retains better the objects’ geometrical structure and detailed textures, e.grow 4 and 7.






Figure 11: Qualitative examples of chairs generated by our point-cloud-based method and other related work. Our method retains better the objects’ geometrical structure and detailed textures, e.grow 3 and 4.






Figure 12: Qualitative examples of chairs generated by our point-cloud-based method and other related work. Our method retain better the objects’ geometrical structure and detailed textures, e.grow 6 and 7.