Differentiable Surface Splatting for Point-based Geometry Processing

06/10/2019 ∙ by Wang Yifan, et al. ∙ 1

We propose Differentiable Surface Splatting (DSS), a high-fidelity differentiable renderer for point clouds. Gradients for point locations and normals are carefully designed to handle discontinuities of the rendering function. Regularization terms are introduced to ensure uniform distribution of the points on the underlying surface. We demonstrate applications of DSS to inverse rendering for geometry synthesis and denoising, where large scale topological changes, as well as small scale detail modifications, are accurately and robustly handled without requiring explicit connectivity, outperforming state-of-the-art techniques. The data and code are at https://github.com/yifita/DSS.



There are no comments yet.


page 1

page 7

page 8

page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Differentiable processing of scene-level information in the image formation process is emerging as a fundamental component for both 3D scene and 2D image and video modeling. The challenge of developing a differentiable renderer lies at the intersection of computer graphics, vision, and machine learning, and has recently attracted a lot of attention from all communities due to its potential to revolutionize digital visual data processing and high relevance for a wide range of applications, especially when combined with the contemporary neural network architectures 

[Loper and Black, 2014; Kato et al., 2018; Liu et al., 2018; Yao et al., 2018; Petersen et al., 2019].

A differentiable renderer (DR) takes scene-level information such as 3D scene geometry, lighting, material and camera position as input, and outputs a synthesized image . Any changes in the image can thus be propagated to the parameters

, allowing for image-based manipulation of the scene. Assuming a differentiable loss function

on a rendered image , we can update the parameters with the gradient

. This view provides a generic and powerful shape-from-rendering framework where we can exploit vast image datasets available, deep learning architectures and computational frameworks, as well as pre-trained models. The challenge, however, is being able to compute the gradient

in the renderer.

Existing DR methods can be classified into three categories based on their geometric representation: voxel-based 

[Nguyen-Phuoc et al., 2018; Tulsiani et al., 2017; Liu et al., 2017], mesh-based [Loper and Black, 2014; Kato et al., 2018; Liu et al., 2018], and point-based [Insafutdinov and Dosovitskiy, 2018; Lin et al., 2018; Roveri et al., 2018a; Rajeswar et al., 2018]. Voxel-based methods work on volumetric data and thus come with high memory requirements even for relatively coarse geometries. Mesh-based DRs solve this problem by exploiting the sparseness of the underlying geometry in the 3D space. However, they are bound by the mesh structure with limited room for global and topological changes, as connectivity is not differentiable. Equally importantly, acquired 3D data typically comes in an unstructured representation that needs to be converted into a mesh form, which is itself a challenging and error-prone operation. Point-based DRs circumvent these problems by directly operating on point samples of the geometry, leading to flexible and efficient processing. However, existing point-based DRs use simple rasterization techniques such as forward-projection or depth maps, and thus come with well-known deficiencies in point cloud processing when capturing fine geometric details, dealing with gaps and occlusions between near-by points, and forming a continuous surface.

In this paper, we introduce Differentiable Surface Splatting (DSS), the first high fidelity point based differentiable renderer. We utilize ideas from surface splatting [Zwicker et al., 2001]

, where each point is represented as a disk or ellipse in the object space, which is projected onto the screen space to form a splat. The splats are then interpolated to encourage hole-free and antialiased renderings. For inverse rendering, we carefully design gradients with respect to point locations and normals by taking each forward operation apart and utilizing domain knowledge. In particular, we introduce regularization terms for the gradients to carefully drive the algorithms towards the most plausible point configuration. There are infinitely many ways splats can form a given image due to the high degree of freedom of point locations and normals. Our inverse pass ensures that points stay on local geometric structures with uniform distribution.

We apply DSS to render multi-view color images as well as auxiliary maps from a given scene. We process the rendered images with state-of-the-art techniques and show that this leads to high-quality geometries when propagated utilizing DSS. Experiments show that DSS yields significantly better results compared to previous DR methods, especially for substantial topological changes and geometric detail preservation. We focus on the particularly important application of point cloud denoising. The implementation of DSS, as well as our experiments, will be available upon publication.

2. Related work

In this section we provide some background and review the state of the art in differentiable rendering and point based processing.

method objective position update depth update normal update occlusion silhouette change topology change
OpenDR mesh via position change
NMR mesh via position change
Paparazzi mesh limited limited via position change
Soft Rasterizer mesh via position change
Pix2Vex mesh via position change
Ours points
Table 1. Comparison of generic differential renderers. By design, OpenDR [Loper and Black, 2014] and NMR [Kato et al., 2018] do not propagate gradients to depth; Paparazzi [Liu et al., 2018] has limitation in updating the vertex positions in directions orthogonal their face normals, thus can not alter the silhouette of shapes; Soft Rasterizer [Liu et al., 2019] and Pix2Vex [Petersen et al., 2019] can pass gradient to occluded vertices, through blurred edges and transparent faces. All mesh renderers do not consider the normal field directly and cannot modify mesh topology. Our method uses a point cloud representation, updates point position and normals jointly, considers the occluded points and visibility changes and enables large deformation including topology changes.

2.1. Differentiable rendering

An ideal differentiable renderer (DR) should: (i) render as realistic images as possible, and (ii) compute reliable derivatives w.r.t. all the rendering parameters. However, depending on the application, a tradeoff must be made between the complexity the rendering function, the number of targeted parameters, and the quality of the gradients. We first discuss general DR frameworks, followed by DRs for specific purposes.

Loper and Black [2014] develop a differentiable renderer framework called OpenDR that approximates a primary renderer and computes the gradients via automatic differentiation. Neural mesh renderer (NMR) [Kato et al., 2018] approximates the backward gradient for the rasterization operation using a handcrafted function for visibility changes. Liu et al. [2018] propose Paparazzi, an analytic DR for mesh geometry processing using image filters. In concurrent work, Petersen et al. [2019] present Pix2Vex, a differentiable renderer via soft blending schemes of nearby triangles, and Liu et al. [2019] introduce Soft Rasterizer, which renders and aggregates the probabilistic maps of mesh triangles, allowing flowing gradients from the rendered pixels to the occluded and far-range vertices. All these generic DR frameworks rely on mesh representation of the scene geometry. We summarize the properties of these renderers in Table 1 and discuss them in greater detail in Sec. 3.2.

Numerous recent works employed DR for learning based 3D vision tasks, such as single view image reconstruction [Vogels et al., 2018; Yan et al., 2016; Pontes et al., 2017; Zhu et al., 2017], face reconstruction [Richardson et al., 2017], shape completion [Hu et al., 2019], and image synthesis [Sitzmann et al., 2018]. To describe a few, Pix2Scene [Rajeswar et al., 2018] uses a point based DR to learn implicit 3D representations from images. However, Pix2Scene renders one surfel for each pixel and does not use screen space blending. Nguyen-Phuoc et al. [2018] and Insafutdinov and Dosovitskiy [2018] propose neural DRs using a volumetric shape representation, but the resolution is limited in practice. Li et al. [2018] and Azinović et al. [2019] introduce a differentiable ray tracer to implement the differentiability of physics based rendering effects, handling e.g. camera position, lighting and texture.

A number of works render depth maps of point sets [Lin et al., 2018; Insafutdinov and Dosovitskiy, 2018; Roveri et al., 2018b] for point cloud classification or generation. These renderers do not define proper gradients for updating point positions or normals, thus they are commonly applied as an add-on layer behind a point processing network, to provide 2D supervision. Typically, their gradients are defined either only for depth values [Lin et al., 2018], or within a small local neighborhood around each point. Such gradients are not sufficient to alter the shape of a point cloud, as we show in a pseudo point renderer in Fig. 10.

The differentiable rendering is also relates to shape-from-shading techniques [Langguth et al., 2016; Shi et al., 2017; Maier et al., 2017; Sengupta et al., 2018] that extract shading and albedo information for geometry processing and surface reconstruction. However, the framework proposed in this paper can be used seamlessly with contemporary deep neural networks, opening a variety of new applications.

2.2. Point-based geometry processing and rendering

With the proliferation of 3D scanners and depth cameras, the capture and processing 3D point clouds is becoming commonplace. The noise, outliers, incompleteness and misalignments persisting in the raw data pose significant challenges for point cloud filtering, editing, and surface reconstruction

[Berger et al., 2017].

Early optimization based point set processing methods rely on shape priors. Alexa and colleagues [2003] introduce the moving least squares (MLS) surface model, assuming a smooth underlying surface. Aiming to preserve sharp edges, Öztireli et al. [2009] propose the robust implicit moving least squares (RIMLS) surface model. Huang et al. [2013] employ an anisotropic weighted locally optimal projection (WLOP) operator [Lipman et al., 2007; Huang et al., 2009] and a progressive edge aware resampling (EAR) procedure to consolidate noisy input. Lu et al. [2018]

formulate WLOP with a Gaussian mixture model and use point-to-plane distance for point set processing (GPF). These methods depend on the fitting of local geometry, e.g. normal estimation, and struggle with reconstructing multi-scale structures from noisy input.

Advanced learning-based methods for point set processing are currently emerging, encouraged by the success of deep learning. Based on PointNet [Qi et al., 2017a], PCPNET [Guerrero et al., 2018] and PointCleanNet [Rakotosaona et al., 2019] estimate local shape properties from noisy and outlier-ridden point sets; EC-Net [Yu et al., 2018] learns point cloud consolidation and restoration of sharp features by minimizing a point-to-edge distance, but it requires edge annotation for the training data. Hermosilla et al. [2019] propose an unsupervised point cloud cleaning method based on Monte Carlo convolution [Hermosilla et al., 2018]. Roveri et al. [2018a] present a projection based differentiable point renderer to convert unordered 3D points to 2D height maps, enabling the use of convolutional layers for height map denoising before back-projecting the smoothed pixels to the 3D point cloud. In contrast to the commonly used Chamfer or EMD loss [Fan et al., 2017], our DSS framework, when used as a loss function, is compatible with convolutional layers and is sensitive to the exact point distribution pattern.

Surface splatting is fundamental to our method. Splatting has been developed for simple and efficient point set rendering and processing in the early seminal point based works [Pfister et al., 2000; Zwicker et al., 2001, 2002; Zwicker et al., 2004]. Recently, point based techniques have gained much attention for their superior potential in geometric learning. To the best of our knowledge, we are the first to implement high-fidelity differentiable surface splatting.

3. Method

In essence, a differentiable renderer is designed to propagate image-level changes to scene-level parameters . This information can be used to optimize the parameters so that the rendered image matches a reference image . Typically, includes the coordinates, normals and colors of the points, camera position and orientation, as well as lighting. Formally, this can be formulated as an optimization problem


where is the image loss, measuring the distance between the rendered and reference images.

Methods to solve the optimization problem (1) are commonly based on gradient descent which requires to be differentiable with respect to . However, gradients w.r.t. point coordinates and normals, and , are not defined everywhere, since is a discontinuous function due to occlusion events and edges.

The key to our method is two-fold. First, we define a gradient and using the principle of finite differences. Second, to address the optimization difficulty that arises from the significant number of degrees of freedom due to the unstructured nature of points, we introduce regularization terms that contribute to obtaining clean and smooth surface points.

In this section, we first review screen space EWA (elliptical weighted average) [Zwicker et al., 2001; Heckbert, 1989], which we adopt to efficiently render high-quality realistic images from point clouds. Then we propose an occlusion-aware gradient definition for the rasterization step, which, unlike previously proposed differential mesh renderers, propagates gradients to depth and allows large deformation. Lastly, we introduce two novel regularization terms for generating clean surface points.

3.1. Forward pass

Figure 2. Illustration of forward splatting using EWA [Zwicker et al., 2001].

Our forward pass closely follows the screen space elliptical weighted average (EWA) filtering described in [Zwicker et al., 2001]. In the following, we briefly review the derivation of EWA filters.

In a nutshell, the idea of screen space EWA is to apply an isotropic Gaussian filter to the attribute of a point in the tangent plane (defined by the normal at that point). The projection onto the image plane defines elliptical Gaussians, which, after truncation to bounded support, form a disk, or splat, as shown in Fig. 2. For a point , we write the filter weight of the isotropic Gaussian at position as



is the standard deviation and

is the identity matrix.

Now we consider the projected Gaussian in screen space. Points and are projected to and , respectively. We write the Jacobian of this projection from the tangent plane to the image plane as ; we refer the reader to the original surface splatting paper [Zwicker et al., 2001] for the derivation of . Then at pixel , the screen space elliptical Gaussian weight is


Note that is determined by the point position and the normal , because is determined by and .

Next, a low-pass Gaussian filter with variance

is convolved with Eq. (3) in screen space. Thus the final elliptical Gaussian is


In the final step, two sources of discontinuity are introduced to the fully differentiable . First, for computational reasons, we limit the elliptical Gaussians to a limited support in the image plane for all outside a cutoff radius , i.e. . Second, we set the Gaussian weights for occluded points to zero. Specifically, we keep a list of the maximum (we choose ) closest points at each pixel position, and compute their depth difference to the front-most point, and then set the Gaussian weights to zero for points that are behind the front-most point by more than a threshold (we set of the bounding box diagonal length).

The resulting truncated Gaussian weight, denoted as , can be formally defined as


The final pixel value at position , , is simply the normalized sum of all filtered point attributes , i.e.,


In practice, this summation can be greatly optimized by computing the bounding box of each ellipse and only considering points whose elliptical support covers the pixel .

The point value

can be any point attribute, e.g., albedo color, shading, depth value, normal vector, etc. In most of our experiments, we use diffuse shading under three orthogonally positioned RGB-colored sun lights. This way,

carries strong information about point normals, and at the same time it is independent of point position (unlike with point lights), which greatly simplifies the factorization for gradient computation, as explained in Sec. 3.2.

Figure 3. Examples of images rendered using DSS. From left to right, we render the normals, inverse depth values and diffuse shading with three RGB-colored sun light sources.

Fig. 3 shows some examples of rendered images. Unlike many pseudo renderers which achieve differentiability by blurring edges and transparent surfaces, our rendered images faithfully depict the actual geometry in the scene.

3.2. Backward pass

In the backward pass, we define an artificial gradient for the discontinuous rasterization function. We first simplify the discontinuous function to a discontinuous step function solely dependent on position , and then we define the gradient w.r.t. .

The discontinuity is encapsulated in the truncated Gaussian weights as described (5). In order to fully utilize automatic differentiation available in most optimization libraries, we factorize the discontinuous into the fully differentiable term and a discontinuous visibility term , i.e. , where is defined as


Since compared to , only impacts the visibility of a small set of pixels around the ellipse, we further simplify the expression so that is solely determined by , i.e., . Therefore, if we write as a function of , and

, then by the chain rule we have


where is undefined at the edges of ellipses due to occlusion.

before movement
after movement
pixel color
and gradient
Figure 4. A schematic illustration of the artificial gradient in a 1D case. We consider pixel intensity at pixel in relation to the position of a splat , (solid line in the third row). We approximate this discontinuous function with a linear function (dotted line).
Figure 5. Illustration of the artificial gradient for point from loss at pixel , . For clarity, we focus on grayscale images. Fig. 4(a) depicts . It indicates the color change in order to decrease the image loss. In this case, the pixels inside the shape have to become darker in order to represent a star. Fig. 4(b) and Fig. 4(c) depict the gradient direction when is not visible at , while Fig. 4(d) depicts the case when is visible.

The construction of the gradient w.r.t.  despite the discontinuity of is comprised of two key components. First, instead of considering , we focus on the joint term , since the additional color information conveyed in enables us to define gradients only in the direction which decreases image loss. Secondly, we replace the discontinuous function of a pixel color with respect to the position of with a continuous linear function, and define the gradient as , where and denotes the change of pixel value and point position respectively. A schematic illustration for an 1D scenario is depicted in Fig. 4.

Intuitively, the joint term expresses the change of pixel values when varying , assuming the shape and colors of the ellipse are fixed, which is a justified assumption for sunlight diffuse shading. Whenever the change of pixel value incurred by the movement of can decrease the image loss, i.e., , an artificial gradient is created to push in the corresponding direction.

A concrete example for grayscale image is illustrated in Fig. 5. We are interested in pixel and the splat . The negative gradient of image loss w.r.t. pixel value , shown in Fig. 4, indicates the desired change in pixel value in order to decrease the image loss; in this example, should become darker. In Fig. 4(b), is not visible at , is rendered by another ellipse, or multiple, lighter ellipses, in front of . Since moving the darker splat to cover darkens , we find the intersection of the viewing ray with the front-most ellipse rendered at , , then define in direction. In case no ellipses are rendered at or the currently rendered ellipse is behind , as shown in Fig. 4(c)), is the intersection of the viewing ray with the ellipse plane, which is orthogonal to the principal axis. Finally, in Fig. 4(d), refers to the brighter ellipse, obviously moving it towards and away from will both reveal the darker splat behind and thus darken , creating two possible gradient in opposition directions and . Thus is obtained by averaging these two gradients. Notice that in the first case, can have a non-zero value in the depth dimension, allowing for a depth update, while in the other cases is equivalent to defining gradient only on the image plane.

Given the translation vector , assuming the pixel values have channels, the artificial gradient is defined as


Here, is the distance between and the edge of the ellipse. Intuitively, the further needs to travel, the less impact it has on , and vice versa. The value is a small constant (we set ). It prevents the gradient from becoming extremely large when is close , which would lead to overshooting, oscillation and other convergence problems.

In order to compute as accurately as possible, we evaluate (6) after the movement of while taking into account currently occluded ellipses. For this purpose, we cache an ordered list of the top-K (we choose K=5) closest ellipses which can be projected to each pixel and save their , and depth values during the forward pass.

Figure 6. The effect of repulsion regularization. We deform a 2D grid to the teapot. Without the repulsion term, points cluster in the center of the target shape. The repulsion term penalizes this type of local minima and encourages a uniform point distribution.

Comparison to other differentiable renderers

In Paparazzi [Liu et al., 2018], the rendering function is simplified enough such that the gradients can be computed analytically, which is prohibitive for silhoutte change where handling significant occlusion events is required. The work related most closely to our approach in terms of gradient definition is the neural mesh renderer (NMR) [Kato et al., 2018]. We both construct depending on the change of pixel , but our method differs from NMR in the following aspects: (1) In our definition, we consider the movement of in 3D space, while NMR only considers movement in the image plane. As a result, we can optimize in the depth dimension even with a single view. (2) In our definition, the gradient for all dimensions of is defined jointly. In contrast, NMR considers the 1D gradients separately and consequently, only pixels along and axes are considered; (3) In our definition, the change of pixel value is computed considering a set of occluded and occluding ellipses projected to pixel . This not only leads to more accurate gradient values, but also encourages noisy points inside the model to move onto the surface, to a position with matching pixel color.

3.3. Surface regularization

Figure 7. The effect of projection regularization. The projection term effectively enforces points to form a local manifold. For a better visualization of outliers inside and outside of the object, we use a small disk radius and render the backside of the disks using light gray color.

The lack of structure in point clouds, while providing freedom of massive topology changes, can pose a significant challenge for optimization. First, the gradient derivation is entirely parallized; as a result, points move irrespective of each other. Secondly, as the movement of points will only induce small and sparse changes in the rendered image, gradients on each point are less structured compared to corresponding gradients for meshes. Without proper regularization, one can quickly end up in local minima.

Inspired by [Huang et al., 2009; Öztireli et al., 2009], we propose regularization to address this problem based on two parts: a repulsion and a projection term. The repulsion term is aimed at generating uniform point distributions by maximizing the distances between its neighbors on a local projection plane, while the projection term preserves clean surfaces by minimizing the distance from the point to the surface tangent plane.

Obviously, both terms require finding a reliable surface tangent plane. However, this can be challenging, since during optimization, especially in the case of multi-view joint optimization, intermediate point clouds can be very noisy and contain many occluded points inside the model, hence we propose a weighted PCA to penalize the occluded inner points. In addition to the commonly used bilateral weights which considers both the point-to-point euclidean distance and the normal similarity, we propose a visibility weight, which penalizes occluded points, since they are more likely to be outliers inside the model.

Let denote a point in question and denote one point in its neighborhood, , we propose computing a weighted PCA using the following weights


where and are bilateral weights which favor neighboring points that are spatially close and have similar normal orientation respectively, and is the proposed visibility weight which is defined using an occlusion counter that counts the number of times

is occluded in all camera views. Then a reliable projection plane can be obtained using singular value decomposition from weighted vectors

, where .

For the repulsion term, the projected point-to-point distance is obtained via , where contains the first principle components. We define the repulsion loss as follows and minimize it together with the per-pixel image loss


For the projection term, we minimize the point-to-plane distance via , where is the last components. Correspondingly, the projection loss is defined as


The effect of repulsion and projection terms are clearly demonstrated in Fig. 6 and Fig. 7. In Fig. 6, we aim to move points lying on a 2D grid to match the silhouette of a 3D teapot. Without the repulsion term, points quickly shrink to the center of the reference shape, which is a common local minima since the gradient coming from surrounding pixels cancel each other out. With the repulsion term, the points can escape such local minima and distribute evenly inside the silhouette. In Fig. 7 we deform a sphere to bunny from 12 views. Without projection regularization, points are scattered within and outside the surface. In contrast, when the projection term is applied, we can obtain a clean and smooth surface.

4. Implementation details.

4.1. Optimization objective

We choose Symmetric Mean Absolute Percentage Error (SMAPE) as the image loss . SMAPE is designed for high dynamic range images such as rendered images therefore it behaves more stable for unbounded values [Vogels et al., 2018]. It is defined as


where and are the dimensions of the image, the value of is typically chosen as .

The total optimization objective corresponding to Eq. (1) for a set of views amounts to


Loss weights and are typically chosen to be respectively.

4.2. Alternating normal and point update

For meshes, the face normals are determined by point positions. For points, though, normals and point positions can be treated as independent entities thus optimized individually. Our pixel value factorization in Eq. (8) and Eq. (9) means that, the gradient on point positions mainly stems from the visibility term, while gradients on normals can be derived from and . Because the gradient w.r.t.  and assumes the other stays fixed, we apply the update of and in an alternating fashion. Specifically, we start with normals, execute optimization for times then we optimize point positions for times.

As observed in many point denoising works [Öztireli et al., 2009; Huang et al., 2009; Guerrero et al., 2018], finding the right normal is the key for obtaining clean surfaces. Hence we efficiently utilize the improved normals even if the point positions are not being updated, in that we directly update the point positions using the gradient from the regularization terms and . In fact, for local shape surface modification, this simple strategy consistently yields satisfying results.

4.3. Error-aware view sampling

View selection is very important for quick convergence. In our experiments, we aim to cover all possible angles by sampling camera positions from a hulling sphere using farthest point sampling. Then we randomly perturb the sampled position and set the camera to look at the center of the object. The sampling process is repeated periodically to further improve optimization.

However, for shapes with complex topology, such a sampling scheme is not enough. We propose an error-aware view sampling scheme which chooses the new camera positions based on the current image loss.

Specifically, we downsample the reference image and the rendered result, then compute the pixel position with the largest image error. Then we find points whose projection is closest to the found pixel. The mean 3D position of these points will be the center of focus. Finally, we sample camera positions on a sphere around this focal point with a relatively small distance. Such techniques help us to improve point positions in small holes during large shape deformation.

5. Results

We evaluate the performance of DSS by comparing it to state-of-the-art DRs, and demonstrate its applications in point-based geometry editing and filtering.

Our method is implemented in Pytorch 

[Paszke et al., 2017]

, we use stochastic gradient descent with Nesterov momentum

[Sutskever et al., 2013] for optimization. A learning rate of and is used for points and normals, respectively. We reduce them by a factor of 0.5 if the total optimization loss stagnates for 15 optimization steps. In all experiments, we render in back-face culling mode with resolution and diffuse shading, using RGB sun lights fixed relative to the camera position.

Unless otherwise stated, we optimize for up to 16 cycles of and optimization steps for point normal and position (for large deformation and ; for local surface editing and ). In each cycle, 12 randomly sampled views are used simultaneously for an optimization step. To test our algorithms for noise resilience, we use random white Gaussian noise with a standard deviation measured relative to the diagonal length of the bounding box of the input model. We refer to Appendix A for a detailed discussion of parameter settings.

5.1. Comparison of different DRs.

initialization target result Meshlab render




Figure 8. Comparison of large shape deformation with topological changes. From top to down are results of Paparazzi [Liu et al., 2018], Neural Mesh Renderer [Kato et al., 2018] and ours. The two mesh-based DRs fail to deform a sphere to a teapot, while DSS successfully recovers the handle and cover of the teapot, thanks to the flexibility of the point-based representation.
Figure 9. DSS deforms a cube to three different Yoga models. Noisy points may occur when camera views are under-sampled or occluded (as shown in the initial result). We apply an additional refinement step improving the view sampling as described in Sec. 4.3.

We compare DSS to two state-of-the-art mesh-based DRs, i.e. NMR [Kato et al., 2018] and Paparazzi [Liu et al., 2018], in terms of large geometry deformation. We use the publicly available code provided by the authors and report the best results among experiments using different parameters (e.g., number of cameras and learning rate). All three methods use the same initial and target shape, and similar camera positions. Note that both NMR and DSS use the pin-hole camera, while Paparazzi uses the orthographic projection. We directly propagate the image error to the DRs, without using additional neural networks to aid the points/vertexes position update.

As shown in Figure 8, NMR and Paparazzi cannot transform a sphere into a targeted teapot, mainly due to the limitation of using a mesh representation. These two mesh-based DRs perform best in mapping and transferring image texture to geometry space, but are not designed for large scale geometry deformation, which is vital for many 3D learning tasks.

We implement a naive point DR to verify the necessity of our gradient computation and surface regularization, as there is no publicly available point-based DR that is designed for geometry processing. The implementation follows existing point-based DRs [Roveri et al., 2018a; Roveri et al., 2018b; Insafutdinov and Dosovitskiy, 2018], where depth values are forward projected as pixel intensity and an isotropic Gaussian filter is applied to the projected values so as to create a gradient for point position also in -direction inside the support of the Gaussian filter. As shown in Figure 10, such a naive implementation of point-based DR cannot handle in large-scale shape deformation nor fine-scale denoising, because position gradient is confined locally restricting long-range movement and normal information is not utilized to fine-grained geometry update.

Figure 10. A simple projection-based point renderer which renders depth values fails in deformation and denoising tasks.

5.2. Application: shape editing via image filter

As demonstrated in Paparazzi, one important application of DR is shape editing using existing image filters. It allows many kinds of geometric filtering and style transfer, which would have been challenging to define purely in the geometry domain. This benefit also applies to DSS.

We experimented with two types of image filters, L0 smoothing [Xu et al., 2011] and superpixel segmentation [Achanta et al., 2012]. These filters are applied to the original rendered images to create references. Like Paparazzi, we keep the silhouette of the shape and change the local surface geometry by updating point normals, then the projection and repulsion regularization are applied to correct the point positions.

As shown in Fig. 11, DSS successfully transfers image-level changes to geometry. Even under 1% noise, DSS continues to produce reasonable results. In contrast, mesh-based DRs are sensitive to input noise, because it leads to small piecewise structures and flipped faces in image space (see Fig. 12), which are troublesome for the computation of gradients. In comparison, points are free of any structural constraints; thus, DSS can update normals and positions independently, which makes it robust under noise.

Figure 11. Examples of DSS-based geometry filtering. We apply image filters on the DSS rendered multi-view images and propagate the changes of pixel values to point positions and normals. From left to right are the Poisson reconstruction of input points, points filtered by -smoothing, and superpixel segmentation. In the first row, a clean point cloud is used as input, while in the second row, we add 1% white Gaussian noise. In both cases, DSS can update the geometry accordingly to match the changes in the image domain.
Figure 12. Paparazzi [Liu et al., 2018] successfully applies a image filter to a clean mesh (Left) but fails on an input containing 0.5 % noise (Right).
Figure 13.

Denoising real kinect scan data using our Pix2Pix-DSS.

5.3. Application: point cloud denoising

One of the benefits of the shape-from-rendering framework is the possibility to leverage powerful neural networks and vast 2D data. We demonstrate this advantage in a point cloud denoising task, which is known to be an ill-posed problem where handcrafted priors struggle with recovering all levels of smooth and sharp features.

First, we adopt an off-the-shelf image translation neural network Pix2Pix [Isola et al., 2017] to denoise rendered images. In addition to a per-pixel L1 loss, Pix2Pix is supervised by an adversarial loss [Goodfellow et al., 2014] to add plausible details for improved visual quality. During test time, we render images of the noisy point cloud from different views and use the trained Pix2Pix network to reconstruct geometric structure from the noisy images. Finally, we update the point cloud using DSS with the denoised images as reference.

To synthesize training data for the Pix2Pix denoising network, we use the training set of the Sketchfab dataset [Yifan et al., 2018], which consist of 91 high-resolution 3D models. We use Poisson-disk sampling [Corsini et al., 2012] implemented in Meshlab [Cignoni et al., 2008] to sample 20K points per model as reference points, and create noisy input points by adding white Gaussian noise, then we compute the PCA normal [Hoppe et al., 1992] for both the reference and input points. We generate training data by rendering a total of 149240 pairs of images from the noisy and clean models using DSS, from a variety of viewpoints and distances. We use point light and diffuse shading. While using sophisticated lighting, non-uniform albedo and specular shading can provide useful cues for estimating global information such as lighting and camera positions, we find the glossy effects pose unnecessary difficulties for the network to infer local geometric structure.

To apply Pix2Pix to rendered content, we remove the tanh activation in the final layer to obtain unbounded pixel values (we refer readers to Appendix B for more details on the adapted architecture). To maximize the amount of hallucinated details, we train two models for 1.0% and 0.3% noise respectively. Fig. 15 shows some examples of the input and output of the network. Hallucinated delicate structures can be observed clearly in both noise levels. Furthermore, even though our Pix2Pix model is not trained with view-consistency constraints, the hallucinated details remain mostly consistent across views. In case small inconsistencies appear in regions where a large amount of high-frequency details are created, DSS is still able to transfer plausible details from the 2D to the 3D domain without visible artefacts, as shown in Fig. 17, thanks to simultaneous multi-view optimization.

Figure 14. Examples of multi-view Pix2Pix denoising on the same 3D model. As our Pix2Pix model processes each view independently, inconsistencies across different views might occur in the generated high-frequency details. In spite of that, DSS recovers plausible structures in the 3D shape (see Fig. 17) thanks to our simultaneous multi-view optimization.

Evaluation of DSS denoising. We perform quantitative and qualitative comparison with state-of-the-art optimization-based methods WLOP [Huang et al., 2009], EAR [Huang et al., 2013], RIMLS [Öztireli et al., 2009] and GPF [Lu et al., 2018], as well as a learning-based method, PointCleanNet [Rakotosaona et al., 2019], using the code provided by the authors. For quantitative comparison, we compute Chamfer distance (CD) and Hausdorff distance (HD) between the reconstructed and ground truth surface.

model application number of points total opt. steps for position total opt. steps for normal avg. forward time (ms) avg. backward time (ms) total time (s) GPU memory (MB)
Fig. 8 shape deformation 8003 200 120 19.3 79.9 336 1.7MB
Fig. 11 L0 surface filtering 20000 8 152 42.8 164.6 665 1.8MB
Fig. 17 denoising 100000 8 152 258.1 680.2 1951 2.3MB
Table 2. Runtime and GPU memory demand for exemplar models in different applications. The images are rendered with resolution and 12 views are used per optimization step.

First, we compare the denoising performance on a relatively noisy (1% noise) and sparse (20K points) input data, as shown in Fig. 16

. Optimization-based methods can reconstruct a smooth surface but also smear the low-level details. The learning-based PointCleanNet can preserve some detailed structure, like the fingers of armadillo, but cannot remove all high-frequency noise. This is mainly because of the multilayer perceptrons (MLPs) used in PointNet++ 

[Qi et al., 2017b]-based networks is suboptimal in learning multi-levels of detail from a large training dataset, compared with convolutional layers. We test DSS with two image filters, i.e., the smoothing and the Pix2Pix model trained on data with 20K points and 1% noise. -DSS has a similar performance with the optimization-based method. Pix2Pix-DSS outperforms the other compared methods quantitatively and qualitatively.

Second, we evaluate the ability to preserve fine-grained detail on a relatively smooth (0.3% noise) and dense (100K points) input data, as shown in Fig. 17. Here, our Pix2Pix model is trained on data with 20K points. Optimization-based methods and -DSS produce high-accuracy reconstruction as the local surface is less contaminated and densely sampled. PointCleanNet suffers from the overfitting to a certain type of training data, e.g., the number of sample points. In contrast, Pix2Pix-DSS is less sensitive to the characteristic of points sampling, thanks to the surface splatting in the image domain. As a result, although the reconstruction error is slightly higher than RIMLS and direct Poisson reconstruction of input, Pix2Pix-DSS reconstructs a clean surface, with a great deal of hallucinated details.

Finally, we validate the generalizability of the proposed image-to-geometry denoising method using real scanned data. First, we test with depth images acquired using a Kinect device [Wang et al., 2016]. Since the raw input is too sparse (2000 vertices), we use Poisson-disk sampling to resample 20K points and compute PCA normals. Then we use the Pix2Pix denoising model, which is trained using synthetic data with 1.0% white Gaussian noise, to denoise the rendered images. As shown in Fig. 13, the combination of neural image denoising and DSS generalizes well to different types of noise.

Furthermore, we acquire a 3D scan of a dragon model by ourselves using a hand-held scanner and resample 50K points as input. We compare the point cloud cleaning performance of EAR, RIMLS, PointCleanNet and Ours as shown in Fig. 18. EAR outputs clean and smooth surfaces but tends to produce underwhelming geometry details. RIMLS preserves sharp geometry features, but compared to our method, its results contain more low-frequency noise. The output of PointCleanNet is notably noisier than other methods, while its reconstructed model falls between EAR and RIMLS in terms of detail preservation and surface smoothness. In comparison, our method yields clean and smooth surfaces with rich geometry details.

0.3% noise 1.0% noise



Figure 15. Examples of the input and output of the Pix2Pix denoising network. We train two models to target two different noise levels (0.3% and 1.0%). In both cases, the network is able to recover smoothly detailed geometry, while the 0.3% noise variant generates more fine-grained details.
Figure 16. Quantitative and qualitative comparison of point cloud denoising. The Chamfer distance (CD) and Hausdorff distance (HD) scaled by and . With respect to HD, our method outperforms its competitors, for CD only PointCleanNet can generate better, albeit noisy, results.

5.4. Performance

Our forward and backward rasterization passes are implemented in CUDA. We benchmark the runtime using an NVIDIA 1080Ti GPU with CUDA 10.0 and summarize the runtime as well as memory demand for all of the applications mentioned above on one exemplary model in Table 2. As before, models are rendered with resolution and 12 views are used per optimization step.

As a reference, for the teapot example, one optimization step in Paparazzi and Neural Mesh Renderer takes about 50ms and 160ms respectively, whereas it takes us 100ms (see the second row in Table 2). However, since Paparazzi does not jointly optimize multiple-views, it requires more iterations for convergence. In the L0-Smoothing example (see Fig. 12), it takes 30 minutes and 30000 optimization steps to obtain the final result, whereas DSS needs 160 steps and 11 minutes for a similar result (see the third row in Table 2).

Figure 17. Quantitative and qualitative comparison of point cloud denoising with noise. We report CD and HD scaled by and . Despite some methods performing better with respect to quantitative evaluation, our result matches the ground truth closely in contrast to other methods.
Figure 18. Qualitative comparison of point cloud denoising on a dragon model acquired using a hand-held scanner (without intermediate mesh representation). Our Pix2Pix-DSS outperforms the compared methods.

6. Conclusion and future works

We showed how a high-quality splat based differentiable renderer could be developed in this paper. DSS inherits the flexibility of point-based representations, can propagate gradients to point positions and normals, and produces accurate geometries and topologies. These were possible due to the careful handling of gradients and regularization. We showcased a few applications of how such a renderer can be utilized for image-based geometry processing. In particular, combining DSS with contemporary deep neural network architectures yielded state-of-the-art results.

There are a plethora of neural networks that provide excellent results on images for various applications such as stylization, segmentation, super-resolution, or finding correspondences, just to name a few. Developing DSS is the first step of transferring these techniques from image to geometry domain. Another fundamental application of DSS is in inverse rendering, where we try to infer scene-level information such as geometry, motion, materials, and lighting from images or video. We believe DSS will be instrumental in inferring dynamic scene geometries in multi-modal capture setups.


  • [1]
  • Achanta et al. [2012] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34, 11 (2012), 2274–2282.
  • Alexa et al. [2003] Marc Alexa, Johannes Behr, Daniel Cohen-Or, Shachar Fleishman, David Levin, and Claudio T Silva. 2003. Computing and rendering point set surfaces. IEEE Trans. Visualization & Computer Graphics 9, 1 (2003), 3–15.
  • Azinović et al. [2019] Dejan Azinović, Tzu-Mao Li, Anton Kaplanyan, and Matthias Nießner. 2019. Inverse Path Tracing for Joint Material and Lighting Estimation. arXiv preprint arXiv:1903.07145 (2019).
  • Berger et al. [2017] Matthew Berger, Andrea Tagliasacchi, Lee M Seversky, Pierre Alliez, Gael Guennebaud, Joshua A Levine, Andrei Sharf, and Claudio T Silva. 2017. A survey of surface reconstruction from point clouds. In Computer Graphics Forum, Vol. 36. 301–329.
  • Cignoni et al. [2008] Paolo Cignoni, Marco Callieri, Massimiliano Corsini, Matteo Dellepiane, Fabio Ganovelli, and Guido Ranzuglia. 2008. MeshLab: an Open-Source Mesh Processing Tool. In Eurographics Italian Chapter Conference.
  • Corsini et al. [2012] Massimiliano Corsini, Paolo Cignoni, and Roberto Scopigno. 2012. Efficient and flexible sampling with blue noise properties of triangular meshes. IEEE Trans. Visualization & Computer Graphics 18, 6 (2012), 914–924.
  • Fan et al. [2017] Haoqiang Fan, Hao Su, and Leonidas J Guibas. 2017. A Point Set Generation Network for 3D Object Reconstruction from a Single Image.

    Proc. IEEE Conf. on Computer Vision & Pattern Recognition

    2, 4, 6.
  • Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In

    Proc. Inter. Conf. on Artificial Intelligence and Statistics

    . 249–256.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In In Advances in Neural Information Processing Systems (NIPS).
  • Guerrero et al. [2018] Paul Guerrero, Yanir Kleiman, Maks Ovsjanikov, and Niloy J Mitra. 2018. PCPNet learning local shape properties from raw point clouds. In Computer Graphics Forum (Proc. of Eurographics), Vol. 37. 75–85.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition.
  • Heckbert [1989] Paul S Heckbert. 1989. Fundamentals of texture mapping and image warping. (1989).
  • Hermosilla et al. [2019] Pedro Hermosilla, Tobias Ritschel, and Timo Ropinski. 2019. Total Denoising: Unsupervised Learning of 3D Point Cloud Cleaning. arXiv preprint arXiv:1904.07615 (2019).
  • Hermosilla et al. [2018] P. Hermosilla, T. Ritschel, P-P Vazquez, A. Vinacua, and T. Ropinski. 2018. Monte Carlo Convolution for Learning on Non-Uniformly Sampled Point Clouds. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia) 37, 6 (2018).
  • Hoppe et al. [1992] Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald, and Werner Stuetzle. 1992. Surface reconstruction from unorganized points. Proc. of SIGGRAPH (1992), 71–78.
  • Hu et al. [2019] Tao Hu, Zhizhong Han, Abhinav Shrivastava, and Matthias Zwicker. 2019. Render4Completion: Synthesizing Multi-view Depth Maps for 3D Shape Completion. arXiv preprint arXiv:1904.08366 (2019).
  • Huang et al. [2009] Hui Huang, Dan Li, Hao Zhang, Uri Ascher, and Daniel Cohen-Or. 2009. Consolidation of Unorganized Point Clouds for Surface Reconstruction. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia) 28, 5 (2009), 176:1–176:7.
  • Huang et al. [2013] Hui Huang, Shihao Wu, Minglun Gong, Daniel Cohen-Or, Uri Ascher, and Hao Richard Zhang. 2013. Edge-Aware Point Set Resampling. ACM Trans. on Graphics 32, 1 (2013), 9:1–9:12.
  • Insafutdinov and Dosovitskiy [2018] Eldar Insafutdinov and Alexey Dosovitskiy. 2018. Unsupervised learning of shape and pose with differentiable point clouds. In In Advances in Neural Information Processing Systems (NIPS). 2802–2812.
  • Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017.

    Image-To-Image Translation With Conditional Adversarial Networks. In

    Proc. IEEE Conf. on Computer Vision & Pattern Recognition.
  • Karras et al. [2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In Proc. Int. Conf. on Learning Representations.
  • Kato et al. [2018] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3d mesh renderer. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition. 3907–3916.
  • Langguth et al. [2016] Fabian Langguth, Kalyan Sunkavalli, Sunil Hadap, and Michael Goesele. 2016. Shading-aware multi-view stereo. In Proc. Euro. Conf. on Computer Vision. Springer, 469–485.
  • Li et al. [2018] Tzu-Mao Li, Miika Aittala, Frédo Durand, and Jaakko Lehtinen. 2018. Differentiable monte carlo ray tracing through edge sampling. In ACM Trans. on Graphics (Proc. of SIGGRAPH Asia). ACM, 222.
  • Lin et al. [2018] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. 2018. Learning efficient point cloud generation for dense 3D object reconstruction. In AAAI Conference on Artificial Intelligence.
  • Lipman et al. [2007] Yaron Lipman, Daniel Cohen-Or, David Levin, and Hillel Tal-Ezer. 2007. Parameterization-free projection for geometry reconstruction. ACM Trans. on Graphics (Proc. of SIGGRAPH) 26, 3 (2007), 22:1–22:6.
  • Liu et al. [2017] Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. 2017. Material editing using a physically based rendering network. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition. 2261–2269.
  • Liu et al. [2018] Hsueh-Ti Derek Liu, Michael Tao, and Alec Jacobson. 2018. Paparazzi: Surface Editing by way of Multi-View Image Processing. In ACM Trans. on Graphics (Proc. of SIGGRAPH Asia). ACM, 221.
  • Liu et al. [2019] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. 2019. Soft Rasterizer: A Differentiable Renderer for Image-based 3D Reasoning. arXiv preprint arXiv:1904.01786 (2019).
  • Loper and Black [2014] Matthew M Loper and Michael J Black. 2014. OpenDR: An approximate differentiable renderer. In Proc. Euro. Conf. on Computer Vision. Springer, 154–169.
  • Lu et al. [2018] Xuequan Lu, Shihao Wu, Honghua Chen, Sai-Kit Yeung, Wenzhi Chen, and Matthias Zwicker. 2018. GPF: GMM-inspired feature-preserving point set filtering. IEEE Trans. Visualization & Computer Graphics 24, 8 (2018), 2315–2326.
  • Maier et al. [2017] Robert Maier, Kihwan Kim, Daniel Cremers, Jan Kautz, and Matthias Nießner. 2017. Intrinsic3d: High-quality 3D reconstruction by joint appearance and geometry optimization with spatially-varying lighting. In Proc. Int. Conf. on Computer Vision. 3114–3122.
  • Mao et al. [2017] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In Proc. Int. Conf. on Computer Vision. 2794–2802.
  • Nguyen-Phuoc et al. [2018] Thu H Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. 2018. RenderNet: A deep convolutional network for differentiable rendering from 3D shapes. In In Advances in Neural Information Processing Systems (NIPS). 7891–7901.
  • Öztireli et al. [2009] A Cengiz Öztireli, Gael Guennebaud, and Markus Gross. 2009. Feature preserving point set surfaces based on non-linear kernel regression. In Computer Graphics Forum (Proc. of Eurographics), Vol. 28. 493–501.
  • Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
  • Petersen et al. [2019] Felix Petersen, Amit H Bermano, Oliver Deussen, and Daniel Cohen-Or. 2019. Pix2Vex: Image-to-Geometry Reconstruction using a Smooth Differentiable Renderer. arXiv preprint arXiv:1903.11149 (2019).
  • Pfister et al. [2000] Hanspeter Pfister, Matthias Zwicker, Jeroen Van Baar, and Markus Gross. 2000. Surfels: Surface elements as rendering primitives. In Proc. Conf. on Computer Graphics and Interactive techniques. 335–342.
  • Pontes et al. [2017] Jhony K Pontes, Chen Kong, Sridha Sridharan, Simon Lucey, Anders Eriksson, and Clinton Fookes. 2017. Image2Mesh: A Learning Framework for Single Image 3D Reconstruction. arXiv preprint arXiv:1711.10669 (2017).
  • Qi et al. [2017a] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition.
  • Qi et al. [2017b] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In In Advances in Neural Information Processing Systems (NIPS). 5099–5108.
  • Rajeswar et al. [2018] Sai Rajeswar, Fahim Mannan, Florian Golemo, David Vazquez, Derek Nowrouzezahrai, and Aaron Courville. 2018. Pix2Scene: Learning Implicit 3D Representations from Images. (2018).
  • Rakotosaona et al. [2019] Marie-Julie Rakotosaona, Vittorio La Barbera, Paul Guerrero, Niloy J Mitra, and Maks Ovsjanikov. 2019. POINTCLEANNET: Learning to Denoise and Remove Outliers from Dense Point Clouds. arXiv preprint arXiv:1901.01060 (2019).
  • Richardson et al. [2017] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning detailed face reconstruction from a single image. In IEEE Trans. Pattern Analysis & Machine Intelligence. 1259–1268.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Inter. Conf. on Medical image computing and computer-assisted intervention. Springer, 234–241.
  • Roveri et al. [2018a] Riccardo Roveri, A Cengiz Öztireli, Ioana Pandele, and Markus Gross. 2018a.

    Pointpronets: Consolidation of point clouds with convolutional neural networks. In

    Computer Graphics Forum (Proc. of Eurographics), Vol. 37. 87–99.
  • Roveri et al. [2018b] Riccardo Roveri, Lukas Rahmann, Cengiz Oztireli, and Markus Gross. 2018b. A network architecture for point cloud classification via automatic depth images generation. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition. 4176–4184.
  • Sengupta et al. [2018] Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs. 2018. SfSNet: Learning Shape, Reflectance and Illuminance of Facesin the Wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6296–6305.
  • Shi et al. [2017] Jian Shi, Yue Dong, Hao Su, and Stella X. Yu. 2017. Learning Non-Lambertian Object Intrinsics Across ShapeNet Categories. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition.
  • Sitzmann et al. [2018] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhöfer. 2018. DeepVoxels: Learning Persistent 3D Feature Embeddings. arXiv preprint arXiv:1812.01024 (2018).
  • Sutskever et al. [2013] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proc. IEEE Int. Conf. on Machine Learning. 1139–1147.
  • Tulsiani et al. [2017] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. 2017. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition. 2626–2634.
  • Vogels et al. [2018] Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. 2018. Denoising with kernel prediction and asymmetric loss functions. ACM Trans. on Graphics 37, 4 (2018), 124.
  • Wang et al. [2016] Peng-Shuai Wang, Yang Liu, and Xin Tong. 2016. Mesh denoising via cascaded normal regression. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia) 35, 6 (2016), 232–1.
  • Xu et al. [2011] Li Xu, Cewu Lu, Yi Xu, and Jiaya Jia. 2011. Image smoothing via L 0 gradient minimization. In ACM Transactions on Graphics (TOG), Vol. 30. ACM, 174.
  • Yan et al. [2016] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. 2016. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In In Advances in Neural Information Processing Systems (NIPS). 1696–1704.
  • Yao et al. [2018] Shunyu Yao, Tzu Ming Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba, Bill Freeman, and Josh Tenenbaum. 2018. 3D-aware scene manipulation via inverse graphics. In In Advances in Neural Information Processing Systems (NIPS). 1887–1898.
  • Yifan et al. [2018] Wang Yifan, Shihao Wu, Hui Huang, Daniel Cohen-Or, and Olga Sorkine-Hornung. 2018. Patch-based Progressive 3D Point Set Upsampling. arXiv preprint arXiv:1811.11286 (2018).
  • Yu et al. [2018] Lequan Yu, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. 2018. EC-Net: an Edge-aware Point set Consolidation Network. Proc. Euro. Conf. on Computer Vision (2018).
  • Zhu et al. [2017] Rui Zhu, Hamed Kiani Galoogahi, Chaoyang Wang, and Simon Lucey. 2017. Rethinking reprojection: Closing the loop for pose-aware shape reconstruction from a single image. In Proc. Int. Conf. on Computer Vision. 57–65.
  • Zwicker et al. [2002] Matthias Zwicker, Mark Pauly, Oliver Knoll, and Markus Gross. 2002. Pointshop 3D: An interactive system for point-based surface editing. In ACM Trans. on Graphics, Vol. 21. ACM, 322–329.
  • Zwicker et al. [2001] Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. 2001. Surface splatting. In Proc. Conf. on Computer Graphics and Interactive techniques. ACM, 371–378.
  • Zwicker et al. [2004] Matthias Zwicker, Jussi Räsänen, Mario Botsch, Carsten Dachsbacher, and Mark Pauly. 2004. Perspective accurate splatting. In Proc. of Graphics interface. Canadian Human-Computer Communications Society, 247–254.

Appendix A Parameter discussion

Here, we describe the effects of all hyper-parameters of our method.

Forward rendering. The required hyper-parameters consist of cutoff threshold, merge threshold and (standard deviation). For all these parameters, we closely follow the default settings in the original EWA paper. [Zwicker et al., 2001]. For close camera views, the default value is increased so that the splats are large enough to create hole-free renderings.

Backward rendering. The cache size used for logging points which are projected to each pixel is the only hyper-parameter. The larger is, the more accurate becomes, as more occluded points can be considered for the re-evaluation of (6). We find is sufficiently large for our experiments.

Regularization. Bandwidth and for computing weights in (11) and (12) are set as suggested in previous works [Huang et al., 2009; Öztireli et al., 2009]. Specifically, , where is the diagonal length of the bounding box of the initial shape and is the number of points; is set to to encourage a smooth surface under the presence of outliers. For large-scale deformation, where the intermediate results can have more outliers, we set of the projection term to a higher value, e.g. , which helps to pull the outliers to the nearest surface.

Optimization. The learning rate has a substantial impact on convergence. In our experiments, we set the learning rate for position and normal to 5 and 5000. These values generally work well for all applications. Higher learning rates cause the points to converge faster but increases the risk of causing the points to gather in clusters. A more sophisticated optimization algorithm can be applied for a more robust optimization process, but it is out of the scope of this paper. A sufficient number of views per optimization step is key to a good result in the ill-posed 2D-to-3D formulation. Twelve camera views are used in all our experiments, while with 8 or fewer views results start to degenerate. The number of steps for points and normals update, and , differ for each application. In general, for large topology changes, we set , where typically and , while for local geometry processing with and . Finally, we find the loss weights for image loss , projection regularization and repulsion regularization , by ensuring the magnitude of per point gradient from and is around of that from . If the repulsion weight is too large, e.g. , points can be repelled far off the surface, while if the projection weight is too large, e.g. , points will be forced to stay on a local surface, making it difficult for topology changes.

Appendix B Network details

Our model is based on Pix2Pix [Isola et al., 2017] that consist of a generator and a discriminator. For the generator, we experimented with U-Net [Ronneberger et al., 2015] and ResNet [He et al., 2016]

, and find ResNet performs slightly better in our task, which we use for all experiments. That is, the generator has a 2-stride convolution and a 2-stride up-convolution for both the encoder and decoder networks and 9 residual blocks in-between. The discriminator follows the architecture as: C64-C128-C256-C512-C1, where LSGAN 

[Mao et al., 2017] is used. To deal with checkerboard artefacts, we use pixel-wise normalization in the generator and add a 1-strided convolutional layer after each deconvolutional layer in the discriminator [Karras et al., 2018]. We use the default parameters of the Pix2Pix pytorch implementation provided by the authors, and ADAM optimizer () . Xavier [Glorot and Bengio, 2010] is used for weights initialization. We train our models for about two days on an NVIDIA 1080Ti GPU.