1. Introduction
Differentiable processing of scenelevel information in the image formation process is emerging as a fundamental component for both 3D scene and 2D image and video modeling. The challenge of developing a differentiable renderer lies at the intersection of computer graphics, vision, and machine learning, and has recently attracted a lot of attention from all communities due to its potential to revolutionize digital visual data processing and high relevance for a wide range of applications, especially when combined with the contemporary neural network architectures
[Loper and Black, 2014; Kato et al., 2018; Liu et al., 2018; Yao et al., 2018; Petersen et al., 2019].A differentiable renderer (DR) takes scenelevel information such as 3D scene geometry, lighting, material and camera position as input, and outputs a synthesized image . Any changes in the image can thus be propagated to the parameters
, allowing for imagebased manipulation of the scene. Assuming a differentiable loss function
on a rendered image , we can update the parameters with the gradient. This view provides a generic and powerful shapefromrendering framework where we can exploit vast image datasets available, deep learning architectures and computational frameworks, as well as pretrained models. The challenge, however, is being able to compute the gradient
in the renderer.Existing DR methods can be classified into three categories based on their geometric representation: voxelbased
[NguyenPhuoc et al., 2018; Tulsiani et al., 2017; Liu et al., 2017], meshbased [Loper and Black, 2014; Kato et al., 2018; Liu et al., 2018], and pointbased [Insafutdinov and Dosovitskiy, 2018; Lin et al., 2018; Roveri et al., 2018a; Rajeswar et al., 2018]. Voxelbased methods work on volumetric data and thus come with high memory requirements even for relatively coarse geometries. Meshbased DRs solve this problem by exploiting the sparseness of the underlying geometry in the 3D space. However, they are bound by the mesh structure with limited room for global and topological changes, as connectivity is not differentiable. Equally importantly, acquired 3D data typically comes in an unstructured representation that needs to be converted into a mesh form, which is itself a challenging and errorprone operation. Pointbased DRs circumvent these problems by directly operating on point samples of the geometry, leading to flexible and efficient processing. However, existing pointbased DRs use simple rasterization techniques such as forwardprojection or depth maps, and thus come with wellknown deficiencies in point cloud processing when capturing fine geometric details, dealing with gaps and occlusions between nearby points, and forming a continuous surface.In this paper, we introduce Differentiable Surface Splatting (DSS), the first high fidelity point based differentiable renderer. We utilize ideas from surface splatting [Zwicker et al., 2001]
, where each point is represented as a disk or ellipse in the object space, which is projected onto the screen space to form a splat. The splats are then interpolated to encourage holefree and antialiased renderings. For inverse rendering, we carefully design gradients with respect to point locations and normals by taking each forward operation apart and utilizing domain knowledge. In particular, we introduce regularization terms for the gradients to carefully drive the algorithms towards the most plausible point configuration. There are infinitely many ways splats can form a given image due to the high degree of freedom of point locations and normals. Our inverse pass ensures that points stay on local geometric structures with uniform distribution.
We apply DSS to render multiview color images as well as auxiliary maps from a given scene. We process the rendered images with stateoftheart techniques and show that this leads to highquality geometries when propagated utilizing DSS. Experiments show that DSS yields significantly better results compared to previous DR methods, especially for substantial topological changes and geometric detail preservation. We focus on the particularly important application of point cloud denoising. The implementation of DSS, as well as our experiments, will be available upon publication.
2. Related work
In this section we provide some background and review the state of the art in differentiable rendering and point based processing.
method  objective  position update  depth update  normal update  occlusion  silhouette change  topology change 
OpenDR  mesh  ✓  ✗  via position change  ✗  ✓  ✗ 
NMR  mesh  ✓  ✗  via position change  ✗  ✓  ✗ 
Paparazzi  mesh  limited  limited  via position change  ✗  ✗  ✗ 
Soft Rasterizer  mesh  ✓  ✓  via position change  ✓  ✓  ✗ 
Pix2Vex  mesh  ✓  ✓  via position change  ✓  ✓  ✗ 
Ours  points  ✓  ✓  ✓  ✓  ✓  ✓ 
2.1. Differentiable rendering
An ideal differentiable renderer (DR) should: (i) render as realistic images as possible, and (ii) compute reliable derivatives w.r.t. all the rendering parameters. However, depending on the application, a tradeoff must be made between the complexity the rendering function, the number of targeted parameters, and the quality of the gradients. We first discuss general DR frameworks, followed by DRs for specific purposes.
Loper and Black [2014] develop a differentiable renderer framework called OpenDR that approximates a primary renderer and computes the gradients via automatic differentiation. Neural mesh renderer (NMR) [Kato et al., 2018] approximates the backward gradient for the rasterization operation using a handcrafted function for visibility changes. Liu et al. [2018] propose Paparazzi, an analytic DR for mesh geometry processing using image filters. In concurrent work, Petersen et al. [2019] present Pix2Vex, a differentiable renderer via soft blending schemes of nearby triangles, and Liu et al. [2019] introduce Soft Rasterizer, which renders and aggregates the probabilistic maps of mesh triangles, allowing flowing gradients from the rendered pixels to the occluded and farrange vertices. All these generic DR frameworks rely on mesh representation of the scene geometry. We summarize the properties of these renderers in Table 1 and discuss them in greater detail in Sec. 3.2.
Numerous recent works employed DR for learning based 3D vision tasks, such as single view image reconstruction [Vogels et al., 2018; Yan et al., 2016; Pontes et al., 2017; Zhu et al., 2017], face reconstruction [Richardson et al., 2017], shape completion [Hu et al., 2019], and image synthesis [Sitzmann et al., 2018]. To describe a few, Pix2Scene [Rajeswar et al., 2018] uses a point based DR to learn implicit 3D representations from images. However, Pix2Scene renders one surfel for each pixel and does not use screen space blending. NguyenPhuoc et al. [2018] and Insafutdinov and Dosovitskiy [2018] propose neural DRs using a volumetric shape representation, but the resolution is limited in practice. Li et al. [2018] and Azinović et al. [2019] introduce a differentiable ray tracer to implement the differentiability of physics based rendering effects, handling e.g. camera position, lighting and texture.
A number of works render depth maps of point sets [Lin et al., 2018; Insafutdinov and Dosovitskiy, 2018; Roveri et al., 2018b] for point cloud classification or generation. These renderers do not define proper gradients for updating point positions or normals, thus they are commonly applied as an addon layer behind a point processing network, to provide 2D supervision. Typically, their gradients are defined either only for depth values [Lin et al., 2018], or within a small local neighborhood around each point. Such gradients are not sufficient to alter the shape of a point cloud, as we show in a pseudo point renderer in Fig. 10.
The differentiable rendering is also relates to shapefromshading techniques [Langguth et al., 2016; Shi et al., 2017; Maier et al., 2017; Sengupta et al., 2018] that extract shading and albedo information for geometry processing and surface reconstruction. However, the framework proposed in this paper can be used seamlessly with contemporary deep neural networks, opening a variety of new applications.
2.2. Pointbased geometry processing and rendering
With the proliferation of 3D scanners and depth cameras, the capture and processing 3D point clouds is becoming commonplace. The noise, outliers, incompleteness and misalignments persisting in the raw data pose significant challenges for point cloud filtering, editing, and surface reconstruction
[Berger et al., 2017].Early optimization based point set processing methods rely on shape priors. Alexa and colleagues [2003] introduce the moving least squares (MLS) surface model, assuming a smooth underlying surface. Aiming to preserve sharp edges, Öztireli et al. [2009] propose the robust implicit moving least squares (RIMLS) surface model. Huang et al. [2013] employ an anisotropic weighted locally optimal projection (WLOP) operator [Lipman et al., 2007; Huang et al., 2009] and a progressive edge aware resampling (EAR) procedure to consolidate noisy input. Lu et al. [2018]
formulate WLOP with a Gaussian mixture model and use pointtoplane distance for point set processing (GPF). These methods depend on the fitting of local geometry, e.g. normal estimation, and struggle with reconstructing multiscale structures from noisy input.
Advanced learningbased methods for point set processing are currently emerging, encouraged by the success of deep learning. Based on PointNet [Qi et al., 2017a], PCPNET [Guerrero et al., 2018] and PointCleanNet [Rakotosaona et al., 2019] estimate local shape properties from noisy and outlierridden point sets; ECNet [Yu et al., 2018] learns point cloud consolidation and restoration of sharp features by minimizing a pointtoedge distance, but it requires edge annotation for the training data. Hermosilla et al. [2019] propose an unsupervised point cloud cleaning method based on Monte Carlo convolution [Hermosilla et al., 2018]. Roveri et al. [2018a] present a projection based differentiable point renderer to convert unordered 3D points to 2D height maps, enabling the use of convolutional layers for height map denoising before backprojecting the smoothed pixels to the 3D point cloud. In contrast to the commonly used Chamfer or EMD loss [Fan et al., 2017], our DSS framework, when used as a loss function, is compatible with convolutional layers and is sensitive to the exact point distribution pattern.
Surface splatting is fundamental to our method. Splatting has been developed for simple and efficient point set rendering and processing in the early seminal point based works [Pfister et al., 2000; Zwicker et al., 2001, 2002; Zwicker et al., 2004]. Recently, point based techniques have gained much attention for their superior potential in geometric learning. To the best of our knowledge, we are the first to implement highfidelity differentiable surface splatting.
3. Method
In essence, a differentiable renderer is designed to propagate imagelevel changes to scenelevel parameters . This information can be used to optimize the parameters so that the rendered image matches a reference image . Typically, includes the coordinates, normals and colors of the points, camera position and orientation, as well as lighting. Formally, this can be formulated as an optimization problem
(1) 
where is the image loss, measuring the distance between the rendered and reference images.
Methods to solve the optimization problem (1) are commonly based on gradient descent which requires to be differentiable with respect to . However, gradients w.r.t. point coordinates and normals, and , are not defined everywhere, since is a discontinuous function due to occlusion events and edges.
The key to our method is twofold. First, we define a gradient and using the principle of finite differences. Second, to address the optimization difficulty that arises from the significant number of degrees of freedom due to the unstructured nature of points, we introduce regularization terms that contribute to obtaining clean and smooth surface points.
In this section, we first review screen space EWA (elliptical weighted average) [Zwicker et al., 2001; Heckbert, 1989], which we adopt to efficiently render highquality realistic images from point clouds. Then we propose an occlusionaware gradient definition for the rasterization step, which, unlike previously proposed differential mesh renderers, propagates gradients to depth and allows large deformation. Lastly, we introduce two novel regularization terms for generating clean surface points.
3.1. Forward pass
Our forward pass closely follows the screen space elliptical weighted average (EWA) filtering described in [Zwicker et al., 2001]. In the following, we briefly review the derivation of EWA filters.
In a nutshell, the idea of screen space EWA is to apply an isotropic Gaussian filter to the attribute of a point in the tangent plane (defined by the normal at that point). The projection onto the image plane defines elliptical Gaussians, which, after truncation to bounded support, form a disk, or splat, as shown in Fig. 2. For a point , we write the filter weight of the isotropic Gaussian at position as
(2) 
where
is the standard deviation and
is the identity matrix.
Now we consider the projected Gaussian in screen space. Points and are projected to and , respectively. We write the Jacobian of this projection from the tangent plane to the image plane as ; we refer the reader to the original surface splatting paper [Zwicker et al., 2001] for the derivation of . Then at pixel , the screen space elliptical Gaussian weight is
(3) 
Note that is determined by the point position and the normal , because is determined by and .
Next, a lowpass Gaussian filter with variance
is convolved with Eq. (3) in screen space. Thus the final elliptical Gaussian is(4) 
In the final step, two sources of discontinuity are introduced to the fully differentiable . First, for computational reasons, we limit the elliptical Gaussians to a limited support in the image plane for all outside a cutoff radius , i.e. . Second, we set the Gaussian weights for occluded points to zero. Specifically, we keep a list of the maximum (we choose ) closest points at each pixel position, and compute their depth difference to the frontmost point, and then set the Gaussian weights to zero for points that are behind the frontmost point by more than a threshold (we set of the bounding box diagonal length).
The resulting truncated Gaussian weight, denoted as , can be formally defined as
(5) 
The final pixel value at position , , is simply the normalized sum of all filtered point attributes , i.e.,
(6) 
In practice, this summation can be greatly optimized by computing the bounding box of each ellipse and only considering points whose elliptical support covers the pixel .
The point value
can be any point attribute, e.g., albedo color, shading, depth value, normal vector, etc. In most of our experiments, we use diffuse shading under three orthogonally positioned RGBcolored sun lights. This way,
carries strong information about point normals, and at the same time it is independent of point position (unlike with point lights), which greatly simplifies the factorization for gradient computation, as explained in Sec. 3.2.Fig. 3 shows some examples of rendered images. Unlike many pseudo renderers which achieve differentiability by blurring edges and transparent surfaces, our rendered images faithfully depict the actual geometry in the scene.
3.2. Backward pass
In the backward pass, we define an artificial gradient for the discontinuous rasterization function. We first simplify the discontinuous function to a discontinuous step function solely dependent on position , and then we define the gradient w.r.t. .
The discontinuity is encapsulated in the truncated Gaussian weights as described (5). In order to fully utilize automatic differentiation available in most optimization libraries, we factorize the discontinuous into the fully differentiable term and a discontinuous visibility term , i.e. , where is defined as
(7) 
Since compared to , only impacts the visibility of a small set of pixels around the ellipse, we further simplify the expression so that is solely determined by , i.e., . Therefore, if we write as a function of , and
, then by the chain rule we have
(8)  
(9) 
where is undefined at the edges of ellipses due to occlusion.




The construction of the gradient w.r.t. despite the discontinuity of is comprised of two key components. First, instead of considering , we focus on the joint term , since the additional color information conveyed in enables us to define gradients only in the direction which decreases image loss. Secondly, we replace the discontinuous function of a pixel color with respect to the position of with a continuous linear function, and define the gradient as , where and denotes the change of pixel value and point position respectively. A schematic illustration for an 1D scenario is depicted in Fig. 4.
Intuitively, the joint term expresses the change of pixel values when varying , assuming the shape and colors of the ellipse are fixed, which is a justified assumption for sunlight diffuse shading. Whenever the change of pixel value incurred by the movement of can decrease the image loss, i.e., , an artificial gradient is created to push in the corresponding direction.
A concrete example for grayscale image is illustrated in Fig. 5. We are interested in pixel and the splat . The negative gradient of image loss w.r.t. pixel value , shown in Fig. 4, indicates the desired change in pixel value in order to decrease the image loss; in this example, should become darker. In Fig. 4(b), is not visible at , is rendered by another ellipse, or multiple, lighter ellipses, in front of . Since moving the darker splat to cover darkens , we find the intersection of the viewing ray with the frontmost ellipse rendered at , , then define in direction. In case no ellipses are rendered at or the currently rendered ellipse is behind , as shown in Fig. 4(c)), is the intersection of the viewing ray with the ellipse plane, which is orthogonal to the principal axis. Finally, in Fig. 4(d), refers to the brighter ellipse, obviously moving it towards and away from will both reveal the darker splat behind and thus darken , creating two possible gradient in opposition directions and . Thus is obtained by averaging these two gradients. Notice that in the first case, can have a nonzero value in the depth dimension, allowing for a depth update, while in the other cases is equivalent to defining gradient only on the image plane.
Given the translation vector , assuming the pixel values have channels, the artificial gradient is defined as
(10) 
Here, is the distance between and the edge of the ellipse. Intuitively, the further needs to travel, the less impact it has on , and vice versa. The value is a small constant (we set ). It prevents the gradient from becoming extremely large when is close , which would lead to overshooting, oscillation and other convergence problems.
In order to compute as accurately as possible, we evaluate (6) after the movement of while taking into account currently occluded ellipses. For this purpose, we cache an ordered list of the topK (we choose K=5) closest ellipses which can be projected to each pixel and save their , and depth values during the forward pass.
Comparison to other differentiable renderers
In Paparazzi [Liu et al., 2018], the rendering function is simplified enough such that the gradients can be computed analytically, which is prohibitive for silhoutte change where handling significant occlusion events is required. The work related most closely to our approach in terms of gradient definition is the neural mesh renderer (NMR) [Kato et al., 2018]. We both construct depending on the change of pixel , but our method differs from NMR in the following aspects: (1) In our definition, we consider the movement of in 3D space, while NMR only considers movement in the image plane. As a result, we can optimize in the depth dimension even with a single view. (2) In our definition, the gradient for all dimensions of is defined jointly. In contrast, NMR considers the 1D gradients separately and consequently, only pixels along and axes are considered; (3) In our definition, the change of pixel value is computed considering a set of occluded and occluding ellipses projected to pixel . This not only leads to more accurate gradient values, but also encourages noisy points inside the model to move onto the surface, to a position with matching pixel color.
3.3. Surface regularization
The lack of structure in point clouds, while providing freedom of massive topology changes, can pose a significant challenge for optimization. First, the gradient derivation is entirely parallized; as a result, points move irrespective of each other. Secondly, as the movement of points will only induce small and sparse changes in the rendered image, gradients on each point are less structured compared to corresponding gradients for meshes. Without proper regularization, one can quickly end up in local minima.
Inspired by [Huang et al., 2009; Öztireli et al., 2009], we propose regularization to address this problem based on two parts: a repulsion and a projection term. The repulsion term is aimed at generating uniform point distributions by maximizing the distances between its neighbors on a local projection plane, while the projection term preserves clean surfaces by minimizing the distance from the point to the surface tangent plane.
Obviously, both terms require finding a reliable surface tangent plane. However, this can be challenging, since during optimization, especially in the case of multiview joint optimization, intermediate point clouds can be very noisy and contain many occluded points inside the model, hence we propose a weighted PCA to penalize the occluded inner points. In addition to the commonly used bilateral weights which considers both the pointtopoint euclidean distance and the normal similarity, we propose a visibility weight, which penalizes occluded points, since they are more likely to be outliers inside the model.
Let denote a point in question and denote one point in its neighborhood, , we propose computing a weighted PCA using the following weights
(11)  
(12)  
(13) 
where and are bilateral weights which favor neighboring points that are spatially close and have similar normal orientation respectively, and is the proposed visibility weight which is defined using an occlusion counter that counts the number of times
is occluded in all camera views. Then a reliable projection plane can be obtained using singular value decomposition from weighted vectors
, where .For the repulsion term, the projected pointtopoint distance is obtained via , where contains the first principle components. We define the repulsion loss as follows and minimize it together with the perpixel image loss
(14) 
For the projection term, we minimize the pointtoplane distance via , where is the last components. Correspondingly, the projection loss is defined as
(15) 
The effect of repulsion and projection terms are clearly demonstrated in Fig. 6 and Fig. 7. In Fig. 6, we aim to move points lying on a 2D grid to match the silhouette of a 3D teapot. Without the repulsion term, points quickly shrink to the center of the reference shape, which is a common local minima since the gradient coming from surrounding pixels cancel each other out. With the repulsion term, the points can escape such local minima and distribute evenly inside the silhouette. In Fig. 7 we deform a sphere to bunny from 12 views. Without projection regularization, points are scattered within and outside the surface. In contrast, when the projection term is applied, we can obtain a clean and smooth surface.
4. Implementation details.
4.1. Optimization objective
We choose Symmetric Mean Absolute Percentage Error (SMAPE) as the image loss . SMAPE is designed for high dynamic range images such as rendered images therefore it behaves more stable for unbounded values [Vogels et al., 2018]. It is defined as
(16) 
where and are the dimensions of the image, the value of is typically chosen as .
The total optimization objective corresponding to Eq. (1) for a set of views amounts to
(17) 
Loss weights and are typically chosen to be respectively.
4.2. Alternating normal and point update
For meshes, the face normals are determined by point positions. For points, though, normals and point positions can be treated as independent entities thus optimized individually. Our pixel value factorization in Eq. (8) and Eq. (9) means that, the gradient on point positions mainly stems from the visibility term, while gradients on normals can be derived from and . Because the gradient w.r.t. and assumes the other stays fixed, we apply the update of and in an alternating fashion. Specifically, we start with normals, execute optimization for times then we optimize point positions for times.
As observed in many point denoising works [Öztireli et al., 2009; Huang et al., 2009; Guerrero et al., 2018], finding the right normal is the key for obtaining clean surfaces. Hence we efficiently utilize the improved normals even if the point positions are not being updated, in that we directly update the point positions using the gradient from the regularization terms and . In fact, for local shape surface modification, this simple strategy consistently yields satisfying results.
4.3. Erroraware view sampling
View selection is very important for quick convergence. In our experiments, we aim to cover all possible angles by sampling camera positions from a hulling sphere using farthest point sampling. Then we randomly perturb the sampled position and set the camera to look at the center of the object. The sampling process is repeated periodically to further improve optimization.
However, for shapes with complex topology, such a sampling scheme is not enough. We propose an erroraware view sampling scheme which chooses the new camera positions based on the current image loss.
Specifically, we downsample the reference image and the rendered result, then compute the pixel position with the largest image error. Then we find points whose projection is closest to the found pixel. The mean 3D position of these points will be the center of focus. Finally, we sample camera positions on a sphere around this focal point with a relatively small distance. Such techniques help us to improve point positions in small holes during large shape deformation.
5. Results
We evaluate the performance of DSS by comparing it to stateoftheart DRs, and demonstrate its applications in pointbased geometry editing and filtering.
Our method is implemented in Pytorch
[Paszke et al., 2017], we use stochastic gradient descent with Nesterov momentum
[Sutskever et al., 2013] for optimization. A learning rate of and is used for points and normals, respectively. We reduce them by a factor of 0.5 if the total optimization loss stagnates for 15 optimization steps. In all experiments, we render in backface culling mode with resolution and diffuse shading, using RGB sun lights fixed relative to the camera position.Unless otherwise stated, we optimize for up to 16 cycles of and optimization steps for point normal and position (for large deformation and ; for local surface editing and ). In each cycle, 12 randomly sampled views are used simultaneously for an optimization step. To test our algorithms for noise resilience, we use random white Gaussian noise with a standard deviation measured relative to the diagonal length of the bounding box of the input model. We refer to Appendix A for a detailed discussion of parameter settings.
5.1. Comparison of different DRs.
initialization  target  result  Meshlab render  
Paparazzi 

NMR 

Ours 
We compare DSS to two stateoftheart meshbased DRs, i.e. NMR [Kato et al., 2018] and Paparazzi [Liu et al., 2018], in terms of large geometry deformation. We use the publicly available code provided by the authors and report the best results among experiments using different parameters (e.g., number of cameras and learning rate). All three methods use the same initial and target shape, and similar camera positions. Note that both NMR and DSS use the pinhole camera, while Paparazzi uses the orthographic projection. We directly propagate the image error to the DRs, without using additional neural networks to aid the points/vertexes position update.
As shown in Figure 8, NMR and Paparazzi cannot transform a sphere into a targeted teapot, mainly due to the limitation of using a mesh representation. These two meshbased DRs perform best in mapping and transferring image texture to geometry space, but are not designed for large scale geometry deformation, which is vital for many 3D learning tasks.
We implement a naive point DR to verify the necessity of our gradient computation and surface regularization, as there is no publicly available pointbased DR that is designed for geometry processing. The implementation follows existing pointbased DRs [Roveri et al., 2018a; Roveri et al., 2018b; Insafutdinov and Dosovitskiy, 2018], where depth values are forward projected as pixel intensity and an isotropic Gaussian filter is applied to the projected values so as to create a gradient for point position also in direction inside the support of the Gaussian filter. As shown in Figure 10, such a naive implementation of pointbased DR cannot handle in largescale shape deformation nor finescale denoising, because position gradient is confined locally restricting longrange movement and normal information is not utilized to finegrained geometry update.
5.2. Application: shape editing via image filter
As demonstrated in Paparazzi, one important application of DR is shape editing using existing image filters. It allows many kinds of geometric filtering and style transfer, which would have been challenging to define purely in the geometry domain. This benefit also applies to DSS.
We experimented with two types of image filters, L0 smoothing [Xu et al., 2011] and superpixel segmentation [Achanta et al., 2012]. These filters are applied to the original rendered images to create references. Like Paparazzi, we keep the silhouette of the shape and change the local surface geometry by updating point normals, then the projection and repulsion regularization are applied to correct the point positions.
As shown in Fig. 11, DSS successfully transfers imagelevel changes to geometry. Even under 1% noise, DSS continues to produce reasonable results. In contrast, meshbased DRs are sensitive to input noise, because it leads to small piecewise structures and flipped faces in image space (see Fig. 12), which are troublesome for the computation of gradients. In comparison, points are free of any structural constraints; thus, DSS can update normals and positions independently, which makes it robust under noise.
Denoising real kinect scan data using our Pix2PixDSS.
5.3. Application: point cloud denoising
One of the benefits of the shapefromrendering framework is the possibility to leverage powerful neural networks and vast 2D data. We demonstrate this advantage in a point cloud denoising task, which is known to be an illposed problem where handcrafted priors struggle with recovering all levels of smooth and sharp features.
First, we adopt an offtheshelf image translation neural network Pix2Pix [Isola et al., 2017] to denoise rendered images. In addition to a perpixel L1 loss, Pix2Pix is supervised by an adversarial loss [Goodfellow et al., 2014] to add plausible details for improved visual quality. During test time, we render images of the noisy point cloud from different views and use the trained Pix2Pix network to reconstruct geometric structure from the noisy images. Finally, we update the point cloud using DSS with the denoised images as reference.
To synthesize training data for the Pix2Pix denoising network, we use the training set of the Sketchfab dataset [Yifan et al., 2018], which consist of 91 highresolution 3D models. We use Poissondisk sampling [Corsini et al., 2012] implemented in Meshlab [Cignoni et al., 2008] to sample 20K points per model as reference points, and create noisy input points by adding white Gaussian noise, then we compute the PCA normal [Hoppe et al., 1992] for both the reference and input points. We generate training data by rendering a total of 149240 pairs of images from the noisy and clean models using DSS, from a variety of viewpoints and distances. We use point light and diffuse shading. While using sophisticated lighting, nonuniform albedo and specular shading can provide useful cues for estimating global information such as lighting and camera positions, we find the glossy effects pose unnecessary difficulties for the network to infer local geometric structure.
To apply Pix2Pix to rendered content, we remove the tanh activation in the final layer to obtain unbounded pixel values (we refer readers to Appendix B for more details on the adapted architecture). To maximize the amount of hallucinated details, we train two models for 1.0% and 0.3% noise respectively. Fig. 15 shows some examples of the input and output of the network. Hallucinated delicate structures can be observed clearly in both noise levels. Furthermore, even though our Pix2Pix model is not trained with viewconsistency constraints, the hallucinated details remain mostly consistent across views. In case small inconsistencies appear in regions where a large amount of highfrequency details are created, DSS is still able to transfer plausible details from the 2D to the 3D domain without visible artefacts, as shown in Fig. 17, thanks to simultaneous multiview optimization.
Evaluation of DSS denoising. We perform quantitative and qualitative comparison with stateoftheart optimizationbased methods WLOP [Huang et al., 2009], EAR [Huang et al., 2013], RIMLS [Öztireli et al., 2009] and GPF [Lu et al., 2018], as well as a learningbased method, PointCleanNet [Rakotosaona et al., 2019], using the code provided by the authors. For quantitative comparison, we compute Chamfer distance (CD) and Hausdorff distance (HD) between the reconstructed and ground truth surface.
model  application  number of points  total opt. steps for position  total opt. steps for normal  avg. forward time (ms)  avg. backward time (ms)  total time (s)  GPU memory (MB) 
Fig. 8  shape deformation  8003  200  120  19.3  79.9  336  1.7MB 
Fig. 11  L0 surface filtering  20000  8  152  42.8  164.6  665  1.8MB 
Fig. 17  denoising  100000  8  152  258.1  680.2  1951  2.3MB 
First, we compare the denoising performance on a relatively noisy (1% noise) and sparse (20K points) input data, as shown in Fig. 16
. Optimizationbased methods can reconstruct a smooth surface but also smear the lowlevel details. The learningbased PointCleanNet can preserve some detailed structure, like the fingers of armadillo, but cannot remove all highfrequency noise. This is mainly because of the multilayer perceptrons (MLPs) used in PointNet++
[Qi et al., 2017b]based networks is suboptimal in learning multilevels of detail from a large training dataset, compared with convolutional layers. We test DSS with two image filters, i.e., the smoothing and the Pix2Pix model trained on data with 20K points and 1% noise. DSS has a similar performance with the optimizationbased method. Pix2PixDSS outperforms the other compared methods quantitatively and qualitatively.Second, we evaluate the ability to preserve finegrained detail on a relatively smooth (0.3% noise) and dense (100K points) input data, as shown in Fig. 17. Here, our Pix2Pix model is trained on data with 20K points. Optimizationbased methods and DSS produce highaccuracy reconstruction as the local surface is less contaminated and densely sampled. PointCleanNet suffers from the overfitting to a certain type of training data, e.g., the number of sample points. In contrast, Pix2PixDSS is less sensitive to the characteristic of points sampling, thanks to the surface splatting in the image domain. As a result, although the reconstruction error is slightly higher than RIMLS and direct Poisson reconstruction of input, Pix2PixDSS reconstructs a clean surface, with a great deal of hallucinated details.
Finally, we validate the generalizability of the proposed imagetogeometry denoising method using real scanned data. First, we test with depth images acquired using a Kinect device [Wang et al., 2016]. Since the raw input is too sparse (2000 vertices), we use Poissondisk sampling to resample 20K points and compute PCA normals. Then we use the Pix2Pix denoising model, which is trained using synthetic data with 1.0% white Gaussian noise, to denoise the rendered images. As shown in Fig. 13, the combination of neural image denoising and DSS generalizes well to different types of noise.
Furthermore, we acquire a 3D scan of a dragon model by ourselves using a handheld scanner and resample 50K points as input. We compare the point cloud cleaning performance of EAR, RIMLS, PointCleanNet and Ours as shown in Fig. 18. EAR outputs clean and smooth surfaces but tends to produce underwhelming geometry details. RIMLS preserves sharp geometry features, but compared to our method, its results contain more lowfrequency noise. The output of PointCleanNet is notably noisier than other methods, while its reconstructed model falls between EAR and RIMLS in terms of detail preservation and surface smoothness. In comparison, our method yields clean and smooth surfaces with rich geometry details.
0.3% noise  1.0% noise  
input 

output 
5.4. Performance
Our forward and backward rasterization passes are implemented in CUDA. We benchmark the runtime using an NVIDIA 1080Ti GPU with CUDA 10.0 and summarize the runtime as well as memory demand for all of the applications mentioned above on one exemplary model in Table 2. As before, models are rendered with resolution and 12 views are used per optimization step.
As a reference, for the teapot example, one optimization step in Paparazzi and Neural Mesh Renderer takes about 50ms and 160ms respectively, whereas it takes us 100ms (see the second row in Table 2). However, since Paparazzi does not jointly optimize multipleviews, it requires more iterations for convergence. In the L0Smoothing example (see Fig. 12), it takes 30 minutes and 30000 optimization steps to obtain the final result, whereas DSS needs 160 steps and 11 minutes for a similar result (see the third row in Table 2).
6. Conclusion and future works
We showed how a highquality splat based differentiable renderer could be developed in this paper. DSS inherits the flexibility of pointbased representations, can propagate gradients to point positions and normals, and produces accurate geometries and topologies. These were possible due to the careful handling of gradients and regularization. We showcased a few applications of how such a renderer can be utilized for imagebased geometry processing. In particular, combining DSS with contemporary deep neural network architectures yielded stateoftheart results.
There are a plethora of neural networks that provide excellent results on images for various applications such as stylization, segmentation, superresolution, or finding correspondences, just to name a few. Developing DSS is the first step of transferring these techniques from image to geometry domain. Another fundamental application of DSS is in inverse rendering, where we try to infer scenelevel information such as geometry, motion, materials, and lighting from images or video. We believe DSS will be instrumental in inferring dynamic scene geometries in multimodal capture setups.
References
 [1]
 Achanta et al. [2012] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC superpixels compared to stateoftheart superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34, 11 (2012), 2274–2282.
 Alexa et al. [2003] Marc Alexa, Johannes Behr, Daniel CohenOr, Shachar Fleishman, David Levin, and Claudio T Silva. 2003. Computing and rendering point set surfaces. IEEE Trans. Visualization & Computer Graphics 9, 1 (2003), 3–15.
 Azinović et al. [2019] Dejan Azinović, TzuMao Li, Anton Kaplanyan, and Matthias Nießner. 2019. Inverse Path Tracing for Joint Material and Lighting Estimation. arXiv preprint arXiv:1903.07145 (2019).
 Berger et al. [2017] Matthew Berger, Andrea Tagliasacchi, Lee M Seversky, Pierre Alliez, Gael Guennebaud, Joshua A Levine, Andrei Sharf, and Claudio T Silva. 2017. A survey of surface reconstruction from point clouds. In Computer Graphics Forum, Vol. 36. 301–329.
 Cignoni et al. [2008] Paolo Cignoni, Marco Callieri, Massimiliano Corsini, Matteo Dellepiane, Fabio Ganovelli, and Guido Ranzuglia. 2008. MeshLab: an OpenSource Mesh Processing Tool. In Eurographics Italian Chapter Conference.
 Corsini et al. [2012] Massimiliano Corsini, Paolo Cignoni, and Roberto Scopigno. 2012. Efficient and flexible sampling with blue noise properties of triangular meshes. IEEE Trans. Visualization & Computer Graphics 18, 6 (2012), 914–924.

Fan
et al. [2017]
Haoqiang Fan, Hao Su,
and Leonidas J Guibas. 2017.
A Point Set Generation Network for 3D Object
Reconstruction from a Single Image.
Proc. IEEE Conf. on Computer Vision & Pattern Recognition
2, 4, 6. 
Glorot and Bengio [2010]
Xavier Glorot and Yoshua
Bengio. 2010.
Understanding the difficulty of training deep
feedforward neural networks. In
Proc. Inter. Conf. on Artificial Intelligence and Statistics
. 249–256.  Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In In Advances in Neural Information Processing Systems (NIPS).
 Guerrero et al. [2018] Paul Guerrero, Yanir Kleiman, Maks Ovsjanikov, and Niloy J Mitra. 2018. PCPNet learning local shape properties from raw point clouds. In Computer Graphics Forum (Proc. of Eurographics), Vol. 37. 75–85.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition.
 Heckbert [1989] Paul S Heckbert. 1989. Fundamentals of texture mapping and image warping. (1989).
 Hermosilla et al. [2019] Pedro Hermosilla, Tobias Ritschel, and Timo Ropinski. 2019. Total Denoising: Unsupervised Learning of 3D Point Cloud Cleaning. arXiv preprint arXiv:1904.07615 (2019).
 Hermosilla et al. [2018] P. Hermosilla, T. Ritschel, PP Vazquez, A. Vinacua, and T. Ropinski. 2018. Monte Carlo Convolution for Learning on NonUniformly Sampled Point Clouds. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia) 37, 6 (2018).
 Hoppe et al. [1992] Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald, and Werner Stuetzle. 1992. Surface reconstruction from unorganized points. Proc. of SIGGRAPH (1992), 71–78.
 Hu et al. [2019] Tao Hu, Zhizhong Han, Abhinav Shrivastava, and Matthias Zwicker. 2019. Render4Completion: Synthesizing Multiview Depth Maps for 3D Shape Completion. arXiv preprint arXiv:1904.08366 (2019).
 Huang et al. [2009] Hui Huang, Dan Li, Hao Zhang, Uri Ascher, and Daniel CohenOr. 2009. Consolidation of Unorganized Point Clouds for Surface Reconstruction. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia) 28, 5 (2009), 176:1–176:7.
 Huang et al. [2013] Hui Huang, Shihao Wu, Minglun Gong, Daniel CohenOr, Uri Ascher, and Hao Richard Zhang. 2013. EdgeAware Point Set Resampling. ACM Trans. on Graphics 32, 1 (2013), 9:1–9:12.
 Insafutdinov and Dosovitskiy [2018] Eldar Insafutdinov and Alexey Dosovitskiy. 2018. Unsupervised learning of shape and pose with differentiable point clouds. In In Advances in Neural Information Processing Systems (NIPS). 2802–2812.

Isola
et al. [2017]
Phillip Isola, JunYan
Zhu, Tinghui Zhou, and Alexei A.
Efros. 2017.
ImageToImage Translation With Conditional Adversarial Networks. In
Proc. IEEE Conf. on Computer Vision & Pattern Recognition.  Karras et al. [2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In Proc. Int. Conf. on Learning Representations.
 Kato et al. [2018] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3d mesh renderer. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition. 3907–3916.
 Langguth et al. [2016] Fabian Langguth, Kalyan Sunkavalli, Sunil Hadap, and Michael Goesele. 2016. Shadingaware multiview stereo. In Proc. Euro. Conf. on Computer Vision. Springer, 469–485.
 Li et al. [2018] TzuMao Li, Miika Aittala, Frédo Durand, and Jaakko Lehtinen. 2018. Differentiable monte carlo ray tracing through edge sampling. In ACM Trans. on Graphics (Proc. of SIGGRAPH Asia). ACM, 222.
 Lin et al. [2018] ChenHsuan Lin, Chen Kong, and Simon Lucey. 2018. Learning efficient point cloud generation for dense 3D object reconstruction. In AAAI Conference on Artificial Intelligence.
 Lipman et al. [2007] Yaron Lipman, Daniel CohenOr, David Levin, and Hillel TalEzer. 2007. Parameterizationfree projection for geometry reconstruction. ACM Trans. on Graphics (Proc. of SIGGRAPH) 26, 3 (2007), 22:1–22:6.
 Liu et al. [2017] Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and JyhMing Lien. 2017. Material editing using a physically based rendering network. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition. 2261–2269.
 Liu et al. [2018] HsuehTi Derek Liu, Michael Tao, and Alec Jacobson. 2018. Paparazzi: Surface Editing by way of MultiView Image Processing. In ACM Trans. on Graphics (Proc. of SIGGRAPH Asia). ACM, 221.
 Liu et al. [2019] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. 2019. Soft Rasterizer: A Differentiable Renderer for Imagebased 3D Reasoning. arXiv preprint arXiv:1904.01786 (2019).
 Loper and Black [2014] Matthew M Loper and Michael J Black. 2014. OpenDR: An approximate differentiable renderer. In Proc. Euro. Conf. on Computer Vision. Springer, 154–169.
 Lu et al. [2018] Xuequan Lu, Shihao Wu, Honghua Chen, SaiKit Yeung, Wenzhi Chen, and Matthias Zwicker. 2018. GPF: GMMinspired featurepreserving point set filtering. IEEE Trans. Visualization & Computer Graphics 24, 8 (2018), 2315–2326.
 Maier et al. [2017] Robert Maier, Kihwan Kim, Daniel Cremers, Jan Kautz, and Matthias Nießner. 2017. Intrinsic3d: Highquality 3D reconstruction by joint appearance and geometry optimization with spatiallyvarying lighting. In Proc. Int. Conf. on Computer Vision. 3114–3122.
 Mao et al. [2017] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In Proc. Int. Conf. on Computer Vision. 2794–2802.
 NguyenPhuoc et al. [2018] Thu H NguyenPhuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. 2018. RenderNet: A deep convolutional network for differentiable rendering from 3D shapes. In In Advances in Neural Information Processing Systems (NIPS). 7891–7901.
 Öztireli et al. [2009] A Cengiz Öztireli, Gael Guennebaud, and Markus Gross. 2009. Feature preserving point set surfaces based on nonlinear kernel regression. In Computer Graphics Forum (Proc. of Eurographics), Vol. 28. 493–501.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPSW.
 Petersen et al. [2019] Felix Petersen, Amit H Bermano, Oliver Deussen, and Daniel CohenOr. 2019. Pix2Vex: ImagetoGeometry Reconstruction using a Smooth Differentiable Renderer. arXiv preprint arXiv:1903.11149 (2019).
 Pfister et al. [2000] Hanspeter Pfister, Matthias Zwicker, Jeroen Van Baar, and Markus Gross. 2000. Surfels: Surface elements as rendering primitives. In Proc. Conf. on Computer Graphics and Interactive techniques. 335–342.
 Pontes et al. [2017] Jhony K Pontes, Chen Kong, Sridha Sridharan, Simon Lucey, Anders Eriksson, and Clinton Fookes. 2017. Image2Mesh: A Learning Framework for Single Image 3D Reconstruction. arXiv preprint arXiv:1711.10669 (2017).
 Qi et al. [2017a] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition.
 Qi et al. [2017b] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In In Advances in Neural Information Processing Systems (NIPS). 5099–5108.
 Rajeswar et al. [2018] Sai Rajeswar, Fahim Mannan, Florian Golemo, David Vazquez, Derek Nowrouzezahrai, and Aaron Courville. 2018. Pix2Scene: Learning Implicit 3D Representations from Images. (2018).
 Rakotosaona et al. [2019] MarieJulie Rakotosaona, Vittorio La Barbera, Paul Guerrero, Niloy J Mitra, and Maks Ovsjanikov. 2019. POINTCLEANNET: Learning to Denoise and Remove Outliers from Dense Point Clouds. arXiv preprint arXiv:1901.01060 (2019).
 Richardson et al. [2017] Elad Richardson, Matan Sela, Roy OrEl, and Ron Kimmel. 2017. Learning detailed face reconstruction from a single image. In IEEE Trans. Pattern Analysis & Machine Intelligence. 1259–1268.
 Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. Unet: Convolutional networks for biomedical image segmentation. In Inter. Conf. on Medical image computing and computerassisted intervention. Springer, 234–241.

Roveri et al. [2018a]
Riccardo Roveri, A Cengiz
Öztireli, Ioana Pandele, and Markus
Gross. 2018a.
Pointpronets: Consolidation of point clouds with convolutional neural networks. In
Computer Graphics Forum (Proc. of Eurographics), Vol. 37. 87–99.  Roveri et al. [2018b] Riccardo Roveri, Lukas Rahmann, Cengiz Oztireli, and Markus Gross. 2018b. A network architecture for point cloud classification via automatic depth images generation. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition. 4176–4184.
 Sengupta et al. [2018] Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs. 2018. SfSNet: Learning Shape, Reflectance and Illuminance of Facesin the Wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6296–6305.
 Shi et al. [2017] Jian Shi, Yue Dong, Hao Su, and Stella X. Yu. 2017. Learning NonLambertian Object Intrinsics Across ShapeNet Categories. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition.
 Sitzmann et al. [2018] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhöfer. 2018. DeepVoxels: Learning Persistent 3D Feature Embeddings. arXiv preprint arXiv:1812.01024 (2018).
 Sutskever et al. [2013] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proc. IEEE Int. Conf. on Machine Learning. 1139–1147.
 Tulsiani et al. [2017] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. 2017. Multiview supervision for singleview reconstruction via differentiable ray consistency. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition. 2626–2634.
 Vogels et al. [2018] Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Gerhard Röthlin, Alex Harvill, David Adler, Mark Meyer, and Jan Novák. 2018. Denoising with kernel prediction and asymmetric loss functions. ACM Trans. on Graphics 37, 4 (2018), 124.
 Wang et al. [2016] PengShuai Wang, Yang Liu, and Xin Tong. 2016. Mesh denoising via cascaded normal regression. ACM Trans. on Graphics (Proc. of SIGGRAPH Asia) 35, 6 (2016), 232–1.
 Xu et al. [2011] Li Xu, Cewu Lu, Yi Xu, and Jiaya Jia. 2011. Image smoothing via L 0 gradient minimization. In ACM Transactions on Graphics (TOG), Vol. 30. ACM, 174.
 Yan et al. [2016] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. 2016. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In In Advances in Neural Information Processing Systems (NIPS). 1696–1704.
 Yao et al. [2018] Shunyu Yao, Tzu Ming Hsu, JunYan Zhu, Jiajun Wu, Antonio Torralba, Bill Freeman, and Josh Tenenbaum. 2018. 3Daware scene manipulation via inverse graphics. In In Advances in Neural Information Processing Systems (NIPS). 1887–1898.
 Yifan et al. [2018] Wang Yifan, Shihao Wu, Hui Huang, Daniel CohenOr, and Olga SorkineHornung. 2018. Patchbased Progressive 3D Point Set Upsampling. arXiv preprint arXiv:1811.11286 (2018).
 Yu et al. [2018] Lequan Yu, Xianzhi Li, ChiWing Fu, Daniel CohenOr, and PhengAnn Heng. 2018. ECNet: an Edgeaware Point set Consolidation Network. Proc. Euro. Conf. on Computer Vision (2018).
 Zhu et al. [2017] Rui Zhu, Hamed Kiani Galoogahi, Chaoyang Wang, and Simon Lucey. 2017. Rethinking reprojection: Closing the loop for poseaware shape reconstruction from a single image. In Proc. Int. Conf. on Computer Vision. 57–65.
 Zwicker et al. [2002] Matthias Zwicker, Mark Pauly, Oliver Knoll, and Markus Gross. 2002. Pointshop 3D: An interactive system for pointbased surface editing. In ACM Trans. on Graphics, Vol. 21. ACM, 322–329.
 Zwicker et al. [2001] Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. 2001. Surface splatting. In Proc. Conf. on Computer Graphics and Interactive techniques. ACM, 371–378.
 Zwicker et al. [2004] Matthias Zwicker, Jussi Räsänen, Mario Botsch, Carsten Dachsbacher, and Mark Pauly. 2004. Perspective accurate splatting. In Proc. of Graphics interface. Canadian HumanComputer Communications Society, 247–254.
Appendix A Parameter discussion
Here, we describe the effects of all hyperparameters of our method.
Forward rendering. The required hyperparameters consist of cutoff threshold, merge threshold and (standard deviation). For all these parameters, we closely follow the default settings in the original EWA paper. [Zwicker et al., 2001]. For close camera views, the default value is increased so that the splats are large enough to create holefree renderings.
Backward rendering. The cache size used for logging points which are projected to each pixel is the only hyperparameter. The larger is, the more accurate becomes, as more occluded points can be considered for the reevaluation of (6). We find is sufficiently large for our experiments.
Regularization. Bandwidth and for computing weights in (11) and (12) are set as suggested in previous works [Huang et al., 2009; Öztireli et al., 2009]. Specifically, , where is the diagonal length of the bounding box of the initial shape and is the number of points; is set to to encourage a smooth surface under the presence of outliers. For largescale deformation, where the intermediate results can have more outliers, we set of the projection term to a higher value, e.g. , which helps to pull the outliers to the nearest surface.
Optimization. The learning rate has a substantial impact on convergence. In our experiments, we set the learning rate for position and normal to 5 and 5000. These values generally work well for all applications. Higher learning rates cause the points to converge faster but increases the risk of causing the points to gather in clusters. A more sophisticated optimization algorithm can be applied for a more robust optimization process, but it is out of the scope of this paper. A sufficient number of views per optimization step is key to a good result in the illposed 2Dto3D formulation. Twelve camera views are used in all our experiments, while with 8 or fewer views results start to degenerate. The number of steps for points and normals update, and , differ for each application. In general, for large topology changes, we set , where typically and , while for local geometry processing with and . Finally, we find the loss weights for image loss , projection regularization and repulsion regularization , by ensuring the magnitude of per point gradient from and is around of that from . If the repulsion weight is too large, e.g. , points can be repelled far off the surface, while if the projection weight is too large, e.g. , points will be forced to stay on a local surface, making it difficult for topology changes.
Appendix B Network details
Our model is based on Pix2Pix [Isola et al., 2017] that consist of a generator and a discriminator. For the generator, we experimented with UNet [Ronneberger et al., 2015] and ResNet [He et al., 2016]
, and find ResNet performs slightly better in our task, which we use for all experiments. That is, the generator has a 2stride convolution and a 2stride upconvolution for both the encoder and decoder networks and 9 residual blocks inbetween. The discriminator follows the architecture as: C64C128C256C512C1, where LSGAN
[Mao et al., 2017] is used. To deal with checkerboard artefacts, we use pixelwise normalization in the generator and add a 1strided convolutional layer after each deconvolutional layer in the discriminator [Karras et al., 2018]. We use the default parameters of the Pix2Pix pytorch implementation provided by the authors, and ADAM optimizer () . Xavier [Glorot and Bengio, 2010] is used for weights initialization. We train our models for about two days on an NVIDIA 1080Ti GPU.
Comments
There are no comments yet.