Black-Box Test-Time Shape REFINEment for Single View 3D Reconstruction

Much recent progress has been made in reconstructing the 3D shape of an object from an image of it, i.e. single view 3D reconstruction. However, it has been suggested that current methods simply adopt a "nearest-neighbor" strategy, instead of genuinely understanding the shape behind the input image. In this paper, we rigorously show that for many state of the art methods, this issue manifests as (1) inconsistencies between coarse reconstructions and input images, and (2) inability to generalize across domains. We thus propose REFINE, a postprocessing mesh refinement step that can be easily integrated into the pipeline of any black-box method in the literature. At test time, REFINE optimizes a network per mesh instance, to encourage consistency between the mesh and the given object view. This, along with a novel combination of regularizing losses, reduces the domain gap and achieves state of the art performance. We believe that this novel paradigm is an important step towards robust, accurate reconstructions, remaining relevant as new reconstruction networks are introduced.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/10/2018

Deep Single-View 3D Object Reconstruction with Visual Hull Embedding

3D object reconstruction is a fundamental task of many robotics and AI p...
09/01/2019

Deep Mesh Reconstruction from Single RGB Images via Topology Modification Networks

Reconstructing the 3D mesh of a general object from a single image is no...
04/01/2021

Sketch2Mesh: Reconstructing and Editing 3D Shapes from Sketches

Reconstructing 3D shape from 2D sketches has long been an open problem b...
04/02/2016

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Inspired by the recent success of methods that employ shape priors to ac...
09/24/2021

Learnable Triangulation for Deep Learning-based 3D Reconstruction of Objects of Arbitrary Topology from Single RGB Images

We propose a novel deep reinforcement learning-based approach for 3D obj...
08/05/2021

Object Wake-up: 3-D Object Reconstruction, Animation, and in-situ Rendering from a Single Image

Given a picture of a chair, could we extract the 3-D shape of the chair,...
03/30/2021

Learning monocular 3D reconstruction of articulated categories from motion

Monocular 3D reconstruction of articulated object categories is challeng...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single view 3D reconstruction is the problem of reconstructing the 3D shape of an object from an image of it. Despite tremendous recent progress, the problem remains a significant challenge for computer vision. Figure

1 shows reconstructions by several state of the art methods. While the reconstructed shape (bottom row) reflects the category of the object in the image (top row), many details that determine fine-grained identity are lost. This happens even without ambiguity from self-occlusion, e.g. for object parts clearly visible in the image (circled in the figure). In contrast, unseen shape regions in the image tend to be reconstructed as well as visible ones. This suggests that current methods simply recognize the object, perform a “nearest-neighbor” search for a “mean class shape” memorized during training [tatarchenko2019single], and make slight adjustments that are usually not enough to recover intricate shape details.

This problem can be subtle when training and test distributions are similar, in which case nearest-neighbor produces decent reconstructions. However, it is magnified when there is a domain shift, as illustrated in Figure 2. When a reconstruction network trained on ShapeNet [chang2015shapenet] is applied to real images (Pix3D [sun2018pix3d]), the reconstructed shape can have severe reconstruction failures. Distribution shift can often be mitigated with domain adaptation techniques  [sun2016deep, ganin2016domain, tzeng2017adversarial, long2017deep]

. However, because they align entire probability distributions, these methods are likely to only help produce better mean class shape reconstructions. It is unclear that they will suffice to recover the lost shape details of Figure 

1. They also require sizeable amounts of data from the target domain. A more ambitious but practical solution is to bridge shifts to unknown target domains by performing optimization at test time, in a self-supervised manner [sun2020test].

Figure 1: Important image details (circled in green) are frequently lost by state of the art 3D reconstruction methods (circled in red) [kanazawa2018learning, li2020self, mescheder2019occupancy, wang2018pixel2mesh, xu2019disn].

Recently, [remelli2020meshsdf] has shown that test time optimization can improve shape reconstructions, by introducing a differentiable explicit surface mesh representation for Deep Signed Distance Function (DeepSDF) [park2019deepsdf] based reconstruction networks. This enabled the introduction of the MeshSDF algorithm which at test time fine-tunes a DeepSDF based network’s weights to the target object using its silhouette. While test-time fine-tuning opens exciting possibilities to solve the problem of Figure 1, it has the limitations inherent to a white box technique. First, the original shape reconstruction network must be available and accessible for retraining. This is not always easy, given the complexity of many state of the art reconstruction procedures. Second, the test-time optimization is specific to that network. For example, it is not trivial to extend the method of [remelli2020meshsdf] to approaches like Mesh R-CNN [gkioxari2019mesh] which utilizes an intermediate voxel representation, nor Atlasnet [groueix2018papier], which represents meshes using atlas surface elements.

This suggests the need for black-box single view reconstruction refinement, which would not suffer from these problems. Unfortunately, no black-box refinement method currently exists. Thus, in this work we design a dedicated refinement network external to the reconstruction model, which refines the mesh shape produced by the latter, a posteriori as illustrated in Figure 2. In contrast to network fine-tuning, black-box refinement has the advantage of being model agnostic and applicable even to meshes produced by third-party networks (e.g.  [choy20163d, tatarchenko2017octree, mescheder2019occupancy, groueix2018papier, kanazawa2018learning]

) unavailable at test time. Furthermore, it allows the joint development of network architectures and loss functions tailored to solving the problem of Figure 

1 at test time. This is important because, when compared to reconstruction networks trained with large datasets, test-time refinement networks must be much more efficient and less prone to overfitting.

In particular, our proposed novel black-box test-time shape refinement procedure for single view reconstruction follows the test-time formulation of [remelli2020meshsdf], i.e. seeks the test-time shape refinement that best matches an object’s silhouette, but abstracts this refinement from the original shape reconstruction. Rather than simply finetuning the reconstruction network, we introduce a REFINE (a recursive acronym for REFine INstances at Evaluation) procedure, implemented with a postprocessing network whose parameters are optimized at test time using the given reconstructed shape and object silhouette. The REFINE optimization is performed at instance level, i.e. each instance is refined independently with re-initialized parameters. It leverages a novel combination of loss functions, encouraging both silhouette consistency and confidence-based mesh symmetry, to produce mesh displacements. Using a variety of metrics, it is systematically shown that reconstruction quality of existing networks degrades when training and testing distributions are different. The degradation is studied on (1) different renderings of a synthetic dataset and (2) real world datasets. It is then shown that REFINE improves the shapes reconstructed by various reconstruction networks, on-the-fly, without relying on prior target distribution knowledge. This holds even when there are large domain gaps (see Figure 2), and outperforms the state of the art test-time finetuning method of [remelli2020meshsdf] even for DeepSDF reconstructions.

Overall, this work makes three main contributions. The first is to characterize the inconsistency between input image and mesh reconstructed by existing networks. This shows that these networks (1) produce coarse, class-averaged, shapes and (2) are unable to tackle distribution shift between training and test data. Second, a new black-box test-time shape REFINEment framework, based on instance-level test-time optimization, is proposed to overcome the problem. Finally, extensive experiments demonstrate that REFINE achieves state of the art performance for test-time shape refinement, leading to significantly more accurate shapes than those synthesized by current reconstruction networks, especially under distribution shift.

Figure 2: Black-box test-time shape refinement. Reconstructions by a network trained on ShapeNet are fed to our proposed external shape REFINEment network (in green) at test-time. REFINE improves reconstructions for images in both the training domain (top, image also from ShapeNet), and unknown test domains (bottom, image from Pix3D).

2 Related Work

In this section, we briefly review the related literature in single view reconstruction and test-time optimization.

Single View 3D Reconstruction.

While many single view 3D reconstruction methods have been proposed, they all suffer from the inconsistencies of Figure 1

, and can benefit from REFINE. The main 3D output modalities are voxels, pointclouds, and meshes. Voxel based methods typically encode an image into a latent vector, which is decoded into a 3D voxel grid with upsampling 3D transposed convolutions

[xie2019pix2vox, choy20163d]. Octrees can enable higher voxel resolution [tatarchenko2017octree, wang2017cnn]. Pointclouds have been explored as an alternative to voxels [fan2017point, lin2017learning] but usually require voxel or mesh conversion for use by downstream tasks. Among mesh based methods, some learn to displace vertices on a sphere [kato2018neural, wang2018pixel2mesh] or a mean shape [kanazawa2018learning] to produce the output. Current state of the art methods rely on an intermediate implicit function representation to describe shape [mescheder2019occupancy, xu2019disn, park2019deepsdf, genova2020local, niemeyer2020differentiable], mapped into a mesh by marching cubes [lorensen1987marching].

Methods also vary by their level of supervision. Most are fully supervised, requiring a large dataset of 3D shapes such as ShapeNet [chang2015shapenet]. Recently, weakly-supervised methods have also been introduced, using semantic keypoints [kanazawa2018learning] or 2.5D sketches [wu2017marrnet] as supervision. Alternatively, [li2020self] has proposed a fully unsupervised method, combining part segmentation and differentiable rendering. Few-shot is considered in  [wallace2019few, michalkiewicz2020few] where classes have limited training data. Domain adaptation was explored in [pinheiro2019domain], which assumes access to data from a known target domain.

Despite progress in single view 3D reconstruction, questions arise on what is actually being learned. In particular, [tatarchenko2019single] shows that simple nearest-neighbor model retrieval can beat state of the art reconstruction methods. This raises concerns that current methods bypass genuine reconstruction, simply combining image recognition and shape memorization. Such memorization is consistent with Figures 1 and 2, leading to suboptimal reconstructions and inability to generalize across domains. It is likely a consequence of learning the reconstruction network over a training set of many instances from the same class. In contrast, REFINE uses test-time optimization to refine a single shape, encouraging consistency with a single silhouette. This prevents memorization, directly addressing the concerns of [tatarchenko2019single]. It also makes REFINE complementary to reconstruction methods and applicable as a postprocessing stage to any of them.

Test-Time Optimization.

Test-time optimization usually exploits inherent structure of the data in a self-supervised manner, as no ground truth labels are available. For example, [sun2020test] leverages an auxiliary self-supervised rotation angle prediction task at test time to reduce domain shift in object classification. The same goal is achieved in [wang2020fully] by test-time entropy minimization. Meanwhile, [tung2017self] uses self-supervision at test time to improve human motion capture. Additionally, interactive user feedback serves to dynamically optimize segmentation predictions [sakinis2019interactive, sofiiuk2020f, jang2019interactive].

Figure 3: The REFINE network architecture. Given an original mesh reconstruction with missing details, the network outputs vertex translations needed to make the refined mesh consistent with an input image. It consists of an silhouette feature map encoder, whose outputs create a shape graph. This is refined by graph convolutions, and two fully connected branches which output the refined mesh and symmetry confidence weights. The latter enforces optimized 3D symmetry constraints. Several losses aid to avoid overfitting to the input viewpoint. Optimization is performed over single examples, at test time.

Test-Time Shape Refinement.

Test-time shape refinement requires a postprocessing procedure to improve the accuracy of meshes produced by a reconstruction network. Most previous approaches are white-box methods, i.e. they are specific to a particular model (or class of models) and require access to the internal workings of the model. Examples include methods that exploit temporal consistency in videos, akin to multi-view 3D reconstruction [li2020onlineAda, lin2019photometric]. [li2020onlineAda] requires the unsupervised part-based video reconstruction architecture proposed by the authors and [zuffi2019three] optimizes over a shape space specific to their architecture for zebra images. Among white-box methods, the approach closest to REFINE is that of [remelli2020meshsdf], which finetunes the weights of the reconstruction network at test time, to better match the object silhouette. But even this method is specific to sign distance function (DeepSDF [park2019deepsdf]) networks. By instead adopting the black-box paradigm, where the mesh refinement step is intentionally decoupled from the reconstruction process, REFINE is capable of learning vertex-based deformations for a mesh generated by any reconstruction architecture. Our experiments show that it can be effectively applied to improve the reconstruction performance of many networks and achieves state of the art results for test-time shape refinement, even outperforming [remelli2020meshsdf] for DeepSDF networks. In summary, unlike prior approaches, REFINE is a black-box technique that can be universally applied to improve reconstruction accuracy, a posteriori.

Method REFINE OccNet [mescheder2019occupancy] MeshSDF [park2019deepsdf] Pix2Mesh [wang2018pixel2mesh] AtlasNet [groueix2018papier]
Params. (Mil.) 0.9 12.7 13.2 18.8 20.3
Table 1: Parameter size comparisons between the REFINE network and popular single view reconstruction models.

3 Black-Box Test-Time Shape Refinement

In this section we introduce REFINE, detailing its neural network architecture, losses, and training procedure.

3.1 Formulation and Inputs

Single view 3D reconstruction methods reconstruct a 3D object shape from a single image of the object. An RGB image of width and height is mapped to a mesh by a reconstruction network

(1)

i.e., where is a set of vertices and a set of edges. is a boolean domain specifying mesh connectivity.

is usually a coarse shape estimate, whose details do not match the input image, as shown in Figure

1. Performance further degrades when is sampled from an image distribution different from that used during training [pinheiro2019domain].

It was proposed in [remelli2020meshsdf] to address this issue by optimizing the parameters of on-the-fly during inference, given a coarsely reconstructed mesh , an object silhouette , and the camera pose . We refer to this problem as test-time shape refinement (TTSR). We investigate an alternative class of solutions to the TTSR problem, which abstracts shape refinement from the reconstruction network . These black-box solutions implement a refinement mapping with a dedicated refinement network external to . The network is trained at test-time, so that is a 3D mesh that more accurately approximates the object shape, as illustrated in Figure 2. We denote the approach as REFINE and as the REFINEment network. In this formulation, predicts a set of 3D displacements for the vertices in . These are used to compute the REFINEd mesh whose render best matches the silhouette . This set of displacements is complemented by a set of symmetry confidence scores , which regularizes through a symmetry prior, as elaborated upon in Section 3.3.

Several advantages derive from the abstraction of refinement from reconstruction. First, this makes REFINE a black-box technique, applicable to any network . In fact, the network does not even have to be available, only the mesh , which gives REFINE a great deal of flexibility. For example, while MeshSDF can only be used with DeepSDF networks, REFINE is applicable even to voxel and pointcloud reconstruction methods, by using mesh conversions [kazhdan2013screened, kazhdan2006poisson, calakli2011ssd, bernardini1999ball]. This property is important, as different methods are better suited for different downstream applications. For example, implicit methods [mescheder2019occupancy, park2019deepsdf] tend to produce the best reconstructions but can have slow inference [park2019deepsdf]. Meanwhile, AtlasNet [groueix2018papier] is less accurate but much more efficient, and inherently provides a parametric patch representation useful for downstream applications like shape correspondence. A second advantage of black-box refinement is that because the refinement network and loss functions used to train it are independent of the reconstruction network , they can be specialized to the test-time shape refinement goal. This is important because special considerations must usually be taken to regularize when compared to , since test time training is based on a single mesh instance, rather than a large dataset. To avoid overfitting, we design to be much smaller than . As shown in Table 1, the REFINE network is at least 10 times smaller than most currently popular reconstruction networks. We also propose several novel loss functions, tailored for test-time training, to regularize .

3.2 Architecture

Figure 3 summarizes the architecture of the REFINE network. This is a combination of an encoder and a graph refiner followed by 2 branches and , which predict the vertex displacements and vertex confidence scores respectively. The encoder module contains neural network layers of parameters , takes silhouette as input, and outputs a set of feature maps , of width , height and channels. In our implementation, is based on ResNet [he2016deep]; is set to 2, where and .

Given feature map the feature vector corresponding to a vertex in is computed by projecting the vertex position onto the feature map [gkioxari2019mesh, wang2018pixel2mesh],

(2)

where is the camera viewpoint and

a perspective projection with bilinear interpolation. Vertices are represented at different resolutions, by concatenating the feature vectors of different layers into

. The set of concatenated feature vectors extracted from all vertices is then processed by a graph convolution [kipf2016semi] refiner , of parameters , to produce an improved set of feature vectors . Finally, this set is mapped into the displacement vector

(3)

by a fully connected branch of parameters and into the confidence vector

(4)

by a fully connected branch of parameters . Overall, the REFINE network implements the mapping

(5)

3.3 Optimization

The REFINE optimization combines popular reconstruction losses with novel losses tailored for test-time shape refinement. In what follows we use to denote a differentiable renderer [kato2018neural, liu2019soft] that maps mesh into its image captured by a camera of parameters . We also define sets , , and of size , constructed with the rows of , , and respectively.
Silhouette Loss: Penalizes shape and silhouette mismatch

(6)

where is the silhouette of the rendered shape, using the 2D binary cross entropy loss

(7)

Displacement Loss: Discourages overly large vertex deformations, with

(8)

Normal Consistency & Laplacian Losses: and are widely used [wang2018pixel2mesh, desbrun1999implicit] to encourage mesh smoothness.

Figure 4: To enforce symmetry, a mesh is differentiably rendered by two cameras, where the viewpoint of camera 2 is obtained by reflecting that of camera 1 about the mesh’s plane of symmetry (yellow). The second render is compared to the horizontal flip of the first.

A second set of losses is introduced to avoid overfitting during test-time refinement. This leverages the prior that many real world objects are bilaterally symmetric [liu2010computational] about a reflection plane . These losses enforce two constraints. The first is that the object vertices should be symmetric. The second is that the image projections of the shape should reflect that symmetry. The latter is enforced by the procedure of Figure 4. For a given render, under camera 1, the reflected camera, i.e. the camera whose viewpoint is reflected about , is first found. The object is then rendered under this camera viewpoint and the resulting render compared to the horizontal flip of the render under camera 1. All reflections are implemented with the transformation , where is the unit normal vector of . While there are methods to predict planes of object symmetry [gao2019prs, zhou2020learning], we found them to be unnecessary, since most reconstruction methods output semantically aligned meshes for objects of the same class. In general, the objects are aligned so that is the vertical plane with . We adopt this convention in all our experiments.

While we have found this symmetry prior to be helpful for many objects, not all objects are symmetric or exactly symmetric. For example, an object can be almost symmetric (e.g. an airplane missing part of one wing). To prevent the symmetry prior from overwhelming (6) when this is the case, we introduce a confidence score per vertex . These confidence scores are learned during the REFINE optimization, enabling local deviations from the global symmetry constraint when appropriate. The two symmetry losses are defined as follows.

Vertex-Based Symmetry Loss: Encourages symmetric mesh vertices according to

(9)

where are the mesh vertices and the associated symmetry confidence scores. The first term penalizes distances between each vertex and its nearest neighbor upon reflection about . This is weighted by the learned confidence score

, which is low for vertices that should be asymmetric based on the signal given by its silhouette. The second term penalizes small confidence scores, preventing trivial solutions. Trade-offs between the two terms is controlled by hyperparameter

. Shown in Figure 5, scores are large except in areas of clear asymmetry.

Figure 5: Left to right: image, original mesh, REFINEd mesh, and vertex confidence weights (shown as points or colors on the REFINEd mesh). Red shades indicate lower confidence; green higher. Note how the confidence weights relax the symmetry prior on asymmetric object parts.

Render-Based Image Symmetry Loss: Encourages image projections that reflect object symmetry. Given camera viewpoints , is used to obtain rendered pairs from symmetric camera viewpoints , as shown in the rows of Figure 4. The loss is defined as

(10)

where is image horizontal flip and are image coordinates. Symmetry is enforced by minimizing the distance between the horizontal flip of each render and the render at the symmetrical camera viewpoint. This is akin to comparing a “virtual image” of what the mesh should symmetrically look like. Pixel-based confidence scores are used as in (9). However, they are not relearned, but derived from the vertex confidences , of (9) by barycentric interpolation on the mesh faces, where are mesh face vertices projected into pixel .
Overall Loss: REFINE is trained with a weighed combination of the six losses

(11)

is the main driving factor to ensure input silhouette consistency, while other losses serve as regularizers to prevent overfitting. Figure 6 shows that REFINEd shape quality tracks the evolution of this loss, for an airplane whose body has been truncated in the original reconstruction. As the REFINE loss steadily decreases, the airplane mesh progressively becomes more faithful to the input image; this is seen in the elongated body and corrected wing shape.

Figure 6: As the REFINEment optimization proceeds, 3D shape becomes more accurate.
Figure 7: An airplane reconstructed by three different methods [mescheder2019occupancy, wang2018pixel2mesh, groueix2018papier]. Since the methods differ greatly, they exhibit very different failure cases and artifacts. Nevertheless, REFINE improves all reconstructions.

3.4 Implementation Details

Ideally, test-time shape refinement postprocessing should support any mesh and be fast. REFINE intrinsically satisfies the first requisite, since it is black-box, class-agnostic, and allows variable number of vertices per mesh. To prevent the imprecise shapes of Figure 1, it optimizes a single instance at a time, starting from a network of random weights. Optimization from scratch converges in relatively few iterations, approximately 400 (i.e. 400 forward and backward passes). This requires about 90 seconds on a GTX 1080Ti GPU. Moreover, because instances are treated independently, the refinement is trivially parallelizable. Since 4 instances fit on a GPU, a two GPU server trivially achieves a per-instance refinement time of seconds, which is effective in terms of the second requisite.

Several details of our implementation are worth noting. In all experiments we used of 6 viewpoints, with azimuths in and elevations in . The learning rate is 0.00007, , , , , , and . More details are given in the supplementary.

Figure 8: Mesh REFINEment examples for Pix3D (top three rows) and ShapeNet (bottom three) images. REFINE is capable of correcting small details (e.g. airplane nose) as well as generating entirely new parts (e.g. the rear wing).

4 Experiments

In this section, we discuss several experiments performed to validate the effectiveness of REFINE.

4.1 Experimental Setup

Metrics: To evaluate REFINEment performance, the original mesh is first reconstructed by a baseline single view reconstruction method, and the reconstruction accuracy is measured. REFINE is then applied to the mesh and its accuracy is measured again. Several popular metrics [tatarchenko2019single] are used to measure 3D accuracy: EMD,

Chamfer Distance, F-Score, and 3D Volumetric IoU. Lower is better for EMD and Chamfer distance, while higher is better for IoU and F-Score; for details please refer to the supplementary.

Datasets: Four datasets are considered, to rigorously study domain shift. All baseline models are trained on the synthetic ShapeNet dataset [chang2015shapenet], with images rendered by [choy20163d] using Blender [blender]. We also re-rendered the 3D models in the test set of [choy20163d] using Pytorch3D [ravi2020pytorch3d]. This second dataset, called RerenderedShapeNet is designed to create a domain gap due to significant differences in shading, viewpoint, and lighting. The third dataset is motivated by our observation that about 97% of ShapeNet is symmetrical, in the sense that each mesh has a symmetry loss for and . To ablate how asymmetry affects reconstruction quality, we introduce a subset of RerenderedShapeNet, denoted as ShapeNetAsym, containing 1259 asymmetric meshes. Finally, we use the Pix3D dataset [sun2018pix3d], which contains real images and their ground truth meshes, to test for large domain shifts. Hyperparameters were tuned with a small portion of RerenderedShapeNet, disjoint from the test set.

Configuration EMD CD- F-Score Vol. IoU 2D IoU
OccNet [mescheder2019occupancy] 4.3 34.0 80 33 69
12.2 154.8 51 16 87
3.7 26.2 80 31 85
3.7 25.8 81 32 86
3.3 22.5 84 35 85
OccNet* [mescheder2019occupancy] 11.0 123.3 48 10 53
, * 8.9 89.1 52 10 72
7.8 85.9 55 12 76
Table 2: Ablation study of REFINE. indicates that all losses are used; an asterisk indicates results averaged over ShapeNetAsym instead of RerenderedShapeNet.
Figure 9: Leftmost column: input image and mesh. Other columns: REFINEment improves with an increasingly larger set of losses (left to right). Best viewed enlarged.

4.2 Ablation Studies

Ablation studies were performed for different components of REFINE. In addition to the metrics above, we also measure in these experiments the consistency between reconstructed mesh and input silhouette, by computing the 2D IoU between the latter and the silhouette of the mesh render.

The top section of Table 2 shows the effect of different REFINEments of RerenderedShapeNet meshes originally reconstructed by OccNet. The first row has not been REFINEd. The second row shows that, using the silhouette loss only (, all other s set to 0) improves input image consistency (from 69 to 87 2D IoU), but the refined mesh severely overfits to the input viewpoint, leading to decreased 3D accuracy. Adding the popular regularizers in the literature (, , ), limits vertex displacements and encourages mesh smoothness. As shown in the third row, this improves 3D reconstruction performance, but the gains over the baseline are small. The fourth row shows that enforcing vertex symmetry () has marginal improvements by itself. However, when combined with render-based image symmetry (row five, which further adds the image symmetry loss with ) it enables significant improvements in all metrics (e.g. from 34 to 22.5 Chamfer Distance).

EMD CD- F-Score Vol. IoU
SVR AtlasNet [groueix2018papier] 8.0 13.0 89 30
Mesh R-CNN [gkioxari2019mesh] 4.2 10.3 90 -
Pix2Mesh [wang2018pixel2mesh] 3.4 8.0 93 48
DISN [xu2019disn] 2.6 9.7 91 57

 

TTSR MeshSDF [remelli2020meshsdf] 3.02.5 12.07.8 9195 -
(-0.5) (-4.2) (+4)
REFINEd OccNet [mescheder2019occupancy] 2.92.3 12.27.5 9196 5759
(-0.6) (-4.7) (+5) (+2)
Table 3: Reconstruction accuracies with no domain shift. Top: single view reconstruction (SVR) networks. Bottom: test-time shape refinement (TTSR) methods. TTSR results presented by accuracy before after refinement, with gain shown in parenthesis.
  REFINEd OccNet [mescheder2019occupancy]   REFINEd Pix2Mesh [wang2018pixel2mesh]   REFINEd AtlasNet [groueix2018papier]
  EMD CD- F-Score Vol. IoU   EMD CD- F-Score Vol. IoU   EMD CD- F-Score Vol. IoU
Airplane   3.5 2.2 20.6 11.4 86 91 38 40   3.7 2.3 22.3 11.0 65 88 12 22   5.3 3.8 41.9 18.2 60 82 5 13
Bench   2.9 2.2 28.6 17.0 84 86 20 20   3.6 2.6 28.0 19.9 65 76 9 11   4.9 4.6 50.0 37.7 58 68 5 8
Cabinet   3.4 2.7 17.0 14.8 83 85 45 46   3.6 3.0 20.2 16.4 74 78 37 39   4.3 4.1 30.7 19.9 59 75 14 17
Car   2.9 2.5 19.9 12.9 86 87 30 31   2.7 2.3 10.8 7.8 85 90 24 27   7.6 4.8 98.8 27.0 44 72 6 12
Chair   6.5 5.4 48.5 39.4 72 76 29 32   6.3 4.5 35.4 25.2 60 73 17 22   6.8 5.0 49.5 27.3 53 71 8 13
Display   3.5 2.7 30.8 18.1 76 83 31 37   4.2 3.0 28.0 17.4 72 81 25 32   4.9 4.5 43.1 30.0 61 71 10 14
Lamp   8.9 6.3 90.5 59.1 68 73 22 23   9.2 7.0 71.6 40.6 50 66 11 14   10.2 7.5 102.4 51.1 44 62 5 10
Speakers   4.4 3.6 29.8 22.3 73 76 43 44   4.3 3.8 31.4 25.5 65 70 36 38   5.4 4.7 46.6 27.7 55 69 13 17
Rifle   6.5 3.9 37.7 14.6 86 91 30 30   3.5 3.4 18.1 10.1 76 91 12 21   6.3 4.5 61.4 28.6 70 84 7 14
Sofa   3.0 2.7 23.8 17.9 83 85 48 49   4.3 3.2 24.8 21.8 71 79 34 40   5.3 4.7 48.0 31.1 63 73 15 19
Table   4.5 3.9 40.6 34.3 72 77 17 20   9.3 6.2 159.3 81.8 30 44 6 8   8.9 7.4 129.6 82.7 36 47 4 8
Telephone   2.3 2.0 10.9 8.0 90 92 48 50   2.2 1.8 10.9 8.2 89 92 40 44   3.4 3.3 33.6 20.8 66 79 11 16
Watercraft   4.3 2.9 42.4 23.5 80 86 32 36   5.0 2.7 32.7 14.0 71 86 16 27   7.1 4.3 76.8 25.5 55 79 6 15
Mean   4.3 3.3 34.0 22.5 80 84 33 35   4.8 3.5 38.0 23.1 67 78 22 27   6.2 4.9 62.5 32.9 56 72 8 13
(-1.0) (-11.5) (+4) (+2)   (-1.3) (-14.9) (+11) (+5)   (-1.3) (-29.6) (+16) (+5)
Table 4: REFINEment in the presence of mild domain shift, namely RerenderedShapeNet reconstructions by ShapeNet trained networks. REFINE achieves gains under all networks, classes, and metrics.
EMD CD- F-Score Vol. IoU
MeshSDF [remelli2020meshsdf] Chair 11.9 9.8 102.0 89.0 - -
(-2.1) (-13.0)
REFINEd OccNet [mescheder2019occupancy] Chair 11.0 8.5 110.7 74.5 57 62 18 20
Bed* 7.5 6.1 70.1 47.9 57 62 22 23
Bookcase* 7.4 4.1 72.0 38.5 56 65 9 12
Desk 7.6 6.7 60.6 43.7 71 72 26 27
Misc* 10.2 5.4 129.6 69.8 46 55 19 20
Sofa 3.2 3.1 30.8 25.5 75 76 50 51
Table 6.5 5.6 67.7 57.8 60 62 16 17
Tool* 10.8 8.6 140.8 118.6 51 60 11 14
Wardrobe* 5.9 3.7 49.9 29.3 65 68 54 55
Mean 7.9 5.8 81.4 56.2 59 65 23 28
(-2.1) (-25.2) (+6) (+5)
Table 5: REFINEment gain in the presence of large domain shift, namely Pix3D reconstructions by ShapeNet trained networks. REFINE achieves gains under all metrics and for all networks. REFINE is even able to improve on classes not seen during training, shown with an asterisk.

The bottom three rows of Table 2 use ShapeNetAsym to study the effect of asymmetry on REFINE performance. The sixth row is not REFINEd. The seventh row shows that when the confidence scores of (9) and (10) are removed (by setting , in which case the confidence scores become approximately 1) the refinement of asymmetrical meshes is significantly less accurate than that of the default configuration (eighth row, ). It can also be seen that, when confidence scores are used, the reconstruction quality is significantly superior to that of the original meshes. In summary, the proposed confidence mechanism enables effective REFINEment of non-symmetric objects.

Figure 9 illustrates the contribution of each loss. The leftmost column shows the input airplane image (top) and mesh (bottom). From the second column, we progressively add more losses. With only the silhouette loss, REFINE severely overfits to the input viewpoint. The displacement loss helps regularize deformation magnitude; the smoothness losses reduce jagged artifacts; the symmetry losses correct shape details (e.g. airplane tail) by enforcing a symmetry prior. These operate intuitively and can be tweaked for target applications. For example, if only symmetric objects will be reconstructed can be increased.

4.3 Refinement Results

We next consider the robustness of REFINE postprocessing to different levels of domain gap. First, refinement experiments without domain gap were performed, for reconstruction networks trained and tested on the ShapeNet renders of [choy20163d]. Table 3 presents results for different reconstruction networks and test-time refinement methods (please see supplementary for full per-class results). As the particulars of MeshSDF’s architecture [remelli2020meshsdf]

have not been open sourced as of this submission, we instead REFINE OccNet 

[mescheder2019occupancy]; here, OccNet and the raw, unrefined version of MeshSDF are directly comparable as both are implicit based, with nearly identical performance prior to refinement. However, REFINE achieves state of the art test-time shape refinement results when paired with OccNet, despite being a black-box method compared to the white-box refinement of MeshSDF [remelli2020meshsdf], which is specific to their network.

Several experiments were next conducted to evaluate the effectiveness of REFINEment in the presence of domain gap. Table 4 gives reconstruction accuracy for RerenderedShapeNet reconstructions, before and after REFINEment, of ShapeNet pretrained networks. Three methods representative of different reconstruction strategies are considered: OccNet [mescheder2019occupancy], based on implicit functions, Pixel2Mesh [wang2018pixel2mesh], which deforms an ellipsoid, and AtlasNet [groueix2018papier], based on surface atlas elements. A larger table with more REFINEd methods is presented in the supplementary. The results of Table 4 are generally worse than those of Table 3; while the methods perform well on the training domain, they struggle to generalize to out-of-distribution data. However, REFINEment significantly recovers much of the degraded performance for all networks. Average gains are particularly large under the Chamfer distance (-11.5 for OccNet, -14.9 for Pixel2Mesh, and -29.6 for AtlasNet) and increase with the network sensitivity to the domain gap (e.g. largest for AtlasNet, which has the weakest performance).

Finally, the effectiveness of REFINE is validated on the real-world Pix3D dataset, which has the largest domain gap. Table 5 shows that the beneficial behavior of REFINE is qualitatively identical to that of Table 4. A comparison to the test-time refinement method of [remelli2020meshsdf] on the shapes of the “Chair” class (the only for which [remelli2020meshsdf] reports results) once again shows that REFINE substantially improves on the state of the art. This occurs even though the initial performance prior to refinement is actually worse for OccNet compared to MeshSDF in this case (e.g. Chamfer Distance 110.7 versus 102), demonstrating the strength of REFINE.

Overall, Tables 3, 4, and 5 show that REFINE postprocessing achieves state of the art reconstruction accuracies, consistently improving the performance of original reconstruction networks regardless of class, metric, base reconstruction method, or dataset. Furthermore, gains increase with the domain gap; for the best performing OccNet (ShapeNet trained), REFINE yields Chamfer Distance average improvements of -4.7, -11.5, and -25.2 on ShapeNet, RerenderedShapeNet, and Pix3D respectively.

These improvements are illustrated by examples in Figure 8

(originally reconstructed by an OccNet). REFINE can both sharpen details (such as the airplane’s elongated nose) and create entirely new parts (set of wings in the back). It also excels at recovering details from unusual “outlier” shapes, such as cars with spoilers and is successful even for

classes on which the original reconstruction method was not trained, leading to poor original meshes. These are marked by an asterisk in Table 5, and include the bed and spoon in the first two rows of Figure 8. REFINE can also recover from very poor reconstructions due to significant domain shift, such as the Pix3D table shown in the third row of Figure 8. Finally, Figure 7 illustrates that although OccNet, Pix2Mesh, and AtlasNet produce very different failure cases and artifacts, REFINE improves the both the input image consistency and 3D accuracy of all methods. Additional examples can be found in the supplementary.

5 Conclusion

In this paper, we demonstrated the effectiveness and versatility of black-box mesh refinement at test time for the problem of single view 3D reconstruction. The proposed REFINE method enforces regularized input image consistency, and can be applied to any reconstruction method in the literature. Experiments show systematic significant improvements over the state of the art, for many metrics, datasets, and reconstruction methods. We believe that this new paradigm will remain relevant as novel reconstruction networks are introduced, and inspire substantial future work in test-time black-box refinement of reconstructed meshes.

References