Log In Sign Up

On the generalization of learning-based 3D reconstruction

State-of-the-art learning-based monocular 3D reconstruction methods learn priors over object categories on the training set, and as a result struggle to achieve reasonable generalization to object categories unseen during training. In this paper we study the inductive biases encoded in the model architecture that impact the generalization of learning-based 3D reconstruction methods. We find that 3 inductive biases impact performance: the spatial extent of the encoder, the use of the underlying geometry of the scene to describe point features, and the mechanism to aggregate information from multiple views. Additionally, we propose mechanisms to enforce those inductive biases: a point representation that is aware of camera position, and a variance cost to aggregate information across views. Our model achieves state-of-the-art results on the standard ShapeNet 3D reconstruction benchmark in various settings.


page 1

page 2

page 3

page 4


Reconstruction Bottlenecks in Object-Centric Generative Models

A range of methods with suitable inductive biases exist to learn interpr...

Input-level Inductive Biases for 3D Reconstruction

Much of the recent progress in 3D vision has been driven by the developm...

Convolutional Occupancy Networks

Recently, implicit neural representations have gained popularity for lea...

Active Object Reconstruction Using a Guided View Planner

Inspired by the recent advance of image-based object reconstruction usin...

Improving QA Generalization by Concurrent Modeling of Multiple Biases

Existing NLP datasets contain various biases that models can easily expl...

Towards robust vision by multi-task learning on monkey visual cortex

Deep neural networks set the state-of-the-art across many tasks in compu...

A Novel Approach for Semiconductor Etching Process with Inductive Biases

The etching process is one of the most important processes in semiconduc...

1 Introduction

Reconstructing the 3D shape of an object from monocular input views is a fundamental problem in computer vision. When the number of input views is small, reconstruction methods rely on priors over object shapes. Learning-based algorithms encode such priors from data. Recently proposed approaches

[atlasnet, disn, occnet] have achieved success in the single/multi view, seen category case when generalizing to novel objects within the seen categories. However, these approaches have difficulty generalizing to object categories not seen during training (cf. Fig. 1).

Figure 1: An example of reconstructing object categories unseen during training. State-of-the-art methods for learning-based reconstruction like OccNets [occnet] fail to generalize to categories unseen during training, mapping objects to their closest category in the training set (a chair). 3D43D improves generalization by using 3 inductive biases in the network design.

We present progress learning priors that generalize to unseen categories by incorporating a geometry-aware spatial feature map. Within this paradigm, we propose a point representation aware of camera position, and a variance cost to aggregate information across views.

A typical learning-based approach will take a single 2D view of an object as input, and a model to generate a 3D reconstruction. What should happen to the 3D ground truth as the viewpoint of the 2D input changes? An object-centric coordinate system would keep the ground truth fixed to a canonical coordinate system, regardless of the viewpoint of the 2D input view. In contrast, a view-centric coordinate system would rotate the ground truth coordinate frame in conjunction with the input view. An example of the two different coordinate systems 111In the graphics community the object-centric coordinate system is often referred as world coordinates and the view-centric coordinate system as camera coordinates. is shown in Fig. 2. Object-centric coordinate systems align shapes of the same category to an arbitrary, shared coordinate system. This introduces stable spatial relationships during training (e.g., wheels of different car shapes generally occupy the same absolute area of ). This makes the reconstruction task easier to learn, but these relationships are not necessarily shared across categories. Similar to [pixelsvoxelsviews, singleviewlearn], we show empirically that adopting a view-centric coordinate system improves generalization to unseen categories.

Figure 2: (a) View-centric coordinate system, where ground truth 3D objects are aligned to their respective input views. (b) Object-centric coordinate system, where all input views share the same ground-truth canonical 3D object orientation.

Another critical factor for achieving good generalization to unseen categories is the capacity of a model to encode geometrical features when processing the input view. Similar to [disn], our model uses feature maps with spatial extent, rather than pooling across spatial locations to obtain a global representation. In [disn] the motivation for using spatial feature maps is to preserve fine grained details of geometry (to better model categories in the training set). In contrast, in this paper we analyze generalization to unseen categories and how different encoding designs impact generalization capability. We argue that using a globally pooled representation encourages the model to perform reconstruction at the object level (since the pooling operation is invariant to spatial arrangements of the feature map), which makes it difficult to generalize to objects from unseen categories. By keeping the spatial extent of features, on the other hand, we process and represent an object at the part level. Critically, in contrast to [disn], we model the scene geometry across different views by explicitly embedding information about camera poses in the spatial feature maps. We show empirically that using these geometry-aware spatial feature maps increases generalization performance.

Finally, we use multi-view aggregation to improve generalization performance. Traditional approaches to 3D reconstruction, such as multi-view stereo (MVS) [mvs] or structure-from-motion (SfM) [sfm]

, exploit the geometry of multiple views via cost volumes instead of priors learned from data. These approaches fail in single-view cases. Single-view reconstruction models, though, must rely entirely on priors for occluded regions. We propose a model that combines learned priors with the complimentary information gained from multiple views. We aggregate information from multiple views by taking inspiration from cost volumes used in MVS and compute a variance cost across views. By refining its single-view estimates with additional views, our model shows improved generalization performance.

Individually, these factors are important as backed by literature and our empirical results and addressing them leads to compounding effects on generalization. The view-centric coordinate system has been shown to improve generalization [pixelsvoxelsviews, singleviewlearn]. However, the need to aggregate information from multiple views is also paramount to reconstruct categories not seen during training time, since the prior learned over the training categories is not trustworthy in this unseen category case. In order to maximally benefit from aggregating information from multiple views, we require features that encode information about parts of objects rather than encoding the object as a whole entity without preserving spatial information. Otherwise, aggregation can be counterproductive by reinforcing the wrong object prior. We show empirically that by compounding these three factors, 3D43D outperforms state-of-the-art 3D reconstruction algorithms when tested on both categories seen and unseen during training. Contrary to suggestions from previous work [unseenclasses, singleviewlearn], we achieve these gains in generalization without a drop in performance when testing on categories seen during training.

MV Consist. [tulsiani] Diff. PCs. [diffpcs] L-MVS [lmvs] PiFU [pifu] DISN [disn] OccNet [occnet] 3D43D
Geometry Voxel Point cloud Voxel/Depth Func. Func. Func. Func.
Coordinate Sys. Viewer Viewer Object Object Object Object Viewer
Features Global Global Spatial+Global Spatial+Global Spatial+Global Global Spatial+Geometric
Multi-view No No Yes Yes Yes No Yes
Generalization No No Yes No No No Yes
Table 1: Summary of design choices of different approaches. We describe each method in terms of their choice of: geometry representation (Geometry), coordinate system (Coordinate Sys.), feature representation (Features), capacity to use multiple views (Multi-view) and if they analyze generalization to unseen categories (Generalization).

2 Related Work

Object Coordinate System. Careful and extensive experimental analysis in [pixelsvoxelsviews, singleviewlearn] has revealed that the object shape priors learned by most 3D reconstruction approaches act in a categorization regime rather than in the expected 3D dense reconstruction regime. In other words, reconstructing the 3D object in these models happens by categorizing the input into a blend of the known 3D objects in the training set. This phenomenon has been mainly explained by the choice of the object-centric coordinate system.

Results by [pixelsvoxelsviews, singleviewlearn] showed that object-centric coordinate systems perform better for reconstructing viewpoints and categories seen during training, at the cost of significantly hampering the capability to generalize to categories or viewpoints unseen at training time. The converse result was also observed for view-centric coordinate systems, which generalized better to unseen objects and viewpoints at the cost of degraded reconstruction performance for known categories/viewpoints.

Feature Representation. Single-view 3D reconstruction approaches have recently obtained impressive results [occnet, pifu, disn, tulsiani, atlasnet, softraster, birds] despite the ill-posed nature of the problem. In order for these approaches to perform single-view reconstruction successfully, priors are required to resolve ambiguities. Not surprisingly, recent works show that using local features that retain spatial information [pifu, disn] improve reconstruction accuracy. However, none of these approaches analyze their performance on object categories unseen during training time. A recent exception is [unseenclasses] where the authors propose a non fully differentiable approach for single view reconstruction that relies on depth in-painting of spherical maps [sphericalmaps]. [unseenclasses] also differs from 3D43D because 3D voxel grids are used as ground-truth, and extra supervision at the level of depth maps is available at training time.

View Aggregation. Multi-view 3D reconstruction has been traditionally addressed by stereopsis approaches like MVS [mvs, mvs2] or SfM [sfm]. Modern learning-based approaches to MVS [mvsnet, mvs2] have incorporated powerful convolutional networks to the MVS pipeline. These networks focus on visible regions and do not make inferences about the geometry of occluded object parts.

Another interesting trend has been to exploit the multi-view consistency inductive bias from MVS to learn 3D shape and pose from pairs of images [tulsiani, diffpcs, keypointnet]. However, these approaches either predict a very sparse set of keypoints [keypointnet], a sparse point cloud [diffpcs], or a voxel grid [tulsiani], limiting the approaches to fixed resolution representations.

Conceptually close to our approach is [lmvs]. The authors propose differentiable proxies to the operations in the standard MVS pipeline, allowing end-to-end optimization. Although [lmvs] addresses multi-view aggregation, there are critical design choices in other aspects of the method that limit the performance. First, the geometry representation produced is a voxel grid, making the estimation of high resolution geometry unfeasible. Second, the cost-volume optimization happens via a large 3D auto-encoder which has a non trivial geometric interpretation. Third, view aggregation is performed in a recurrent fashion, making the model sensitive to permutation of the views.

Properly extending the previously discussed single-view works [occnet, pifu, disn, tulsiani, atlasnet, softraster, birds] to the multi-view case is not trivial, although simple extensions to aggregate multiple views are briefly outlined in [pifu, disn]. Inspired by cost volume computation used in MVS [mvsnet] we aggregate information from different views by computing a variance cost (Sect. 3.3.)

Geometry Representation. The choice of representation scheme for 3D objects has been at the core of 3D reconstruction research from the early beginning. Voxels [tulsiani, lmvs] have been used as a natural extension of 2D image representation, showing great results in low resolution regimes. However, memory and computation requirements to scale voxel representations to higher resolution prevent them from being widely used. Circumventing this problem, point clouds are a more frequently used representation [diffpcs, keypointnet]. Point clouds deal with the computational cost problem of voxels by sparsifying the representation and eliminating the neighbouring structure information. Meshes [birds, meshrenderer, softraster] add the neighboring structure back into point cloud representations. However, to make mesh estimation efficient, neighbouring structure has to be predefined (usually in the form of connectivity of a convex polyhedron with a variable number of faces) and only deformations of that structure can be modelled. Finally, functional/implicit representations have recently gained interest [occnet, disn, deepsdf, deeplevelsets, cvxnet, nasa]. This representation encodes geometry as the level set of a function that can be evaluated for any point in . Such a function can generate geometry at arbitrary resolutions by evaluating as many points as desired. As a summary, Tab. 1 shows the contributions of the most relevant and related literature in comparison to 3D43D.

3 Model

We now describe our approach in terms of the choice of geometry representation, the use of geometry-aware feature maps, and the multi-view aggregation strategy. Our model is shown in Fig. 3.

3.1 Functional Geometry Representation

3D43D takes the form of a functional estimator of geometry [occnet, disn, atlasnet, deepsdf]. Given a view of an object, our goal is to predict the object occupancy function indicating whether a given point lies inside the mesh of the object. In order to do so, we learn a parametric scalar field where is an monocular (RGB) view of the object. In the remainder of the text the parameter subscript is dropped for ease of notation. This scalar field

is implemented by a fully connected deep neural network with residual blocks  

222Details of the implementation can be found in the supplementary material.

3.2 Encoding Geometry-aware Features

Our goal is to learn a prior over occupancy functions that generalizes well to unseen categories, which we address by giving our model the capacity to reason about local object cues and scene geometry. In order to do so, we process input views with a convolutional U-Net[unet] encoder with skip connections (refer to the supplementary material for implementation details). This results in a feature map for a given RGB view. This is in contrast to the approach taken in [occnet, atlasnet] where a view is represented by a global feature which pools across the spatial dimensions of the feature map. Our hypothesis is that preserving the spatial dimensions in the latent representation is crucial to retain local object cues which greatly improve generalization to unseen views and categories, as demonstrated in the experiments. To predict the occupancy value for a 3D point in world coordinates we project this point into its location in by using the extrinsic camera parameters and intrinsic parameters  333We assume camera intrinsics to be constant. (cf. Eq. (1

)), and sample the corresponding feature vector

. We use bi-linear sampling as in [spatialtransformer] to make the sampling operation differentiable.


The feature vector encodes localized appearance information but lacks any notion of the structural geometry of the underlying scene. This causes ambiguities in the representation since, for example, all points in a ray shot from the camera origin get projected to the same location in the image plane. Thus, the sampled feature vector cannot uniquely characterize the points in the ray (, to predict occupancy). Recent works [pifu, disn] mitigate this issue by augmenting to explicitly encode coordinates of 3D points . This is often done by concatenating (or a latent representation of [disn]) and , and further processing it via additional fully connected blocks.

However, (or its representation) is sensitive to the choice of coordinate system [pixelsvoxelsviews]. Recent approaches [occnet, disn] use a canonical object-centric coordinate system for each object category, which has been shown to generalize poorly to categories not seen during training [singleviewlearn]. On the other hand, expressing in a view-centric coordinate system444Also known as camera coordinate system improves generalization to unseen categories [pixelsvoxelsviews, singleviewlearn, unseenclasses]. Note that if is expressed in the view-centric coordinate system the characterization of the scene is incomplete since it lacks information about the points where the rays passing through originated in the image capturing process (the representation is not aware of the origin of the view-centric coordinate system the scene).

To tackle this issue we represent using the camera coordinate system (denoted as ), and give the representation access to the origin of the camera coordinate system with respect to the world (the camera position with respect to the world coordinate system). Therefore, after sampling we concatenate it with and , and process it with an MLP with residual blocks, resulting in feature representation that is aware of the scene geometry . This feature representation is then input to the occupancy field . Note that this does not require additional camera information compared to [disn, pifu] since the camera position is already used to project into the image plane to sample the feature map. In our model we explicitly condition the representation using the camera position instead of only implicitly using the camera position to sample feature maps. Fig. 3 shows our model.

Figure 3: Overview of our model. Input views are processed by our UNet encoder producing feature maps that are sampled at spatial locations corresponding to a 3D point . Those features are then concatenated with the point and the location of the camera origin of the corresponding input view and process through an MLP that produces geometry-aware point representations (one for each view). Those representations are used to compute a mean and variance cost across views that is used by another MLP to predict occupancy.

3.3 Multi-View Aggregation

We now turn to the task of aggregating information from multiple views to estimate occupancy. Traditionally, view aggregation approaches for geometry estimation require the explicit computation of a 3D cost volume which is either refined using graph optimization techniques in traditional MVS approaches [mvs] or processed by 3D convolutional auto-encoders with a large number of parameters in learned models [lmvs, mvsnet, mvs2]. Here we do not explicitly construct a cost volume for the whole scene, instead, we compute point-wise estimates of that cost volume. One key observation is that our model is able to estimate geometry for parts of the object that are occluded in the input views, as opposed to MVS approaches that only predict geometry for visible parts of a scene (depth). As a result our approach integrates reconstruction of visible geometry and generation of unseen geometry under the same framework.

Our task is to predict the ground truth occupancy values of points , given a set of posed RGB views . In order to do so, we independently compute geometry aware representations across views for each point as show in Sect. 3.2. In order for our model to deal with a variable number of views a pooling operation over is required. Modern approaches to estimate the complete geometry of a scene (visible and occluded) from multiple views rely on element-wise pooling operators like mean or max [disn, pifu]. These element-wise operators can be suitable for categories seen at training time. However, in order to better generalize to unseen categories it is oftentimes beneficial to rely on the alignment cost between features as done in purely geometric (non learning-based) approaches [mvs]. Inspired by traditional geometric approaches we propose to use an approximation to the alignment cost between local features . We approximate the alignment cost on the set of local features by computing the variance as follows,


where is the average of . A key design choice is that we do not use the variance as the sole input to our functional occupancy decoder since the variance will be zero everywhere and uninformative when only a single view is available. Instead, we add a conditional input branch to our decoder , which takes as input conditioned on . We also give the model access to a global object representation by introducing a residual path that performs additive conditioning on . We perform average pooling on feature maps both spatially and across views to obtain that is added to . Conditioning in

is implemented via conditional batch normalization

555Implementation details in the supplementary material layers [cbn]. This formulation naturally handles the single view case, where . Finally, our objective function is shown in Eq. 3.


4 Experiments

We present empirical results that show how 3D43D performs in two experimental setups: reconstruction of categories seen during training and generalization to categories unseen during training. In the first setup our goal is to show that 3D43D is competitive with state-of-the-art 3D reconstruction approaches. In the second setup we show that 3D43D generalizes better to categories unseen at training time. Finally, we conduct ablation experiments to show how the proposed contributions impact the reconstruction accuracy.

4.1 Settings

Dataset: For all of our experiments, we use the ShapeNet [shapenet] subset of Choy et al. [3dr2n2], together with their renderings. For a fair comparison with different methods we use the same train/test splits and occupancy ground truth as [occnet], which provides an in depth comparison with several approaches 666Readers interested in the ground-truth generation process are referred to [occnet, stutz2018learning]..

Metrics: We report the following metrics, following [occnet]: volumetric IoU, Chamfer-L1 distance, and normal consistency score. In addition, as recently suggested by [singleviewlearn]

we report the F-score. Volumetric IoU is defined as the intersection over union of the volume of two meshes. An estimate of the volumetric IoU is computed by randomly sampling 100k points and determining whether points reside in the meshes

[occnet]. The Chamfer- is a relaxation of the symmetric Hausdorff distance measuring the average of reconstruction accuracy and completeness. The normal consistency score is the mean absolute dot product of the normals in one mesh and the normals at the corresponding nearest neighbors in the other mesh [occnet]. Finally, the F-score can be understood as “the percentage of correctly reconstructed surface” [singleviewlearn].

Implementation: We resize our input images to pixels. For our encoder, we choose a U-Net with a ResNet-50[resnet] encoder, the final feature maps C have channels and are of the same spatial size as the input. The function that computes geometric features is an MLP with ResNet blocks and that takes a vector of dimensions and outputs a -dimensional representation g. Our occupancy function is an MLP with ResNet blocks where all layers have hidden units except the output layer. To train the occupancy function we sample points with their respective occupancy value from a pool of k points and use input views 777This was due to memory limitations. Nonetheless, the method generalizes to an arbitrary number of input views during inference. Details of different sampling strategies can be found in [occnet]. We train our network with batches of samples and use Adam[adam]

with default Pytorch parameters as our optimizer. We use a learning rate of

and train our network for epochs. To obtain meshes at inference time from the occupancy function we follow the process in [occnet].

4.2 Categories seen during training

In this section, we compare our method to various baselines on single-view and multi-view 3D reconstruction experiments. For the single-view setup we report our results on standard experiments for reconstructing unseen objects of categories seen during training (cf. Tab. 2). We compare 3D43D with 3D-R2N2[3dr2n2], Pix2Mesh[pix2mesh], PSGN[psgn], AtlasNet[atlasnet] and OccNets[occnet]. Encouragingly, 3D43D performs on par with the state-of-the-art OccNets[occnet] approach, which uses global features that are able to encode semantics of the objects in the training set. In addition, OccNets make use of an object-centric coordinate system, which aligns all shapes to the same canonical orientation, making the reconstruction problem simpler. Our results indicate that spatial feature maps are able to encode useful information for reconstruction despite being spatially localized. This result backs up our initial hypothesis and is critical to establish a good baseline performance. From this baseline, we explore the performance of our model when generalizing to unseen categories in different scenarios (Sect. 4.3).

IoU Chamfer- Normal Consistency
Seen category [3dr2n2] [pix2mesh] [occnet] 3D43D [3dr2n2] [psgn] [pix2mesh] [atlasnet] [occnet] 3D43D [3dr2n2] [pix2mesh] [atlasnet] [occnet] 3D43D
airplane 0.426 0.420 0.591 0.571 0.227 0.137 0.187 0.104 0.134 0.096 0.629 0.759 0.836 0.845 0.825
bench 0.373 0.323 0.492 0.502 0.194 0.181 0.201 0.138 0.150 0.112 0.678 0.732 0.779 0.814 0.809
cabinet 0.667 0.664 0.750 0.761 0.217 0.215 0.196 0.175 0.153 0.119 0.782 0.834 0.850 0.884 0.886
car 0.661 0.552 0.746 0.741 0.213 0.169 0.180 0.141 0.149 0.122 0.714 0.756 0.836 0.852 0.844
chair 0.439 0.396 0.530 0.564 0.270 0.247 0.265 0.209 0.206 0.193 0.663 0.746 0.791 0.829 0.832
display 0.440 0.490 0.518 0.548 0.605 0.284 0.239 0.198 0.258 0.166 0.720 0.830 0.858 0.857 0.883
lamp 0.281 0.323 0.400 0.453 0.778 0.314 0.308 0.305 0.368 0.561 0.560 0.666 0.694 0.751 0.766
loudspeaker 0.611 0.599 0.677 0.729 0.318 0.316 0.285 0.245 0.266 0.229 0.711 0.782 0.825 0.848 0.868
rifle 0.375 0.402 0.480 0.529 0.183 0.134 0.164 0.115 0.143 0.248 0.670 0.718 0.725 0.783 0.798
sofa 0.626 0.613 0.693 0.718 0.229 0.224 0.212 0.177 0.181 0.125 0.731 0.820 0.840 0.867 0.875
table 0.420 0.395 0.542 0.574 0.239 0.222 0.218 0.190 0.182 0.146 0.732 0.784 0.832 0.860 0.864
telephone 0.611 0.661 0.746 0.740 0.195 0.161 0.149 0.128 0.127 0.107 0.817 0.907 0.923 0.939 0.935
vessel 0.482 0.397 0.547 0.588 0.238 0.188 0.212 0.151 0.201 0.175 0.629 0.699 0.756 0.797 0.799
mean 0.493 0.480 0.593 0.621 0.278 0.215 0.216 0.175 0.194 0.184 0.695 0.772 0.810 0.840 0.845
Table 2: Performance of different approaches on the test set of categories seen during training, trained with single views. Our results show that 3D43D is comparable with state-of-the-art models trained on a object-centric coordinate system in the single view setting. Compared models are: 3D-R2N2[3dr2n2], Pix2Mesh[pix2mesh], PSGN[psgn], AtlasNet[atlasnet] and OccNets[occnet].

In order to further validate our contribution, we provide results on multi-view reconstruction. We randomly sample 5 views of the object and compare our method with OccNets[occnet] (the top performer for single-view reconstruction method). To provide a fair comparison, we extend the trained model provided by OccNets (the best runner up) to the multiple view case by average pooling their conditional features across views at inference time. Since our method uses spatial features that are aware of scene geometry, we expect our aggregation mechanism to obtain more accurate reconstruction. Results shown in Tab. 3 and qualitative results in Fig. 4 consistently agree with this observation.

IoU Chamfer- Normal Consistency F-score
Seen category [occnet] 3D43D [occnet] 3D43D [occnet] 3D43D [occnet] 3D43D
airplane 0.600 0.736 0.096 0.021 0.853 0.899 0.735 0.841
bench 0.547 0.663 0.176 0.027 0.834 0.881 0.691 0.789
cabinet 0.770 0.831 0.125 0.073 0.893 0.925 0.853 0.898
car 0.759 0.797 0.109 0.090 0.861 0.873 0.852 0.878
chair 0.568 0.716 0.187 0.063 0.846 0.911 0.704 0.824
display 0.593 0.752 0.168 0.089 0.884 0.935 0.723 0.851
lamp 0.415 0.625 1.083 0.256 0.764 0.858 0.546 0.752
loudspeaker 0.699 0.807 0.360 0.143 0.856 0.912 0.801 0.883
rifle 0.466 0.745 0.112 0.012 0.789 0.903 0.625 0.851
sofa 0.731 0.809 0.171 0.054 0.886 0.927 0.831 0.886
table 0.569 0.689 0.588 0.058 0.873 0.921 0.703 0.805
telephone 0.785 0.861 0.103 0.017 0.948 0.971 0.866 0.922
vessel 0.592 0.708 0.163 0.053 0.818 0.868 0.730 0.821
mean 0.621 0.749 0.265 0.073 0.854 0.906 0.743 0.846
Table 3: Performance metrics for multi-view reconstruction using 5 random views of objects from categories seen at training time, where we see that 3D43D achieves consistently better reconstruction.
Figure 4: Reconstructions from categories seen during training time using 5 input views. For each object: (Top row) Input views. (Middle row) OccNets [occnet] prediction (orbit of 5 views of the predicted mesh). (Bottom row) 3D43D prediction (orbit of 5 views of the predicted mesh). We can qualitatively see that 3D43D produces better results than OccNets [occnet] in terms of high-frequency geometry. Note that input views and reconstructions are not presented from the shown viewpoint.

4.3 Generalization to unseen categories

We now turn to our second experimental setup were we evaluate the ability of 3D43D to generalize to categories not seen during training. In order to do so, we restrict the training set to the top-3 most frequent categories in ShapeNet (Car, Chair and Airplane) following [unseenclasses], and test on the remaining categories. Tab. 4 compares the performance of 3D43D with two strong baselines: OccNets [occnet] and OccNets trained with a view-centric coordinate system ([occnet]-v). We extend OccNets to use view-centric coordinates in order to validate observations in recent papers [pixelsvoxelsviews, singleviewlearn] reporting that using a view-centric coordinate system improves generalization to unseen categories. We find empirically that this observation holds for models that do not aggregate information from multiple views. As discussed in Sec. 3.3, [occnet]-v suffers from systematic drawbacks due to the use of global features, and this results in degraded performance. Additionally, using a view-centric coordinate system only partially tackles the generalization problem, and further improvements can be obtained from the geometry aware features, and the mean and variance aggregation used by 3D43D.

We show sample reconstructions from this experiment in Fig. 5. The visualizations reveal that OccNets tend to work in a categorization regime, often mapping unseen categories to their closest counterparts in the training set. This is clearly visible in Fig. 5. This problem is not solved solely by using multiple views, which can be counterproductive by giving OccNets more confidence to reconstruct the wrong object.

Figure 5: Reconstruction of objects from unseen categories when training OccNets[occnet] and 3D43D on Cars, Chairs and Airplanes. For each object: (Top row) Input views. (Middle row) OccNets [occnet] prediction (orbit of 5 views of the predicted mesh). (Bottom row) 3D43D prediction (orbit of 5 views of the predicted mesh). OccNets [occnet] commonly map unseen categories to categories seen at training time. In comparison, 3D43D reconstructions are more accurate and less biased towards training categories. Note that input views and reconstructions are not shown from the same viewpoint.
IoU Chamfer- Normal Consistency F-score
Unseen category [occnet] [occnet]-v 3D43D [occnet] [occnet]-v 3D43D [occnet] [occnet]-v 3D43D [occnet] [occnet]-v 3D43D
(1 view)
bench 0.251 0.291 0.302 0.752 0.323 0.357 0.714 0.733 0.706 0.374 0.426 0.447
cabinet 0.282 0.404 0.502 1.102 0.621 0.529 0.662 0.739 0.759 0.418 0.551 0.647
display 0.117 0.162 0.243 3.213 1.836 1.389 0.546 0.612 0.638 0.197 0.260 0.364
lamp 0.100 0.150 0.223 3.482 2.276 1.997 0.582 0.625 0.618 0.166 0.241 0.340
loudspeaker 0.311 0.405 0.507 1.649 0.860 0.744 0.655 0.731 0.749 0.452 0.552 0.649
rifle 0.155 0.150 0.236 2.465 2.206 0.707 0.539 0.527 0.588 0.255 0.252 0.3737
sofa 0.493 0.552 0.559 0.915 0.399 0.421 0.761 0.799 0.784 0.625 0.688 0.699
table 0.172 0.214 0.313 1.304 0.861 0.583 0.686 0.722 0.731 0.275 0.331 0.461
telephone 0.052 0.155 0.271 1.673 1.062 0.996 0.654 0.682 0.700 0.096 0.256 0.403
vessel 0.324 0.378 0.401 0.849 0.592 0.521 0.648 0.691 0.690 0.463 0.525 0.553
mean 0.226 0.286 0.356 1.740 1.104 0.824 0.645 0.686 0.696 0.332 0.408 0.494
bench 0.288 0.147 0.463 0.508 1.960 0.113 0.729 0.625 0.800 0.421 0.242 0.617
cabinet 0.295 0.312 0.629 0.917 1.273 0.250 0.674 0.655 0.844 0.430 0.458 0.756
display 0.120 0.127 0.409 2.868 3.179 0.428 0.560 0.534 0.770 0.200 0.213 0.558
lamp 0.100 0.138 0.369 3.365 2.653 2.057 0.586 0.623 0.738 0.167 0.224 0.513
loudspeaker 0.315 0.333 0.627 1.460 1.344 0.392 0.660 0.677 0.829 0.457 0.480 0.753
rifle 0.180 0.095 0.498 1.866 2.610 0.115 0.567 0.444 0.760 0.290 0.169 0.655
sofa 0.525 0.356 0.679 0.732 1.445 0.147 0.776 0.663 0.858 0.656 0.508 0.795
table 0.186 0.177 0.455 1.122 1.771 0.255 0.694 0.700 0.827 0.295 0.285 0.609
telephone 0.036 0.131 0.549 1.588 1.457 0.184 0.689 0.592 0.861 0.066 0.226 0.691
vessel 0.347 0.256 0.521 0.683 1.524 0.145 0.661 0.603 0.776 0.489 0.390 0.669
mean 0.239 0.207 0.520 1.511 1.922 0.409 0.660 0.612 0.806 0.347 0.319 0.662
Table 4: Performance metrics for single and multi-view reconstruction when generalizing to unseen object categories after training only on the car, chair and plane categories.

4.4 Ablation

We perform an ablation study to show how the main design choices of our approach affect performance (ie. a point representation that is aware of camera position and a variance cost to aggregate information across views). We take as our baseline a model conceptually similar to DISN [disn] but with a view-centric coordinate frame. We have already shown that spatial feature maps provide substantial improvements over 1D features (improvements over OccNet [occnet] shown in Tab. (2)(3)) for seen categories. Note that DISN [disn] also reports similar results. However, in this paper we focus on analyzing the generalization of the model to categories unseen during training and show in our ablation that spatial feature maps are not the only critical design choice and our novel contributions improve reconstruction accuracy for unseen categories.

For all our model ablations our encoder (ie. a UNet with a ResNet50 encoder) outputs spatial feature maps for each view that are sampled at locations corresponding to a particular point for which occupancy is predicted. Our ablation is divided in three models:

  • Point model (P): Here we take the sampled features across views (. the ) and concatenate them with p before feeding through our MLP , so that . We take these feature representations across views and aggregate them using average pooling, where the resulting vector is used as input to .

  • Point+Camera model (P+C): In this version we concatenate also the camera location before processing the vector with , so that . We then average pool the resulting features and use them as input to .

  • Point+Camera+Variance model (P+C+V): In this model we take the same encoding as in P+C (). However, we now compute the mean and variance of and use them as input and conditioning, respectively, for . This is our full 3D43D model.

These models are trained on 3 ShapeNet categories: plane, chair and car. We then report results on the test set of 10 unseen categories. We train and evaluate our model with views and report the average IoU across the unseen classes in Tab. 5, where we show that our novel contributions contribute to improve the reconstruction accuracy.

P P+C P+C+V (3D43D)
IoU 0.453 0.476 0.491
Table 5: Results of our ablation experiments.

5 Conclusions

In this paper, we studied factors that impact the generalization of learning-based 3D reconstruction models to unseen categories during training. We argued that for a 3D reconstruction approach to generalize successfully to unseen classes all these factors need to be addressed together. We empirically showed that by taking this into when designing our model, we obtain large improvements over state-of-the-art methods when reconstructing objects of on unseen categories. These improvements in generalization are a step forward for learned 3D reconstruction methods, which we hope will also enable recent Neural Rendering approaches [nerf, srn, enr] to go beyond the constrained scenario of training-category specific models. Finally, larger datasets will lead to more informative priors. We believe that having a clear understanding of these factors and their compound effects will enrich this promising avenue of research.