1 Introduction
Reconstructing the 3D shape of an object from monocular input views is a fundamental problem in computer vision. When the number of input views is small, reconstruction methods rely on priors over object shapes. Learningbased algorithms encode such priors from data. Recently proposed approaches
[atlasnet, disn, occnet] have achieved success in the single/multi view, seen category case when generalizing to novel objects within the seen categories. However, these approaches have difficulty generalizing to object categories not seen during training (cf. Fig. 1).We present progress learning priors that generalize to unseen categories by incorporating a geometryaware spatial feature map. Within this paradigm, we propose a point representation aware of camera position, and a variance cost to aggregate information across views.
A typical learningbased approach will take a single 2D view of an object as input, and a model to generate a 3D reconstruction. What should happen to the 3D ground truth as the viewpoint of the 2D input changes? An objectcentric coordinate system would keep the ground truth fixed to a canonical coordinate system, regardless of the viewpoint of the 2D input view. In contrast, a viewcentric coordinate system would rotate the ground truth coordinate frame in conjunction with the input view. An example of the two different coordinate systems ^{1}^{1}1In the graphics community the objectcentric coordinate system is often referred as world coordinates and the viewcentric coordinate system as camera coordinates. is shown in Fig. 2. Objectcentric coordinate systems align shapes of the same category to an arbitrary, shared coordinate system. This introduces stable spatial relationships during training (e.g., wheels of different car shapes generally occupy the same absolute area of ). This makes the reconstruction task easier to learn, but these relationships are not necessarily shared across categories. Similar to [pixelsvoxelsviews, singleviewlearn], we show empirically that adopting a viewcentric coordinate system improves generalization to unseen categories.
Another critical factor for achieving good generalization to unseen categories is the capacity of a model to encode geometrical features when processing the input view. Similar to [disn], our model uses feature maps with spatial extent, rather than pooling across spatial locations to obtain a global representation. In [disn] the motivation for using spatial feature maps is to preserve fine grained details of geometry (to better model categories in the training set). In contrast, in this paper we analyze generalization to unseen categories and how different encoding designs impact generalization capability. We argue that using a globally pooled representation encourages the model to perform reconstruction at the object level (since the pooling operation is invariant to spatial arrangements of the feature map), which makes it difficult to generalize to objects from unseen categories. By keeping the spatial extent of features, on the other hand, we process and represent an object at the part level. Critically, in contrast to [disn], we model the scene geometry across different views by explicitly embedding information about camera poses in the spatial feature maps. We show empirically that using these geometryaware spatial feature maps increases generalization performance.
Finally, we use multiview aggregation to improve generalization performance. Traditional approaches to 3D reconstruction, such as multiview stereo (MVS) [mvs] or structurefrommotion (SfM) [sfm]
, exploit the geometry of multiple views via cost volumes instead of priors learned from data. These approaches fail in singleview cases. Singleview reconstruction models, though, must rely entirely on priors for occluded regions. We propose a model that combines learned priors with the complimentary information gained from multiple views. We aggregate information from multiple views by taking inspiration from cost volumes used in MVS and compute a variance cost across views. By refining its singleview estimates with additional views, our model shows improved generalization performance.
Individually, these factors are important as backed by literature and our empirical results and addressing them leads to compounding effects on generalization. The viewcentric coordinate system has been shown to improve generalization [pixelsvoxelsviews, singleviewlearn]. However, the need to aggregate information from multiple views is also paramount to reconstruct categories not seen during training time, since the prior learned over the training categories is not trustworthy in this unseen category case. In order to maximally benefit from aggregating information from multiple views, we require features that encode information about parts of objects rather than encoding the object as a whole entity without preserving spatial information. Otherwise, aggregation can be counterproductive by reinforcing the wrong object prior. We show empirically that by compounding these three factors, 3D43D outperforms stateoftheart 3D reconstruction algorithms when tested on both categories seen and unseen during training. Contrary to suggestions from previous work [unseenclasses, singleviewlearn], we achieve these gains in generalization without a drop in performance when testing on categories seen during training.
MV Consist. [tulsiani]  Diff. PCs. [diffpcs]  LMVS [lmvs]  PiFU [pifu]  DISN [disn]  OccNet [occnet]  3D43D  
Geometry  Voxel  Point cloud  Voxel/Depth  Func.  Func.  Func.  Func. 
Coordinate Sys.  Viewer  Viewer  Object  Object  Object  Object  Viewer 
Features  Global  Global  Spatial+Global  Spatial+Global  Spatial+Global  Global  Spatial+Geometric 
Multiview  No  No  Yes  Yes  Yes  No  Yes 
Generalization  No  No  Yes  No  No  No  Yes 
2 Related Work
Object Coordinate System. Careful and extensive experimental analysis in [pixelsvoxelsviews, singleviewlearn] has revealed that the object shape priors learned by most 3D reconstruction approaches act in a categorization regime rather than in the expected 3D dense reconstruction regime. In other words, reconstructing the 3D object in these models happens by categorizing the input into a blend of the known 3D objects in the training set. This phenomenon has been mainly explained by the choice of the objectcentric coordinate system.
Results by [pixelsvoxelsviews, singleviewlearn] showed that objectcentric coordinate systems perform better for reconstructing viewpoints and categories seen during training, at the cost of significantly hampering the capability to generalize to categories or viewpoints unseen at training time. The converse result was also observed for viewcentric coordinate systems, which generalized better to unseen objects and viewpoints at the cost of degraded reconstruction performance for known categories/viewpoints.
Feature Representation. Singleview 3D reconstruction approaches have recently obtained impressive results [occnet, pifu, disn, tulsiani, atlasnet, softraster, birds] despite the illposed nature of the problem. In order for these approaches to perform singleview reconstruction successfully, priors are required to resolve ambiguities. Not surprisingly, recent works show that using local features that retain spatial information [pifu, disn] improve reconstruction accuracy. However, none of these approaches analyze their performance on object categories unseen during training time. A recent exception is [unseenclasses] where the authors propose a non fully differentiable approach for single view reconstruction that relies on depth inpainting of spherical maps [sphericalmaps]. [unseenclasses] also differs from 3D43D because 3D voxel grids are used as groundtruth, and extra supervision at the level of depth maps is available at training time.
View Aggregation. Multiview 3D reconstruction has been traditionally addressed by stereopsis approaches like MVS [mvs, mvs2] or SfM [sfm]. Modern learningbased approaches to MVS [mvsnet, mvs2] have incorporated powerful convolutional networks to the MVS pipeline. These networks focus on visible regions and do not make inferences about the geometry of occluded object parts.
Another interesting trend has been to exploit the multiview consistency inductive bias from MVS to learn 3D shape and pose from pairs of images [tulsiani, diffpcs, keypointnet]. However, these approaches either predict a very sparse set of keypoints [keypointnet], a sparse point cloud [diffpcs], or a voxel grid [tulsiani], limiting the approaches to fixed resolution representations.
Conceptually close to our approach is [lmvs]. The authors propose differentiable proxies to the operations in the standard MVS pipeline, allowing endtoend optimization. Although [lmvs] addresses multiview aggregation, there are critical design choices in other aspects of the method that limit the performance. First, the geometry representation produced is a voxel grid, making the estimation of high resolution geometry unfeasible. Second, the costvolume optimization happens via a large 3D autoencoder which has a non trivial geometric interpretation. Third, view aggregation is performed in a recurrent fashion, making the model sensitive to permutation of the views.
Properly extending the previously discussed singleview works [occnet, pifu, disn, tulsiani, atlasnet, softraster, birds] to the multiview case is not trivial, although simple extensions to aggregate multiple views are briefly outlined in [pifu, disn]. Inspired by cost volume computation used in MVS [mvsnet] we aggregate information from different views by computing a variance cost (Sect. 3.3.)
Geometry Representation. The choice of representation scheme for 3D objects has been at the core of 3D reconstruction research from the early beginning. Voxels [tulsiani, lmvs] have been used as a natural extension of 2D image representation, showing great results in low resolution regimes. However, memory and computation requirements to scale voxel representations to higher resolution prevent them from being widely used. Circumventing this problem, point clouds are a more frequently used representation [diffpcs, keypointnet]. Point clouds deal with the computational cost problem of voxels by sparsifying the representation and eliminating the neighbouring structure information. Meshes [birds, meshrenderer, softraster] add the neighboring structure back into point cloud representations. However, to make mesh estimation efficient, neighbouring structure has to be predefined (usually in the form of connectivity of a convex polyhedron with a variable number of faces) and only deformations of that structure can be modelled. Finally, functional/implicit representations have recently gained interest [occnet, disn, deepsdf, deeplevelsets, cvxnet, nasa]. This representation encodes geometry as the level set of a function that can be evaluated for any point in . Such a function can generate geometry at arbitrary resolutions by evaluating as many points as desired. As a summary, Tab. 1 shows the contributions of the most relevant and related literature in comparison to 3D43D.
3 Model
We now describe our approach in terms of the choice of geometry representation, the use of geometryaware feature maps, and the multiview aggregation strategy. Our model is shown in Fig. 3.
3.1 Functional Geometry Representation
3D43D takes the form of a functional estimator of geometry [occnet, disn, atlasnet, deepsdf]. Given a view of an object, our goal is to predict the object occupancy function indicating whether a given point lies inside the mesh of the object. In order to do so, we learn a parametric scalar field where is an monocular (RGB) view of the object. In the remainder of the text the parameter subscript is dropped for ease of notation. This scalar field
is implemented by a fully connected deep neural network with residual blocks
^{2}^{2}2Details of the implementation can be found in the supplementary material.3.2 Encoding Geometryaware Features
Our goal is to learn a prior over occupancy functions that generalizes well to unseen categories, which we address by giving our model the capacity to reason about local object cues and scene geometry. In order to do so, we process input views with a convolutional UNet[unet] encoder with skip connections (refer to the supplementary material for implementation details). This results in a feature map for a given RGB view. This is in contrast to the approach taken in [occnet, atlasnet] where a view is represented by a global feature which pools across the spatial dimensions of the feature map. Our hypothesis is that preserving the spatial dimensions in the latent representation is crucial to retain local object cues which greatly improve generalization to unseen views and categories, as demonstrated in the experiments. To predict the occupancy value for a 3D point in world coordinates we project this point into its location in by using the extrinsic camera parameters and intrinsic parameters ^{3}^{3}3We assume camera intrinsics to be constant. (cf. Eq. (1
)), and sample the corresponding feature vector
. We use bilinear sampling as in [spatialtransformer] to make the sampling operation differentiable.(1) 
The feature vector encodes localized appearance information but lacks any notion of the structural geometry of the underlying scene. This causes ambiguities in the representation since, for example, all points in a ray shot from the camera origin get projected to the same location in the image plane. Thus, the sampled feature vector cannot uniquely characterize the points in the ray (, to predict occupancy). Recent works [pifu, disn] mitigate this issue by augmenting to explicitly encode coordinates of 3D points . This is often done by concatenating (or a latent representation of [disn]) and , and further processing it via additional fully connected blocks.
However, (or its representation) is sensitive to the choice of coordinate system [pixelsvoxelsviews]. Recent approaches [occnet, disn] use a canonical objectcentric coordinate system for each object category, which has been shown to generalize poorly to categories not seen during training [singleviewlearn]. On the other hand, expressing in a viewcentric coordinate system^{4}^{4}4Also known as camera coordinate system improves generalization to unseen categories [pixelsvoxelsviews, singleviewlearn, unseenclasses]. Note that if is expressed in the viewcentric coordinate system the characterization of the scene is incomplete since it lacks information about the points where the rays passing through originated in the image capturing process (the representation is not aware of the origin of the viewcentric coordinate system the scene).
To tackle this issue we represent using the camera coordinate system (denoted as ), and give the representation access to the origin of the camera coordinate system with respect to the world (the camera position with respect to the world coordinate system). Therefore, after sampling we concatenate it with and , and process it with an MLP with residual blocks, resulting in feature representation that is aware of the scene geometry . This feature representation is then input to the occupancy field . Note that this does not require additional camera information compared to [disn, pifu] since the camera position is already used to project into the image plane to sample the feature map. In our model we explicitly condition the representation using the camera position instead of only implicitly using the camera position to sample feature maps. Fig. 3 shows our model.
3.3 MultiView Aggregation
We now turn to the task of aggregating information from multiple views to estimate occupancy. Traditionally, view aggregation approaches for geometry estimation require the explicit computation of a 3D cost volume which is either refined using graph optimization techniques in traditional MVS approaches [mvs] or processed by 3D convolutional autoencoders with a large number of parameters in learned models [lmvs, mvsnet, mvs2]. Here we do not explicitly construct a cost volume for the whole scene, instead, we compute pointwise estimates of that cost volume. One key observation is that our model is able to estimate geometry for parts of the object that are occluded in the input views, as opposed to MVS approaches that only predict geometry for visible parts of a scene (depth). As a result our approach integrates reconstruction of visible geometry and generation of unseen geometry under the same framework.
Our task is to predict the ground truth occupancy values of points , given a set of posed RGB views . In order to do so, we independently compute geometry aware representations across views for each point as show in Sect. 3.2. In order for our model to deal with a variable number of views a pooling operation over is required. Modern approaches to estimate the complete geometry of a scene (visible and occluded) from multiple views rely on elementwise pooling operators like mean or max [disn, pifu]. These elementwise operators can be suitable for categories seen at training time. However, in order to better generalize to unseen categories it is oftentimes beneficial to rely on the alignment cost between features as done in purely geometric (non learningbased) approaches [mvs]. Inspired by traditional geometric approaches we propose to use an approximation to the alignment cost between local features . We approximate the alignment cost on the set of local features by computing the variance as follows,
(2) 
where is the average of . A key design choice is that we do not use the variance as the sole input to our functional occupancy decoder since the variance will be zero everywhere and uninformative when only a single view is available. Instead, we add a conditional input branch to our decoder , which takes as input conditioned on . We also give the model access to a global object representation by introducing a residual path that performs additive conditioning on . We perform average pooling on feature maps both spatially and across views to obtain that is added to . Conditioning in
is implemented via conditional batch normalization
^{5}^{5}5Implementation details in the supplementary material layers [cbn]. This formulation naturally handles the single view case, where . Finally, our objective function is shown in Eq. 3.(3) 
4 Experiments
We present empirical results that show how 3D43D performs in two experimental setups: reconstruction of categories seen during training and generalization to categories unseen during training. In the first setup our goal is to show that 3D43D is competitive with stateoftheart 3D reconstruction approaches. In the second setup we show that 3D43D generalizes better to categories unseen at training time. Finally, we conduct ablation experiments to show how the proposed contributions impact the reconstruction accuracy.
4.1 Settings
Dataset: For all of our experiments, we use the ShapeNet [shapenet] subset of Choy et al. [3dr2n2], together with their renderings. For a fair comparison with different methods we use the same train/test splits and occupancy ground truth as [occnet], which provides an in depth comparison with several approaches ^{6}^{6}6Readers interested in the groundtruth generation process are referred to [occnet, stutz2018learning]..
Metrics: We report the following metrics, following [occnet]: volumetric IoU, ChamferL1 distance, and normal consistency score. In addition, as recently suggested by [singleviewlearn]
we report the Fscore. Volumetric IoU is defined as the intersection over union of the volume of two meshes. An estimate of the volumetric IoU is computed by randomly sampling 100k points and determining whether points reside in the meshes
[occnet]. The Chamfer is a relaxation of the symmetric Hausdorff distance measuring the average of reconstruction accuracy and completeness. The normal consistency score is the mean absolute dot product of the normals in one mesh and the normals at the corresponding nearest neighbors in the other mesh [occnet]. Finally, the Fscore can be understood as “the percentage of correctly reconstructed surface” [singleviewlearn].Implementation: We resize our input images to pixels. For our encoder, we choose a UNet with a ResNet50[resnet] encoder, the final feature maps C have channels and are of the same spatial size as the input. The function that computes geometric features is an MLP with ResNet blocks and that takes a vector of dimensions and outputs a dimensional representation g. Our occupancy function is an MLP with ResNet blocks where all layers have hidden units except the output layer. To train the occupancy function we sample points with their respective occupancy value from a pool of k points and use input views ^{7}^{7}7This was due to memory limitations. Nonetheless, the method generalizes to an arbitrary number of input views during inference. Details of different sampling strategies can be found in [occnet]. We train our network with batches of samples and use Adam[adam]
with default Pytorch parameters as our optimizer. We use a learning rate of
and train our network for epochs. To obtain meshes at inference time from the occupancy function we follow the process in [occnet].4.2 Categories seen during training
In this section, we compare our method to various baselines on singleview and multiview 3D reconstruction experiments. For the singleview setup we report our results on standard experiments for reconstructing unseen objects of categories seen during training (cf. Tab. 2). We compare 3D43D with 3DR2N2[3dr2n2], Pix2Mesh[pix2mesh], PSGN[psgn], AtlasNet[atlasnet] and OccNets[occnet]. Encouragingly, 3D43D performs on par with the stateoftheart OccNets[occnet] approach, which uses global features that are able to encode semantics of the objects in the training set. In addition, OccNets make use of an objectcentric coordinate system, which aligns all shapes to the same canonical orientation, making the reconstruction problem simpler. Our results indicate that spatial feature maps are able to encode useful information for reconstruction despite being spatially localized. This result backs up our initial hypothesis and is critical to establish a good baseline performance. From this baseline, we explore the performance of our model when generalizing to unseen categories in different scenarios (Sect. 4.3).
IoU  Chamfer  Normal Consistency  

Seen category  [3dr2n2]  [pix2mesh]  [occnet]  3D43D  [3dr2n2]  [psgn]  [pix2mesh]  [atlasnet]  [occnet]  3D43D  [3dr2n2]  [pix2mesh]  [atlasnet]  [occnet]  3D43D 
airplane  0.426  0.420  0.591  0.571  0.227  0.137  0.187  0.104  0.134  0.096  0.629  0.759  0.836  0.845  0.825 
bench  0.373  0.323  0.492  0.502  0.194  0.181  0.201  0.138  0.150  0.112  0.678  0.732  0.779  0.814  0.809 
cabinet  0.667  0.664  0.750  0.761  0.217  0.215  0.196  0.175  0.153  0.119  0.782  0.834  0.850  0.884  0.886 
car  0.661  0.552  0.746  0.741  0.213  0.169  0.180  0.141  0.149  0.122  0.714  0.756  0.836  0.852  0.844 
chair  0.439  0.396  0.530  0.564  0.270  0.247  0.265  0.209  0.206  0.193  0.663  0.746  0.791  0.829  0.832 
display  0.440  0.490  0.518  0.548  0.605  0.284  0.239  0.198  0.258  0.166  0.720  0.830  0.858  0.857  0.883 
lamp  0.281  0.323  0.400  0.453  0.778  0.314  0.308  0.305  0.368  0.561  0.560  0.666  0.694  0.751  0.766 
loudspeaker  0.611  0.599  0.677  0.729  0.318  0.316  0.285  0.245  0.266  0.229  0.711  0.782  0.825  0.848  0.868 
rifle  0.375  0.402  0.480  0.529  0.183  0.134  0.164  0.115  0.143  0.248  0.670  0.718  0.725  0.783  0.798 
sofa  0.626  0.613  0.693  0.718  0.229  0.224  0.212  0.177  0.181  0.125  0.731  0.820  0.840  0.867  0.875 
table  0.420  0.395  0.542  0.574  0.239  0.222  0.218  0.190  0.182  0.146  0.732  0.784  0.832  0.860  0.864 
telephone  0.611  0.661  0.746  0.740  0.195  0.161  0.149  0.128  0.127  0.107  0.817  0.907  0.923  0.939  0.935 
vessel  0.482  0.397  0.547  0.588  0.238  0.188  0.212  0.151  0.201  0.175  0.629  0.699  0.756  0.797  0.799 
mean  0.493  0.480  0.593  0.621  0.278  0.215  0.216  0.175  0.194  0.184  0.695  0.772  0.810  0.840  0.845 
In order to further validate our contribution, we provide results on multiview reconstruction. We randomly sample 5 views of the object and compare our method with OccNets[occnet] (the top performer for singleview reconstruction method). To provide a fair comparison, we extend the trained model provided by OccNets (the best runner up) to the multiple view case by average pooling their conditional features across views at inference time. Since our method uses spatial features that are aware of scene geometry, we expect our aggregation mechanism to obtain more accurate reconstruction. Results shown in Tab. 3 and qualitative results in Fig. 4 consistently agree with this observation.
IoU  Chamfer  Normal Consistency  Fscore  

Seen category  [occnet]  3D43D  [occnet]  3D43D  [occnet]  3D43D  [occnet]  3D43D 
airplane  0.600  0.736  0.096  0.021  0.853  0.899  0.735  0.841 
bench  0.547  0.663  0.176  0.027  0.834  0.881  0.691  0.789 
cabinet  0.770  0.831  0.125  0.073  0.893  0.925  0.853  0.898 
car  0.759  0.797  0.109  0.090  0.861  0.873  0.852  0.878 
chair  0.568  0.716  0.187  0.063  0.846  0.911  0.704  0.824 
display  0.593  0.752  0.168  0.089  0.884  0.935  0.723  0.851 
lamp  0.415  0.625  1.083  0.256  0.764  0.858  0.546  0.752 
loudspeaker  0.699  0.807  0.360  0.143  0.856  0.912  0.801  0.883 
rifle  0.466  0.745  0.112  0.012  0.789  0.903  0.625  0.851 
sofa  0.731  0.809  0.171  0.054  0.886  0.927  0.831  0.886 
table  0.569  0.689  0.588  0.058  0.873  0.921  0.703  0.805 
telephone  0.785  0.861  0.103  0.017  0.948  0.971  0.866  0.922 
vessel  0.592  0.708  0.163  0.053  0.818  0.868  0.730  0.821 
mean  0.621  0.749  0.265  0.073  0.854  0.906  0.743  0.846 
4.3 Generalization to unseen categories
We now turn to our second experimental setup were we evaluate the ability of 3D43D to generalize to categories not seen during training. In order to do so, we restrict the training set to the top3 most frequent categories in ShapeNet (Car, Chair and Airplane) following [unseenclasses], and test on the remaining categories. Tab. 4 compares the performance of 3D43D with two strong baselines: OccNets [occnet] and OccNets trained with a viewcentric coordinate system ([occnet]v). We extend OccNets to use viewcentric coordinates in order to validate observations in recent papers [pixelsvoxelsviews, singleviewlearn] reporting that using a viewcentric coordinate system improves generalization to unseen categories. We find empirically that this observation holds for models that do not aggregate information from multiple views. As discussed in Sec. 3.3, [occnet]v suffers from systematic drawbacks due to the use of global features, and this results in degraded performance. Additionally, using a viewcentric coordinate system only partially tackles the generalization problem, and further improvements can be obtained from the geometry aware features, and the mean and variance aggregation used by 3D43D.
We show sample reconstructions from this experiment in Fig. 5. The visualizations reveal that OccNets tend to work in a categorization regime, often mapping unseen categories to their closest counterparts in the training set. This is clearly visible in Fig. 5. This problem is not solved solely by using multiple views, which can be counterproductive by giving OccNets more confidence to reconstruct the wrong object.
IoU  Chamfer  Normal Consistency  Fscore  

Unseen category  [occnet]  [occnet]v  3D43D  [occnet]  [occnet]v  3D43D  [occnet]  [occnet]v  3D43D  [occnet]  [occnet]v  3D43D 
(1 view)  
bench  0.251  0.291  0.302  0.752  0.323  0.357  0.714  0.733  0.706  0.374  0.426  0.447 
cabinet  0.282  0.404  0.502  1.102  0.621  0.529  0.662  0.739  0.759  0.418  0.551  0.647 
display  0.117  0.162  0.243  3.213  1.836  1.389  0.546  0.612  0.638  0.197  0.260  0.364 
lamp  0.100  0.150  0.223  3.482  2.276  1.997  0.582  0.625  0.618  0.166  0.241  0.340 
loudspeaker  0.311  0.405  0.507  1.649  0.860  0.744  0.655  0.731  0.749  0.452  0.552  0.649 
rifle  0.155  0.150  0.236  2.465  2.206  0.707  0.539  0.527  0.588  0.255  0.252  0.3737 
sofa  0.493  0.552  0.559  0.915  0.399  0.421  0.761  0.799  0.784  0.625  0.688  0.699 
table  0.172  0.214  0.313  1.304  0.861  0.583  0.686  0.722  0.731  0.275  0.331  0.461 
telephone  0.052  0.155  0.271  1.673  1.062  0.996  0.654  0.682  0.700  0.096  0.256  0.403 
vessel  0.324  0.378  0.401  0.849  0.592  0.521  0.648  0.691  0.690  0.463  0.525  0.553 
mean  0.226  0.286  0.356  1.740  1.104  0.824  0.645  0.686  0.696  0.332  0.408  0.494 
(5 views)  
bench  0.288  0.147  0.463  0.508  1.960  0.113  0.729  0.625  0.800  0.421  0.242  0.617 
cabinet  0.295  0.312  0.629  0.917  1.273  0.250  0.674  0.655  0.844  0.430  0.458  0.756 
display  0.120  0.127  0.409  2.868  3.179  0.428  0.560  0.534  0.770  0.200  0.213  0.558 
lamp  0.100  0.138  0.369  3.365  2.653  2.057  0.586  0.623  0.738  0.167  0.224  0.513 
loudspeaker  0.315  0.333  0.627  1.460  1.344  0.392  0.660  0.677  0.829  0.457  0.480  0.753 
rifle  0.180  0.095  0.498  1.866  2.610  0.115  0.567  0.444  0.760  0.290  0.169  0.655 
sofa  0.525  0.356  0.679  0.732  1.445  0.147  0.776  0.663  0.858  0.656  0.508  0.795 
table  0.186  0.177  0.455  1.122  1.771  0.255  0.694  0.700  0.827  0.295  0.285  0.609 
telephone  0.036  0.131  0.549  1.588  1.457  0.184  0.689  0.592  0.861  0.066  0.226  0.691 
vessel  0.347  0.256  0.521  0.683  1.524  0.145  0.661  0.603  0.776  0.489  0.390  0.669 
mean  0.239  0.207  0.520  1.511  1.922  0.409  0.660  0.612  0.806  0.347  0.319  0.662 
4.4 Ablation
We perform an ablation study to show how the main design choices of our approach affect performance (ie. a point representation that is aware of camera position and a variance cost to aggregate information across views). We take as our baseline a model conceptually similar to DISN [disn] but with a viewcentric coordinate frame. We have already shown that spatial feature maps provide substantial improvements over 1D features (improvements over OccNet [occnet] shown in Tab. (2)(3)) for seen categories. Note that DISN [disn] also reports similar results. However, in this paper we focus on analyzing the generalization of the model to categories unseen during training and show in our ablation that spatial feature maps are not the only critical design choice and our novel contributions improve reconstruction accuracy for unseen categories.
For all our model ablations our encoder (ie. a UNet with a ResNet50 encoder) outputs spatial feature maps for each view that are sampled at locations corresponding to a particular point for which occupancy is predicted. Our ablation is divided in three models:

Point model (P): Here we take the sampled features across views (. the ) and concatenate them with p before feeding through our MLP , so that . We take these feature representations across views and aggregate them using average pooling, where the resulting vector is used as input to .

Point+Camera model (P+C): In this version we concatenate also the camera location before processing the vector with , so that . We then average pool the resulting features and use them as input to .

Point+Camera+Variance model (P+C+V): In this model we take the same encoding as in P+C (). However, we now compute the mean and variance of and use them as input and conditioning, respectively, for . This is our full 3D43D model.
These models are trained on 3 ShapeNet categories: plane, chair and car. We then report results on the test set of 10 unseen categories. We train and evaluate our model with views and report the average IoU across the unseen classes in Tab. 5, where we show that our novel contributions contribute to improve the reconstruction accuracy.
P  P+C  P+C+V (3D43D)  
IoU  0.453  0.476  0.491 
5 Conclusions
In this paper, we studied factors that impact the generalization of learningbased 3D reconstruction models to unseen categories during training. We argued that for a 3D reconstruction approach to generalize successfully to unseen classes all these factors need to be addressed together. We empirically showed that by taking this into when designing our model, we obtain large improvements over stateoftheart methods when reconstructing objects of on unseen categories. These improvements in generalization are a step forward for learned 3D reconstruction methods, which we hope will also enable recent Neural Rendering approaches [nerf, srn, enr] to go beyond the constrained scenario of trainingcategory specific models. Finally, larger datasets will lead to more informative priors. We believe that having a clear understanding of these factors and their compound effects will enrich this promising avenue of research.