LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation

12/01/2019 ∙ by Keunhong Park, et al. ∙ University of Washington Nvidia 35

Current 6D object pose estimation methods usually require a 3D model for each object. These methods also require additional training in order to incorporate new objects. As a result, they are difficult to scale to a large number of objects and cannot be directly applied to unseen objects. In this work, we propose a novel framework for 6D pose estimation of unseen objects. We design an end-to-end neural network that reconstructs a latent 3D representation of an object using a small number of reference views of the object. Using the learned 3D representation, the network is able to render the object from arbitrary views. Using this neural renderer, we directly optimize for pose given an input image. By training our network with a large number of 3D shapes for reconstruction and rendering, our network generalizes well to unseen objects. We present a new dataset for unseen object pose estimation–MOPED. We evaluate the performance of our method for unseen object pose estimation on MOPED as well as the ModelNet dataset.



There are no comments yet.


page 1

page 3

page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The pose

of an object defines where it is in space and how it is oriented. An object pose is typically defined by a 3D orientation (rotation) and translation comprising six degrees of freedom (6D). Knowing the pose of an object is crucial for any application that involves interacting with real world objects. For example, in order for a robot to manipulate objects it must be able to reason about the pose of the object. In augmented reality, 6D pose estimation enables virtual interaction and re-rendering of real world objects.

Figure 1: We present an end-to-end differentiable modeling and rendering pipeline. We use this pipeline to perform pose estimation using simple gradient updates.

In order to estimate the 6D pose of objects, current state-of-the-art methods [41, 6, 37] require a 3D model for each object. Methods based on renderings [34] usually need high quality 3D models typically obtained using 3D scanning devices. Although modern 3D reconstruction and scanning techniques such as [21] can generate 3D models of objects, they typically require significant effort. It is easy to see how building a 3D model for every object is an infeasible task.

Furthermore, existing pose estimation methods require extensive training under different lighting conditions and occlusions. For methods that train a single network for multiple objects [41], the pose estimation accuracy drops significantly with the increase in number of objects. This is due to large variation of object appearances depending on the pose. To remedy this mode of degradation, some approaches train a separate network for each object [34, 33, 6]. This approach is not scalable to a large number of objects. Regardless of using a single or multiple networks, all model-based methods require extensive training for unseen test objects that are not in the training set.

In this paper, we investigate the problem of learning 3D object representations for 6D object pose estimation without 3D models and without extra training for unseen objects during test time. The core of our method is a novel neural network that takes a set of reference RGB images of a target object with known poses, and internally builds a 3D representation of the object. Using the internal 3D representation, the network is able to render arbitrary views of the object. To estimate object pose, the network compares the input image with its rendered images in a gradient descent fashion to search for the best pose where the rendered image matches the input image. Applying the network to an unseen object only requires collecting views with registered poses using traditional techniques [21] and feeding a small subset of those views with the associated poses to the network, instead of training for the new object which takes time and computational resources.

Our network design is inspired by space carving [13]. We build a 3D voxel representation of an object by computing 2D latent features and projecting them to a canonical 3D voxel using a deprojection unit inspired by [22]. This operation can be interpreted as space carving in latent space. Rendering a novel view is conducted by rotating the latent voxel representation to the new view and project it into the 2D image space. Using the projected latent features, a decoder generates a new view image by first predicting the depth map of the object at the query view and then assigning color for each pixel by combining corresponding pixel values at different reference views.

To reconstruct and render unseen objects, we train the network on random 3D meshes from the ShapeNet dataset [4] that are textured using images from the MS-COCO dataset [4] under different lighting conditions. Our experiments show that the model generalizes to novel object categories and instances. For pose estimation during inference, we assume that the object of interest is segmented with a generic object instance segmentation method such as [42]. The pose of the object is estimated by finding a 6D pose that maximizes the similarity between the latent representation of the segmented object in the input image and the latent representation generated at that pose from the network. The object pose is computed using the gradient of the latent similarity with respect to the object pose. Fig. 1 illustrates our reconstruction and pose estimation pipeline.

We believe that the problem of unseen object pose estimation from limited views in the absence of high fidelity textured 3D models is very important from a practical point of view. To this end, we present a new evaluation dataset called MOPED, Model-free Object Pose Estimation Dataset. We summarize our key contributions:

  • We propose a novel neural network that reconstructs a latent representation of a novel object given a limited set of reference views and can subsequently render it from arbitrary viewpoints without additional training.

  • We perform pose estimation on unseen objects given reference views without additional training.

  • We introduce MOPED–a dataset for evaluating model-free, zero-shot pose estimation. We provide references images of objects taken in controlled environments and test images taken in uncontrolled environments.

Figure 2: A high level overview of our architecture. 1) Out modeling network takes an image and mask and predicts a feature volume for each input view. The predicted features volumes are then fused into a single canonical latent object by the fusion module. 2) Given the latent object, our rendering network produces a depth map and a mask for any output camera.

2 Related Work

Pose Estimation. Pose estimation methods fall into three major categories. The first category tackles the problem of pose estimation by designing network architectures that facilitate pose estimation [20, 12, 9]. The second category formulates the pose estimation by predicting a set of 2D image features, such as the projection of 3D box corners [34, 37], direction of the center of the object [41] then recovering the pose of the object using the predictions. The third category estimates the pose of objects by aligning the a rendering of the 3D model to the image. DeepIM [15] estimates the pose of the object by learning to align the 3D model of the object to the image. Another approach is to learn a model that can reconstruct the object with different poses [33, 6]. These methods then use the latent representation of the object to estimate the pose of the object. A limitation of this line of work is that they need to train separate auto-encoders for each object category and there is a lack of knowledge transfer between object categories. In addition, these methods require high fidelity textured 3D models for each object which are not trivial to build in practice since it involves specialized hardware [29]. Our method addresses both of these limitations: our method works with a set of reference views with registered poses instead of a 3D model. Without additional training, our system builds a latent representation from the reference views which can be rendered to color and depth for arbitrary viewpoints. Similar to [33, 6], we seek to find a pose that minimizes the difference in latent space between the query object and the test image.

3D shape learning and novel view synthesis. Inferring shapes of objects at the category level has recently gained a lot of attention. Shape geometry has been represented as voxels [5], signed distance functions (SDF) [25], point clouds [43], and as implicit functions encoded by a neural network [31]. These methods are trained at the category level and can only represent different instances within the categories they were trained on. In addition, these models only capture the shape of the object and do not model the appearance of the object. To overcome this limitation, recent works [23, 22, 31] decode appearance from neural 3D latent representations that respect projective geometry, generalizing well to novel viewpoints. Novel views are generated by transforming the latent representation in 3D and projecting it to 2D. A decoder then generates a novel view from the projected features. Some methods find a nearest neighbor shape proxy and infer high quality appearances but cannot handle novel categories [38, 24]. Differentiable rendering [14, 22, 17] systems seek to implement the rendering process (rasterization and shading) in a differentiable manner so that gradients can be propagated to and from neural networks. Such methods can be used to directly optimize parameters such as pose or appearance. Current differentiable rendering methods are limited by the difficulty of implemented complex appearance models and require a 3D mesh. We seek to combine the best of these methods by creating a differentiable rendering pipeline that does not require a 3D mesh by instead building voxelized latent representations from a small number of reference images.

Multi-View Reconstruction. Our method takes inspiration from multi-view reconstruction methods. It is most similar to space carving [13] and can be seen as a latent-space extension of it. Dense fusion methods such as [21, 39] generate dense point cloud of the objects from RGB-D sequences. Other works [36, 35] train models that learn object representations from unaligned views. During inference, these methods are able to estimate the pose of the object by reasoning about the shape. These methods require a large training corpus for each category. Our method takes a hybrid approach, taking multiple reference images and input and building a latent representation at inference time.

3 Overview

We present an end-to-end system for novel view reconstructions and pose estimation. We present our system in two parts. Sec. 4 describes our reconstruction pipeline which takes a small collection of reference images as input and produces a flexible representation which can be rendered from novel viewpoints. We leverage multi-view consistency to construct a latent representation and do not rely on category specific shape priors. This key architecture decision enables generalization beyond the distribution of training objects. We show that our reconstruction pipeline can accurately reconstruct unseen object categories including from real images. In Sec. 5, we formulate the 6D pose estimation problem using our neural renderer. Since our rendering process is fully differentiable, we directly optimize for the camera parameters without the need for additional training or code-book generation.

Camera Model. Throughout this paper we use a perspective pinhole camera model with an intrinsic matrix


and a homogenous extrinsic matrix , where and are the focal lengths, and are the coordinates of the camera principal point, and and are rotation and translation of the camera, respectively. We also define a viewport cropping parameter which represents a bounding box around the object in pixel coordinates. For brevity we refer to the collection of these camera parameters as .

4 Neural Reconstruction and Rendering

Given a set of reference images with associated object poses and object segmentation masks, we seek to construct a representation of the object which can be rendered with arbitrary camera parameters. Building on the success of recent methods [23, 31], we represent the object as a latent 3D voxel grid. This representation can be directly manipulated using standard 3D transformations–naturally accommodating our requirement of novel view rendering. The overview of our method is shown in Fig. 2. There are two main components to our reconstruction pipeline: 1) Modeling the object by predicting per-view feature volumes and fusing them into a single canonical latent representation; 2) Rendering the latent representation to depth and color images.

4.1 Modeling

Our modeling step is inspired by space carving [13] in that our network takes observations from multiple views and leverages multi-view consistency to build a canonical representation. However, instead of using photometric consistency, we use latent features to represent each view which allows our network to learn features useful for this task.

Per-View Features. We begin by generating a feature volume for each input view . Each feature volume corresponds to the camera frustum of the input camera, bounded by the viewport parameter and depth-wise by where is the distance to the object center and is the radius of the object. Fig. 3 illustrates the generation of the per-view features.

Figure 3: The per-view feature volumes computed in the modeling network corresponds a depth bounded camera frustum. The blue box on the image plane is determined by the camera crop parameter and together with the depth determines the bounds of the frustum.

Similar to [30], we use U-Nets [26] for their property of preserving spatial structure. We first compute 2D features by passing the input (an RGB image , a binary mask , and optionally depth ) through a 2D U-Net. The deprojection unit () then lifts the 2D image features in to 3D volumetric features in by factoring the 2D channel dimension into the 3D channel dimension and depth dimension . This deprojection operation is the exact opposite of the projection unit presented in [22]. The lifted features are then passed through a 3D U-Net to produce the volumetric features for the camera:


Camera to Object Coordinates. Each voxel in our feature volume represents a point in 3D space. Following recent works [22, 23, 30], we transform our feature volumes directly using rigid transformations. Consider a continuous function defining our camera-space latent representation, where is a point in camera coordinates. The feature volume is a discrete sample of this function. This representation in object space is given by where is a point in object coordinates and is an object-to-camera extrinsic matrix. We compute the object-space volume by sampling for each object-space voxel coordinate

. In practice, this is done by trilinear sampling the voxel grid and edge-padding values that fall outside. Given this transformation operation

, the object-space feature volume is given by

View Fusion. We now have a collection of feature volumes , each associated with an input view. Our fusion module fuses all views into a single canonical feature volume: .

Simple channel-wise average pooling yields good results but we found that sequentially integrating each volume using a Recurrent Neural Network (RNN) similarly to

[30] slightly improved reconstruction accuracy (see Sec. 6.4). Using a recurrent unit allows the network to keep and ignore features from views in contrast to average pooling. This facilitates comparisons between different views allowing the network to perform operations similar to the photometric consistency criterion used in space carving [13]

. We use a Convolutional Gated Recurrent Unit (ConvGRU)

[1] so that the network can leverage spatial information. An illustration of our fusion module is shown in Fig. 4.

Figure 4: We illustrate two methods of fusion per-view feature volumes. (1) Simple channel-wise average pooling and (2) a recurrent fusion module similar to that of [30]

4.2 Rendering

Our rendering modules takes the fused object volume and renders it given arbitrary camera parameters . Ideally, our rendering module would directly regress a color image. However, it is challenging to preserve high frequency details through a neural network. U-Nets [26] introduce skip connections between equivalent scale layers allowing high frequency spatial structure to propagate to the end of the network, but it is unclear how to add skip connections in the presence of 3D transformations. Existing works such as [30, 18] train a single network for each scene allowing the decoder to memorize high frequency information while the latent representation encodes state information. Trying to predict color without skip connections results in blurry outputs as shown in Fig. 5. We side-step this difficulty by first rendering depth and then using an image-based rendering approach to produce a color image.

Decoding Depth. Depth is a 3D representation, making it easier for the network to exploit the geometric structure we provide. In addition, depth tends to be locally smoother compared to color allowing more information to be compactly represented in a single voxel.

Our rendering network is a simple inversion of the reconstruction network and bears many similarities to RenderNet [22]. First, we pass the canonical object-space volume through a small 3D U-Net () before transforming it to camera coordinates using the method described in Sec. 4.1. We perform the transformation with an object-to-camera extrinsic matrix instead of the inverse . A second 3D U-Net () then decodes the resulting volume to produce a feature volume: which is then flattened to a 2D feature grid using the projection unit () from [22] by first collapsing the depth dimension into the channel dimension and applying a 1x1 convolution. The resulting features are decoded by a 2D U-Net () with two output branches for depth () and for a segmentation mask (). The outputs of the rendering network are given by:


Image Based Rendering (IBR). We use image-based rendering [28] to leverage the reference images to predict output color. Given the camera intrinsics and depth map for an output view, we can recover the 3D object-space position of each output pixel as


which can be transformed to the input image frame as for each input camera . The output pixel can then copy the color of the corresponding input pixel to produce a reprojected color image.

Figure 5: We compare different methods for reconstructing color. (1) the ground truth image, (2) the prediction a color branch added to the rendering network, (3) the reprojected reference images weighted by camera similarity, and (4) the reprojected reference images blended by our IBR network. Please see supplementary materials for more examples.

The resulting reprojected image will contain invalid pixels due occlusions. There are multiple strategies to weighting each pixel including 1) weighting by reprojected depth error, 2) weighting by similarity between input and query cameras, 3) using a neural network. Examples of renderings using each method are shown in Fig. 5. The first choice suffers from artifacts in the presence of depth errors or thin surfaces. The second approach yields reasonable results but produces blurry images for intermediate views. We opt for the third option. Following deep blending [7], we train a network that predicts blend weights for each reprojected input : where is an element-wise product. The blend weights are predicted by a 2D U-Net. The inputs to this network are 1) the depth predicted by our reconstruction pipeline, 2) each reprojected input image , and 3) a view similarity score

based on the angle between the input and query poses. We concatenate the inputs into a 7-channel tensor.

5 Object Pose Estimation

Given an image , and a depth map , a pose estimation system provides a rotation and a translation which together define an object-to-camera coordinate transformation referred to as the object pose. In this section, we describe how we use our reconstruction pipeline described in Sec. 4 to directly optimize for the pose. We first find a coarse pose using only forward inference and then refine it using gradient optimization.

Formulation. Pose is defined by a rotation and a translation . Our formulation also includes the viewport parameter defined in Sec. 3. Defining a viewport allows us to efficiently pass the input to the reconstruction network while also providing scale invariance. We encode the rotation as a quaternion and translation as . We assume we are given an RGB image , an object segmentation mask , and depth comprising the input .

5.1 Losses

In order to estimate pose, we must first provide a criterion which quantifies the quality of the pose. We use two loss functions. One is a standard

depth loss:


which disambiguates the object scale and serves as a reconstruction loss for how well the prediction depth matches the input depth .

We introduce a novel latent loss which leverages our reconstruction network to evaluate the fitness of a pose. Given the input , a latent object , and a pose , the latent loss is defined as


where is the rendering network up to the projection layer and is the modeling network as described in Sec. 4. This loss differs from auto-encoder based approaches such as [33, 6] in that 1) our network is not trained on the object, and 2) the loss is computed directly given the image and camera pose. Given the two loss functions, the pose estimation problem is given by:


where is the depth branch of the rendering network and is a weight to balance the two losses which is optimized using [10]. The latent object is omitted for clarity.

Coarse Pose. We bootstrap our pose estimation by first computing a coarse estimate. This speeds up optimization and prevents the optimization from getting stuck in bad local minima. We begin by estimating the translation of the object as the centroid of the bounding cube defined by the mask bounding box and corresponding depth values. We then uniformly sample rotation quaternions and translations and select the top poses according to Eq. (8). This step only requires forward inference of our network and thus we can perform this step for a large number of candidate poses.

Pose Optimization. Our entire pipeline is differentiable end-to-end. We can therefore optimize Eq. (8) using gradient optimization. Given a latent object and a coarse pose , we compute the loss and propagate the gradients back to the camera pose. This step only requires the rendering network and does not use the modeling network. The image-based rendering network is also not used in this step.

We parameterize the rotation in the log quaternion form in order to keep gradient updates valid:


We also jointly optimize the translation and viewport .

6 Experiments

We evaluate our method on two datasets: ModelNet [40] and our new dataset MOPED. We aim to evaluate pose estimation accuracy on unseen objects.

6.1 Implementation Details

Training Data. We train our reconstruction network on 3D models from ShapeNet [4] which contains around 51,300 shapes. We exclude models larger than 50MB or models with more than 10 components for efficient data loading resulting in around 30,000 models. We generate UV maps for each model using Blender’s smart UV projection [3] to facilitate texturing. We normalize all models to unit diameter. When rendering, we a random image from MS-COCO [16] for each component of the model. We render with the Beckmann model [2]

with randomized parameters. We also render uniformly colored objects with a probability of 0.5.

Network Input. We generate our training data at a resolution of . However, the input to our network is a fixed size . To keep our inputs consistent and to keep our network scale-invariant, we ‘zoom’ into the object such that all images appear to be from the same distance. This is done by computing a bounding box size where is the current image width and height, is the distance to the centroid (See Fig. 3), is the desired output size, and is the desired ‘zoom’ distance and cropping around object centroid projected to image coordinates . This defines the viewport parameter . The cropped image is scaled to .

Training. In each iteration of training, we sample a 3D model and then sample 16 random reference poses and 16 random target poses. Each pose is sampled by uniformly sampling a unit quaternion and translation such that the object stays within frame. We train our network using the Adam optimizer [11] with a fixed learning rate of 0.001 for 1.5M iterations. Each batch consists of 20 objects with 16 input views and 16 target views. We use an reconstruction loss for depth and binary cross-entropy for the mask. We apply the losses to both the input and output views. We randomly orient our canonical coordinate frame in each iteration by uniformly sampling a random unit quaternion. This prevents our network from overfitting to the implementation of our latent voxel transformations. We also add motion blur, color jitter, and pixel noise to the color inputs and add noise to the input masks using the same procedure as [19].

Figure 6: Here we show two examples from the ModelNet experiment. (1) shows the target image, (2) shows the ground truth depth, (3) shows our optimized predicted depth, and (4) shows the error between the ground truth and our prediction. (a) illustrates how a pose with low depth error can still result in a relatively high angular error.

6.2 Experiments on the ModelNet Dataset

We conduct experiments on ModelNet to evaluate the generalization of our method toward unseen object categories. To this end, we train our network on all the meshes in ShapeNetCore [4] excluding the categories we are going to evaluate on. We closely follow the evaluation protocol of [15] for this section. The model is evaluated on 7 unseen categories: bathtub, bookshelf, guitar, range hood, sofa, wardrobe, and TV stand. For each category, 50 pairs of initial and target object pose are sampled. We compared with [15] and [32] where all the methods initialized with the initial pose and evaluated on how successful they are on estimating the target pose. Three metrics are used: 1) : which is the percentage of the estimations that are withing and of the ground truth target pose. 2) ADD (0.1d) which considers the pose estimate to be correct if its average distance [8] to target pose is within of the model diameter. 3) Proj2D (5 px) that measures the percentage of the time that projections of a set of predefined points on the mesh at the estimated pose are within 5 pixel of the projected points at the ground truth pose.

Table 1 shows the quantitative results on the ModelNet dataset. On average, our method achieves state-of-the-art results on all the metrics thanks to our ability to perform continuous optimization on pose. However, for the metric, there are object categories that our method performs worse despite performing well on all other metrics. One reason is image and spatial resolution. The input and output images to our network have resolution . The resolution of our voxel representation is . The limited resolution can hinder the performance for small objects or objects that are distant from the camera. Small changes in the depth of each pixel may disproportionately affect the rotation of the object compared to our loss values and . Fig. 6 shows examples from the ModelNet experiment illustrating this limitation.

(5°, 5cm) ADD (0.1d) Proj2D (5px)
DI MP Ours DI MP Ours DI MP Ours
bathtub 71.6 85.5 85.0 88.6 91.5 92.7 73.4 80.6 94.9
bookshelf 39.2 81.9 80.2 76.4 85.1 91.5 51.3 76.3 91.8
guitar 50.4 69.2 73.5 69.6 80.5 83.9 77.1 80.1 96.9
range_hood 69.8 91.0 82.9 89.6 95.0 97.9 70.6 83.9 91.7
sofa 82.7 91.3 89.9 89.5 95.8 99.7 94.2 86.5 97.6
tv_stand 73.6 85.9 88.6 92.1 90.9 97.4 76.6 82.5 96.0
wardrobe 62.7 88.7 91.7 79.4 92.1 97.0 70.0 81.1 94.2
Mean 64.3 84.8 85.5 83.6 90.1 94.3 73.3 81.6 94.7
Table 1: ModelNet pose refinement experiments compared to DeepIM (DI) [15] and Multi-Path Learning (MP) [32].
Input Textured 3D Mesh Images + Camera Pose
Training Yes No
# Networks Per-Object Single Universal
Pose Loss -
black_drill 59.78 82.94 49.80 56.67 79.06 53.77 62.15 82.36 59.36 51.61 80.81 48.05
cheezit 57.78 82.45 48.47 61.31 91.63 55.24 44.56 90.24 35.10 23.98 88.20 15.92
duplo_dude 56.91 82.14 47.11 74.02 89.55 52.49 76.81 90.50 59.83 53.26 89.51 38.54
duster 58.91 82.78 46.66 49.13 91.56 19.33 51.13 91.68 24.78 39.05 81.57 20.82
graphics_card 59.13 83.20 49.85 80.71 91.25 67.71 79.33 90.90 60.35 60.11 87.91 41.92
orange_drill 58.23 82.68 49.08 51.84 70.95 46.12 55.52 73.68 45.46 44.20 68.39 41.68
pouch 57.74 82.16 49.01 60.43 89.60 49.80 58.51 89.15 44.40 22.03 82.94 20.19
remote 56.87 82.04 48.06 55.38 94.80 37.73 63.18 94.96 45.27 62.39 91.58 41.96
rinse_aid 57.74 82.53 48.13 65.63 92.58 28.61 67.09 93.66 27.62 57.54 87.44 19.00
toy_plane 62.41 85.10 49.81 60.18 90.24 51.70 56.80 88.54 40.16 34.29 87.22 35.07
vim_mug 58.09 82.38 48.08 30.11 80.76 14.38 49.89 77.79 32.85 27.49 78.59 10.51
mean 58.51 82.76 48.55 58.67 87.45 43.35 60.45 87.59 43.20 43.27 84.01 30.33
Table 2: Quantitative Results on MOPED Dataset. We report the Area Under Curve (AUC) for each metric.
Figure 7: Objects in MOPED–a new dataset for model-free pose estimation. The objects shown are: (a) toy_plane, (b) duplo_dude, (c) cheezit, (d) duster, (e) black_drill, (f) orange_drill, (g) graphics_card, (h) remote, (i) rinse_aid, (j) vim_mug, and (k) pouch.

6.3 Experiments on the MOPED Dataset

We introduce the Model-free Object Pose Estimation Dataset (MOPED). MOPED consists of 11 objects, shown in Fig. 7. For each object, we take multiple RGB-D videos cover all views the object. We first use KinectFusion [21] to register frames from a single capture and then use a combination of manual annotation and automatic registration [44, 27, 45] to align separate captures. For each object, we select reference frames with farthest point sampling to ensure good coverage of the object. For test sequences we capture each object in 5 different environments. We sample every other frame for evaluation videos. This results in approximately 300 test images per object. We quantitatively evaluate our method and baselines using three metrics for which we provide the Area Under Curve (AUC) metric: 1) ADD: average distance error with threshold between . 2) ADD-S: symmetric average distance error with threshold between . This metric measures the geometric alignment of symmetric objects regardless of texture. 3) Proj2D: projection error with threshold between . We compute all metrics for all sampled frames.

Figure 8: Qualitative results on the MOPED dataset

We compare our method with PoseRBPF [6], a state-of-the-art model-based pose estimation method. Since PoseRBPF requires textured 3D models, we reconstruct a mesh for each object by aggregating point clouds from reference captures and building a TSDF volume. The point clouds are integrated into the volume using KinectFusion [21]. The meshes have artifacts such as washed out high frequency details and shadow vertices due to slight misalignment (see supplementary materials). Table 2 shows quantitative comparisons on the MOPED dataset. Note that our method is not trained on the test object while PoseRBPF has a separate encoder for each object. Our method achieves superior performance on both ADD and ADD-S. We evaluate different version of our method with different combinations of loss functions. Compared to our combined loss, optimizing only performs better for geometrically asymmetric objects but worse on textured objects such as the cheezit box. Optimizing both losses achieves better results on textured objects. Fig. 8 shows estimated poses for different test images. Please see supplementary materials for qualitative examples.

# Views 1 2 4 8 16 32
ADD 15.91 25.00 40.38 55.35 58.67 55.81
ADD-S 63.14 75.91 85.62 87.72 87.45 88.70
Proj.2D 8.68 15.43 28.41 38.87 43.35 38.45
Table 3: AUC metrics by number of reference views.

6.4 Ablation Studies

In this section, we analyze the effect of different design choices and how they affect the robustness of our method.

Number of reference views. We first evaluate the sensitivity of our method to the number of input reference views. Novel view synthesis is easier with more reference views because there is a higher chance that a query view will be close to a reference view. Table 3 shows that the accuracy increases with the number of reference views. In addition, having more than 8 reference views only yields marginal performance gains shows that our method does not require many views to achieve good pose estimation results.

View Fusion.

We compare multiple strategies for aggregating the latent representations from each reference view. The naive way is to use a simple pooling function such as average/max pooling. Alternatively, we can integrate the volumes using an RNN such as a ConvGRU so that the network can reason across views. Table 

4 shows the quantitative evaluation of these two variations. Although the average performance of the objects are very similar, the ConvGRU variation performs better than the average pooling variation. This indicates the importance of spatial relationship in the voxel representation for pose estimation.

Avg Pool 56.78 88.04 39.82
ConvGRU 56.36 88.28 40.43
Table 4: Effect of different View Fusion strategies

7 Conclusion

We have presented a novel framework for learning 3D object representations from reference views. Our network is able to decode this representation to synthesize novel views and estimate 6D poses of objects. By training the network with thousands of 3D shapes, our network learns to reconstruct and estimate the pose of unseen objects during inference. Compared to the current methods for 6D object pose estimation, our method removes the requirement of having a high quality 3D model or performing training for each object. Consequently, our method has the potential to handle a large number of objects for pose estimation. For future work, we plan to investigate unseen object pose estimation in cluttered scenes where objects may occlude each other. Another direction is to speed up the pose estimation process by applying network optimization techniques.


  • [1] N. Ballas, L. Yao, C. Pal, and A. Courville (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432. Cited by: §4.1.
  • [2] P. Beckmann and A. Spizzichino (1987) The scattering of electromagnetic waves from rough surfaces. Norwood, MA, Artech House, Inc., 1987, 511 p.. Cited by: §6.1.
  • [3] Blender Online Community (2019) Blender - a 3d modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam. External Links: Link Cited by: §6.1.
  • [4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012. Cited by: §1, §6.1, §6.2.
  • [5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In

    European Conference on Computer Vision (ECCV)

    pp. 628–644. Cited by: §2.
  • [6] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox (2019) PoseRBPF: a rao-blackwellized particle filter for 6D object pose tracking. In Robotics: Science and Systems (RSS), Cited by: §1, §1, §2, §5.1, §6.3, Table 2.
  • [7] P. Hedman, J. Philip, T. Price, J. Frahm, G. Drettakis, and G. Brostow (2018) Deep blending for free-viewpoint image-based rendering. In SIGGRAPH Asia 2018 Technical Papers, pp. 257. Cited by: §4.2.
  • [8] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Asian Conference on Computer Vision (ACCV), pp. 548–562. Cited by: §6.2.
  • [9] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab (2017) SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1521–1529. Cited by: §2.
  • [10] A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 7482–7491. Cited by: §5.1.
  • [11] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
  • [12] A. Kundu, Y. Li, and J. M. Rehg (2018) 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3559–3568. Cited by: §2.
  • [13] K. N. Kutulakos and S. M. Seitz (2000) A theory of shape by space carving. International Journal of Computer Vision (IJCV) 38 (3), pp. 199–218. Cited by: §1, §2, §4.1, §4.1.
  • [14] T. Li, M. Aittala, F. Durand, and J. Lehtinen (2018) Differentiable monte carlo ray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37 (6), pp. 222:1–222:11. Cited by: §2.
  • [15] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox (2018) DeepIM: deep iterative matching for 6D pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §2, §6.2, Table 1.
  • [16] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §6.1.
  • [17] S. Liu, T. Li, W. Chen, and H. Li (2019-10) Soft rasterizer: a differentiable renderer for image-based 3D reasoning. The IEEE International Conference on Computer Vision (ICCV). Cited by: §2.
  • [18] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh (2019) Neural volumes: learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751. Cited by: §4.2.
  • [19] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg (2017)

    Dex-Net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics

    Robotics: Science and Systems (RSS). Cited by: §6.1.
  • [20] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3D bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [21] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon (2011) KinectFusion: real-time dense surface mapping and tracking. In IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Vol. 11, pp. 127–136. Cited by: §1, §1, §2, §6.3, §6.3.
  • [22] T. Nguyen-Phuoc, C. Li, S. Balaban, and Y. Yang (2018) RenderNet: a deep convolutional network for differentiable rendering from 3D shapes. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, §4.1, §4.1, §4.2.
  • [23] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y. Yang (2019)

    HoloGAN: Unsupervised learning of 3D representations from natural images

    International Conference on Computer Vision (ICCV). Cited by: §2, §4.1, §4.
  • [24] K. Park, K. Rematas, A. Farhadi, and S. M. Seitz (2018) PhotoShape: photorealistic materials for large-scale shape collections. In SIGGRAPH Asia 2018 Technical Papers, pp. 192. Cited by: §2.
  • [25] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [26] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.1, §4.2.
  • [27] S. Rusinkiewicz and M. Levoy (2001) Efficient variants of the ICP algorithm. In Proceedings Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152. Cited by: §6.3.
  • [28] H. Shum and S. B. Kang (2000) Review of image-based rendering techniques. In Visual Communications and Image Processing, Vol. 4067, pp. 2–13. Cited by: §4.2.
  • [29] A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel (2014) BigBIRD: a large-scale 3D database of object instances. In IEEE International Conference on Robotics and Automation (ICRA), pp. 509–516. Cited by: §2.
  • [30] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer (2019) DeepVoxels: learning persistent 3D feature embeddings. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: Figure 4, §4.1, §4.1, §4.1, §4.2.
  • [31] V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3D-structure-aware neural scene representations. In Advances in Neural Information Processing Systems (NuerIPS), Cited by: §2, §4.
  • [32] M. Sundermeyer, M. Durner, E. Y. Puang, Z. Marton, and R. Triebel (2019) Multi-path learning for object pose estimation across domains. arXiv preprint arXiv:1908.00151. Cited by: §6.2, Table 1.
  • [33] M. Sundermeyer, Z. Marton, M. Durner, M. Brucker, and R. Triebel (2018) Implicit 3D orientation learning for 6D object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 699–715. Cited by: §1, §2, §5.1.
  • [34] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield (2018) Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning (CoRL), External Links: Link Cited by: §1, §1, §2.
  • [35] S. Tulsiani, A. A. Efros, and J. Malik (2018) Multi-view consistency as supervisory signal for learning shape and pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2897–2905. Cited by: §2.
  • [36] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik (2017) Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 2626–2634. Cited by: §2.
  • [37] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese (2019) DenseFusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3343–3352. Cited by: §1, §2.
  • [38] T. Y. Wang, H. Su, Q. Huang, J. Huang, L. Guibas, and N. J. Mitra (2016) Unsupervised texture transfer from images to model collections. ACM Trans. Graph. 35 (6), pp. 177:1–177:13. External Links: ISSN 0730-0301, Link Cited by: §2.
  • [39] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison (2015) ElasticFusion: dense slam without a pose graph. In Robotics: Science and Systems (RSS), Cited by: §2.
  • [40] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D ShapeNets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920. Cited by: §6.
  • [41] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2018)

    PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

    Robotics: Science and Systems (RSS). Cited by: §1, §1, §2.
  • [42] C. Xie, Y. Xiang, A. Mousavian, and D. Fox (2019) The best of both modes: separately leveraging RGB and depth for unseen object instance segmentation. In Conference on Robot Learning (CoRL), Cited by: §1.
  • [43] G. Yang, Z. Hao, M. Liu, S. Belongie, and B. Hariharan (2019) PointFlow: 3D point cloud generation with continuous normalizing flows. International Conference on Computer Vision (ICCV). Cited by: §2.
  • [44] Q. Zhou, J. Park, and V. Koltun (2016) Fast global registration. In European Conference on Computer Vision (ECCV), pp. 766–782. Cited by: §6.3.
  • [45] Q. Zhou, J. Park, and V. Koltun (2018) Open3D: A modern library for 3D data processing. arXiv:1801.09847. Cited by: §6.3.