of an object defines where it is in space and how it is oriented. An object pose is typically defined by a 3D orientation (rotation) and translation comprising six degrees of freedom (6D). Knowing the pose of an object is crucial for any application that involves interacting with real world objects. For example, in order for a robot to manipulate objects it must be able to reason about the pose of the object. In augmented reality, 6D pose estimation enables virtual interaction and re-rendering of real world objects.
In order to estimate the 6D pose of objects, current state-of-the-art methods [41, 6, 37] require a 3D model for each object. Methods based on renderings  usually need high quality 3D models typically obtained using 3D scanning devices. Although modern 3D reconstruction and scanning techniques such as  can generate 3D models of objects, they typically require significant effort. It is easy to see how building a 3D model for every object is an infeasible task.
Furthermore, existing pose estimation methods require extensive training under different lighting conditions and occlusions. For methods that train a single network for multiple objects , the pose estimation accuracy drops significantly with the increase in number of objects. This is due to large variation of object appearances depending on the pose. To remedy this mode of degradation, some approaches train a separate network for each object [34, 33, 6]. This approach is not scalable to a large number of objects. Regardless of using a single or multiple networks, all model-based methods require extensive training for unseen test objects that are not in the training set.
In this paper, we investigate the problem of learning 3D object representations for 6D object pose estimation without 3D models and without extra training for unseen objects during test time. The core of our method is a novel neural network that takes a set of reference RGB images of a target object with known poses, and internally builds a 3D representation of the object. Using the internal 3D representation, the network is able to render arbitrary views of the object. To estimate object pose, the network compares the input image with its rendered images in a gradient descent fashion to search for the best pose where the rendered image matches the input image. Applying the network to an unseen object only requires collecting views with registered poses using traditional techniques  and feeding a small subset of those views with the associated poses to the network, instead of training for the new object which takes time and computational resources.
Our network design is inspired by space carving . We build a 3D voxel representation of an object by computing 2D latent features and projecting them to a canonical 3D voxel using a deprojection unit inspired by . This operation can be interpreted as space carving in latent space. Rendering a novel view is conducted by rotating the latent voxel representation to the new view and project it into the 2D image space. Using the projected latent features, a decoder generates a new view image by first predicting the depth map of the object at the query view and then assigning color for each pixel by combining corresponding pixel values at different reference views.
To reconstruct and render unseen objects, we train the network on random 3D meshes from the ShapeNet dataset  that are textured using images from the MS-COCO dataset  under different lighting conditions. Our experiments show that the model generalizes to novel object categories and instances. For pose estimation during inference, we assume that the object of interest is segmented with a generic object instance segmentation method such as . The pose of the object is estimated by finding a 6D pose that maximizes the similarity between the latent representation of the segmented object in the input image and the latent representation generated at that pose from the network. The object pose is computed using the gradient of the latent similarity with respect to the object pose. Fig. 1 illustrates our reconstruction and pose estimation pipeline.
We believe that the problem of unseen object pose estimation from limited views in the absence of high fidelity textured 3D models is very important from a practical point of view. To this end, we present a new evaluation dataset called MOPED, Model-free Object Pose Estimation Dataset. We summarize our key contributions:
We propose a novel neural network that reconstructs a latent representation of a novel object given a limited set of reference views and can subsequently render it from arbitrary viewpoints without additional training.
We perform pose estimation on unseen objects given reference views without additional training.
We introduce MOPED–a dataset for evaluating model-free, zero-shot pose estimation. We provide references images of objects taken in controlled environments and test images taken in uncontrolled environments.
2 Related Work
Pose Estimation. Pose estimation methods fall into three major categories. The first category tackles the problem of pose estimation by designing network architectures that facilitate pose estimation [20, 12, 9]. The second category formulates the pose estimation by predicting a set of 2D image features, such as the projection of 3D box corners [34, 37], direction of the center of the object  then recovering the pose of the object using the predictions. The third category estimates the pose of objects by aligning the a rendering of the 3D model to the image. DeepIM  estimates the pose of the object by learning to align the 3D model of the object to the image. Another approach is to learn a model that can reconstruct the object with different poses [33, 6]. These methods then use the latent representation of the object to estimate the pose of the object. A limitation of this line of work is that they need to train separate auto-encoders for each object category and there is a lack of knowledge transfer between object categories. In addition, these methods require high fidelity textured 3D models for each object which are not trivial to build in practice since it involves specialized hardware . Our method addresses both of these limitations: our method works with a set of reference views with registered poses instead of a 3D model. Without additional training, our system builds a latent representation from the reference views which can be rendered to color and depth for arbitrary viewpoints. Similar to [33, 6], we seek to find a pose that minimizes the difference in latent space between the query object and the test image.
3D shape learning and novel view synthesis. Inferring shapes of objects at the category level has recently gained a lot of attention. Shape geometry has been represented as voxels , signed distance functions (SDF) , point clouds , and as implicit functions encoded by a neural network . These methods are trained at the category level and can only represent different instances within the categories they were trained on. In addition, these models only capture the shape of the object and do not model the appearance of the object. To overcome this limitation, recent works [23, 22, 31] decode appearance from neural 3D latent representations that respect projective geometry, generalizing well to novel viewpoints. Novel views are generated by transforming the latent representation in 3D and projecting it to 2D. A decoder then generates a novel view from the projected features. Some methods find a nearest neighbor shape proxy and infer high quality appearances but cannot handle novel categories [38, 24]. Differentiable rendering [14, 22, 17] systems seek to implement the rendering process (rasterization and shading) in a differentiable manner so that gradients can be propagated to and from neural networks. Such methods can be used to directly optimize parameters such as pose or appearance. Current differentiable rendering methods are limited by the difficulty of implemented complex appearance models and require a 3D mesh. We seek to combine the best of these methods by creating a differentiable rendering pipeline that does not require a 3D mesh by instead building voxelized latent representations from a small number of reference images.
Multi-View Reconstruction. Our method takes inspiration from multi-view reconstruction methods. It is most similar to space carving  and can be seen as a latent-space extension of it. Dense fusion methods such as [21, 39] generate dense point cloud of the objects from RGB-D sequences. Other works [36, 35] train models that learn object representations from unaligned views. During inference, these methods are able to estimate the pose of the object by reasoning about the shape. These methods require a large training corpus for each category. Our method takes a hybrid approach, taking multiple reference images and input and building a latent representation at inference time.
We present an end-to-end system for novel view reconstructions and pose estimation. We present our system in two parts. Sec. 4 describes our reconstruction pipeline which takes a small collection of reference images as input and produces a flexible representation which can be rendered from novel viewpoints. We leverage multi-view consistency to construct a latent representation and do not rely on category specific shape priors. This key architecture decision enables generalization beyond the distribution of training objects. We show that our reconstruction pipeline can accurately reconstruct unseen object categories including from real images. In Sec. 5, we formulate the 6D pose estimation problem using our neural renderer. Since our rendering process is fully differentiable, we directly optimize for the camera parameters without the need for additional training or code-book generation.
Camera Model. Throughout this paper we use a perspective pinhole camera model with an intrinsic matrix
and a homogenous extrinsic matrix , where and are the focal lengths, and are the coordinates of the camera principal point, and and are rotation and translation of the camera, respectively. We also define a viewport cropping parameter which represents a bounding box around the object in pixel coordinates. For brevity we refer to the collection of these camera parameters as .
4 Neural Reconstruction and Rendering
Given a set of reference images with associated object poses and object segmentation masks, we seek to construct a representation of the object which can be rendered with arbitrary camera parameters. Building on the success of recent methods [23, 31], we represent the object as a latent 3D voxel grid. This representation can be directly manipulated using standard 3D transformations–naturally accommodating our requirement of novel view rendering. The overview of our method is shown in Fig. 2. There are two main components to our reconstruction pipeline: 1) Modeling the object by predicting per-view feature volumes and fusing them into a single canonical latent representation; 2) Rendering the latent representation to depth and color images.
Our modeling step is inspired by space carving  in that our network takes observations from multiple views and leverages multi-view consistency to build a canonical representation. However, instead of using photometric consistency, we use latent features to represent each view which allows our network to learn features useful for this task.
Per-View Features. We begin by generating a feature volume for each input view . Each feature volume corresponds to the camera frustum of the input camera, bounded by the viewport parameter and depth-wise by where is the distance to the object center and is the radius of the object. Fig. 3 illustrates the generation of the per-view features.
Similar to , we use U-Nets  for their property of preserving spatial structure. We first compute 2D features by passing the input (an RGB image , a binary mask , and optionally depth ) through a 2D U-Net. The deprojection unit () then lifts the 2D image features in to 3D volumetric features in by factoring the 2D channel dimension into the 3D channel dimension and depth dimension . This deprojection operation is the exact opposite of the projection unit presented in . The lifted features are then passed through a 3D U-Net to produce the volumetric features for the camera:
Camera to Object Coordinates. Each voxel in our feature volume represents a point in 3D space. Following recent works [22, 23, 30], we transform our feature volumes directly using rigid transformations. Consider a continuous function defining our camera-space latent representation, where is a point in camera coordinates. The feature volume is a discrete sample of this function. This representation in object space is given by where is a point in object coordinates and is an object-to-camera extrinsic matrix. We compute the object-space volume by sampling for each object-space voxel coordinate
. In practice, this is done by trilinear sampling the voxel grid and edge-padding values that fall outside. Given this transformation operation, the object-space feature volume is given by
View Fusion. We now have a collection of feature volumes , each associated with an input view. Our fusion module fuses all views into a single canonical feature volume: .
Simple channel-wise average pooling yields good results but we found that sequentially integrating each volume using a Recurrent Neural Network (RNN) similarly to slightly improved reconstruction accuracy (see Sec. 6.4). Using a recurrent unit allows the network to keep and ignore features from views in contrast to average pooling. This facilitates comparisons between different views allowing the network to perform operations similar to the photometric consistency criterion used in space carving 
. We use a Convolutional Gated Recurrent Unit (ConvGRU) so that the network can leverage spatial information. An illustration of our fusion module is shown in Fig. 4.
Our rendering modules takes the fused object volume and renders it given arbitrary camera parameters . Ideally, our rendering module would directly regress a color image. However, it is challenging to preserve high frequency details through a neural network. U-Nets  introduce skip connections between equivalent scale layers allowing high frequency spatial structure to propagate to the end of the network, but it is unclear how to add skip connections in the presence of 3D transformations. Existing works such as [30, 18] train a single network for each scene allowing the decoder to memorize high frequency information while the latent representation encodes state information. Trying to predict color without skip connections results in blurry outputs as shown in Fig. 5. We side-step this difficulty by first rendering depth and then using an image-based rendering approach to produce a color image.
Decoding Depth. Depth is a 3D representation, making it easier for the network to exploit the geometric structure we provide. In addition, depth tends to be locally smoother compared to color allowing more information to be compactly represented in a single voxel.
Our rendering network is a simple inversion of the reconstruction network and bears many similarities to RenderNet . First, we pass the canonical object-space volume through a small 3D U-Net () before transforming it to camera coordinates using the method described in Sec. 4.1. We perform the transformation with an object-to-camera extrinsic matrix instead of the inverse . A second 3D U-Net () then decodes the resulting volume to produce a feature volume: which is then flattened to a 2D feature grid using the projection unit () from  by first collapsing the depth dimension into the channel dimension and applying a 1x1 convolution. The resulting features are decoded by a 2D U-Net () with two output branches for depth () and for a segmentation mask (). The outputs of the rendering network are given by:
Image Based Rendering (IBR). We use image-based rendering  to leverage the reference images to predict output color. Given the camera intrinsics and depth map for an output view, we can recover the 3D object-space position of each output pixel as
which can be transformed to the input image frame as for each input camera . The output pixel can then copy the color of the corresponding input pixel to produce a reprojected color image.
The resulting reprojected image will contain invalid pixels due occlusions. There are multiple strategies to weighting each pixel including 1) weighting by reprojected depth error, 2) weighting by similarity between input and query cameras, 3) using a neural network. Examples of renderings using each method are shown in Fig. 5. The first choice suffers from artifacts in the presence of depth errors or thin surfaces. The second approach yields reasonable results but produces blurry images for intermediate views. We opt for the third option. Following deep blending , we train a network that predicts blend weights for each reprojected input : where is an element-wise product. The blend weights are predicted by a 2D U-Net. The inputs to this network are 1) the depth predicted by our reconstruction pipeline, 2) each reprojected input image , and 3) a view similarity score
based on the angle between the input and query poses. We concatenate the inputs into a 7-channel tensor.
5 Object Pose Estimation
Given an image , and a depth map , a pose estimation system provides a rotation and a translation which together define an object-to-camera coordinate transformation referred to as the object pose. In this section, we describe how we use our reconstruction pipeline described in Sec. 4 to directly optimize for the pose. We first find a coarse pose using only forward inference and then refine it using gradient optimization.
Formulation. Pose is defined by a rotation and a translation . Our formulation also includes the viewport parameter defined in Sec. 3. Defining a viewport allows us to efficiently pass the input to the reconstruction network while also providing scale invariance. We encode the rotation as a quaternion and translation as . We assume we are given an RGB image , an object segmentation mask , and depth comprising the input .
In order to estimate pose, we must first provide a criterion which quantifies the quality of the pose. We use two loss functions. One is a standarddepth loss:
which disambiguates the object scale and serves as a reconstruction loss for how well the prediction depth matches the input depth .
We introduce a novel latent loss which leverages our reconstruction network to evaluate the fitness of a pose. Given the input , a latent object , and a pose , the latent loss is defined as
where is the rendering network up to the projection layer and is the modeling network as described in Sec. 4. This loss differs from auto-encoder based approaches such as [33, 6] in that 1) our network is not trained on the object, and 2) the loss is computed directly given the image and camera pose. Given the two loss functions, the pose estimation problem is given by:
where is the depth branch of the rendering network and is a weight to balance the two losses which is optimized using . The latent object is omitted for clarity.
Coarse Pose. We bootstrap our pose estimation by first computing a coarse estimate. This speeds up optimization and prevents the optimization from getting stuck in bad local minima. We begin by estimating the translation of the object as the centroid of the bounding cube defined by the mask bounding box and corresponding depth values. We then uniformly sample rotation quaternions and translations and select the top poses according to Eq. (8). This step only requires forward inference of our network and thus we can perform this step for a large number of candidate poses.
Pose Optimization. Our entire pipeline is differentiable end-to-end. We can therefore optimize Eq. (8) using gradient optimization. Given a latent object and a coarse pose , we compute the loss and propagate the gradients back to the camera pose. This step only requires the rendering network and does not use the modeling network. The image-based rendering network is also not used in this step.
We parameterize the rotation in the log quaternion form in order to keep gradient updates valid:
We also jointly optimize the translation and viewport .
We evaluate our method on two datasets: ModelNet  and our new dataset MOPED. We aim to evaluate pose estimation accuracy on unseen objects.
6.1 Implementation Details
Training Data. We train our reconstruction network on 3D models from ShapeNet  which contains around 51,300 shapes. We exclude models larger than 50MB or models with more than 10 components for efficient data loading resulting in around 30,000 models. We generate UV maps for each model using Blender’s smart UV projection  to facilitate texturing. We normalize all models to unit diameter. When rendering, we a random image from MS-COCO  for each component of the model. We render with the Beckmann model 
with randomized parameters. We also render uniformly colored objects with a probability of 0.5.
Network Input. We generate our training data at a resolution of . However, the input to our network is a fixed size . To keep our inputs consistent and to keep our network scale-invariant, we ‘zoom’ into the object such that all images appear to be from the same distance. This is done by computing a bounding box size where is the current image width and height, is the distance to the centroid (See Fig. 3), is the desired output size, and is the desired ‘zoom’ distance and cropping around object centroid projected to image coordinates . This defines the viewport parameter . The cropped image is scaled to .
Training. In each iteration of training, we sample a 3D model and then sample 16 random reference poses and 16 random target poses. Each pose is sampled by uniformly sampling a unit quaternion and translation such that the object stays within frame. We train our network using the Adam optimizer  with a fixed learning rate of 0.001 for 1.5M iterations. Each batch consists of 20 objects with 16 input views and 16 target views. We use an reconstruction loss for depth and binary cross-entropy for the mask. We apply the losses to both the input and output views. We randomly orient our canonical coordinate frame in each iteration by uniformly sampling a random unit quaternion. This prevents our network from overfitting to the implementation of our latent voxel transformations. We also add motion blur, color jitter, and pixel noise to the color inputs and add noise to the input masks using the same procedure as .
6.2 Experiments on the ModelNet Dataset
We conduct experiments on ModelNet to evaluate the generalization of our method toward unseen object categories. To this end, we train our network on all the meshes in ShapeNetCore  excluding the categories we are going to evaluate on. We closely follow the evaluation protocol of  for this section. The model is evaluated on 7 unseen categories: bathtub, bookshelf, guitar, range hood, sofa, wardrobe, and TV stand. For each category, 50 pairs of initial and target object pose are sampled. We compared with  and  where all the methods initialized with the initial pose and evaluated on how successful they are on estimating the target pose. Three metrics are used: 1) : which is the percentage of the estimations that are withing and of the ground truth target pose. 2) ADD (0.1d) which considers the pose estimate to be correct if its average distance  to target pose is within of the model diameter. 3) Proj2D (5 px) that measures the percentage of the time that projections of a set of predefined points on the mesh at the estimated pose are within 5 pixel of the projected points at the ground truth pose.
Table 1 shows the quantitative results on the ModelNet dataset. On average, our method achieves state-of-the-art results on all the metrics thanks to our ability to perform continuous optimization on pose. However, for the metric, there are object categories that our method performs worse despite performing well on all other metrics. One reason is image and spatial resolution. The input and output images to our network have resolution . The resolution of our voxel representation is . The limited resolution can hinder the performance for small objects or objects that are distant from the camera. Small changes in the depth of each pixel may disproportionately affect the rotation of the object compared to our loss values and . Fig. 6 shows examples from the ModelNet experiment illustrating this limitation.
|(5°, 5cm)||ADD (0.1d)||Proj2D (5px)|
|Input||Textured 3D Mesh||Images + Camera Pose|
|# Networks||Per-Object||Single Universal|
6.3 Experiments on the MOPED Dataset
We introduce the Model-free Object Pose Estimation Dataset (MOPED). MOPED consists of 11 objects, shown in Fig. 7. For each object, we take multiple RGB-D videos cover all views the object. We first use KinectFusion  to register frames from a single capture and then use a combination of manual annotation and automatic registration [44, 27, 45] to align separate captures. For each object, we select reference frames with farthest point sampling to ensure good coverage of the object. For test sequences we capture each object in 5 different environments. We sample every other frame for evaluation videos. This results in approximately 300 test images per object. We quantitatively evaluate our method and baselines using three metrics for which we provide the Area Under Curve (AUC) metric: 1) ADD: average distance error with threshold between . 2) ADD-S: symmetric average distance error with threshold between . This metric measures the geometric alignment of symmetric objects regardless of texture. 3) Proj2D: projection error with threshold between . We compute all metrics for all sampled frames.
We compare our method with PoseRBPF , a state-of-the-art model-based pose estimation method. Since PoseRBPF requires textured 3D models, we reconstruct a mesh for each object by aggregating point clouds from reference captures and building a TSDF volume. The point clouds are integrated into the volume using KinectFusion . The meshes have artifacts such as washed out high frequency details and shadow vertices due to slight misalignment (see supplementary materials). Table 2 shows quantitative comparisons on the MOPED dataset. Note that our method is not trained on the test object while PoseRBPF has a separate encoder for each object. Our method achieves superior performance on both ADD and ADD-S. We evaluate different version of our method with different combinations of loss functions. Compared to our combined loss, optimizing only performs better for geometrically asymmetric objects but worse on textured objects such as the cheezit box. Optimizing both losses achieves better results on textured objects. Fig. 8 shows estimated poses for different test images. Please see supplementary materials for qualitative examples.
6.4 Ablation Studies
In this section, we analyze the effect of different design choices and how they affect the robustness of our method.
Number of reference views. We first evaluate the sensitivity of our method to the number of input reference views. Novel view synthesis is easier with more reference views because there is a higher chance that a query view will be close to a reference view. Table 3 shows that the accuracy increases with the number of reference views. In addition, having more than 8 reference views only yields marginal performance gains shows that our method does not require many views to achieve good pose estimation results.
We compare multiple strategies for aggregating the latent representations from each reference view. The naive way is to use a simple pooling function such as average/max pooling. Alternatively, we can integrate the volumes using an RNN such as a ConvGRU so that the network can reason across views. Table4 shows the quantitative evaluation of these two variations. Although the average performance of the objects are very similar, the ConvGRU variation performs better than the average pooling variation. This indicates the importance of spatial relationship in the voxel representation for pose estimation.
We have presented a novel framework for learning 3D object representations from reference views. Our network is able to decode this representation to synthesize novel views and estimate 6D poses of objects. By training the network with thousands of 3D shapes, our network learns to reconstruct and estimate the pose of unseen objects during inference. Compared to the current methods for 6D object pose estimation, our method removes the requirement of having a high quality 3D model or performing training for each object. Consequently, our method has the potential to handle a large number of objects for pose estimation. For future work, we plan to investigate unseen object pose estimation in cluttered scenes where objects may occlude each other. Another direction is to speed up the pose estimation process by applying network optimization techniques.
-  (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432. Cited by: §4.1.
-  (1987) The scattering of electromagnetic waves from rough surfaces. Norwood, MA, Artech House, Inc., 1987, 511 p.. Cited by: §6.1.
-  (2019) Blender - a 3d modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam. External Links: Cited by: §6.1.
-  (2015) ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012. Cited by: §1, §6.1, §6.2.
3D-R2N2: a unified approach for single and multi-view 3D object reconstruction.
European Conference on Computer Vision (ECCV), pp. 628–644. Cited by: §2.
-  (2019) PoseRBPF: a rao-blackwellized particle filter for 6D object pose tracking. In Robotics: Science and Systems (RSS), Cited by: §1, §1, §2, §5.1, §6.3, Table 2.
-  (2018) Deep blending for free-viewpoint image-based rendering. In SIGGRAPH Asia 2018 Technical Papers, pp. 257. Cited by: §4.2.
-  (2012) Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Asian Conference on Computer Vision (ACCV), pp. 548–562. Cited by: §6.2.
-  (2017) SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1521–1529. Cited by: §2.
Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491. Cited by: §5.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
-  (2018) 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3559–3568. Cited by: §2.
-  (2000) A theory of shape by space carving. International Journal of Computer Vision (IJCV) 38 (3), pp. 199–218. Cited by: §1, §2, §4.1, §4.1.
-  (2018) Differentiable monte carlo ray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37 (6), pp. 222:1–222:11. Cited by: §2.
-  (2018) DeepIM: deep iterative matching for 6D pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §2, §6.2, Table 1.
-  (2014) Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §6.1.
-  (2019-10) Soft rasterizer: a differentiable renderer for image-based 3D reasoning. The IEEE International Conference on Computer Vision (ICCV). Cited by: §2.
-  (2019) Neural volumes: learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751. Cited by: §4.2.
Dex-Net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. Robotics: Science and Systems (RSS). Cited by: §6.1.
-  (2017) 3D bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2011) KinectFusion: real-time dense surface mapping and tracking. In IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Vol. 11, pp. 127–136. Cited by: §1, §1, §2, §6.3, §6.3.
-  (2018) RenderNet: a deep convolutional network for differentiable rendering from 3D shapes. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, §4.1, §4.1, §4.2.
HoloGAN: Unsupervised learning of 3D representations from natural images. International Conference on Computer Vision (ICCV). Cited by: §2, §4.1, §4.
-  (2018) PhotoShape: photorealistic materials for large-scale shape collections. In SIGGRAPH Asia 2018 Technical Papers, pp. 192. Cited by: §2.
-  (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.1, §4.2.
-  (2001) Efficient variants of the ICP algorithm. In Proceedings Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152. Cited by: §6.3.
-  (2000) Review of image-based rendering techniques. In Visual Communications and Image Processing, Vol. 4067, pp. 2–13. Cited by: §4.2.
-  (2014) BigBIRD: a large-scale 3D database of object instances. In IEEE International Conference on Robotics and Automation (ICRA), pp. 509–516. Cited by: §2.
-  (2019) DeepVoxels: learning persistent 3D feature embeddings. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: Figure 4, §4.1, §4.1, §4.1, §4.2.
-  (2019) Scene representation networks: continuous 3D-structure-aware neural scene representations. In Advances in Neural Information Processing Systems (NuerIPS), Cited by: §2, §4.
-  (2019) Multi-path learning for object pose estimation across domains. arXiv preprint arXiv:1908.00151. Cited by: §6.2, Table 1.
-  (2018) Implicit 3D orientation learning for 6D object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 699–715. Cited by: §1, §2, §5.1.
-  (2018) Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning (CoRL), External Links: Cited by: §1, §1, §2.
-  (2018) Multi-view consistency as supervisory signal for learning shape and pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2897–2905. Cited by: §2.
-  (2017) Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 2626–2634. Cited by: §2.
-  (2019) DenseFusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3343–3352. Cited by: §1, §2.
-  (2016) Unsupervised texture transfer from images to model collections. ACM Trans. Graph. 35 (6), pp. 177:1–177:13. External Links: Cited by: §2.
-  (2015) ElasticFusion: dense slam without a pose graph. In Robotics: Science and Systems (RSS), Cited by: §2.
-  (2015) 3D ShapeNets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920. Cited by: §6.
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. Robotics: Science and Systems (RSS). Cited by: §1, §1, §2.
-  (2019) The best of both modes: separately leveraging RGB and depth for unseen object instance segmentation. In Conference on Robot Learning (CoRL), Cited by: §1.
-  (2019) PointFlow: 3D point cloud generation with continuous normalizing flows. International Conference on Computer Vision (ICCV). Cited by: §2.
-  (2016) Fast global registration. In European Conference on Computer Vision (ECCV), pp. 766–782. Cited by: §6.3.
-  (2018) Open3D: A modern library for 3D data processing. arXiv:1801.09847. Cited by: §6.3.