1 Introduction
Estimating 6DoF object pose from images is a core problem for a wide range of applications including robotic manipulation, navigation, augmented reality and autonomous driving. While numerous methods appear in the literature [12, 41, 1, 39, 2, 6, 17, 26], scalability (to large numbers of objects) and accuracy continue to be critical issues that limit existing methods. Recent work has attempted to leverage the power of deep CNNs to surmount these limitations [35, 25, 42, 27, 38, 16, 44, 30]. One naive approach is to train a network to estimate the pose of each object of interest (Fig. 1 (a)). More recent approaches follow the principle of “object per output branch” (Fig. 1 (b)) whereby each object class^{1}^{1}1An object class may refer to either an object instance or an object category. is associated with an output stream connected to a shared feature basis [44, 16, 35, 25, 30]. In both cases, the size of the network increases with the number of objects, which implies that large amounts of data are needed for each class to avoid overfitting. In this work, we present a multiclass pose estimation architecture (Fig. 1 (c)) which receives object images and class labels provided by a detection system and which has a single branch for pose prediction. As a result, our model is readily scalable to large numbers of object categories and works for unseen instances while providing robust and accurate pose prediction for each object.
The ambiguity of object appearance and occlusion in cluttered scenes is another problem that limits the application of pose estimation in practice. One solution is to exploit additional views of the same instance to compensate for recognition failure from a single view. However, naive “averaging” of multiple singleview pose estimates in SE(3) [5] does not work due to its sensitivity to incorrect predictions. Additionally, most current approaches to multiview 6DoF pose estimation [33, 22, 7] do not address singleview ambiguities caused by object symmetry. This exacerbates the complexity of view fusion when multiple correct estimates from single views do not agree on SE(3). Motivated by these challenges, we demonstrate a new multiview framework (Fig. 1 (d)) which selects pose hypotheses, computed from our singleview multiclass network, based on a distance metric robust to object symmetry.
In summary, we make following contributions to scalable and accurate pose estimation on multiple classes and multiple views:

We develop a multiclass CNN architecture for accurate pose estimation with three novel features: a) a single pose prediction branch which is coupled with a discriminative pose representation in SE(3) and is shared by multiple classes; b) a method to embed object class labels into the learning process by concatenating a tiled class map with convolutional layers; and c) deep supervision with an object mask which improves the generalization from synthetic data to real images.

We present a multiview fusion framework which reduces singleview ambiguity based a voting scheme. An efficient implementation is proposed to enable fast hypothesis selection during inference.

We show that our method provides stateoftheart performance on public benchmarks including YCBVideo [44], JHUScene50 [22] for 6DoF object pose estimation [44, 22], and ObjectNet3D for largescale viewpoint estimation [42]. Further, we present a detailed ablative study on all benchmarks to empirically validate the three innovations in the singleview pose estimation network.
2 Related Work
We first review three categories of work on singleview pose estimation and then investigate recent progress on multiview object recognition.
Template Matching. Traditional templatebased methods compute 6DoF pose of an object by matching image observations to hundreds or thousands of object templates that are sampled from a constrained viewing sphere [12, 41, 1, 39]. Recent approaches apply deep CNNs as endtoend matching machines to improve the robustness of template matching [41, 1, 19]. Unfortunately, these methods do not scale well in general because the inference time grows linearly with the number of objects. Moreover, they generalize poorly to unseen object instances as shown in [1] and suffer from poor domain shift from synthetic to real images.
BottomUp Approaches. Given object CAD models, 6DoF object pose can be inferred by registering a CAD model to part of a scene using coarsetofine ICP [47], Hough voting [37], RANSAC [28]
and heuristic 3D descriptors
[8, 32]. More principled approaches use random forests to infer local object coordinates for each image pixel based on handcrafted features
[3, 4, 26] or autoencoders [6, 17]. However, local image patterns are ambiguous for objects with similar appearance, which prevents this line of work from being applied to generic objects and unconstrained background clutter.Learning EndtoEnd Pose Machines. This class of work deploys deep CNNs to learn an endtoend mapping from a single RGB or RGBD image to object pose. [35, 25, 27, 42] train CNNs to directly predict the Euler angles of object instances and then apply them to unseen instances from the same object categories. Other methods decouple 6DoF pose into rotation and translation components and infer each independently. SSD6D [16]classifies an input into discrete bins of Euler angles and subsequently estimates 3D position by fitting 2D projections to a detected bounding box. PoseCNN [44]
regresses rotation with a loss function that is robust to object symmetry, and follows this with a bottomup approach to vote for the 3D location of the object center via RANSAC. In contrast to the above, our method formulates a discriminative representation of 6DoF pose that enables predictions of both rotation and translation by a single forward pass of a CNN, while being scalable to hundreds of object categories.
MultiView Recognition. In recent years, several multiview systems have been developed to enhance 3D model classification [34, 15], 2D object detection [20, 29] and semantic segmentation [23, 36, 47]. For 6DoF pose estimation, SLAM++ [33] is an early representative of a multiview pose framework which jointly optimizes poses of both the detected object and the cameras. [23] computes object pose by registering 3D object models over an incrementally reconstructed scene via a dense SLAM system. These two methods are difficult to scale because they rely on [28] whose running time grows linearly to the number of objects. A more recent method [7]
formulates a probabilistic framework to fuse pose estimates from different views. However, it requires computation of marginal probability over all subsets of a given number of views, which is computationally prohibitive when the number of views and/or objects is large.
3 SingleView MultiClass Pose Estimation Network
In this section, we introduce a CNNbased architecture for multiclass pose estimation (Fig. 2). The input can be an RGB or RGBD image region of interest (ROI) of an object provided by arbitrary object detection algorithm. The network outputs represent both the rotation and the translation of a 6DoF pose in SE(3).
We first note that the a single rotation relative to the camera corresponds to different object appearances in image domain when varies. This issue has been discussed in [27] in the case of 1D yaw angle estimation. To create a consistent mapping from the ROI appearance to , we initially rectify the annotated pose to align to the current viewpoint as follows. We first compute the 3D orientation towards the center of the ROI : , where is the 2D camera center and are the focal lengths for and axes. Subsequently, we compute rectified XYZ axes by aligning the Z axis to .
(1) 
where symbol
indicates the cross product of two vectors. Finally, we project
onto and obtain the rectified pose : and , where . We refer readers to the supplementary material for more details about the rectification step. When depth is available, we rectify the XYZ value of each pixel by and construct a normalized XYZ map by centering the point cloud to the median along each axis.Figure 2 illustrates the details of our network design. Two streams of convolutional layers receive RGB image and XYZ map respectively and the final outputs are bin and delta vectors (described below) for both rotation and translation (Sec. 3.1). These two streams are further merged with class priors (Sec.3.2) and deeply supervised by object mask (Sec. 3.3). When depth data is not available, we simply remove the XYZ stream.
3.1 Bin & Delta Representation for SE(3)
Direct regression to object rotation has been shown to be inferior to a classification scheme over discretized SO(3)^{2}^{2}2SO(3) is the Special Orthogonal group of rations in three dimensions [31, 27, 16]. One common discretization of SO(3) is to bin along each Euler angle (i.e. yaw, pitch and roll) [35, 16]. However, this binning scheme yields a nonuniform tessellation of SO(3). Consequently, a small error on one Euler angle may be magnified and result in a large deviation in the final rotation estimate. In the following, we formulate two new bin & delta representations which uniformly partition both SO(3) and R(3). They are further coupled with a classification & regression scheme for learning discriminative pose features.
Almost Uniform Partition of SO(3). We first exploit the sampling technique developed by [45] to generate rotations
that are uniformly distributed on SO(3). These
rotations are used as the centers of rotation bins in SO(3). These are shared between different object classes. Given an arbitrary rotation matrix , we convert it to a bin and delta pair based on . The bin vector contains dimensions where the th dimension indicates the confidence of belonging to bin . stores rotations (i.e. quaternions in our implementation) where the th rotation is the deviation from to . During inference, we take the bin with maximum score and apply the corresponding delta value to the bin center to compute the final prediction. In training, we enforce a sparse confidence scoring scheme for to supervise the network:(2) 
where and is the set of nearest neighbors of among in terms of the geodesic distance between two rotations and . Note that we design delta to achieve and not because the former is numerically more stable. Specifically, if is the prediction of with error such that , the error of final prediction is also because . If we define instead, then and the error will be . Thus, the error of may be magnified in the final rotation estimate .
Gridding XYZ Axes. The translation vector is the 3D vector from the camera origin to the object center. To divide the translation space, we uniformly grid X, Y and Z axes independently. For RGB images, we align the X and Y axes to image coordinates and the Z axis is optical axis of the camera. We also rescale the ROI to a fixed scale for the CNN, so we further adjust the value of each pixel to such that image scale is consistent to the depth value: , where and are image scales before and after rescaling, respectively. When depth data is available, the XYZ axes are simply chosen to be the coordinate axes of normalized point cloud.
We now discuss how to construct the bin & delta pair for X axis; the Y and Z axes are done in the same way. We first create nonoverlapping bins of equal size between ^{3}^{3}3 and may vary across different axes. When the X value is lower than (or larger than ), we assign it to the first (or last bin). During inference, we compute the X value by adding the delta to the bin center which has the maximum confidence score. During training, similar to Eq. 2, we compute of an X value by finding its nearest neighbors among bins. Then, we assign for the top nearest neighbor and for the remaining neighbors (). Correspondingly, the delta values of the nearest neighbor bins are deviations from the bin centers to the actual X value and others are . Finally, we concatenate all bins and deltas of X, Y and Z axes: and . One alternative way of dividing translation space is to apply joint griding over XYZ space. However, the total number of bins grows exponentially as increases and we found no performance gain by doing so in practice.
3.2 Fusion of Class Prior
Many existing methods assume known object class labels, provided by a detection system, prior to pose analysis [44, 16, 31, 25, 1]. However, they ignore the class prior during training and only apply it during inference. Our idea is to directly incorporate this known class label into the learning process of convolutional filters for pose. This is partly inspired by prior work on CNNbased handeye coordination learning [21] where a tiled robot motor motion map is concatenated with one hidden convolutional layer for predicting the grasp success probability. Given the class label of the ROI, we create a onehot vector where the entry corresponding to the class label is set to and all others to
. We further spatially tile this onehot vector to form a 3D tensor with size
, where is the number of object classes and are height and width of a convolutional feature map at an intermediate layer chosen as part of the network design. As shown in Fig. 2, we concatenate this tiled class tensor with the last convolutional layers of both color and depth streams along the filter channel. Therefore, the original feature map is embedded with class labels at all spatial locations and the subsequent layers are able to model classspecific patterns for pose estimation. This is critical in teaching the network to develop compact classspecific filters for each individual object while taking advantage of a shared basis of low level features for robustness.3.3 Deep Supervision with Object Segmentation
Due to limited availability of pose annotations on real images, synthetic CAD renderings are commonly used as training data for learningbased pose estimation methods [44, 12, 16]. We take this approach but, following [24], we also incorporate the deep supervision of an object mask at a hidden layer, (shown in Fig. 2) for additional regularization of the training process. We can view the object mask as an intermediate result for the final task of 6DoF pose estimation. That is, good object segmentation is a prerequisite for the final success of pose estimation. Moreover, a precisely predicted object mask benefits a postrefinement step such as Iterative Closest Point (ICP).
To incorporate the mask with the feature and class maps (Sec. 3.2), we append one output branch for the object mask which contains one convolutional layer followed by two deconvolution layers with upsampling ratio . We assume that the object of interest dominates the input image so that only a binary mask (“1” indicates object pixel and “0” means background or other objects) is needed as an auxiliary cue. As such, the size of the output layer for binary segmentation prediction is fixed regardless of the number of object instances in database, which enables our method to scale well to large numbers of objects. Conversely, when multiple objects appear in a scene, we must rely on some detection system to “roughly” localize them in the 2D image first.
3.4 Network Architecture
The complete loss function for training the network consists of five loss components over the segmentation map, the rotation, and the three translation components:
(3) 
where , , and are the bin and delta estimates of the groundtruth , , and , respectively. We apply crossentropy softmax to segmentation loss on each pixel location and to the bin losses and . We employ L2 losses for the delta values and
. All losses are simultaneously backpropagated to the network to update network parameters on each batch. For simplicity, we apply loss weight
for each loss term.Each convolutional layer is coupled with a batchnorm layer [13]
and ReLU. The size of all convolutional filters is
x. The output layer for each bin and delta is constructed with one global average pooling (GAP) layer followed by one fully connected (FC) layer with neurons. We employ a dropout [18]layer before each downsampling of convolution with stride
. We deploy layers in total.4 MultiView Pose Framework
In this section, we present a multiview framework which refines the outputs of our singleview network (Sec. 3) during the inference stage. We assume that camera pose of each frame in a sequence is known. In practice, camera poses can be provided by many SLAM systems such as Kinect Fusion [14].
4.1 Motivation
Recall that we can obtain top estimates from all subspaces in SE(3) including SO(3), X, Y, and Z spaces (Sec. 3.1). Therefore, we can compute pose hypotheses
by composing topk results from all subspaces. In turn, we compute the top accuracy as the highest pose accuracy achieved among all hypotheses. Fig. 3 shows the curve of top accuracies of our pose estimation network across all object instances, in terms of the mPCK^{4}^{4}4Please refer to Sec. 5 for more details on the mPCK metric. metric on YCBVideo benchmark [44]. We observe that pose estimation performance significantly improves when we initially increase from to and almost saturates at This suggests that the inferred confidence score is ambiguous in only a small range, which makes sense especially for objects that have symmetric geometry or texture. The question is how we can resolve this ambiguity and further improve the pose estimation performance. We now present a multiview voting algorithm that selects the correct hypothesis from the top hypothesis set.
4.2 Hypothesis Voting
To measure the difference between hypotheses from different views, we first transfer all hypotheses into view using the known camera poses of all views. We consider a hypothesis set from views, where indicates the pose hypothesis in view with respect to camera coordinate of view . To handle singleview ambiguity caused by symmetrical geometry, we test the consistency of “fit” to the observed data. More specifically, we employ the distance metric proposed by [12] to measure the discrepancy between two hypothesis and :
(4) 
where denotes the set of 3D model points and . yields small distance when 3D object occupancies under poses and are similar, even if and have large geodesic distance on SO(3). Finally, the voting score for is calculated as:
(5) 
where
is the threshold for outlier rejection. We select the hypothesis with the highest vote score as the final prediction. Fig.
1 (d) illustrates this multiview voting process.Efficient Implementation. The above hypothesis voting algorithm is computationally expensive because the time complexity of Eq. 4 is at least via a KDTree implementation. Our solution is to decouple translation and rotation components in Eq. 4 and approximate by :
(6) 
In fact, is an upper bound on : for any and , because based on the triangle inequality. Since the complexity of is , we can focus on speeding up the computation of rotation distance . Our approach is to precompute a table of all pairwise distances between every two rotations from uniformly sampled rotation bins by [45]. For arbitrary and , we search for their nearest neighbors and from . In turn, we approximate the rotation distance as follows:
(7) 
where the right hand side can be directly retrieved from the precomputed distance table during inference. When is large enough, the approximation error of Eq. 7 has little effect on our voting algorithm. In practice, we find the performance gain saturates when . Thus, the complexity of Eq. 7 is for nearest neighbor search, which is significantly smaller than of Eq. 5 ( in general).
5 Experiments
In this section, we empirically evaluate our method on three largescale datasets: YCBVideo [44], JHUScene50 [22] for 6DoF pose estimation, and ObjectNet3D [42] for viewpoint estimation. Further, we conduct an ablative study to validate our three innovations for the singleview pose network.
Evaluation Metric. For 6DoF pose estimation, we follow the recently proposed metric “ADDS” [44]. The traditional metric [12] considers a pose estimate to be correct if in Eq. 4 is below a threshold with respect to the ground truth value . “ADDS” improves this thresholdbased metric by computing the area under the curve of the accuracythreshold over different thresholds within a range (i.e. ). We rename “ADDS” as “mPCK” because it is essentially the mean of PCK accuracy [46]. For viewpoint estimation, we use Average Viewpoint Precision (AVP) used in PASCAL3D+ [43] and Average Orientation Similarity (AOS) used in KITTI [9].
Implementation Details. The number of nearest neighbors we use for soft binning is for SO(3) and for each of XYZ axes. We set binning scores as and . The number of rotation bins is . For XYZ binning, we use bins and for each axis when RGBD data is used. For inference on RGB data, we use bins, for XY axes and bins, for Z axis. In multiview voting, we set the distance threshold and the precomputed size of distance table as . The input image to our singleview pose network is x. The tiled class map is inserted at convolutional layer 15 with size
. We use stochastic gradient descent with momentum
to train our network from scratch. The learning rate starts at 0.01 and decreases by onetenth every steps. The batch size is for YCBVideo and for both JHUScene50 and ObjectNet3D. We construct each batch by mixing equal number of data from each class. We name our MultiClass pose Network as “MCN”. The multiview framework using n views is called as “MVnMCN”. Since MCN also infers instance mask, we use it to extract object point clouds when depth data is available and then run ICP to refined estimated poses by registering the object mesh to extracted object clouds. We denote this ICPbased approach as “poseCNN+ICP”.5.1 YCBVideo
Object  RGB  RGBD  

PCNN [44]  MCN  MV5MCN  3D Reg. [44]  PCNN + ICP [44]  MCN  MCN + ICP  MV5MCN  
002_master_chef_can  84.4  87.8  90.6  90.1  95.7  89.4  96.0  96.2 
003_cracker_box  80.8  64.3  72.0  77.4  94.8  85.4  88.7  90.9 
004_sugar_box  77.5  82.4  87.4  93.3  97.9  92.7  97.3  95.3 
005_tomato_can  85.3  87.9  91.8  92.1  95.0  93.2  96.5  97.5 
006_mustard_bottle  90.2  92.5  94.3  91.1  98.2  96.7  97.7  97.0 
007_tuna_fish_can  81.8  84.7  89.6  86.9  96.2  95.1  97.6  95.1 
008_pudding_box  86.6  51.0  51.7  89.3  98.1  91.6  86.2  94.5 
009_gelatin_can  86.7  86.4  88.5  97.2  98.9  94.6  97.6  96.0 
010_potted_meat_can  78.8  83.1  90.3  84.0  91.6  91.7  90.8  96.7 
011_banana  80.8  79.1  85.0  77.3  96.5  93.8  97.5  94.4 
019_pitcher_base  81.0  84.8  86.1  83.8  97.4  93.8  96.6  96.2 
021_bleach_cleanser  75.7  76.0  81.0  89.2  96.3  92.9  96.4  95.4 
024_bowl  74.2  76.1  80.2  67.4  91.7  82.6  76.0  82.0 
025_mug  70.0  91.4  93.1  85.3  94.2  95.3  97.3  96.8 
035_power_drill  73.9  76.0  81.1  89.4  98.0  88.2  95.9  93.1 
036_wood_block  63.9  54.0  58.4  76.7  93.1  81.5  93.5  93.6 
037_scissors  57.8  71.6  82.7  82.8  94.6  87.3  79.2  94.2 
040_large_marker  56.2  60.1  66.3  82.8  97.8  90.2  98.0  95.4 
051_large_clamp  34.3  66.8  77.5  67.6  81.5  91.5  94.0  93.3 
052_larger_clamp  38.6  61.1  68.0  49.0  51.6  88.0  90.7  90.9 
061_foam_brick  82.0  60.9  67.7  82.4  96.4  93.2  96.5  95.9 
All  73.4  75.1  80.2  83.7  93.1  90.6  93.3  94.3 
YCBVideo dataset [44] contains 92 real video sequences for 21 object instances. 80 videos along with 80,000 synthetic images are used for training and 2949 key frames are extracted from the remaining 12 videos for testing. We fine tune the current stateoftheart “maskRCNN” [11] on the training set as the detection system. Following the same scenario in [44], we assume that one object appears at most once in a scene. Therefore, we compute the bounding box of a particular object by finding the one with highest detection score of that object. For our multiview system, one view is coupled with other randomly sampled views in the same sequence. Each view outputs top results from each space of SO(3), X, Y and Z and in turn pose hypotheses.
Table 1 reports mPCK accuracies of our methods and variants of poseCNN [44] (denoted as “PCNN”). All methods are trained and tested following the same experiment setting defined in [44]. We first observe that the multiview framework (MV5MCN) consistently improves the singleview network (MCN) across different instances and achieves the overall stateoftheart performance. Such improvement is more significant on RGB data, where the mPCK margin between MV5MCN and MCN is which is much larger than the margin of on RGBD data for all instances. This is mainly because singleview ambiguity is more severe without depth data. Subsequently, MCN outperforms poseCNN by on RGB and MCN+ICP is marginally better than poseCNN+ICP by on RGBD. We can see that MCN achieves more balanced performance than poseCNN across different instances. For example, poseCNN+ICP only obtains on class “052_larger_clamp” which is lower than the minimum accuracy of a single class by MCN+ICP. This can be mainly attributed to our class fusion design in learning discriminative classspecific feature so that similar objects can be wellseparated in feature space (e.g. “051_large_clamp” and “052_larger_clamp”). We also observe that MCN is much inferior to PoseCNN on some instances such as foam brick. This is mainly caused by larger detection errors (less than 0.5 IoU with ground truth) on these instances.
We also run MCN over ground truth bounding boxes and the overall mPCKs are on RGB ( higher than the mPCK on detected bounding boxes) and on RGBD ( higher the mPCK on detected bounding boxes). This indicates that MCN is sensitive to detection error on RGB while being robust on RGBD data. The reason is that we rely on the image scale of bounding box to recover 3D translation for RGB input. In addition, we obtain high instance segmentation accuracy^{5}^{5}5The ratio of the number of pixels with correctly predicted mask label versus all of MCN across all object instances: on RGB and on RGBD. This implies that MCN does actually learn the intermediate foreground mask as part of pose prediction. We refer readers for more numerical results in supplementary material, including segmentation accuracies, PCK curves of MCN and mPCK accuracies on groundtruth bounding box on individual instance. Last, we show some qualitative results in upper part of Fig. 4. We can see that MCN is capable of predicting object pose under occlusion and MV5MCN further refines the MCN result.
5.2 JHUScene50
Object  RGB  RGBD  

Manifold [1]  MCN  MV5MCN  ObjRec. [28]  Manifold [1]  MCN  MV5MCN  
drill_1  10.6  33.4  36.5  14.5  70.3  76.8  78.1 
drill_2  9.9  48.8  54.5  2.9  49.0  76.6  80.1 
drill_3  7.6  45.5  48.0  3.7  50.9  81.5  85.4 
drill_4  9.3  41.6  45.5  6.5  51.4  82.0  87.1 
hammer_1  5.0  24.9  30.2  8.1  38.7  80.1  87.6 
hammer_2  5.1  28.3  33.4  10.7  35.5  81.2  91.5 
hammer_3  7.8  26.2  31.2  8.6  47.8  83.1  88.1 
hammer_4  5.1  17.2  20.6  3.8  38.3  73.8  87.8 
hammer_5  5.2  37.1  44.4  9.6  35.0  78.0  86.3 
sander  10.7  35.6  39.5  9.5  54.3  76.0  75.5 
All  7.6  33.9  38.4  7.8  47.1  78.9  84.8 
JHUScene50 [22] contains 50 scenes with diverse background clutter and severe object occlusion. Moreover, the target object set consists of hand tool instances with similar appearance. Only textured CAD models are available during training and all real image frames comprise the test set. To cope with our pose learning framework, we simulate a large amount of synthetic data by rendering densely cluttered scenes similar to the test data, where objects are randomly piled on a table. We use UnrealCV [40] as the rendering tool and generate 100k training images.
We compare MCN and MV5MCN with the baseline method ObjRecRANSAC^{6}^{6}6https://github.com/tummvp/ObjRecRANSAC [28] in JHUScene50 and one recent stateoftheart pose manifold learning technique [1]^{7}^{7}7We reimplement this method because the source code is not publicly available.. All methods are trained on the same synthetic training set and tested on the real image frames from JHUScene50. We compute 3D translation for [1] by following the same procedure used in [12]. We evaluate different methods on the ground truth locations of all objects. Table 2 reports mPCK accuracies of all methods. We can see that MCN significantly outperforms other comparative methods by a large margin, though MCN performs much worse than on YCBVideo mainly because of the severe occlusion and diverse cluttered background in JHUScene50. Additionally, we observe that MV5MCN is superior to MCN on both RGB and RGBD data. The performance gain on RGBD data achieved by MV5MCN is much larger than the one on YCBVideo, especially for the hammer category due to the symmetrical 3D geometry. We visualize some results of MCN and MV5MCN in the bottom of Fig. 4. The bottomright example shows MV5MCN corrects the orientation of MCN result which frequently occurs for hammer.
5.3 ObjectNet3D
mAP  AOS  AVP  

Fast RCNN [10]  ObjectNet3D [42]  MCN  ObjectNet3D [42]  MCN  
Accuracy  61.6  51.9  56.0  39.4 (64.0)  50.0 (81.2) 
To evaluate the scalability of our method, we conduct an experiment on ObjectNet3D which consists of viewpoint annotation of instances from object categories. In contrast to most existing benchmarks [44, 22, 12] which target indoor scenes and small objects, ObjectNet3D covers a wide range of outdoor environments and diverse object categories such as airplane. We modify the MCN model by only using the rotation branch for viewpoint estimation and removing the deep supervision of object mask because the object mask is not available in ObjectNet3D. To our knowledge, only [42] reports viewpoint estimation accuracy on this dataset, where a viewpoint regression branch is added along with bounding box regression in the Fast RCNN architecture [10]. For the fair comparison, we use the same detection results for [42] as the input to MCN. Because ObjectNet3D only provides detection results on the validation set, we train our model on the training split and test on the validation set. Table 3 reports the viewpoint estimation accuracies of different methods on the validation set,in terms of two different metrics AVP [43] and AOS [9]. The detection performance in mAP is the upperbound of AVP. The numbers in parentheses are the ratios of AVP versus mAP. We can see that MCN is significantly superior to the largescale model [42] on both AOS and AVP, even if [42] actually optimizes the network hyperparameters on the validation set. This shows that MCN can be scaled to a largescale pose estimation problem. Moreover, object instances have little overlap between training and validation sets in ObjectNet3D, which indicates that MCN can generalize to unseen object instances within a category.
5.4 Ablative Study
Method  RGB  RGBD  

YCBVideo  JHU  ObjectNet3D  YCBVideo  JHU  
plain  61.0  25.0  51.7 / 38.3  61.8  19.6 
BD + Seg  66.2  26.3  50.3* / 41.3*  89.5  70.0 
BD + TC  68.5  29.3  56.0 / 50.0  90.1  76.4 
SepBranch + Seg + BD  73.8  31.6  52.5* / 42.9*  90.2  77.7 
SepNet + Seg + BD  62.1  28.7  NA  87.1  66.9 
MCN (Seg + TC + BD)  80.2  33.9  NA  90.8  78.9 
In this section, we empirically validate the three innovations introduced in MCN: bin & delta representation (“BD”), tiled class map (“TC”) and deep supervision of object segmentation (“Seg”). Additionally, we also inspect the baseline architectures: separate network for each object (“SepNet”) and separate output branch for each object (“SepBranch”), as shown in Fig. 1 (a) and Fig. 1 (b) respectively. To remove the effect of using “BD”, we directly regress quaternion and translation (plain) as the comparison. Table 4 presents accuracies of different methods on all three benchmarks. We follow previous sections to report mPCK for YCBVideo and JHUScene50, and AOS/AVP for ObjectNet3D. Because ObjectNet3D does not provide segmentation groundtruth, we remove module “Seg” in all analysis related to ObjectNet3D. Also, we do not report accuracy of “SepNet” on ObjectNet3D because it requires 100 GPUs for training. We have three main observations: 1. When removing any of the three innovations, pose estimation performance consistently decreases. Typically, “BD” is a more critical design than “Seg” and tiled class map because the removal of BD causes larger performance drop; 2. “SepBranch” coupled with “BD” and “Seg” appears to be the second best architecture, but it is still inferior to MCN especially on YCBVideo and ObjectNet3D. Moreover, the model size of “SepBranch” grows rapidly with the increasing number of classes; 3. “SepNet” is expensive in training and it performs substantially worse than MCN because MCN exploits diverse data from different classes to reduce overfitting.
6 Conclusion
We present a unified architecture for inferring 6DoF object pose from single and multiple views. We first introduce a singleview pose estimation network with three innovations: a new bin & delta pose representation, the fusion of tiled class map into convolutional layers and deep supervision of object mask at intermediate layer. These modules enable a scalable pose learning architecture for largescale object classes and unconstrained background clutter. Subsequently, we formulate a new multiview framework for selecting singleview pose hypotheses while considering ambiguity caused by object symmetry. In the future, an intriguing direction is to embed the multiview procedure into the training process to jointly optimize both singleview and multiview performance. Also, the multiview algorithm can be improved to maintain a fixed number of “good” hypotheses for any incremental update given a new frame.
Acknowledgments. This work is supported by the IARPA DIVA program and the National Science Foundation under grants IIS127228 and IIS1637949.
References
 [1] Balntas, V., Doumanoglou, A., Sahin, C., Sock, J., Kouskouridas, R., Kim, T.K.: Pose guided rgbd feature learning for 3d object pose estimation. In: CVPR (2017)
 [2] Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6d object pose estimation using 3d object coordinates. In: ECCV. Springer (2014)
 [3] Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6d object pose estimation using 3d object coordinates. In: ECCV. Springer (2014)
 [4] Brachmann, E., Michel, F., Krull, A., Ying Yang, M., Gumhold, S., et al.: Uncertaintydriven 6d pose estimation of objects and scenes from a single rgb image. In: CVPR (2016)
 [5] Chirikjian, G.S., Mahony, R., Ruan, S., Trumpf, J.: Pose changes from a different point of view. Journal of Mechanisms and Robotics (2018)
 [6] Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.K.: Recovering 6d object pose and predicting nextbestview in the crowd. In: CVPR (2016)
 [7] Erkent, Ö., Shukla, D., Piater, J.: Integration of probabilistic pose estimates from multiple views. In: ECCV. Springer (2016)
 [8] F. Tombari, S.S., Stefano, L.D.: A combined textureshape descriptor for enhanced 3d feature matching. ICIP (2011)
 [9] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In: CVPR (2012)
 [10] Girshick, R.: Fast rcnn. arXiv preprint arXiv:1504.08083 (2015)
 [11] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask rcnn. In: ICCV. IEEE (2017)

[12]
Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Model based training, detection and pose estimation of textureless 3d objects in heavily cluttered scenes. In: Computer Vision–ACCV 2012. Springer (2013)

[13]
Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. JMLR (2015)
 [14] Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., et al.: Kinectfusion: realtime 3d reconstruction and interaction using a moving depth camera. In: ACM symposium on User interface software and technology. ACM (2011)
 [15] Johns, E., Leutenegger, S., Davison, A.J.: Pairwise decomposition of image sequences for active multiview recognition. In: CVPR. IEEE (2016)
 [16] Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: Ssd6d: Making rgbbased 3d detection and 6d pose estimation great again. In: CVPR (2017)
 [17] Kehl, W., Milletari, F., Tombari, F., Ilic, S., Navab, N.: Deep learning of local rgbd patches for 3d object detection and 6d pose estimation. In: ECCV. pp. 205–220. Springer (2016)

[18]
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classification with Deep Convolutional Neural Networks. In: NIPS (2012)
 [19] Krull, A., Brachmann, E., Michel, F., Ying Yang, M., Gumhold, S., Rother, C.: Learning analysisbysynthesis for 6d pose estimation in rgbd images. In: ICCV (2015)
 [20] Lai, K., Bo, L., Ren, X., Fox, D.: Detectionbased object labeling in 3d scenes. In: ICRA. IEEE (2012)
 [21] Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning handeye coordination for robotic grasping with largescale data collection. In: International Symposium on Experimental Robotics. pp. 173–184. Springer (2016)
 [22] Li, C., Boheren, J., Carlson, E., Hager, G.D.: Hierarchical semantic parsing for object pose estimation in densely cluttered scenes. In: ICRA (2016)

[23]
Li, C., Xiao, H., Tateno, K., Tombari, F., Navab, N., Hager, G.D.: Incremental scene understanding on dense slam. In: IROS. IEEE (2016)
 [24] Li, C., Zia, M.Z., Tran, Q.H., Yu, X., Hager, G.D., Chandraker, M.: Deep supervision with shape concepts for occlusionaware 3d object parsing. CVPR (2017)
 [25] Massa, F., Marlet, R., Aubry, M.: Crafting a multitask cnn for viewpoint estimation. BMVC (2016)
 [26] Michel, F., Kirillov, A., Brachmann, E., Krull, A., Gumhold, S., Savchynskyy, B., Rother, C.: Global hypothesis generation for 6d object pose estimation. ICCV (2017)
 [27] Mousavian, A., Anguelov, D., Flynn, J., Košecká, J.: 3d bounding box estimation using deep learning and geometry. In: CVPR. IEEE (2017)
 [28] Papazov, C., Burschka, D.: An efficient ransac for 3d object recognition in noisy and occluded scenes. In: Computer Vision–ACCV 2010 (2011)
 [29] Pillai, S., Leonard, J.: Monocular slam supported object recognition. In: RSS (2015)
 [30] Rad, M., Lepetit, V.: Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: ICCV (2017)
 [31] Ren, S., He, K., Girshick, R., Sun, J.: Faster rcnn: Towards realtime object detection with region proposal networks. In: NIPS (2015)
 [32] Rusu, R.B.: Semantic 3d object maps for everyday manipulation in human living environments. KIKünstliche Intelligenz (2010)
 [33] SalasMoreno, R., Newcombe, R., Strasdat, H., Kelly, P., Davison, A.: Slam++: Simultaneous localisation and mapping at the level of objects. In: CVPR (2013)
 [34] Su, H., Maji, S., Kalogerakis, E., LearnedMiller, E.: Multiview convolutional neural networks for 3d shape recognition. In: CVPR. pp. 945–953 (2015)
 [35] Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: Viewpoint estimation in images using CNNs trained with Rendered 3D model views. In: ICCV (2015)
 [36] Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnnslam: Realtime dense monocular slam with learned depth prediction. CVPR (2017)
 [37] Tejani, A., Tang, D., Kouskouridas, R., Kim, T.K.: Latentclass hough forests for 3d object detection and pose estimation. In: ECCV. Springer (2014)
 [38] Tekin, B., Sinha, S.N., Fua, P.: Realtime seamless single shot 6d object pose prediction. arXiv preprint arXiv:1711.08848 (2017)
 [39] Tjaden, H., Schwanecke, U., Schömer, E.: Realtime monocular pose estimation of 3d objects using temporally consistent local color histograms. In: CVPR (2017)
 [40] Weichao Qiu, Fangwei Zhong, Y.Z.S.Q.Z.X.T.S.K.Y.W.A.Y.: Unrealcv: Virtual worlds for computer vision. ACM Multimedia Open Source Software Competition (2017)
 [41] Wohlhart, P., Lepetit, V.: Learning descriptors for object recognition and 3d pose estimation. In: CVPR (2015)
 [42] Xiang, Y., Kim, W., Chen, W., Ji, J., Choy, C., Su, H., Mottaghi, R., Guibas, L., Savarese, S.: Objectnet3d: A large scale database for 3d object recognition. In: ECCV (2016)
 [43] Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: A Benchmark for 3D Object Detection in the Wild. In: WACV (2014)
 [44] Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
 [45] Yan, Y., Chirikjian, G.S.: Almostuniform sampling of rotations for conformational searches in robotics and structural biology. In: ICRA (2012)
 [46] Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixturesofparts. In: CVPR (2011)
 [47] Zeng, A., Yu, K.T., Song, S., Suo, D., Walker, E., Rodriguez, A., Xiao, J.: Multiview selfsupervised deep learning for 6d pose estimation in the amazon picking challenge. In: ICRA. IEEE (2017)
Comments
There are no comments yet.