Code for our work on pose-estimation using template 3D models.
Understanding the geometry and pose of objects in 2D images is a fundamental necessity for a wide range of real world applications. Driven by deep neural networks, recent methods have brought significant improvements to object pose estimation. However, they suffer due to scarcity of keypoint/pose-annotated real images and hence can not exploit the object's 3D structural information effectively. In this work, we propose a data-efficient method which utilizes the geometric regularity of intraclass objects for pose estimation. First, we learn pose-invariant local descriptors of object parts from simple 2D RGB images. These descriptors, along with keypoints obtained from renders of a fixed 3D template model are then used to generate keypoint correspondence maps for a given monocular real image. Finally, a pose estimation network predicts 3D pose of the object using these correspondence maps. This pipeline is further extended to a multi-view approach, which assimilates keypoint information from correspondence sets generated from multiple views of the 3D template model. Fusion of multi-view information significantly improves geometric comprehension of the system which in turn enhances the pose estimation performance. Furthermore, use of correspondence framework responsible for the learning of pose invariant keypoint descriptor also allows us to effectively alleviate the data-scarcity problem. This enables our method to achieve state-of-the-art performance on multiple real-image viewpoint estimation datasets, such as Pascal3D+ and ObjectNet3D. To encourage reproducible research, we have released the codes for our proposed approach.READ FULL TEXT VIEW PDF
Code for our work on pose-estimation using template 3D models.
Estimating 3D pose of an object from a given RGB image is an important and challenging task in computer vision. Pose estimation can enable AI systems to gain 3D understanding of the world from simple monocular projections. While ample variation is observed in the design of objects of a certain type, say chairs, the intrinsic structure or skeleton is observed to be mostly similar. Moreover, in case of 3D objects, it is often possible to unite information from multiple 2D views, which in succession can enhance 3D perception of humans as well as artificial vision systems. In this work, we show how intraclass structural similarity of objects along with multi-view 3D interpretation can be utilized to solve the task of fine-grained 3D pose estimation.
By viewing instances of an object class from multiple viewpoints over time, humans gain the ability to recognize sub-parts of the object, independent of pose and intra-class variations. Such viewpoint and appearance invariant comprehension enables human brain to match semantic sub-parts between different instances of same object category, even from simple 2D perspective projections (RGB image). Inspired from human cognition, an artificial model with similar matching mechanism can be designed to improve final pose estimation results. In this work, we consider a single template model with known keypoint annotations as a 3D structural reference for the object category of interest. Subsequently, Key-point correspondence maps are obtained by matching keypoint-descriptors of synthetic RGB projections from multiple viewpoints, with respect to the spatial descriptors from a real RGB image. Such keypoint-correspondence maps can provide the geometric and structural cues useful for pose estimation.
The proposed pose estimation system consists of two major parts; 1) A Fully Convolutional Network which learns pose-invariant local descriptors to obtain keypoint-correspondence, and 2) A pose estimation network which fuses information from multiple correspondence maps to output the final pose estimation result. For each object class, we annotate a single template 3D model with sparse 3D keypoints. Given an image, in which the object’s pose is to be estimated, first it is paired with multiple rendered images from different viewpoints of the template 3D model (see Figure 1). Projections of the annotated 3D keypoints is tracked on the rendered synthetic images to provide ground-truth for learning of efficient key-point descriptor. Subsequently, keypoint-correspondence maps are generated for each image pair using correlation of individual keypoint descriptor (from rendered image) to the spatial descriptors obtained from the given image.
Recent works [1, 2, 3] show that deep neural networks can effectively merge information from multiple 2D views to deliver enhanced view estimation performance. These approaches require multi-view projections of the given input image to exploit the muli-view information. But in the proposed approach, we attempt to take advantages of multi-view cue by generating correspondence map from the single-view real RGB image by comparing it against multiview synthetic renders. This is achieved by feeding the multi-view keypoint correspondence maps through a carefully designed fusion network (convolutional neural network) to obtain the final pose estimation results. Moreover, by fusing information from multiple viewpoints, we show significant improvement in pose estimation, making our pose estimation approach state-of-the-art in competitive real-image datasets, such as Pascal3D+  and ObjectNet3D . In Figure 1, a diagrammatic overview of our approach is presented.
Many recent works [6, 4, 7], have utilized deep neural networks for 3D object understanding and pose estimation. However, these approaches have several drawbacks. Works such as [8, 9] achieve improved pose estimation performance by utilizing a vast amount of synthetic data. This can be a severe bottleneck when an extensive repository of diverse 3D models for a specific category is unavailable (as in case of novel object-classes, such as mechanical parts, abstract 3D models etc.). Additionally, 3D-INN  require a complex keypoint-refinement module that, while being remarkable at keypoint estimation, shows sub-optimal performance for viewpoint estimation, when compared against current state-of-the-art models. We posit that it is essential to explore and exploit strong 3D-structural object priors to alleviate various general issues, such as data-bottleneck and partial-occlusion, which are observed in object viewpoint estimation. Moreover, our approach has two crucial advantages. Firstly, our keypoint correspondence map captures relation between the keypoint and the entire 2D spatial view of the object in a given image. That is, the correspondence map not only captures information regarding spatial location of keypoint in the given image, but also captures various relations between the keypoint and other sematic-parts of the object. In Figure 2, we show the obtained correspondence map for varied keypoints, and provide evidence for this line of reasoning. Secondly, our network fuses the correspondence map of each keypoint from multiple views. This multi-view comprehension of individual keypoint enables our network to have a more nuanced interpretation of 3D structure of the object class, which later leads to improvement in pose estimation performance.
To summarize, our main contributions in this work include: (1) a method for learning pose-invariant local descriptors for various object classes, (2) a keypoint correspondence map formulation which captures various explicit and implicit relations between the keypoint, and a given image, (3) a pose estimation network which assimilates information from multiple viewpoints, and (4) state-of-the-art performance on real-image object pose estimation datasets for indoor object classes such as ‘Chair’, ‘Sofa’, ‘Table’ and ‘Bed’.
Local descriptors and keypoint correspondence: A multitude of work propose formulations for local discriptors of 3D objects, as well as 2D images. Early methods employed hand-engineered local descriptors like SIFT or HOG [10, 11, 12, 13]
to represent semantic part structures useful for object comprehension. With the advent of deep learning, works such as[14, 15, 16, 17] have proposed effective learning methods to obtain local descriptor correspondence in 2D images. Recently, Huang et al.  propose to learn local descriptors for 3D objects following deep multi-view fusion approach. While this work is one of our inspirations, our method differs in many crucial aspects. We do not require extensive multi-view fusion of local descriptors as performed by Huang et al.for individual local points. Moreover we do not rely on a large repository of 3D models with surface segmentation information for generalization. For effective local descriptor correspondence, Universal Correspondence Network  formulate an optimization strategy for learning robust spatial correspondence, which is used in coherence with an active hard-mining strategy and a convolutional spatial transformer (STN) . While  learn geometric and spatial correspondence for task such as semantic part matching, we focus on the learning procedure of their approach and adapt it for learning our pose-invariant local descriptors.
Multi-view information assimilation : Borotschnig et al. , and Paletta et al.  were one of the earliest works to show the utility of multi-view information for improving performance on tasks related to 3D object comprehension. In recent years, multiple innovative network architectures, such as [2, 3] have been proposed for the same. One of the earliest works to combine deep learning with multi-view information assimilation,  showed that 2D image-based approaches are effective for general object recognition tasks, even for 3D models. They proposed an approach for 3D object recognition based on multiple 2D projections of the object, surpassing previous works which were based on other 3D object representations such as voxel and mesh format. In , Qi et al.give a comprehensive study on the voxel-based CNN and multi-view CNN for 3D object classification.
Apart from object classification, multi-view approach is seen to be useful for a wide variety of other tasks, such as learning local features for 3D models , 3D object shape prediction  etc.. In this work, we use multi-view information assimilation for object pose estimation in a given monocular RGB image using multiple views of a 3D template model. Such a multi-view approach does not exist in the literature.
Object viewpoint estimation: Many recent works [24, 25] use deep convolutional networks for object viewpoint estimation. While works such as  attempt pose estimation along with keypoint estimation, an end-to-end approach solely for 3D pose estimation was first proposed by RenderForCNN . Su et al.  proposed to utilize vast amount of synthetic rendered data from 3D CAD models with dataset specific cues for occlusion and clutter information, to combat the lack of pose annotated real data. In contrast, 3D Interpreter Network (3D-INN)  propose an interesting approach where 3D keypoints and view is approximated by minimizing a novel re-projection loss on the estimated 2D keypoints. However, the requirement of vast amount of synthetic data is a significant bottleneck for both the works. In comparison, our method relies on the presence of a single synthetic template model per object category, making our method significantly data efficient and far more scalable. This is an important pre-requisite to incorporate the proposed approach for novel object classes, where multiple 3D models may not exists. Recently, Grabner et al.  estimate object pose by predicting the vertices of a 3D bounding box and solving a perspective-n-point problem. While achieving state-of-the-art performance in multiple object categories, they could not surpass performance of  on the challenging indoor object classes such as ‘chair’,‘sofa’, and ‘table’. It is essential to provide stronger 3D structural priors to learn pose estimation under data scarcity scenario for such complex categories. The structural prior is effectively modeled in our case by keypoint correspondence and multi-view information assimilation.
This section consist of 3 main parts: in Section 3.1, we present our approach for learning pose invariant local descriptors, Section 3.2 explains how the keypoint correspondence maps are generated, and Section 3.3 explains our regression network, along with various related design choices. Finally, we briefly describe our data generation pipeline in Section 3.4.
To effectively compare given image descriptors with the keypoint descriptors from multi-view synthetic images, our method must identify various sub-parts of the given object, invariant to pose and intra-class variation. To achieve this we train a convolutional neural network (CNN), which takes an RGB image as input and gives a spatial map of local descriptors as output. That is, given an image of size , our network predicts a spatial local descriptor map of size , where the
-dimensional vector at each spatial location is treated as the corresponding local descriptor.
Following the approach of other established method [18, 17], we use the CNN to form two brances of a Siamese architecture with shared convolutional parameters. Now, given a pair of images and with annotated keypoints, we pass them through the siamese network to get the spatial local descriptor maps and respectively. The annotated keypoints are then used to generate positive and negative correspondence pairs, where a positive correspondence pair refers to a pair of points such that they represent a certain semantic keypoint. In , authors present the correspondence contrastive loss, which is used to reduce the distance between the local descriptors of positive correspondence pairs, and increase the distance for the negative pairs. Let and represent spatial locations on and respectively. The correspondence contrastive loss can be defined as,
where is the total number of pairs, for positive correspondence pairs, and for negative correspondence pairs.
Chief benefit of using a correspondence network is its utility to combat data-scarcity. Given samples with keypoint annotation, we can generate training samples for training the local descriptor representations. The learned local descriptors do most of the heavy lifting by providing useful structural cues for 3D pose estimation. This helps us avoid extensive usage of synthetic data and the common pitfalls associated with it, such as domain shift  while testing on real samples. Compared to state-of-the-art works [8, 9], where millions of synthetic data samples were used for effecting training, we use only renders of a single template 3D model per class (which is less than of the data used by [8, 9]). Another computational advantage we observe is in terms of run-time efficiency. Given a single image, we estimate the local descriptors for all the visible points on the object. This is in stark contrast to Huang et al. , where multiple images were used for generating local descriptors for each point of the object.
In most cases such as in , objects are represented by a sparse set of keypoints. Learning feature descriptors for only a few sparse semantic keypoints has many disadvantages. In such case, the models fails to learn efficient descriptors for spatial regions away from the defined semantic keypoint locations. However, information regarding parts away from these keypoints can also be useful for pose estimation. Hence, we propose to learn proxy-dense local descriptors to obtain more effective correspondence maps (see Figure 3b and 3c). This also allows us to train the network more efficiently by generating enough amount of positive and negatives correspondence pairs. For achieving this objective, we generate dense keypoints for all images, details of which are presented in Section 3.3.
Correspondence Network Architecture: The siamese network contains two branches with shared weights. It is trained on the generated key-point annotations (details in section 3.3) using the loss, equation 1 described above. For the Siamese network, we employ a standard Googlenet 
architecture with imagenet pretrained weights. Further, to obtain spatially aligned local features, we use a convolutional spatial transformation layer after layer of googlenet architecture, as proposed in UCN . The use of convolutional spatial transformation layer is found to be very useful for semantic part correspondence in presence of reasonably high pose and intra-class variations.
The CNN introduced in the previous section provides a spatial local descriptor map for a rendered synthetic image . Now, using the keypoint annotations rendered from the 3D template model, we want to generate a spatial map, which can capture the location of corresponding keypoint in a given real image, . To achieve this we propose to utilize pairwise descriptor correlation between both the images. Let, is of size , and represents a keypoint in . Now our goal is to estimate a correspondence map of keypoint for the real image . By taking correlation of the local descriptor at , with all locations of the spatial local descriptor for image , i.e. , correspondence maps are obtained for each keypoint, . Using max-out Hadamard product , we compute the pairwise descriptor correlation for any in and in as follows:
As the learned local descriptors are unit normalized, the max-out Hadamard product represents only positive correlation between local descriptor at with local descriptors of all locations in image . By applying softmax on the entire map of rectified Hadamard product, multiple high correlation values will be suppressed by making the highest correlation value more prominent in the final correspondence map. Such normalization step is in line with the traditionally used second nearest neighbor test proposed by Lowe et al. . Using the above formulation, keypoint correspondence maps is generated for a set of sparse structurally important keypoints in image . The structurally important keypoints that we use for each object class are the same as the ones used by . Finally, We use the structurally important keypoint set for individual object category as defined by Wu et al. . Finally the stacked correspondence map for all structural keypoints of computed for image is represented by . Here is of size , where is the number of keypoints.
As explained earlier, our keypoint correspondence map computes the relation between the keypoint in and all the points in . In comparison to , where a location heatmap is predicted for each keypoint, our keypoint correspondence map captures the interplay between different keypoints as well. This in turn acts as an important cue for final pose estimation. Figure 1 shows keypoint correspondence maps generated by our approach, which clearly provide evidence of our claims.
With the structural cues for object in image provided by the keypoint correspondence set , we can estimate pose of the object more effectively. In our setup, is a synthetically rendered image of the template 3D model with the tracked 2D keypoint annotations, and is the image of interest where the pose has to be estimated. It is important to note, that contains information regarding relation between the keypoints in with respect to the image . However, as is a 2D projection of the 3D template object, it is possible that some keypoints are self occluded, or only partially visible. For such keypoints would contain noisy and unclear correspondence. As mentioned earlier, the selected keypoints are structurally important and hence lack of information of any of them can hamper the final pose estimation performance.
To alleviate this issue, we propose to utilize a multi-view pose estimation approach. We first render the template 3D model from multiple viewpoints considering viewpoints. Then, the keypoint correspondence set is generated for each view by pairing with for all . Finally, information from multiple views is combined together by concatenating all the correspondence sets to form a fused Multi-View Correspondence set, represented by . Here, is of size ; where is the number of views, and is the number of structurally important keypoints. subsequently, is supplied as an input to our pose estimation network which effectively combines information from multiple-views of the template object to infer the required structural cues. For a given , we render from fixed viewpoints, for ; where represents a tuple of azimuth, elevation and tilt angles in degree.
In Figure 2
b, the architecture of our pose estimation network is outlined. Empirically, we found Inception Layer to be most efficient in terms of performance for memory footprint. We believe, multiple receptive fields in the inception layer help the network to learn structural relations at varied scales, which later improves pose estimation performance. For effective modeling, we consider deeper architecture with reduced number of filters per convolutional layer. Here, the pose estimation network classifies the three Euler angles, namely azimuth (), elevation (), and tilt (). Following , we use the Geometric Structure Aware Classification Loss for effective estimation of all the three angles.
As a result of proxy-dense correspondence, Pose-Invariant local descriptor has information about dense keypoints. But leverages information only from the sparse set of structurally important keypoints. Therefore, we also explore whether can also be utilized to improve the final pose estimation performance. To achieve this, we concatenate convolution-processed feature map of with inception-processed features of to form the input to our pose-estimation network. This brings us to our final state-of-the-art architecture. Various experiments are performed in section 4.1, which outline the benefits of each of the design choices.
Learning an efficient pose-invariant keypoint descriptor requires presence of ground-truth positive correspondence pair in sufficient amount. For each real image, we generate an ordered set of dense keypoints by forming a skeletal frame of the object from the available sparse keypoint annotations provided in Keypoint-5 dataset . To obtain dense positive keypoint pairs, we sample additional points along the structural skeleton lines obtained from the semantic sparse keypoints for both real and sythetic image. Various simple keypoint pruning methods based on seat presence, self-occlusion etc. are used to remove noisy keypoints (more detail in supplementary). Figure 3 (c) shows some real images where dense keypoint annotation is generated from available sparse keypoint annotation as described above.
For our synthetic data, a single template 3D model (per category) is manually annotated with a sparse set of 3D keypoints. These models are shown in Figure 3a. Using a modified version of the rendering pipeline presented by , we render the template 3D model and project sparse 2D keypoints from multiple views to generate synthetic data required for the pipeline. Similar skeletal point sampling mechanism as mentioned earlier is used to from dense keypoint annotation for each synthetic image as shown in Figure 3b (more details in supplementary).
In this section, we evaluate the proposed approach with other state-of-the-art models for multiple tasks related to viewpoint estimation. Additionally, multiple architectural choices are validated by performing various ablation on the proposed multi-view assimilation method.
Datasets: We empirically demonstrate state-of-the-art or competitive performance when compared to several other methods on two public datasets. Pascal3D+ : This dataset contains images from Pascal  and ImageNet  set labeled with both detection and continuous pose annotations for 12 rigid object categories. ObjectNet3D : This dataset consists of 100 diverse categories, 90,127 images with 201,888 objects. Due to the requirement of keypoints, keypoint-based methods can be evaluated only on object-categories with available keypoint annotation. Hence, we evaluate our method on 4 categories from these dataset namely, Chair, Bed, Sofa and Dining-table (3 on Pascal3D+, as it does not contain Bed category). We evaluate our performance for the task of object viewpoint estimation, and joint detection and viewpoint estimation.
Metrics: Performance in object viewpoint estimation is measured using Median Error () and Accuracy at (), which were introduced by Tulsiani et al. . measures the median geodesic distance between the predicted pose and the ground-truth pose (in degree) and measures the of images where the geodesic distance between the predicted pose and the ground-truth pose is less than (in radian). While previous works evaluate with only, we evaluate with smaller as well (i.e. for and ) to highlights our models ability to deliver more accurate pose estimates. Finally, to evaluate performance on joint detection and viewpoint estimation, we use Average Viewpoint Precision at ‘n’ views(-) metric as introduced in .
Training details: We use ADAM optimizer  having a learning rate of with minibatch-size . For each object class, we assign a single 3D model from Shapenet Repository as the object template. The local feature descriptor network is trained using 8,000 renders of the template 3D model (per class), along with real training images from Keypoint-5 and Pascal3D+. Dense correspondence annotations are generated for this segment of the training (refer Section 3.4). Finally, the pose estimation network is trained using Pascal3D+ or ObjectNet3D datasets. This training regime provides us our normal model, labeled . Additionally, to compare against RenderForCNN  in the presence of synthetic data, we construct a separate training regime, where the synthetic data provided by RenderForCNN  is also utilized for training the pose estimation network. The model trained in this regime is labeled .
In this section, we focus on evaluating the utility of various components of our method for object viewpoint estimation. Our ablative analysis focuses on the Chair category. The Chair category, having high intra-class variation, is considered one of the most challenging classes and provides minimally biased dataset for evaluating ablations of our architecture. For all the ablations, the network is trained on the train-subset of ObjectNet3D and Pascal-3D+ dataset. We report our ablation statistics on the easy-test-subset of Pascal3D+ for chair category, as introduced by .
First, we show the utility of the Multi-view information assimilation by performing ablations on the number of views ‘’. In Figure 5, we evaluate the for our method with ‘’ varying from 1 to 7. Note that we do not utilize the local descriptors in this setup and the pose estimator uses only the multi-view keypoint correspondence maps as input. As the figure shows, additional information from multiple views is crucial. For having an computationally efficient yet effective system, we use for all the following experiments. Next, it is essential to ascertain the utility of local descriptors in improving our performance. In Table 5, we can clearly observe increment in performance due to usage of along with . Hence, in our final pipeline, the pose estimator network is designed to include the as an additional input.
In this section, we evaluate our method against other state-of-the-art approaches for the task of viewpoint estimation. Similar to other keypoint-based pose estimation works, such as 3D-INN , we conduct our experiments on all object classes where 2D-keypoint information is available.
Pascal3D+: Table 1 compares our approach to other state-of-the-art methods, namely Grabner et al.  and RenderForCNN . The table shows, our best performing method clearly outperform other established approaches on pose estimation task.
|Category||Su et al. ||Grabner et al. |
ObjectNet3D: As none of the existing works have shown results on ObjectNet3D dataset, we trained RenderForCNN using the synthetic data and code provided by the authors Su et al.  for ObjectNet3D. Table 2 compares our method against RenderForCNN on various metrics for viewpoint estimation. RenderForCNN, which is trained using 500,000 more samples of synthetic images, still shows poor performance than the proposed method .
|Object Viewpoint Estimation|
|Su et al. ||9.70||8.45||4.50||7.21||7.46|
|Su et al. ||0.75||0.90||0.77||0.77||0.80|
|Su et al. ||0.71||0.89||0.72||0.75||0.76|
|Su et al. ||0.64||0.80||0.68||0.72||0.71|
|Joint Object Detection and Pose Estimation|
|-4||Su et al. ||23.9||69.8||53.5||65.1||53.1|
Now, for this task, our pipeline is used along with object detection proposal from R-CNN  using MCG  object proposals to estimate viewpoint of objects in each detected bounding box, as also followed by V&K . Note that the performance of all models in this task is affected by the performance of the underlying Object Detection module, which varies significantly among classes.
Pascal3D+: In Table 3, we compare our approach against other state-of-the-art keypoint-based methods, namely, 3D-INN  and V&K . The metric comparison shows superiority of our method, which in turn highlights ours’ ability to predict pose even with noisy object localization.
Table 2 clearly demonstrates sub-optimal performance of RenderForCNN on ObjectNet3D. This is due to the fact that, the synthetic data provided by the authors Su et al.  is overfitted to the distribution of Pascal3D+ dataset. This leads to a lack of generalizability in RenderForCNN, where a mismatch in the synthetic and real data distribution can significantly lower its performance. Moreover, Table 2 not only presents our superior performance, but also highlights the poor generalizability of RenderForCNN.
Here, we present analysis of results on additional experiments to highlight the chief benefits of the proposed approach.
|Category||Su et al. ||Grabner et al. |
Effective Data Utilization: To highlight the effective utilization of data in our method, we compare against other methods trained without utilizing any synthetic data. For this experiment, we trained RenderForCNN without utilizing synthetic data and compare it to in Table 4. The Table not only provides evidence for high data dependency of RenderForCNN, it also highlights our superior performance against Grabner et al.  even in limited data scenario.
Higher precision of our Approach: Table 5 compares to RenderForCNN  on stricter metrics, namely and . Further, we show a plot of vs in Figure 7, and 7 for multiple classes in both Pascal3D+ and ObjectNet3D dataset. Compared to the previous state-of-the-art model, we are able to substantially improve the performance with harsher bounds, indicating that our model is more precise on estimating the pose of objects on both ’Chair’ and ’Table’ category. This firmly establishing the superiority of our approach for the task of fine-grained viewpoint estimation.
|Su et al. ||0.59||0.79||0.68||0.68|
|Su et al. ||0.42||0.69||0.60||0.57|
In this paper, we present a novel approach for object viewpoint estimation , which combines keypoint correspondence maps from multiple views, to achieve state-of-the-art results on standard pose estimation datasets. Being data-efficient, our method is suitable for large-scale or novel-object based real world applications. In future work, we would like to make the method weakly-supervised as obtaining keypoint annotations for novel object categories is non-trivial. Finally, the pose-invariant local descriptors show a promise of usability in other tasks, which will also be explored in the future.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018)
Active object recognition by view integration and reinforcement learning.Robotics and Autonomous Systems 31(1) (2000) 71 – 86