iSPA-Net: Iterative Semantic Pose Alignment Network

Understanding and extracting 3D information of objects from monocular 2D images is a fundamental problem in computer vision. In the task of 3D object pose estimation, recent data driven deep neural network based approaches suffer from scarcity of real images with 3D keypoint and pose annotations. Drawing inspiration from human cognition, where the annotators use a 3D CAD model as structural reference to acquire ground-truth viewpoints for real images; we propose an iterative Semantic Pose Alignment Network, called iSPA-Net. Our approach focuses on exploiting semantic 3D structural regularity to solve the task of fine-grained pose estimation by predicting viewpoint difference between a given pair of images. Such image comparison based approach also alleviates the problem of data scarcity and hence enhances scalability of the proposed approach for novel object categories with minimal annotation. The fine-grained object pose estimator is also aided by correspondence of learned spatial descriptor of the input image pair. The proposed pose alignment framework enjoys the faculty to refine its initial pose estimation in consecutive iterations by utilizing an online rendering setup along with effectiveness of a non-uniform bin classification of pose-difference. This enables iSPA-Net to achieve state-of-the-art performance on various real image viewpoint estimation datasets. Further, we demonstrate effectiveness of the approach for multiple applications. First, we show results for active object viewpoint localization to capture images from similar pose considering only a single image as pose reference. Second, we demonstrate the ability of the learned semantic correspondence to perform unsupervised part-segmentation transfer using only a single part-annotated 3D template model per object class. To encourage reproducible research, we have released the codes for our proposed algorithm.


Object Pose Estimation from Monocular Image using Multi-View Keypoint Correspondence

Understanding the geometry and pose of objects in 2D images is a fundame...

PFRL: Pose-Free Reinforcement Learning for 6D Pose Estimation

6D pose estimation from a single RGB image is a challenging and vital ta...

Spatial Attention Improves Iterative 6D Object Pose Estimation

The task of estimating the 6D pose of an object from RGB images can be b...

Unsupervised Part Discovery via Feature Alignment

Understanding objects in terms of their individual parts is important, b...

Novel Object Viewpoint Estimation through Reconstruction Alignment

The goal of this paper is to estimate the viewpoint for a novel object. ...

PI-Net: Pose Interacting Network for Multi-Person Monocular 3D Pose Estimation

Recent literature addressed the monocular 3D pose estimation task very s...

Towards Scene Understanding with Detailed 3D Object Representations

Current approaches to semantic image and scene understanding typically e...

Code Repositories


Code for "iSPA-Net: Iterative Semantic Pose Alignment Network".

view repo

1. Introduction

Although human creativity has led us to create ingenious, highly varied designs for objects like cars, furnitures etc., the intrinsic structure of different instances of an object category are similar. By having a clear understanding of the 3D structural regularities despite the diverse intra-class variations, machines can gain a holistic understanding of their environment. A task that goes hand-in-hand with this aim is object viewpoint estimation, the task of estimating the 3D pose of an object from a 2D image. In this work, we explore whether the relation between 3D structural regularities and viewpoint can be exploited for fine-grained viewpoint estimation.

In our daily routine we see objects from various viewpoints overtime. This helps human perception to manipulate one view of the object with another view of the same object class already perceived in temporal proximity (Perry et al., 2006). In fact, to accurately annotate object viewpoint in various datasets (Xiang et al., 2016; Xiang et al., 2014), the annotators have the provision to compare the object in image, to a reference 3D model projection at different viewpoints. Inspired from both the human cognition and recent advances in Deep Learning, we take a novel direction to solve the task of object pose estimation using a single 3D template model as structural reference. A fine-grained iterative Semantic Pose Alignment Network, named iSPA-Net, is developed to efficiently model a view-planning strategy, which can iteratively match pose of a given 3D template to the pose of object in an RGB image.

Figure 1. Illustration of the proposed pipeline. At each iteration, iSPA-Net takes a pair of real and graphically rendered image to predict the difference in viewpoint. In order to improve pose alignment, a new synthetic image is rendered from the previously estimated viewpoint in each iteration. The figure on right shows the trajectory of viewpoints for the iterative alignment.

Our proposed approach performs object viewpoint estimation by semantic alignment of pose of a reference 3D model projection to the pose of object in a given natural image. To realize this, we employ a novel CNN architecture which takes the pair of images as input to estimate a difference in object viewpoint. The proposed architecture consists of two major parts, a correspondence network followed by a pose-estimator network. The correspondence network computes a correspondence tensor capturing information regarding spatial displacement of object parts between the two input images. Following this, the pose-estimator network infers the difference in object viewpoint. To obtain an absolute viewpoint prediction of the given real object, we add the predicted viewpoint difference to the known viewpoint of the synthetically rendered image. To further achieve improvements on pose estimation result, we introduce an iterative estimation pipeline where, at each iteration the reference 3D template model is rendered from the viewpoint estimate of the previous iteration (see Figure


Object viewpoint estimation has been attempted previously using Deep Convolutional Networks (CNNs) in (Tulsiani and Malik, 2015; Su et al., 2015; Xiang et al., 2014), but these approaches have several drawbacks. Works such as RenderForCNN (Su et al., 2015) etc., solve this task by employing a CNN to directly regress the pose parameters. Due to lack of structural supervision, such methods do not learn the underlying structural information in an explicit manner. Other works such as 3D-INN (Wu et al., 2016) and DISCO (Li et al., 2017) propose to primarily predict an abstract 2D skeleton map for a given RGB image. While performance of such works for keypoint estimation is remarkable, they deliver sub-optimal results for viewpoint estimation. This is understandable as the central focus of such works is keypoint estimation, which does not encourage fine-grained viewpoint estimation explicitly.

On the other hand, iSPA-Net enjoys advantages of a number of key design choices as compared to the previous approaches. The iterative alignment pipeline is one of the key novel features of the proposed pose estimation framework. This enables iSPA-Net to refine its pose estimation results in successive iterations using the online rendering pipeline in contrast to previous state-of-the-arts. Another key-feature of iSPA-Net is that instead of inferring absolute viewpoint, we estimate viewpoint difference between a pair of image projections. This enables us to utilize training examples with only unique real images having pose annotation. Due to this, we are able to efficiently utilize the available data, which in the case of viewpoint estimation, is very scarce due to high cost of manual annotation. Previous works like RenderForCNN (Su et al., 2015) and 3D-INN (Wu et al., 2016) tackle the scarcity of data by generating synthetic data in abundance (millions in (Su et al., 2015; Wu et al., 2016)) along with the available real samples. However, it is know that statistics of synthesized images are different from those of real images (Nath Kundu et al., 2018), which often leads to sub-optimal performance on real data. To explicitly enforce fine-grained pose estimation, iSPA-Net

uses a non-uniform binning classifier for estimating the viewpoint difference. Bins vary for coarse to fine partitions as the viewpoint difference varies from large to small. By enforcing the network to progressively improve its precision as viewpoint difference decreases, we ensure that pose estimation is improved in successive iterations (Section


Several experiments are conducted on real image datasets to demonstrate effectiveness of the proposed approach (Section 4). We achieve state-of-the-art results on two viewpoint estimation datasets, namely PASCAL 3D+ (Xiang et al., 2014), and Objectnet3D (Xiang et al., 2016). We also show two diverse applications of iSPA-Net: Firstly, we present the task of active object viewpoint localization (Section 5.1), where, given a reference image, the agent must relocate its position so that pose of object in the camera feed matches with the pose of object in a given reference image. This has a wide range of applications in industrial and warehousing setups. For instance, automated cataloging of all the chairs in a showroom from a particular viewpoint referred in a reference image, using a camera mounted drone. Secondly, we utilize the learned semantic correspondence of iSPA-Net for performing unsupervised transfer of part segmentation on various objects using a single part-annotated 3D template model (Section 5.2).

To summarize, our main contributions in this work include: An approach for object viewpoint estimation, which (1) by its iterative nature enables accurate fine-grained pose estimation, and (2) by predicting difference of viewpoint between objects in a image pair, alleviates the data bottleneck. We also show (3) State-of-the-art performance in object viewpoint estimation on various datasets. Finally,(4) multiple applications of the proposed approach are shown such as Active Object Viewpoint Localization on synthetic data and Unsupervised Part-Segmentation on real data.

2. Related work

3D structure inference from 2D image: One of the fundamental goals of computer vision is to infer 3D structural information of objects from 2D images. A few recent works attempt to solve this in an unsupervised way using projection transformation by employing special layers with deep convolutional network architecture (Rezende et al., 2016; Yan et al., 2016; Choy et al., 2016b; Li et al., 2015). Kalogerakis et al.(Liu et al., 2016) follows a similar approach to transfer object part annotations from multiple 2D view projections to 3D mesh models. Before deep era, projected location of unique object parts or semantic key-points were also explored to infer 3D viewpoint of objects from RGB images (Aubry et al., 2014; Lim et al., 2014; Su et al., 2014; Xu et al., 2016). Earlier methods employed hand-engineered local descriptors like SIFT or HOG (Aubry et al., 2014; Liu et al., 2016; Taniai et al., 2016; Berg et al., 2005) to represent semantic part structures useful for viewpoint estimation. Recent works, such as (Li et al., 2017; Wu et al., 2016), extract 3D structural information by predicting 3D keypoint locations as an intermediate step while estimating 2D keypoints. However, as projection of 2D keypoints to 3D space, being a ill-posed problem, is prone to be erroneous, the 3D structural information may not be precise. Additionally, due to estimation of only a few keypoints, the extracted 3D structural information might not be suitable for high precision task such as fine-grained viewpoint estimation.

Figure 2. Left: Illustration of the proposed iSPA-Net architecture with connections between the individual CNN modules (Section 3). Right: -law curve along with the resultant non-uniform bin-quantization used for prediction of angle-difference.

Multitudes of work, such as (Schmidt et al., 2017; Han et al., 2017; Yu et al., 2018; Choy et al., 2016a), propose use of CNNs for learning semantic correspondence between images. Universal Correspondence Network (Choy et al., 2016a) proposes an optimization technique for deep networks to learn robust spatial correspondence by efficiently designing an active hard-mining strategy and a convolutional spatial transformer.

Object viewpoint estimation:   There are many recent works which use deep convolutional networks for object viewpoint estimation (Poirson et al., 2016; Mahendran et al., 2017). RenderForCNN (Su et al., 2015) was one of the first to use deep CNNs in an end-to-end approach solely for 3D pose estimation. They synthesized rendered views from 3D CAD models with random occlusion and background information to gather enough labeled data required to train a deep network. However, such approach may not be applicable for novel object categories as it requires a large repository of 3D CAD models for creating labeled training data. Following an entirely different approach, 3D Interpreter Network (3D-INN) (Wu et al., 2016) proposed an convolutional network to predict 2D skeleton locations. They recover the 3D joint-point locations and the corresponding 3D viewpoint from the estimated 2D joints by minimizing the reprojection error. However, such skeleton based approach heavily rely on correctness of keypoint annotation data for natural images, which is especially noisy for occluded keypoints. Tulsiani et al.(Tulsiani and Malik, 2015) explored the idea of coarse to fine level view-point estimation. An end-to-end approach to directly classify discretized viewpoint angles can only be used to perform coarse level 3D view estimation. Such methods extract global structural features, whereas for fine-grained pose estimation, spatial part-based keypoints plays a crucial role. Hence, in our proposed approach, the iterative alignment frameworks starts from a coarse level viewpoint estimation followed by fine-level alignment of a structural template model by enforcing explicit local descriptor correspondence.

3D model retrieval for viewpoint estimation:   A cluster of previous arts exist which attempt to estimate 3D structure of object by aligning a retrieved matching CAD model using 2D RGB images (Aubry et al., 2014; Xiang et al., 2014; Massa et al., 2016) and also additional depth information (Bansal et al., 2016; Gupta et al., 2015). But performance of these methods highly rely on the style of the retrieved CAD model due to the high intra-class variations. Additionally, such works cannot be adapted for the proposed task of active object viewpoint localization as they require CAD model of object for alignment, which may not be available in real life applications.

3. Approach

In this section, we explain details of the proposed pose alignment pipeline (refer Figure 2 for an overview). The architecture is inspired from a classical computer vision setup (Philbin et al., 2007) with the addition of specially designed modules which make the pipeline fully differentiable for end-to-end training. A classical pipeline would typically start with extraction of useful local descriptors (e.g. SIFT, HOG) or spatial representations for both the given images. Then, the part-based spatial features are matched between the two images to acquire a correspondence map, which is then used to infer pose shift or geometrical parameters for alignment (e.g. RANSAC). Our architecture also follows a similar pipeline to align pose of the 3D template model over a given natural image. The different components of our proposed approach are presented in following subsections.

3.1. View-invariant feature representation, (

As discussed earlier, we first focus on extraction of useful local features, which can result in efficient local-descriptor correspondence. As shown in Figure 2, the network takes two input images ( and ) through a Siamese architecture with shared parameters, and outputs corresponding spatial feature maps. Here, is an image containing the object of interest whereas is a rendered RGB image generated from a 3D template model with known viewpoint parameters i.e. azimuth (), elevation (), and tilt or in-plane rotation (). We represent the output feature map of the this network, which is analogues to local descriptors used in classical setup, as and . To learn spatial representations essential for part alignment, we use correspondence contrastive loss, as presented in (Choy et al., 2016a). Let and represent spatial locations on and respectively. Then, the contrastive loss can be defined as,


where is the total number of pairs, for positive correspondence pairs, and for negative correspondence pairs. In Section 3.6, we describe how positive correspondence pair are acquired between a pair of images, .

Correspondence Network Architecture:   For the Siamese network, we employ a standard GoogLeNet (Szegedy et al., 2015)

architecture with imagenet pretrained weights. To obtain spatially aligned local features

, we use a convolutional spatial transformation layer, as proposed in UCN (Choy et al., 2016a), after layer of GoogLeNet architecture. The convolutional spatial transformation layer greatly improves feature correlation performance by explicitly handling scale and rotation parameters for efficient correspondence among the spatial descriptors of and .

3.2. Correspondence map and Disparity network ()

The output feature map obtained, and , are of size tensors. After L2 normalization, these can also be considered as a -dimensional spatial descriptors for each location , i.e. . A feature correlation layer is formulated to get spatial correlation map for all location pairs , covering the full resolution of feature maps and respectively. The pairwise feature correlation for any location pair and is computed as:

In the above formulation, dot product between the l normalized spatial descriptors is taken as a measure of correlation. The correlation maps are also normalized for each location of the input feature map across all locations of the other feature map, . Due to this normalization, ambiguous correspondences having multiple high correlation matches with the other spatial feature map are penalized. Such normalization step is in line with the traditionally used second nearest neighbor test proposed by Lowe et al(Lowe, 2004). The final resultant tensor is of size representing location wise spatial matching index of part-based descriptors for a given pair of input images

. Here, both correlation and normalization steps are clearly differentiable with simple vector and matrix operations, thus enabling end-to-end training of the pose alignment framework.

To gain a compact and fused representation of the correspondence,

is further passed through a single inception module. Finally, we apply the down-sampled

map (transparency map) obtained from the rendering mechanism of to every feature channel of the correspondence tensor. For further processing, the correspondence feature, is then concatenated with the spatial features and after processing through some convolutional layers. This combines the part-based local descriptor representation of and with the corresponding spatial shift () obtained between and . Next, a disparity-map between a stereo pair of rendered image, with is computed. Here, is generated by considering a minimal shift of in both azimuth and elevation angle at the absolute viewpoint of object in . The raw disparity map is fed to a small disparity network , allowing the network to exploit useful 3D information of the template object. Finally, as shown in Figure 2, the extracted representation is merged with the previously acquired concatenated tensor to obtain the input tensor to the Pose-estimator network .

Note that only appearance-based disparity is used, which can be easily captured without access to the 3D model, as the case might be in real-world scenario. In Section 4.1, we show ablation on our proposed architecture and demonstrate the utility of each of the aforementioned components.

3.3. Pose-estimator network, ()

The merged representation, containing output of along with correlation map and the view invariant feature tensor, is then passed as input to the Pose-estimator Network, . The network is trained to predict the viewpoint difference = = between the viewpoints of object in image and . We model the viewpoint difference as an -bin classification problem. The bins for classification are formed using -Law quantization of , which is explained in section 3.6

. The network is trained using the Geometric Structure Aware Loss Function, proposed in

(Su et al., 2015). Further, we introduce an auxiliary task of predicting the absolute azimuth angle of real object in image , and experimentally evaluate its utility.

3.4. Iterative pipeline and viewpoint classifier ()

The proposed method, currently consist of a pipeline which, given two input images and , estimates the viewpoint difference between them. To estimate the viewpoint of object in an image , it is passed along with another image containing a rendered 3D template object at a known viewpoint parameters . The viewpoint in is estimated to be using the predicted difference . However, for fine-grained pose estimation, we propose the following iterative pipeline. Consider at iteration , the viewpoint of is and the predicted viewpoint difference is . Now, as it is possible to render an image with the viewpoint , for iteration , the pair and acts as input pair, thereby allowing the network to perform fine-grained viewpoint estimation. This process is continued until the estimated viewpoint difference is below some threshold , or when the iteration limit is reached.

While it is possible to randomly initiate the viewpoint of 3D template object for the first iteration, we develop a more structured approach by employing a small Viewpoint Classifier Network, . This network takes as input to provide a coarse viewpoint estimate of the object in image . This estimate is used as initial viewpoint of in the iterative pipeline. By using this coarse estimate, the number of iterations required to reach the threshold (on an average) is reduced significantly. is a shallow network comprising of only three convolutional layers with a final classification layer. It is trained separately for a 16-way classification of only azimuth angle () using the Geometric Structure Aware Loss proposed in (Su et al., 2015).

3.5. -Law Quantization of angle difference ()

Ideally, for fine grained pose alignment, the model should be precise in its prediction viewpoint difference for all large and small range. However, such a model would require high capacity, as well as large quantity of data. Instead, we propose an alternate solution, where the model is biased to have precise predictions when is small, and only approximate prediction when is large. When used in an iterative setup, such a model would reduce at each iteration, leading it to have improved precision in successive iterations. Hence, this model can sufficiently address the task of fine grained pose alignment, without facing the capacity and data bottleneck of the ideal model.

In iSPA-Net, we realize such a bias in the system by introducing a non-uniform binning for the output of Pose-estimator network, (). Instead of a uniform -bin classification, we perform a -bin classifier, where finer bins are allocated to lower range and coarser bins for higher range. We use the -Law curve to label each to a bin. The right section of Figure 2, shows the -law curve and a representative labeling of into 20 bins, using the proposed formulation. The -law equation used to obtain the non-uniform binning of angle-difference can be written as:

Similarly, we use the inverse function to obtain prediction from the bin-classifier.

Another advantage of this approach is that while it provides performance comparable to iSPA-Net with additional recurrent layers ( refereed as iSPA-Net) and normal uniform binning, its training regime is substatially simpler, and faster. Training is simpler due to the single iteration training of iSPA-Net, compared to the online iterative training required for iSPA-Net. Hence, iSPA-Net is truly iterative only during the validation or test setup.

3.6. Data preparation

We select a single 3D template model for each class of object from ShapeNet dataset (Chang et al., 2015). Using a modified version of the rendering pipeline presented by (Su et al., 2015), the selected template model is rendered at various viewpoints to create sample images . By pairing these images randomly with real images , we create training samples for our pose alignment framework. Our network takes as input image pair and the corresponding ground truth is obtained as i.e. . Note that, our reliance on synthetic data is minimal. For efficient offline training, where rendered images are paired with real images , we use only 8,000 renders of a single 3D template model. This is in sharp contrast to works such as (Su et al., 2015; Li et al., 2017), which use millions of synthetic images in their training pipeline.

While this information is sufficient to train the network in an end-to-end fashion, to further improve our view-invariant feature representation , we use the loss function given by equation 1, which requires dense correspondence annotations between image pair . For generating such correspondence information, we use automated processing of annotations provided in the Keypoint-5 dataset, released by Wu et al(Wu et al., 2016). We use the 2D skeletal representation, which is based on annotation of sparse 2D skeletal keypoints of real images. As shown in Figure 3, the sparse 2D keypoints are annotated on real images at important joint locations such as leg ends, seat joint etc. For each image sample , we join these sparse keypoints in a fixed order, to obtain the corresponding 2D skeletal frame (second row of Figure 3: a and c). To generate a similar skeletal frame for our rendered 3D object templates in , we manually annotate sparse 3D keypoints for our template objects, as shown in Figure 3d. Using the projection of these 3D keypoints, based on the viewpoint parameters, we generate 2D keypoints for any rendered image and also the corresponding 2D skeletal frame in a similar fashion. Now, by pairing points along this generated skeletal frames of any image pair , 2D keypoints on can be matched to the corresponding keypoints on image . By creating multiple such pairs, we generate dense correspondence set for any image pair which is then used to improve the performance of local descriptor .

Figure 3 shows some qualitative examples of generated annotations for our training. We employ various methods to prune and reform our generated annotations, such as depth based pruning (for ), and other methods (details provided in supplementary). Although, sometimes the generated annotations are not accurate (Figure 3.c), the correspondence model is able to learn improved view invariant local descriptors , due to the presence on ample amount of correct noise-free annotations.

Figure 3. Left: Top row of (a) and (c) show examples from Keypoint-5 dataset and top row of (b) shows a synthetic rendered sample. Bottom row in all three depicts the generated 2D skeletal frames. Right (d): Manually annotated 3D skeletal model for the single 3D template model of each object category.

4. Experiments

In this section, we compare the proposed approach iSPA-net with other state-of-the-art methods for viewpoint estimation. We also examine the performance improvement caused by the different design decisions for iSPA-Net.

Datasets and Metrics   We empirically demonstrate state-of-the-art performance when compared to several other methods, on two public datasets, namely Pascal3D+(Xiang et al., 2014), and ObjectNet3D(Xiang et al., 2016). We evaluate our performance for the task of object viewpoint estimation.

Performance in object viewpoint estimation is measured using Median Error (), and Accuracy at (), which were introduced by Tulsiani et al(Tulsiani and Malik, 2015). is measured in terms of degrees. As our approach aims to perform fine-grained object viewpoint estimation, we show with smaller as well. This stronger metric requires higher precision in estimation of pose and highlights our models utility for fine-grained pose estimation. Finally, we show vs plots to concretely establish our superiority in fine-grained pose estimation.

Training details   We use ADAM optimizer (Kingma and Ba, 2014) having a learning rate of with minibatch-size . For training the local feature descriptor network, we generate dense correspondence annotations on Keypoint-5 and Pascal3D+ dataset. Whereas, the regression network is trained using Pascal3D+ and ObjectNet3D datasets.

Methods MedErr
77% 12.52
78.5% 11.26
79.32% 10.66
86% 8.96
iSPA-Net Naive Binning 75.5 13.4
Table 1. Performance comparison of baseline ablations for different design choices.

4.1. Ablative Analysis

In this section, we experimentally validate the improvements caused by addition of the various components in our pipeline. Our baseline evaluation of architectural modifications focuses on Chair category, as it is considered as one the most challenging class with high amount of intraclass diversity.

Ablations of iSPA-Net pipeline: Our baseline model, uses only the correspondence map , and is trained using the Geometric Structure Aware Loss Function. First, processed features are introduced in the pipeline. Then, the disparity network is added. Finally, The Auxiliary Loss (Section 3.3) is appended to iSPA-Net to complete the full pipeline. As shown in Table 1, each of these enhancements leads to increased performance of iSPA-Net. In the last 2 rows of Table 1, we compare the performance of iSPA-Net with naive uniform binning of to iSPA-Net with -Law quantization of (Same as ). It is clear from these results that -Law quantization of improves the performance of iSPA-Net.

Figure 4. MedErr of iSPA-Net with respect to iterative prediction limit .

Ablations on number of iterations: Figure 4 show the improvement in the viewpoint estimation due to refinement of prediction in consecutive iterations. As is evident from the Figure, iSPA-Net’s performance improves iteratively, supporting the notion of iterative refinement of pose for Object viewpoint estimation.

For all the ablations, the network is trained on the train-subset of ObjectNet3D and Pascal-3D+ dataset. We report our ablation statistics on the test-subset of Pascal3D+ for the chair category.

4.2. Viewpoint Estimation

In this section, we evaluate iSPA-Net against other state-of-the-art networks for the task of viewpoint estimation.

Evaluation on ObjectNet3D:   ObjectNet3D dataset consists of 100 diverse categories, 90,127 images with 201,888 objects. Due to lack of keypoint annotation, we evaluate iSPA-Net on 4 categories from this dataset, namely, Chair, Bed, Sofa and Dining-table. To evaluate the viewpoint estimation of iSPA-Net, we report performance in terms of and using the ground truth bounding boxes provided with the dataset.

Due to Lack of prior work on this dataset, we additionally trained RenderForCNN (Su et al., 2015) on this dataset using the code and data provided by the authors Su et al.. The results are presented in Table 2. RenderForCNN is observed to perform poorly on this dataset. This is due to the fact that the synthetic data provided by the authors is overfit to the distribution of Pascal3D+ dataset. The poor performance of RenderForCNN not only highlights its lack of generalizability, but also demonstrates the susceptibility of models trained on synthetic data to falter on real data even on slight mismatch of image distributions. Stricter metrics such as ( and ) further emphasize the superiority of our method.

Method Metric Chair Sofa Table Bed Avg.
Su et al(Su et al., 2015) 9.70 8.45 4.50 7.21 7.46
iSPA-Net 9.15 6.08 4.70 7.11 6.76
Su et al(Su et al., 2015) 0.75 0.90 0.77 0.77 0.80
iSPA-Net 0.82 0.92 0.95 0.83 0.88
Su et al(Su et al., 2015) 0.71 0.89 0.72 0.75 0.76
iSPA-Net 0.79 0.91 0.93 0.80 0.85
Su et al(Su et al., 2015) 0.64 0.80 0.68 0.72 0.71
iSPA-Net 0.67 0.86 0.89 0.74 0.79
Table 2. Evaluation on viewpoint estimation on ObjectNet3D dataset. Note that iSPA-Net is trained with no synthetic data, where as Su et trained with 500,000 synthetic images (for all 4 classes).
Category Su et al(Su et al., 2015) Grabner et al(Grabner et al., 2018) Ours iSPA-Net
Chair 0.86 9.7 0.80 13.7 0.86 8.96
Sofa 0.90 9.5 0.87 13.5 0.88 9.37
Table 0.73 10.8 0.71 11.8 0.83 6.28
Average 0.83 10.0 0.79 13.0 0.86 8.20
Table 3. Performance for object viewpoint estimation on PASCAL 3D+ (Xiang et al., 2014) using ground truth bounding boxes. Note that MedErr is measured in degree.

Evaluation on Pascal3D+:   Pascal 3D+ (Xiang et al., 2014) dataset contains images from Pascal (Everingham et al., 2015) and ImageNet (Russakovsky et al., 2015) labeled with both detection and continuous pose annotations for 12 rigid object categories. Due to lack of keypoint annotation information, we show results for 3 classes, namely Chair, Sofa, and Dining-table Similar to 3D-INN(Wu et al., 2016).

We observe in Table 3 that iSPA-Net, even with significantly less data than RenderForCNN, is able to surpass current state-of-the-art methods.

Method Metric Chair Sofa Table Avg.
Su et al(Su et al., 2015) 0.59 0.76 0.68 0.68
iSPA-Net 0.84 0.80 0.83 0.82
Su et al(Su et al., 2015) 0.42 0.69 0.60 0.57
iSPA-Net 0.76 0.75 0.78 0.76
Table 4. Performance for object viewpoint estimation on PASCAL 3D+ (Xiang et al., 2014) using ground truth bounding boxes, for stricter metrics. Note that we use 95% less synthetic data than RenderForCNN.

4.3. Favorable Attributes of iSPA-Net

Figure 5. Comparison of iSPA-Net to Su et al.for varied values of in metric on ObjectNet3D dataset.

Fine-Grained Pose Estimation: In Table 4, we compare iSPA-Net to RenderForCNN on stricter metrics, and . As shown in the table, iSPA-Net is clearly superior to RenderForCNN for fine-grained pose estimation. Further, in Figure 5, we show a plot of vs on ObjectNet3D dataset. Compared to the previous state-of-the-art models, we are able to substantially improve the performance with harsher bounds, indicating that our model is highly precise on estimating the pose of object in many images. Figure 5, shows vs. for two different categories in ObjectNet3D test-set.

Method Metric Chair Sofa Table Avg.
Grabner et al(Grabner et al., 2018) 15.90 11.60 16.20 14.57
13.56 8.98 11.84 11.45
Grabner et al(Grabner et al., 2018) 0.72 0.80 0.67 0.73
0.74 0.80 0.83 0.79
Table 5. Performance for object viewpoint estimation on PASCAL 3D+ (Xiang et al., 2014) for single model training regime, which highlights the generalizability of iSPA-Net’s learned representation.

High Generalizability: For pose estimation, the memory-efficiency of a given approach is a crucial detail for its deployment. iSPA-Net achieves memory-efficiency by being highly generalizable across object categories. We train a single network for all the above considered objects categories, and compare it to the single network performance of Grabner et al(Grabner et al., 2018). In Table 5, we show that iSPA-Net clearly outperforms Grabner et al.. Note that the single network model of Grabner et trained on 12 classes. However, due to the significantly better performance of our approach, we assert that our approach is equally, if not more, generalizable.

Figure 6. Illustration of active view point localization; (a) given reference image, (b) localized view-point on various 3D objects.

5. Applications of iSPA-Net

Figure 7. Qualitative results of unsupervised semantic part segmentation. First, iSPA-Net performs pose-alignment, which is then followed by transfer of part labels from the template model to the given real image. Note that only a single template model has to be annotated per category.

5.1. Object viewpoint Localization

In this section we present a novel application of iSPA-Net. As presented earlier, object viewpoint localization is the task of estimating the location, from where pose of object in 3D world can match the pose of object in a given reference image. Solutions for this task can be used for multiple industrial applications such as automated object cataloging, massive manufacturing survey etc. iSPA-Net is designed to estimate fine-grained pose of objects without relying on the presence of a similar CAD model for alignment. Our model can be fixed on a drone, which receives a reference image and has an input feed (e.g. a camera) to obtain real world image . Now, instead of rendering at a new viewpoint for pose alignment, the drone can maneuver to a different location giving rise to a updated camera-feed , so as to align with the pose in in consecutive iterations.

Due to lack of experimental setup, we qualitatively evaluate this task on synthetic dataset. Using iSPA-Net, we align different 3D objects to a single given reference image . Figure 6 shows our qualitative results for this task, where images in (b) show localized viewpoints for various 3D objects based on the reference image given in (a).

5.2. Semantic Part Segmentation

As an additional application of the proposed pose alignment framework, we perform semantic part segmentation transfer from the chosen template model to multiple real images containing instances of the same object class  (Huang et al., 2015). We manually annotate only the 3D template models (one for each object class) used by the pose alignment network for each object class as shown in Figure 7. To transfer the part segmentation from the annotated template model, we first perform pose alignment using the proposed iSPA-Net framework. From the pose aligned render of the 3D-template projection (with labels), feature-correlation from the intermediate correspondence network output is utilized for semantic label transfer. This is done by assigning a label to each pixel in the real image based on the label of the spatial features in the 3D-template projection which are highly correlated to features of that pixel. Then, a silhouette map for the given natural object images is obtained using state-of-the-arts object segmentation model (Chen et al., 2017). A hierarchical image segmentation algorithm as proposed by Arbelaez et al(Arbelaez et al., 2011) is then used to acquire super-pixel regions in the image. The over-segmented regions are obtained only for the region masked by the silhouette map obtained from the segmentation model as shown the second row of each examples in Figure 7. For each over-segmented region we assign median value of the comprising pixel-labels obtained from the correlation based label-transfer step. The resultant part-segmentation map is shown in the last row of Figure 7.

It is evident from the qualitative results that the pose-alignment network can be used effectively to obtain a coarse level part segmentation even in presence of diverse view and occlusion scenarios (see Figure 7). Such pose-alignment based approach also opens up possibilities to improve the available part-segmentation models by utilizing fine-grained pose information in a much more explicit manner. Moreover, use of pose-alignment to obtain part-segmentation can be used to assist annotators with an initial coarse label map. The procedure involves manual segmentation of a single template model per class, which also addresses the scalability issue of part-segmentation algorithms for novel object categories.

6. Conclusions

In this paper, we present a novel iterative object viewpoint estimation network, iSPA-net, for fine-grained Pose estimation, drawing inspiration from human perception and classical computer vision pipeline. Along with demonstrating state-of-the-art performance in various public datasets, we also show that such a pipeline can have wide industrial applications. This work presents a multitude of new challenges as well, such as formulating an unsupervised approach for annotation-free training regime, estimating pose of diverse outdoor-and-indoor objects etc. Along with facing the aforementioned challenges, our future work will focus on extending the proposed framework to perform 6D object pose tracking.

Acknowledgements   This work was supported by a CSIR Fellowship (Jogendra), and a project grant from Robert Bosch Centre for Cyber-Physical Systems, IISc.

Supplementary: iSPA-Net: Iterative Semantic Pose Alignment Network

In this supplementary we outline the various secondary details which provide interesting insight into this work, while also elaborating on various intricacies of our approach.

7. Data Generation

7.1. Overview

We select a single 3D template model for each class of object from ShapeNet dataset (Chang et al., 2015). Using a modified version of the rendering pipeline presented by (Su et al., 2015), we render the selected template model at various viewpoints to create samples of image . Note that, our reliance on synthetic data is minimal. We use only 8K renders of a single 3D template model. This is in sharp contrast to works such as (Su et al., 2015; Li et al., 2017), which use millions of synthetic images in their training pipeline.

Figure 8. Data Generation.

To train our pose-invariant local descriptors, , we use the contrastive corresondence loss function, introduced in(Choy et al., 2016a), which requires dense correspondence annotations between image pair . For generating such correspondence information, we use automated processing of annotations provided in the Keypoint-5 dataset, released in 3D-INN (Wu et al., 2016). We use the 2D skeletal representation, which is based on annotation of sparse 2D skeletal keypoints of real images. As shown in Figure 8, the sparse 2D keypoints are annotated on real images at important joint locations such as leg ends, seat joint etc. For each image sample , we join these sparse keypoints in a fixed order, to obtain the corresponding 2D skeletal frame (second row of Figure 1. a, c). To generate a similar skeletal frame for our rendered 3D object templates in , we manually annotate sparse 3D keypoints for our template objects, as shown in Figure 8 d. Using the projection of these 3D keypoints, based on the viewpoint parameters, we generate 2D keypoints for any rendered image and also the corresponding 2D skeletal frame in a similar fashion. Now, by pairing points along this generated skeletal frames of any image pair , 2D keypoints on can be matched to the corresponding keypoints on image . Following this, we generate dense correspondence set for any image pair to improve performance of local descriptor .

Figure 8 shows some qualitative examples of generated annotations for our training. We employ various methods to prune and reform our generated annotations, such as depth based pruning (for ), and other methods, explained further in the next section. Although, sometimes the generated annotations are not accurate (Figure 8.c), the correspondence model is able to learn improved view invariant local descriptors , due to the presence on ample amount of correct noise-free annotations.

Finally, Figure 9 shows the single 3D template model used for each object category.

Figure 9. The template 3D model we use for each class.

7.2. Keypoint Pruning Mechanism

For Keypoint pruning, we use three main approaches:

Figure 10. The utility of the three pruning mechanism presented in section 7.2.
  • Visibility Based Pruning in image : As the image is rendered using a template 3D model, a visibility map of the entire object can also be formed easily. Using this visibility map, we prune our points which are not visible from the rendered viewpoint. In figure 10 (a), an example is presented.

  • Seat Presence in Image : As we know visibility information of all parts of the real image is not available. Hence, we instead use some approximations. We assume that all images of the object are from positive elevation angles. If this assumption holds true, all the leg skeletal keypoints which occur inside the the 2D region covered by the seat are not visible and hence can be pruned out. In figure 10 (b), examples of this pruning mechanism is presented.

  • Self-Occlusion of Legs in Image

    Self occlusion of object legs can be a very frequent occurrence, and almost in all angles, some legs of an object may occlude other legs. We further prune out keypoints on occluded leg, by applying a heuristic approach. First, We approximate the pose quadrant of the object by joining a 2D vector from the back of the seat to the front. Now, based on the pose of the object, which leg can occlude the other is known. This information is then used to prune out self-occluded leg keytpoints. In figure

    10 (c), an example is presented.

Figure 11. Keypoint Estimation results.

8. Keypoint Correspondence

For our proposed approach, the optimality of the learnt local descriptors for giving correspondence map is crucial. In this section, we show some qualitative results to demonstrate keypoint estimation ability of our pose-invariant local descriptors. For each keypoint in the synthetic render, we find the closest matching location in a given real image. In figure 11, we show some qualitative results of keypoint matching between real images and multiple renders of our template 3D model.As we can see, the learnt local descriptors are indeed pose-invariant as they are correctly corresponding to right locations even after considerable change in pose ( for example, the bottom right pair).

Figure 12. Pose Estimation results. The bar below each image represent the azimuth angle values. The green ellipse represents the Ground Truth pose, and the blue ellipse represents the predicted pose.

9. Qualitative Samples for Pose estimation

In this section, we show some of the results achieved by our network. In Figure 12, we show examples of images from Pascal 3D+ easy test-dataset, along with predicted and annotated azimuth pose angle. The images are arranged in ascending order of angular error in azimuth estimation. As we can see, many times, high error in pose estimation occurs due to extremely poor image quality, due to factors such as lack of illumination, clutter etc.


  • (1)
  • Arbelaez et al. (2011) Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. 2011. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33, 5 (2011), 898–916.
  • Aubry et al. (2014) Mathieu Aubry, Daniel Maturana, Alexei A Efros, Bryan C Russell, and Josef Sivic. 2014. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR.
  • Bansal et al. (2016) Aayush Bansal, Bryan Russell, and Abhinav Gupta. 2016. Marr revisited: 2d-3d alignment via surface normal prediction. In CVPR.
  • Berg et al. (2005) Alexander C Berg, Tamara L Berg, and Jitendra Malik. 2005. Shape matching and object recognition using low distortion correspondences. In CVPR.
  • Chang et al. (2015) Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR].
  • Chen et al. (2017) Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).
  • Choy et al. (2016a) Christopher B Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. 2016a. Universal correspondence network. In NIPS.
  • Choy et al. (2016b) Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016b. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV.
  • Everingham et al. (2015) Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2015. The pascal visual object classes challenge: A retrospective. International journal of computer vision 111, 1 (2015), 98–136.
  • Grabner et al. (2018) Alexander Grabner, Peter M. Roth, and Vincent Lepetit. 2018. 3D Pose Estimation and 3D Model Retrieval for Objects in the Wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • Gupta et al. (2015) Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2015. Inferring 3d object pose in RGB-D images. arXiv preprint arXiv:1502.04652 (2015).
  • Han et al. (2017) Kai Han, Rafael S Rezende, Bumsub Ham, Kwan-Yee K Wong, Minsu Cho, Cordelia Schmid, and Jean Ponce. 2017. SCNet: Learning semantic correspondence. In ICCV.
  • Huang et al. (2015) Qixing Huang, Hai Wang, and Vladlen Koltun. 2015. Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics (TOG) 34, 4 (2015), 87.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Li et al. (2017) Chi Li, M Zeeshan Zia, Quoc-Huy Tran, Xiang Yu, Gregory D Hager, and Manmohan Chandraker. 2017. Deep supervision with shape concepts for occlusion-aware 3d object parsing. In CVPR.
  • Li et al. (2015) Yangyan Li, Hao Su, Charles Ruizhongtai Qi, Noa Fish, Daniel Cohen-Or, and Leonidas J Guibas. 2015. Joint embeddings of shapes and images via CNN image purification. ACM Trans. Graph. 34, 6 (2015), 234–1.
  • Lim et al. (2014) Joseph J Lim, Aditya Khosla, and Antonio Torralba. 2014. Fpm: Fine pose parts-based model with 3d cad models. In ECCV.
  • Liu et al. (2016) Ce Liu, Jenny Yuen, and Antonio Torralba. 2016. Sift flow: Dense correspondence across scenes and its applications. In Dense Image Correspondences for Computer Vision. Springer, 15–49.
  • Lowe (2004) David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110.
  • Mahendran et al. (2017) Siddharth Mahendran, Haider Ali, and René Vidal. 2017.

    3D pose regression using convolutional neural networks. In

  • Massa et al. (2016) Francisco Massa, Bryan C Russell, and Mathieu Aubry. 2016. Deep exemplar 2d-3d detection by adapting from real to rendered views. In CVPR.
  • Nath Kundu et al. (2018) Jogendra Nath Kundu, Phani Krishna Uppala, Anuj Pahuja, and R. Venkatesh Babu. 2018. AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation. In CVPR.
  • Perry et al. (2006) Gavin Perry, Edmund T Rolls, and Simon M Stringer. 2006. Spatial vs temporal continuity in view invariant visual object recognition learning. Vision Research 46, 23 (2006), 3994–4006.
  • Philbin et al. (2007) James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In CVPR.
  • Poirson et al. (2016) Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei Liu, Jana Kosecka, and Alexander C Berg. 2016. Fast single shot detection and pose estimation. In 3DV.
  • Rezende et al. (2016) Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. 2016. Unsupervised learning of 3d structure from images. In NIPS.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
  • Schmidt et al. (2017) Tanner Schmidt, Richard Newcombe, and Dieter Fox. 2017. Self-supervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters (2017).
  • Su et al. (2014) Hao Su, Qixing Huang, Niloy J Mitra, Yangyan Li, and Leonidas Guibas. 2014. Estimating image depth using shape collections. ACM Transactions on Graphics (TOG) 33, 4 (2014), 37.
  • Su et al. (2015) Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. 2015. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In CVPR.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, and others. 2015. Going deeper with convolutions. CVPR.
  • Taniai et al. (2016) Tatsunori Taniai, Sudipta N Sinha, and Yoichi Sato. 2016. Joint recovery of dense correspondence and cosegmentation in two images. In CVPR.
  • Tulsiani and Malik (2015) Shubham Tulsiani and Jitendra Malik. 2015. Viewpoints and keypoints. In CVPR.
  • Wu et al. (2016) Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. 2016. Single image 3d interpreter network. In ECCV.
  • Xiang et al. (2016) Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh Mottaghi, Leonidas Guibas, and Silvio Savarese. 2016. ObjectNet3D: A Large Scale Database for 3D Object Recognition. In ECCV.
  • Xiang et al. (2014) Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. 2014. Beyond pascal: A benchmark for 3d object detection in the wild. In WACV.
  • Xu et al. (2016) Kai Xu, Vladimir G Kim, Qixing Huang, Niloy Mitra, and Evangelos Kalogerakis. 2016. Data-driven shape analysis and processing. In SIGGRAPH ASIA 2016 Courses. ACM, 4.
  • Yan et al. (2016) Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. 2016. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS.
  • Yu et al. (2018) Wei Yu, Xiaoshuai Sun, Kuiyuan Yang, Yong Rui, and Hongxun Yao. 2018. Hierarchical semantic image matching using CNN feature pyramid. Computer Vision and Image Understanding (2018).