StarMap for Category-Agnostic Keypoint and Viewpoint Estimation
Semantic keypoints provide concise abstractions for a variety of visual understanding tasks. Existing methods define semantic keypoints separately for each category with a fixed number of semantic labels. As a result, these representation is not suitable when objects have a varying number of parts, e.g. chairs with varying number of legs. We propose a category-agnostic keypoint representation encoded with their 3D locations in the canonical object views. Our intuition is that the 3D locations of the keypoints in canonical object views contain rich semantic and compositional information. Our representation thus consists of a single channel, multi-peak heatmap (StarMap) for all the keypoints and their corresponding features as 3D locations in the canonical object view (CanViewFeature) defined for each category. Not only is our representation flexible, but we also demonstrate competitive performance in keypoint detection and localization compared to category-specific state-of-the-art methods. Additionally, we show that when augmented with an additional depth channel (DepthMap) to lift the 2D keypoints to 3D, our representation can achieve state-of-the-art results in viewpoint estimation. Finally, we demonstrate that each individual component of our framework can be used on the task of human pose estimation to simplify the state-of-the-art architecture.READ FULL TEXT VIEW PDF
Understanding the geometry and pose of objects in 2D images is a fundame...
In this paper we consider the problem of human pose estimation from a si...
This paper tackles the task of category-level pose estimation for garmen...
Studies from neuroscience show that part-mapping computations are employ...
We would like robots to achieve purposeful manipulation by placing any
In this paper, we focus on recognizing 3D shapes from arbitrary views, i...
Robots in human environments will need to interact with a wide variety o...
StarMap for Category-Agnostic Keypoint and Viewpoint Estimation
Semantic keypoints, such as joints on a human body or corners on a chair, provide concise abstractions of visual objects regarding their compositions, shapes, and poses. Accurate semantic keypoint detection forms the basis for many visual understanding tasks, including human pose estimation [4, 24, 27, 53], hand pose estimation [48, 54], viewpoint estimation [26, 37], feature matching , fine-grained image classification , and 3D reconstruction [38, 41, 11, 10].
Existing methods define a fixed number of semantic keypoints for each object category in isolation [37, 26, 42, 24]. A standard approach is to allocate a heatmap channel for each keypoint. Or in other words, keypoints are inferred as separate heat maps according to their encoding order. This approach, however, is not suitable when objects have a varying number of parts, e.g. chairs with varying numbers of legs. The approach is even more limiting when we want to share and use keypoint labels of multiple different categories. In fact, keypoints of different categories do share rich compositional similarities. For instance, chairs and tables may share the same configuration of legs, and motorcycles and bicycles all contain wheels. Category-specific keypoint encodings fail to capture both the intra-category part variations and the inter-category part similarities.
In this paper, we propose a novel, category-agnostic keypoint representation. Our representation consists of two components: 1) a single channel, multi-peak heatmap, termed StarMap, for all keypoints of all objects; and 2) their respective feature (Fig. 1), termed CanViewFeature, which is defined as the 3D locations in a normalized canonical object view (or a world coordinte system). Specifically, StarMap combines the separate keypoint heat maps in previous approaches [37, 26] into a single heat map, and thus unifies the detection of different keypoints. CanViewFeature provides semantic discrimination between keypoints, i.e., through their locations in the normalized canonical object view. One intuition behind this representation is that the distribution of keypoints’ 3D locations in the canonical object view encodes rich semantic and compositional information. For example, the locations of all legs are close to the ground, and they are below the seats. Our representation can be obtained via supervised training on any standard datasets with 3D viewpoint annotations, such as Pascal3D+  and ObjectNet3D .
Our representation provides the flexibility to represent varying numbers of keypoints across different categories by eliminting the hard-encoding of keypoints. Additionally, we demonstrate that our representation can still achieve competitive results in keypoint detection and localization compared to the state-of-the-art category-specific approaches [16, 37] (Sec 4.2) by using simple nearest neighbor association on the category-level keypoint templates.
One direct application of our representation is viewpoint estimation [37, 30, 21], which can be achieved by solving a perspective-n-points (PnP)  problem to align the CanViewFeature with the StarMap. Further, we observed considerable performance gains in this task by augmenting the StarMap with an additional depth channel (DepthMap) to lift the 2D image coordinates into 3D. We report state-of-the-art performance compared to previous viewpoint estimation methods [30, 26, 21, 37] with ablation studies on each component. Finally, we show our method works well when applied to unseen categories. Full code is publicly available at https://github.com/xingyizhou/StarMap.
first trained a deep neural network for 2D human pose regression and Li et al. extended this approach to 3D. Starting from Tompson et al. , the heatmap representation has dominated the 2D keypoint estimation community and has achieved great success in both 2D human pose estimation [24, 40, 46] and single category man-made object keypoint detection [42, 41]. Recently, the heatmap representation has been generalized in various different directions. Cao et al.  and Newell et al.  extended the single peak heatmap (for single keypoint detection) to a multi-peak heatmap where each peak is one instance of a specific type of keypoint, enabling bottom-up, multi-person pose estimation. Pavlakos et al.  lifted the 2D pixel heatmap to a 3D voxel heatmap, resulting in an end-to-end 3D human pose estimation system. Tulsiani et al.  and Pavlakos et al.  stacked keypoint heatmaps from different object categories together for multi-category object keypoint estimation. Despite good performance gained by these approaches, they share a common limitation: each heatmap is only trained for a specific keypoint type from a specific object. Learning each keypoint individually not only ignores the intra-category variations or inter-category similarities, but also makes the representation inherently impossible to be generalized to unknown keypoint configurations for novel categories.
Viewpoint estimation. Viewpoint estimation, i.e., estimating an object’s orientation in a given frame, is a practical problem in computer vision and robotics [12, 26]. It has been well explored by traditional techniques that solve for transformations between corresponding points in the world and image views; this is known as the Perspective-n-Point Problem [13, 18]
. Lately, viewpoint estimation accuracy and utility have been greatly improved in the deep learning era. Tulsiani et al. introduced viewpoint estimation as a bin classification problem for each viewing angle (azimuth, elevation and in-plane rotation). Mousavian et al.  augmented the bin classification scheme by adding regression offsets within each bin so that predictions could be more fine-grained. Szeto et al.  used annotated keypoints as additional input to further improve bin classification. To combat scarcity of training data and generic features, Su et al.  proposed to synthesize images with known 3D viewpoint annotations and proposed a geometry-aware loss to further boost the estimation performance. Recently, Pavlakos et al.  proposed to use detected semantic keypoint followed by a PnP algorithm  to solve for the resulting viewpoint matrix and achieved state-of-the-art results. However, this method relies on category-specific keypoint annotation and is not generalizable. On the contrary, our approach is both accurate and category-agnostic, by utilizing category-agnostic keypoints.
General keypoint detection. There are several related concepts similar to our general semantic keypoint. The most well-known one is the SIFT descriptor , which aims to detect a large number of interest points based on local and low level image statistics. Also, the heatmap representation has been used in saliency detection  and visual attention , which detects a region of image which is “important” in the context. Similarly, Altwaijry et al.  used the heatmap representation to detect a set of points that is useful for feature matching. The key difference between our keypoint and the above concepts is that their keypoints do not contain semantic meanings and are not annotated by humans, making them less useful in high level vision tasks such as pose estimation.
To our best knowledge, we are the first to propose a category-agnostic keypoint representation and show that it is directly applicable to viewpoint estimation.
In this section, we describe our approach for learning a category-agnostic keypoint representation from a single RGB image. We begin with describing the representation in Section 3.1. We then introduce how to learn this representation in Section 3.2. Finally, we show a direct application of our representation in viewpoint estimation in Section 3.3.
A desired general purpose keypoint representation should be both adaptive (i.e., should be able to represent different content of different visual objects) and semantically meaningful (i.e., should convey certain semantic information for downstream applications).
So far the most widely used keypoint representation is the category specific stacked keypoint vector , which represents object keypoints by a vector ( for number of keypoints and for dimensions), or multi-channel heatmaps [34, 24], which associate each channel with one specific keypoint on a specific object category, e.g., -channel heatmaps for human [34, 24], -channel heatmaps for chair . Although these representations are certainly semantically meaningful (e.g., the first channel of human heatmaps is the left ankle), it does not satisfy the adaptive property, e.g., chairs with legged bases and swivel bases cannot be learned together due to varying number of keypoints. As a result, they can not be considered as the same category based on their different keypoint configurations. To generalize heatmaps to multiple categories, a popular approach is to stack all heatmaps from all categories [37, 26] (resulting in output channels, where is the number of keypoints of category ). In such a representation, keypoints from different objects are completely separated, e.g. seat corners from swivel chairs are irrelevant to seat corners from chairs. To merge keypoints from different objects, one has to establish consistent correspondences  between different keypoints across multiple categories, which is difficult or sometimes impossible.
In this paper, we introduce a hybrid representation that meets all desired properties. As illustrated in Figure 2, our hybrid representation consists of three components, StarMap, CanViewFeature and DepthMap. In particular, StarMap specifies the image coordinates of keypoints where the number of keypoints can vary across different categories; CanViewFeature specifies the 3D locations of keypoints in a canonical coordinate system, which provide an identity for each keypoint; DepthMap lifts 2D keypoints into 3D. As we will see later, it enhances the performance of using this representation for the application of viewpoint estimation. Now we describe each component in more details.
StarMap. As shown in Figure 2 (top left), StarMap is a single channel heatmap whose local maximums encode the image locations of the underlying points. It is motivated by the success of using one heatmap to encode occurrences of one keypoint on multiple persons [4, 23]. In our setting, we generalize the idea to encode all keypoints of each object. This is in contrast to [4, 23], which use multi-peak heatmaps to detect multiple instances of the same specific keypoint. In our implementation, given a heatmap, we extract the corresponding keypoints by detecting all local maximums, with respect to the 8-ring neighborhood whose values are above .
When comparing multi-channel heatmaps and a single channel heatmap, one intuition is that multi-channel heatmaps, which are category-specific and keypoint-specific representations, lead to better accuracy. However, as we will see later, using a single channel allows us to train the representation from bigger training data (multiple categories), leading to an overall better keypoint predictor. We also argue that a single-channel representation (1 channel vs 100+ channels on Pascal3D+ ) is favored when computational and memory resources are limited. On the other hand, StarMap alone does not provide the semantic meaning of each detected point. This drawback motivates the second component of our hybrid keypoint representation.
CanViewFeature. CanViewFeature collects the 3D locations of the keypoints in the canonical view. In our implementation, we allocate three channels for CanViewFeature. Specifically, after detecting a keypoint (peak) in StarMap, the values of these three channels at the corresponding pixel specify the 3D location in the canonical coordinate system. The design of CanViewFeature is motivated from recent works on embedding visual objects into latent spaces [32, 39]. Such latent spaces provide a shared platform for comparing and linking different visual objects. Our representation shares the same abstract idea, yet we make the embedding explicit in 3D (where we can view the learned representation) and learnable in a supervised
manner. This enables additional applications such as viewpoint estimation, as we will discuss later. When considering the space of keypoint configurations in the canonical space, it is easy to find that the feature is invariant to object pose and image appearance (scale, translation, rotation, lighting), little-variance to object shape (e.g., left frontal wheels from different cars are always in the left frontal area), and little variance to object category (e.g., frontal wheels from different categories are always in bottom frontal area).
only provides 3D locations, we can leverage this to classify the keypoints, by using nearest neighbor association on the category-level keypoint templates.
DepthMap. CanViewFeature and StarMap are related to each other via a similarity transform (rotation, translation, scaling) and a perspective projection. It is certainly possible to solve a non-linear optimization problem to recover the underlying similarity transform. However, since the network predictions are not perfect, we found that this approach leads to sub-optimal results.
To stabilize this process and make the relation even simpler, we augment StarMap with one additional channel called DepthMap. The encoding is the same as CanViewFeature. More precisely, we first extract keypoints at peak locations and then access the corresponding pixels to obtain the depth values. When the camera intrinsic parameters are present, we use them to convert image coordinates and depth value into the true 3D location of the corresponding pixel. Otherwise, we assume weak-perspective projection, and directly use the image coordinates and depth value as an approximation of the underlying 3D location.
Data preparation. Training our hybrid representation requires annotations of 2D keypoints, their corresponding depths, and their corresponding 3D locations in the canonical view. We remark that such training data is feasible to obtain and publicly available [44, 43]. 2D keypoint annotations per image are straightforward to retrieve  and thus widely available [15, 2, 3]. Also, annotating 3D keypoints of a CAD model  is not a hard task, given an interactive 3D UI such as MeshLab . The canonical view of a CAD model is defined as the front view of an object with the largest 3D bounding box dimension scaled to (meaning it is zero centered). Note that just a few 3D CAD models need to be annotated for each category (about 10 per category), because keypoint configuration variation is orders of magnitude smaller than the image appearance variation. Given a collection of images and a small set of CAD models of the corresponding categories, a human annotator is asked to select the closest CAD model to the image’s content, as done in Pascal3D+ and ObjectNet3D [44, 43]. A coarse viewpoint is also annotated by manually dragging the selected CAD model to align the image appearance. In summary, all the annotations required to train our hybrid representation are relatively easy to acquire. We refer to [44, 43] for more details on how to annotate such data.
We now describe how we calculate the depth annotation. Ideally, the transformation between the canonical view and image pixel coordinate is a full-perspective camera model:
where describes intrinsic camera parameters, is the 2D keypoint location in the image coordinate system, is the 3D location in canonical coordinate system. , , and are the rotation matrix (i.e. viewpoint), translation vector, and scale factor, respectively. However, the camera intrinsic parameters are most likely unavailable in testing scenarios. In those cases, a weak-perspective camera model is often applied to approximate the 3D-to-2D transformation for keypoint estimation [51, 26], by changing Eq. 1 to
where specifies the location of the keypoint, is its associated depth, and denotes the center of the image.
Letting be the transformed 3D keypoints in the metric space, we have (with unknown ), which transforms one point from the 3D metric space to the 2D pixel space with an augmented depth value . In training, let be the number of keypoints in category . Both the viewpoint transformation matrix and the canonical points are known, and we can calculate the rotated keypoints . Moreover, the corresponding 2D keypoints are known, so we can simply solve the scale factor by aligning the and plane bounding box size: , which gives rise to the underlying depth value.
Network training. As described above, we have full supervision for all of our 3 output components. Training is done as a supervised heatmap regression, i.e., we minimize the distance between the output 5-channel heatmap and their ground truth. Note that for CanViewFeature and DepthMap, we only care about the output at peak locations. Following [22, 23], we ignore the non-peak output locations rather than forcing them to be zero. This can be simply implemented by multiplying a mask matrix to both the network output and ground truth and then using a standard loss.
Our implementation is done in the PyTorch framework. We use a 2-stacks HourglassNetwork, which is the state-of-the-art architecture for 2D human pose estimation . We trained our network using curriculum learning, i.e., we first train the network with only StarMap
output for 90 epochs and then fine-tune the network with theCanViewFeature followed by DepthMap supervision for additional 90 epochs each. The whole training stages took about 2 days on one GTX 1080 TI GPU. All the hyper-parameters are set to the default values in the original Hourglass implementation .
The output of our approach (StarMap, DepthMap and CanViewFeature) can directly be used to estimate the viewpoint of the input image with respect to the canonical view (i.e., camera pose estimation). Specifically, Let be the un-normalized 3D coordinate of keypoint , where (, ) is the image center. Let be its counterpart in the canonical view. With we denote this keypoint’s value on the heatmap, which indicates a confidence score. We solve for a similarity transformation between the image coordinate system and world coordinate system that is parameterized by a scalar , a rotation , and a translation . This is done by minimizing the following objective function:
where is the SVD and , are the mean of , .
In this section, we perform experimental evaluations on the proposed hybrid keypoint representation. We begin with describing the experimental setup in Section 4.1. We then evaluate the accuracy of our keypoint detector and the application in viewpoint estimation in Section 4.2 and Section 4.3, respectively. We then present advanced analysis of our hybrid keypoint representation in Section 4.4. Finally, we show that our category-agnostic keypoint representation can be extended to novel categories in Section 4.5. Table 5 collect some qualitative results, and more results are deferred to the supplementary material.
We use Pascal3D+  as our major evaluation benchmark. This dataset contains 12 man-made object categories with 2K to 4K images per category. We make use of the following annotations in our training: object bounding box, category-specific 2D keypoints (annotations from ), approximate 3D CAD model of the object, viewpoint of the image, and category-specific 3D keypoint annotations (corresponds with the 2D keypoint configuration) in the canonical coordinate system defined on each CAD model. Following [37, 30], evaluation is done on the subset of the validation set that is non-truncated and non-occluded, which contains samples in total. As the evaluation protocols and baseline approaches vary across different tasks, we will describe them for each specific set of evaluations.
We first evaluate our method on the keypoint estimation task, which specifies the locations of the predicted keypoints. Since keypoint locations alone do not carry the identities of each keypoint and cannot be used as identity-specific evaluation, we perform the evaluation by using two protocols – namely, with identification inferred from our learned CanViewFeature or with oracle assigned identification. Specifically, for the first protocol, for each category, we calculate the mean of the locations of each keypoint in the world coordinate system among all CAD models and use this as the category-level template. We then associate each keypoint with the ID of its nearest mean annotated keypoint in the template. For the second protocol, we assume a perfect ID assignment (or keypoint classification) by assigning the output keypoint ID as the closest annotation (in image coordinates). The second protocol can also be thought of as randomly perturbing the annotated keypoint order and picking the best one. Following the conventions [16, 37], we use PCK(
), or Percentage of Correct Keypoints, as the evaluation metric. PCK considers a keypoint to be correct if its2D pixel distance from the ground truth keypoint location is less than , where and are the object’s bounding box dimensions.
|Pavlakos.  Oracle Id||92.3||93.0||79.6||89.3||97.8||96.7||83.9||N/A||N/A||85.1||73.3||88.5||89.0|
|Ours Oracle Id||93.1||92.6||84.1||92.4||98.4||96.0||91.7||90.0||90.1||89.7||83.0||95.2||92.2|
The keypoint localization and classification results are shown in Table 1. We show 3 state-of-the-art methods [16, 37, 26] for category-specific keypoint localization for comparison. The evaluation of  is done by ourselves based on their published model. For the first protocol, our result of mean PCK() is marginally better than the state-of-the-arts in 2014 [16, 37]
, probably because we used a more up-to-date HourglassNetwork. Our performance is slightly worse than , who uses the same Hourglass architecture but with stacked category-specific channels output ( output channels in total), which is expected. This is due to the error caused by incorrect keypoint ID association. We emphasize that all counterpart methods are category-specific, thus requiring ground truth object category as input while ours is general.
The second protocol (Bottom of Table 1) factors out the error caused by incorrect keypoint ID association. For a fair comparison, we also allow  to change its output order with the oracle nearest location (to eliminate the common left-right flip error ). We can see our score is , which is higher than that of Pavlakos et al . This is quite encouraging since our approach is designed to be a general purpose keypoint predictor. This result shows that it is advantageous to train a unified network to predict keypoint locations, as this allows to train a single network with more relevant training data.
Some qualitative results are shown in Table. 5, and more results can be found in the supplementary material.
As a direct application, we evaluate our hybrid representation on the task of viewpoint estimation. The objective of viewpoint estimation is to predict the azimuth (), elevation (), and in-plane rotation () of the image object with respect to the world coordinate system. In our experiment, we follow the conventions [37, 30] by measuring the angle between the predicted rotation vector and the ground truth rotation vector: where transforms the viewpoint representation into a rotation matrix. Here , and are rotations along , and axis, respectively.
We consider two metrics that are commonly applied in the literature [37, 26, 21, 30], namely, Median Error, which is the median of the rotation angle error, and Accuracy at , which is the percentage of keypoints whose error is less than . We use , which is a default setting in the literature.
A popular approach for solving viewpoint estimation is to cast the problem as bin classification by discretiziing the space of [37, 21, 30, 19]. Since network architecture governs the performance of a neural network, we re-train the baseline models  with more modern network architectures . We implemented a ResNet18 (Res18-Specific) with the same hyper-parameters as  (we also tried VGG  or ResNet50  but observed very similar or worse performance).
We also want to remark that although viewpoint estimation itself is not a category-specific task, all the studied preview works have used a category-specific formulation, e.g., use separate last-layer bin classifiers for each category, resulting in output units . We also provide a general viewpoint estimator as a baseline (Res18-General).
Table 2 compares our approach with previous techniques. Our method outperforms all previous methods and baselines in both testing metrics. Specifically with respect to MedErr, our approach achieved , which is lower than the prior state-of-the-art result reported in Mousavian et al . In terms of , our method outperforms the state-of-the-art result of Su et al . This is a quite positive result, since  uses additional rendered images for training.
We further evaluate , which assesses the percentage of very accurate predictions. In this case, we simply compare against our re-implemented Res18, which achieved similar results with other state-of-the-art techniques. As shown in Table 2, our approach is significantly better than Res18-General/Specific with respect to . This shows the advantage of performing keypoint alignment for pose estimation.
Note that it is also possible to directly align CanViewFeature with StarMap for viewpoint estimation by a weak-perspective PnP  algorithm (PnP in Table 2). In this case, utilizing DepthMap outperforms the direct alignment by in terms of and in terms of , respecctively. On one hand, this shows the usefulness of DepthMap, particularly when the prediction is noisy. On the other hand, the performance of both approaches becomes similar when the predictions are very accurate (). This is expected since both approaches should output identical results when the predictions are perfect.
Analysis of CanViewFeature. We use the ground-truth keypoint location, and compare their learned 3D locations for keypoint classification with popular point features used in the literature, namely, SIFT  and Conv5 of VGG . For CanViewFeature, we still follow the same procedure of using nearest neighbor for keypoint classification. For SIFT and Conv5, a linear SVM is used to classify the keypoints .
Table 3 compares CanViewFeature with the two baseline approaches from . We can see that CanViewFeature is significantly better than baseline approaches. This shows the advantage of using a shared keypoint representation for training a general purpose keypoint detector.
Ablation study on representation components. To better understand the importance of each component of our representation and whether they are well-trained, we provide error analysis by replacing each output component with its ground truth. To this end, we use viewpoint estimation as the task for evaluation, and Table 4 summarizes the results. Specifically, replacing StarMap with its ground truth does not provides much performance gains in both metrics, indicating that StarMap is fairly accurate. This is justified by the high keypoint accuracy reported in Section 4.2. Moreover, replacing either CanViewFeature or DepthMap with the underlying ground truth provides considerable performance gains in terms of . In particular, using perfect DepthMap leads noticeable decrease in median error. This is expected since the general task of estimating pixel depth remains quite challenging.
Our keypoint representation is category-agnostic and is free to be extended to novel object categories .
We note that Pascal3D+  only contains categories and it is hard to learn common inter-category information with such limited category samples. To further verify the generalization ability of our method, we used a newly published large scale 3D dataset, ObjectNet3D . ObjectNet3D  has the same annotations as Pascal3D+  but with 100 categories. We evenly hold out 20 categories (every 5 categories sorted in the alphabetical order) from the training data and only used them for testing. Because Shoe and Door do not have keypoint annotation, we remove them from the testing set, resulting in 18 novel categories. Please refer to the supplementary for details on dataset details.
We compare the performance gap between including and withholding the categories during training. The results are shown in Table 6. As expected, the viewpoint estimation accuracy of most categories drops. For some categories (Iron, Knife, Pen, Rifle, Slipper), both experiments fail (with accuracy lower than ). One explanation is that these 5 failed categories are small and narrow objects, whose annotations may not be accurate. For example, the keypoint annotations on ObjectNet3D  for small object are not always well-defined (see qualitative results in supplementary), e.g., Key and Spoon have dense keypoints annotation on their silhouette. For half of the novel objects (bookshelf, cellphone, computer, filing cabinet, guitar, microwave, pot, stove, tub), the performance gap between including and withholding training data is less than . This indicates that our representation is fairly general and can extend viewpoint estimation to novel categories.
Acknowledgement. We thank Shubham Tulsiani and Angela Lin for the helpful discussions.
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2014)
Li, S., Chan, A.B.: 3d human pose estimation from monocular images with deep convolutional neural network. In: Asian Conference on Computer Vision. pp. 332–347. Springer (2014)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. pp. 2048–2057 (2015)
|(Similar Classifier Transfer )||0.58||0.50|
|(General Classifier )||0.55||0.80|
|(General Classifier Res18)||0.58||0.79|
The pose induction for novel categories problem has been studied by Tulsiani et al. . They proposed two baselines for viewpoint induction: i) Similar Classifier Transfer (SCT), which uses the viewpoint classifier of a manually defined similar category for the novel category (e.g., use bicycle classifier for motorcycle); 2) General Classifier (GC), which trains a category-agnostic viewpoint classifier (similar to our Res18-General baseline in Table. 2 of our main paper). For evaluation, they  exclude two categories (Motorcycle and Bus) from the Pascal3D+ training set  and evaluate viewpoint estimation on these two categories with the same protocol of . We compare our proposed method on viewpoint estimation with their baselines in Table 7.
Our keypoint alignment-based viewpoint estimator achieved lower performance than direct general viewpoint classification. This can be understood from the following factors. First, the viewpoint estimation task has shown itself not to be category-specific. As shown in Table. 2 of the main paper, Res18-General has a very close performance with Res18-Specific ( 0.79 vs. 0.81), indicating that viewpoint estimation does not benefit a lot from category-specific design. However, keypoint estimation is inherently category-specific, and keypoint definitions vary widely per category. Our system places emphasis on learning the geometry of each training category, and such information is only weakly connected to the viewpoint estimation task. Despite these limitations, our keypoint-based method is able to achieve encouraging results on pose induction ( accuracy on Motorcycle, accuracy on Bus). Moreover, as indicated in the main paper, the view-point estimation performance of our method is highly correlated with the consistency of keypoint predictions and CanViewFeature. On novel categories, they become less consistent, leading to a drop in viewpoint estimation accuracy. However, one can certainly employ domain adaptation techniques to improve their consistency. We leave this as a direction for future research.
Our proposed method is currently the only learning-based method to induct keypoint estimation to novel categories. However, we remark that we avoid directly evaluating keypoint localization performance, as keypoint detection task on novel category is ill-posed. Keypoint definitions are subjective on novel objects, e.g., our method consistently predicts frontal lights as keypoint for bus, while the annotations of Pascal3D+  do not, presumably due to light being defined as a keypoint on a car but not on a bus.
|bicycle||clock||fish tank||lighter||racket||speaker||vending machine|
|blackboard||coffee maker||flashlight||mailbox||refrigerator||spoon||washing machine|
|bottle||cup||hair dryer||motorbike||road pole||suitcase|
|bucket||desk lamp||hammer||mouse||satellite dish||teapot|
The detailed training and testing categories split is shown in Table. 8. ObjectNet3D  contains about 50k training samples in total, but only 20k of them have keypoint annotations. We use the 20k subset of the training set for training and the validation set for testing. In total, we collected 19k images for training, and 4k images for novel categories.
|HourglassNetwork w.  oracle ID||97.44||98.27||94.02||92.22||93.30||90.49||86.02||93.22|
|StarMap with oracle Id||92.12||93.65||90.49||86.09||82.40||87.23||82.22||88.17|
|StarMap with learned Id||91.00||88.69||83.02||73.58||74.16||76.67||69.01||79.85|
In the main paper we have considered evaluating our approach on rigid objects. We show that the results are consistent on a different task, namely, human pose estimation.
We first evaluate StarMap on the task of 2D human pose estimation on the MPII Dataset , by replacing the 16-channel output of state-of-the-art HourglassNet  with a one-channel StarMap and a two-channel 2D canonical feature. As shown in Table 9, our method leads to encouraging results when compared to the default HourglassNet , especially when assigned oracle identification, which means we can see very similar visual results by using 1 output channel instead of 16.
The DepthMap representation, which associate each 2D joint with a depth value in a map representation, can be a simplified 3D keypoint representation. It is contrast to Zhou et al.  who represent 3D keypoint as 2D heatmap and depth vector learned with an additional subnetwork. More specifically, Zhou et al.  proposes to decouple the 3D coordinate into image coordinate and depth (see our Section. 3.2) in a weak-perspective camera model, which enables using rich 2D in-the-wild data  in training. For estimating the depth of each joint, they use an additional depth regression sub-network on the top of the 2D network, which is cumbersome (i.e., introducing more hyper-parameters for designing the sub-network and increasing the feed forward time). When using our DepthMap encoding, which augments the heatmaps with an additional depth channel and associates the depth value on the heatmap peak location, we can replace the sub-network  with channels. We illustrate the difference in Fig. 3.
Human3.6M dataset , which contains about 3.6 millions frames of images, each with accurate 3D human joint location annotations. Following [51, 27], both training and testing are done on a down-sampled subset. We follow the standard protocol to use 5 subjects for training and 2 subjects for testing. The error is measured in mean per joint position error (MPJPE) in millimeters after aligning the root joint location with ground truth and assuming a fixed average scale [51, 27]. All the experiment settings are the same with .
The results in Table. 10 show that our DepthMap representation achieves very close performance with the original design of Zhou et al. , while saving about network parameter (from the depth-regression sub-network). We also compare with Mehta et al. , who also use a map representation for coordinates. Instead of directly using the coordinate from 2D heatmap (with a weak-perspective camera model), they regress the full coordinates at the peak heatmap location with a full-perspective camera model. Also, they use a modified ResNet50  architecture instead of HourglassNetwork . Our results are considerably better than theirs, showing the effectiveness of the decoupled weak-perspective 3D keypoint representation.
This section is removed due to arXiv size limit. Please visit the project page (https://github.com/xingyizhou/StarMap) for more qualitative results.