Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation
We motivate and address a human-in-the-loop variant of the monocular viewpoint estimation task in which the location and class of one semantic object keypoint is available at test time. In order to leverage the keypoint information, we devise a Convolutional Neural Network called Click-Here CNN (CH-CNN) that integrates the keypoint information with activations from the layers that process the image. It transforms the keypoint information into a 2D map that can be used to weigh features from certain parts of the image more heavily. The weighted sum of these spatial features is combined with global image features to provide relevant information to the prediction layers. To train our network, we collect a novel dataset of 3D keypoint annotations on thousands of CAD models, and synthetically render millions of images with 2D keypoint information. On test instances from PASCAL 3D+, our model achieves a mean class accuracy of 90.7 obtains 85.7 human-in-the-loop inference.READ FULL TEXT VIEW PDF
Humans have an unparalleled visual intelligence and can overcome visual
We present a method for training CNN-based object class detectors direct...
The goal of this paper is to estimate the viewpoint for a novel object.
For Convolutional Neural Network based object detection, there is a typi...
We propose a technique to train semantic part-based models of object cla...
Semantic object parts can be useful for several visual recognition tasks...
We consider the problem of estimating human pose and trajectory by an ae...
Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation
It is well understood that humans and computers have complementary abilities. Humans, for example, are good at visual perception—even in rather challenging scenarios such as finding a toy in a cluttered room—and, consequently, subsequent abstract reasoning from visually acquired information. On the other hand, computers are good at processing large amounts of data quickly and with great precision, such as predicting viewpoints for millions of images within an exact, but possibly inaccurate, degree. Although we, as a community, design automatic systems that seek to extract information from images automatically—and have done this quite well, e.g., [9, 17]—there are indeed situations that are beyond the capabilities of current systems, such as inferring the extent of damage to two vehicles involved in a car accident from data acquired by a dash-cam.
In such exceptionally challenging cases, integrating the abilities of both humans and computers during inference is necessary; we call this methodology hybrid intelligence, borrowing a term from social computing . This strategy can lead to pipelines that achieve better performance than fully automatic systems without incurring a significant burden on the human (Figure 1
illustrates such an example). Indeed, numerous computer vision researchers have begun to investigate tasks inspired by this methodology, such as learning on a budget
and Markov Decision Process-based fusion.
Continuing in this vein of work, we focus on integrating the information provided by a human as additional input during inference to a novel convolutional neural network (CNN) architecture. We refer to this architecture as the Click-Here Convolutional Neural Network, or CH-CNN. In training, we learn how to best make use of the additional keypoint information. We develop a means to encode the location and identity of a single semantic keypoint on an image as the extra human guidance, and automatically learn how to integrate it within the part of the network that processes the image. The human guidance keypoint essentially determines a weighting, or attention mechanism , to identify particularly discriminative locations of information as data flows through the network. To the best of our knowledge, this is the first work to integrate such human guidance into a CNN at inference time.
To ground this work, we focus on the specific problem of monocular viewpoint estimation—the problem of identifying the camera’s position with respect to the target object from a single RGB image. This challenging problem has applications in numerous areas such as automated driving, robotics, and scene understanding, many of which we envision a possible human-in-the-loop during inference. Although discriminative CNN-based methods have achieved remarkable performance on this task[23, 22, 14, 28], they often make mistakes when faced with three types of challenges: occlusion, truncation, and highly symmetrical objects . In the first two cases, there is not enough visual information for the model to make the correct prediction, whereas in the third case, the model cannot identify the visual cues necessary to select among multiple plausible viewpoints.
Monocular viewpoint estimation is well-suited to our hybrid intelligence setup as humans can locate semantic keypoints on objects, such as the center of the left-front wheel on a car, fairly easily and with high confidence. CH-CNN is able to integrate such a keypoint directly into the inference pipeline. It computes a distance transform based on the keypoint location, combines it with a one-hot vector that indicates the keypoint class label, and then uses these data to generate a weight map that is combined with hidden activations from the convolutional layers that operate on the image. At a high level, our model learns to extract two types of information—global image information and keypoint-conditional information—and uses them to obtain the final viewpoint prediction.
We train CH-CNN with over 8,000 computer-aided design (CAD) models from ShapeNet  annotated with a custom, web-based interface. To our knowledge, our keypoint annotation dataset is an order of magnitude larger than the next largest keypoint dataset for ShapeNet CAD models  in terms of number of annotated models. As our thorough experiments show, we are able to use this human guidance to vastly improve viewpoint estimation performance: on human-guidance instances from the PASCAL 3D+ validation set , a fine-tuned version of the state-of-the-art model from Su et al.  achieves 85.7% mean class accuracy, while our CH-CNN achieves 90.7% mean class accuracy. Additionally, our model is well-suited for handling challenges that the state-of-the-art model often fails to overcome, as shown by our qualitative results.
We summarize our contributions as follows. First, we propose a novel CNN that integrates two types of information—an image and information about a single keypoint—to output viewpoint predictions; this model is designed to be incorporated into a hybrid-intelligence viewpoint estimation pipeline. Second, to train our model, we collect keypoint locations on thousands of CAD models, and use these data to render millions of synthetic images with 2D keypoint information. Finally, we evaluate our model on the PASCAL 3D+ viewpoint estimation dataset  and achieve substantially better performance than the leading state-of-the-art, image-only method, validating our hybrid intelligence-based approach. Our code and 3D CAD keypoint annotations are available on our project website at ryanszeto.com/projects/ch-cnn.
Monocular Viewpoint Estimation. Viewpoint estimation and pose estimation of rigid objects have been tackled using a wide variety of approaches. One line of work has extended Deformable Part Models (DPMs)  to simultaneously localize objects and predict their viewpoint [29, 19, 8]. However, DPM-based methods can only predict a limited set of viewpoints, since each viewpoint requires a separate set of models. Patch alignment-based approaches identify discriminative patches from the test image and match them to a database of rendered 3D CAD models [1, 16]. More recent approaches have leveraged CNNs [5, 4, 28, 14, 23, 22], which achieve high performance without requiring the hand-crafted features used by earlier work. Additionally, unlike DPM-based approaches, CNNs extend easily to fine-grained viewpoints by regressing from the image to either a continuous viewpoint space [5, 4] or a discrete, but fine-grained space [23, 22]. Even better performance can be achieved by supervising the CNN training stage with intermediate representations [28, 14]. Nonetheless, most fully-automatic approaches struggle from three specific challenges: occlusion [29, 22, 1], truncation [29, 22], and highly symmetric objects [22, 16]. As we show in Section 5, CH-CNN helps reduce the error caused by these challenges.
Human Interaction for Vision Tasks.
Most prior work in the vision community on integrating information from humans at inference time are examples of either active learning or dynamic inference. Active learning approaches reduce the amount of labeled data required for sufficient performance by intelligently selecting unlabeled instances for the human to annotate[24, 25, 24, 15]. Our task differs from active learning in that the information from the human (the keypoint) is available at inference time rather than training time, and we leverage auxiliary human information to improve the accuracy of our model rather than to achieve sufficient performance with fewer examples. In dynamic inference, a system proposes questions with the goal of improving the confidence or quality of its final answer [20, 2, 26, 27, 10]. This line of work has demonstrated the potential of incorporating human input at inference time. Contrasting with work in dynamic inference, which emphasizes the process of selecting questions for the human to answer, we focus on the problem of learning how to integrate answers in an end-to-end approach for viewpoint estimation CNNs.
Our goal is to estimate three discrete angles that describe the rotation of the camera about a target object, where we are given a tight crop of the object, the location of a visible keypoint in the image, and the keypoint class (e.g. the center of the front right wheel, for a car). We do so with a novel CH-CNN that outputs confidences for each possible angle.
Formally, let be a single RGB image, be the 2D coordinate of the provided keypoint location in the image, and be the keypoint class. The label can take on one of values, where is the set of object classes and is the set of keypoint classes for a given object class . Furthermore, for a given instance , let be a tuple associated with representing the ground-truth azimuth/longitudinal rotation, elevation/latitudinal rotation, and in-plane rotation of the camera with respect to the object’s canonical coordinate system; each angle is discretized into bins (following Su et al. , we consider ). For each object class
, we seek a probability distribution functionthat is maximized at for any instance . We approximate this set of functions with our CH-CNN.
Prior work [23, 22] has explored the case where , i.e. the image and object class are available at test time, by fine-tuning popular CNN architectures such as AlexNet  and VGGNet . Note that after fine-tuning, the intermediate activations of these models can be interpreted as image features that are useful for viewpoint estimation . In our case, we have access to additional information at test time, i.e. the keypoint location and class
. We believe that for viewpoint estimation, this information can be used to produce features that complement the global image features extracted from popular CNN architectures. We incorporate this idea in CH-CNN by learning to weigh features from certain regions in the image more heavily based on the keypoint information.
Figure 2 illustrates the architecture of CH-CNN. The early layers of our architecture are divided into two streams: the first generates features from the image, and the second produces “keypoint features” to complement the high-level image features. The keypoint feature stream produces features in three steps. First, a weight map is produced by passing the keypoint map and class through a series of linear transformations and taking the softmax of the result. Second, the activation depth columns from a convolutional layer (conv4 in our case) are multiplied by the corresponding weights from the weight map. Finally, the keypoint features are created by taking the sum of the weighted columns.
CH-CNN concatenates the features from the image and keypoint streams and performs inference with one fully-connected hidden layer and one prediction layer for each angle. The fact that we seek a probability distribution function for each object class suggests that a separate network must be trained for each object class. To avoid this, we adopt the approach used in Su et al.  where lower-level feature layers are shared by all object classes, and object class-dependent prediction layers are used for each angle.
We implement the image stream of CH-CNN with the hidden layers of AlexNet  (i.e. the layers up to the second fully-connected layer fc7); we take the activations of the fc7 layer as our image features. We stress that while AlexNet is a less powerful model than more recent ones such as ResNet , our choice allows for a sensible comparison with Su et al. , who fine-tune the same architecture for viewpoint estimation. Additionally, the choice of architecture used for the image stream is independent of our primary contribution, which is to leverage the additional guidance from the provided keypoint at inference time.
The keypoint feature stream takes representations of and and generates a weighting over activation depth columns from a convolutional layer in the image stream (the fourth layer conv4 in our case), where spatial, but high-level information is retained. We use to denote the column at position in the conv4 activation depth column grid. We represent with a matrix , where each entry is the Chebyshev distance of from divided by the largest possible distance from the keypoint; the label is represented with a one-hot vector encoding .
To learn weights over the activation depth columns, we first learn keypoint map features by downsampling
with max pooling, and applying a linear transformation to the vectorized result:
Similarly, features from the keypoint class vector are obtained with a linear transformation:
Finally, the weight map for the conv4 activation depth columns is obtained by linearly transforming the concatenated keypoint features, applying the softmax function, and reshaping the result to match the shape of the conv4 activation depth column grid :
The keypoint feature vector is the sum of the conv4 activation depth columns weighted by :
where and index into and the conv4 activation depth column grid.
To perform inference, and
are concatenated. The result is passed through one non-linear hidden layer with an activation function(e.g. the rectified linear activation function) and a set of class-wise prediction layers for each angle :
To train our network, we use the geometric structure aware loss function from Su et al.,
where is a sample from object class , is the set of training instances, is the set of possible viewpoints, is the estimated probability of given instance , is a distance metric between viewpoints and (e.g. the geodesic distance defined in Sec. 5.1), and
is a hyperparameter that tunes the cost of an inaccurate prediction. This loss is a modification of the cross-entropy loss that encourages correlation between the predictions of nearby views.
To train the network, we begin by generating sets of training instances from synthetic data from ShapeNet  and real-world data from the PASCAL 3D+ dataset  (see Section 4 for details). Then, we initialize the layers from AlexNet with the weights learned from Su et al. ; the layers in the keypoint feature stream , as well as the prediction layers and
, are initialized with random weights. Next, we train on the synthetic data until the validation performance on a held-out subset of the synthetic data plateaus. Finally, we fine-tune on the real-world training data until the loss on that data plateaus. We develop and train our models in Caffe.
The annotations available in the PASCAL 3D+ dataset  allow us to generate about 14,000 training instances from real-world images (see Section 4.1 for details on this process), but this number is insufficient for training CH-CNN. To overcome this limitation, we have extended the synthetic rendering pipeline proposed by Su et al.  to generate not only synthetic images with labels, but also 2D keypoint locations, resulting in about two million synthetic training instances. Because this procedure requires knowledge of the 3D keypoint locations on CAD models, we have collected keypoint annotations on 918 bus, 7,377 car, and 320 motorcycle models from the CAD model repository ShapeNet  with the use of an in-house annotation interface (refer to the supplemental material for details on the CAD model filtering and annotation collection processes). We focus on vehicles to help advance applications in automotive settings, but note that our method is applicable to any rigid object class with semantic keypoints. To the best of our knowledge, the number of annotated CAD models in our dataset is greater than ten times that of the next largest ShapeNet-based keypoint dataset from Li et al. , who collected keypoints on 472 cars, 80 chairs, and 80 sofas. Our annotated CAD models are publicly available on our project website.
We render images of the annotated CAD models using the same pipeline used in Su et al. , which we now describe here. First, we randomly sample light sources and camera extrinsics. Then, we render the CAD model over a random background from the SUN397 dataset  to reduce overfitting to synthetic instances. Finally, we crop the object with a randomly perturbed bounding box. From a single rendered image , we generate one instance of the form with label for each visible keypoint, which can be identified by ray-tracing in the rendering environment. We focus on visible keypoints because in the hybrid intelligence environment, we assume that the human locates unambiguous keypoints, which disqualifies occluded and truncated keypoints. We follow this approach to generate about two million synthetic training instances.
PASCAL 3D+ provides detailed annotations that make generating labeled instances a straightforward process. To obtain instance-label pairs from PASCAL 3D+, we extract ground-truth bounding box crops of every vehicle in the dataset. For each cropped vehicle image and ground-truth keypoint contained inside that is labeled as visible, we produce one labeled instance. We augment the set of training data by horizontally flipping and adjusting , , and appropriately. In total, we extract about 14,000 training instances and 7,000 test instances from the PASCAL 3D+ training and validation sets, respectively.
We conduct experiments to compare image-only viewpoint estimation with our human-in-the-loop approach, as well as analyze the impact of keypoint information on our model. First, we quantitatively compare our model against the state-of-the-art model R4CNN  on the three vehicle object classes in PASCAL 3D+ (Section 5.1). Second, we analyze the influence of the keypoint information on our model via ablation tests and perturbations in the keypoint location at inference time (Section 5.2). Finally, we provide qualitative results to compare our model’s predictions to those made by R4CNN (Section 5.3).
|R4CNN , fine-tuned||90.6||82.4||84.1||85.7||2.93||5.63||11.7||6.74|
|Keypoint features (Gaussian fixed attention)||88.9||81.3||82.8||84.4||3.00||5.88||11.4||6.76|
|Keypoint features (uniform fixed attention)||90.6||82.0||83.7||85.4||3.01||5.72||12.1||6.93|
|CH-CNN (keypoint map only)||90.6||82.0||84.2||85.6||3.04||5.73||11.3||6.68|
|CH-CNN (keypoint class only)||90.9||86.3||83.1||86.8||2.92||5.29||11.0||6.41|
|CH-CNN (keypoint map + class)||96.8||90.2||85.2||90.7||2.64||4.98||11.4||6.35|
We compare multiple viewpoint estimation models by evaluating their performance on instances extracted from the PASCAL 3D+ validation set . To be consistent with prior work [23, 22], we report two metrics, and , which are defined as follows. Let be the geodesic distance between the predicted rotation matrix and the ground-truth rotation matrix on the manifold of rotation matrices. We define as the fraction of test instances where in radians, and as the median value of in degrees over all test instances.
Table 1 summarizes the performance of various models on the instances extracted from the PASCAL 3D+ validation set. We include R4CNN with and without fine-tuning (Section 3.2) to account for the difference in object classes used in Su et al. . We also compare against two baselines that use a fixed weight map for (Equation 4) instead of learning attention from the keypoint data. The first baseline (Gaussian fixed attention) sets to a normalized 13
13 Gaussian kernel with a standard deviation of 6, and the second baseline (uniform fixed attention) setsto a 13 13 box filter. Aside from the baselines, we evaluate three versions of our CH-CNN model described in Section 3.1. The first two learn a weight map using either the keypoint map or the keypoint class vector exclusively, and the third is our full model that integrates both sources of information into the weight map computation.
As shown in Table 1, our full CH-CNN model obtains the highest accuracies out of all tested models by a wide margin; noticeable drops in median error also occur. A conclusion that we draw from these results is that a weighted sum of feature columns can help improve viewpoint estimates. Most importantly, learning to weigh these features based on the keypoint information is critical to substantially improving performance over image-only methods. This indicates that providing a single keypoint during inference can indeed help viewpoint estimation by providing features that compliment those extracted solely from the image.
Figure 4 shows the histograms of angle errors across all object classes obtained by our full CH-CNN model and fine-tuned R4CNN (we refer to this model simply as R4CNN for the remainder of the paper). The most notable difference between the two error distributions occurs along the tails: CH-CNN obtains high errors noticeably less frequently than R4CNN, which we attribute to our model’s ability to take advantage of keypoint features when the image features are not informative enough to make a good estimate.
Table 2 stratifies performance by car keypoint classes. In all cases, our model estimates the viewpoint more accurately than R4CNN. However, relative improvement varies greatly, meaning that if certain keypoints can be provided, the improvement from using our model over R4CNN will become more apparent. For instance, CH-CNN yields the greatest relative increase in accuracy when the right back windshield keypoint is provided, but the lowest relative improvement when the right front light keypoint is provided. We attribute this difference to the varying amount of visual information that an image-only system can leverage, which depends on which keypoints are visible: front lights are often more visually distinguishable from their rear counterparts than windshield corners are to their front counterparts. Stratified performance for bus and motorcycle keypoints can be found in the supplementary materials.
|Left front wheel||86.9||89.5||2.99|
|Left back wheel||80.6||89.0||10.4|
|Right front wheel||89.4||91.2||2.01|
|Right back wheel||85.9||90.8||5.70|
|Left front light||90.5||94.5||4.42|
|Right front light||93.2||95.5||2.47|
|Left front windshield||87.3||91.0||4.24|
|Right front windshield||88.9||91.7||3.15|
|Left back trunk||76.8||89.5||16.5|
|Right back trunk||72.8||88.0||20.9|
|Left back windshield||72.1||84.7||17.5|
|Right back windshield||70.8||87.6||23.7|
In this section, we explore how changing the keypoint information at inference time affects our trained CH-CNN model. To argue that CH-CNN adapts to the keypoint features rather than ignoring them in favor of the image features, we experiment with providing a keypoint map of all zeros, a keypoint class vector of all zeros, or both to our trained model at test time. As shown in Table 3, CH-CNN attains the worst performance when both the keypoint map and class vector are blank. In the cases where either the keypoint map or class is available, but not both, the model achieves better performance. Finally, the best performance is obtained by providing both sources of information. These results indicate that our model adapts to the keypoint information, rather than relying solely on the image features.
Next, we demonstrate that CH-CNN is robust to noise in the keypoint location at inference time, which is required in order to be useful for the hybrid intelligence environment. The noise is modeled by sampling the keypoint location from a 2D Gaussian whose mean is at the true keypoint location. We accomplish this by creating a new test set for each standard deviation as follows. We replace each instance from the PASCAL 3D+ validation set with one instance of the form , where . Here, is the 2
2 identity matrix andparameterizes the covariance matrix.
In Figure 5, we plot the mean class performance of CH-CNN as increases. We see that our model is robust to misplaced keypoints, retaining over 98% of its maximum performance even when the standard deviation is about 20% of the image dimensions. This is likely due to our method of downsampling the keypoint map, which would map the perturbed keypoint to a similar depth column weight map.
To conclude our analysis, we present qualitative comparisons between CH-CNN and R4CNN  by illustrating the confidences across azimuth, the most challenging angle to predict for PASCAL 3D+ . In Figure 6, we compare the two models for images that exhibit either occlusion, truncation, or highly symmetric objects, observing that CH-CNN tends to estimate viewpoint more robustly than R4CNN under these circumstances. In the shown examples, our model estimates a narrow band around the true azimuth with high confidence. On the other hand, R4CNN exhibits a variety of behaviors, such as multiple peaks (all rows, left), wide bands (middle row, left), or high confidence for the angle opposite the true azimuth (top row, right). We attribute the relative improvement of CH-CNN to the keypoint features, which can help suppress contradictory viewpoint estimates.
Figure 7 includes multiple examples of each object class, as well as failure cases for our model. In the positive cases, we continue to see narrower, but more accurate, bands of high confidence from CH-CNN than from R4CNN. Although the negative cases show that CH-CNN does not entirely overcome the main challenges of viewpoint estimation, the improved performance as shown in Table 1 indicates that these factors impact our model less severely than they impact R4CNN.
Limitations and Suggestions. Our work makes a few critical assumptions that are worth addressing in future work. First, we assume that information about only one keypoint is provided; in reality, we should be able to leverage multiple keypoints to further improve the estimate. Second, we assume that viewpoint estimates of the same object with different keypoint data are unrelated, whereas a better approach would be to enforce the consistency of viewpoint estimates of the same object. Third, we assume that the provided keypoint is both unoccluded and within the object bounding box. However, this is sensible in the context of hybrid intelligence because we can trust the human to suggest unambiguous keypoints or indicate that none exist, in which case we can fall back on image-only systems.
Summary. We have presented a hybrid intelligence approach to monocular viewpoint estimation called CH-CNN, which leverages keypoint information provided by humans at inference time to more accurately estimate the viewpoint. Our method combines global image features with keypoint-conditional features by learning to weigh feature activation depth columns based on the keypoint information. We train this model by generating synthetic examples from a new, large-scale 3D keypoint dataset. As shown by our experiments, our method vastly improves viewpoint estimation performance over state-of-the-art, image-only systems, validating our argument that applying hybrid intelligence to the domain of viewpoint estimation can yield great benefits with minimal human effort. To spur further work in hybrid intelligence for 3D scene understanding, we have made our code and keypoint annotations available at ryanszeto.com/projects/ch-cnn.
Acknowledgements. We thank Vikas Dhiman, Luowei Zhou, and Madan Ravi Ganesh for their helpful discussions and management of computing resources. We also thank Alex Miller, Matthew Dorow, Bhavika Reddy Jalli, Hojun Son, Guangyu Wang, Ronald Scott, and the other student annotators for collecting the keypoint dataset. This work was partially supported by the Denso Corporation, NSF CNS 1463102, and DARPA W31P4Q-16-C-0091.
IEEE Conference on Computer Vision and Pattern Recognition, 2014.
This document constitutes the written portion of the supplementary material for Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation. It is organized as follows:
Appendix B provides details for our CH-CNN architecture, including layer sizes and training parameters.
Appendix C describes how we collected and verified the CAD model keypoint annotations in our dataset.
Appendix D provides additional quantitative analysis. We analyze the and evaluation metrics on our model and R4CNN  in multiple ways, such as comparing performance by keypoint class and comparing accuracy over a range of thresholds. Additionally, we list the evaluation metrics on variations of our model that use different types of keypoint maps.
Appendix E provides additional qualitative results. First, we visualize challenging instances in which occlusion, truncation, and/or symmetry occur. Then, we visualize the weight maps produced by instances from each object class.
CH-CNN takes three inputs that are generated from instance tuple : , , and . is a 227 227 3 RGB image subtracted by the ImageNet image mean ; is a 227 227 grayscale image whose values are produced by any method described in Appendix D; and is a 34-length vector (corresponding to 12 bus, 12 car, and 10 motorcycle keypoint classes) with value 1 at keypoint class index and zero elsewhere. The image stream of CH-CNN is implemented with the first seven layers from AlexNet  using the reference architecture available in Caffe .
, the keypoint map is first downsampled with a max pooling layer with a stride and kernel size of 5. Then, the result is flattened and multiplied by the learned 21162116 matrix . To obtain , the one-hot keypoint class vector is multiplied by the learned 34 34 matrix . The concatenated vector is multiplied by the learned 169 2150 weight matrix to get , to which a softmax and reshaping is applied to obtain a 13 13 weight map whose entries sum to 1 (13 13 comes from the height and width of the conv4
activation tensor, and 169 comes from their product). For the inference layers, the hidden activations
are obtained with a learned non-linear fully-connected layer with 4096 outputs and ReLU as the non-linear activation function. Finally, the angle prediction layer for each angleand object class takes and multiplies it by the learned 360 4096 matrix .
The entire CH-CNN architecture is trained end-to-end with the Adam algorithm  while training on synthetic and real instances. In both cases, the batch size, base learning rate, first momentum rate, and second momentum rate are set to 192, , 0.9, and 0.999 respectively. It takes about 3 days on an NVIDIA Titan X Pascal GPU to train CH-CNN.
Collecting a large number of keypoint annotations efficiently requires a scalable and easily-accessible interface. To this end, we extended the open-source project cad.js111https://github.com/ghemingway/cad.js, a web-based interface and server for viewing 3D CAD models, to support keypoint annotation. Figure 8 shows screenshots of our keypoint annotation interface. When a CAD model is loaded, the user can navigate around the object via rotate, pan, and zoom operations with the mouse. At the bottom of the screen is a panel that describes the requested keypoint with visual examples and text. In order to label the keypoint, the user enters edit mode and drags a small sphere onto the appropriate location. The user cycles through and labels all keypoints for the model’s object class, then enters save mode to preview the annotations before saving them to the server. The preview displays all keypoints at once, with a line drawn from the textual label to the keypoint location; users can view individual keypoints by mousing over the textual label.
To obtain the 3D CAD models, we downloaded the bus, car, and motorcycle models from ShapeNet . We restricted ourselves to these three vehicle classes from PASCAL 3D+ due to cost constraints; however, we note that our methods extend naturally to any number of object classes. After filtering out models that did not capture realistic appearance (e.g. models without wheels, cars without bodies, etc.), which comprised about 1.8% of ShapeNet vehicle models, we were left with 918 bus, 7,377 car, and 320 motorcycle models to annotate.
We hired 10 student annotators over the course of one month to label the 3D CAD models with the semantic keypoints identified in the PASCAL 3D+ dataset . Although Xiang et al.  identified keypoints meant to broadly describe the corresponding object classes, we found that not all CAD models contained all semantic keypoints (e.g. convertible cars do not have rear windows, so the “rear windshield corner” keypoints have no meaning for these models). In these cases, the annotators were instructed to not label those keypoints; as a result, some CAD models do not have an annotation for every keypoint for their object class.
To improve the quality and consistency of annotations, each CAD model was viewed by one annotator and one verifier—the annotator placed the labels, and the verifier checked that the labels were placed appropriately. The verifier sent the model back to the annotator if he/she disagreed with the locations; if they could not reach a mutual agreement on keypoint locations, we annotated the model ourselves.
We begin by exploring the impact of using four different types of keypoint maps on viewpoint estimation performance for CH-CNN. We produce unnormalized keypoint maps defined by the following four procedures:
Gaussian keypoint map. Given keypoint location , the unnormalized keypoint map is given by a 2D Gaussian whose mean is and whose standard deviation is about 10% of the image (23 pixels for our 227 227 images).
Euclidean distance transform keypoint map. Given keypoint location , each entry in the keypoint map is given as
Manhattan distance transform keypoint map. Given keypoint location , each entry in the keypoint map is given as
Chebyshev distance transform keypoint map. Given keypoint location , each entry in the keypoint map is given as
We produce the final keypoint map from an unnormalized keypoint map by dividing by the maximum possible value over all possible (note that this is not necessarily the maximum value of the given ). The performance of CH-CNN with each type of keypoint map is shown in Table 4.
Table 5 lists the performance of fine-tuned R4CNN  and CH-CNN on each keypoint type, as well as the relative improvement of our model over fine-tuned R4CNN. From the overall relative improvement for all three object classes under both evaluation metrics, we observe that providing keypoint information generally increases viewpoint estimation performance over R4CNN. However, we also note that in the motorcycle class, some keypoint classes appear to confuse CH-CNN and yield a relative decrease in performance.
In Table 6, we compare the errors made by CH-CNN and R4CNN based on per-instance performance rather than aggregate performance. To do this, we compute the error for one instance from R4CNN, and subtract the corresponding value from CH-CNN. The values in Table 6 are the means of the resulting difference in errors stratified by keypoint class. From this table, we see that across most keypoint classes, CH-CNN predicts an angle closer to the ground truth than R4CNN for any particular instance. We see the same general trends as those seen in Table 5, such as performance varying depending on keypoint class and decreased performance for certain motorcycle keypoint classes.
In Figure 9, we plot the value of , defined as the fraction of test instances where in radians, across multiple values of between 0 and . We also report the normalized area under the curve (nAUC), which is the percentage of the plotted area that falls under a given curve. From this graph, we notice that the gap in performance between CH-CNN and R4CNN widens considerably with a large enough threshold. However, performance between the two models is similar at very small values of , which suggests the need to focus on improvements at strict threshold values.
In this section, we present additional qualitative results. Figure 10 visualizes additional instances where a high degree of occlusion, truncation, or object symmetry is present. Figure 11 shows the attention maps that are generated from a test instance from each object class.