1 Introduction
Convolutional neural networks have shown that jointly optimizing feature extraction and classification pipelines can significantly improve object recognition lenet ; alexnet . That being said, current approaches to geometric vision problems, such as 3D reconstruction phototourism and shape alignment li2009robust , comprise a separate keypoint detection module, followed by geometric reasoning as a postprocess. In this paper, we explore whether one can benefit from an endtoend geometric reasoning framework, in which keypoints are jointly optimized as a set of latent variables for a downstream task.
Consider the problem of determining the 3D pose of a car in an image. A standard solution first detects a sparse set of categoryspecific keypoints, and then uses such points within a geometric reasoning framework (e.g., a PnP algorithm lepetit2008pnp ) to recover the 3D pose or camera angle. Towards this end, one can develop a set of keypoint detectors by leveraging strong supervision in the form of manual keypoint annotations in different images of an object category, or by using expensive and error prone offline modelbased fitting methods. Researchers have compiled large datasets of annotated keypoints for faces sagonas2016300 , hands tompson14tog , and human bodies andriluka14cvpr ; lin2014microsoft . However, selection and consistent annotation of keypoints in images of an object category is expensive and illdefined. To devise a reasonable set of points, one should take into account the downstream task of interest. Directly optimizing keypoints for a downstream geometric task should naturally encourage desirable keypoint properties such as distinctiveness, ease of detection, diversity, etc.
This paper presents KeypointNet, an endtoend geometric reasoning framework to learn an optimal set of categoryspecific 3D keypoints, along with their detectors, for a specific downstream task. Our novelty stands in contrast to prior work that learns latent keypoints through an arbitrary proxy selfsupervision objective, such as reconstruction zhang2018unsupervised ; hinton2011transforming
. Our framework is applicable to any downstream task represented by an objective function that is differentiable with respect to keypoint positions. We formulate 3D pose estimation as one such task, and our key technical contributions include (1) a novel differentiable pose estimation objective and (2) a multiview consistency loss function. The pose objective seeks optimal keypoints for recovering the relative pose between two views of an object. The multiview consistency loss encourages consistent keypoint detections across 3D transformations of an object. Notably, we propose to detect
3D keypoints (2D points with depth) from individual 2D images and formulate pose and consistency losses for such 3D keypoint detections.We show that KeypointNet discovers geometrically and semantically consistent keypoints across viewing angles as well as across object instances of a given class. Some of the discovered keypoints correspond to interesting and semantically meaningful parts, such as the wheels of a car, and we show how these 3D keypoints can infer their depths without access to object geometry. We conduct three sets of experiments on different object categories from the ShapeNet dataset chang2015shapenet . We evaluate our technique against a strongly supervised baseline based on manually annotated keypoints on the task of relative 3D pose estimation. Surprisingly, we find that our endtoend framework achieves significantly better results, despite the lack of keypoint annotations.
2 Related Work
Both 2D and 3D keypoint detection are longstanding problems in computer vision, where keypoint inference is traditionally used as an early stage in object localization pipelines
lepetit2006keypoint . As an example, a successful early application of modern convolutional neural networks (CNNs) was on detecting 2D human joint positions from monocular RGB images. Due to its compelling utility for HCI, motion capture, and security applications, a large body of work has since developed in this joint detection domain toshev2014deeppose ; tompson2014joint ; pishchulin16cvpr ; newell2016stacked ; yang2017learning ; papandreou2017towards ; huang2017coarse ; he2017mask .More related to our work, a number of recent CNNbased techniques have been developed for 3D human keypoint detection from monocular RGB images, which use various architectures, supervised objectives, and 3D structural priors to directly infer a predefined set of 3D joint locations VNect_SIGGRAPH2017 ; mehta2017monocular ; chen2017adversarial ; mehta2017single ; guler2018densepose . Other techniques use inferred 2D keypoint detectors and learned 3D priors to perform “2Dto3Dlifting” ramakrishna2012reconstructing ; chen20173d ; zhou2016sparseness ; martinez2017simple or find datatomodel correspondences from depth images pons2015metric . Honari et al. honari2018improving improve landmark localization by incorporating semisupervised tasks such as attribute prediction and equivariant landmark prediction. In contrast, our set of keypoints is not defined a priori and is instead a latent set that is optimized endtoend to improve inference for a geometric estimation problem. A body of work also exists for more generalized, albeit supervised, keypoint detection, e.g., NIPS2012_4680 ; wu2016single .
Enforcing latent structure in CNN feature representations has been explored for a number of domains. For instance, the capsule framework hinton2011transforming and its variants sabour2017dynamic ; hinton2018matrix
encode activation properties in the magnitude and direction of hiddenstate vectors and then combine them to build higherlevel features. The output of our KeypointNet can be seen as a similar form of latent 3D feature, which is encouraged to represent a set of 3D keypoint positions due to the carefully constructed consistency and relative pose objective functions.
Recent work has demonstrated 2D correspondence matching across intraclass instances with large shape and appearance variation. For instance, Choy et al. choy2016universal use a novel contrastive loss based on appearance to encode geometry and semantic similarity. Han et al. han2017scnet propose a novel SCNet architecture for learning a geometrically plausible model for 2D semantic correspondence. Wang et al. wang2017multi
rely on deep features and perform a multiimage matching across an image collection by solving a feature selection and labeling problem. Thewlis
et al. thewlis2017unsupervised use groundtruth transforms (optical flow between image pairs) and pointwise matching to learn a dense objectcentric coordinate frame with viewpoint and image deformation invariance. Similarly, Agrawal et al. agrawal2015learning use egomotion prediction between image pairs to learn semisupervised feature representations, and show that these features are competitive with supervised features for a variety of tasks.Other work has sought to learn latent 2D or 3D features with varying amounts of supervision. ArieNachimson & Basri constructing_implicit_iccv_09 build 3D models of rigid objects and exploit these models to estimate 3D pose from a 2D image as well as a collection of 3D latent features and visibility properties. Inspired by cycle consistency for learning correspondence huang2013consistent ; zhou2015multi , Zhou et al. zhou2016learning train a CNN to predict correspondence between different objects of the same semantic class by utilizing CAD models. Independent from our work, Zhang et al. zhang2018unsupervised discover sparse 2D landmarks of images of a known object class as explicit structure representation through a reconstruction objective. Similarly, Jakab and Gupta et al. jakab2018conditional use conditional image generation and reconstruction objective to learn 2D keypoints that capture geometric changes in training image pairs. Rhodin et al. rhodin2018unsupervised uses a multiview consistency loss, similar to ours, to infer 3D latent variables specifically for human pose estimation task. In contrast to zhou2016learning ; zhang2018unsupervised ; jakab2018conditional ; rhodin2018unsupervised , our latent keypoints are optimized for a downstream task, which encourages more directed keypoint selection. By representing keypoints in true physical 3D structures, our method can even find occluded correspondences between images with large pose differences, e.g., large outofplane rotations.
Approaches for finding 3D correspondence have been investigated. Salti et al. salti2015learning cast 3D keypoint detection as a binary classification between points whose groundtruth similarity label is determined by a predefined 3D descriptor. Zhou et al. zhou2017unsupervised use viewconsistency as a supervisory signal to predict 3D keypoints, although only on depth maps. Similarly, Su et al. su2015render leverage synthetically rendered models to estimate object viewpoint by matching them to realworld image via CNN viewpoint embedding. Besides keypoints, selfsupervision based on geometric and motion reasoning has been used to predict other forms of output, such as 3D shape represented as blendshape coefficients for human motion capture tung2017self .
3 Endtoend Optimization of 3D Keypoints
Given a single image of a known object category, our model predicts an ordered list of 3D keypoints, defined as pixel coordinates and associated depth values. Such keypoints are required to be geometrically and semantically consistent across different viewing angles and instances of an object category (e.g., see Figure 4). Our KeypointNet has heads that extract keypoints, and the same head tends to extract 3D points with the same semantic interpretation. These keypoints will serve as a building block for feature representations based on a sparse set of points, useful for geometric reasoning and poseaware or poseinvariant object recognition (e.g., sabour2017dynamic ).
In contrast to approaches that learn a supervised mapping from images to a list of annotated keypoint positions, we do not define the keypoint positions a priori. Instead, we jointly optimize keypoints with respect to a downstream task. We focus on the task of relative pose estimation at training time, where given two views of the same object with a known rigid transformation , we aim to predict optimal lists of 3D keypoints, and in the two views that best match one view to the other (Figure 1). We formulate an objective function , based on which one can optimize a parametric mapping from an image to a list of keypoints. Our objective consists of two primary components:

A multiview consistency loss that measures the discrepancy between the two sets of points under the ground truth transformation.

A relative pose estimation loss, which penalizes the angular difference between the ground truth rotation vs. the rotation recovered from and using orthogonal procrustes.
We demonstrate that these two terms allow the model to discover important keypoints, some of which correspond to semantically meaningful locations that humans would naturally select for different object classes. Note that we do not directly optimize for keypoints that are semantically meaningful, as those may be suboptimal for downstream tasks or simply hard to detect. In what follows, we first explain our objective function and then describe the neural architecture of KeypointNet.
Notation. Each training tuple comprises a pair of images of the same object from different viewpoints, along with their relative rigid transformation , which transforms the underlying 3D shape from to . has the following matrix form:
(1) 
where and represent a 3D rotation and translation respectively. We learn a function , parametrized by , that maps a 2D image to a list of 3D points where , by optimizing an objective function of the form .
3.1 Multiview consistency
The goal of our multiview consistency loss is to ensure that the keypoints track consistent parts across different views. Specifically, a 3D keypoint in one image should project onto the same pixel location as the corresponding keypoint in the second image. For this task, we assume a perspective camera model with a known global focal length . Below, we use to denote 3D coordinates, and to denote pixel coordinates. The projection of a keypoint from image into image (and vice versa) is given by the projection operators:
where, for instance, denotes the projection of to the second view, and denotes the projection of to the first view. Here, represents the perspective projection operation that maps an input homogeneous 3D coordinate in camera coordinates to a pixel position plus depth:
(2) 
We define a symmetric multiview consistency loss as:
(3) 
We measure error only in the observable image space as opposed to also using , because depth is never directly observed, and usually has different units compared to and . Note however that predicting is critical for us to be able to project points between the two views.
Enforcing multiview consistency is sufficient to infer a consistent set of 2D keypoint positions (and depths) across different views. However, this consistency alone often leads to a degenerate solution where all keypoints collapse to a single location, which is not useful. One can encode an explicit notion of diversity to prevent collapsing, but there still exists infinitely many solutions that satisfy multiview consistency. Rather, what we need is a notion of optimality for selecting keypoints which has to be defined with respect to some downstream task. For that purpose, we use pose estimation as a task which naturally encourages keypoint separation so as to yield wellposed estimation problems.
3.2 Relative pose estimation
One important application of keypoint detection is to recover the relative transformation between a given pair of images. Accordingly, we define a differentiable objective that measures the misfit between the estimated relative rotation (computed via Procrustes’ alignment of the two sets of keypoints) and the ground truth . Given the translation equivariance property of our keypoint prediction network (Section 4) and the view consistency loss above, we omit the translation error in this objective. The pose estimation objective is defined as :
(4) 
which measures the angular distance between the optimal leastsquares estimate computed from the two sets of keypoints, and the ground truth relative rotation matrix . Fortunately, we can formulate this objective in terms of fully differentiable operations.
To estimate , let and denote two matrices comprising unprojected 3D keypoint coordinates for the two views. In other words, let and , where returns the first 3 coordinates of its input. Similarly denotes unprojected points in . Let and denote the meansubtracted version of and , respectively. The optimal leastsquares rotation between the two sets of keypoints is then given by:
(5) 
where . This estimation problem to recover is known as the orthogonal Procrustes problem schonemann1966procrustes . To ensure that is invertible and to increase the robustness of the keypoints, we add Gaussian noise to the 3D coordinates of the keypoints ( and ) and instead seek the best rotation under some noisy predictions of keypoints. To minimize the angular distance (4
), we backpropagate through the SVD operator using matrix calculus
ionescu2015matrix ; giles2008extended .Empirically, the pose estimation objective helps significantly in producing a reasonable and natural selection of latent keypoints, leading to the automatic discovery of interesting parts such as the wheels of a car, the cockpit and wings of a plane, or the legs and back of a chair. We believe this is because these parts are geometrically consistent within an object class (e.g., circular wheels appear in all cars), easy to track, and spatially varied, all of which improve the performance of the downstream task.
4 KeypointNet Architecture
One important property for the mapping from images to keypoints is translation equivariance at the pixel level. That is, if we shift the input image, e.g., to the left by one pixel, the output locations of all keypoints should also be changed by one unit. Training a standard CNN without this property would require a larger training set that contains objects at every possible location, while still providing no equivariance guarantees at inference time.
We propose the following simple modifications to achieve equivariance. Instead of regressing directly to the coordinate values, we ask the network to output a probability distribution map
that represents how likely keypoint is to occur at pixel , with. We use a spatial softmax layer to produce such a distribution over image pixels
goroshin2015learning . We then compute the expected values of these spatial distributions to recover a pixel coordinate:(6) 
For the coordinates, we also predict a depth value at every pixel, denoted , and compute
(7) 
To produce a probability map with the same resolution and equivariance property, we use stridedone fully convolutional architectures
fcn , also used for semantic segmentation. To increase the receptive field of the network, we stack multiple layers of dilated convolutions, similar to wavenet .Our emphasis on designing an equivariant network not only helps significantly reduce the number of training examples required to achieve good generalization, but also removes the computational burden of converting between two representations (spatialencoded in image to valueencoded in coordinates) from the network, so that it can focus on other critical tasks such as inferring depth.
Architecture details. All kernels for all layers are , and we stack layers of dilated convolutions with dilation rates of , all with output channels except the last layer which has output channels, split between and
. We use leakyRelu and Batch Normalization
batchnorm for all layers except the last layer. The output layers forhave no activation function, and the channels are passed through a spatial softmax to produce
. Finally, and are then converted to actual coordinates using Equations (6) and (7).Breaking symmetry. Many object classes are symmetric across at least one axis, e.g., the left side of a sedan looks like the right side flipped. This presents a challenge to the network because different parts can appear visually identical, and can only be resolved by understanding global context. For example, distinguishing the left wheels from the right wheels requires knowing its orientation (i.e., whether it is facing left or right). Both supervised and unsupervised techniques benefit from some global conditioning to aid in breaking ties and to make the keypoint prediction more deterministic.
To help break symmetries, one can condition the keypoint prediction on some coarse quantization of the pose. Such a coarsetofine approach to keypoint detection is discussed in more depth in tulsiani2015viewpoints . One simple such conditioning is a binary flag that indicates whether the dominant direction of an object is facing left or right. This dominant direction comes from the ShapeNet dataset we use (Section 6), where the 3D models are consistently oriented. To infer keypoints without this flag at inference time, we train a network with the same architecture, although half the size, to predict this binary flag.
In particular, we train this network to predict the projected pixel locations of two 3D points and , transformed into each view in a training pair. These points correspond to the front and back of a normalized object. This network has a single loss between the predicted and the groundtruth locations. The binary flag is 1 if the coordinate of the projected pixel of the first point is greater than that of the second point. This flag is then fed into the keypoint prediction network.
5 Additional Keypoint Characteristics
In addition to the main objectives introduced above, there are common, desirable characteristics of keypoints that can benefit many possible downstream tasks, in particular:

No two keypoints should share the same 3D location.

Keypoints should lie within the object’s silhouette.
Separation loss
penalizes two keypoints if they are closer than a hyperparameter
in 3D:(8) 
Unlike the consistency loss, this loss is computed in 3D to allow multiple keypoints to occupy the same pixel location as long as they have different depths. We prefer a robust, bounded support loss over an unbounded one (e.g., exponential discounting) because it does not exhibit a bias towards certain structures, such as a honeycomb, or towards placing points infinitely far apart. Instead, it encourages the points to be sufficiently far from one another.
Ideally, a welldistributed set of keypoints will automatically emerge without constraining the distance of keypoints. However, in the absence of keypoint location supervision, our objective with latent keypoints can converge to a local minimum with two keypoints collapsing to one. The main goal of this separation loss is to prevent such degenerate cases, and not to directly promote separation.
Silhouette consistency encourages the keypoints to lie within the silhouette of the object of interest. As described above, our network predicts coordinates of the keypoint via a spatial distribution, denoted , over possible keypoint positions. One way to ensure silhouette consistency, is by only allowing a nonzero probability inside the silhouette of the object, as well as encouraging the spatial distribution to be concentrated, i.e.,
unimodal with a low variance.
During training, we have access to the binary segmentation mask of the object in each image, where means foreground object. The silhouette consistency loss is defined as
(9) 
Note that this binary mask is only used to compute the loss and not used at inference time. This objective incurs a zero cost if all of the probability mass lies within the silhouette. We also include a term to minimize the variance of each of the distribution maps:
(10) 
This term encourages the distributions to be peaky, which has the added benefit of helping keep their means within the silhouette in the case of nonconvex object boundaries.
6 Experiments
Training data.
Our training data is generated from ShapeNet chang2015shapenet , a largescale database of approximately 51K 3D models across 270 categories. We create separate training datasets for various object categories, including car, chair, and plane. For each model in each category, we normalize the object so that the longest dimension lies in , and render 200 images of size under different viewpoints to form 100 training pairs. The camera viewpoints are randomly sampled around the object from a fixed distance, all above the ground with zero roll angle. We then add small random shifts to the camera positions.
Implementation details.
We implemented our network in TensorFlow
tensorflow2015whitepaper , and trained with the Adam optimizer with a learning rate of , and a total batch size of . We use the following weights for the losses: . We train the network for steps using synchronous training with replicas.6.1 Comparison with a supervised approach
To evaluate against a supervised approach, we collected human landmark labels for three object categories (cars, chairs, and planes) from ShapeNet using Amazon Mechanical Turk. For each object, we ask three different users to click on points corresponding to reference points shown as an example to the user. These reference points are based on the Pascal3D+ dataset (12 points for cars, 10 for chairs, 8 for planes). We render the object from multiple views so that each specified point is facing outward from the screen. We then compute the average pixel location over user annotations for each keypoint, and triangulate corresponding points across views to obtain 3D keypoint coordinates.
For each category, we train a network with the same architecture as in Section 4 using the supervised labels to output keypoint locations in normalized coordinates , as well as depths, using an loss to the human labels. We then compute the angular distance error on 10% of the models for each category held out as a test set. (This test set corresponds to 720 models of cars, 200 chairs, and 400 planes. Each individual model produces 100 test image pairs.) In Figure 2, we plot the histograms of angular errors of our method vs. the supervised technique trained to predict the same number of keypoints, and show error statistics in Table 1. For a fair comparison against the supervised technique, we provide an additional orientation flag to the supervised network. This is done by training another version of the supervised network that receives the orientation flag predicted from a pretrained orientation network. Additionally, we tested a more comparable version of our unsupervised network where we use and fix the same pretrained orientation network during training. The mean and median accuracy of the predicted orientation flags on the test sets are as follows: cars: (, ), planes: (, ), chairs: (, ).
Our unsupervised technique produces lower mean and median rotation errors than both versions of the supervised technique. Note that our technique sometimes incorrectly predicts keypoints that are from the correct orientation due to incorrect orientation prediction.


Cars  Planes  Chairs  
Method  Mean  Median  3DSE  Mean  Median  3DSE  Mean  Median  3DSE 


a) Supervised  16.268  5.583  0.240  18.350  7.168  0.233  21.882  8.771  0.269 
b) Supervised with  13.961  4.475  0.197  17.800  6.802  0.230  20.502  8.261  0.248 
pretrained ONet  
c) Ours with  
pretrained ONet  13.500  4.418  0.165  18.561  6.407  0.223  14.238  5.607  0.203 
d) Ours  11.310  3.372  0.171  17.330  5.721  0.230  14.572  5.420  0.196 

Mean and median angular distance errors between the groundtruth rotation and the Procrustes estimate computed from two sets of predicted keypoints on test pairs. ONet is the network that predicts a binary orientation. 3DSE is the standard errors described in Section
6.1.Keypoint location consistency.
To evaluate the consistency of predicted keypoints across views, we transform the keypoints predicted for the same object under different views to object space using the known camera matrices used for rendering. Then we compute the standard error of 3D locations for all keypoints across all test cars (3DSE in Table 1
). To disregard outliers when the network incorrectly infers the orientation, we compute this metric only for keypoints whose error in rotation estimate is less than
(left halves of the histograms in Figure 2), for both the supervised method and our unsupervised approach.6.2 Generalization across views and instances
In this section, we show qualitative results of our keypoint predictions on test cars, chairs, and planes using a default number of 10 keypoints for all categories. (We show results with varying numbers of keypoints in the Appendix.) In Figure 4, we show keypoint prediction results on single objects from different views. Some of these views are quite challenging such as the topdown view of the chair. However, our network is able to infer the orientation and predict occluded parts such as the chair legs. In Figure 4, we run our network on many instances of test objects. Note that during training, the network only sees a pair of images of the same model, but it is able to utilize the same keypoints for semantically similar parts across all instances from the same class. For example, the blue keypoints always track the cockpit of the planes. In contrast to prior work thewlis2017unsupervised ; hinton2011transforming ; zhang2018unsupervised that learns latent representations by training with restricted classes of transformations, such as affine or 2D optical flow, and demonstrates results on images with small pose variations, we learn through physical 3D transformation and are able to produce a consistent set of 3D keypoints from any angle. Our method can also be used to establish correspondence between two views under outofplane or even 180 rotations when there is no visual overlap.
Failure cases. When our orientation network fails to predict the correct orientation, the output keypoints will be flipped as shown in Figure 5. This happens for cars whose front and back look very similar, or for unusual wing shapes that make inference of the dominant direction difficult.
7 Discussion & Future work
We explore the possibility of optimizing a representation based on a sparse set of keypoints or landmarks, without access to keypoint annotations, but rather based on an endtoend geometric reasoning framework. We show that, indeed, one can discover consistent keypoints across multiple views and object instances by adopting two novel objective functions: a relative pose estimation loss and a multiview consistency objective. Our translation equivariant architecture is able to generalize to unseen object instances of ShapeNet categories chang2015shapenet
. Importantly, our discovered keypoints outperform those from a direct supervised learning baseline on the problem of rigid 3D pose estimation.
We present preliminary results on the transfer of the learned keypoint detectors to real world images by training on ShapeNet images with random backgrounds (see supplemental material). Further improvements may be achieved by leveraging recent work in domain adaptation johnson2017driving ; tremblay2018training ; tobin2017domain ; bousmalis2017unsupervised ; tzeng2017adversarial . Alternatively, one can train KeypointNet directly on real images provided relative pose labels. Such labels may be estimated automatically using StructurefromMotion longuet1981computer . Another interesting direction would be to jointly solve for the relative transformation or rely on a coarse pose initialization, inspired by triggs1999bundle , to extend this framework to objects that lack 3D models or pose annotations.
Our framework could also be extended to handle an arbitrary number of keypoints. For example, one could predict a confidence value for each keypoint, then threshold to identify distinct ones, while using a loss that operates on unordered sets of keypoints. Visual descriptors could also be incorporated under our framework, either through a postprocessing task or via joint endtoend optimization of both the detector and the descriptor.
8 Acknowledgement
We would like to thank Chi Zeng who helped setup the Mechanical Turk tasks for our evaluations.
References

(1)
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,
Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang
Zheng.
Tensorflow: Largescale machine learning on heterogeneous distributed systems, 2015.
 (2) Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. ICCV, 2015.
 (3) Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2D human pose estimation: New benchmark and state of the art analysis. CVPR, 2014.
 (4) M. ArieNachimson and R. Basri. Constructing implicit 3D shape models for pose estimation. ICCV, 2009.
 (5) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixellevel domain adaptation with generative adversarial networks. CVPR, 2017.
 (6) Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An InformationRich 3D Model Repository. arXiv:1512.03012, 2015.
 (7) ChingHang Chen and Deva Ramanan. 3D human pose estimation= 2D pose estimation+ matching. CVPR, 2017.
 (8) Yu Chen, Chunhua Shen, XiuShen Wei, Lingqiao Liu, and Jian Yang. Adversarial learning of structureaware fully convolutional networks for landmark localization. arXiv:1711.00253, 2017.
 (9) Christopher B Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. Universal correspondence network. NIPS, 2016.
 (10) Mike Giles. An extended collection of matrix derivative results for forward and reverse mode automatic differentiation. Oxford University, 2008.
 (11) Ross Goroshin, Michael F Mathieu, and Yann LeCun. Learning to linearize under uncertainty. NIPS, 2015.
 (12) Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. arXiv:1802.00434, 2018.
 (13) Kai Han, Rafael S Rezende, Bumsub Ham, KwanYee K Wong, Minsu Cho, Cordelia Schmid, and Jean Ponce. SCNet: Learning semantic correspondence. ICCV, 2017.
 (14) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask RCNN. ICCV, 2017.
 (15) Mohsen Hejrati and Deva Ramanan. Analyzing 3d objects in cluttered images. NIPS, 2012.
 (16) Geoffrey Hinton, Nicholas Frosst, and Sara Sabour. Matrix capsules with em routing. ICLR, 2018.
 (17) Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming autoencoders. Int. Conf. on Artificial Neural Networks, 2011.

(18)
Sina Honari, Pavlo Molchanov, Stephen Tyree, Pascal Vincent, Christopher Pal,
and Jan Kautz.
Improving landmark localization with semisupervised learning.
InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  (19) QiXing Huang and Leonidas Guibas. Consistent shape maps via semidefinite programming. Computer Graphics Forum, 2013.
 (20) Shaoli Huang, Mingming Gong, and Dacheng Tao. A coarsefine network for keypoint localization. ICCV, 2017.
 (21) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
 (22) Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix backpropagation for deep networks with structured layers. ICCV, 2015.
 (23) Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Conditional image generation for learning the structure of visual objects. arXiv preprint arXiv:1806.07823, 2018.
 (24) Matthew JohnsonRoberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. Driving in the Matrix: Can virtual worlds replace humangenerated annotations for real world tasks? ICRA, pages 746–753, 2017.
 (25) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012.
 (26) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 (27) Vincent Lepetit and Pascal Fua. Keypoint recognition using randomized trees. IEEE Trans. PAMI, 28(9):1465–1479, 2006.
 (28) Vincent Lepetit, Francesc MorenoNoguer, and Pascal Fua. EPnP: An accurate O(n) solution to the PnP problem. IJCV, 2008.
 (29) Yan Li, Leon Gu, and Takeo Kanade. A robust shape model for multiview car alignment. CVPR, 2009.
 (30) TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. ECCV, 2014.
 (31) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. CVPR, 2015.
 (32) H Christopher LonguetHiggins. A computer algorithm for reconstructing a scene from two projections. Nature, 293(5828):133, 1981.
 (33) Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3D human pose estimation. ICCV, 2017.
 (34) Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3D human pose estimation in the wild using improved CNN supervision. 3DV, 2017.
 (35) Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Srinath Sridhar, Gerard PonsMoll, and Christian Theobalt. Singleshot multiperson 3D body pose estimation from monocular RGB input. arXiv:1712.03453, 2017.
 (36) Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, HansPeter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. VNect: Realtime 3D Human Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics, 2017.
 (37) Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. ECCV, 2016.
 (38) George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate multiperson pose estimation in the wild. arXiv:1701.01779, 2017.
 (39) Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. DeepCut: Joint subset partition and labeling for multi person pose estimation. CVPR, June 2016.
 (40) Gerard PonsMoll, Jonathan Taylor, Jamie Shotton, Aaron Hertzmann, and Andrew Fitzgibbon. Metric regression forests for correspondence estimation. IJCV, 113(3):163–175, 2015.
 (41) Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Reconstructing 3d human pose from 2d image landmarks. ECCV, 2012.
 (42) Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsupervised geometryaware representation for 3d human pose estimation. arXiv preprint arXiv:1804.01110, 2018.
 (43) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. NIPS, 2017.
 (44) Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces inthewild challenge: Database and results. Image and Vision Computing, 2016.
 (45) Samuele Salti, Federico Tombari, Riccardo Spezialetti, and Luigi Di Stefano. Learning a descriptorspecific 3d keypoint detector. ICCV, 2015.
 (46) Peter Schönemann. A generalized solution of the orthogonal Procrustes problem. Psychometrika, 1966.
 (47) Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3D. ACM transactions on graphics (TOG), 2006.
 (48) Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. ICCV, 2015.
 (49) James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object frames by dense equivariant image labelling. NIPS, 2017.
 (50) Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. IROS, pages 23–30, 2017.
 (51) Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin. Realtime continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics, 33, 2014.
 (52) Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. NIPS, 2014.
 (53) Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. CVPR, 2014.
 (54) Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. arXiv preprint arXiv:1804.06516, 2018.
 (55) Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. International workshop on vision algorithms, pages 298–372, 1999.
 (56) Shubham Tulsiani and Jitendra Malik. Viewpoints and keypoints. CVPR, 2015.
 (57) HsiaoYu Tung, HsiaoWei Tung, Ersin Yumer, and Katerina Fragkiadaki. Selfsupervised learning of motion capture. In Advances in Neural Information Processing Systems, pages 5236–5246, 2017.
 (58) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. CVPR, 2017.
 (59) Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.
 (60) Qianqian Wang, Xiaowei Zhou, and Kostas Daniilidis. Multiimage semantic matching by mining consistent features. arXiv preprint arXiv:1711.07641, 2017.
 (61) Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T Freeman. Single Image 3D Interpreter Network. ECCV, 2016.

(62)
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba.
Sun database: Largescale scene recognition from abbey to zoo.
CVPR, 2010.  (63) Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. ICCV, 2017.
 (64) Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, and Honglak Lee. Unsupervised discovery of object landmarks as structural representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2694–2703, 2018.
 (65) Tinghui Zhou, Philipp Krahenbuhl, Mathieu Aubry, Qixing Huang, and Alexei A Efros. Learning dense correspondence via 3dguided cycle consistency. CVPR, 2016.
 (66) Xiaowei Zhou, Menglong Zhu, and Kostas Daniilidis. Multiimage matching via fast alternating minimization. CVPR, 2015.
 (67) Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Konstantinos G Derpanis, and Kostas Daniilidis. Sparseness meets deepness: 3D human pose estimation from monocular video. CVPR, 2016.
 (68) Xingyi Zhou, Arjun Karpur, Chuang Gan, Linjie Luo, and Qixing Huang. Unsupervised domain adaptation for 3d keypoint prediction from a single depth scan. arXiv:1712.05765, 2017.
Appendix A Histograms for individual categories
We show histograms similar to Figure 2 in the paper for individual object categories.
Appendix B Ablation study
We present an ablation study for the primary losses as well as how their weights affect the results (Figure 7).
Removing multiview consistency loss. This causes some of the keypoints to move around when the viewing angle changes, and not track onto any particular part of the object. The pose estimation loss alone may only provide a strong gradient for a number of keypoints as long as they give a good rotation estimate, but it does not explicitly force every point to be consistent.
Pose estimation loss & Noise. Removing pose estimation loss completely leads the network to place keypoints near the center of an object, which is the area with the least rotation motion, and thus least pixel displacement under different views. Increasing the noise that is added to the keypoints for rotation estimation encourages the keypoints to be spread apart from the center.
Removing silhouette consistency. This causes the keypoints to lie outside the object. Interestingly, the keypoints still satisfy multiview consistency, and lie on a virtual 3D space that rotates with the object.
Appendix C Results on deformed object
To evaluate the robustness of these keypoints under shape variations such as the length of the car, and whether the network uses local features to detect local parts as opposed to placing keypoints on a regular rigid structure, we run our network on a nonrigidly deformed car in Figure 8. Here we show that the network is able to predict where the wheels are and the overall deformation of the car structure.
Appendix D Results using different numbers of keypoints
We trained our network with varying number of keypoints . The network starts by discovering the most prominent components such as the head and wings, then gradually tracks more parts as the number increases.
Appendix E Proofofconcept results on realworld images
To predict keypoints on real images, we train our network by adding random backgrounds, taken from SUN397 dataset [62], to our rendered training examples. Surprisingly, such a simple modification allows the network to predict keypoints on some cars in ImageNet. We show a few handpicked results as well as some failure cases in Figure 10. The network especially has difficulties dealing with large perspective distortion and cars that have strong patterns or specular highlights.
Comments
There are no comments yet.