1 Introduction
Semantic correspondence for general objects is an important research area for machine vision. Understanding different objects of the same category is still a challenge topic. There are quite a few methods on solving this problem in 2D image domain. [choy2016universal, han2017scnet, kim2017fcss, horn1981determining, liu2010sift] propose to matching local regions between pairs of images while [rocco2017convolutional, rocco2018end, seo2018attentive, ham2017proposal] consider it as a global image alignment problem. However, these works all investigate 2D image features and very few works focus on the internal 3D structures of images to be matched. We argue that by explicitly exploiting 3D structures of objects, one can easily infer the self-occlusion and spatial relationships. This idea is explored in some recent works [kulkarni2019canonical, zhou2016learning], which come up with a 3D model as an intermediate medium. However, they assume there exists a template model for all images, which does not hold in most cases.
To solve these problems, we propose a novel semantic transfer method that aims to predict 3D structures from a single RGB image and then project 3D semantic labels back onto 2D image planes. 2D to 3D shape prediction is inspired by Wu et al. [wu2018learning]
. For 3D-2D projection, we estimate viewpoints directly from 2D images and then leverage a 3D semantic prediction model trained on KeypointNet
[you2020keypointnet] to give 3D semantic labels together with its 2D projections. Viewpoints are further fine-tuned with differentiable renderers. The main advantages of this method lies in two aspects. 1) the number of training data required is reduced drastically. Previous 2D image transfer networks require numerous images for a class of object in order to extract robust semantic features. These images view objects of different shapes from different angles. On the contrary, if we could infer 3D structures from 2D images, then all we need is to utilize labels on existing 3D models and then project them onto 2D image planes. One may consider 3D structures inference as a data-heavy task but virtual 2D images can be generated from 3D models on the fly, as done by Wu et al. [wu2018learning]. 2) Visibility reasoning is made explicit. When we project semantic labels onto 2D image planes, points on the back is naturally culled. On 2D image domains, visibility is implicitly made by 2D CNNs, making it hard to interpret. As shown in Figure 1, for direct 2D-2D methods, the generated 2D warping from source to target does not account for self-occlusion and may be erroneous. However, semantic transfer would be much easier if we first estimate their corresponding 3D models and camera poses.
Although this idea is appealing, there are still a lot of challenges in this 2D-3D-2D cycle. Compared with directly learning correspondence maps from 2D images, our pipeline is more involved. Each stage would incur some error and the final result would get biased through error accumulation. To this end, we propose a camera pose estimation module fine-tuned by differentiable renderer. The final result is competitive and even outperforms state-of-the-art methods on some benchmarks. In addition, we conduct comprehensive and detailed experiments to figure out the effectiveness of each stage in the proposed 2D-3D-2D cycle. We hope this analysis could help future researches to further improve on 2D-3D, 3D-2D or 2D-3D-2D predictions.
Our main contributions are listed below:
-
We propose a novel 2D-3D-2D pipeline to solve 2D semantic correspondence problem by leveraging it to the 3D domain.
-
Our proposed method sets a new state of the art on several semantic correspondence benchmarks.
-
We conduct detailed and comprehensive experiments to decompose each stage/parts effect on the accuracy of final results.
-
We will make all our code with detailed experiments publicly available.
2 Related Work
2.1 2D Semantic Correspondence
Image semantic correspondence has a long history which dates back to optical flow [horn1981determining], multi-stereo [okutomi1993multiple]. Recently, some local descriptor based methods like proposal flow [ham2017proposal] and SIFT flow [liu2010sift] are explored to find dense correspondences across different objects. With the advance of deep learning, neural features [hariharan2015hypercolumns, lin2017feature, kong2016hypernet] are broadly used as they are more robust and generalizable. Methods like A2Net [seo2018attentive], NC-Net [rocco2018neighbourhood] and HPF [min2019hyperpixel] view semantic correspondence as a matching problem in high-dimensional feature images. In addition, [florence2018dense, schmidt2016self] leverage an unsupervised methods to learn consistent dense embeddings with SLAM across different objects.
2.2 3D Semantic Correspondence
[allen2003space, blanz1999morphable] are the pioneers on detecting 3D semantic correspondence between human bodies and faces. Recently, [halimi2018self, roufosse2019unsupervised, groueix20183d] propose unsupervised methods on learning dense correspondences between humans and animals. With the help of recent large scale model dataset such as ShapeNet [chang2015shapenet] and PartNet [mo2019partnet], finding semantic correspondences on general objects become possible. Deep functional dictionaries [sung2018deep] and SyncSpecCNN [yi2017syncspeccnn] all learn a set of synchronized base functions in order to obtain dense correspondence from functional maps. In addition to ShapeNet, [pavlakos20176, kim2013learning, you2019fine, you2020keypointnet] provide additional keypoint or correspondence annotations for object semantic understandings.
Perhaps, CSM [kulkarni2019canonical] and Zhou et al. [zhou2016learning] are the closest to this paper. However, they assume that for all images, there is a template 3D model that fits well, making them not directly applicable to categories where the shapes across instances differ significantly in topology or undergo large articulation. Besides, they implicitly infer 3D models by generating a 2D-3D pixel maps while we explicitly predict each image’s corresponding 3D shape.
2.3 Single View Shape Reconstruction
Recently, many works have been introduced on single view shape reconstruction. For supervised methods where a ground-truth model is available, PSGN [fan2017point] and pseudo-renderer [lin2018learning] reconstruct point clouds from single-view RGB images. Front2back [yao2019front2back] predicts per-pixel depth, which is then converted into a point cloud. [wu2017marrnet, chen2009learning, wu2016learning] predict voxel grids with a relatively small resolution while some others like [park2019deepsdf, mescheder2019occupancy, liu2019learning] reconstruct implicit surface functions, where resolutions are not limited compared to voxels. In addition, there are also plenty of researches [gkioxari2019mesh, wen2019pixel2mesh++] focused on triangle mesh reconstruction, which is constrained by mesh topology. Pan et al. [pan2019deep] tries to modify the mesh topology during reconstruction. What’s more, some other directions like reconstructing images as geometric primitive collections [gao2019sdm, tian2019learning] and complex octree structures [riegler2017octnet, tatarchenko2017octree] are also explored.
For unsupervised single view shape prediction, [sitzmann2019deepvoxels, niemeyer2019differentiable, rematas2019neural, eslami2018neural] utilize only 2D image annotations, together with a multi-view consistency prior, to reconstruct the implicit 3D models. Other works like [wang2020deep, li2019synthesizing] focus on a large collection of images in the wild and reconstruct a model for each distinguished image.
2.4 Differentiable Renderer
Differentiable renderer is an emerging topic in recent years, we see that [sitzmann2019deepvoxels, niemeyer2019differentiable] all utilize differentiable projections to learn a 3D shape from its 2D image projections. Neural Mesh Renderer [kato2018neural] first brings this idea to mesh rasterization rendering and Li et al. [li2018differentiable] comes up with differentiable monte-carlo ray tracing. DiffSDF [jiang2019sdfdiff] renders implicit surfaces defined by signed distance function in a differentiable way.
3 Methodology
3.1 Overview
Our method includes four essential parts: (a) 2D-3D shape prediction, (b) viewpoint estimation from RGB images, (c) keypoint database with semantic embeddings, (d) 3D-2D projections of semantic points. Full pipeline is illustrated in Figure 2.
For 2D-3D shape prediction, we utilize a similar structure with ShapeHD [wu2018learning], which first estimate silhouettes, normals and depths from 2D images and then predict 3D shapes using 3D convolutions. This pattern can be summarized as 2D-2.5D-3D and is first proposed by MarrNet[wu2017marrnet]
. For viewpoint estimation, azimuth and elevation are classified into discrete bins using ResNet architecture. For 3D keypoint transfer, we utilize KeypointNet
[you2020keypointnet] annotations with nearest neighbor search. For 3D-2D semantic projection, 3D voxel predictions are converted into meshes with marching cube algorithm [lorensen1987marching]. Then, these meshes can be projected back onto 2D image films provided viewpoint estimations, fine-tuned by differentiable renderer. Note that we assume that objects are centered in the image and do not get occluded or cut by other objects. Clutter occlusions and incompletions are out of the scope of this paper.
3.2 Single View 3D Reconstruction
There are a number of works focusing on single view 3D reconstruction, such as MarrNet [wu2017marrnet], ShapeHD [wu2018learning], Mesh R-CNN [gkioxari2019mesh]. We utilize a similar architecture with ShapeHD, considering that it could penalize those unrealistic 3D shapes. To make it complete, here we briefly show the components that are used in ShapeHD. ShapeHD is inspired by MarrNet. It has an 2.5D sketch estimator, which is an encoder-decoder that predicts the object’s depth, surface normals and silhouette from an RGB image. Followed is a 3D estimator which also has an encoder-decoder structure. It predicts a 3D shape of the object in the canonical view from 2.5D sketches. In addition, the author introduced a deep naturalness regularizer that penalizes unrealistic shapes prediction. The regularizer is implemented by a 3D generative adversarial network and the discriminator is then used to calculate the naturalness score. The architecture for this module is shown in Figure 3.

3.3 Viewpoint Estimation from RGB Images
Given predicted 3D shapes, it is necessary to estimate the viewpoint in order to project it back onto the image plane. To do so, we design a network that predict viewpoints directly from RGB images. Note that this is different from Pix3D [sun2018pix3d] where viewpoints are estimated from 2.5D sketches. We argue that error would accumulate if the previous 2.5D sketch prediction is inaccurate. Direct estimation reduces the number of passed stage from two to one, which help improve the accuracy. This is also verified in our experiments.
We treat view estimation as a classification problem, where azimuth is divided into 24 bins the elevation is divided into 12 bins. Circularity in azimuth is dealt carefully with an additional circular bin. The architecture for this module is demonstrated in Figure 4.

3.4 3D Semantic Prediction
Predicting semantic labels on 3D models is pretty challenging in this 2D-3D-2D loop. Firstly, the predicted 3D shape from previous stage is not perfect and may be corrupted. Secondly, directly training on 3D models may be prohibitive as current semantic image datasets usually do not come up with the corresponding 3D models. Even for datasets with 3D models like PASCAL-3D [xiang_wacv14], since the number of models is relatively small, overfitting is highly suspected.
Therefore, we resort to existing large-scale 3D keypoint datasets to train a semantic prediction network. KeypointNet[you2020keypointnet] contains millions of keypoint annotations from ShapeNet models. By training on this dataset, one could obtain a semantic embedding for each keypoint in the dataset and then transfer it to the predicted 3D models with a nearest neighbor search. In other words, we train a semantic prediction network on a 3D object database and then generalize it to our predicted 3D shapes. To account for corruption, we augment our dataset with random Gaussian noises near the object surface. This idea is illustrated in Figure 5.

3.5 Differentiable Rendering in 2D Projection
As a final step, we are now ready to project our predicted 3D shapes together with inferred semantic points back onto 2D image planes. This step, although the last but not the least, is important as errors are accumulated all the way through previous stages. To have a chance correcting previous predictions, we threshold the voxels with marching cube algorithm, fine-tuning the viewpoint with the help of Neural Mesh Renderer [kato2018neural]. Specifically, denote the ground-truth silhouette image as , predicted 3D model as , our fine-tuned viewpoint is :
(1) |
where is the set of all 2D image coordinates, is the projected image given 3D model under viewpoint . We run several gradient descent steps in order to find the best viewpoint.
To summarize, we first predict 3D shapes and viewpoints from single view RGB images; then 3D semantic keypoints or any other semantic information is transferred from an existing 3D model database to the predicted 3D shapes; finally, the predicted 3D shapes together with their semantics are projected onto 2D image planes, with viewpoints fine-tuned.

4 Experiments
Our experiments are divided into three parts. The first part is the comparison of our proposed method with current state-of-the-art methods. The second part is some ablation studies on our proposed viewpoint estimation and fine-tuning modules. The third part is a detailed and thorough investigation of each component/stage’s influence on final results in our full pipeline. We hope this kind of detailed analysis could help following researchers to have a better understanding on each individual components in the 2D-3D-2D loop. Note this analysis also covers previous 2D-2.5D-3D reconstruction pipeline and can be used to further improve single-view 3D reconstruction.
Datasets

We use ShapeNet Synthetic rendered images [wu2018learning] for training and Pix3D [sun2018pix3d], PF-PASCAL[ham2016proposal], SPair-71k [min2019spair] images for evaluation (except for Section 4.4). Here, chairs are chosen for better illustrations. More results on other classes can be found in the supplementary material. Differences among these four datasets are shown in Figure 7. ShapeNet Synthetic could provide tons of training data though it is fake and synthetically rendered. Pix3D is a real dataset without much clutter occlusions and objects are well centered in images, which is relatively clean. PF-PASCAL and SPair-71k provide more extreme occlusions/cutoff/scale variations. Though clutter occlusions and incompletions are not the focus of this paper, our method still gives a competitive score on these datasets compared with state-of-the-art.
Metric
We use a common evaluation metric of percentage of correct keypoints (PCK), which counts the average number of correctly predicted keypoints given a tolerance threshold. Given predicted keypoint
and groundtruth keypoint , the prediction is considered correct if Euclidean distance between them is smaller than a given threshold. The correctness of each keypoint can be expressed as(2) |
where and are the width and height of either an entire image or object bounding box, {img, bbox}, and is a tolerance factor.
We select out those keypoints that are in the intersection of both KPNet and Pix3D/PF-PASCAL/SPair-71k for evaluation. For our method, we directly estimate keypoints for each single input image and calculate PCK for each keypoint; while for other baseline methods like Hyperpixel Flow, PCK is calculated on an image pair where keypoints in one image are reckoned as ground-truth and keypoints in the other image are predicted by a semantic warp or transfer. Our method can be also considered as transferring keypoints from an implicit ground-truth keypoint template. PCK results are averaged over all keypoints.
4.1 Comparison with State-of-the-Arts
In this section, we compare our methods with several state-of-the-arts that either directly do a 2D image semantic transfer [min2019hyperpixel, seo2018attentive] or utilize 3D templates [kulkarni2019canonical].
Our method is trained with ShapeNet Synthetic renderings while state-of-the-art methods are trained directly on real-world images. Interestingly, though our method has a domain gap when applied to real-world images, we still outperform state-of-the-art methods on Spair-71k and Pix3D. Quantitative results are shown in Table 1. On PF-PASCAL, our method is inferior due to the difficulty in handling severely occluded and incomplete objects. Qualitative results on SPair-71k/PF-PASCAL are shown in Figure 8 and 9.


Models | PCK-chair () | PCK-chair () | ||||
---|---|---|---|---|---|---|
PF-PASCAL | SPair71k | Pix3D | PF-PASCAL | SPair71k | Pix3D | |
ours | 0.444 | 0.565 | 0.560 | 0.342 | 0.346 | 0.323 |
HPFres101[min2019hyperpixel] | 0.602 | 0.419 | 0.534 | 0.435 | 0.324 | 0.434 |
A2Netres101[seo2018attentive] | 0.516 | 0.358 | 0.506 | 0.302 | 0.200 | 0.356 |
CSMunet[kulkarni2019canonical] | 0.164 | 0.115 | 0.152 | 0.16411footnotemark: 1 | 0.11511footnotemark: 1 | 0.15211footnotemark: 1 |
4.2 Ablation Study on Viewpoint Estimation
In this section, we explore the effect of proposed view estimation module and viewpoint fine-tuning module.
For the view estimation module, we compare with viewpoints that are predicted from estimated 2.5D sketch (w/o v.p. from RGB). Quantitative results are shown in Table 2, we see that our proposed method greatly improve the accuracy of view estimation by not accumulating the error in the 2D-2.5D sketch prediction. From Figure 10, it can be concluded that viewpoints estimated from 2.5D sketch are much more biased and reduce the quality of final transferred keypoints.
For the viewpoint fine-tuning module, fine-tuning viewpoints improves the overall accuracy on SPair-71k and Pix3D while downgrades on PF-PASCAL when , as shown in Table 2. This is due to the fact that PF-PASCAL includes more occluded chairs than SPair-71k thus is harder than the latter one. This breaks the prior of objects centered on the image thus making viewpoint fine-tuning vulnerable. Qualitative results are given in Figure 10.
Models | PCK-chair () | PCK-chair () | ||||
---|---|---|---|---|---|---|
PF-PASCAL | SPair71k | Pix3D | PF-PASCAL | SPair71k | Pix3D | |
ours | 0.444 | 0.565 | 0.560 | 0.342 | 0.346 | 0.323 |
ours w/o v.p. from RGB | 0.171 | 0.185 | 0.238 | 0.075 | 0.098 | 0.117 |
ours w/o v.p. fine-tune | 0.497 | 0.538 | 0.560 | 0.316 | 0.332 | 0.322 |


4.3 Detailed Analysis of Each Stage
In this section, we investigate how each stage influences the final result, by replacing the following three components with their ground-truths: (a) 2D to 3D shape reconstruction, (b) viewpoint estimation and (c) semantic 3D model. Here, semantic 3D model means whether to use ground-truth 3D model of the input image when doing keypoint transfer (the database input to PointNet in Figure 5).
We start with our full pipeline and then replace each component with its ground-truth counterpart to see the accuracy improvement, respectively. We also evaluate our method with all components’ corresponding ground-truths. Notice that although ground-truth viewpoints with azimuth and elevation are given, they are not the ground-truth 6D camera poses. Therefore, all components with ground-truth fails to achieve 100% accuracy.
Results are given in Table 3. We see that the PCK contribution of ground-truth semantic 3D model is the largest, which means that if we have the ground-truth model for computing keypoint embeddings instead of the ones in our keypoint database, we would gain about 15.5% relative improvement. The contribution of ground-truth 3D reconstruction is small, which suggests that the 2D-2.5D-3D single view reconstruction pipeline meets few difficulties when applied to real datasets. Replacing predicted viewpoints with ground-truths also gives 8.2% PCK improvement, which means that there is still some future work to do in single view camera pose estimation. More visualization results are demonstrated in Figure 11.
Models | PCK-chair () | PCK-chair () | ||
---|---|---|---|---|
Pix3D | ShapeNet | Pix3D | ShapeNet | |
ours | 0.560 | 0.323 | 0.513 | 0.243 |
ours w/ 2.5D model GT | - | - | 0.518 | 0.243 |
ours w/ 3D model GT | 0.571 | 0.363 | 0.521 | 0.263 |
ours w/ viewpoint GT | 0.606 | 0.351 | 0.545 | 0.262 |
ours w/ semantic model GT | 0.647 | 0.434 | 0.631 | 0.331 |
all GT | 0.741 | 0.560 | 0.723 | 0.419 |
4.4 Domain Gap Exploration
Plenty of single view 3D reconstructions are done on virtually rendered datasets and validated on real-world data. This greatly reduces the need for human annotated 2D keypoint and 3D model label pairs. However, this also introduces an unavoidable gap since due to the rendering error.
In this section, we evaluate our method on virtually rendered ShapeNet, which is from the same domain of training datasets while all the evaluated images are not seen during training. This is in comparison with Section 4.1 whose evaluation is on real datasets. Note, since we are evaluating on rendered datasets, we have 2.5D ground-truth (silhouette, normal and depth), so that we add an extra experiment by replacing 2.5D predictions with ground-truth data.
Quite interestingly, results on Pix3D are better than on ShapeNet Synthetic. One reason is that in ShapeNet Synthetic dataset, camera positions are sampled uniformly from the entire unit sphere while real datasets seldom have large elevations, making generalization from virtual datasets easier. Quantitative results are given in Table 3 and visualization is shown in Figure 12.


5 Future Work
Here we show some failure cases of our method in Figure 13
, where severe clutter occlusions are introduced. It would be interesting to extend our method to explicit reason about clutter occlusions and incompletions. Besides, as our current viewpoint estimation only includes two degrees of freedom, a full 6D pose estimation is possible and we leave it as a future work.
6 Conclusions
In this paper, we propose a new pipeline on predicting semantic correspondences by leveraging it to 3D domain and then project corresponding 3D models back to 2D domain, with their semantic labels. This method explicitly reasons about objects self-occlusion and visibility. We show that our method gives comparative and even superior results on standard semantic benchmarks. We also conduct thorough and detailed experiments to analyze our network components.
Supplementary
Dense Embedding Prediction
In the main document, we show that our method outperforms other methods on keypoint transfer task. Dense embeddings from single RGB images can also be obtained by propogating 3D semantic embeddings onto 2D image plane. Some qualitive results are illustrated in 14 and 15.


Keypoint Transfer Results on Other Classes
In this section, we provide additional PCK results on both car and aeroplanes, evaluated on PF-PASCAL and SPair-71k.
Models | PCK-car () | PCK-car () | ||
---|---|---|---|---|
PF-PASCAL | SPair71k | PF-PASCAL | SPair71k | |
ours | 0.574 | 0.500 | 0.349 | 0.328 |
HPFres101 | 0.533 | 0.364 | 0.437 | 0.276 |
A2Netres101 | 0.478 | 0.332 | 0.326 | 0.201 |
CSMunet | 0.339 | 0.234 | 0.339 | 0.234 |
Models | PCK-aeroplane () | PCK-aeroplane () | ||
---|---|---|---|---|
PF-PASCAL | SPair71k | PF-PASCAL | SPair71k | |
ours | 0.440 | 0.390 | 0.194 | 0.182 |
HPFres101 | 0.401 | 0.280 | 0.294 | 0.212 |
A2Netres101 | 0.388 | 0.234 | 0.256 | 0.155 |
CSMunet | 0.220 | 0.139 | 0.220 | 0.139 |



