Semantic Correspondence via 2D-3D-2D Cycle

by   Yang You, et al.
Shanghai Jiao Tong University

Visual semantic correspondence is an important topic in computer vision and could help machine understand objects in our daily life. However, most previous methods directly train on correspondences in 2D images, which is end-to-end but loses plenty of information in 3D spaces. In this paper, we propose a new method on predicting semantic correspondences by leveraging it to 3D domain and then project corresponding 3D models back to 2D domain, with their semantic labels. Our method leverages the advantages in 3D vision and can explicitly reason about objects self-occlusion and visibility. We show that our method gives comparative and even superior results on standard semantic benchmarks. We also conduct thorough and detailed experiments to analyze our network components. The code and experiments are publicly available at


page 9

page 10

page 11

page 12

page 14

page 15

page 16

page 17


Understanding Pixel-level 2D Image Semantics with 3D Keypoint Knowledge Engine

Pixel-level 2D object semantic understanding is an important topic in co...

End-to-End Wireframe Parsing

We present a conceptually simple yet effective algorithm to detect wiref...

Exposing Semantic Segmentation Failures via Maximum Discrepancy Competition

Semantic segmentation is an extensively studied task in computer vision,...

SCNet: Learning Semantic Correspondence

This paper addresses the problem of establishing semantic correspondence...

GANiry: Bald-to-Hairy Translation Using CycleGAN

This work presents our computer vision course project called bald men-to...

Semantic Understanding of Professional Soccer Commentaries

This paper presents a novel approach to the problem of semantic parsing ...

Semi-parametric Makeup Transfer via Semantic-aware Correspondence

The large discrepancy between the source non-makeup image and the refere...

1 Introduction

Semantic correspondence for general objects is an important research area for machine vision. Understanding different objects of the same category is still a challenge topic. There are quite a few methods on solving this problem in 2D image domain. [choy2016universal, han2017scnet, kim2017fcss, horn1981determining, liu2010sift] propose to matching local regions between pairs of images while [rocco2017convolutional, rocco2018end, seo2018attentive, ham2017proposal] consider it as a global image alignment problem. However, these works all investigate 2D image features and very few works focus on the internal 3D structures of images to be matched. We argue that by explicitly exploiting 3D structures of objects, one can easily infer the self-occlusion and spatial relationships. This idea is explored in some recent works [kulkarni2019canonical, zhou2016learning], which come up with a 3D model as an intermediate medium. However, they assume there exists a template model for all images, which does not hold in most cases.

To solve these problems, we propose a novel semantic transfer method that aims to predict 3D structures from a single RGB image and then project 3D semantic labels back onto 2D image planes. 2D to 3D shape prediction is inspired by Wu et al. [wu2018learning]

. For 3D-2D projection, we estimate viewpoints directly from 2D images and then leverage a 3D semantic prediction model trained on KeypointNet 

[you2020keypointnet] to give 3D semantic labels together with its 2D projections. Viewpoints are further fine-tuned with differentiable renderers. The main advantages of this method lies in two aspects. 1) the number of training data required is reduced drastically. Previous 2D image transfer networks require numerous images for a class of object in order to extract robust semantic features. These images view objects of different shapes from different angles. On the contrary, if we could infer 3D structures from 2D images, then all we need is to utilize labels on existing 3D models and then project them onto 2D image planes. One may consider 3D structures inference as a data-heavy task but virtual 2D images can be generated from 3D models on the fly, as done by Wu et al. [wu2018learning]. 2) Visibility reasoning is made explicit. When we project semantic labels onto 2D image planes, points on the back is naturally culled. On 2D image domains, visibility is implicitly made by 2D CNNs, making it hard to interpret. As shown in Figure 1, for direct 2D-2D methods, the generated 2D warping from source to target does not account for self-occlusion and may be erroneous. However, semantic transfer would be much easier if we first estimate their corresponding 3D models and camera poses.

Figure 1: Direct 2D-2D correspondence vs. ours 2D-3D-2D pipeline. Left: direct 2D semantic warping. It is erroneous to directly warp one image to another in 2D domain due to the existence of self-occlusion and viewpoint variations. Right: our method estimates the corresponding 3D model and projects predicted 3D semantic labels back onto 2D images.

Although this idea is appealing, there are still a lot of challenges in this 2D-3D-2D cycle. Compared with directly learning correspondence maps from 2D images, our pipeline is more involved. Each stage would incur some error and the final result would get biased through error accumulation. To this end, we propose a camera pose estimation module fine-tuned by differentiable renderer. The final result is competitive and even outperforms state-of-the-art methods on some benchmarks. In addition, we conduct comprehensive and detailed experiments to figure out the effectiveness of each stage in the proposed 2D-3D-2D cycle. We hope this analysis could help future researches to further improve on 2D-3D, 3D-2D or 2D-3D-2D predictions.

Our main contributions are listed below:

  • We propose a novel 2D-3D-2D pipeline to solve 2D semantic correspondence problem by leveraging it to the 3D domain.

  • Our proposed method sets a new state of the art on several semantic correspondence benchmarks.

  • We conduct detailed and comprehensive experiments to decompose each stage/parts effect on the accuracy of final results.

  • We will make all our code with detailed experiments publicly available.

2 Related Work

2.1 2D Semantic Correspondence

Image semantic correspondence has a long history which dates back to optical flow [horn1981determining], multi-stereo [okutomi1993multiple]. Recently, some local descriptor based methods like proposal flow [ham2017proposal] and SIFT flow [liu2010sift] are explored to find dense correspondences across different objects. With the advance of deep learning, neural features [hariharan2015hypercolumns, lin2017feature, kong2016hypernet] are broadly used as they are more robust and generalizable. Methods like A2Net [seo2018attentive], NC-Net [rocco2018neighbourhood] and HPF [min2019hyperpixel] view semantic correspondence as a matching problem in high-dimensional feature images. In addition, [florence2018dense, schmidt2016self] leverage an unsupervised methods to learn consistent dense embeddings with SLAM across different objects.

2.2 3D Semantic Correspondence

[allen2003space, blanz1999morphable] are the pioneers on detecting 3D semantic correspondence between human bodies and faces. Recently, [halimi2018self, roufosse2019unsupervised, groueix20183d] propose unsupervised methods on learning dense correspondences between humans and animals. With the help of recent large scale model dataset such as ShapeNet [chang2015shapenet] and PartNet [mo2019partnet], finding semantic correspondences on general objects become possible. Deep functional dictionaries [sung2018deep] and SyncSpecCNN [yi2017syncspeccnn] all learn a set of synchronized base functions in order to obtain dense correspondence from functional maps. In addition to ShapeNet, [pavlakos20176, kim2013learning, you2019fine, you2020keypointnet] provide additional keypoint or correspondence annotations for object semantic understandings.

Perhaps, CSM [kulkarni2019canonical] and Zhou et al. [zhou2016learning] are the closest to this paper. However, they assume that for all images, there is a template 3D model that fits well, making them not directly applicable to categories where the shapes across instances differ significantly in topology or undergo large articulation. Besides, they implicitly infer 3D models by generating a 2D-3D pixel maps while we explicitly predict each image’s corresponding 3D shape.

2.3 Single View Shape Reconstruction

Recently, many works have been introduced on single view shape reconstruction. For supervised methods where a ground-truth model is available, PSGN [fan2017point] and pseudo-renderer [lin2018learning] reconstruct point clouds from single-view RGB images. Front2back [yao2019front2back] predicts per-pixel depth, which is then converted into a point cloud. [wu2017marrnet, chen2009learning, wu2016learning] predict voxel grids with a relatively small resolution while some others like [park2019deepsdf, mescheder2019occupancy, liu2019learning] reconstruct implicit surface functions, where resolutions are not limited compared to voxels. In addition, there are also plenty of researches [gkioxari2019mesh, wen2019pixel2mesh++] focused on triangle mesh reconstruction, which is constrained by mesh topology. Pan et al. [pan2019deep] tries to modify the mesh topology during reconstruction. What’s more, some other directions like reconstructing images as geometric primitive collections [gao2019sdm, tian2019learning] and complex octree structures [riegler2017octnet, tatarchenko2017octree] are also explored.

For unsupervised single view shape prediction, [sitzmann2019deepvoxels, niemeyer2019differentiable, rematas2019neural, eslami2018neural] utilize only 2D image annotations, together with a multi-view consistency prior, to reconstruct the implicit 3D models. Other works like [wang2020deep, li2019synthesizing] focus on a large collection of images in the wild and reconstruct a model for each distinguished image.

2.4 Differentiable Renderer

Differentiable renderer is an emerging topic in recent years, we see that  [sitzmann2019deepvoxels, niemeyer2019differentiable] all utilize differentiable projections to learn a 3D shape from its 2D image projections. Neural Mesh Renderer [kato2018neural] first brings this idea to mesh rasterization rendering and Li et al. [li2018differentiable] comes up with differentiable monte-carlo ray tracing. DiffSDF [jiang2019sdfdiff] renders implicit surfaces defined by signed distance function in a differentiable way.

3 Methodology

3.1 Overview

Our method includes four essential parts: (a) 2D-3D shape prediction, (b) viewpoint estimation from RGB images, (c) keypoint database with semantic embeddings, (d) 3D-2D projections of semantic points. Full pipeline is illustrated in Figure 2.

For 2D-3D shape prediction, we utilize a similar structure with ShapeHD [wu2018learning], which first estimate silhouettes, normals and depths from 2D images and then predict 3D shapes using 3D convolutions. This pattern can be summarized as 2D-2.5D-3D and is first proposed by MarrNet[wu2017marrnet]

. For viewpoint estimation, azimuth and elevation are classified into discrete bins using ResNet architecture. For 3D keypoint transfer, we utilize KeypointNet 

[you2020keypointnet] annotations with nearest neighbor search. For 3D-2D semantic projection, 3D voxel predictions are converted into meshes with marching cube algorithm [lorensen1987marching]. Then, these meshes can be projected back onto 2D image films provided viewpoint estimations, fine-tuned by differentiable renderer. Note that we assume that objects are centered in the image and do not get occluded or cut by other objects. Clutter occlusions and incompletions are out of the scope of this paper.

Figure 2: Our full pipeline. (a) 3D models are reconstructed from single RGB images. (b) Viewpoints are also estimated from RGB images. (c) We obtain a keypoint descriptor database by training on existing 3D keypoint datasets, and then transfer these keypoints with nearest neighbor search. (d) Given viewpoints, 3D models and transferred keypoints, we project them onto the original image plane.

3.2 Single View 3D Reconstruction

There are a number of works focusing on single view 3D reconstruction, such as MarrNet [wu2017marrnet], ShapeHD [wu2018learning], Mesh R-CNN [gkioxari2019mesh]. We utilize a similar architecture with ShapeHD, considering that it could penalize those unrealistic 3D shapes. To make it complete, here we briefly show the components that are used in ShapeHD. ShapeHD is inspired by MarrNet. It has an 2.5D sketch estimator, which is an encoder-decoder that predicts the object’s depth, surface normals and silhouette from an RGB image. Followed is a 3D estimator which also has an encoder-decoder structure. It predicts a 3D shape of the object in the canonical view from 2.5D sketches. In addition, the author introduced a deep naturalness regularizer that penalizes unrealistic shapes prediction. The regularizer is implemented by a 3D generative adversarial network and the discriminator is then used to calculate the naturalness score. The architecture for this module is shown in Figure 3.

Figure 3: Single view 3D model reconstruction. (a) Firstly, 2.5D sketches including normals, silhouettes and depths are estimated. (b) Then, 3D transpose convolution is used to recover object voxels. (c) In addition, a shape naturalness score is proposed to ensure that generated shapes are not diverged from real shapes.

3.3 Viewpoint Estimation from RGB Images

Given predicted 3D shapes, it is necessary to estimate the viewpoint in order to project it back onto the image plane. To do so, we design a network that predict viewpoints directly from RGB images. Note that this is different from Pix3D [sun2018pix3d] where viewpoints are estimated from 2.5D sketches. We argue that error would accumulate if the previous 2.5D sketch prediction is inaccurate. Direct estimation reduces the number of passed stage from two to one, which help improve the accuracy. This is also verified in our experiments.

We treat view estimation as a classification problem, where azimuth is divided into 24 bins the elevation is divided into 12 bins. Circularity in azimuth is dealt carefully with an additional circular bin. The architecture for this module is demonstrated in Figure 4.

Figure 4: Viewpoint Estimation Module. Viewpoint is estimated from 2D CNN followed by several fully connected layers. Then, KL divergence loss is employed.

3.4 3D Semantic Prediction

Predicting semantic labels on 3D models is pretty challenging in this 2D-3D-2D loop. Firstly, the predicted 3D shape from previous stage is not perfect and may be corrupted. Secondly, directly training on 3D models may be prohibitive as current semantic image datasets usually do not come up with the corresponding 3D models. Even for datasets with 3D models like PASCAL-3D [xiang_wacv14], since the number of models is relatively small, overfitting is highly suspected.

Therefore, we resort to existing large-scale 3D keypoint datasets to train a semantic prediction network. KeypointNet[you2020keypointnet] contains millions of keypoint annotations from ShapeNet models. By training on this dataset, one could obtain a semantic embedding for each keypoint in the dataset and then transfer it to the predicted 3D models with a nearest neighbor search. In other words, we train a semantic prediction network on a 3D object database and then generalize it to our predicted 3D shapes. To account for corruption, we augment our dataset with random Gaussian noises near the object surface. This idea is illustrated in Figure 5.

Figure 5: Semantic Keypoint Transfer. The database has a large collection of models (may not necessarily contain the model to be evaluated). We extract their keypoint embeddings using PointNet trained with contrastive loss. For the input model, its dense embeddings are extracted with the same pretrained PointNet. Afterwards, keypoint locations are identified by a nearest neighbor search.

3.5 Differentiable Rendering in 2D Projection

As a final step, we are now ready to project our predicted 3D shapes together with inferred semantic points back onto 2D image planes. This step, although the last but not the least, is important as errors are accumulated all the way through previous stages. To have a chance correcting previous predictions, we threshold the voxels with marching cube algorithm, fine-tuning the viewpoint with the help of Neural Mesh Renderer [kato2018neural]. Specifically, denote the ground-truth silhouette image as , predicted 3D model as , our fine-tuned viewpoint is :


where is the set of all 2D image coordinates, is the projected image given 3D model under viewpoint . We run several gradient descent steps in order to find the best viewpoint.

To summarize, we first predict 3D shapes and viewpoints from single view RGB images; then 3D semantic keypoints or any other semantic information is transferred from an existing 3D model database to the predicted 3D shapes; finally, the predicted 3D shapes together with their semantics are projected onto 2D image planes, with viewpoints fine-tuned.

Figure 6: Differentiable Rendering. Given current camera pose/viewpoints, we back-propagate through neural mesh renderer to optimize its pose by comparing rendered silhouette and predicted silhoutte.

4 Experiments

Our experiments are divided into three parts. The first part is the comparison of our proposed method with current state-of-the-art methods. The second part is some ablation studies on our proposed viewpoint estimation and fine-tuning modules. The third part is a detailed and thorough investigation of each component/stage’s influence on final results in our full pipeline. We hope this kind of detailed analysis could help following researchers to have a better understanding on each individual components in the 2D-3D-2D loop. Note this analysis also covers previous 2D-2.5D-3D reconstruction pipeline and can be used to further improve single-view 3D reconstruction.


Figure 7: Dataset Visualization (chair) on ShapeNet, Pix3D, PF-PASCAL and SPair-71k. From up to bottom: difficulties from easy to hard.

We use ShapeNet Synthetic rendered images [wu2018learning] for training and Pix3D [sun2018pix3d],  PF-PASCAL[ham2016proposal], SPair-71k [min2019spair] images for evaluation (except for Section 4.4). Here, chairs are chosen for better illustrations. More results on other classes can be found in the supplementary material. Differences among these four datasets are shown in Figure 7. ShapeNet Synthetic could provide tons of training data though it is fake and synthetically rendered. Pix3D is a real dataset without much clutter occlusions and objects are well centered in images, which is relatively clean. PF-PASCAL and SPair-71k provide more extreme occlusions/cutoff/scale variations. Though clutter occlusions and incompletions are not the focus of this paper, our method still gives a competitive score on these datasets compared with state-of-the-art.


We use a common evaluation metric of percentage of correct keypoints (PCK), which counts the average number of correctly predicted keypoints given a tolerance threshold. Given predicted keypoint

and groundtruth keypoint , the prediction is considered correct if Euclidean distance between them is smaller than a given threshold. The correctness of each keypoint can be expressed as


where and are the width and height of either an entire image or object bounding box, {img, bbox}, and is a tolerance factor.

We select out those keypoints that are in the intersection of both KPNet and Pix3D/PF-PASCAL/SPair-71k for evaluation. For our method, we directly estimate keypoints for each single input image and calculate PCK for each keypoint; while for other baseline methods like Hyperpixel Flow, PCK is calculated on an image pair where keypoints in one image are reckoned as ground-truth and keypoints in the other image are predicted by a semantic warp or transfer. Our method can be also considered as transferring keypoints from an implicit ground-truth keypoint template. PCK results are averaged over all keypoints.

All our networks are written in Pytorch. Each stage in Figure 

2 is trained independently. All input images are cropped and resized to so that the object is centered in the image.

4.1 Comparison with State-of-the-Arts

In this section, we compare our methods with several state-of-the-arts that either directly do a 2D image semantic transfer [min2019hyperpixel, seo2018attentive] or utilize 3D templates [kulkarni2019canonical].

Our method is trained with ShapeNet Synthetic renderings while state-of-the-art methods are trained directly on real-world images. Interestingly, though our method has a domain gap when applied to real-world images, we still outperform state-of-the-art methods on Spair-71k and Pix3D. Quantitative results are shown in Table 1. On PF-PASCAL, our method is inferior due to the difficulty in handling severely occluded and incomplete objects. Qualitative results on SPair-71k/PF-PASCAL are shown in Figure 8 and 9.

Figure 8: Qualitative results on PF-PASCAL.
Figure 9: Qualitative results on SPair-71k.
Models PCK-chair () PCK-chair ()
PF-PASCAL SPair71k Pix3D PF-PASCAL SPair71k Pix3D
ours 0.444 0.565 0.560 0.342 0.346 0.323
HPFres101[min2019hyperpixel] 0.602 0.419 0.534 0.435 0.324 0.434
A2Netres101[seo2018attentive] 0.516 0.358 0.506 0.302 0.200 0.356
CSMunet[kulkarni2019canonical] 0.164 0.115 0.152 0.16411footnotemark: 1 0.11511footnotemark: 1 0.15211footnotemark: 1
Table 1: Comparison of our method with state-of-the-arts. 1CSM crops input images with object bounding boxes, so the results for and are the same.

4.2 Ablation Study on Viewpoint Estimation

In this section, we explore the effect of proposed view estimation module and viewpoint fine-tuning module.

For the view estimation module, we compare with viewpoints that are predicted from estimated 2.5D sketch (w/o v.p. from RGB). Quantitative results are shown in Table 2, we see that our proposed method greatly improve the accuracy of view estimation by not accumulating the error in the 2D-2.5D sketch prediction. From Figure 10, it can be concluded that viewpoints estimated from 2.5D sketch are much more biased and reduce the quality of final transferred keypoints.

For the viewpoint fine-tuning module, fine-tuning viewpoints improves the overall accuracy on SPair-71k and Pix3D while downgrades on PF-PASCAL when , as shown in Table 2. This is due to the fact that PF-PASCAL includes more occluded chairs than SPair-71k thus is harder than the latter one. This breaks the prior of objects centered on the image thus making viewpoint fine-tuning vulnerable. Qualitative results are given in Figure 10.

Models PCK-chair () PCK-chair ()
PF-PASCAL SPair71k Pix3D PF-PASCAL SPair71k Pix3D
ours 0.444 0.565 0.560 0.342 0.346 0.323
ours w/o v.p. from RGB 0.171 0.185 0.238 0.075 0.098 0.117
ours w/o v.p. fine-tune 0.497 0.538 0.560 0.316 0.332 0.322
Table 2: Ablation study on viewpoint estimation modules.
Figure 10: Visualization on viewpoint estimation modules. From left to right: input image; viewpoint estimated from predicted 2.5D sketches instead of RGB images; viewpoint not fine-tuned; models with all modules.
Figure 11: Visualization of each stage’s effect on Pix3D. From left to right: input image; replaced with ground-truth 3D model; replaced with ground-truth viewpoint; replaced with ground-truth 3D model in semantic transfer; all ground-truths.

4.3 Detailed Analysis of Each Stage

In this section, we investigate how each stage influences the final result, by replacing the following three components with their ground-truths: (a) 2D to 3D shape reconstruction, (b) viewpoint estimation and (c) semantic 3D model. Here, semantic 3D model means whether to use ground-truth 3D model of the input image when doing keypoint transfer (the database input to PointNet in Figure 5).

We start with our full pipeline and then replace each component with its ground-truth counterpart to see the accuracy improvement, respectively. We also evaluate our method with all components’ corresponding ground-truths. Notice that although ground-truth viewpoints with azimuth and elevation are given, they are not the ground-truth 6D camera poses. Therefore, all components with ground-truth fails to achieve 100% accuracy.

Results are given in Table 3. We see that the PCK contribution of ground-truth semantic 3D model is the largest, which means that if we have the ground-truth model for computing keypoint embeddings instead of the ones in our keypoint database, we would gain about 15.5% relative improvement. The contribution of ground-truth 3D reconstruction is small, which suggests that the 2D-2.5D-3D single view reconstruction pipeline meets few difficulties when applied to real datasets. Replacing predicted viewpoints with ground-truths also gives 8.2% PCK improvement, which means that there is still some future work to do in single view camera pose estimation. More visualization results are demonstrated in Figure 11.

Models PCK-chair () PCK-chair ()
Pix3D ShapeNet Pix3D ShapeNet
ours 0.560 0.323 0.513 0.243
ours w/ 2.5D model GT - - 0.518 0.243
ours w/ 3D model GT 0.571 0.363 0.521 0.263
ours w/ viewpoint GT 0.606 0.351 0.545 0.262
ours w/ semantic model GT 0.647 0.434 0.631 0.331
all GT 0.741 0.560 0.723 0.419
Table 3: Detailed analysis of each stage on Pix3D and ShapeNet.

4.4 Domain Gap Exploration

Plenty of single view 3D reconstructions are done on virtually rendered datasets and validated on real-world data. This greatly reduces the need for human annotated 2D keypoint and 3D model label pairs. However, this also introduces an unavoidable gap since due to the rendering error.

In this section, we evaluate our method on virtually rendered ShapeNet, which is from the same domain of training datasets while all the evaluated images are not seen during training. This is in comparison with Section 4.1 whose evaluation is on real datasets. Note, since we are evaluating on rendered datasets, we have 2.5D ground-truth (silhouette, normal and depth), so that we add an extra experiment by replacing 2.5D predictions with ground-truth data.

Quite interestingly, results on Pix3D are better than on ShapeNet Synthetic. One reason is that in ShapeNet Synthetic dataset, camera positions are sampled uniformly from the entire unit sphere while real datasets seldom have large elevations, making generalization from virtual datasets easier. Quantitative results are given in Table 3 and visualization is shown in Figure 12.

Figure 12: Detailed analysis of each stage on ShapeNet Synthetic. From left to right: input image; replaced with ground-truth 2.5D sketch; replaced with ground-truth 3D model; replaced with ground-truth viewpoint; replaced with ground-truth 3D model in semantic transfer; all ground-truths.
Figure 13: Failure cases of our results on SPair-71k/PF-PASCAL when there exists severe clutter occlusions.

5 Future Work

Here we show some failure cases of our method in Figure 13

, where severe clutter occlusions are introduced. It would be interesting to extend our method to explicit reason about clutter occlusions and incompletions. Besides, as our current viewpoint estimation only includes two degrees of freedom, a full 6D pose estimation is possible and we leave it as a future work.

6 Conclusions

In this paper, we propose a new pipeline on predicting semantic correspondences by leveraging it to 3D domain and then project corresponding 3D models back to 2D domain, with their semantic labels. This method explicitly reasons about objects self-occlusion and visibility. We show that our method gives comparative and even superior results on standard semantic benchmarks. We also conduct thorough and detailed experiments to analyze our network components.


Dense Embedding Prediction

In the main document, we show that our method outperforms other methods on keypoint transfer task. Dense embeddings from single RGB images can also be obtained by propogating 3D semantic embeddings onto 2D image plane. Some qualitive results are illustrated in 14 and 15.

Figure 14: Predicted dense embeddings on PF-PASCAL cars. Notice how the generated embeddings are consistent across different models, despite of viewpoint variations. Similar colors indicate similar embeddings.
Figure 15: Predicted dense embeddings on PF-PASCAL aeroplanes. Notice how the generated embeddings are consistent across different models, despite of viewpoint variations. Similar color indicate similar embeddings.

Keypoint Transfer Results on Other Classes

In this section, we provide additional PCK results on both car and aeroplanes, evaluated on PF-PASCAL and SPair-71k.

Models PCK-car () PCK-car ()
ours 0.574 0.500 0.349 0.328
HPFres101 0.533 0.364 0.437 0.276
A2Netres101 0.478 0.332 0.326 0.201
CSMunet 0.339 0.234 0.339 0.234
Table 4: Comparison of our method with state-of-the-arts on cars.
Models PCK-aeroplane () PCK-aeroplane ()
ours 0.440 0.390 0.194 0.182
HPFres101 0.401 0.280 0.294 0.212
A2Netres101 0.388 0.234 0.256 0.155
CSMunet 0.220 0.139 0.220 0.139
Table 5: Comparison of our method with state-of-the-arts on aeroplanes.
Figure 16: Qualitative results on PF-PASCAL cars.
Figure 17: Qualitative results on SPair-71k cars.
Figure 18: Qualitative results on PF-PASCAL aeroplanes.
Figure 19: Qualitative results on SPair-71k aeroplanes.