Fine-grained Object Semantic Understanding from Correspondences

12/29/2019 ∙ by Yang You, et al. ∙ Shanghai Jiao Tong University 10

Fine-grained semantic understanding of 3D objects is crucial in many applications such as object manipulation. However, it is hard to give a universal definition of point-level semantics that everyone would agree on. We observe that people are pretty sure about semantic correspondences between two areas from different objects, but less certain about what each area means in semantics. Therefore, we argue that by providing human labeled correspondences between different objects from the same category, one can recover rich semantic information of an object. In this paper, we propose a method that outputs dense semantic embeddings based on a novel geodesic consistency loss. Accordingly, a new dataset named CorresPondenceNet and its corresponding benchmark are designed. Several state-of-the-art networks are evaluated based on our proposed method. We show that our method could boost the fine-grained understanding of heterogeneous objects and the inference of dense semantic information is possible.



There are no comments yet.


page 3

page 6

page 8

Code Repositories


Implementation of the ECCV '20 paper: "Human Correspondence Consensus for 3D Object Semantic Understanding".

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object understanding [leng20163d, mo2019partnet, zhou2019semantic]

is one of the holy grails in computer vision. Being able to fully understand object semantics is crucial for various applications such as self-driving 

[bojarski2016end, paden2016survey] and attribute transfer [liao2017visual]. Recently, significant advances have been made in both category-level and instance-level understanding of objects [chang2015shapenet, kundu20183d]. However, having category-level or instance-level knowledge of objects is far from enough for fine-grained tasks such as object manipulation [levine2018learning, matas2018sim]. Fine-grained semantic understanding of objects is of great importance and still remains challenging.

One of the key problems with fine-grained semantic understanding lies in the ambiguous definitions of semantics. In the past decades, researchers have proposed keypoints [leutenegger2011brisk, lin2014microsoft, salti2015learning, suwajanakorn2018discovery] and skeletons [au2008skeleton] to explicitly define object semantics. These methods have made success in tasks like human body parsing [kalayeh2018human], however, it is hard or even impossible to give consistent definitions of keypoints or skeletons for a general object. Recently, part based representations of objects are also adopted by researchers [chang2015shapenet, yi2016scalable, mo2019partnet], where an object is decomposed into semantic parts by experts, with a predefined semantic label on each part. The above methods all impose an explicit definition of object semantics, which is inevitably biased or flawed since different people may hold different opinions of what the semantics of an object are.

In this paper, we explore a brand new way to deal with this vagueness in fine-grained object understanding. Instead of explicitly giving semantic components and labels, we leverage the semantic correspondence between objects to implicitly infer their semantic meanings. This is based on the observation that while it is hard to tell the exact meanings of some sub-object areas, almost everyone would agree on their semantic correspondence across different objects, as shown in Figure 1. Consequently, comprehensive object understanding can be achieved by collecting multiple unambiguous semantic correspondences from a large population.

Figure 1: We observe that it is hard to tell the exact meanings of some areas on an object, while correspondences between different objects are clear.

To that end, we introduce CorresPondenceNet (CPNet): a diverse and high-quality dataset on top of ShapeNet [chang2015shapenet] with cross-object, point-level fine-grained 3D semantic correspondence annotations. In this dataset, every annotator gives multiple sets of semantic-consistent points across different intra-class objects, which we call “correspondence sets”, as shown in Figure 2.

Using these correspondence sets, we aim to obtain the pointwise embeddings of an object to represent its fine-grained semantic information. We propose a novel method to learn these embeddings. While a simple push-pull loss fails to generate meaningful embeddings, we leverage a geodesic consistency loss. On one hand, points in the same correspondence set get pulled in the embedding space. On the other hand, points across different correspondence sets get pushed according to their average geodesic distances. By considering geodesic relationships between different correspondence sets, points with similar semantics are more likely to be grouped together in the embedding space.

In summary, our key contributions are as follows:

  • We explore a new way towards fine-grained semantic understanding of objects, where explicit definitions are avoided but point-level semantic correspondences across heterogeneous objects are leveraged.

  • We introduce CPNet, the first correspondence based dataset for 3D object understanding, which contains 100K+ high-quality semantic-consistent points.

  • We design a novel geodesic consistency loss to learn dense embeddings. To evaluate these embeddings, a brand-new semantic understanding benchmark — semantic correspondence estimation, is proposed. A variety of state-of-the-art neural networks are evaluated.

2 Related Work

Datasets on Semantic Analysis

Big data and deep learning have witnessed several large 2D/3D datasets these years aiming to parse semantic information from objects. In the world of 2D images, SPAIR-71k 

[min2019spair] proposes a large-scale dataset with rich annotations on viewpoints, keypoints and segmentations, which is mainly used for semantic matching between different images. Recently, Ham et al. [ham2017proposal] and Taniai et al. [taniai2016joint] have introduced datasets with groundtruth correspondences. Since then, PF-WILLOW and PF-PASCAL [ham2017proposal] have been used for evaluation in many works. In addition, plenty of datasets on human pose analysis [andriluka14cvpr, PoseTrack] have been proposed recently. These 2D image datasets have their advantages in that they are relatively large and pertain diversity across different scenes and objects.

On the other hand, there exists a rich set of 3D model datasets that try to directly process meshes or point clouds. There are generally two types of them: ones that focus on rigid models and some others that focus on non-rigid models. For rigid model analysis, ShapeNet Core 55 [chang2015shapenet] is proposed to help object-level classification while ShapeNet part dataset [yi2016scalable] pushes it one step forward with intra-object part classification. As a followup, PartNet [mo2019partnet] comes up with a much more complete and manually defined hierarchical structures of parts. Alternatively, dataset proposed by Dutagaci et al. [dutagaci2012evaluation] focuses on sparse semantic keypoints on objects. For non-rigid (deformable) models, FAUST [bogo2014faust] and TOSCA [bronstein2008numerical] provide dense correspondence labels for humans and animals, respectively. These methods leverage the clear anatomy structure underlying humans and animals and can be applied to pose transfer, pose synthesis, etc.

Methods on Fine-grained Semantic Understanding

In the last decade, plenty of methods have been proposed to find semantic correspondences between paired images. Earlier methods like Okutomi et al. [okutomi1993multiple], Horn et al. [horn1993determining] and Matas et al. [matas2004robust] propose to find semantic correspondences within the same scene. Semantic flows like SIFT flow [liu2010sift] and ProposalFlow [ham2017proposal] further explore to find dense correspondence across different scenes. Kulkarni et al. [kulkarni2019canonical] and Zhou et al. [zhou2016learning] utilize a synthesis 3D model as a medium to enforce semantic cycle-consistency. Florence et al. [florence2018dense] and Schmidt et al. [schmidt2016self] leverage an unsupervised method to learn consistent dense embeddings across different objects.

When it comes to the domain of 3D shapes, Blanz et al. [blanz1999morphable] and Allen et al. [allen2003space] are the pioneers on finding 3D correspondence between human faces and bodies. Recently, 3D dense semantic correspondence has been boosted by a variety of deep learning methods. Halimi et al. [halimi2018self], Groueix et al. [groueix20183d] and Roufosse et al. [roufosse2019unsupervised] propose unsupervised methods on learning dense correspondences between humans and animals. Deep functional dictionaries [sung2018deep] gives a small dictionary of basis functions for each shape, a dictionary whose span includes the semantic functions provided for that shape.

Figure 2: CPNet dataset. Each person annotates multiple sets of corresponding points. Points in the same correspondence set are in the same color. It can be seen that people could have his/her own understanding of semantic points as long as they are consistent across different models within the same category.

3 Understanding Semantic Information from Humans

Understanding semantics from arbitrary objects is of great importance. However, explicitly expressing semantics in a well defined format is extremely hard as the definition of semantics is vague and diverse.

We observe that people are pretty sure about the correspondence between two areas but less sure about what each area means in semantics. As shown in Figure 1, almost everyone would agree on the lined correspondences between two helmets. However, it is hard to tell the exact semantic meanings of the colored areas.

Therefore, unlike all previous methods where an explicit definition of keypoints or parts is given, we instead focus on sparse correspondences annotated by humans, based on the assumption that all the corresponding points labeled by the same person share the same semantic meaning.

4 CorresPondenceNet

CorresPondenceNet (CPNet) has a collection of 25 categories, 2000+ models based on ShapeNetCore. Each model is annotated with a number of semantic points from multiple annotators, as shown in Figure 2. Unlike other 2D or 3D keypoint datasets which manually set a keypoint template and let annotators to follow, semantic points in our dataset are not deliberately defined by anyone. The key is that every annotator can have his/her own understanding of semantic points, as long as they are consistent across different models within the same category. In the following subsections, we discuss how we collect models, how we annotate models and annotation types in details.

4.1 Dataset Collections

Our dataset is based on ShapeNetCore [chang2015shapenet]. ShapeNetCore is a subset of the full ShapeNet dataset with single clean 3D models and manually verified category and alignment annotations. There are 51,300 unique 3D models from 55 common object categories in ShapeNetCore. We select 25 categories that are mostly seen in daily life to build our dataset. To keep a balanced dataset, for each category we keep at most 100 models. For those categories with less than 100 models, all the models are selected.

Airplane Bathtub Bed Bench Bottle Bus Cap Car Chair Dishwasher Display Earphone Faucet
NK 5527 6033 6464 5421 4489 6404 949 7938 6140 5343 4509 904 1612
NA 10 10 10 10 10 10 10 10 10 10 10 10 10
NM 100 100 100 100 100 100 38 100 100 77 100 58 100
Cmin 35 40 40 30 41 50 20 64 50 60 20 14 10
Cmed 54 60 60 50 45 64 25 80 70 70 50 15 15
Cmax 72 96 80 70 46 81 30 82 78 84 51 21 22
Guitar Helmet Knife Lamp Laptop Motorcycle Mug Pistol Rocket Skateboard Table Vessel All
NK 2832 1500 2109 1683 2987 3878 7668 3358 2315 3822 4008 5214 104861
NA 10 10 10 10 10 10 10 10 10 10 10 10 -
NM 100 95 100 100 100 100 100 100 66 100 100 100 2334
Cmin 19 27 10 13 20 30 66 17 21 20 39 40 -
Cmed 30 35 12 15 30 40 77 35 32 40 40 54 -
Cmax 32 37 15 21 36 40 78 41 49 43 44 56 -
Table 1: CPNet statistics. NK gives the number of annotated points of each category; NA gives the number of annotators for each category; NM is the number of models in each category; Cmin, Cmed, Cmax give minimum, median and maximum number of correspondence sets per instance in each category.

4.2 Annotation Process

We hire 80 professional annotators in total. Each model is annotated by at least 10 persons to enrich the dataset.

For each category, every annotator is allowed to create to templates with his/her own understanding of semantic points. Templates are then listed to guide the annotations of rest models, so that he/she is able to keep the consistency. Consider an airplane as an example, if one annotator marks the nose as No.1 semantic point, then he/she is supposed to mark all the noses on other airplanes as No.1. It does not matter if another annotator marks the nose as No.2 semantic point, or even neglecting it, as long as one annotator obeys his own rules across all the models. For those points that may not exist on all the models such as propeller, one can just skip this point on the models without it. The annotator is free to choose any points from his/her perspective.

Each annotator is asked to mark at most 16 semantic points per model. All points are annotated on raw meshes, which is more accurate than those annotated on point clouds. Moreover, it is straightforward to extend these annotations to point clouds by sampling from the mesh while fixing the locations of semantic points.

4.3 Annotation Type

Denote all the models as , where represents a single model. Each mdoel is associated with a set of semantic points where denote the -th semantic point of the -th annotator on the -th model.

In addition, we ask each annotator to give consistent points across different models, so that and have the same semantic meaning. Therefore, we define a set of correspondence sets , where each correspondence set contains all the points with the same semantic label. Note that we dropped the index of the annotator since distinct point correspondence from the same person can be treated the same as those from different persons.

Each annotated point contains attributes about (1) coordinate, (2) color, (3) face index and (4) coordinate. By providing these attributes, methods based on either point clouds or meshes can be applied easily.

4.4 Statistics

CPNet provides instance-level keypoint annotation for 2,334 models with 104,861 keypoints from 25 object categories. Table 1 gives the detailed statistics of our dataset.

5 Proposed Method

We now propose a method on learning dense semantic embeddings from human labeled correspondences across various intra-class models.

5.1 Problem Statement

Given a set of 3D models and a set of correspondence sets defined in Section 4.3, our goal is to produce a set of pointwise embeddings for each model . The embeddings encode semantic information across different models and points with similar semantics are close in embedding space. We define as an embedding function, such that gives the embedding for point on the model. In practice, we approximate with a deep neural network and explain how to optimize as follows.

5.2 Method Details

Pull Loss

It is natural to come up with a pull loss since we would like to ensure the semantic consistency within every correspondence set. As illustrated in Figure 3, the points with the same color belong to the same correspondence set and reveal similar semantic information. For one specific correspondence like the green line shown in Figure 3

, we aim to pull the embedding vectors of the points within it. Any two of points in the same correspondence set form a positive pair. The pairwise embedding distances are then summed over all positive pairs to form our pull loss:


where is the number of all possible positive point pairs.

Geodesic Consistency Loss

The pull loss in Equation 1 enforces the points in the same correspondence set to have similar embeddings. However, there is a trivial solution where outputs a constant embedding (e.g. ) for all points, which is a global optimum when minimizing only. Such a trivial solution is due to the ignorance of an important principle: we ought to ensure that those points with distinct semantics to have a large embedding distance. Therefore, a push loss guided by geodesic consistency is proposed to fulfill this goal. We leverage a prior to determine whether two different correspondence sets have distinct semantics: if all pairs of points from these two sets have large geodesic distances on models, they are more likely to reveal different semantic information.

Based on this insight, we design a distance measure for a pair of correspondence sets and :


where is the geodesic distance between point and . This distance measure represents the average geodesic distance between point pairs from two correspondence sets.

Then, the push loss can be written as,


where is the number of all possible negative pairs formed by points from different correspondence sets.

In Equation 3, the push loss is only activated when is smaller than . In other words, the larger is, the further and are separated in the embedding space. This is based on the observation that some points in two correspondence sets may have similar semantic information (like the red and orange lines in Figure 3) while some have totally different meanings (like the orange and green lines in Figure 3). Therefore, only for those correspondence sets with a large average geodesic distance, a large distance between their embeddings is expected.

Figure 3: Correspondence sets across different airplanes. , and denote three semantic correspondence sets respectively.

Our final loss is,


where is a weight factor.

Hard Negative Mining

In practice, negative pairs to be pushed are combinatorially more than positive pairs to be pulled, since negative pairs are sampled from different correspondence sets. In such case, we borrow the idea from  [dalal2005histograms] to utilize hard negative mining. Within each batch, only those negative pairs with smallest embedding distances are taken into consideration, matching the number of positive pairs.

Figure 4: Given correspondence sets, we pull the points in the same correspondence set and push points from different correspondence sets adaptively, according to their average geodesic distances. The blue and orange correspondence sets are close so that they can stay close in embedding space, while the orange and green ones are far away in average geodesic distance so their embeddings are pushed further from each other.

Our method is summarized in Figure 4.

6 Experiments

Figure 5: Predicted semantic embeddings for PontConv. Same colors indicate similar embeddings.
Airplane Bathtub Bed Bench Bottle Bus Cap Car Chair Dishwasher Display Earphone Faucet
PointNet 0.088 0.245 0.231 0.198 0.106 0.082 0.123 0.074 0.198 0.124 0.180 0.101 0.170
PointNet++ 0.083 0.307 0.254 0.210 0.218 0.142 0.123 0.077 0.199 0.168 0.207 0.130 0.189
RS-Net 0.095 0.280 0.212 0.280 0.105 0.084 0.086 0.065 0.187 0.149 0.183 0.092 0.150
PointConv 0.078 0.284 0.237 0.220 0.107 0.107 0.099 0.083 0.185 0.142 0.186 0.096 0.150
DGCNN 0.075 0.273 0.223 0.216 0.144 0.115 0.110 0.087 0.215 0.159 0.239 0.092 0.141
GraphCNN 0.091 0.291 0.256 0.217 0.139 0.123 0.121 0.113 0.212 0.166 0.220 0.138 0.164
Minkowski 0.108 0.286 0.270 0.243 0.152 0.136 0.144 0.099 0.238 0.189 0.253 0.117 0.161
SHOT 0.229 0.488 0.539 0.530 0.382 0.405 0.345 0.386 0.474 0.515 0.455 0.495 0.274
Random 0.290 0.489 0.526 0.507 0.427 0.396 0.484 0.401 0.478 0.507 0.459 0.599 0.337
Guitar Helmet Knife Lamp Laptop Motorcycle Mug Pistol Rocket Skateboard Table Vessel Average
PointNet 0.095 0.177 0.061 0.265 0.171 0.123 0.070 0.168 0.186 0.155 0.075 0.119 0.143
PointNet++ 0.116 0.186 0.079 0.263 0.183 0.128 0.106 0.185 0.163 0.179 0.093 0.159 0.166
RS-Net 0.110 0.167 0.054 0.273 0.138 0.122 0.110 0.161 0.152 0.166 0.089 0.135 0.146
PointConv 0.109 0.176 0.076 0.270 0.137 0.128 0.085 0.173 0.168 0.156 0.097 0.144 0.148
DGCNN 0.124 0.173 0.068 0.261 0.181 0.148 0.139 0.174 0.172 0.150 0.069 0.162 0.156
GraphCNN 0.135 0.184 0.116 0.279 0.168 0.152 0.132 0.185 0.181 0.169 0.099 0.199 0.170
Minkowski 0.148 0.213 0.105 0.290 0.206 0.170 0.149 0.194 0.195 0.173 0.109 0.172 0.181
SHOT 0.305 0.387 0.194 0.425 0.543 0.340 0.414 0.334 0.271 0.381 0.607 0.377 0.404
Random 0.326 0.406 0.426 0.451 0.543 0.358 0.488 0.375 0.298 0.378 0.544 0.378 0.435
Table 2: Mean Geodesic Error (mGE) results.
  Input: model set , an embedding function to be evaluated
  Output: mean Geodesic Error (mGE) of
  for  in  do
     for  in  do
        for  in  do
           , where
              denotes the model that point lies on.
        end for
     end for
  end for
Algorithm 1 mean Geodesic Error calculation

In this section, we demonstrate that our proposed method on learning pointwise embeddings can effectively help fine-grained object semantic understanding. We first introduce a new metric to evaluate predicted embeddings. Then seven state-of-the art neural network architectures are chosen as our method’s backbones and benchmarked. We additionally compare our approach, which is based on human labeled correspondences, with that based on part-level supervision.

Evaluation Metric

We introduce mean Geodesic Error (mGE) to evaluate predicted semantic embeddings. mGE is calculated individually for each category and measures how well the generated embedding vectors fit with annotated correspondence sets. Algorithm 1 presents the calculation procedure of mGE for a given embedding function . Intuitively, for each annotated points on a model, we find their corresponding points that minimize the embedding distance on other models. After that, the geodesic distances between these points and human labeled corresponding points are accumulated. It is easy to verify that if all the embeddings are identical within the same correspondence set but are distinct across different correspondence sets, , which means that the predicted semantic embeddings are consistent with human labels.

Benchmark Neural Networks

We benchmark three kinds of backbones: point cloud, graph and voxel based neural networks. Point cloud based architectures PointNet [qi2017pointnet], PointNet++ [qi2017pointnet++] and PointConv [pointconv] take unordered point sets as the input and generate embeddings directly from these point sets. Graph based architectures DGCNN [wang2019dynamic] and GraphCNN [defferrard2016convolutional]

use graph based convolutional neural networks to extract embeddings. Voxel based architecture MinkowskiNet 

[choy20194d] takes voxels as the input and utilize sparse 3D convolutions. In addition, we report the performance of a local geometry based descriptor SHOT [tombari2010unique] and random embeddings.

Figure 6: Predicted embeddings for SHOT. Same colors indicate similar embeddings.

Evaluation and Results

We split our dataset into train (70%), validation (15%) and test (15%) set. Train and validation sets are used during training and all the results are reported on the test set. We use ADAM optimizer [kingma2014adam] with initial learning rate , ,

and batch size 4. The learning rate is multiplied by 0.9 every 10 epochs and the hyperparameter

in Equation 4 is set to 1. The output point embedding vector is 128-dimensional for all neural networks.

Table 2 gives mGE of all the compared architectures. SHOT fails to predict correct semantic correspondences across objects, whose performance is just slightly better than random point embeddings. The reason is that SHOT only considers local geometric properties, without aggregation of the global structure and semantic information. The visualization of embeddings computed by SHOT are shown in Figure 6. In contrast, all deep learning based methods using our geodesic consistency loss achieve much smaller mGE. Among them, PointNet, RS-Net and PointConv are relatively superior to the other nets on extracting semantic correspondence information.

Figure 7: Comparison between our method and part-level supervision. Given a point on the source model, we find its closest point in embedding space on the target model and post-process the founded correspondences with PMF [vestner2017product] to ensure bijectiveness. The corresponding points are in the same color.

The visualization of learned embeddings by PointConv is shown in Figure 5. From Figure 5, we can see that consistent pointwise embeddings are generated across heterogeneous objects. We get reasonable dense embeddings of all points on objects though only sparse correspondence annotations are used. A possible explanation is that the annotated correspondences impose a sparse set of pairwise constraints on the embedding function approximated by a deep neural network. Deep neural networks are usually Lipschitz-continuous and therefore, by fitting these imposed correspondence constraints, dense continuous embeddings could be inferred.

Comparison to Part-level Supervision

To further illustrate the advantage of our proposed semantic correspondence sets, we compare our method with that supervised by part-level annotations.

We train a PointNet using correspondence labels and part labels respectively. For PointNet trained on part labels, we use the same experiment settings for part segmentation as the original paper [qi2017pointnet] and extract features from the last but one layer as point embeddings. Then given a point on a source model, we use embeddings to find its corresponding point on the target model and results are shown in Figure 7. Qualitatively, we can see that when trained on our correspondence labels, points of the same semantic have similar embeddings while part-level supervision fails to give consistent semantic embeddings across objects. In addition, we compare them quantitatively using mGE, as shown in Table 3. Clearly, PointNet trained on our correspondence labels achieves better performance. On the contrary, with only part-level supervision, points in the same part are hard to be distinguished from each other, resulting in inferior performance. Note that the number of training data for part-level supervision (10240) is seven times more than that for correspondence based supervision (1362).

PointNet PointNet(Part)
Airplane 0.088 0.182
Cap 0.123 0.271
Car 0.074 0.246
Chair 0.198 0.278
Earphone 0.101 0.140
Guitar 0.095 0.114
Knife 0.061 0.065
Lamp 0.265 0.313
Laptop 0.171 0.114
Motorcycle 0.123 0.237
Mug 0.070 0.182
Pistol 0.168 0.204
Rocket 0.186 0.218
Skateboard 0.155 0.330
Table 0.075 0.282
Average 0.130 0.212
Table 3: Comparison of the results trained on human labeled correspondences and part annotations using PointNet. We can see that part-level supervision are far from enough for inferring finer object semantics while our method on human labeled correspondences could help improve the semantic understanding of objects.

7 Conclusion

In this paper, we explored a new way to obtain fine-grained semantic understanding of 3D objects. Instead of explicitly defining semantic labels on an object, we leveraged an observation that though semantic meanings on a single object can be ambiguous and hard to depict, the correspondences of certain points across objects are clear. We thus built a dataset named CorresPondenceNet (CPNet) based on human labeled correspondences, and proposed a novel geodesic guided push-pull loss to recover dense and rich semantic information of objects. Mean Geodesic Error (mGE) metric is introduced to evaluate our method with various backbones. As shown in the experiments, our method can effectively learn pointwise semantic embeddings, which are implicitly inferred from correspondences.