Transductive Zero-Shot Learning for 3D Point Cloud Classification

12/16/2019 ∙ by Ali Cheraghian, et al. ∙ Australian National University 0

Zero-shot learning, the task of learning to recognize new classes not seen during training, has received considerable attention in the case of 2D image classification. However despite the increasing ubiquity of 3D sensors, the corresponding 3D point cloud classification problem has not been meaningfully explored and introduces new challenges. This paper extends, for the first time, transductive Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) approaches to the domain of 3D point cloud classification. To this end, a novel triplet loss is developed that takes advantage of unlabeled test data. While designed for the task of 3D point cloud classification, the method is also shown to be applicable to the more common use-case of 2D image classification. An extensive set of experiments is carried out, establishing state-of-the-art for ZSL and GZSL in the 3D point cloud domain, as well as demonstrating the applicability of the approach to the image domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Capturing 3D point cloud data from complex scenes has been facilitated recently by inexpensive and accessible 3D depth camera technology. This in turn has increased the interest in, and need for, 3D object classification methods that can operate on such data. However, much if not most of the data collected will belong to classes for which a classification system may not have been explicitly trained. In order to recognize such previously “unseen” classes, it is necessary to develop Zero-Shot Learning (ZSL) methods in the domain of 3D point cloud classification. While such methods are typically trained on a set of so-called “seen” classes, they are capable of classifying certain “unseen” classes as well. Knowledge about unseen classes is introduced to the network via semantic feature vectors that can be derived from networks pre-trained on image attributes or on a very large corpus of texts 

[29, 3, 65, 57].

Figure 1: The challenge of zero-shot learning for 3D point cloud data. (a) and (b) are pre-trained 2D image and 3D point cloud feature spaces respectively. (c) The average intra-class distance between an unseen feature vector and a semantic feature vector of the corresponding class after inductive learning in the visual and point cloud domains respectively. The embedding space quality is much higher in (a) than (b) because image-based pre-trained models, such as ResNet, use deeper networks trained on millions of images, whereas point cloud-based models, such as PointNet, use shallower networks trained on only a few thousand point clouds.

Performing ZSL for the purpose of 3D object classification is a more challenging task than ZSL applied to 2D images [34, 3, 4, 29, 21, 57]. ZSL methods in the 2D domain commonly take advantage of pre-trained models, like ResNet [15], that have been trained on millions of labeled images featuring thousands of classes. As a result, the extracted 2D features are very well clustered. By contrast, there is no parallel in the 3D point cloud domain; labeled 3D datasets tend to be small and have only limited sets of classes. For example, pre-trained models like PointNet [31] are trained on only a few thousand samples from a small number of classes. This leads to poor-quality 3D features with clusters that are not nearly as well separated as their visual counterparts. This gives rise to the problem of projection domain shift [14]. In essence, this means that the function learned from seen samples is biased, and cannot generalize well to unseen classes. In the inductive learning approach, where only seen classes are used during training, projected semantic vectors tend to move toward the seen feature vectors, making the intra-class distance between corresponding unseen semantic and feature vectors large. This intuition is visualized in Figure 1.

Now, the key question is how far these problems can be mitigated by adopting a transductive learning approach, where the model is trained using both labeled and unlabeled samples. Our goal is to design a strategy that reduces the bias and encourages the projected semantic vectors to align with their true feature vector counterparts, minimizing the average intra-class distance. In 2D ZSL, the transductive setting has been shown to be effective [14, 66, 49], however in the case of 3D point cloud data it is a more challenging task. Pre-trained 3D features are poorly clustered and exhibit large intra-class distances. As a result, state-of-the-art transductive methods suitable for image data [14, 66, 49] are unable to reduce the bias problem for 3D data.

In order to take advantage of the transductive learning approach for 3D point cloud zero-shot learning, we propose a transductive ZSL method using a novel triplet loss that is employed in an unsupervised manner. Unlike the traditional triplet formulation [44, 32], our proposed triplet loss works on unlabeled (test) data and can operate without the need of ground-truth supervision. This loss applies to unlabeled data such that intra-class distances are minimized while also maximizing inter-class distances, reducing the bias problem. As a result, a prediction function with greater generalization ability and effectiveness on unseen classes is learned. Moreover, our proposed method is also applicable in the case of 2D ZSL, which demonstrates the generalization strength of our method to other sensor modalities.

Our main contributions are: (1) extending and adapting transductive zero-shot learning and generalized zero-shot learning to 3D point cloud classification for the first time; (2) developing a novel triplet loss that takes advantage of unlabeled test data, applicable to both 3D point cloud data and 2D images; and (3) performing extensive experiments, establishing state-of-the-art on four 3D datasets, ModelNet10 [56], ModelNet40 [56], McGill [47], and SHREC2015 [26].

2 Related Work

Zero-Shot Learning: For the ZSL task, there has been significant progress, including on image recognition [34, 65, 3, 4, 29, 21, 57], multi-label ZSL [22, 36], and zero-shot detection [35]. Despite this progress, these methods solve the constrained problem where the test instances are restricted to only unseen classes, rather than being from either seen or unseen classes. This setting, where both seen and unseen classes are considered at test time, is called Generalized Zero-Shot Learning (GZSL). To address this problem, some methods decrease the scores that seen classes produce by a constant value [5]

, while others perform a separate training stage intended to balance the probabilities of the seen and unseen classes

[34]. Schonfeld [43] learned a shared latent space of image features and semantic representation based on a modality-specific VAE model. In our work, we use a novel unsupervised triplet loss to address the bias problem, leading to significantly better GZSL results.

Transductive Zero-shot Learning: The transductive learning approach takes advantage of unlabeled test samples, in addition to the labeled seen samples. For example, Rohrbach [40] exploited the manifold structure of unseen classes using a graph-based learning algorithm to leverage the neighborhood structure within unseen classes. Yu [63] proposed a transductive approach to predict class labels via an iterative refining process. More recently, transductive ZSL methods have started exploring how to improve the accuracy of both the seen and unseen classes in generalized ZSL tasks [66, 49]. Zhao [66] proposed a domain invariant projection method that projects visual features to semantic space and reconstructs the same feature from the semantic representation in order to narrow the domain gap. In another approach, Song [49]

identified the model bias problem of inductive learning, that is, a trained model assigns higher prediction scores for seen classes than unseen. To address this, they proposed a quasi-fully supervised learning method to solve the GZSL task. Xian

[59] proposed f-VAEGAN-D2 which takes advantage of both VAEs and GANs to learn the feature distribution of unlabeled data. All of these approaches are designed for transductive ZSL tasks on 2D image data. In contrast, we explore to what extent a transductive ZSL setting helps to improve 3D point cloud recognition.

Learning with a Triplet Loss:

Triplet losses have been widely used in computer vision

[44, 32, 12, 16, 11]. Schroff [44] demonstrated how to select positive and negative anchor points from visual features within a batch. Qiao [32] introduced using a triplet loss to train an inductive ZSL model. More recently, Do [11] proposed a tight upper bound of the triplet loss by linearizing it using class centroids, Zakharov [64] explored the triplet loss in manifold learning, Srivastava [50] investigated weighting hard negative samples more than easy negatives, and Zhaoqun [25] proposed the angular triplet-center loss, a variant that reduces the similarity distance between features. Triplet loss related methods typically work under inductive settings, where the ground-truth label of an anchor point remains available during training. In contrast, we describe a triplet formation technique in the transductive setting. Our method utilizes test data without knowing its true label. Moreover, we choose positive and negative samples of an anchor from word vectors instead of features.

ZSL on 3D Point Clouds:

Despite much progress on 3D point cloud classification using deep learning 

[31, 30, 55, 61, 54, 60, 7, 38, 39, 37], only two works have addressed the ZSL problem for 3D point clouds. Cheraghian [9, 8] proposed a bilinear compatibility function to associate a PointNet [31]

feature vector with a semantic feature vector, and separately proposed an unsupervised skewness loss to mitigate the hubness problem. Both works use inductive inference and are therefore less able to handle the bias towards seen classes in the GZSL task than our proposed method.

3 Transductive ZSL for 3D Point Clouds

Zero-shot learning is heavily dependent on good pre-trained models generating well-clustered features [28, 4, 3, 63] as the performance of established ZSL methods otherwise degrades rapidly. In the 2D case, pre-trained models are trained by considering thousands of classes and millions of images [57]. However, similar quality pre-trained models are typically unavailable for 3D point cloud objects. Therefore, 3D point cloud features cluster more poorly than image features. To illustrate this point, in Figure 2 we visualize 3D features of unseen classes from the 3D datasets ModelNet10 [56], McGill [47] and 2D features of unseen classes from the 2D datasets AwA2 [57] and CUB [53]

. Here, we use unseen classes to highlight the generalization ability of the pre-trained model. Because of the use of a large dataset (like ImageNet) for the 2D case, the cluster structure is more separable in 2D than in 3D. As 3D features are not as robust and separable as 2D features, relating those features to their corresponding semantic vectors is more difficult than for the corresponding 2D case. Addressing the poor feature quality of typical 3D datasets, we propose to use a triplet loss in the transductive setting of ZSL. Our method specifically addresses the alignment of poor features (like those coming from 3D feature extractors) with semantic vectors. Therefore, while our method improves the results for both 2D and 3D modalities, the largest gain is observed in the 3D case.

(a) ModelNet10 (b) McGill (c) AwA2 (d) CUB
Figure 2: tSNE [52] visualizations of unseen 3D point cloud features of (a) ModelNet10 [56] (b) McGill [47] and unseen 2D image features of (c) AwA2 [57] (d) CUB [53] . The cluster structure in the 2D feature space is much better defined, with tighter and more separated clusters than those in the 3D point cloud.

3.1 Problem Formulation

Let for denote a 3D point cloud. Also let and denote disjoint () seen and unseen class label sets with sizes and respectively, and and denote the sets of associated semantic embedding vectors for the embedding function , with . Then we define the set of seen instances as , where is the th point cloud of the seen set with label and semantic vector . The set of unseen instances is defined similarly as , where is the th point cloud of the unseen set with label and semantic vector .

We consider two learning problems in this work: zero-shot learning and its generalized variant. The goal of each problem is defined as follows.

  • Zero-Shot Learning (ZSL): To predict a class label from the unseen label set given an unseen point cloud .

  • Generalized Zero-Shot Learning (GZSL): To predict a class label from the seen or unseen label sets given a point cloud .

3.2 Model Training

Zero-shot learning can be addressed using inductive or transductive inference. For inductive ZSL, the model is trained in a fully-supervised manner with seen instances only from the set .

To learn an inductive model, an objective function

(1)

is minimized, where is the number of instances in the batch, is the point cloud feature vector associated with point cloud , are the weights of the nonlinear projection function that maps from the semantic embedding space to the point cloud feature space, and the parameter controls the amount of regularization.

In contrast, transductive ZSL additionally uses the set of unlabeled, unseen instances and the set of unseen semantic embedding vectors during training. To learn a transductive model in a semi-supervised manner, an objective function

(2)

is minimized, where is the batch size of seen instances, is the unsupervised loss, controls the influence of the unsupervised loss, and controls the amount of regularization. For the term, a triplet loss is proposed, which will be outlined in the next section.

Transductive ZSL addresses the problem of the projection domain shift [14] inherent in inductive ZSL approaches. In ZSL, the seen and unseen classes are disjoint and often only very weakly related. Since the underlying distributions of the seen and unseen classes may be quite different, the ideal projection function between the semantic embedding space and point cloud feature space is also likely to be different for seen and unseen classes. As a result, using the projection function learned from only the seen classes without considering the unseen classes will cause an unknown bias. Transductive ZSL reduces the domain gap and the resulting bias by using unlabeled unseen class instances during training, improving the generalization performance. The effect of the domain shift in ZSL is shown in Figure 3. When inductive learning is used (a), the projected unseen semantic embedding vectors are far from the cluster centres of the associated point cloud feature vectors, however when transductive learning is used (b), the vectors are much closer to the cluster centres.

Figure 3: 2D tSNE [52] visualization of unseen point cloud feature vectors (circles) and projected semantic feature vectors (squares) based on (a) inductive and (b) transductive learning on ModelNet10. The projected semantic feature vectors are much closer to the cluster centres of the point cloud feature vectors for transductive ZSL than for inductive ZSL, showing that the transductive approach is able to narrow the domain gap between seen and unseen classes.

3.3 Unsupervised Triplet Loss

In this work, we propose an unsupervised triplet loss for (2). It is unsupervised because the computation of operates on test data, which remains unlabeled, and receives no ground-truth supervision throughout transductive training. To compute a triplet loss, a positive and negative sample need to be found for each anchor sample [44]. In the fully-supervised setting, selecting positive and negative samples is not difficult, because all training samples have ground-truth labels. However, it is much more challenging in the unsupervised setting, where ground-truth labels are not available. For transductive ZSL, we define a positive sample using a pseudo-labeling approach [23]. For each anchor , we assign a pseudo-label that chooses a positive sample among the semantic embedding vectors which is the closest to the anchor feature vector after projection , as follows

(3)

Such pseudo-labeling is different from the usual practice [23] because it chooses a semantic vector as a positive sample in the triplet formation instead of a plausible ground-truth label. For GZSL, the unlabeled data for

can be from the seen or unseen classes during training. As a result, a pseudo-label must be found for both unlabeled seen and unlabeled unseen samples. Importantly, if the pseudo-label indicates that an unlabeled sample is from a seen class, then that sample is discarded. This reduces the impact of incorrect, noisy pseudo-labels on the model for seen classes. Samples from seen classes (with ground-truth labels) will instead influence the supervised loss function. Hence, we use true supervision where possible (seen classes), and only use pseudo-supervision where there is no alternative (unseen classes). The positive sample for GZSL is therefore chosen as follows

(4)

The negative sample is selected from the seen semantic embedding set for both ZSL and GZSL, since all elements of this set will have a different label from the unseen anchor. We choose the negative sample as the seen semantic embedding vector whose projection is closest to the anchor vector ,

(5)

Finally, the unsupervised loss function associated with the unlabeled instances for both ZSL and GZSL tasks is defined as follows:

(6)

where is a margin that encourages separation between the clusters, and is the batch size of the unlabeled instances. We describe the overall training process in Algorithm 1. In the proposed algorithm, in the first stage, an inductive model is learned. Then the transductive model is initialized with the inductive model. Finally the transductive model is learned.

This proposed triplet loss is distinct from recent literature [44, 32] in two ways. (1) Popular methods of triplet formation select a similar feature to the input feature as a positive sample, whereas we choose a semantic word vector for this purpose. This helps to better align the 3D point cloud features with the semantic vectors. (2) We employ a triplet loss in a transductive setting to utilize unlabeled (test) data, whereas established methods consider the triplet loss for inductive training only. This extends the role of the triplet loss beyond inductive learning.

1:Input: , , , , , ,
2:Output: A trained model to find for all
3:Inductive training stage
4: train an inductive model using Eq 1 with only seen data: , , ,
5:Transductive training stage
6:, initialize transductive model
7:repeat
8:     if GZSL then
9:          use to assign positive and negative anchors to using Eq 4 and Eq 5 for triple formation
10:     else
11:          use to assign positive and negative anchors to using Eq 3 and Eq 5 for triple formation      
12:     for  do
13:         Calculate overall transductive loss using Eq 2
14:

         Backpropagate and update

     
15:until convergence
16:Return Class decision with using Eq 7 for ZSL or Eq 8 for GZSL
Algorithm 1 Transductive ZSL for 3D point cloud objects

3.4 Model Architecture

The proposed model architecture is shown in Figure 4, consisting of two branches: the point cloud network that extracts a feature vector from a point cloud , and the semantic projection network that projects a semantic feature vector into point cloud feature space. Any network that learns a feature space from 3D point sets and is invariant to permutations of points in the point cloud can be used in our method as the point cloud network [31, 30, 55, 24, 61, 54, 60]. The projection network with trainable weights consists of two fully-connected layers, with and dimensions respectively, each followed by a nonlinearity.

Figure 4: The proposed architecture for ZSL and GZSL. For inductive learning, the input point cloud and semantic representation are and , respectively. For transductive learning, the input point cloud and semantic representation are and respectively.

3.5 Inference

For the zero-shot learning task, given the learned optimal weights from training with labeled seen instances and unlabeled unseen instances , the label of the input point cloud is predicted as

(7)

For the generalized zero-shot learning task, the label of the input point cloud for is predicted as

(8)

4 Results

4.1 Experimental Setup

Datasets: We evaluate our approach on four well-known 3D datasets, ModelNet10 [56], ModelNet40 [56], McGill [47], and SHREC2015 [26], and two 2D datasets, AwA2 [57] and CUB [53]. The dataset statistics as used in this work are given in Table 1. For the 3D datasets, we follow the seen/unseen splits proposed by Cheraghian [9], where the seen classes are those in ModelNet40 that do not occur in ModelNet10, and the unseen classes are those from the test sets of ModelNet10, McGill and SHREC2015 that are not in the set of seen classes. These splits allow us to test unseen classes from different distributions than that of the seen classes. For the 2D datasets, we follow the Standard Splits (SS) and Proposed Splits (PS) of Xian  [57].

Semantic features: We use the 300-dimensional word2vec [27] semantic feature vectors for the 3D dataset experiments, the 85-dimensional attribute vectors from Xian  [57] for the AwA2 experiments, and the 312-dimensional attribute vectors from Wah  [53] for the CUB experiments.

Evaluation: We report the top-

accuracy as a measure of recognition performance, where the predicted label (the class with minimum distance from the test sample) must match the ground-truth label to be considered a successful prediction. For generalized ZSL, we also report the Harmonic Mean (HM)

[57] of the accuracy of the seen and unseen classes, computed as

(9)

where and are seen and unseen class top- accuracies respectively.

Cross-validation: We used Monte Carlo cross-validation to find the best hyper-parameters, averaging over repetitions. For ModelNet40, (5) of the 30 seen classes were randomly selected as an unseen validation set, while were used for the AwA2 and CUB datasets. The hyper-parameters and were 0.15 and 0.0001 for ModelNet40, 0.1 and 0.001 for AwA2, and 0.25 and 0.001 for CUB.

Implementation details: For the 3D data experiments, we used PointNet [31]

as the point cloud feature extraction network, with five multi-layer perceptron layers (64,64,64,128,1024) followed by max-pooling layers and two fully-connected layers (512,1024). Batch normalization (BN) 

[17]

and ReLU activations were used for each layer. The 1024-dimensional input feature embedding was extracted from the last fully-connected layer. The network was pre-trained on the 30 seen classes of ModelNet40. For the 2D data experiments, we used a 101-layered ResNet architecture

[15], where the 2048-dimensional input feature embedding was obtained from the top-layer pooling unit. The network was pre-trained on ImageNet 1K [10] without fine-tuning. We fixed the pre-trained weights for both the 3D and 2D networks. For semantic projection layers, we used two fully-connected (512,1024) with tanh non-linearities. These parameters are fully-learnable. To train the network, we used the Adam optimizer [20]

with an initial learning rate of 0.0001, and batch sizes of 32 and 128 for 3D and 2D experiments respectively. We implemented the architecture using TensorFlow 

[1] and trained and tested it on a NVIDIA GTX Titan V GPU.

Dataset Total Seen/ Train/
classes Unseen Valid/Test
3D ModelNet40 [56] 40 30/– 5852/1560/–
ModelNet10 [56] 10 –/10 –/–/908
McGill [47] 19 –/14 –/–/115
SHREC2015 [26] 50 –/30 –/–/192
2D AwA2 SS [57] 50 40/10 30337/–/6985
AwA2 PS [57] 50 40/10 23527/5882/7913
CUB SS [53] 200 150/50 8855/–/2933
CUB PS [53] 200 150/50 7057/1764/2967
Table 1: Statistics of the 3D and 2D datasets. The total number of classes in the datasets are reported, alongside the actual splits used in this paper dividing the classes into seen or unseen and the elements into those used for training or testing. The 3D splits are from [9] and the 2D Standard Splits (SS) and Proposed Splits (PS) are from Xian [57].

4.2 3D Point Cloud Experiments

For the experiments on 3D data, we compare with two 3D ZSL methods, ZSLPC [9] and MHPC [8], and three 2D ZSL methods, f-CLSWGAN [58], CADA-VAE [43], and QFSL [49]. These state-of-the-art image-based methods were re-implemented and adapted to point cloud data to facilitate comparison. We also report results for a baseline inductive method, which uses the inductive loss function (1) and is trained only on labeled seen classes, and for a transductive baseline method, which replaces our triplet unlabeled loss with a standard Euclidean loss.

The results on the ModelNet10, McGill, and SHREC2015 datasets are shown in Table 2. Our method significantly outperforms the other approaches on these datasets. Several observations can be made from the results. (1) Transductive learning is much more effective than inductive learning for point cloud ZSL. This is likely due to inductive approaches being more biased towards seen classes, while transductive approaches alleviate the bias problem by using unlabeled, unseen instances during training. (2) Although generative methods [58, 43] have shown successful results on 2D ZSL, they fail to generalize to 3D ZSL. We hypothesize that they rely more strongly on high quality pre-trained models and attribute embeddings, both of which are not available for 3D data. (3) Our proposed method performs better than QFSL, which is likely due to our triplet loss formulation. While noisy, the positive and negative samples of unlabeled data provide useful supervision, unlike the unsupervised approach for only unlabeled data in QFSL. (4) The triplet loss performs much better than the Euclidean loss for this problem, since it maximizes the inter-class distance as well as minimizing the intra-class distance. (5) Our proposed method does not perform as well on the McGill and SHREC2015 datasets when compared to the ModelNet10 results, because the distributions of semantic feature vectors in the unseen McGill and SHREC2015 datasets are significantly different from the distribution in the seen ModelNet40 dataset, much more so than that of ModelNet10 [9].

Method ModelNet10 McGill SHREC2015
I ZSLPC [9] 28.0 10.7 5.2
MHPC [8] 33.9 12.5 6.2
f-CLSWGAN [58] 20.7 10.2 5.2
CADA-VAE [43] 23.0 10.7 6.2
Baseline 23.5 13.0 5.2
T QFSL [49] 38.8 18.8 9.5
Baseline 37.8 21.7 5.2
Ours 46.9 21.7 13.0
Table 2: ZSL results on the 3D ModelNet10 [56], McGill [47], and SHREC2015 [26] datasets. We report the top-1 accuracy (%) for each method. “I” and “T” denote inductive and transductive learning respectively.
Figure 5: Individual performance on unseen classes from ModelNet10. Our transductive method consistently outperforms both ZSLPC [9] and the inductive baseline.

Generalized ZSL, which is more realistic than standard ZSL, is more challenging than ZSL as there are both seen and unseen classes during inference. As a result, methods proposed for ZSL do not usually report results for GZSL. The results are shown in Table 3. Our method obtained the best performance with respect to the harmonic mean (HM) on all datasets, and the best performance with respect to the unseen class accuracy on most datasets, which demonstrates the utility of our method for GZSL as well as ZSL for 3D point cloud recognition.

Method ModelNet10 McGill SHREC2015
HM HM HM
I MHPC [8] 53.8 26.2 35.2 - - - - - -
f-CLSWGAN [58] 76.3 3.7 7.0 75.3 2.3 4.5 74.2 0.8 1.6
CADA-VAE [43] 84.7 1.3 2.6 83.3 1.6 3.1 80.0 1.7 3.3
Baseline 83.7 0.4 0.8 80.0 0.9 1.8 82.1 0.9 1.8
T QFSL [49] 58.1 21.8 31.7 65.3 13.0 21.6 72.3 7.8 14.1
Baseline 77.7 21.0 33.1 75.5 12.2 21.0 83.4 4.2 8.0
Ours 74.6 23.4 35.6 74.4 13.9 23.4 78.6 10.6 18.4
Table 3: GZSL results on the 3D ModelNet10 [56], McGill [47], and SHREC2015 [26] datasets. We report the top-1 accuracy (%) on seen classes () and unseen classes () for each method, as well as the harmonic mean (HM) of both measures. “I” and “T” denote inductive and transductive learning respectively.

We also show, in Figure 5, the performance of individual classes from ModelNet10. Our method achieves the best accuracy on most classes, while the inductive baseline and ZSLPC [9] have close to zero accuracy on many classes (, desk, night stand, toilet, and bed). This is likely due to the hubness problem, which inductive methods are more sensitive to than transductive methods.

4.3 2D Image Experiments

While our method was designed to address ZSL and GZSL tasks for 3D point cloud recognition, we also adapt and evaluate our method for the case of 2D image recognition. The results for ZSL and GZSL are shown in Tables 4 and 5 respectively.

For ZSL, our proposed method is evaluated on the AwA2 [57] and CUB [53] datasets using the SS and PS splits [57]. Our method achieves very competitive results on these datasets, indicating that the method can generalize to image data. Note that we do not fine-tune the image feature extraction network in our model, unlike the models listed with asterisks, for fair comparison with existing work. However, the literature demonstrates that fine-tuning can improve performance considerably, particularly on the CUB dataset.

For GZSL, we evaluate our method on the same datasets and compare with state-of-the-art GZSL methods [48, 5, 66, 49]. As shown in Table 5, our method is again competitive with the other methods on the AwA2 dataset with respect to both unseen class accuracy and harmonic mean accuracy. Our results lag state-of-the-art on the CUB dataset, although fine-tuning the feature extraction network may go some way to closing this gap.

Method AwA2 CUB
SS PS SS PS
I SJE [2] 69.5 61.9 55.3 53.9
ESZSL [41] 75.6 58.6 55.1 53.9
SYNC [4] 71.2 46.6 54.1 55.6
f-CLSWGAN [58] - - - 57.3
f-VAEGAN-D2 [59] - 71.1 - 61.0
f-VAEGAN-D2* [59] - 70.3 - 72.9
Baseline 71.2 69.0 59.3 54.2
T DIPL [66] - - 68.2 65.4
QFSL* [49] 84.8 79.7 69.7 72.1
f-VAEGAN-D2 [59] - 89.8 - 71.1
f-VAEGAN-D2* [59] - 89.3 - 82.6
Baseline 83.3 75.6 70.6 58.3
Ours 88.1 87.3 72.0 62.2
Table 4: ZSL results on the Standard Splits (SS) and Proposed Splits (PS) of the 2D AwA2 and CUB datasets. We report the top-1 accuracy (%) for each method. “I” and “T” denote inductive and transductive learning respectively. Image feature extraction model fine-tuned (we do not fine-tune our model).
Method AwA2 CUB
HM HM
I CMT[48] 89.0 8.7 15.9 60.1 4.7 8.7
CS[5] 77.6 45.3 57.2 49.4 48.1 48.7
f-CLSWGAN [58] - - - 43.7 57.7 49.7
CADA-VAE [43] 75.0 55.8 63.9 53.5 51.6 52.6
f-VAEGAN-D2 [59] 57.6 70.6 63.5 48.4 60.1 53.6
f-VAEGAN-D2* [59] 57.1 76.1 65.2 63.2 75.6 68.9
Baseline 88.9 22.1 35.4 69.4 8.4 14.9
T DIPL[66] - - - 44.8 41.7 43.2
QFSL*[49] 93.1 66.2 77.4 74.9 71.5 73.2
f-VAEGAN-D2 [59] 84.8 88.6 86.7 61.4 65.4 63.2
f-VAEGAN-D2* [59] 86.3 88.7 87.5 73.8 81.4 77.3
Baseline 88.0 67.2 76.2 51.4 40.2 45.1
Ours 81.8 83.1 82.4 50.5 50.2 50.3
Table 5: GZSL results on the 2D AwA2 and CUB datasets. We report the top-1 accuracy (%) on seen classes () and unseen classes () for each method, as well as the harmonic mean (HM) of both measures. “I” and “T” denote inductive and transductive learning respectively. Image feature extraction model fine-tuned (we do not fine-tune our model).

4.4 Discussion

Challenges with 3D data: Recent deep learning methods for classifying point cloud objects have achieved over 90% accuracy on several standard datasets, including ModelNet40 and ModelNet10. Moreover, due to significant progress in depth camera technology  [6, 18], it is now possible to capture 3D point cloud objects at scale much more easily. It is therefore likely that many classes of 3D objects will not be present in the labeled training set. As a result, zero-shot classification systems will be needed to leverage other more easily-obtainable sources of information in order to classify unseen objects. However, we observe that the difference in accuracy between ZSL and supervised learning is still very large for 3D point cloud classification, 46.9% as compared to 95.7% [24] for ModelNet10. As such, there is significant potential for improvement for zero-shot 3D point cloud classification. While the performance is still quite low, this is also the case for 2D ZSL, with state-of-the-art being 31.1% top-5 accuracy on the ImageNet2010/12 [42] datasets, reflecting the challenging nature of the problem.

Hubness: ZSL methods either (a) map the input feature space to semantic space using a hinge loss or least mean squares loss [13, 48], (b) map both spaces to an intermediate space using a binary cross entropy or a hinge loss [19, 62], or (c) map the semantic space to the input feature space [65]. We use the last approach, projecting semantic vectors to input feature space, since it has been shown that this alleviates the hubness problem [46, 65]. We validate this claim by measuring the skewness of the distribution [46, 33] when projected in each direction, and the associated accuracy. We report these values in Table 6 for the ModelNet10 dataset. The degree of skewness is much lower when projecting the semantic feature space to the point cloud feature space, and achieves a significantly higher accuracy. This provides additional evidence that this projection direction is preferable for mitigating the problem of hubs and the consequent bias.

Semantic space Input space
(Accuracy) input space semantic space
Inductive 2.67 (23.5%) 3.07 (19.5%)
Transductive -0.19 (46.9%) 2.03 (31.2%)
Table 6: The skewness (and accuracy) on ModelNet10 with different projection directions in both inductive and transductive settings. The skewness is lower when projecting the semantic space to the input point cloud feature space, mitigating the hubness problem and leading to more accurate transductive ZSL.

5 Conclusion

In this paper, we identified and addressed issues that arise in the inductive and transductive settings of zero-shot learning and its generalized variant when applied to the domain of 3D point cloud classification. We observed that in the 2D domain the embedding quality generated by the pre-trained feature space is of a significantly higher quality than that produced by its 3D counterpart, due to the vast difference in the amount of labeled training data they have been exposed to. To mitigate this, a novel triplet loss was developed that makes use of unlabeled test data in a transductive setting. The utility of this method was demonstrated via an extensive set of experiments that showed significant benefit in the 2D domain and established state-of-the-art results in the 3D domain for ZSL and GZSL tasks.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning.

    .
    In OSDI, Vol. 16, pp. 265–283. Cited by: §4.1.
  • [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele (2015) Evaluation of output embeddings for fine-grained image classification. In CVPR, Vol. 07-12-June-2015, pp. 2927–2936. External Links: Document Cited by: Table 4.
  • [3] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2016-07) Label-Embedding for Image Classification. IEEE TPAMI 38 (7), pp. 1425–1438. External Links: Document Cited by: §1, §1, §2, §3.
  • [4] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha (2016) Synthesized classifiers for zero-shot learning. In CVPR, Vol. 2016-January, pp. 5327–5336. Cited by: §1, §2, §3, Table 4.
  • [5] W. Chao, B. Changpinyo, and F. Sha (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), pp. 52–68. Cited by: §2, §4.3, Table 5.
  • [6] C. Chen, B. Yang, S. Song, M. Tian, J. Li, W. Dai, and L. Fang (2018) Calibrate multiple consumer rgb-d cameras for low-cost and efficient 3d indoor mapping. Remote Sensing 10 (2). External Links: Link, ISSN 2072-4292, Document Cited by: §4.4.
  • [7] A. Cheraghian and L. Petersson (2019-01) 3DCapsule: extending the capsule architecture to classify 3d point clouds. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1194–1202. External Links: Document, ISSN 1550-5790 Cited by: §2, §6.2.
  • [8] A. Cheraghian, S. Rahman, D. Campbell, and L. Petersson (2019) Mitigating the hubness problem for zero-shot learning of 3d objects. In British Machine Vision Conference (BMVC’19), External Links: 1907.06371 Cited by: §2, §4.2, Table 2, Table 3.
  • [9] A. Cheraghian, S. Rahman, and L. Petersson (2019) Zero-shot learning of 3d point cloud objects. In International Conference on Machine Vision Applications (MVA), Cited by: §2, Figure 5, §4.1, §4.2, §4.2, §4.2, Table 1, Table 2.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §4.1.
  • [11] T. Do, T. Tran, I. Reid, V. Kumar, T. Hoang, and G. Carneiro (2019-06) A theoretically sound upper bound on the triplet loss for improving the efficiency of deep distance metric learning. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • [12] X. Dong and J. Shen (2018-09) Triplet loss in siamese network for object tracking. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [13] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) DeViSE: a deep visual-semantic embedding model. In NIPS, Cited by: §4.4.
  • [14] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong (2015-11) Transductive multi-view zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 37 (11), pp. 2332–2345. External Links: ISSN 0162-8828 Cited by: §1, §1, §3.2.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §4.1.
  • [16] X. He, Y. Zhou, Z. Zhou, S. Bai, and X. Bai (2018-06) Triplet-center loss for multi-view 3d object retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
  • [18] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon (2011) KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, New York, NY, USA, pp. 559–568. External Links: ISBN 978-1-4503-0716-1, Link, Document Cited by: §4.4.
  • [19] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2017–2025. External Links: Link Cited by: §4.4.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [21] C. H. Lampert, H. Nickisch, and S. Harmeling (2014-03) Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 453–465. External Links: Document, ISSN 0162-8828 Cited by: §1, §2.
  • [22] C. Lee, W. Fang, C. Yeh, and Y. Frank Wang (2018-06)

    Multi-label zero-shot learning with structured knowledge graphs

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [23] D. Lee (2013-07)

    Pseudo-label : the simple and efficient semi-supervised learning method for deep neural networks

    .
    ICML 2013 Workshop : Challenges in Representation Learning (WREPL), pp. . Cited by: §3.3, §3.3.
  • [24] J. Li, B. M. Chen, and G. H. Lee (2018) SO-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9397–9406. Cited by: §3.4, §4.4, §6.2.
  • [25] Z. Li, C. Xu, and B. Leng (2019) Angular triplet-center loss for multi-view 3d shape retrieval. In AAAI, Cited by: §2.
  • [26] Z. Lian, J. Zhang, S. Choi, H. ElNaghy, J. El-Sana, T. Furuya, A. Giachetti, R. A. Guler, L. Lai, C. Li, H. Li, F. A. Limberger, R. Martin, R. U. Nakanishi, A. P. Neto, L. G. Nonato, R. Ohbuchi, K. Pevzner, D. Pickup, P. Rosin, A. Sharf, L. Sun, X. Sun, S. Tari, G. Unal, and R. C. Wilson (2015) Non-rigid 3D Shape Retrieval. In Eurographics Workshop on 3D Object Retrieval, I. Pratikakis, M. Spagnuolo, T. Theoharis, L. V. Gool, and R. Veltkamp (Eds.), External Links: Document Cited by: §1, §4.1, Table 1, Table 2, Table 3, Table 7.
  • [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §4.1.
  • [28] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean (2014) Zero-shot learning by convex combination of semantic embeddings. In ICLR, Cited by: §3.
  • [29] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell (2009) Zero-shot learning with semantic output codes. In NIPS, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), pp. 1410–1418. Cited by: §1, §1, §2.
  • [30] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §2, §3.4, §6.2.
  • [31] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2), pp. 4. Cited by: §1, §2, §3.4, §4.1, §6.2, Table 7.
  • [32] R. Qiao, L. Liu, C. Shen, and A. van den Hengel (2017) Visually aligned word embeddings for improving zero-shot learning. In British Machine Vision Conference (BMVC’17), Cited by: §1, §2, §3.3.
  • [33] M. Radovanovic, A. Nanopoulos, and M. Ivanovic (2010)

    Hubs in space: popular nearest neighbors in high-dimensional data

    .
    Journal of Machine Learning Research 11, pp. 2487–2531. Cited by: §4.4.
  • [34] S. Rahman, S. Khan, and F. Porikli (2018-11) A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning. IEEE Transactions on Image Processing 27 (11), pp. 5652–5667. External Links: Document, ISSN 1057-7149 Cited by: §1, §2.
  • [35] S. Rahman, S. Khan, and F. Porikli (2018-12) Zero-shot object detection: learning to simultaneously recognize and localize novel concepts. In Asian Conference on Computer Vision (ACCV), Cited by: §2.
  • [36] S. Rahman and S. Khan (2018-12) Deep multiple instance learning for zero-shot image tagging. In Asian Conference on Computer Vision (ACCV), Cited by: §2.
  • [37] S. Ramasinghe, S. Khan, N. Barnes, and S. Gould (2019) Blended convolution and synthesis for efficient discrimination of 3d shapes. External Links: 1908.10209 Cited by: §2.
  • [38] S. Ramasinghe, S. Khan, N. Barnes, and S. Gould (2019) Representation learning on unit ball with 3d roto-translational equivariance. External Links: 1912.01454 Cited by: §2.
  • [39] S. Ramasinghe, S. Khan, N. Barnes, and S. Gould (2019) Spectral-gans for high-resolution 3d point-cloud generation. External Links: 1912.01800 Cited by: §2.
  • [40] M. Rohrbach, S. Ebert, and B. Schiele (2013) Transfer learning in a transductive setting. In NIPS, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 46–54. Cited by: §2.
  • [41] B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In ICML, pp. 2152–2161. Cited by: Table 4.
  • [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV 115 (3), pp. 211–252. External Links: Document Cited by: §4.4.
  • [43] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019-06)

    Generalized zero- and few-shot learning via aligned variational autoencoders

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.2, §4.2, Table 2, Table 3, Table 5.
  • [44] F. Schroff, D. Kalenichenko, and J. Philbin (2015-06)

    FaceNet: a unified embedding for face recognition and clustering

    .
    In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 815–823. External Links: Document, ISSN 1063-6919 Cited by: §1, §2, §3.3, §3.3.
  • [45] Y. Shen, C. Feng, Y. Yang, and D. Tian (2018-06) Neighbors do help: deeply exploiting local structures of point clouds. Cited by: §6.2.
  • [46] Y. Shigeto, I. Suzuki, K. Hara, M. Shimbo, and Y. Matsumoto (2015) Ridge regression, hubness, and zero-shot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 135–151. Cited by: §4.4.
  • [47] K. Siddiqi, J. Zhang, D. Macrini, A. Shokoufandeh, S. Bouix, and S. Dickinson (2008-05) Retrieving articulated 3-d models using medial surfaces. Mach. Vision Appl. 19 (4), pp. 261–275. External Links: ISSN 0932-8092, Link, Document Cited by: §1, Figure 2, §3, §4.1, Table 1, Table 2, Table 3, Table 7.
  • [48] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In NIPS, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 935–943. Cited by: §4.3, §4.4, Table 5.
  • [49] J. Song, C. Shen, Y. Yang, Y. P. Liu, and M. Song (2018) Transductive unbiased embedding for zero-shot learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1024–1033. Cited by: §1, §2, §4.2, §4.3, Table 2, Table 3, Table 4, Table 5, §6.3, Table 8, Supplementary Material.
  • [50] S. Srivastava and B. Lall (2019-02) DeepPoint3D: learning discriminative local descriptors using deep metric learning on 3d point clouds. Pattern Recognition Letters, pp. . External Links: Document Cited by: §2.
  • [51] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang, and J. Kautz (2018) SPLATNet: sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2530–2539. Cited by: §6.2.
  • [52] L. Van Der Maaten (2014) Accelerating t-sne using tree-based algorithms.. Journal of machine learning research 15 (1), pp. 3221–3245. Cited by: Figure 2, Figure 3, Figure 7.
  • [53] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: Figure 2, §3, §4.1, §4.1, §4.3, Table 1.
  • [54] C. Wang, B. Samari, and K. Siddiqi (2018) Local spectral graph convolution for point set feature learning. arXiv preprint arXiv:1803.05827. Cited by: §2, §3.4, §6.2.
  • [55] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2018) Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829. Cited by: §2, §3.4, §6.2, Table 7, Supplementary Material.
  • [56] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, Figure 2, §3, §4.1, Table 1, Table 2, Table 3, Table 7, Table 8.
  • [57] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2018) Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN 0162-8828 Cited by: §1, §1, §2, Figure 2, §3, §4.1, §4.1, §4.1, §4.3, Table 1.
  • [58] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018-06) Feature generating networks for zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2, §4.2, Table 2, Table 3, Table 4, Table 5.
  • [59] Y. Xian, S. Sharma, B. Schiele, and Z. Akata (2019-06) F-vaegan-d2: a feature generating framework for any-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Table 4, Table 5.
  • [60] S. Xie, S. Liu, Z. Chen, and Z. Tu (2018-06) Attentional shapecontextnet for point cloud recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.4, §6.2.
  • [61] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao (2018) SpiderCNN: deep learning on point sets with parameterized convolutional filters. arXiv preprint arXiv:1803.11527. Cited by: §2, §3.4, §6.2.
  • [62] Y. Yang and T. Hospedales (2015) A unified perspective on multi-domain and multi-task learning. In 3rd International Conference on Learning Representations (ICLR), (English). Cited by: §4.4.
  • [63] Y. Yu, Z. Ji, X. Li, J. Guo, Z. Zhang, H. Ling, and F. Wu (2018-10) Transductive zero-shot learning with a self-training dictionary approach. IEEE Transactions on Cybernetics 48 (10), pp. 2908–2919. External Links: ISSN 2168-2267 Cited by: §2, §3.
  • [64] S. Zakharov, W. Kehl, B. Planche, A. Hutter, and S. Ilic (2017-Sep.)

    3D object instance recognition and pose estimation using triplet loss with dynamic margin

    .
    In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 552–559. External Links: ISSN Cited by: §2.
  • [65] L. Zhang, T. Xiang, and S. Gong (2017-07) Learning a deep embedding model for zero-shot learning. In CVPR, Cited by: §1, §2, §4.4.
  • [66] A. Zhao, M. Ding, J. Guan, Z. Lu, T. Xiang, and J. Wen (2018) Domain-invariant projection learning for zero-shot recognition. In Advances in neural information processing systems (NIPS), Cited by: §1, §2, §4.3, Table 4, Table 5.

Supplementary Material

In this supplementary material, we further assess our proposed method with additional quantitative and qualitative evaluations. In the quantitative evaluation section, we evaluate (1) the effect of the batch size on 3D Zero-Shot Learning (ZSL) using ModelNet10, (2) the effect of using a different point cloud architecture, EdgeConv [55], and (3) the effect of using the experimental protocol for Generalized Zero-Shot Learning (GZSL) proposed by Song [49]. In the qualitative evaluation section, we show success and failure cases on unseen classes from ModelNet10.

6 Additional Quantitative Evaluation

6.1 Batch Size

In this experiment, we evaluate the effect of the batch size on the accuracy of our proposed method for the 3D ModelNet10 dataset. As can be seen in Figure 6, the size of the batch has a significant impact on the performance, with the best performance on this dataset being achieved at a batch size of 32.

Figure 6: Top-1 accuracy on the ModelNet10 dataset as the batch size varies.

6.2 Point Cloud Architecture

In this paper, we used PointNet [31] as the backbone point cloud architecture in our 3D experiments. However, while PointNet is one of the first works that has been proposed for point cloud classification using deep learning, there are many other methods  [31, 30, 55, 24, 61, 54, 60, 45, 51, 7] which were introduced later and tend to achieve better performance for supervised 3D point cloud classification. Here, we compare PointNet with EdgeConv [55] to study the effect of using a more advanced point cloud architecture for the task of 3D ZSL classification. In supervised 3D point cloud classification, EdgeConv achieves 92.2% accuracy on ModelNet40 while PointNet achieves 89.2%. In this additional experiment, we use ModelNet10 as the unseen set to compare those two methods. As shown in Table 7, both PointNet and EdgeConv achieve similar performance. We would expect to see some improvement when using EdgeConv since it works better in the case of supervised classification. In Figure 7, it can be seen however that both PointNet and EdgeConv cluster unseen point cloud features similarly and imperfectly. This again shows the difficulty of the ZSL task on 3D data where there are a lack of good pretrained models.

Method ModelNet10 McGill SHREC2015
PointNet [31] 46.9 21.7 13.0
EdgeConv [55] 45.2 20.6 13.0
Table 7: ZSL results on the 3D ModelNet10 [56], McGill [47], and SHREC2015 [26] datasets using different point cloud architecture, PointNet and EdgeConv.
Figure 7: 2D tSNE [52] visualization of unseen point cloud feature vectors (circles) based on (a) PointNet (b) EdgeConv on ModelNet10. The unseen point cloud features are clustered similarly in both PointNet and EdgeConv, despite EdgeConv performing better than PointNet on the task of supervised point cloud classification.

6.3 QFSL’s Generalized ZSL Evaluation Protocol

In this experiment, we evaluate the effect of using a different evaluation protocol for the GZSL experiments, as proposed by Song  [49]. Under this protocol, the unlabeled data, which consists of seen and unseen instances, is divided into halves, and two models are trained. In each model, half of unlabeled data is used for training and the other half for testing. The final performance is calculated by averaging the performance of these two models. The authors suggest that this allows for fairer evaluation, although it is an imperfect solution. Nonetheless, we show in Table 8 for the ModelNet10 dataset that our method performs better than QFSL with respect to all accuracy measures under both this protocol and the original protocol from our paper. In fact, both methods perform better under this different protocol, which suggests that splitting the unlabeled data in this way makes the task easier. As a result, we use our more conservative GZSL evaluation protocol in the main paper.

Method HM
QFSL [49] 58.1 / 68.2 21.8 / 24.3 31.7 / 35.6
Ours 74.6 / 72.0 23.4 / 29.2 35.6 / 41.5
Table 8: GZSL results on the 3D ModelNet10 dataset [56] under evaluation protocols (A) / (B), where (A) is the evaluation protocol from our paper and (B) is the protocol proposed by Song [49]. We report the top-1 accuracy (%) on seen classes () and unseen classes () for each method, as well as the harmonic mean (HM) of both measures.

7 Qualitative Evaluation

In this section, we visualize five unseen classes from the ModelNet10 dataset with examples where our method correctly classified the point cloud, shown in Figure 8, and examples where it incorrectly classified the point cloud, shown in Figure 9. The network appears to be providing incorrect predictions for mostly hard examples, those that are quite different from standard examples in that class, or where the classes overlap in their geometry, such as dresser and night stand.

Figure 8: Visualization of five classes from the ModelNet10 dataset with examples of correctly classified point clouds.
Figure 9: Visualization of five classes from the ModelNet10 dataset with examples of incorrectly classified point clouds. The predicted classes are shown below each model.