Capturing 3D point cloud data from complex scenes has been facilitated recently by inexpensive and accessible 3D depth camera technology. This in turn has increased the interest in, and need for, 3D object classification methods that can operate on such data. However, much if not most of the data collected will belong to classes for which a classification system may not have been explicitly trained. In order to recognize such previously “unseen” classes, it is necessary to develop Zero-Shot Learning (ZSL) methods in the domain of 3D point cloud classification. While such methods are typically trained on a set of so-called “seen” classes, they are capable of classifying certain “unseen” classes as well. Knowledge about unseen classes is introduced to the network via semantic feature vectors that can be derived from networks pre-trained on image attributes or on a very large corpus of texts[29, 3, 65, 57].
Performing ZSL for the purpose of 3D object classification is a more challenging task than ZSL applied to 2D images [34, 3, 4, 29, 21, 57]. ZSL methods in the 2D domain commonly take advantage of pre-trained models, like ResNet , that have been trained on millions of labeled images featuring thousands of classes. As a result, the extracted 2D features are very well clustered. By contrast, there is no parallel in the 3D point cloud domain; labeled 3D datasets tend to be small and have only limited sets of classes. For example, pre-trained models like PointNet  are trained on only a few thousand samples from a small number of classes. This leads to poor-quality 3D features with clusters that are not nearly as well separated as their visual counterparts. This gives rise to the problem of projection domain shift . In essence, this means that the function learned from seen samples is biased, and cannot generalize well to unseen classes. In the inductive learning approach, where only seen classes are used during training, projected semantic vectors tend to move toward the seen feature vectors, making the intra-class distance between corresponding unseen semantic and feature vectors large. This intuition is visualized in Figure 1.
Now, the key question is how far these problems can be mitigated by adopting a transductive learning approach, where the model is trained using both labeled and unlabeled samples. Our goal is to design a strategy that reduces the bias and encourages the projected semantic vectors to align with their true feature vector counterparts, minimizing the average intra-class distance. In 2D ZSL, the transductive setting has been shown to be effective [14, 66, 49], however in the case of 3D point cloud data it is a more challenging task. Pre-trained 3D features are poorly clustered and exhibit large intra-class distances. As a result, state-of-the-art transductive methods suitable for image data [14, 66, 49] are unable to reduce the bias problem for 3D data.
In order to take advantage of the transductive learning approach for 3D point cloud zero-shot learning, we propose a transductive ZSL method using a novel triplet loss that is employed in an unsupervised manner. Unlike the traditional triplet formulation [44, 32], our proposed triplet loss works on unlabeled (test) data and can operate without the need of ground-truth supervision. This loss applies to unlabeled data such that intra-class distances are minimized while also maximizing inter-class distances, reducing the bias problem. As a result, a prediction function with greater generalization ability and effectiveness on unseen classes is learned. Moreover, our proposed method is also applicable in the case of 2D ZSL, which demonstrates the generalization strength of our method to other sensor modalities.
Our main contributions are: (1) extending and adapting transductive zero-shot learning and generalized zero-shot learning to 3D point cloud classification for the first time; (2) developing a novel triplet loss that takes advantage of unlabeled test data, applicable to both 3D point cloud data and 2D images; and (3) performing extensive experiments, establishing state-of-the-art on four 3D datasets, ModelNet10 , ModelNet40 , McGill , and SHREC2015 .
2 Related Work
Zero-Shot Learning: For the ZSL task, there has been significant progress, including on image recognition [34, 65, 3, 4, 29, 21, 57], multi-label ZSL [22, 36], and zero-shot detection . Despite this progress, these methods solve the constrained problem where the test instances are restricted to only unseen classes, rather than being from either seen or unseen classes. This setting, where both seen and unseen classes are considered at test time, is called Generalized Zero-Shot Learning (GZSL). To address this problem, some methods decrease the scores that seen classes produce by a constant value 
, while others perform a separate training stage intended to balance the probabilities of the seen and unseen classes. Schonfeld  learned a shared latent space of image features and semantic representation based on a modality-specific VAE model. In our work, we use a novel unsupervised triplet loss to address the bias problem, leading to significantly better GZSL results.
Transductive Zero-shot Learning: The transductive learning approach takes advantage of unlabeled test samples, in addition to the labeled seen samples. For example, Rohrbach  exploited the manifold structure of unseen classes using a graph-based learning algorithm to leverage the neighborhood structure within unseen classes. Yu  proposed a transductive approach to predict class labels via an iterative refining process. More recently, transductive ZSL methods have started exploring how to improve the accuracy of both the seen and unseen classes in generalized ZSL tasks [66, 49]. Zhao  proposed a domain invariant projection method that projects visual features to semantic space and reconstructs the same feature from the semantic representation in order to narrow the domain gap. In another approach, Song 
identified the model bias problem of inductive learning, that is, a trained model assigns higher prediction scores for seen classes than unseen. To address this, they proposed a quasi-fully supervised learning method to solve the GZSL task. Xian proposed f-VAEGAN-D2 which takes advantage of both VAEs and GANs to learn the feature distribution of unlabeled data. All of these approaches are designed for transductive ZSL tasks on 2D image data. In contrast, we explore to what extent a transductive ZSL setting helps to improve 3D point cloud recognition.
Learning with a Triplet Loss:
Triplet losses have been widely used in computer vision[44, 32, 12, 16, 11]. Schroff  demonstrated how to select positive and negative anchor points from visual features within a batch. Qiao  introduced using a triplet loss to train an inductive ZSL model. More recently, Do  proposed a tight upper bound of the triplet loss by linearizing it using class centroids, Zakharov  explored the triplet loss in manifold learning, Srivastava  investigated weighting hard negative samples more than easy negatives, and Zhaoqun  proposed the angular triplet-center loss, a variant that reduces the similarity distance between features. Triplet loss related methods typically work under inductive settings, where the ground-truth label of an anchor point remains available during training. In contrast, we describe a triplet formation technique in the transductive setting. Our method utilizes test data without knowing its true label. Moreover, we choose positive and negative samples of an anchor from word vectors instead of features.
ZSL on 3D Point Clouds:
Despite much progress on 3D point cloud classification using deep learning[31, 30, 55, 61, 54, 60, 7, 38, 39, 37], only two works have addressed the ZSL problem for 3D point clouds. Cheraghian [9, 8] proposed a bilinear compatibility function to associate a PointNet 
feature vector with a semantic feature vector, and separately proposed an unsupervised skewness loss to mitigate the hubness problem. Both works use inductive inference and are therefore less able to handle the bias towards seen classes in the GZSL task than our proposed method.
3 Transductive ZSL for 3D Point Clouds
Zero-shot learning is heavily dependent on good pre-trained models generating well-clustered features [28, 4, 3, 63] as the performance of established ZSL methods otherwise degrades rapidly. In the 2D case, pre-trained models are trained by considering thousands of classes and millions of images . However, similar quality pre-trained models are typically unavailable for 3D point cloud objects. Therefore, 3D point cloud features cluster more poorly than image features. To illustrate this point, in Figure 2 we visualize 3D features of unseen classes from the 3D datasets ModelNet10 , McGill  and 2D features of unseen classes from the 2D datasets AwA2  and CUB 
. Here, we use unseen classes to highlight the generalization ability of the pre-trained model. Because of the use of a large dataset (like ImageNet) for the 2D case, the cluster structure is more separable in 2D than in 3D. As 3D features are not as robust and separable as 2D features, relating those features to their corresponding semantic vectors is more difficult than for the corresponding 2D case. Addressing the poor feature quality of typical 3D datasets, we propose to use a triplet loss in the transductive setting of ZSL. Our method specifically addresses the alignment of poor features (like those coming from 3D feature extractors) with semantic vectors. Therefore, while our method improves the results for both 2D and 3D modalities, the largest gain is observed in the 3D case.
3.1 Problem Formulation
Let for denote a 3D point cloud. Also let and denote disjoint () seen and unseen class label sets with sizes and respectively, and and denote the sets of associated semantic embedding vectors for the embedding function , with . Then we define the set of seen instances as , where is the th point cloud of the seen set with label and semantic vector . The set of unseen instances is defined similarly as , where is the th point cloud of the unseen set with label and semantic vector .
We consider two learning problems in this work: zero-shot learning and its generalized variant. The goal of each problem is defined as follows.
Zero-Shot Learning (ZSL): To predict a class label from the unseen label set given an unseen point cloud .
Generalized Zero-Shot Learning (GZSL): To predict a class label from the seen or unseen label sets given a point cloud .
3.2 Model Training
Zero-shot learning can be addressed using inductive or transductive inference. For inductive ZSL, the model is trained in a fully-supervised manner with seen instances only from the set .
To learn an inductive model, an objective function
is minimized, where is the number of instances in the batch, is the point cloud feature vector associated with point cloud , are the weights of the nonlinear projection function that maps from the semantic embedding space to the point cloud feature space, and the parameter controls the amount of regularization.
In contrast, transductive ZSL additionally uses the set of unlabeled, unseen instances and the set of unseen semantic embedding vectors during training. To learn a transductive model in a semi-supervised manner, an objective function
is minimized, where is the batch size of seen instances, is the unsupervised loss, controls the influence of the unsupervised loss, and controls the amount of regularization. For the term, a triplet loss is proposed, which will be outlined in the next section.
Transductive ZSL addresses the problem of the projection domain shift  inherent in inductive ZSL approaches. In ZSL, the seen and unseen classes are disjoint and often only very weakly related. Since the underlying distributions of the seen and unseen classes may be quite different, the ideal projection function between the semantic embedding space and point cloud feature space is also likely to be different for seen and unseen classes. As a result, using the projection function learned from only the seen classes without considering the unseen classes will cause an unknown bias. Transductive ZSL reduces the domain gap and the resulting bias by using unlabeled unseen class instances during training, improving the generalization performance. The effect of the domain shift in ZSL is shown in Figure 3. When inductive learning is used (a), the projected unseen semantic embedding vectors are far from the cluster centres of the associated point cloud feature vectors, however when transductive learning is used (b), the vectors are much closer to the cluster centres.
3.3 Unsupervised Triplet Loss
In this work, we propose an unsupervised triplet loss for (2). It is unsupervised because the computation of operates on test data, which remains unlabeled, and receives no ground-truth supervision throughout transductive training. To compute a triplet loss, a positive and negative sample need to be found for each anchor sample . In the fully-supervised setting, selecting positive and negative samples is not difficult, because all training samples have ground-truth labels. However, it is much more challenging in the unsupervised setting, where ground-truth labels are not available. For transductive ZSL, we define a positive sample using a pseudo-labeling approach . For each anchor , we assign a pseudo-label that chooses a positive sample among the semantic embedding vectors which is the closest to the anchor feature vector after projection , as follows
Such pseudo-labeling is different from the usual practice  because it chooses a semantic vector as a positive sample in the triplet formation instead of a plausible ground-truth label. For GZSL, the unlabeled data for
can be from the seen or unseen classes during training. As a result, a pseudo-label must be found for both unlabeled seen and unlabeled unseen samples. Importantly, if the pseudo-label indicates that an unlabeled sample is from a seen class, then that sample is discarded. This reduces the impact of incorrect, noisy pseudo-labels on the model for seen classes. Samples from seen classes (with ground-truth labels) will instead influence the supervised loss function. Hence, we use true supervision where possible (seen classes), and only use pseudo-supervision where there is no alternative (unseen classes). The positive sample for GZSL is therefore chosen as follows
The negative sample is selected from the seen semantic embedding set for both ZSL and GZSL, since all elements of this set will have a different label from the unseen anchor. We choose the negative sample as the seen semantic embedding vector whose projection is closest to the anchor vector ,
Finally, the unsupervised loss function associated with the unlabeled instances for both ZSL and GZSL tasks is defined as follows:
where is a margin that encourages separation between the clusters, and is the batch size of the unlabeled instances. We describe the overall training process in Algorithm 1. In the proposed algorithm, in the first stage, an inductive model is learned. Then the transductive model is initialized with the inductive model. Finally the transductive model is learned.
This proposed triplet loss is distinct from recent literature [44, 32] in two ways. (1) Popular methods of triplet formation select a similar feature to the input feature as a positive sample, whereas we choose a semantic word vector for this purpose. This helps to better align the 3D point cloud features with the semantic vectors. (2) We employ a triplet loss in a transductive setting to utilize unlabeled (test) data, whereas established methods consider the triplet loss for inductive training only. This extends the role of the triplet loss beyond inductive learning.
3.4 Model Architecture
The proposed model architecture is shown in Figure 4, consisting of two branches: the point cloud network that extracts a feature vector from a point cloud , and the semantic projection network that projects a semantic feature vector into point cloud feature space. Any network that learns a feature space from 3D point sets and is invariant to permutations of points in the point cloud can be used in our method as the point cloud network [31, 30, 55, 24, 61, 54, 60]. The projection network with trainable weights consists of two fully-connected layers, with and dimensions respectively, each followed by a nonlinearity.
For the zero-shot learning task, given the learned optimal weights from training with labeled seen instances and unlabeled unseen instances , the label of the input point cloud is predicted as
For the generalized zero-shot learning task, the label of the input point cloud for is predicted as
4.1 Experimental Setup
Datasets: We evaluate our approach on four well-known 3D datasets, ModelNet10 , ModelNet40 , McGill , and SHREC2015 , and two 2D datasets, AwA2  and CUB . The dataset statistics as used in this work are given in Table 1. For the 3D datasets, we follow the seen/unseen splits proposed by Cheraghian , where the seen classes are those in ModelNet40 that do not occur in ModelNet10, and the unseen classes are those from the test sets of ModelNet10, McGill and SHREC2015 that are not in the set of seen classes. These splits allow us to test unseen classes from different distributions than that of the seen classes. For the 2D datasets, we follow the Standard Splits (SS) and Proposed Splits (PS) of Xian .
Semantic features: We use the 300-dimensional word2vec  semantic feature vectors for the 3D dataset experiments, the 85-dimensional attribute vectors from Xian  for the AwA2 experiments, and the 312-dimensional attribute vectors from Wah  for the CUB experiments.
Evaluation: We report the top-
accuracy as a measure of recognition performance, where the predicted label (the class with minimum distance from the test sample) must match the ground-truth label to be considered a successful prediction. For generalized ZSL, we also report the Harmonic Mean (HM) of the accuracy of the seen and unseen classes, computed as
where and are seen and unseen class top- accuracies respectively.
Cross-validation: We used Monte Carlo cross-validation to find the best hyper-parameters, averaging over repetitions. For ModelNet40, (5) of the 30 seen classes were randomly selected as an unseen validation set, while were used for the AwA2 and CUB datasets. The hyper-parameters and were 0.15 and 0.0001 for ModelNet40, 0.1 and 0.001 for AwA2, and 0.25 and 0.001 for CUB.
Implementation details: For the 3D data experiments, we used PointNet 
as the point cloud feature extraction network, with five multi-layer perceptron layers (64,64,64,128,1024) followed by max-pooling layers and two fully-connected layers (512,1024). Batch normalization (BN)
and ReLU activations were used for each layer. The 1024-dimensional input feature embedding was extracted from the last fully-connected layer. The network was pre-trained on the 30 seen classes of ModelNet40. For the 2D data experiments, we used a 101-layered ResNet architecture, where the 2048-dimensional input feature embedding was obtained from the top-layer pooling unit. The network was pre-trained on ImageNet 1K  without fine-tuning. We fixed the pre-trained weights for both the 3D and 2D networks. For semantic projection layers, we used two fully-connected (512,1024) with tanh non-linearities. These parameters are fully-learnable. To train the network, we used the Adam optimizer 
with an initial learning rate of 0.0001, and batch sizes of 32 and 128 for 3D and 2D experiments respectively. We implemented the architecture using TensorFlow and trained and tested it on a NVIDIA GTX Titan V GPU.
|2D||AwA2 SS ||50||40/10||30337/–/6985|
|AwA2 PS ||50||40/10||23527/5882/7913|
|CUB SS ||200||150/50||8855/–/2933|
|CUB PS ||200||150/50||7057/1764/2967|
4.2 3D Point Cloud Experiments
For the experiments on 3D data, we compare with two 3D ZSL methods, ZSLPC  and MHPC , and three 2D ZSL methods, f-CLSWGAN , CADA-VAE , and QFSL . These state-of-the-art image-based methods were re-implemented and adapted to point cloud data to facilitate comparison. We also report results for a baseline inductive method, which uses the inductive loss function (1) and is trained only on labeled seen classes, and for a transductive baseline method, which replaces our triplet unlabeled loss with a standard Euclidean loss.
The results on the ModelNet10, McGill, and SHREC2015 datasets are shown in Table 2. Our method significantly outperforms the other approaches on these datasets. Several observations can be made from the results. (1) Transductive learning is much more effective than inductive learning for point cloud ZSL. This is likely due to inductive approaches being more biased towards seen classes, while transductive approaches alleviate the bias problem by using unlabeled, unseen instances during training. (2) Although generative methods [58, 43] have shown successful results on 2D ZSL, they fail to generalize to 3D ZSL. We hypothesize that they rely more strongly on high quality pre-trained models and attribute embeddings, both of which are not available for 3D data. (3) Our proposed method performs better than QFSL, which is likely due to our triplet loss formulation. While noisy, the positive and negative samples of unlabeled data provide useful supervision, unlike the unsupervised approach for only unlabeled data in QFSL. (4) The triplet loss performs much better than the Euclidean loss for this problem, since it maximizes the inter-class distance as well as minimizing the intra-class distance. (5) Our proposed method does not perform as well on the McGill and SHREC2015 datasets when compared to the ModelNet10 results, because the distributions of semantic feature vectors in the unseen McGill and SHREC2015 datasets are significantly different from the distribution in the seen ModelNet40 dataset, much more so than that of ModelNet10 .
Generalized ZSL, which is more realistic than standard ZSL, is more challenging than ZSL as there are both seen and unseen classes during inference. As a result, methods proposed for ZSL do not usually report results for GZSL. The results are shown in Table 3. Our method obtained the best performance with respect to the harmonic mean (HM) on all datasets, and the best performance with respect to the unseen class accuracy on most datasets, which demonstrates the utility of our method for GZSL as well as ZSL for 3D point cloud recognition.
We also show, in Figure 5, the performance of individual classes from ModelNet10. Our method achieves the best accuracy on most classes, while the inductive baseline and ZSLPC  have close to zero accuracy on many classes (, desk, night stand, toilet, and bed). This is likely due to the hubness problem, which inductive methods are more sensitive to than transductive methods.
4.3 2D Image Experiments
While our method was designed to address ZSL and GZSL tasks for 3D point cloud recognition, we also adapt and evaluate our method for the case of 2D image recognition. The results for ZSL and GZSL are shown in Tables 4 and 5 respectively.
For ZSL, our proposed method is evaluated on the AwA2  and CUB  datasets using the SS and PS splits . Our method achieves very competitive results on these datasets, indicating that the method can generalize to image data. Note that we do not fine-tune the image feature extraction network in our model, unlike the models listed with asterisks, for fair comparison with existing work. However, the literature demonstrates that fine-tuning can improve performance considerably, particularly on the CUB dataset.
For GZSL, we evaluate our method on the same datasets and compare with state-of-the-art GZSL methods [48, 5, 66, 49]. As shown in Table 5, our method is again competitive with the other methods on the AwA2 dataset with respect to both unseen class accuracy and harmonic mean accuracy. Our results lag state-of-the-art on the CUB dataset, although fine-tuning the feature extraction network may go some way to closing this gap.
Challenges with 3D data: Recent deep learning methods for classifying point cloud objects have achieved over 90% accuracy on several standard datasets, including ModelNet40 and ModelNet10. Moreover, due to significant progress in depth camera technology [6, 18], it is now possible to capture 3D point cloud objects at scale much more easily. It is therefore likely that many classes of 3D objects will not be present in the labeled training set. As a result, zero-shot classification systems will be needed to leverage other more easily-obtainable sources of information in order to classify unseen objects. However, we observe that the difference in accuracy between ZSL and supervised learning is still very large for 3D point cloud classification, 46.9% as compared to 95.7%  for ModelNet10. As such, there is significant potential for improvement for zero-shot 3D point cloud classification. While the performance is still quite low, this is also the case for 2D ZSL, with state-of-the-art being 31.1% top-5 accuracy on the ImageNet2010/12  datasets, reflecting the challenging nature of the problem.
Hubness: ZSL methods either (a) map the input feature space to semantic space using a hinge loss or least mean squares loss [13, 48], (b) map both spaces to an intermediate space using a binary cross entropy or a hinge loss [19, 62], or (c) map the semantic space to the input feature space . We use the last approach, projecting semantic vectors to input feature space, since it has been shown that this alleviates the hubness problem [46, 65]. We validate this claim by measuring the skewness of the distribution [46, 33] when projected in each direction, and the associated accuracy. We report these values in Table 6 for the ModelNet10 dataset. The degree of skewness is much lower when projecting the semantic feature space to the point cloud feature space, and achieves a significantly higher accuracy. This provides additional evidence that this projection direction is preferable for mitigating the problem of hubs and the consequent bias.
|Semantic space||Input space|
|(Accuracy)||input space||semantic space|
|Inductive||2.67 (23.5%)||3.07 (19.5%)|
|Transductive||-0.19 (46.9%)||2.03 (31.2%)|
In this paper, we identified and addressed issues that arise in the inductive and transductive settings of zero-shot learning and its generalized variant when applied to the domain of 3D point cloud classification. We observed that in the 2D domain the embedding quality generated by the pre-trained feature space is of a significantly higher quality than that produced by its 3D counterpart, due to the vast difference in the amount of labeled training data they have been exposed to. To mitigate this, a novel triplet loss was developed that makes use of unlabeled test data in a transductive setting. The utility of this method was demonstrated via an extensive set of experiments that showed significant benefit in the 2D domain and established state-of-the-art results in the 3D domain for ZSL and GZSL tasks.
Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: §4.1.
-  (2015) Evaluation of output embeddings for fine-grained image classification. In CVPR, Vol. 07-12-June-2015, pp. 2927–2936. External Links: Cited by: Table 4.
-  (2016-07) Label-Embedding for Image Classification. IEEE TPAMI 38 (7), pp. 1425–1438. External Links: Cited by: §1, §1, §2, §3.
-  (2016) Synthesized classifiers for zero-shot learning. In CVPR, Vol. 2016-January, pp. 5327–5336. Cited by: §1, §2, §3, Table 4.
-  (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), pp. 52–68. Cited by: §2, §4.3, Table 5.
-  (2018) Calibrate multiple consumer rgb-d cameras for low-cost and efficient 3d indoor mapping. Remote Sensing 10 (2). External Links: Cited by: §4.4.
-  (2019-01) 3DCapsule: extending the capsule architecture to classify 3d point clouds. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1194–1202. External Links: Cited by: §2, §6.2.
-  (2019) Mitigating the hubness problem for zero-shot learning of 3d objects. In British Machine Vision Conference (BMVC’19), External Links: Cited by: §2, §4.2, Table 2, Table 3.
-  (2019) Zero-shot learning of 3d point cloud objects. In International Conference on Machine Vision Applications (MVA), Cited by: §2, Figure 5, §4.1, §4.2, §4.2, §4.2, Table 1, Table 2.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §4.1.
A theoretically sound upper bound on the triplet loss for improving the efficiency of deep distance metric learning.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-09) Triplet loss in siamese network for object tracking. In The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2013) DeViSE: a deep visual-semantic embedding model. In NIPS, Cited by: §4.4.
-  (2015-11) Transductive multi-view zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 37 (11), pp. 2332–2345. External Links: Cited by: §1, §1, §3.2.
-  (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §4.1.
-  (2018-06) Triplet-center loss for multi-view 3d object retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
-  (2011) KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, New York, NY, USA, pp. 559–568. External Links: Cited by: §4.4.
-  (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2017–2025. External Links: Cited by: §4.4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2014-03) Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 453–465. External Links: Cited by: §1, §2.
Multi-label zero-shot learning with structured knowledge graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2013-07) . ICML 2013 Workshop : Challenges in Representation Learning (WREPL), pp. . Cited by: §3.3, §3.3.
-  (2018) SO-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9397–9406. Cited by: §3.4, §4.4, §6.2.
-  (2019) Angular triplet-center loss for multi-view 3d shape retrieval. In AAAI, Cited by: §2.
-  (2015) Non-rigid 3D Shape Retrieval. In Eurographics Workshop on 3D Object Retrieval, I. Pratikakis, M. Spagnuolo, T. Theoharis, L. V. Gool, and R. Veltkamp (Eds.), External Links: Cited by: §1, §4.1, Table 1, Table 2, Table 3, Table 7.
-  (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §4.1.
-  (2014) Zero-shot learning by convex combination of semantic embeddings. In ICLR, Cited by: §3.
-  (2009) Zero-shot learning with semantic output codes. In NIPS, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), pp. 1410–1418. Cited by: §1, §1, §2.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pp. 5099–5108. Cited by: §2, §3.4, §6.2.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2), pp. 4. Cited by: §1, §2, §3.4, §4.1, §6.2, Table 7.
-  (2017) Visually aligned word embeddings for improving zero-shot learning. In British Machine Vision Conference (BMVC’17), Cited by: §1, §2, §3.3.
Hubs in space: popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, pp. 2487–2531. Cited by: §4.4.
-  (2018-11) A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning. IEEE Transactions on Image Processing 27 (11), pp. 5652–5667. External Links: Cited by: §1, §2.
-  (2018-12) Zero-shot object detection: learning to simultaneously recognize and localize novel concepts. In Asian Conference on Computer Vision (ACCV), Cited by: §2.
-  (2018-12) Deep multiple instance learning for zero-shot image tagging. In Asian Conference on Computer Vision (ACCV), Cited by: §2.
-  (2019) Blended convolution and synthesis for efficient discrimination of 3d shapes. External Links: Cited by: §2.
-  (2019) Representation learning on unit ball with 3d roto-translational equivariance. External Links: Cited by: §2.
-  (2019) Spectral-gans for high-resolution 3d point-cloud generation. External Links: Cited by: §2.
-  (2013) Transfer learning in a transductive setting. In NIPS, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 46–54. Cited by: §2.
-  (2015) An embarrassingly simple approach to zero-shot learning. In ICML, pp. 2152–2161. Cited by: Table 4.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV 115 (3), pp. 211–252. External Links: Cited by: §4.4.
Generalized zero- and few-shot learning via aligned variational autoencoders. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.2, §4.2, Table 2, Table 3, Table 5.
FaceNet: a unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 815–823. External Links: Cited by: §1, §2, §3.3, §3.3.
-  (2018-06) Neighbors do help: deeply exploiting local structures of point clouds. Cited by: §6.2.
-  (2015) Ridge regression, hubness, and zero-shot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 135–151. Cited by: §4.4.
-  (2008-05) Retrieving articulated 3-d models using medial surfaces. Mach. Vision Appl. 19 (4), pp. 261–275. External Links: Cited by: §1, Figure 2, §3, §4.1, Table 1, Table 2, Table 3, Table 7.
-  (2013) Zero-shot learning through cross-modal transfer. In NIPS, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 935–943. Cited by: §4.3, §4.4, Table 5.
-  (2018) Transductive unbiased embedding for zero-shot learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1024–1033. Cited by: §1, §2, §4.2, §4.3, Table 2, Table 3, Table 4, Table 5, §6.3, Table 8, Supplementary Material.
-  (2019-02) DeepPoint3D: learning discriminative local descriptors using deep metric learning on 3d point clouds. Pattern Recognition Letters, pp. . External Links: Cited by: §2.
-  (2018) SPLATNet: sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2530–2539. Cited by: §6.2.
-  (2014) Accelerating t-sne using tree-based algorithms.. Journal of machine learning research 15 (1), pp. 3221–3245. Cited by: Figure 2, Figure 3, Figure 7.
-  (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: Figure 2, §3, §4.1, §4.1, §4.3, Table 1.
-  (2018) Local spectral graph convolution for point set feature learning. arXiv preprint arXiv:1803.05827. Cited by: §2, §3.4, §6.2.
-  (2018) Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829. Cited by: §2, §3.4, §6.2, Table 7, Supplementary Material.
-  (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, Figure 2, §3, §4.1, Table 1, Table 2, Table 3, Table 7, Table 8.
-  (2018) Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Cited by: §1, §1, §2, Figure 2, §3, §4.1, §4.1, §4.1, §4.3, Table 1.
-  (2018-06) Feature generating networks for zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2, §4.2, Table 2, Table 3, Table 4, Table 5.
-  (2019-06) F-vaegan-d2: a feature generating framework for any-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Table 4, Table 5.
-  (2018-06) Attentional shapecontextnet for point cloud recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.4, §6.2.
-  (2018) SpiderCNN: deep learning on point sets with parameterized convolutional filters. arXiv preprint arXiv:1803.11527. Cited by: §2, §3.4, §6.2.
-  (2015) A unified perspective on multi-domain and multi-task learning. In 3rd International Conference on Learning Representations (ICLR), (English). Cited by: §4.4.
-  (2018-10) Transductive zero-shot learning with a self-training dictionary approach. IEEE Transactions on Cybernetics 48 (10), pp. 2908–2919. External Links: Cited by: §2, §3.
3D object instance recognition and pose estimation using triplet loss with dynamic margin. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 552–559. External Links: Cited by: §2.
-  (2017-07) Learning a deep embedding model for zero-shot learning. In CVPR, Cited by: §1, §2, §4.4.
-  (2018) Domain-invariant projection learning for zero-shot recognition. In Advances in neural information processing systems (NIPS), Cited by: §1, §2, §4.3, Table 4, Table 5.
In this supplementary material, we further assess our proposed method with additional quantitative and qualitative evaluations. In the quantitative evaluation section, we evaluate (1) the effect of the batch size on 3D Zero-Shot Learning (ZSL) using ModelNet10, (2) the effect of using a different point cloud architecture, EdgeConv , and (3) the effect of using the experimental protocol for Generalized Zero-Shot Learning (GZSL) proposed by Song . In the qualitative evaluation section, we show success and failure cases on unseen classes from ModelNet10.
6 Additional Quantitative Evaluation
6.1 Batch Size
In this experiment, we evaluate the effect of the batch size on the accuracy of our proposed method for the 3D ModelNet10 dataset. As can be seen in Figure 6, the size of the batch has a significant impact on the performance, with the best performance on this dataset being achieved at a batch size of 32.
6.2 Point Cloud Architecture
In this paper, we used PointNet  as the backbone point cloud architecture in our 3D experiments. However, while PointNet is one of the first works that has been proposed for point cloud classification using deep learning, there are many other methods [31, 30, 55, 24, 61, 54, 60, 45, 51, 7] which were introduced later and tend to achieve better performance for supervised 3D point cloud classification. Here, we compare PointNet with EdgeConv  to study the effect of using a more advanced point cloud architecture for the task of 3D ZSL classification. In supervised 3D point cloud classification, EdgeConv achieves 92.2% accuracy on ModelNet40 while PointNet achieves 89.2%. In this additional experiment, we use ModelNet10 as the unseen set to compare those two methods. As shown in Table 7, both PointNet and EdgeConv achieve similar performance. We would expect to see some improvement when using EdgeConv since it works better in the case of supervised classification. In Figure 7, it can be seen however that both PointNet and EdgeConv cluster unseen point cloud features similarly and imperfectly. This again shows the difficulty of the ZSL task on 3D data where there are a lack of good pretrained models.
6.3 QFSL’s Generalized ZSL Evaluation Protocol
In this experiment, we evaluate the effect of using a different evaluation protocol for the GZSL experiments, as proposed by Song . Under this protocol, the unlabeled data, which consists of seen and unseen instances, is divided into halves, and two models are trained. In each model, half of unlabeled data is used for training and the other half for testing. The final performance is calculated by averaging the performance of these two models. The authors suggest that this allows for fairer evaluation, although it is an imperfect solution. Nonetheless, we show in Table 8 for the ModelNet10 dataset that our method performs better than QFSL with respect to all accuracy measures under both this protocol and the original protocol from our paper. In fact, both methods perform better under this different protocol, which suggests that splitting the unlabeled data in this way makes the task easier. As a result, we use our more conservative GZSL evaluation protocol in the main paper.
|QFSL ||58.1 / 68.2||21.8 / 24.3||31.7 / 35.6|
|Ours||74.6 / 72.0||23.4 / 29.2||35.6 / 41.5|
7 Qualitative Evaluation
In this section, we visualize five unseen classes from the ModelNet10 dataset with examples where our method correctly classified the point cloud, shown in Figure 8, and examples where it incorrectly classified the point cloud, shown in Figure 9. The network appears to be providing incorrect predictions for mostly hard examples, those that are quite different from standard examples in that class, or where the classes overlap in their geometry, such as dresser and night stand.