. In a real-life application, a trained system that can classify a given object instance within a fixed number of classes may need to readjust itself to classify a new set of classes in addition to old classes without retraining from scratch. For example, a self-driving car already recognizes street objects (vehicles, traffic lights, etc.). Now, the car manufacturer wants to increase the car’s capability in recognizing roadside objects (buildings, trees, etc.) by retraining only on instances of new classes of interest. The main issue of the retraining is the catastrophic forgetting of old class knowledge. Since this setup does not allow old class instances, the model learns new classes but forgets old ones. Researchers proposed Learning without Forgetting (LwF) methods[12, 26, 9, 41, 24] to address this problem. Traditionally, this problem has been investigated using 2D image data. This paper explores LwF on 3D point cloud object data.
Modern 3D camera technology allows us to capture 3D point cloud data more accessible than ever 
. Now, it is time to adapt 3D point cloud recognition models with LwF capabilities. We identify some key difficulties to address this problem. Firstly, in comparison to image datasets like ImageNet, very large-scale 3D point cloud datasets are not available. 3D datasets usually contain a handful number of classes and instances[38, 32]. Secondly, a typical pre-trained model for a 3D recognition system is not as robust as 2D models because of not being trained on a large dataset . Thirdly, 3D point cloud data (especially real scanned objects) contains more noise than 2D image data . This paper investigates how far a 3D point cloud recognition model can obtain LwF capabilities considering all difficulties mentioned above.
We first train a 3D point cloud model with instances belonging to a set of pre-defined old classes. Then, we update the trained model using a popular knowledge distillation technique  to address the forgetting problem. Because of the difficulties of 3D data, this approach exhibits a large amount of forgetting of old classes. To minimize forgetting, we employ semantic word vectors of classes inside the network pipeline [23, 4, 42]. During both new and old task training, the network tries to align point cloud features to their corresponding semantics. The class semantics encodes similarities and dissimilarities of different objects from the natural language domain. The network learns to project new instance features around the previously obtained and fixed semantic vectors while learning new classes. By performing feature-semantic alignment in both old and new tasks, the network forgets less than the traditional semantic embedding less method. For example, during the old model training, the model learns to classify ‘bed’ via its semantic (like isFurniture, isIndoor) representation. Later, during the new model training, the model could not see ‘bed’, but it observes similar classes (like sofa, chair, table with shared ‘bed’ semantics) that helps not to forget about ‘bed’ knowledge. Experimenting on ModelNet40 , ScanObjectNN , MIT Scenes , and CUB  datasets, we show that our proposed method outperforms traditional knowledge distillation methods in both 3D and 2D data cases. The contributions of this paper are summarized below:
To the best of our knowledge, we are the first to experiment learning without forgetting on 3D point object cloud data.
Our method applies knowledge distillation to restore previously gained experience of the old mode and minimize catastrophic forgetting while learning a set of new classes. In addition, we investigate the advantage of semantic word vectors in the network distillation process.
2 Related Works
3D Point Cloud Architecture: There are two streams of works for 3D point cloud classification: feature-based and end-to-end approaches. Feature-based methods mostly use Multi-view representation and Volumetric CNNs. Multi-view representation methods [31, 20, 40] convert 3D point cloud into 2D images, which are then classified using 2D convolutional networks. Volumetric CNNs [14, 34]
project point cloud objects on a volumetric grid or a set of octrees. Then, they apply a computationally expensive 3D convolutional neural network. The main drawback of feature-based methods is that they do not work directly on the raw point cloud. End-to-end approaches like PointNet, PointNet++
use raw point cloud data as input to multi-layer perceptron networks followed by maxpooling layers. Several other works[37, 10, 11] apply improved convolution operation on point cloud objects. Moreover, [29, 35]
use Graph neural networks to extract features from 3D point clouds. In this paper, we build our model on several end-to-end architectures.
Learning Without Forgetting: Many methods have been proposed to solve the catastrophic forgetting problem [15, 6, 2]. Exemplar-free methods [12, 1, 41] do not require any samples from base/old task. Li et.al.  proposed to use Hinton’s  knowledge distillation loss to preserve old task’s knowledge in 2D images, but the domain shift between tasks makes this method weak. Rehearsal methods [26, 3, 9] keep a small number of exemplars from the old task. Rebuffi et.al.  first introduced replay-based method with bounded memory, but it fails to represent the main distribution of old task when there is a lot of variations. The Pseudo-rehearsal process used in [28, 36, 17] learns to produce examples from the old task. Some methods [1, 13] minimizes additional parameters to solve the problem of model expansion. All of the approaches mentioned above have proposed solutions to the catastrophic forgetting of 2D image data. Our method is the first to use knowledge distillation to address LwF of 3D data.
Word embedding for catastrophic forgetting: The use of semantic representation to prevent catastrophic forgetting is relatively new [23, 4, 42]. Such approaches explore the semantic relation between old and new classes to reduce the forgetting of old classes while training new classes. Zhu et.al.  suggested using semantic representation to train the object detection model by projecting the feature vector into the semantic space. Similarly, Rahman et.al.  proposed to use semantic representations of class labels as anchors in the prediction space for not forgetting the acquired knowledge of old classes. Cheraghian et.al.  proposed a knowledge distillation strategy by using semantic representation as an auxiliary knowledge. Even though semantic representation has yielded promising results, the experiments are limited to 2D data. In this paper, we use word vectors in knowledge distillation for 3D point clouds object classification.
Problem Formulation: Assume, we have a set of old, , and a set of new, , classes, where, , and . The 3D point cloud recognition model initially observes classes and gets trained to classify only old classes. Later, classes are added to the model to update previous training. Suppose, a 3D point cloud input sets, for , can get a label from either old or new classes. Additionally, there is a set of -dimensional semantic class embeddings for each of the old and new classes, denoted as and , respectively. We define the old set as , where, is the -th point cloud object belonging to old set with the class label , and the class embedding , and is the number of old class instances. Similarly, there is a set for new classes , where, is the -th point cloud object having the class label , and the class embedding , and is the number of new class instances. We build a 3D point cloud object recognition model (termed as old model) using set. Then, we aim to update the same model (termed as new model) using only newly available data that can predict a class label for a test sample belonging to either old or new sets, i.e., . We assume the model has prior knowledge about the test sample during inference, whether it belongs to old or new classes.
Main challenges: While updating the model with new data, , the model gradually forgets the old training (done on ) because of the restriction of not using old class instances. Previous works address this problem with 2D image data. In this paper, we address the same problem on 3D point cloud objects. Due to the unavailability of large-scale datasets and pre-trained models, the problem becomes more complex in the 3D than 2D domain.
3.1 Model Overview
Our proposed method is shown in Fig 2, which includes old and new models. The new model is the updated variant of the old model to accommodate new classes. Both old and new models are presented together because, during the training of the new model, we use the output of the old model. For both models, the point cloud input is fed into the backbone , which can be any point cloud architecture (PointNet, DGCNN, PointConv etc.), to extract feature input,i.e., . Additionally, a semantic representation unit is employed to generate class embedding,i.e., , given class label. While training the old model using old classes, the feature input g and the class embedding are mapped into a common -dimensional space using projection functions and , respectively. The new feature representations for the point cloud feature and the class embedding are and , respectively. Finally, dot multiplication between and form the output for the old classes. A cross-entropy loss, , is adopted to train the model for the old classes. While updating the same model with the new classes, we add a parallel pipeline from the output of backbone . Two projection modules and are added to map features of new classes and class embedding into the common -dimensional space. The new representations of feature and class semantics are and , respectively. At the end, is dot-multiplied with to generate output,
for the new classes. In order to prevent forgetting of the old classes, a knowledge distillation loss, function,,  is employed between output of the old and new models.
3.2 Training and Inference
We train the proposed architecture with two stages: old and new model training. Unlike traditional approaches for learning without forgetting (LwF) , both stages use semantic word vectors of classes to remember past knowledge.
Training old model: At the first stage of training, we learn the old model using the training data of employing a cross-entropy loss. Unlike 2D image cases, we perform this training from scratch because no pre-trained model is available to initialize the weights of the backbone, . The output of the old model for the th 3D point cloud instance is
where, and are learnable weights associate with and layers, respectively. After finishing the training, the old model can classify the old set of classes, . This old model remains frozen during the second stage training.
Training new model: We build a new model during the second training stage by updating a copy of the old model, which is trained in the previous stage. This new model gives predictions for both old and new classes. But, we are not allowed to cannot any old class instances during training the new model. We add and layers to and layers. Only the training data of new classes is used to train the new model. Similar to Eq. 1, both pipeline of the new model can produce output for old and new classes.
where, and are weights associated with and layers, respectively. Among all trainable weights of new model, and are initialized from the old model but and are trained from the scratch. While forwarding an input 3D point cloud object , old model outputs for old classes and new model outputs and for old and new classes, respectively.
We calculate a traditional cross-entropy loss between and ground-truth . This loss is used to learn new classes. Additionally, using old class outputs from old and new model, we calculate a knowledge distillation  loss . This loss is employed to prevent the forgetting of the old classes. Unlike the traditional , we consider class semantics in the pipeline, which further helps the LwF process. The entire loss () to train this model is
where, hyperparametercontrols the contribution of . To calculate , we use negative log likelihood loss common in 3D backbones. To calculate , we record the output from old model for new class dataset’s 3D point clouds . The equations for and are:
where, is the temperature and is the softmax function.
Inference: For any test instance, a forward pass to the new model calculates old and new class scores. We classify old and new classes by selecting the maximum score from and , respectively.
Dataset: We evaluate our method on 3D datasets, ModelNet10, ModelNet40 , ScanObjectNN  and two 2D datasets, MIT Scenes , CUB . For the 3D experiment, we use two different settings related to synthetic and real scanned point cloud data. The synthetic experiment, ModelNet40 ModelNet10 setting use 30 classes of ModelNet40 as old and non-overlapped 10 classes of ModelNet10 as new classes. The real scanned object experiment, ModelNet40 ScanObjectNN use 26 classes of ModelNet40 as old and 11 classes of ScanObjectNN as new classes. Both of these setups were previously introduced in . For the 2D experiment, Scenes CUB considers 67 classes of MIT Scenes as old and 200 classes of CUB as new. In another setup, 150 and 50 classes of CUB dataset are used as old and new, respectively. These setups are proposed in [12, 39]. The statistics of train test instances are summarized in Table 1.
|Dataset||Settings||Task||# Classes||# Train||# Test|
Semantic embedding: For semantic representation of classes, we use 300 dimensional word2vec (w2v)  and GloVe (glo)  word vectors. The word vector models are usually trained on unannotated text corpus. Unless explicitly mentioned all performance in this paper are with word2vec vectors.
Evaluation protocol: We evaluate our method using top-1 accuracy. We calculate the old model’s accuracy as . Similarly, we calculate and to represent performance of old and new classes, respectively using the final model. To measure the extent of forgetting, we calculate, . A lower indicates less forgetting of the new model.
Validation strategy: We further randomly divide the set of old classes into val-old and val-new classes for validation experiment. In the ModelNet40 ModelNet10 and ModelNet40 ScanObjectNN experiments, we choose 24 and 20 classes from old classes, respectively as val-old and the rest of the classes as val-new to find values for hyperparameters. We choose and for our 3D experiments by performing a grid search within the range .
(pretrained on Imagenet
) as 2D image backbone to obtain feature vector. For feature vector projection layers, we use two fully connected layers (512, 256) and (1024, 512) with ReLU activations in 3D and 2D experiments, respectively. For 3D and 2D experiments, we use one fully connected layer of size 256 and 512 with ReLU in the projection layer of semantic representation. In all experiments, we use the Adam optimizer with a learning rate of 0.0001 and batch sizes of 32 during training. We implement our work using thePyTorch framework.
Compared methods: In this paper, we compare the results of the following methods. (a) Baseline-1: A backbone model is trained using the instances of old classes. Then, the trained backbone is further fine-tuned using new class instances only. (b) LwF : The backbone training is same as Baseline-1. Then, the fine-tuning on new class samples uses a knowledge distillation loss  not to forget the old class knowledge. (c) Baseline-2: This method is an intermediate stage of our proposed approach. We first train the old model of Fig. 2 using semantic word vector information inside the architecture. But, it does not have any fine-tuning stage. This performance can be considered zero-shot learning [5, 25] results because it treats new classes as unseen. This method can classify new (unseen) classes without having trained on new instances. (d) Ours: This is our final recommendation as described in Sec. 3.1 and 3.2. On top of Baseline-2 training, it contains fine-tuning on new class instances.
|Backbone||ModelNet40 ModelNet10||ModelNet40 ScanObjectNN|
4.1 3D point cloud experiments
Overall results: Table 2 shows the overall results using two settings, ModelNet40 ModelNet10 and ModelNet40 ScanObjectNN. Our observations are as follows. (1) Baseline-1 gets the worst results in forgetting issue showing high values because the fine-tuning for the new model does not consider about old classes. High and low value in and , respectively tells that this method learns new classes but forgets the old classes significantly. (2) LwF  obtains better results on forgetting issue (lower values) than Baseline-1 because this method apply a knowledge distillation loss not to forget old classes. (3) Baseline-2 shows the performance after old class training using our method. Without receiving training on new classes, this model can still classify new classes considering those as unseen class. Although no forgetting occurred in this case, there is no balance of old and new class performance. (4) Ours result makes a nice balance of old and new accuracy with maintaining minimal forgetting. (5) Although both settings achieve similar results () in old classes across methods, ModelNet40 ScanObjectNN gets less accuracy on new classes () than ModelNet40 ModelNet10. The reason is that ScanObjectNN classes (new) are real-scanned 3D objects with higher noise than synthetic data.
|CUB (150)||LwF ||58.2||57.1||66.2||1.8|
Ablation studies: In Table 3
, we perform ablation study while varying different 3D point cloud backbone. Among all backbones, PointNet performs consistently well in both 3D experiment settings. PointConv and DGCNN have some success in forgetting issue with synthetic data of ModelNet10 but fails to generalize it for real scanned ScanObjectNN classes. The global features extracted by PointNet may be more helpful than local features from PointConv and DGCNN backbones. Table4(a) also compares two different word vector models (word2vec and GloVe) as semantics. In most cases, word2vec achieves better accuracy and less forgetting in comparison to GloVe.
Hyperparameter sensitivity: We experiment with varying and in Fig. 3. By fixing one hyperparameter and adjusting another, we observe hyperparameter sensitivity within the range . We notice that increasing and from 0 to 3 improves the old () and new () class performance. From to higher, results do not deflect much, but values decrease gradually. We achieve best results using .
4.2 2D experiments
In addition to 3D point cloud experiments, we conduct 2D image experiments. We report our results in Table 4(b) using MIT Scenes , CUB . For two different experiment setups, Scenes CUB and CUB (150) CUB (50), our method achieves better performance than LwF  in terms of less forgetting (). Moreover, we observe that the result of the 2D experiments is better than the 3D experiments (Tables 2 and 3). The amount of forgetting () is higher in 3D cases than in 2D cases (5-6% vs. 1-2%). The main reason is the 2D backbone (VGG16 ) has been pre-trained on a large dataset Imagenet , which has million training instances and thousands of classes. In contrast, the 3D backbone (PointNet, DGCNN, PointConv) used in the 3D experiments is not pre-trained on a huge dataset. Therefore, the feature vector obtained from the 2D backbone is richer and more clustered than the feature vector obtained from the 3D backbone. We notice that the feature-semantic alignment in the 2D experiment is more aligned than the 3D experiment, as shown in Fig. 4.
In this paper, we investigate LwF on 3D point cloud objects. Because of the lack of large-scale 3D datasets and powerful pre-trained models, popular knowledge distillation on prediction scores poorly performs on 3D data. To improve the performance further, we use semantic word vectors in the network pipeline. It helps to improve the traditional knowledge distillation performance. We also report performance on different 3D recognition backbones and word embeddings. We notice that the extent of forgetting in 3D is still inferior to the 2D image case. Future research in this area may investigate this issue further.
Acknowledgment: This work was supported by NSU CTRG 2020–2021 grant #CTRG-20/SEPS/04.
Memory aware synapses: learning what (not) to forget. In
Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
Expert gate: Lifelong learning with a network of experts.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2021) Semantic-aware knowledge distillation for few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2021) Zero-shot learning on 3d point cloud objects and beyond. arXiv preprint arXiv:2104.04980. Cited by: §1, §4, §4.
-  (2014) An empirical investigation of catastrophic forgetting in gradient-based neural networks. In 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, Cited by: §2.
-  (2020) Deep learning for 3d point clouds: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2, §3.1, §3.2, §4.
-  (2019) Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2019) A-cnn: annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) Pointcnn: convolution on x-transformed points. Advances in neural information processing systems. Cited by: §2.
-  (2018) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §1, §2, §3.2, §4.1, §4.2, Table 2, Table 4, §4, §4.
-  (2018-06) PackNet: adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In IROS, Cited by: §2.
-  (1989) Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation - Advances in Research and Theory. Cited by: §2.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. Cited by: §4.
-  (2019) Learning to remember: A synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2014) GloVe: global vectors for word representation. In EMNLP, Cited by: §4.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 3, §4.
-  (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2017) PointNet++ deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Cited by: §2.
-  (2009) Recognizing indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, §1, §4.2, §4.
-  (2020) Any-shot object detection. In Proceedings of the Asian Conference on Computer Vision (ACCV), Cited by: §1, §2.
-  (2019-10) Transductive learning for zero-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1.
-  (2020) Zero-shot object detection: joint recognition and localization of novel concepts. International Journal of Computer Vision 128 (12), pp. 2979–2999. Cited by: §4.
-  (2017) iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal on Computer Vision (IJCV). Cited by: §4.2, §4.
-  (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, Cited by: §2.
-  (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1, §4.2, §4.
-  (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 945–953. Cited by: §2.
-  (2019) Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In 2019 IEEE/CVF International Conference on Computer Vision ( ICCV), Cited by: 3rd item, §1, §1, §4.
-  (2011) Multiclass recognition and part localization with humans in the loop. In International Conference on Computer Vision ( ICCV), Cited by: 3rd item, §1, §4.2, §4.
-  (2017) O-cnn: octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG). Cited by: §2.
-  (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog). Cited by: §2, Table 3, §4.
-  (2018) Memory replay Gans: Learning to generate images from new categories without forgetting. In Advances in Neural Information Processing Systems, Cited by: §2.
-  (2019) PointCONV: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 3, §4.
-  (2015) 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, §1, §1, §4.
-  (2018) Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.
-  (2019) Learning relationships for multi-view 3d object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2020) Class-incremental learning via deep model consolidation. In Workshop on Applications of Computer Vision (WACV), Cited by: §1, §2.
-  (2021) Semantic relation reasoning for shot-stable few-shot object detection. arXiv preprint arXiv:2103.01903. Cited by: §1, §2.