Deep Class Incremental Learning from Decentralized Data

In this paper, we focus on a new and challenging decentralized machine learning paradigm in which there are continuous inflows of data to be addressed and the data are stored in multiple repositories. We initiate the study of data decentralized class-incremental learning (DCIL) by making the following contributions. Firstly, we formulate the DCIL problem and develop the experimental protocol. Secondly, we introduce a paradigm to create a basic decentralized counterpart of typical (centralized) class-incremental learning approaches, and as a result, establish a benchmark for the DCIL study. Thirdly, we further propose a Decentralized Composite knowledge Incremental Distillation framework (DCID) to transfer knowledge from historical models and multiple local sites to the general model continually. DCID consists of three main components namely local class-incremental learning, collaborated knowledge distillation among local models, and aggregated knowledge distillation from local models to the general one. We comprehensively investigate our DCID framework by using different implementations of the three components. Extensive experimental results demonstrate the effectiveness of our DCID framework. The codes of the baseline methods and the proposed DCIL will be released at https://github.com/zxxxxh/DCIL.

READ FULL TEXT VIEW PDF

Authors

page 1

page 12

03/21/2022

Document-Level Relation Extraction with Adaptive Focal Loss and Knowledge Distillation

Document-level Relation Extraction (DocRE) is a more challenging task co...
07/08/2018

Revisiting Distillation and Incremental Classifier Learning

One of the key differences between the learning mechanism of humans and ...
11/11/2020

Real-Time Decentralized knowledge Transfer at the Edge

Proliferation of edge networks creates islands of learning agents workin...
05/27/2022

A Decentralized Collaborative Learning Framework Across Heterogeneous Devices for Personalized Predictive Analytics

In this paper, we propose a Similarity-based Decentralized Knowledge Dis...
11/19/2020

KD3A: Unsupervised Multi-Source Decentralized Domain Adaptation via Knowledge Distillation

Conventional unsupervised multi-source domain adaptation(UMDA) methods a...
11/25/2020

torchdistill: A Modular, Configuration-Driven Framework for Knowledge Distillation

While knowledge distillation (transfer) has been attracting attentions f...
12/23/2020

IIRC: Incremental Implicitly-Refined Classification

We introduce the "Incremental Implicitly-Refined Classi-fication (IIRC)"...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep models has achieved great success in a wide range of artificial intelligence research fields [18, 47, 33, 56, 43, 8, 70]. Nevertheless, they have been shown to prone to the catastrophic forgetting problem [49]. Catastrophic forgetting refers to the phenomenon where the performance of the deep model degrades seriously when evolving the model for new data. In response to this urgent problem, incremental learning (IL) [7, 35, 36, 64, 78, 46, 77], a.k.a., continual learning [52, 65, 60, 75], which is targeted at learning continuous incoming data streams while getting away with catastrophic forgetting, has drawn increasing attention.

Current incremental learning framework requires the deep neural network models to process continuous streams of information in a centralized manner. Despite its success, we argue that such a centralized setting is often impossible or impractical. More and more data emerge from and exist in “isolated islands”, which may be subject to various regularization or requirements in privacy. It is not always allowed to move data and use data out of their owners. In addition, continuous inflows lead to a huge amount of data located in different repositories, which may cause huge communication and computational burden in bringing them together into a single repository for learning.

Therefore, it is crucial to enable learning models to be deployed in scenarios where data are located in different places and the learning process to be performed across time beyond the bounds of a single repository. Nevertheless, no existing machine learning paradigm such as the incremental learning and distributed learning are able to handle such complex scenarios and thus leave us a big challenge, as illustrated by Table I. In incremental learning [7, 35, 36, 64, 78, 55, 63, 26], a model is updated given a data stream continually coming from one single repository. On the contrary, in Distributed Learning (DL) and Federated Learning (FL) [39, 30], multiple models learnt by different repositories are aggregated to a general model. Clearly, IL cannot process data from multiple repositories, while DL and FL cannot provide a mechanism to handle continuous data streams.

Fig. 1: Comparisons of (a) the traditional deep (class-) Incremental Learning (IL) and (b) the Decentralized deep (class-) Incremental Learning paradigms (DIL). The area in gray indicates the current session of learning. A⃝ refers to the function of model aggregation.
CIL DL / FL DCIL
data from multiple repositories
continuous data streams
TABLE I: Different nature of data sources among traditional centralized class incremental learning (CIL), Distributed learning (DL) and Federated Learning (FL), and our proposed decentralized class incremental learning (DCIL).

In this paper, we raise the concern about such a new challenging scenario, where deep incremental learning shall be performed in a decentralized manner, as illustrated in Figure 1

. To meet this challenge, it is required to enable deep learning models to learn from new data residing on local sites like end devices, and in turn to promote the performance of the

general model on the main site continually. Take Smart Family Photo Album as an example, where a considerable number of photos can be token on occasion separately by different smart phones. A pre-trained CNN model is deployed in a central console as a general model shared by the family members. To process and analyze the inflows of photos in real-time, a group of local models is deployed and maintained in each smartphone. As the number of photos may increase rapidly, the local CNN models are required to learn from and adapt to newly emerging photos, which may distinctly vary from historical ones. Once local models are updated, the general model shall communicate with the local ones, learn from them, and updates itself accordingly, without copying all users’ photos in a single repository. The decentralized incremental learning algorithm can also be used in the field of multi-robot collaboration, such as welcoming robots that can recognize human identities. Each welcoming robot learns to identity guests over time and learn to recognize new guests without forgetting previous guests even if data of previous guests is unavailable. All welcoming robots can exchange models rather than data with guest privacy to each other in order to generate a main model that can recognize all guests.

In response to this demand, we study the paradigm of decentralized class incremental learning (DCIL). In DCIL, it is avoided to upload the data in local sites to the main site. It thus becomes challenging to learn a general model with such limited information. To kick off relevant research, we define a rationale DCIL learning and evaluation protocol on mainstream class-incremental-learning datasets. Moreover, under the protocol, we develop a DCIL paradigm to transform typical (centralized) class incremental learning approaches to their corresponding decentralized counterparts and built baseline DCIL results.

(a) local model 0
(b) local model 1
(c) general model
Fig. 2: Visualization of the phenomenon of model drift after simple averaging with a basic DCIL setting using t-SNE. An example of a DCIL model aggregated from two local sites on subImageNet is shown, where the number of base classes and the number of new classes in current session are 50 and 6 respectively. (a) and (b): the feature space of the two local models individually trained; (c): the feature space of the general model where noisy decision boundaries on the entire set of training samples at the current session is observed.

Along with its promising prospect, there are still several problems of DCIL. On the one hand, as data are distributed among multiple repositories and streams, local sites can only access a portion of the entire data set, which inherently induces deflected local optimum during data decentralized learning [30]. On the other hand, directly averaging the local model weights to form a general model may have detrimental effects on the performance of the general model. One reason may lie in the permutation-invariant property of neural network parameters, as one neural net has multiple equivalent counterparts with different orders of the same parameters [73]. As a result, a phenomenon of model drift on the general model may occur, as shown in Figure 2.

To further solve the challenge in DCIL and the limitations of the basic DCIL paradigm, we propose a novel DCIL framework, termed Decentralized Composite Incremental Distillation framework (DCID) that enables to learn knowledge across time and multiple repositories. There are three main steps in the proposed DCID framework. Firstly, we introduce a data decentralized learning mechanism to perform class incremental learning. Secondly, we propose a collaborated knowledge distillation method to exchange knowledge among local models so that local models can self-consistently evolve, without the supervision from the general model. Finally, we design an aggregated knowledge distillation method to transfer knowledge from multiple local models to update the general model. The proposed approach outperforms the baseline DCIL algorithms, with a communication cost at the same level.

Briefly, the contributions of this paper are manifold:

  • We recognize the importance and initiate the study of decentralized class incremental learning (DCIL). Compared with the popularly studied class incremental learning (CIL), the problem setting of DCIL is more practical and challenging.

  • We propose a basic DCIL paradigm to decentralize state-of-the-art class incremental learning approaches and provide baseline results for the DCIL study.

  • We propose a decentralized deep class incremental learning framework DCID, which consistently outperforms the baselines under various settings.

Ii Related Work

This study is relevant to class incremental learning and distributed learning / federated learning.

Ii-a Deep neural networks

Deep neural networks (DNNs) have shown great ability to represent highly complex functions. Deep neural networks, especially deep convolutional neural networks (CNNs) 

[33], have yielded breakthroughs in image classification and detection tasks. A large number of deep network structures and training techniques have been proposed. After the success of AlexNet  [33] , deep Residual Network (ResNet) [19] has become one of the most groundbreaking networks in the deep learning community in the last few years. By using residual networks, researcher can train up to hundreds of layers and achieves excellent performance. To address the overfitting issue for training deep neural networks, the dropout technique [20],  [71] is further proposed to regularize the model parameters by randomly dropping the hidden nodes of DNNs in the training steps to avoid co-adaptations of these nodes. Lately, the non-regularity of data structures have led to recent advancements in Graph Convolutional Networks (GCNs) [38]. The work [23] presents a new minibatch GCN which allows to reduce computational cost of traditional GCNs. Moreover, due to the powerful representation ability with multiple levels of abstraction, deep multi-modal representation learning [51] has attracted much attention in recent years. The study [24] provides a general multi-modal deep learning framework and a new fusion architecture for geoscience and remote sensing image classification applications.

Ii-B Class Incremental Learning

Continual / incremental learning [53] aims at learning from evolving streams of training data. There are two branches of incremental learning: online incremental learning [4] that the model is single-pass through the data without task boundaries, and offline incremental learning that the model can be trained in offline mode in each incremental session. In this paper, we mainly focus on the latter one. There are two major categories of incremental learning: task incremental learning and class incremental learning. A group of studies work on the task-incremental learning scenario [48, 72, 9, 68, 31, 37, 54], where a multi-head structure is used. On the contrary, the class incremental learning (CIL) task maintains and updates a unified classification head and thus is more challenging. This paper mainly focuses on the class-incremental learning approaches.

Class incremental learning (CIL) is targeted at continually learning a unified classifier until all encountered classes can be recognized. To prevent the

catastrophic forgetting problem, a group of CIL approaches transfer the knowledge of old classes by preserving a few old classes anchor samples into the external memory buffer. Many approaches like iCaRL [55] and EEIL [6] use knowledge distillation

to compute the different types of distillation loss functions. Knowledge Distillation (KD) is a technique to transfer learned knowledge from a trained neural network (as a teacher model) to a new one (as a student model) 

[22, 3, 40, 15]. KD for class incremental learning is typically used in centralized settings before deployment in order to reduce the model complexity without weakening the predictive power [55, 6, 58, 69, 26]. Later, studies such as LUCIR [27], BiC [69] and MDF [76] focuses on the critical bias problem that causes the classifier’s prediction biased towards the new classes by using cosine distance classifiers or an extra bias-correction layer to fix output bias. More recently, TPCIL [63] puts forward the elastic Hebbian graph and the topology-preserving loss to maintain the topology of the network’s feature space. CER [29] proposes a Coordinating Experience Replay approach to constrain the rehearsal process, which show superiority under diverse incremental learning settings. Furthermore, to utilize the memory buffer more efficiently, MeCIL [79] proposes to keep more auxiliary low-fidelity anchor samples, rather than the original real-high-fidelity anchor samples. Nearly, [45] recognizes the importance of the global property of the whole anchor set and designs an efficient derivable ranking algorithm to calculating loss functions.

It is worth noting that existing (class) incremental learning approaches are studied in a centralized manner. They can only work in a situation where data keeps coming from a single repository. Thus it cannot handle the cases where new data emerging from distributed sources.

Ii-C Distributed Learning and Federated Learning

Distributed learning is mainly for training data in parallel scenarios with excessive data efficiency. Both data and workloads are divided into multiple work nodes/sites so that the burden of learning local data of each work node is within the tolerance. In each site, a local model is trained. Local models then communicate with other work nodes in accordance with certain rules like Parameter Server [61]. The server node receives the local model from different work nodes. To integrate and build a general machine learning model, there are studies by simply averaging the model parameters to obtain a general model, solving a conformance optimization problem such as ADMM [2] and BMUF [11], or by model integration like ensemble learning [62].

Federated Learning (FL) is a popular distributed framework, which enables the creation of a general model through many local sites. The global model is aggregated by the parameters learned at the local sites on their local data. It involves training models over remote end devices, such as mobile phones. One typical federated learning method namely Federated Averaging (FedAvg) [50] aggregates local parameters with weights proportional to the sizes of data on each client. To reduce the communication costs, STC [59] and TWAFL [13] compress both the upstream and downstream communications. To reduce global model drifts, FedProx [39] incorporates a proximal term to restrict local models closer to the global model. FedMA [66]

matches individual neurons of the neural networks layer-wise before averaging the parameters due to the permutation invariance of neural network parameters. There are also relevant works like FedMeta 

[10], which combine federated learning with meta learning and share a parameterized algorithm (or meta learner) instead of a global model. FedMAX [12]

introduces a prior based on the principle of maximum entropy for accurate FL. Moreover, data-sharing FedAvg 

[80] uses a public dataset between the server side and local sides. Furthermore, FedDF [42] leverages knowledge distillation technique to aggregate knowledge from local models to refine a robust global model, and performs parameter averaging as done in FedAvg [50].

Nevertheless, most of the research on distributed learning including federated learning today are still performed solving closed tasks which would hardly lead to a more open-world, long-term problem where things keep changing over time. There are only a few exceptions which is aiming at incrementally learning over multi-nodes without aggregating data [17, 16]. However, they are in a very preliminary stage. First, they use linear neural nets, which seriously limits the applications, and fails to connect to modern deep incremental learning methods. Secondly, they only investigate a one-learning-session setting, which is apart from real continual learning over time. Thus these studies can not be applied to complicated decentralized deep incremental learning scenarios.

Iii Decentralized Class Incremental Learning

Iii-a Problem Description

We now define the Decentralized Class Incremental Learning (DCIL) problem setting as follows. The training data set consists of images from an image set and their labels from a predefined common label space , where is the total number of classes. is divided into T independent training sessions , where . Note that the training sets of different sessions are disjoint, so do the label sets, i.e., and for .

The goal of DCIL is to obtain a general model

, which generalizes well to classify new samples of all seen classes probably appearing in any local sites.

The learning phase of each session is as follows. At session , a general model is prepared before the training stage of the session starts. The goal of this session is to update to a new general model so that its performance on can be improved. Unlike conventional incremental learning settings where the data of the session is centralized, in the DCIL setting, is decentralized and distributed to data owners (local sites). Let , where , , and for . Note that all data owners share a common class label sets of this session . As the general model is not allowed to access straightforwardly, has to be distributed to each data owners and thus there are copies of , i.e., , deployed locally to each data owners when session learning starts. They then continually learn and update to maximize their performance on each of the separately, without forgetting the knowledge learnt in previous sessions. Finally, the learnt knowledge embedded in the updated local models is transmitted to the main site for updating general model .

Fig. 3: Illustrations of the framework of DCID. The dotted line refers to the transmission of output on .

Iii-B Dcid

In this section, we propose the Decentralized Composite Knowledge Incremental Distillation framework (DCID). As shown in Figure 3, DCID mainly consists of three steps. Firstly, Decentralized Incremental Knowledge Distillation (DID) performs class incremental learning in a data decentralized setting. Secondly, Decentralized Collaborative Knowledge Distillation (DCD) uses collaborated knowledge distillation among local models. Thirdly, Decentralized Knowledge Aggregated Distillation (DAD) provides an aggregated knowledge distillation mechanism to update the general model.

Decentralized Knowledge Incremental Distillation

In the step of Decentralized Incremental knowledge Distillation (DID), there are two categories of knowledge which the DID learner has to acquire: knowledge from data of the current session (i.e., new-class data) and the one from data or models of the historical sessions (i.e., old-class data). First, as constrained by the data sharing policy in the DCIL setting, new data of the session can only be accessed by corresponding local data owners (local sites) and cannot be shared by other sites. As a result, the model of the previous session has to be distributed and then deployed locally in a decentralized manner. Second, to avoid catastrophic forgetting, it is important to transfer knowledge from models of the previous session to the local models of current session, as affirmed by [41], [55]. Knowledge distillation [21] is a typical model compression and acceleration technique to transfer knowledge from large teacher models to lighter easier-to-deploy student models. This technique, as a regularizer, improves the performance of student models by giving extra soft targets with higher information entropy rather than the one-hot label (hard targets). Modern CIL methods usually rely on a small anchor set to maintain the memory of historical sessions, which has shown to an effective mechanism to avoid catastrophic forgetting [52].

To meet these challenges, we provide a paradigm to distribute existing class incremental learning approaches to multiple local sites where class incremental learning can be performed locally and distributively. More specifically, suppose there are local sites, the loss function for the -th local sites in DID is composed of an anchor loss and a classification loss for new data :

(1)

where is parameter controlling the strength of the two losses.

The anchor loss regularizes the new local model by minimizing the gap between its behavior and the old local model on the anchor set for each data owner. Through , knowledge from models of the previous session can transfer to the local models of current session.

Generally speaking, the anchor set is composed of the most representative instances of the local data that local models encountered in previous sessions. We use the herding method [55, 67] to construct the anchor sets for local models. For each newly encountered class

, the instances whose feature vector are the closest to the average feature vector

of the class are selected as anchors, iteratively. Given the selected anchors, the -th anchor of the current class is added to the anchor set for representing this class of the local model as follows:

(2)

where is the instance set of class of local site and is the feature function of a backbone CNN model with parameters . Note that the anchor set is obtained by adding new anchors from session to the previous anchor set , respectively.

It is worth mentioning that Eq. 1 provides a generic way of performing incremental learning process in local sites. By choosing a certain local incremental learner, and can be determined and can be minimized. Thus a large group of modern anchor based CIL approaches as mentioned in Section 2.1 can be optional. In Section 4, we investigate four state-of-the-art CIL methods including iCARL [55], LUCIR [26], ERDIL [14] and TPCIL [63] to the DCIL setting and show that our proposed DCID performs well regardless of the selection of particular CIL approaches in DID.

Decentralized Collaborated Knowledge Distillation

As the local users can only access a small portion of data and are not allowed to reach the images owned by other local sites, the cluster centers estimated locally are inevitably biased. We propose an efficient yet effective knowledge distillation mechanism, termed by Decentralized Collaborated Knowledge Distillation (DCD).

The key idea of the DCD is making the ensemble of local models as the teacher’s knowledge to guide local model learning, which has richer information entropy than the individual local model [1]. To this end, each local model (as a student), sees the ensemble of the whole local models as a teacher. Consequently, the procedure of DCD is as follows.

First, at the start of the new session , a small shared dataset is built by collecting a few samples without storing their labels from each local site respectively. contains the information that can be shared among local sites and shall be carefully set. When is too large, the burden for communicating will increase, while when is too small, the performance of DCD will be declined. As is accessible to all local sites, given an instance , the

output (before the Softmax layer)

of each local model can be computed.

Second, for , the ensemble output are computed by linear combination of the output provided by local models . We use the weighted average output as a reasonable, high-quality soft target to perform knowledge distillation. Let , where are positive numbers and .

Third, each local model learns the knowledge from the ensemble model. For each local site, the decentralized collaborated knowledge distillation loss over the dataset is minimized:

(3)

where is the distillation temperature parameter that controls the shape of the distribution for distilling richer knowledge from the ensemble teacher output. is the number of new classes in the current session. The Kullback-Leibler divergence is chosen as the fundamental distillation loss function .

After that, each local model  updates to . It is worth mentioning that only the several samples without their labels and corresponding output are shared between these local models. This mechanism is without any further supervised signals and it is easy to implement. Moreover, the communication cost during the process of DCD is moderate. Further, data privacy is guaranteed as only a few images are shared, without providing corresponding labels.

Decentralized Aggregated Knowledge Distillation

In the final stage, the Decentralized Aggregated knowledge Distillation method transfers the knowledge of multiple local models back to the main site to update the general model . One naïve solution is to use the most popular aggregation method in Federated learning, i.e., FedAvg [50] to fulfill this function. Nevertheless, the shared dataset and the set of ensemble output maintained by the second stage DCD provides extra opportunities to distill knowledge and further improve the aggregation results. As a result, we design an effective two-step DAD solution to refine the general model as follows:

At first, the general model is initialized by aggregating all local models using FedAvg [50]. That is, general model aggregates local models by weighted averaging:

(4)

where is the number of training samples over all local sites, is the number of training samples in the -th local site.

Then, the main site distills the ensemble of all the local models (as teachers) to one single general model (as a student). To this end, we use the same ensemble approach as DCD process. The local models are evaluated on mini-batches of data from the current shared dataset and their output are aggregated as teacher output . The student output are generated from the inital general model evaluated on .

To secure an enhanced performance, data on and the corresponding teacher output and student output are shuffled. We then minimize the decentralized aggregated knowledge distillation loss over the dataset :

(5)

We use the same form of as in Eq. 3. is the distillation temperature parameter. Note that we only transfer corresponding output of from each local site to the main site, with only a small amount of extra communication costs as the same as DCD.

Overall Learning Procedure

We recap the operations performed by the local site and the main site at session as follows.

Each local site uploads the random chosen samples without sharing their labels of new classes to the main site, and downloads the shared dataset . Then the following processes are repeated until convergence. Firstly, each local site updates the general model of the previous session using Eq. 1 to obtain . Secondly, each local site performs DCD, finetunes its model using the ensemble output according to Eq. 3 and obtains . Thirdly, each local site uploads its model parameter and output on to the main site.

The main site receives the uploaded samples from all local sites and construct a shared dataset . Then the following processes are repeated for rounds. Each round represents the process that the general model to be distributed, and the new general model is obtained after local model training and aggregation. The main site firstly broadcasts the latest general model to local sites. Secondly, it calculates the ensemble output by using the uploaded output from local sites and distributes them back to the local sites. Thirdly, the main site aggregates model parameters sent from all the local sites by taking an weighted average of them to initialize the general model parameters , then finetunes it by using the ensemble of output from local sites to get . Finally, the main site broadcasts the updated to all local sites.

Fig. 4: Comparison between DCID and baselines on CIFAR100. The accuracy values reported here are the mean and the stand deviation of three-times test.

The overall learning procedure of the proposed decentralized composite knowledge incremental distillation framework in one session is summarized in Algorithm 1. It is worth mentioning that besides the upload and download of model parameters, our DCID only shares a very limited number of training samples without sharing their labels among local sites and the main site, which not only protects data privacy, but also minimizes the communication cost.

Iii-C Baseline Approaches

As we focus on a new learning paradigm, it is desired to provide baseline results and build a benchmark for the decentralized class incremental learning study. To meet this need, we develop a basic decentralized framework to expand typical class-incremental learning methods like the four introduced in the previous sub-section to their DCIL counterparts. The overall procedure of the proposed basic DCIL framework are summarized in Algorithm 2 and described as follows:

Input : Training set of the current session, the anchor set , the shared dataset and the general model after the previous session.
Output : The updated general model of the current session.
1 Copy the general model of the main site and distribute to data owners as ;
2 for each round to  do
3       for each local site to in parallel do
4             Update local models by minimizing ;
5             Compute local output on ;
6             Compute  according to Eq.2;
7            
8       end for
9      ;
10       for each local site to in parallel do
11             Update local models by minimizing ;
12            
13       end for
14      ;
15       for each local site to in parallel do
16             Compute local output on ;
17            
18       end for
19      ;
20       Update general model by minimizing ;
21      
22 end for
Algorithm 1 The DCID framework
Input : Training set of the current session, the anchor set , and the general model after the previous session.
Output : The updated general model of the current session.
1 Copy the general model of the main site and distribute to data owners as ;
2 for each round to  do
3       for each local site to in parallel do
4             Update local models by minimizing ;
5             Compute  according to Eq. 2;
6            
7       end for
8      ;
9      
10 end for
Algorithm 2 The basic DCIL framework

Given a particular class-incremental learner, at the beginning of session , the main site distributes the general model to local data owners (Step 1). Then, each of the data owners updates the local model using its own training data to adapt to new classes while performing a certain kind of forgetting alleviation mechanism using a few old-class anchors (Step 4). Note that and in each local site will not be transmitted to other local sites or the main site. The local anchor set is then updated to be (Step 5). Finally, the updated local models are transmitted to the main site and the general model evolves as the simple weighted average of all the local models based on the amount of data trained at the current session (Step 7). It is worth mentioning that Algorithm 2 works as well for incremental learning methods without using a historical anchor set, by simply setting and removing Step 5.

The proposed framework is carefully designed so that popular federated updating and aggregation methods such as FedAvg, FedMAX [12], and FedProx [39] can be used as a plug-in module to update the local models and gather them as a whole into a general one. The selection of these methods can be easily controlled by defining different in Step 4. When using FedAvg for aggregation, is defined as follows and the resulted DCID method is referred to as “DCIL w/ FedAvg”:

(6)

where , , and are the same with those in Eq. 1. Alternatively, the methods “DCIL w/ FedMAX” can be defined as:

(7)

where refers to the activation vector at the input of the last fully-connected layer for sample on the -th local site.

stands for a uniform distribution over the activation vectors.

is a mini-batch size of local training data. is the Kullback-Leibler divergence and is a hyper-parameter used to control the scale of the regulation loss. Moreover, for “DCIL w/ FedProx” becomes

(8)

where is the parameter controlling the scale of the proximal term. Please refer to [12] and [39] for more details of the FedMAX and FedProx methods.

Iv Experiments

To facilitate the study of DCIL, we conduct comprehensive experiments under the DCIL setting to provide baseline results of the DCIL frameworks and evaluate the proposed DCID approach.

Iv-a Data and Setup

Experiments are performed in PyTorch under the DCIL setting on two challenging image classification datasets CIFAR100

[32] and subImageNet [55, 27].

CIFAR100 contains 60,000 natural RGB images over 100 classes, including 50,000 training images and 10,000 test images. It is very popular both in incremental learning works [5, 55] and distributed / federated learning [50] works. Each image has a size of 32×32. We randomly flip images for data augmentation during training.

SubImageNet

contains images of 100 classes randomly selected from ImageNet

[57]. There are about 130,000 RGB images for training and 5,000 RGB images for testing. For data augmentation during training, we randomly flip the image and crop a 224×224 patch as the network’s input. During testing, we crop a single center patch of each image for evaluation.

Experimental setting

For each dataset, we randomly choose half of the classes, i.e., 50 classes as the base classes for the base session (session 0) and divide the rest of the classes into  and sessions for incremental learning.

In each session, there are local sites where the general model of the previous session is copied, deployed, and updated locally. We randomly sample the dataset of each session under a balanced and independent and identical distribution (IID) for experiments in Sections IV-B, IV-C and IV-D. Evaluations for non-IID settings are provided in Section IV-E. is set to 5 as a basic setting for most experiments in Sections IV-B, IV-C, IV-D and IV-E. More comprehensive evaluations about different settings of in a range of are provided as well in Section IV-D.

Fig. 5: Comparison between DCID and baselines on subImageNet. The accuracy values reported here are the mean and the stand deviation of three-times test.

We choose four representative class incremental learning approaches: iCARL [55], LUCIR [26], ERDIL [14] and TPCIL [63] as the basic incremental learners. iCARL classifies by using the nearest-mean-of-exemplars rule and the traditional knowledge distillation to transfer the knowledge. LUCIR incorporates three components, i.e., cosine normalization, feature knowledge distillation, and inter-class separation, to alleviate catastrophic forgetting. ERDIL uses an exemplar relation graph to explore the relations information of exemplars from the old classes and leverages the graph-based relational knowledge distillation to transfer old knowledge for new class learning. TPCIL maintains the topology knowledge space by elastic Hebbian graph and typology-preserving loss. We decentralize them to multiple local sites in the DID stage to verify the proposed framework. At the start of each session, each local site initializes classification layer parameters for new classes equally before training for better convergence [50].

A series of session accuracy are recorded on the testing sets at the end of each session from session 0 to the last session, respectively. Two measures, i.e., the average accuracy over such a series of accuracy as well as the final accuracy namely the accuracy of the last session are reported for evaluating the performance [44, 55, 63, 64, 26, 14]. The details settings of the parameters are reported as follows.

The anchor number is set to 20 for each class of local sites and the number of shared data for each class is set to 20. During knowledge distillation, the learning rate is set to

and the epoch is set to 5. The distillation temperature parameter in Eqs. 

3 and 5 are set to 5. We have evaluated in Eq. 8 for “DCIL w/ FedProx” in the interval of and found that the results are insensitive to the setting. As a result, we set . Analogously, we set = 500 in Eq. 7 for “DCIL w/ FedMAX”.

Considering different scales of the two datasets, we follow mainstream class-incremental learning study to choose different backbone networks with corresponding settings. On CIFAR100, we choose the popular 32-layer ResNet [19] as the backbone, as in  [27]. Initially, we train the base model for 160 epochs using minibatch SGD with a minibatch size of 128. As per the recommended settings from the original papers, we set the hyper-parameter in Eq.1 for TPCIL, for ERDIL while for iCARL and LUCIR. The initial learning rate is set to 0.1 and decreased to 0.01 and 0.001 at epoch 80 and 120, respectively. During decentralized deep incremental learning sessions, we choose the local epoch = 10, with a minibatch size of 128 for each local site. The learning rate is initially set to 0.01 and decreased by 10 times at epoch 10. The number of rounds  is set to 10 in both Algorithms 1 and 2. On subImageNet, we follow [26] and use the 18-layer ResNet as the backbone. We set the hyper-parameter in Eq. 1 for ERDIL and for other CIL methods. We train the base model with a minibatch size of 128 and the initial learning rate is set to 0.1. We decrease the learning rate to 0.01 and 0.001 after epoch 30 and 50, respectively, and stop training at epoch 100. Then, we fintune the model on each subsequent and decentralized training set. The learning rate is initially set to 0.1 and decreased by 10 times at epoch 10. We choose the local epoch = 30, with a minibatch size of 128 for each local site. is set to 3.

Iv-B Comparison Results under the IID setting

Under the IID setting, the training data of each new class in session is partitioned into each local sites randomly, similar to McMahan [50] et al. Figure 4 and 5 show the comparison results between our proposed DCID and baseline approaches described in Section III-C on two datasets: CIFAR100 and subImageNet. For stable evaluation results, we run the experiments for three times and report the mean results using four representative class-incremental methods: iCARL, LUCIR, ERDIL and TPCIL as reported. Each curve reports the mean testing accuracy over sessions. To achieve upper-bound performances for reference, we retrain the model at each session with a centralized setting of data, which is denoted as “Centralized”. The green curve reports the accuracy achieved by our proposed DCID, while the crimson, blue, pink, and amber curves report the accuracies of “Centralized”, “DCIL w/ FedAvg”, “DCIL w/ FedMAX”, and “DCIL w/ FedProx”, respectively. The main results are summarized as follows:

  • In all experiments with 5 and 10 session settings on both two datasets, our DCID consistently outperforms baseline results on each incremental session by a large margin, especially on the challenging subImageNet. The superiority of our method becomes more obvious after learning all the sessions, which shows the effectiveness of long-term incremental learning from decentralized data.

  • On CIFAR100, our DCID frameworks with iCARL, LUCIR, ERDIL and TPCIL achieve average accuracies of 63.87%, 65.05% ,64.76%, and 65.31% respectively under the 5-session setting. In comparison, the second-best “DCIL w/ FedMAX” frameworks achieves the average accuracies of 62.78%, 63.77%, 64.05% and 64.86%, respectively. As a result, DCID outperforms the second-best one by up to 1.28% in term of the average accuracy with LUCIR method. After learning all the sessions, DCID further outperforms “DCIL w/ FedMAX” by up to 1.88% in term of the final accuracy with LUCIR method. Analogously, under the 10-session setting, our DCID also outperforms baseline approaches both in average accuracy and final accuracy.

  • On subImageNet, under the 5-session setting, our DCID with iCARL, LUCIR, ERDIL and TPCIL achieves the average accuracies of 65.23%, 68.01%, 66.99% and 70.93%, respectively, which are 4.36%, 3.83%, 3.88% and 4.20% higher than the second best method, correspondingly. Furthermore, at the last session, our DICD greatly outperforms the second best method by up to 5.70%, 6.36%, 6.40% and 5.62%. Under the 10-session setting, our DCID also outperforms the second best method with iCARL, LUCIR, ERDIL and TPCIL by 4.35%, 3.77%, 4.14% and 3.15% in terms of average accuracies, respectively. Moreover, at the end of the entire learning process, our method outperforms the second best method by 4.59%, 6.10%, 6.08% and 4.69%, respectively.

Iv-C Ablation Study

We provide ablation studies on subImageNet to investigate each component’s contribution to the final performance gain and prove the effectiveness and generalization ability of DCID.

Component Encountered Classes Average
50 60 70 80 90 100 Acc.
Baseline 83.52 65.39 60.34 55.52 51.97 48.46 60.86
Baseline w/ DCD 83.52 67.28 64.67 59.24 55.98 53.79 64.05
Baseline w/ DAD 83.52 67.21 63.93 58.62 54.71 53.67 63.61
DCID 83.52 69.96 65.82 60.85 57.11 54.16 65.23
TABLE II: Abation study on subImageNet with iCARL.
Component Encountered classes Average
50 60 70 80 90 100 Acc.
Baseline 83.52 70.83 63.34 58.25 53.16 50.14 63.21
Baseline w/ DCD 83.52 71.93 68.14 63.7 59.51 56.48 67.21
Baseline w/ DAD 83.52 71.64 67.02 62.02 57.88 55.72 66.30
DCID 83.52 73.43 68.91 63.98 60.31 57.92 68.01
TABLE III: Abation study on subImageNet with LUCIR.
Component Encountered Classes Average
50 60 70 80 90 100 Acc.
Baseline 83.52 70.39 66.01 62.2 58.35 54.84 65.89
Baseline w/ DCD 83.52 73.68 69.26 66.43 62.27 61.17 69.38
Baseline w/ DAD 83.52 72.92 68.01 65.28 61.14 59.73 68.43
DCID 83.52 74.83 70.62 67.75 65.67 63.24 70.93
TABLE IV: Abation study on subImageNet with TPCIL.

We conduct the experiments using three different centralized CIL approaches: iCARL, LUCIR and TPCIL. The experiments are performed on subImageNet under the 5-session incremental learning setting with 5 local sites. We explore the impact of the Decentralized Collaborate knowledge Distillation (DCD) module in Eq.3 and Decentralized Aggerated knowledge Distillation (DAD) module in Eq.5, respectively. Table II, III, and IV report the comparative results using iCARL, LUCIR and TPCIL methods, respectively. Our “Baseline” method is “DCIL w/ FedAvg”, in other words, removing the DAD and DCD modules from DCID. “Baseline w/ DAD” refers to “Baseline” method with the DAD module. We summarize the results as follows:

  • Both the DCD and the DAD modules improve the performance of the baseline no matter which of the three basic CIL methods is used. Baseline with DCD outperforms the baseline by up to 3.19%, 4.00%, and 3.49% using iCARL, LUCIR and TPCIL, respectively. Baseline with DAD exceeds the baseline by up to 2.75%, 3.09%, and 2.54%, correspondingly.

  • When combing the two modules with the baseline, DCID achieves the best average accuracy. It exceeds Baseline with DCD by up to 1.18%, 0.80% and 1.55% using iCARL, LUCIR and TPCIL, and Baseline with DAD by up to 1.64%, 1.71% and 2.50%, respectively.

All these results show that our proposed method consistently outperforms the baselines, regardless of the (centralized) CIL methods used. The effectiveness of the proposed DCID and its components is demonstrated.

Iv-D Key Issues of DCID

The effect of the size of shared dataset

To investigate the effect brought by the different amounts of the shared dataset, we further evaluate the methods using the different numbers of shared samples per class on a local site. With the 5 session and 5 local site settings, we compare the performance comparison using LUCIR method in Table V. The number of shared samples per class is in the range of . Note that the method with “0” shared samples equals the baseline method “DCIL w/ FedAvg”.

We can observe that the larger shared dataset can perform better, while it also brings more communication costs and data privacy issues. Moreover, we can see that the test accuracy is prone to be saturated when the number of shared samples per class is larger than 20 in a local site, which is also the reason why we choose this parameter to carry out our experiments.

Number of Encountered Classes Average
Shared Samples 50 60 70 80 90 100 Acc.
0 83.52 70.83 63.34 58.25 53.16 50.14 63.21
2 83.52 73.06 67.14 62.45 58.13 55.22 66.70
5 83.52 73.26 68.35 63.72 59.04 57.04 67.48
10 83.52 73.57 68.28 64.06 60.23 57.14 67.80
20 83.52 73.43 68.91 63.98 60.31 57.92 68.01
30 83.52 73.45 69.13 64.34 60.62 58.10 68.19
TABLE V: Evaluating our method with the different numbers of shared samples per class on a local site with 5 session setting on subImageNet.

The effect of the number of local epochs

The number of the local epochs

is a worth-exploring hyperparameter in distributed learning and fedrated learning, which affects the computation on local sites and the trade-off between communication cost and performance 

[42]. Figure 6 compares the test accuracy of in the range of {5, 10, 20, 30, 50} with 300 total training epochs. The experiments use the LUCIR method with the 5-session and 5 local site settings.

Fig. 6: Comparison of average accuracy (a) and final accuracy (b) between our method DCID and baseline method “DCID w/ FedAvg” with different numbers of local epochs.

We can observe from Figure 6 that our DCID method consistently outperforms the baseline method (“DCIL w/ FedAvg”) on both average accuracy and final accuracy for all numbers of local epochs.

Specifically, we found that the performance of our method improves the average accuracy and final accuracy of the baseline method by and on average, respectively. Moreover, we observe that the longer the local training period takes, the better our method works. The reason can be summarized that longer local training leads to higher quality of the ensemble and hence a better distillation result for the models [34]. In contrast, the performance of baseline method saturates and even degrades after =30, which is consistent to the observations in previous literature [50, 66]. Fortunately, this phenomenon is alleviated in the proposed DCID.

The effect of the number of anchors

The class incremental learning methods in our experiments all store several the old class anchors to represent the old knowledge. Though storing more anchors may helpful for the performance, it also brings more memory overhead and higher computation cost. Table VI reports the average accuracy achieved by using different numbers of anchors per class in a local site. It is observed that the test accuracy is prone to be saturated when the number of anchors per class is larger than 20. As the anchor number used in the original LUCIR [26] and TPCIL [63] algorithm is 20, which reaches a good balance of the accuracy and the memory costs, we set the number of anchors per class as 20 in the experiments.

Methods Number of Anchors
5 10 20 30 40
Baseline 57.83 60.46 63.21 64.18 64.67
DCID 62.79 65.52 68.01 69.48 69.71
TABLE VI: Comparison of average accuracy with different numbers of anchors per class.

The effect of the number of local sites

We further analyze the influences when our model communicates with the different numbers of local sites. In the experiments, we use LUCIR as the CIL method on subImageNet dataset with the 5 session setting. The number of local sites is in the range of . The anchor number is fixed to 100 per class. Table VII shows the average accuracy and the final accuracy on the testing set achieved by different settings of local site number .

We can observe that with the increase of local site number , both the accuracies of the baseline and our method decrease, which is consistent with observations in [50, 66]. However, we can still see that our DCID achieves significantly better performance than the baseline in all settings consistently, which demonstrates the efficiency of our method. Moreover, with the increase of the number of local sites, our method has a greater improvement than the baseline method. This improvement is because that a larger number of local sites enrich the diversity of models better and obtain higher-quality ensemble outputs [34].

Methods Number of Local Sites
2 5 10 20
Average Baseline 67.13 63.21 62.86 61.71
Acc. DCID (Ours) 70.07 68.01 67.61 66.27
Final Baseline 56.32 51.14 48.92 45.46
Acc. DCID (Ours) 61.66 58.92 56.54 53.82
TABLE VII: Comparison of average accuracy and final accuracy using our method and baseline with different numbers of local sites.

The training time cost

We further evaluate the computational costs of typical class incremental learning (CIL) approaches and our decentralized class incremental learning (DCIL) approach, DCID. As both typical CIL and our proposed DCIL approaches generate a unified model, when using the same backbone, the inference time will be the same. As a result, we only have to compare the total training time.

We have conducted experiments of total training time of four CIL algorithms and our DCID on CIFAR100 dataset with 5 local sites in one incremental learning session in Table VIII. All training time cost is calculated from the start of training to convergence. The hyperparameters of four typical CIL methods are consistent with original paper. All the experiments are conducted on TITAN XP GPUs.

Based on these results, we discuss the influences of decentralizing incremental learning in the computational costs. Typical incremental learning approaches train on a centralized dataset and generate one model. Instead, DCID consists of three processes: Decentralized Incremental Knowledge Distillation (DID), Decentralized Collaborative Knowledge Distillation (DCD) and Decentralized Knowledge Aggregated Distillation (DAD). Compared to typical class incremental learning approaches, the training data of DCID are distributed in different local sites. There are extra costs in model distribution and aggregation. Nevertheless, in each local site, we only have to train local models on a part of the dataset in the training process synchronously. As a result, the total training costs are reduced.

Methods iCARL LUCIR ERDIL TPCIL
Typical CIL 767s 1041s 1094s 1283s
DCID 558s 662s 745s 871s
TABLE VIII: Comparision of training time of typical CIL method and DICD of one session on CIFAR100.

We can see that the training time of DCID is shorter than typical CIL methods, which mainly thanks to the fact that the local models can be trained simultaneously on the local dataset. Therefore, the proposed DCID framework is efficient.

The robustness to data variability

It is of great importance that the algorithms for machine learning are robust to potential data perturbations. To quantitatively validate the robustness of our method, inspired by the experiments in [25], we perturb test images by jittering the hue of images and evaluate the performance against chromatic changes. We denote by the hue of the original image, and by the parameter of the magnitude of hue shift. The hue of image after processing is randomly sampled within the interval of . We conducted experients on subImageNet with LUCIR method to investigate the performances of the baseline and our DCID with different . The results are shown in Table IX.

Methods
0 0.1 0.3 0.5
Average Baseline 63.21 61.67 58.11 56.23
Acc. DCID (Ours) 68.01 65.25 60.40 58.45
Final Baseline 50.14 47.98 44.40 43.38
Acc. DCID (Ours) 57.92 53.90 50.06 47.70
TABLE IX: Results with LUCIR by jittering the hue of test images on subImageNet.

Though the test accuracy is declined by the jittering of image hue, our DCID still outperforms the baseline method, proving the robustness of our method against chromatic perturbations.

Iv-E Evaluation under the non-IID setting

In data heterogeneously distributed settings, training data of new classes in session are distributed independently of class labels. Traning data usually follows Dirichlet distribution over classes for non-IID setting in distributed learning and federated learning, as used in [74, 28, 42].

Therefore, we follows the Dirichlet distribution to synthesize non-IID training data distributions in our experiments. The value of is a concentration parameter controlling the degree of non-IID-ness among multiple local sites, while characterizes a prior class distribution over classes in incremental session . With , each local site holds training samples from only one random class; with , all local sites have identical distributions to the prior class distribution. The prior class distribution is set to uniform distribution in our experiments. Therefore, a smaller indicates higher data heterogeneity. In this work, we use the 5-session setting as an example, so each session contains 10 classes in both CIFAR100 and subImageNet. To better understand the local data distribution we considered in the experiments, we visualize the effects of adopting different for 5-session setting of CIFAR100 and subImageNet in Figure 7.

Fig. 7: Visualization of training samplers per class allocated to each local site in one session (5-session setting), for different values of the Dirichlet distribution. -axis indicates class labels and -axis indicates IDs of local sites. The size of each dot reflects the magnitude of the samples number.

We conduct experiments on and on subImageNet with LUCIR and CIFAR100 with iCARL. We can observe that our DCID consistently outperforms baseline method “DCIL w/ FedAvg”, both in final accuracies and average accuracies, as shown in Figure 8 and 9. It demonstrates the effectiveness of our proposed framework for heterogeneous distribution of local data. It’s worth noting that the gain of DCID is still notable when the data distributions are highly heterogeneous (with a small ).

Fig. 8: Comparison of average accuracy (a) and final accuracy (b) between our method DCID and baseline method “DCIL w/ FedAvg” using LUCIR w.r.t different of Dirichlet distribution on subImageNet.
Fig. 9: Comparison of average accuracy (a) and final accuracy (b) between our method DCID and baseline method “DCIL w/ FedAvg” using iCARL w.r.t different of Dirichlet distribution on CIFAR100.

V Conclusion

We initiate the study of decentralized deep incremental learning, which handles continuous streams of data coming from different sources. It is distinct to existing studies on (deep) incremental learning and distributed learning. Incremental learning can only update a model given a data stream coming from a single repository, while neither distributed learning nor federated learning can handle continuous data streams. The study of decentralized deep incremental learning is thus significant and challenging. To facilitate this study, we establish a benchmark. We then propose a decentralized composite knowledge incremental distillation method, which further outperforms the baseline methods consistently by a large margin under different IID and non-IID settings. In future, the proposed method will be applied to multi-robot systems.

Acknowledgment

This work is funded by National Key Research and Development Project of China under Grant 2019YFB1312000 and by National Natural Science Foundation of China under Grant No. 62076195. The authors are also grateful to Ms. Liangfei ZHANG and Ms. Yu LIU for their comments and discussions.

References

  • [1] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton (2018) Large scale distributed neural network training through online distillation. Proc. of International Conference on Learning Representations (ICLR). Cited by: §III-B.
  • [2] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §II-C.
  • [3] C. Bucila, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’06), Cited by: §II-B.
  • [4] Z. Cai, O. Sener, and V. Koltun (2021) Online continual learning with natural distribution shifts: an empirical study with visual data. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    ,
    pp. 8281–8290. Cited by: §II-B.
  • [5] F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-end incremental learning. In Proc. of ECCV, pp. 233–248. Cited by: §IV-A.
  • [6] F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-end incremental learning. In ECCV, Cited by: §II-B.
  • [7] G. Cauwenberghs and T. Poggio (2001)

    Incremental and decremental support vector machine learning

    .
    Advances in neural information processing systems, pp. 409–415. Cited by: §I, §I.
  • [8] D. Chang, Y. Ding, J. Xie, A. K. Bhunia, X. Li, Z. Ma, M. Wu, J. Guo, and Y. Song (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing 29, pp. 4683–4695. Cited by: §I.
  • [9] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2018) Efficient lifelong learning with a-gem. In International Conference on Learning Representations, Cited by: §II-B.
  • [10] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He (2018) Federated meta-learning with fast convergence and efficient communication. arXiv preprint arXiv:1802.07876. Cited by: §II-C.
  • [11] K. Chen and Q. Huo (2016) Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Proc. of ICASSP, Cited by: §II-C.
  • [12] W. Chen, K. Bhardwaj, and R. Marculescu (2020) Fedmax: mitigating activation divergence for accurate and communication-efficient federated learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 348–363. Cited by: §II-C, §III-C.
  • [13] Y. Chen, X. Sun, and Y. Jin (2019) Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation. IEEE transactions on neural networks and learning systems 31 (10), pp. 4229–4238. Cited by: §II-C.
  • [14] S. Dong, X. Hong, X. Tao, X. Chang, X. Wei, and Y. Gong (2021) Few-shot class-incremental learning via relation knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 1255–1263. Cited by: §III-B, §IV-A, §IV-A.
  • [15] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. In International Conference on Machine Learning (ICML), Cited by: §II-B.
  • [16] H. Guan, Y. Wang, X. Ma, and Y. Li (2019) A distributed class-incremental learning method based on neural network parameter fusion. In IEEE 21st International Conference on High Performance Computing and Communications, Cited by: §II-C.
  • [17] H. Guan, Y. Wang, X. Ma, and Y. Li (2019)

    DCIGAN: a distributed class-incremental learning method based on generative adversarial networks

    .
    In IEEE ISPA, Vol. , pp. 768–775. Cited by: §II-C.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §I.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §II-A, §IV-A.
  • [20] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §II-A.
  • [21] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. Computer Science 14 (7), pp. 38–39. Cited by: §III-B.
  • [22] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. Computer Science 14 (7), pp. 38–39. Cited by: §II-B.
  • [23] D. Hong, L. Gao, J. Yao, B. Zhang, A. Plaza, and J. Chanussot (2021) Graph convolutional networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 59 (7), pp. 5966–5978. Cited by: §II-A.
  • [24] D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, Q. Du, and B. Zhang (2020) More diverse means better: multimodal deep learning meets remote-sensing imagery classification. IEEE Transactions on Geoscience and Remote Sensing 59 (5), pp. 4340–4354. Cited by: §II-A.
  • [25] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu (2019)

    An augmented linear mixing model to address spectral variability for hyperspectral unmixing

    .
    IEEE Transactions on Image Processing 28 (4), pp. 1923–1938. External Links: Document Cited by: §IV-D.
  • [26] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019-06) Learning a unified classifier incrementally via rebalancing. In Proc. of CVPR, Cited by: §I, §II-B, §III-B, §IV-A, §IV-A, §IV-A, §IV-D.
  • [27] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), Cited by: §II-B, §IV-A, §IV-A.
  • [28] T. H. Hsu, H. Qi, and M. Brown (2019) Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335. Cited by: §IV-E.
  • [29] Z. ji, J. Liu, Q. Wang, and Z. Zhang (2021-12) Coordinating experience replay: a harmonious experience retention approach for continual learning. Knowledge-Based Systems 234, pp. 107589. External Links: Document Cited by: §II-B.
  • [30] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh (2020) Scaffold: stochastic controlled averaging for federated learning. International Conference on Machine Learning, pp. 5132–5143. Cited by: §I, §I.
  • [31] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. PNAS. Cited by: §II-B.
  • [32] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases 1 (4). Cited by: §IV-A.
  • [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §I, §II-A.
  • [34] L. I. Kuncheva and C. J. Whitaker (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning 51 (2), pp. 181–207. Cited by: §IV-D, §IV-D.
  • [35] I. Kuzborskij, F. Orabona, and B. Caputo (2013) From n to n+ 1: multiclass transfer incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3358–3365. Cited by: §I, §I.
  • [36] S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017)

    Overcoming catastrophic forgetting by incremental moment matching

    .
    Advances in neural information processing systems 30. Cited by: §I, §I.
  • [37] S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017) Overcoming catastrophic forgletting by incremental moment matching. In Advances in NIPS, pp. 4652–4662. Cited by: §II-B.
  • [38] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    .
    Thirty-Second AAAI conference on artificial intelligence. Cited by: §II-A.
  • [39] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020) Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, 2:429–450, Cited by: §I, §II-C, §III-C.
  • [40] X. Li, S. Li, B. Omar, F. Wu, and X. Li (2021) Reskd: residual-guided knowledge distillation. IEEE Transactions on Image Processing 30, pp. 4735–4746. Cited by: §II-B.
  • [41] Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. External Links: Document Cited by: §III-B.
  • [42] T. Lin, L. Kong, S. U. Stich, and M. Jaggi (2020) Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems 33, pp. 2351–2363. Cited by: §II-C, §IV-D, §IV-E.
  • [43] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017)

    Sphereface: deep hypersphere embedding for face recognition

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: §I.
  • [44] Y. Liu, Y. Su, A. Liu, B. Schiele, and Q. Sun (2020) Mnemonics training: multi-class incremental learning without forgetting. In Proc. of CVPR, pp. 12245–12254. Cited by: §IV-A.
  • [45] Y. Liu, X. Hong, X. Tao, S. Dong, J. Shi, and Y. Gong (2022) Model behavior preserving for class-incremental learning. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–12. External Links: Document Cited by: §II-B.
  • [46] Y. Liu, S. Parisot, G. Slabaugh, X. Jia, A. Leonardis, and T. Tuytelaars (2020) More classifiers, less forgetting: a generic multi-classifier paradigm for incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pp. 699–716. Cited by: §I.
  • [47] J. Long, E. Shelhamer, and T. Darrell (2015-06) Fully convolutional networks for semantic segmentation. In IEEE conference on computer vision and pattern recognition(CVPR), Cited by: §I.
  • [48] A. Mallya and S. Lazebnik (2018) Packnet: adding multiple tasks to a single network by iterative pruning. Proc. of CVPR. Cited by: §II-B.
  • [49] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §I.
  • [50] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proc. of AISTATS, Cited by: §II-C, §III-B, §III-B, §IV-A, §IV-A, §IV-B, §IV-D, §IV-D.
  • [51] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In ICML, Cited by: §II-A.
  • [52] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks 113, pp. 54–71. Cited by: §I, §III-B.
  • [53] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks 113, pp. 54–71. Cited by: §II-B.
  • [54] G. M. Park, S. M. Yoo, and J. H. Kim (2020) Convolutional neural network with developmental memory for continual learning. IEEE Transactions on Neural Networks and Learning Systems PP (99), pp. 1–15. Cited by: §II-B.
  • [55] S. A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) ICaRL: incremental classifier and representation learning. In Proc of CVPR, Cited by: §I, §II-B, §III-B, §III-B, §III-B, §IV-A, §IV-A, §IV-A, §IV-A.
  • [56] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §I.
  • [57] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision. Cited by: §IV-A.
  • [58] H. Saihui, P. Xinyu, L. Chen Change, W. Zilei, and L. Dahua (2018) Lifelong learning via progressive distillation and retrospection. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §II-B.
  • [59] F. Sattler, S. Wiedemann, K. R. Muller, and W. Samek (2019) Robust and communication-efficient federated learning from non-i.i.d. data. IEEE Transactions on Neural Networks and Learning Systems PP (99), pp. 1–14. Cited by: §II-C.
  • [60] H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. Advances in neural information processing systems 30. Cited by: §I.
  • [61] A. J. Smola and S. Narayanamurthy (2010) An architecture for parallel topic models. Proc. of the Vldb Endowment 3 (1), pp. 703–710. Cited by: §II-C.
  • [62] S. Sun, W. Chen, J. Bian, X. Liu, and T. Liu (2017) Ensemble-compression: a new method for parallel training of deep neural networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 187–202. Cited by: §II-C.
  • [63] X. Tao, X. Chang, X. Hong, X. Wei, and Y. Gong (2020) Topology-preserving class-incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §I, §II-B, §III-B, §IV-A, §IV-A, §IV-D.
  • [64] X. Tao, X. Hong, X. Chang, S. Dong, X. Wei, and Y. Gong (2020) Few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12183–12192. Cited by: §I, §I, §IV-A.
  • [65] X. Tao, X. Hong, X. Chang, and Y. Gong (2020) Bi-objective continual learning: learning ‘new’while consolidating ‘known’. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 5989–5996. Cited by: §I.
  • [66] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khazaeni (2020) Federated learning with matched averaging. In International Conference on Learning Representations, Cited by: §II-C, §IV-D, §IV-D.
  • [67] M. Welling (2009) Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1121–1128. Cited by: §III-B.
  • [68] C. Wu, L. Herranz, X. Liu, J. van de Weijer, B. Raducanu, et al. (2018) Memory replay gans: learning to generate new categories without forgetting. In Advances In NIPS, pp. 5962–5972. Cited by: §II-B.
  • [69] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu (2019) Large scale incremental learning. In CVPR, Cited by: §II-B.
  • [70] J. Xie, Z. Ma, D. Chang, G. Zhang, and J. Guo (2021) GPCA: a probabilistic framework for gaussian process embedded channel attention. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document Cited by: §I.
  • [71] J. Xie, Z. Ma, J. Lei, G. Zhang, J. Xue, Z. Tan, and J. Guo (2021) Advanced dropout: a model-free methodology for bayesian dropout optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-A.
  • [72] J. Yoon, E. Yang, J. Lee, and S. J. Hwang (2018) Lifelong learning with dynamically expandable networks. International Conference on Learning Representations. Cited by: §II-B.
  • [73] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, T. N. Hoang, and Y. Khazaeni (PMLR, 2019) Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pp. 7252–7261.. Cited by: §I.
  • [74] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N. Hoang, and Y. Khazaeni (2019) Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pp. 7252–7261. Cited by: §IV-E.
  • [75] F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: §I.
  • [76] B. Zhao, X. Xiao, G. Gan, B. Zhang, and S. Xia (2020) Maintaining discrimination and fairness in class incremental learning. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), Cited by: §II-B.
  • [77] H. Zhao, Y. Fu, X. Li, S. Li, B. Omar, and X. Li (2020) Few-shot class-incremental learning via feature space composition. CoRR abs/2006.15524. External Links: 2006.15524 Cited by: §I.
  • [78] H. Zhao, X. Qin, S. Su, Y. Fu, Z. Lin, and X. Li (2021) When video classification meets incremental classes. Proceedings of the 29th ACM International Conference on Multimedia, pp. 880–889. Cited by: §I, §I.
  • [79] H. Zhao, H. Wang, Y. Fu, F. Wu, and X. Li (2021) Memory efficient class-incremental learning for image classification. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–12. External Links: Document Cited by: §II-B.
  • [80] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §II-C.