Introduction
A key signature of human intelligence is the ability to quickly acquire knowledge from few examples. Despite artificial intelligence has made remarkable progress in wide applications, it remains challenging to perform well in situations with little available data or limited computational resources. Such a scenario is typically referred to as fewshot learning, which has attracted vast interests recently
[aaai1, aaai2, aaai4].Rather than simply augmenting data [hallucinate] or adding regularization to compensate for the lack of data, an emerging line of work tackles fewshot learning with metalearning. By leveraging previous learning experience to obtain a prior over tasks at metatrain time, the efficiency of later learning is further improved at metatest time. Particularly, the learned prior for discovering the transferable knowledge can act as an inductive bias to minimize generalization error.
Generally, mainstream metalearning models follow the episodic training paradigm, where the metalearner extracts domaingeneral information among episodes so that it can assist the taskspecific learner to recognize unlabeled samples (query set) based on the few labeled points (support set). In this way, the metalearner can be implemented variously: as an optimizer that gathers gradient flows from different tasks [maml, reptile, layerwise]
; as an classification weight generator that hallucinates classifiers for novel classes
[param_prediction, dynamic, gene_GNN]; or as a metric that measures similarity between the query and support samples [matching, proto]. Nevertheless, existing metalearning methods are far from optimal due to the lack of relational inductive bias modeling [relational], thereby failing to manipulate the structured representations of intra and intertask relations.Motivated to capture more interactions among instances, another line of work has explored graph structure [GNN, EGNN, TPN] or secondorder statistics like covariance [covariance] in metalearning framework. More concretely, Garcia and Bruna [GNN] cast fewshot learning as the node classification problem with graph neural networks, where nodes are represented with the images in the episodes, and edges are given by a trainable similarity kernels. In the same vein, Liu [TPN] et al. proposed a transductive propagation network (TPN) for label propagation and thus enabled transductive inference for all query set. Alternatively, Kim et al. [EGNN] modeled the learning as an edge labeling problem, in order to directly predict whether the associated two nodes belong to the same class.
While promising, most existing graphbased metalearning approaches suffer from two major limitations, i.e., catastrophic forgetting [lifelong] and insufficient robustness [BGNN], which make it difficult to transfer knowledge over long time spans or handle uncertainty in graph structure. On the one hand, the problem of catastrophic forgetting is that when a metalearner gradually encounters a sequence of learning problems, it tends to attenuate past knowledge when it learns new things. On the other hand, current graphbased models incorrectly treat the predefined edge initialization as the reliable topology for messagepassing, in the sense that inaccurate or uncertain relationships among querysupport pairs could lead to severe error accumulation through multilayer propagation.
To address the issues mentioned above, in this paper, we propose a novel Continual MetaLearning with Bayesian Graph Neural Networks (CMLBGNN) for fewshot classification, which is illustrated in Figure 1
. To alleviate the drawback of catastrophic forgetting, we jointly model the longterm intertask correlations and shortterm intraclass adjacency with the derived continual graph neural networks, which can retain and then access important prior information associated with newly encountered episodes. Specifically, the node update block aggregates the adjacent embedding from each episode and feeds contextaware node representations to gated recurrent units, which are expected to meliorate node features with previous history. Such aggregations can be naturally chained and combined into the multiple layers to enhance model expressiveness. Moreover, as uncertainty is rife in edge initialization, we provide a Bayesian approach for edge inference so that classification weights can be dynamically adjusted for discriminating specific tasks. The conceived amortization networks approximate posterior distribution of classification weights with a Gaussian distribution defined by a mean and variance over possible values. Accordingly, taskspecific parameters are sampled to mitigate the bias of node embedding and further enhance the robustness of graph neural networks. Overall, our contributions can be briefly summarized as follows:

We propose a novel Continual MetaLearning framework that leverages both longterm intertask and shortterm intratask correlations for fewshot learning. Different from existing graphbased metalearning approaches, we introduce a memoryaugmented graph neural network to enable flexible knowledge transfer across episodes.

To remedy uncertainty among querysupport pairs, a Bayesian edge inference is derived by amortizing posterior inference of taskspecific parameters.

We show the effectiveness of the proposed architecture through extensive experiments on the miniImagenet and tieredImagenet benchmark with a
relative improvement over stateoftheart counterparts. Regarding to robustness analysis, we perform semisupervised learning to verify the efficiency and effectiveness of the proposed method.
Related Work
MetaLearning
Metalearning studies how to distill prior knowledge from past experience and enable fast adaptation to novel tasks with only a limited amount of samples. Much effort has been devoted by recent work, which can be broadly categorized into several groups: (1) Optimizationbased methods either learn a good parameter initialization or leverage an optimizer as the metalearner to adjust model weights. Typical examples include learning to approximate gradient descent with LSTM [optimizationasmodel], learning modelagnostic initial parameters [maml]
and its variants with probabilistic estimation
[pmaml, bmaml, metaSGD], firstorder approximation [reptile], layer selection [layerwise], learner update direction and learning rate learning [metaSGD], and relation embedding [leo]; (2) Generationbased methods learn to augment fewshot data with a generative metalearner [PMN], or learn to predict classification weights for novel classes [dynamic, param_prediction, gene_GNN]; (3) Metricbasedapproaches address the fewshot classification problem by learning a proper distance metrics as the metalearner, such as cosine similarity
[matching], euclidean distance to class prototypes [proto, softkmeans][regression], relation network [relation], task attention [attention], category traversal [ctm], and graph modeling [GNN, TPN, covariance, EGNN]. Rather than purely relying on graphbased metric learning, our methodology exploits longterm information from previous tasks and jointly models the topological uncertainty.Catastrophic Forgetting
Catastrophic forgetting [lifelong]
has been a longstanding issue in machine learning community due to the stabilityplasticity dilemma
[dilemma]. In recent literature, a number of methods have been proposed on the basis of continual learning, which can be roughly subdivided into the following groups. Regularization approaches [regularization] alleviate catastrophic forgetting by imposing constraints on the update of the neural weights in order to prevent “overwriting” what was previously encoded. Alternatively, in ensemble algorithms [dynamic], the architecture itself is altered to accommodate new tasks by retraining a pool of pretrained models. In dualmemory algorithms [replay], one estimates the distribution of the old data either by saving a small fraction of the original dataset into a memory buffer or by training a generator to mimic the lost data and labels [replay1]. Being more related to the last group, our work, for the first time, tackles the catastrophic forgetting in a metalearning framework and validates its effectiveness and efficiency on practical tasks.Background and Problem Definition
Given the training set , the goal is to learn the model , which is capable of generalizing well to the unseen test set , where . Metalearning approaches commonly adopt episodic training strategy to minimize the generalization error across a series of tasks that are randomly sampled from a task distribution. Specifically, a way, shot classification setting is used for both training and testing stage, where indicates the number of unique classes in each episode and denotes the number of training samples per class. Each episode or task is composed of the support set and the query set . To discover the commonalities and variability across tasks, we decouple the model into two submodules, i.e., the feature learner and the taskdependent classifiers , where indicates the shared parameters that suit for all tasks and indicate the taskspecific parameters. To mathematically illustrate the proposed continual metalearning procedure, we firstly recall the unified definition of existing metalearning models: as discussed in literature [bayes, pmaml], metalearning can be viewed as an approximate inference for the posterior given the following definition.
Definition 1.
MetaLearning Given the task sampled from task distribution
, the posterior predictive distribution for query points is calculated as,
where is the maximum a posteriori (MAP) value of , which can be obtained via point estimates.
For instance, the optimizationbased metalearning regards all model parameters as and forms a point estimate by taking several steps of gradient descent initialized at and learning rate , i.e.,
(1) 
While, the generationbased
metalearning focuses on estimating classification weight vector
for novel classes, given the initial value trained on and the learning rate , i.e.,(2) 
where is the learnable weight generator. The metricbased metalearning takes the parameters from the top layer of neural networks as for all classes, by averaging the toplayer activations for each class , i.e.,
(3) 
where denotes the number of samples belonging to class , and indicates the embedding network.
However, the current metalearning suffers from catastrophic forgetting and lacks for uncertainty estimation. Without fixing these problems, a single deep model will be incapable of adapting itself to a longrun learning, since it forgets the old messages when it deals with new things. Therefore, we generalize the metalearning to a preferable continual way, and give the following elegant definition.
Definition 2.
Continual MetaLearning Given the task sampled from task distribution , the posterior predictive distribution for query points is
where indicates the history knowledge that gives transition of longterm memory.
The first term can be interpreted as given a novel sample , its respective label not only conditions on the current task but also history information , which naturally reuses supervision without pilling up complexity. The second term presents information updating and resetting from related tasks, which is jointly learned with shared parameter in the continual graph neural networks, which will be elaborated in the next section. Lastly, to ensure a tractable likelihood, in the last term, the distribution over classifier weights is approximated with a few steps of Bayes by Backprop.
Proposed Approach
In this section, we detail two major components of the proposed method, the continual graph neural networks and Bayes by Backprop procedure for edge inference.
Continual Graph Neural Networks
As shown in Figure 1, the support (shown in circle) and query data (shown in triangle) in each episode can form an undirected acyclic graph . Each vertex is associated with a node embedding vector and each edge represents interaction between nodes. The adjacency matrix is defied as semantic similarity between two connected nodes and , which will be updated dynamically. Considering the given label of support set, the adjacency matrix can be initialized as,
(4) 
where supportsupport pairs are initialized w.r.t their labels and querysupport pairs are softly assigned with an uncertain value. Even though ambiguity is explicitly injected, we remedy the noise via probabilistic edge inference presented in the next section. For efficient implementation, we split the metatrain and metatest dataset into several sequences of episodes , which is learned by node updating (see Figure 2), history transition and edge inference module consecutively, where is the length of the sequence.
Node Interactions Modeling
To make each node aggregate more information from neighbors hops away, the proposed graph model stacks aggregation blocks. Following the generic propagation rule, the node vectors in arbitrary graph at the th layer can be updated as,
(5) 
where indicates the neighbor set of the node , is the concatenation operation and is a transformation block consisting of two convolutional layers, one LeakyReLU activation and one dropout layer. The node embedding is initialized with the extracted representation from the backbone embedding model, i.e., . Here we leave out the layer mark in the next section for simplicity.
Task History Transition
After receiving messages from current episode , each node embedding is further transformed with a gate updater. Different from the Gated Graph Neural Network [GGNN] that utilizes the gate updater to extend the depth of graph neural networks for the same batch of data, we feed different embedding at each time step in order to capture the longterm task correlations. Specifically, the hidden state for each node is transferred as follows,
(6) 
where is the updated feature of node ,
is the sigmoid function,
, , , , , are learnable weights, and , , are biases of the updating function. and are update gate vector and reset gate vector, respectively. Here we let all parameters as for shorthand. The hidden state is initialized as a zero vector.Adjacency Feature Update
After step hidden transition, we can obtain a set of final node representations at the th layer. In this way, the adjacency features at the th layer is calculated as,
(7) 
where is the degree matrix of adjacency matrix,
is the nonlinear transformation network parameterized by
, which includes four convolutional blocks, a batch normalization, a LeakyReLU activation and a dropout layer.
Bayes by Backprop for Edge Inference
Though multiple graph aggregation layers, we aim to infer a full predictive distribution over the unknown query labels relying on distributional Bayesian Decision Theory [pmaml, versa]. Notably, we amortize the posterior distribution of the classification weights as to enable quick prediction at the metatest stage and learn parameters by minimizing the average expected loss over tasks i.e., , where
(8) 
To ensure likelihood tractable, we use a factorized Gaussian distribution for with means and variances set by the amortization network,
(9) 
With the generated posterior distribution, we can adaptively transform the predictive logits from the
th layer based on the learned adjacency matrix and sampled classifier weights for each specific task,(10) 
where and indicate the sigmoid function and Hadamard product, respectively. indicates the mask matrix to select query predictions, where the value is assigned when the row index belongs to query set and otherwise. Similarly, selects columns when the column index belongs to support set.
During the metatraining time, the proposed model is optimized by minimizing the binary crossentropy loss of query edges and Bayes by Backprop loss, i.e.,
(11) 
where shared parameters are jointly optimized. the loss aggregates all predictions from different layers. The overall algorithm for metatraining is shown in Algorithm 1.
Experiments
Datasets
For fair comparisons with stateoftheart baselines, we conduct extensive experiments on two benchmark fewshot classification datasets:
miniImageNet is the subset of the ILSVRC dataset, where images for each of classes are randomly chosen to be the part to the dataset. We follow the class split used by [optimizationasmodel], where classes are used for training, for validation, and for testing. All the input images have the size of .
tieredImageNet is a larger subset of ILSVRC, which contains classes in higherlevel categories sampled from the highlevel nodes in the ImageNet. The standard split includes classes for training, classes for validation, and classes for testing. The average number of images in each class is .
Baselines
We compare our approach with the following baseline methods to justify its effectiveness:
Optimizationbased: Metalearner LSTM [optimizationasmodel], MAML [maml], REPTILE [reptile], MetaSGD [metaSGD], SNAIL [snail], LEO [leo].
Generationbased: PLATIPUS [pmaml], VERSA [versa], LwoF [dynamic], Param_Predict [param_prediction], wDAE [gene_GNN].
Metricbased: Matching Net [matching], Prototypical Net [proto], Relation Net [relation], TADAM [TADAM], CTM [ctm].
Graphbased: GNN [GNN], CovaMNet [covariance], TPN [TPN], EGNN [EGNN].
Implementation Details
Our source code^{1}^{1}1The source code is attached in supplementary material for reference.
is implemented based on Pytorch. All experiments are conducted on a server with two GeForce GTX 1080 Ti and two GTX 2080 Ti GPUs.
Module Architecture.
Despite the generality of backbone embedding module, we adopt the same architecture used in some recent work [relation, EGNN]. Specifically, the network consists of four convolutional blocks including a 2D covolutional layer with a kernal, a batch normalization, a maxpooling and a LeakyReLU activation. Regarding to specification of recurrent units, the dimension of hidden states and all embedding size are fixed to 96. We fix the number of hidden states to 8.
Parameter Settings.
The minibatch size for all graphbased models is 80 and 64 for 1shot and 5shot experiments, respectively. The proposed model was trained by Adam optimizer with an initial learning rate of and weight decay of . The dropout rate is set to 0.3 and the loss coefficient is set to 1. We report the final results of the proposed model trained with 70K and 160K iterations on miniImageNet and tieredImageNet.
Models  Backbone  1shot  5shot 

Optimizationbased  
Metalearner LSTM  Conv4  
MAML  Conv4  
REPTILE  Conv4  
MetaSGD  Conv4  
SNAIL  ResNet12  
LEO  WRN28  
Generationbased  
PLATIPUS  Conv4    
VERSA  Conv4  
LwoF  Conv4  
Param_Predict  WRN28  
wDAE  WRN28  
Metricbased  
Matching Net  Conv4  
Prototypical Net  Conv4  
Relation Net  Conv4  
TADAM  ResNet12  
CTM  Conv4  
Graphbased  
GNN  Conv4  
CovaMNet  Conv4  
TPN  Conv4  
EGNN  Conv4    
EGNN  Conv4  
Ours  
CMLBGNN w/o C  Conv4  
CMLBGNN w/o B  Conv4  
CMLBGNN  Conv4 
Models  Backbone  1shot  5shot 

Optimizationbased  
MAML  Conv4  
REPTILE  Conv4  
MetaSGD  Conv4  
LEO  WRN28  
Generationbased  
LwoF  Conv4  
wDAE  WRN28  
Metricbased  
Matching Net  Conv4  
Prototypical Net  Conv4  
Relation Net  Conv4  
CTM  Conv4  
Graphbased  
GNN  Conv4  
TPN  Conv4  
EGNN3  Conv4    
EGNN1  Conv4  
EGNN2  Conv4  
EGNN3  Conv4  
Ours  
CMLBGNN1  Conv4  
CMLBGNN2  Conv4  
CMLBGNN3  Conv4 
Comparisons with StateofTheArt
To verify the effectiveness of our proposed continual metalearning model, we compare it with stateoftheart metalearning methods on the miniImageNet and tieredImageNet datasets. Here we report the best performance for every model in Table 1 and Table 2
, along with the specifications of the backbone embedding models for feature extraction.
Conv4 refers to a 4layer convolutional network, ResNet12 [resnet] denotes 4 layer blocks of depth 3 with kernels and short connections, and WRN28 is a 28layer wide residual network. Generally, a deeper embedding network will lead to a better classification performance yet with a risk of overfitting. From Table 1, we can observe that our CMLBGNN equipped with three graph layers surpasses all compared metalearning methods with a large margin, especially in the challenging scenario of 1shot learning. More concretely, the proposed model with the basic conv4 embedding structure gains , , , relative improvements over the previous best optimizationbased LEO [leo], generationbased wDAE [gene_GNN], metricbased CTM [ctm] and graphbased methods EGNN [EGNN] in a 5way 1shot miniImageNet experiment, respectively. This is mainly owing to the learned history transition, which reinforces the memory of rare samples and correlations between classes. Furthermore, we reimplemented the most powerful graphbased baseline EGNN with minibatch size of 80 for fairness and present a detailed comparison in boxplots. All parameters are randomly initialized in three trials with fixed seeds 111, 222, 333 for reproducibility. As depicted in Figure 3, the absolute value of validation accuracy either in 1shot or 5shot setting tends to go up as training iterations increase. The proposed method reaches the peaks at an early stage and achieves a much higher performance, yet showing the sensitivity to seed selection in the case of 5way 5shot classification. We infer this variance is mainly introduced by edge inference sampling, which can be alleviated by averaging predictions from multiple sampling. From Table 2, we observe that our reimplemented EGNN obtains better performance (as indicated with ) by enlarging the batch size from 40 to 80. This phenomenon consistently verifies that task correlations are more likely to contribute positively to fewshot learning.Ablation Study
Effect of Components.
The major ablation results regarding to CMLBGNN with different components on miniImageNet dataset are shown in gray blocks of Table 1. All variants are trained with three graph layers, minibatch size of 80. Removing the history transition module, the variant CMLBGNN w/o C can only mine the pattern from local neighborhood without maintaining related prior information for reference, thus inevitably leading to a inferior performance, e.g., averagely decreasing 5way 1shot performance from to . CMLBGNN w/o B indicates the variant of our proposed model that directly utilizes the adjacency matrix to predict query labels without inferring taskspecific parameters. Accordingly, the classification accuracy suffers a slight drop on both datasets, e.g., from to in a 5way 5shot setting, which demonstrates the necessity of the full CMLBGNN formulations.
Methods  Layer1  Layer2  Layer3  

1shot  GNN  
EGNN  
CMLBGNN  
5shot  GNN  
EGNN  
CMLBGNN 
Effect of GNN’s Depth.
In addition to the evaluation for investigating the impact of GNN’s depth, we test our model in both 5way 1shot and 5shot with different depth of graph neural networks on both miniImageNet (shown in Table 3) and miniImageNet (shown in Table 2) dataset. Generally, larger depth enables node to learn from a global perspective and thus enhances the expressive power of graph neural networks. For instance, the proposed CMLBGNN, EGNN and GNN equipped with a 3layer structure respectively improve the classification accuracy by , and w.r.t 5way 1shot classification, compared with the one with onelayer structure.
Effect of Number of Hidden States.
In order to study the impact of number of hidden states in history transition module, we compare nine variants of our model on both datasets and show validation curves in terms of node classification accuracy and edge binary classification results in Figure 4. The CMLBGNNL{1,2,3}C16 indicates the variants that leverage unrolled gated recurrent units with 16 time steps to transfer history messages with different number of graph layers, which significantly outperform other variants with fewer hidden states CMLBGNNL{1,2,3}C{4,8}. This confirms that the augmented memory module effectively enhances node representation learning by bridging the prior task learning regardless of the model depth.
Methods  5way 5shot  

labeled  labeled  labeled  
GNNLabeledOnly  
GNNSemi  
EGNNLabeledOnly  
EGNNSemi  
CMLBGNNLabeledOnly  
CMLBGNNSemi 
Robustness Evaluation by Semisupervised Learning
To quantitatively analyze the model capacity of handling uncertainty, we conduct 5way 5shot semisupervised experiments on miniImageNet dataset and showcase major results in Table 4. In this semisupervised regime, support data is partially labeled while balanced across all classes, which poses a greater challenge of modeling uncertain relationships between labeled and unlabeled samples. In particular, the labeled column indicates that each episode contains 4 labeled support instances and 1 unlabeled instance. Here we use LabeledOnly to denote the strategy with only labeled support samples, and Semi presents training with both labeled and unlabeled data. By comparing the results with all graphbased counterparts, the proposed method greatly outperforms with a large margin ( vs and , when 20% are labeled). The superior performance results from our uncertainty modeling, which effectively adapts the noise and misguidance from adjacency initialization with taskspecific parameters.
Conclusion
In this work, we propose a continual metalearning model with Bayesian graph neural networks for fewshot classification problem. Towards preserving more history messages associated with related tasks, the proposed CMLBGNN mines the prior knowledge patterns by updating a memoryaugmented graph neural network and handles topological uncertainty with Bayes by Backprop. Distinguishing our work from conventional graphbased metalearning methods, it naturally alleviates the catastrophic forgetting and insufficient robustness issues and thus encourages an efficient adaptation and generalization to novel tasks.
Comments
There are no comments yet.