Learning from the Past: Continual Meta-Learning via Bayesian Graph Modeling

11/12/2019 ∙ by Yadan Luo, et al. ∙ The University of Queensland 0

Meta-learning for few-shot learning allows a machine to leverage previously acquired knowledge as a prior, thus improving the performance on novel tasks with only small amounts of data. However, most mainstream models suffer from catastrophic forgetting and insufficient robustness issues, thereby failing to fully retain or exploit long-term knowledge while being prone to cause severe error accumulation. In this paper, we propose a novel Continual Meta-Learning approach with Bayesian Graph Neural Networks (CML-BGNN) that mathematically formulates meta-learning as continual learning of a sequence of tasks. With each task forming as a graph, the intra- and inter-task correlations can be well preserved via message-passing and history transition. To remedy topological uncertainty from graph initialization, we utilize Bayes by Backprop strategy that approximates the posterior distribution of task-specific parameters with amortized inference networks, which are seamlessly integrated into the end-to-end edge learning. Extensive experiments conducted on the miniImageNet and tieredImageNet datasets demonstrate the effectiveness and efficiency of the proposed method, improving the performance by 42.8 with state-of-the-art on the miniImageNet 5-way 1-shot classification task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

A key signature of human intelligence is the ability to quickly acquire knowledge from few examples. Despite artificial intelligence has made remarkable progress in wide applications, it remains challenging to perform well in situations with little available data or limited computational resources. Such a scenario is typically referred to as few-shot learning, which has attracted vast interests recently 

[aaai1, aaai2, aaai4].

Rather than simply augmenting data [hallucinate] or adding regularization to compensate for the lack of data, an emerging line of work tackles few-shot learning with meta-learning. By leveraging previous learning experience to obtain a prior over tasks at meta-train time, the efficiency of later learning is further improved at meta-test time. Particularly, the learned prior for discovering the transferable knowledge can act as an inductive bias to minimize generalization error.

Generally, mainstream meta-learning models follow the episodic training paradigm, where the meta-learner extracts domain-general information among episodes so that it can assist the task-specific learner to recognize unlabeled samples (query set) based on the few labeled points (support set). In this way, the meta-learner can be implemented variously: as an optimizer that gathers gradient flows from different tasks [maml, reptile, layer-wise]

; as an classification weight generator that hallucinates classifiers for novel classes 

[param_prediction, dynamic, gene_GNN]; or as a metric that measures similarity between the query and support samples [matching, proto]. Nevertheless, existing meta-learning methods are far from optimal due to the lack of relational inductive bias modeling [relational], thereby failing to manipulate the structured representations of intra- and inter-task relations.

Motivated to capture more interactions among instances, another line of work has explored graph structure [GNN, EGNN, TPN] or second-order statistics like covariance [covariance] in meta-learning framework. More concretely, Garcia and Bruna [GNN] cast few-shot learning as the node classification problem with graph neural networks, where nodes are represented with the images in the episodes, and edges are given by a trainable similarity kernels. In the same vein, Liu [TPN] et al. proposed a transductive propagation network (TPN) for label propagation and thus enabled transductive inference for all query set. Alternatively, Kim et al. [EGNN] modeled the learning as an edge labeling problem, in order to directly predict whether the associated two nodes belong to the same class.

While promising, most existing graph-based meta-learning approaches suffer from two major limitations, i.e., catastrophic forgetting [lifelong] and insufficient robustness [BGNN], which make it difficult to transfer knowledge over long time spans or handle uncertainty in graph structure. On the one hand, the problem of catastrophic forgetting is that when a meta-learner gradually encounters a sequence of learning problems, it tends to attenuate past knowledge when it learns new things. On the other hand, current graph-based models incorrectly treat the pre-defined edge initialization as the reliable topology for message-passing, in the sense that inaccurate or uncertain relationships among query-support pairs could lead to severe error accumulation through multi-layer propagation.

To address the issues mentioned above, in this paper, we propose a novel Continual Meta-Learning with Bayesian Graph Neural Networks (CML-BGNN) for few-shot classification, which is illustrated in Figure 1

. To alleviate the drawback of catastrophic forgetting, we jointly model the long-term inter-task correlations and short-term intra-class adjacency with the derived continual graph neural networks, which can retain and then access important prior information associated with newly encountered episodes. Specifically, the node update block aggregates the adjacent embedding from each episode and feeds context-aware node representations to gated recurrent units, which are expected to meliorate node features with previous history. Such aggregations can be naturally chained and combined into the multiple layers to enhance model expressiveness. Moreover, as uncertainty is rife in edge initialization, we provide a Bayesian approach for edge inference so that classification weights can be dynamically adjusted for discriminating specific tasks. The conceived amortization networks approximate posterior distribution of classification weights with a Gaussian distribution defined by a mean and variance over possible values. Accordingly, task-specific parameters are sampled to mitigate the bias of node embedding and further enhance the robustness of graph neural networks. Overall, our contributions can be briefly summarized as follows:

  • We propose a novel Continual Meta-Learning framework that leverages both long-term inter-task and short-term intra-task correlations for few-shot learning. Different from existing graph-based meta-learning approaches, we introduce a memory-augmented graph neural network to enable flexible knowledge transfer across episodes.

  • To remedy uncertainty among query-support pairs, a Bayesian edge inference is derived by amortizing posterior inference of task-specific parameters.

  • We show the effectiveness of the proposed architecture through extensive experiments on the miniImagenet and tieredImagenet benchmark with a

    relative improvement over state-of-the-art counterparts. Regarding to robustness analysis, we perform semi-supervised learning to verify the efficiency and effectiveness of the proposed method.

Related Work

Meta-Learning

Meta-learning studies how to distill prior knowledge from past experience and enable fast adaptation to novel tasks with only a limited amount of samples. Much effort has been devoted by recent work, which can be broadly categorized into several groups: (1) Optimization-based methods either learn a good parameter initialization or leverage an optimizer as the meta-learner to adjust model weights. Typical examples include learning to approximate gradient descent with LSTM [optimizationasmodel], learning model-agnostic initial parameters [maml]

and its variants with probabilistic estimation 

[pmaml, bmaml, meta-SGD], first-order approximation [reptile], layer selection [layer-wise], learner update direction and learning rate learning [meta-SGD], and relation embedding [leo]; (2) Generation-based methods learn to augment few-shot data with a generative meta-learner [PMN], or learn to predict classification weights for novel classes [dynamic, param_prediction, gene_GNN]; (3) Metric-based

approaches address the few-shot classification problem by learning a proper distance metrics as the meta-learner, such as cosine similarity 

[matching], euclidean distance to class prototypes [proto, softkmeans]

, ridge regression 

[regression], relation network [relation], task attention [attention], category traversal [ctm], and graph modeling [GNN, TPN, covariance, EGNN]. Rather than purely relying on graph-based metric learning, our methodology exploits long-term information from previous tasks and jointly models the topological uncertainty.

Catastrophic Forgetting

Catastrophic forgetting [lifelong]

has been a long-standing issue in machine learning community due to the stability-plasticity dilemma 

[dilemma]. In recent literature, a number of methods have been proposed on the basis of continual learning, which can be roughly subdivided into the following groups. Regularization approaches [regularization] alleviate catastrophic forgetting by imposing constraints on the update of the neural weights in order to prevent “overwriting” what was previously encoded. Alternatively, in ensemble algorithms [dynamic], the architecture itself is altered to accommodate new tasks by retraining a pool of pre-trained models. In dual-memory algorithms [replay], one estimates the distribution of the old data either by saving a small fraction of the original dataset into a memory buffer or by training a generator to mimic the lost data and labels [replay1]. Being more related to the last group, our work, for the first time, tackles the catastrophic forgetting in a meta-learning framework and validates its effectiveness and efficiency on practical tasks.

Figure 1: The general flowchart of the proposed continual meta-learning.

Background and Problem Definition

Given the training set , the goal is to learn the model , which is capable of generalizing well to the unseen test set , where . Meta-learning approaches commonly adopt episodic training strategy to minimize the generalization error across a series of tasks that are randomly sampled from a task distribution. Specifically, a -way, -shot classification setting is used for both training and testing stage, where indicates the number of unique classes in each episode and denotes the number of training samples per class. Each episode or task is composed of the support set and the query set . To discover the commonalities and variability across tasks, we decouple the model into two sub-modules, i.e., the feature learner and the task-dependent classifiers , where indicates the shared parameters that suit for all tasks and indicate the task-specific parameters. To mathematically illustrate the proposed continual meta-learning procedure, we firstly recall the unified definition of existing meta-learning models: as discussed in literature [bayes, pmaml], meta-learning can be viewed as an approximate inference for the posterior given the following definition.

Definition 1.

Meta-Learning Given the task sampled from task distribution

, the posterior predictive distribution for query points is calculated as,

where is the maximum a posteriori (MAP) value of , which can be obtained via point estimates.

For instance, the optimization-based meta-learning regards all model parameters as and forms a point estimate by taking several steps of gradient descent initialized at and learning rate , i.e.,

(1)

While, the generation-based

meta-learning focuses on estimating classification weight vector

for novel classes, given the initial value trained on and the learning rate , i.e.,

(2)

where is the learnable weight generator. The metric-based meta-learning takes the parameters from the top layer of neural networks as for all classes, by averaging the top-layer activations for each class , i.e.,

(3)

where denotes the number of samples belonging to class , and indicates the embedding network.

However, the current meta-learning suffers from catastrophic forgetting and lacks for uncertainty estimation. Without fixing these problems, a single deep model will be incapable of adapting itself to a long-run learning, since it forgets the old messages when it deals with new things. Therefore, we generalize the meta-learning to a preferable continual way, and give the following elegant definition.

Definition 2.

Continual Meta-Learning Given the task sampled from task distribution , the posterior predictive distribution for query points is

where indicates the history knowledge that gives transition of long-term memory.

The first term can be interpreted as given a novel sample , its respective label not only conditions on the current task but also history information , which naturally reuses supervision without pilling up complexity. The second term presents information updating and resetting from related tasks, which is jointly learned with shared parameter in the continual graph neural networks, which will be elaborated in the next section. Lastly, to ensure a tractable likelihood, in the last term, the distribution over classifier weights is approximated with a few steps of Bayes by Backprop.

Figure 2: An illustration of the node update and edge inference block at the -th layer.

Proposed Approach

In this section, we detail two major components of the proposed method, the continual graph neural networks and Bayes by Backprop procedure for edge inference.

Continual Graph Neural Networks

As shown in Figure 1, the support (shown in circle) and query data (shown in triangle) in each episode can form an undirected acyclic graph . Each vertex is associated with a node embedding vector and each edge represents interaction between nodes. The adjacency matrix is defied as semantic similarity between two connected nodes and , which will be updated dynamically. Considering the given label of support set, the adjacency matrix can be initialized as,

(4)

where support-support pairs are initialized w.r.t their labels and query-support pairs are softly assigned with an uncertain value. Even though ambiguity is explicitly injected, we remedy the noise via probabilistic edge inference presented in the next section. For efficient implementation, we split the meta-train and meta-test dataset into several sequences of episodes , which is learned by node updating (see Figure 2), history transition and edge inference module consecutively, where is the length of the sequence.

Node Interactions Modeling

To make each node aggregate more information from neighbors hops away, the proposed graph model stacks aggregation blocks. Following the generic propagation rule, the node vectors in arbitrary graph at the -th layer can be updated as,

(5)

where indicates the neighbor set of the node , is the concatenation operation and is a transformation block consisting of two convolutional layers, one LeakyReLU activation and one dropout layer. The node embedding is initialized with the extracted representation from the backbone embedding model, i.e., . Here we leave out the layer mark in the next section for simplicity.

Task History Transition

After receiving messages from current episode , each node embedding is further transformed with a gate updater. Different from the Gated Graph Neural Network [GGNN] that utilizes the gate updater to extend the depth of graph neural networks for the same batch of data, we feed different embedding at each time step in order to capture the long-term task correlations. Specifically, the hidden state for each node is transferred as follows,

(6)

where is the updated feature of node ,

is the sigmoid function,

, , , , , are learnable weights, and , , are biases of the updating function. and are update gate vector and reset gate vector, respectively. Here we let all parameters as for shorthand. The hidden state is initialized as a zero vector.

Adjacency Feature Update

After -step hidden transition, we can obtain a set of final node representations at the -th layer. In this way, the adjacency features at the -th layer is calculated as,

(7)

where is the degree matrix of adjacency matrix,

is the non-linear transformation network parameterized by

, which includes four convolutional blocks, a batch normalization, a LeakyReLU activation and a dropout layer.

Bayes by Backprop for Edge Inference

Though multiple graph aggregation layers, we aim to infer a full predictive distribution over the unknown query labels relying on distributional Bayesian Decision Theory [pmaml, versa]. Notably, we amortize the posterior distribution of the classification weights as to enable quick prediction at the meta-test stage and learn parameters by minimizing the average expected loss over tasks i.e., , where

(8)

To ensure likelihood tractable, we use a factorized Gaussian distribution for with means and variances set by the amortization network,

(9)

With the generated posterior distribution, we can adaptively transform the predictive logits from the

-th layer based on the learned adjacency matrix and sampled classifier weights for each specific task,

(10)

where and indicate the sigmoid function and Hadamard product, respectively. indicates the mask matrix to select query predictions, where the value is assigned when the row index belongs to query set and otherwise. Similarly, selects columns when the column index belongs to support set.

During the meta-training time, the proposed model is optimized by minimizing the binary cross-entropy loss of query edges and Bayes by Backprop loss, i.e.,

(11)

where shared parameters are jointly optimized. the loss aggregates all predictions from different layers. The overall algorithm for meta-training is shown in Algorithm 1.

1:Inputs:
2:      Task distribution of : ;
3:Outputs:
4:      Model Weights: ;
5:Initialize:
6:      Hyper-parameters: , , ; Adjacency matrix ; Hidden states ; Visual features for all tasks; Minibatch size and learning rate ;
7:for M iterations do
8:     Sample batch of tasks ;
9:     Message passing and update node representations in Equation (5) and Equation (6);
10:     Compute adjacency matrix in Equation (7);
11:     Generate task-specific parameter , , sample ;
12:     Compute prediction of query samples;
13:     Update parameters by descending stochastic gradients according to Equation (11).
14:end for
Algorithm 1 Meta-training of the Proposed CML-BGNN.

Experiments

Datasets

For fair comparisons with state-of-the-art baselines, we conduct extensive experiments on two benchmark few-shot classification datasets:

miniImageNet is the subset of the ILSVRC- dataset, where images for each of classes are randomly chosen to be the part to the dataset. We follow the class split used by [optimizationasmodel], where classes are used for training, for validation, and for testing. All the input images have the size of .

tieredImageNet is a larger subset of ILSVRC-, which contains classes in higher-level categories sampled from the high-level nodes in the ImageNet. The standard split includes classes for training, classes for validation, and classes for testing. The average number of images in each class is .

Baselines

We compare our approach with the following baseline methods to justify its effectiveness:

Optimization-based: Meta-learner LSTM [optimizationasmodel], MAML [maml], REPTILE [reptile], Meta-SGD [meta-SGD], SNAIL [snail], LEO [leo].

Generation-based: PLATIPUS [pmaml], VERSA [versa], LwoF [dynamic], Param_Predict [param_prediction], wDAE [gene_GNN].

Metric-based: Matching Net [matching], Prototypical Net [proto], Relation Net [relation], TADAM [TADAM], CTM [ctm].

Graph-based: GNN [GNN], CovaMNet [covariance], TPN [TPN], EGNN [EGNN].

Implementation Details

Our source code111The source code is attached in supplementary material for reference.

is implemented based on Pytorch. All experiments are conducted on a server with two GeForce GTX 1080 Ti and two GTX 2080 Ti GPUs.

Module Architecture.

Despite the generality of backbone embedding module, we adopt the same architecture used in some recent work [relation, EGNN]. Specifically, the network consists of four convolutional blocks including a 2D covolutional layer with a kernal, a batch normalization, a max-pooling and a LeakyReLU activation. Regarding to specification of recurrent units, the dimension of hidden states and all embedding size are fixed to 96. We fix the number of hidden states to 8.

Parameter Settings.

The mini-batch size for all graph-based models is 80 and 64 for 1-shot and 5-shot experiments, respectively. The proposed model was trained by Adam optimizer with an initial learning rate of and weight decay of . The dropout rate is set to 0.3 and the loss coefficient is set to 1. We report the final results of the proposed model trained with 70K and 160K iterations on miniImageNet and tieredImageNet.

Models Backbone 1-shot 5-shot
Optimization-based
Meta-learner LSTM Conv4
MAML Conv4
REPTILE Conv4
Meta-SGD Conv4
SNAIL ResNet-12
LEO WRN-28
Generation-based
PLATIPUS Conv4 -
VERSA Conv4
LwoF Conv4
Param_Predict WRN-28
wDAE WRN-28
Metric-based
Matching Net Conv4
Prototypical Net Conv4
Relation Net Conv4
TADAM ResNet-12
CTM Conv4
Graph-based
GNN Conv4
CovaMNet Conv4
TPN Conv4
EGNN Conv4 -
EGNN Conv4
Ours
CML-BGNN w/o C Conv4
CML-BGNN w/o B Conv4
CML-BGNN Conv4
Table 1: The 5-way 1-shot and 5-shot classification accuracies (%) on the test split of the miniImageNet dataset, with confidence interval. indicates our re-implementation. “w/o” indicates without.
Models Backbone 1-shot 5-shot
Optimization-based
MAML Conv4
REPTILE Conv4
Meta-SGD Conv4
LEO WRN-28
Generation-based
LwoF Conv4
wDAE WRN-28
Metric-based
Matching Net Conv4
Prototypical Net Conv4
Relation Net Conv4
CTM Conv4
Graph-based
GNN Conv4
TPN Conv4
EGNN-3 Conv4 -
EGNN-1 Conv4
EGNN-2 Conv4
EGNN-3 Conv4
Ours
CML-BGNN-1 Conv4
CML-BGNN-2 Conv4
CML-BGNN-3 Conv4
Table 2: The 5-way 1-shot and 5-shot classification accuracies (%) on the test split of the tieredImageNet dataset, with confidence interval. indicates the re-implementation. “w/o” indicates without.
Figure 3: The 5-way 1-shot and 5-shot accuracies of query classification on the validation split of miniImageNet (left) and tieredImageNet dataset (right).

Comparisons with State-of-The-Art

To verify the effectiveness of our proposed continual meta-learning model, we compare it with state-of-the-art meta-learning methods on the miniImageNet and tieredImageNet datasets. Here we report the best performance for every model in Table 1 and Table 2

, along with the specifications of the backbone embedding models for feature extraction.

Conv4 refers to a 4-layer convolutional network, ResNet-12 [resnet] denotes 4 layer blocks of depth 3 with kernels and short connections, and WRN-28 is a 28-layer wide residual network. Generally, a deeper embedding network will lead to a better classification performance yet with a risk of overfitting. From Table 1, we can observe that our CML-BGNN equipped with three graph layers surpasses all compared meta-learning methods with a large margin, especially in the challenging scenario of 1-shot learning. More concretely, the proposed model with the basic conv4 embedding structure gains , , , relative improvements over the previous best optimization-based LEO [leo], generation-based wDAE [gene_GNN], metric-based CTM [ctm] and graph-based methods EGNN [EGNN] in a 5-way 1-shot miniImageNet experiment, respectively. This is mainly owing to the learned history transition, which reinforces the memory of rare samples and correlations between classes. Furthermore, we re-implemented the most powerful graph-based baseline EGNN with mini-batch size of 80 for fairness and present a detailed comparison in boxplots. All parameters are randomly initialized in three trials with fixed seeds 111, 222, 333 for reproducibility. As depicted in Figure 3, the absolute value of validation accuracy either in 1-shot or 5-shot setting tends to go up as training iterations increase. The proposed method reaches the peaks at an early stage and achieves a much higher performance, yet showing the sensitivity to seed selection in the case of 5-way 5-shot classification. We infer this variance is mainly introduced by edge inference sampling, which can be alleviated by averaging predictions from multiple sampling. From Table 2, we observe that our re-implemented EGNN obtains better performance (as indicated with ) by enlarging the batch size from 40 to 80. This phenomenon consistently verifies that task correlations are more likely to contribute positively to few-shot learning.

Figure 4: The 5-way 1-shot accuracies (%) of query classification on the validation split of miniImageNet with different number of hidden states. Best viewed in color.

Ablation Study

Effect of Components.

The major ablation results regarding to CML-BGNN with different components on miniImageNet dataset are shown in gray blocks of Table 1. All variants are trained with three graph layers, mini-batch size of 80. Removing the history transition module, the variant CML-BGNN w/o C can only mine the pattern from local neighborhood without maintaining related prior information for reference, thus inevitably leading to a inferior performance, e.g., averagely decreasing 5-way 1-shot performance from to . CML-BGNN w/o B indicates the variant of our proposed model that directly utilizes the adjacency matrix to predict query labels without inferring task-specific parameters. Accordingly, the classification accuracy suffers a slight drop on both datasets, e.g., from to in a 5-way 5-shot setting, which demonstrates the necessity of the full CML-BGNN formulations.

Methods Layer-1 Layer-2 Layer-3
1-shot GNN
EGNN
CML-BGNN
5-shot GNN
EGNN
CML-BGNN
Table 3: 5-Way 5-shot and 1-shot classification accuracies (%) on miniImageNet dataset with different depths of graph neural networks. indicates our re-implementation.

Effect of GNN’s Depth.

In addition to the evaluation for investigating the impact of GNN’s depth, we test our model in both 5-way 1-shot and 5-shot with different depth of graph neural networks on both miniImageNet (shown in Table 3) and miniImageNet (shown in Table 2) dataset. Generally, larger depth enables node to learn from a global perspective and thus enhances the expressive power of graph neural networks. For instance, the proposed CML-BGNN, EGNN and GNN equipped with a 3-layer structure respectively improve the classification accuracy by , and w.r.t 5-way 1-shot classification, compared with the one with one-layer structure.

Effect of Number of Hidden States.

In order to study the impact of number of hidden states in history transition module, we compare nine variants of our model on both datasets and show validation curves in terms of node classification accuracy and edge binary classification results in Figure 4. The CML-BGNN-L-{1,2,3}-C-16 indicates the variants that leverage unrolled gated recurrent units with 16 time steps to transfer history messages with different number of graph layers, which significantly outperform other variants with fewer hidden states CML-BGNN-L-{1,2,3}-C-{4,8}. This confirms that the augmented memory module effectively enhances node representation learning by bridging the prior task learning regardless of the model depth.

Methods 5-way 5-shot
-labeled -labeled -labeled
GNN-LabeledOnly
GNN-Semi
EGNN-LabeledOnly
EGNN-Semi
CML-BGNN-LabeledOnly
CML-BGNN-Semi
Table 4: Semi-supervised few-shot classification accuracies (%) on miniImageNet with 95% confidence intervals.

Robustness Evaluation by Semi-supervised Learning

To quantitatively analyze the model capacity of handling uncertainty, we conduct 5-way 5-shot semi-supervised experiments on miniImageNet dataset and showcase major results in Table 4. In this semi-supervised regime, support data is partially labeled while balanced across all classes, which poses a greater challenge of modeling uncertain relationships between labeled and unlabeled samples. In particular, the -labeled column indicates that each episode contains 4 labeled support instances and 1 unlabeled instance. Here we use LabeledOnly to denote the strategy with only labeled support samples, and Semi presents training with both labeled and unlabeled data. By comparing the results with all graph-based counterparts, the proposed method greatly outperforms with a large margin ( vs and , when 20% are labeled). The superior performance results from our uncertainty modeling, which effectively adapts the noise and misguidance from adjacency initialization with task-specific parameters.

Conclusion

In this work, we propose a continual meta-learning model with Bayesian graph neural networks for few-shot classification problem. Towards preserving more history messages associated with related tasks, the proposed CML-BGNN mines the prior knowledge patterns by updating a memory-augmented graph neural network and handles topological uncertainty with Bayes by Backprop. Distinguishing our work from conventional graph-based meta-learning methods, it naturally alleviates the catastrophic forgetting and insufficient robustness issues and thus encourages an efficient adaptation and generalization to novel tasks.

References