DeepAI
Log In Sign Up

Few-Shot and Continual Learning with Attentive Independent Mechanisms

Deep neural networks (DNNs) are known to perform well when deployed to test distributions that shares high similarity with the training distribution. Feeding DNNs with new data sequentially that were unseen in the training distribution has two major challenges – fast adaptation to new tasks and catastrophic forgetting of old tasks. Such difficulties paved way for the on-going research on few-shot learning and continual learning. To tackle these problems, we introduce Attentive Independent Mechanisms (AIM). We incorporate the idea of learning using fast and slow weights in conjunction with the decoupling of the feature extraction and higher-order conceptual learning of a DNN. AIM is designed for higher-order conceptual learning, modeled by a mixture of experts that compete to learn independent concepts to solve a new task. AIM is a modular component that can be inserted into existing deep learning frameworks. We demonstrate its capability for few-shot learning by adding it to SIB and trained on MiniImageNet and CIFAR-FS, showing significant improvement. AIM is also applied to ANML and OML trained on Omniglot, CIFAR-100 and MiniImageNet to demonstrate its capability in continual learning. Code made publicly available at https://github.com/huang50213/AIM-Fewshot-Continual.

READ FULL TEXT VIEW PDF

page 6

page 13

page 15

page 16

page 17

page 18

04/19/2021

Few-shot Continual Learning: a Brain-inspired Approach

It is an important yet challenging setting to continually learn new task...
06/24/2020

Insights from the Future for Continual Learning

Continual learning aims to learn tasks sequentially, with (often severe)...
09/11/2021

Total Recall: a Customized Continual Learning Method for Neural Semantic Parsers

This paper investigates continual learning for semantic parsing. In this...
01/11/2023

Continual Few-Shot Learning Using HyperTransformers

We focus on the problem of learning without forgetting from multiple tas...
09/06/2022

Continual Learning: Fast and Slow

According to the Complementary Learning Systems (CLS) theory <cit.> in n...
07/12/2022

Continual Learning with Deep Learning Methods in an Application-Oriented Context

Abstract knowledge is deeply grounded in many computer-based application...
05/01/2021

A Deep Learning Framework for Lifelong Machine Learning

Humans can learn a variety of concepts and skills incrementally over the...

Code Repositories

1 Introduction

Humans have the ability to learn new concepts continually while retaining previously learned concepts [11]. While learning new concepts, prior concepts that were learned are leveraged to form new connections in the brain [4, 52]

. The plasticity of the human brain plays an important role on the forming of novel neuronal connections for learning new concepts. Current deep learning methods are inefficient in remembering old concepts after being fed with new concepts, also widely know as catastrophic forgetting

[34, 23]. Deep neural networks (DNNs) trained in an end-to-end fashion also has difficulty in learning new tasks in a sample efficient manner [12]. It is conjectured that the cause of catastrophic forgetting and inefficiency in learning new tasks is from the stability-plasticity dilemma [2]. Stability is required so that previously learned information can be retained through the limitation of abrupt weight changes. Plasticity on the other hand encourages large weight changes, resulting in the fast acquisition of new concepts with the trade-off of forgetting old concepts.

It is believed that by scaling up the currently available architecture, DNNs are able to generalize better [7, 41, 10]. Tremendous effort is placed into neural architecture search (NAS) [28, 54, 49, 39, 32] with the hypothesis that improvements on a structural level introduce inductive bias that improves the generalizability of a neural network. As most of the prior arts are evaluated on benchmark datasets that are distributed similarly to the training set that it is trained on, the evaluation results are not a good measure of the generalization. We argue that the ability to adapt, acquire new knowledge and recall previously learned information plays an important role in reaching true generalization. The importance of learning to learn, i.e. meta-learning, has shone the spotlight on two major research direction that we will focus on — few-shot learning and continual learning. In few-shot learning [12, 37, 45, 14], the goal is to learn novel concepts with as few samples as possible, i.e. evaluating the capability of adapting to new tasks. Whereas in continual learning, the ability to learn an increasing amount of concepts while not forgetting old ones is evaluated.

Following OML [22], we separate the feature extraction part and the decision making part of the network, defined in OML as representation learning network (RLN) and prediction learning network (PLN) respectively. The fast and slow learning in OML is performed on an architecture level, i.e. RLN is updated in the outer loop (slow weights) and PLN is updated in the inner loop (fast weights). Such approach has proven to be helpful in learning sparse representation that are beneficial for fast adaptation and prevention of catastrophic forgetting. We take one step further by introducing sparsity on an architectural level, accomplished through the introduction of Attentive Independent Mechanisms (AIM). AIM is composed of a set of mechanisms that competitively attend to the input representation, having mechanisms that are closely related to the input representation being activated during inference. AIM can be understood as a mixture of experts competing to explain an incoming representation, hence only the mechanisms that best explain the input representation will be updated, leading to a sparse representation or modeling on an architectural level. Having sparse modeling on an architectural level for higher-order representations has its benefits, as only the experts or mechanisms that best explain a task will be involved in the learning process, helping in the acceleration of learning new concepts and the mitigation of catastrophic forgetting. To demonstrate the potential of AIM as a fundamental building block for fast learning without forgetting, we demonstrate its strength on few-shot classification [12, 43, 53] and continual learning [5, 22, 23] benchmarks.

Our contributions are as follows: (1) In Section 3, we give a detailed description and formulation of AIM — a novel module that can be used for few-shot and continual learning. (2) We apply AIM on few-shot learning and continual learning tasks in Section 4.1 and Section 4.2 respectively. Qualitative and quantitative results are shown for both learning tasks, giving readers an insight on the importance of having AIM in the context of few-shot and continual learning. For few-shot classification, experiments are performed on CIFAR-FS and MiniImageNet whereas for continual learning, experiments are performed on Omniglot, CIFAR-100 and MiniImageNet. Substantial improvement in accuracy over prior arts are shown.

2 Related Work

Meta-learning revolves on the idea of learning to learn, hoping that through the observation of training iterations on a few tasks, we are able to generalize to unseen tasks with only a few or zero samples. Meta-learning is usually composed of a support set and a query set. The support set is used for fast adaptation and the query set is used to evaluate the adapted model and to meta-learn the adaptation procedure. Model-based meta-learning methods include the work by [35] that uses a meta-learner based on a LSTM [18] which includes all previously seen samples, i.e. all support samples of a task are considered during the class prediction of query samples through an attentive mechanism. Another similar work by [44] augments LSTM with an external memory bank. [36] incorporates fast and slow weights for few-shot classification.

Metric-based meta-learning methods include Siamese Network proposed by [24] which predicts whether two images originate from the same class. [50] proposed Matching Networks that uses cosine distance in an attention kernel to measure the similarity of images in its embedding space. [45] later found that using Euclidean distance as a metric instead of cosine distance improves performance. A generalization of all the mentioned work is done by modeling the metric using a graph neural network proposed by [13].

Optimization-based meta-learning includes [42] that proposed using a LSTM meta-learner which provides gradient to a convolutional network-based fast learner. [12, 37]

proposed an inner and outer-loop optimization method having fast adaptation in the inner-loop and an outer-loop update that backpropagates through the inner-loop updates.

[53] used the concept of inner and outer loop-update by having the context parameters (embeddings of tasks) updated in the inner-loop. LEO [43]

has its classifier weights generated by a low-dimensional latent embedding updated in the inner-loop.

[15]

proposed a similar approach where classification weights are generated using feature vectors that corresponds to the support set. SIB

[20] performs transductive inference using synthetic gradient [21] on the feature averaging variant classifier proposed by [15]. Transductive inference was first introduced to the context of few-shot classification by [33], having a graph constructed for the support set and the query set, with labels propagated within the graph. As the architecture proposed by [33] is restrictive, [19] proposed a more general approach that uses a cross attention module that models semantic relevance between the support and query set.

In continual learning, the objective is to mitigate catastrophic forgetting [23]. Earlier works are based on regularization method, with [17] proposing the use of fast and slow training weights, borrowing the idea of plasticity and stability for network training. This idea is then adopted by OML [22] to learn representations that are useful for future learning and helps in mitigating catastrophic forgetting. Similarly, fast and slow learning is applied to ANML [5], having a neuromodulatory network modeled using slow weights. [1] uses task-specific gate module and prediction head to reduce competitive effect between classes. A criterion is designed in [3] to store most-interfered samples in a fixed-sized rehearsal memory.

3 Method

Figure 1: AIM is inserted right after the feature extractor and before the output classifier . Only mechanisms closely related to the input representation are active (green boxes) and updated during the training phase (blue dashed lines).

As Attentive Independent Mechanisms (AIM) is used to model higher order information, we place it right after a feature extractor, defined as . is a series of convolutional layers parameterized by , is an input sample and is its corresponding representation. AIM is a module that is parameterized by and is defined as . The representation from AIM is then fed to a linear layer for the task of classification. An illustration of AIM as a module is shown in Figure 1. We also show an illustration on the application of AIM to existing meta-learning frameworks used for few-shot learning and continual learning in Figure 2. We first describe the implementation of AIM as module in Section 3.1 followed by its integration to SIB [20] for few-shot learning in Section 3.2 and to OML [22] and ANML [5] for continual learning in Section 3.3.

3.1 Attentive Independent Mechanisms

The goal of AIM is to learn a sparse set of mechanisms, i.e. mixture of experts, to decouple the modeling of higher order information from the feature extraction pipeline. These mechanisms compete and attend to the input representation in a top-down fashion using cross-attention [30, 29]. Through the strict selection of mechanisms, a sparse set of mechanisms will be selected for every task, inducing an architectural bias that helps in fast adaptation to new tasks and mitigating catastrophic forgetting. The structure of AIM is composed of a set of independent mechanisms, each parameterized by its own set of parameters. Each mechanism acts as an independent expert that collaborate with other experts to solve a particular task. AIM can be viewed as a static version of RIMs [16], i.e. temporal modeling of hidden states using LSTM [18] found in RIMs is removed. For RIMs, the model is fed with a continuous stream of inputs, making dynamical modeling using LSTM intuitive. For AIM, the assumption of having continuous stream of inputs does not hold as the practice of few-shot classification and continual learning have i.i.d. data being fed into the model during training and inference. Departing from RIMs, the objective of AIM is to show that through a mixture of experts, new concepts can be easily learned with minimal catastrophic forgetting. We hypothesize that by having a set of independent mechanisms, a sparse set of factorized representations or concepts can be extracted from the input representation. Such concepts have properties that are tasks-invariant which can be helpful in learning new tasks. The learning of concepts in AIM can also be understood as the amortized version of memory based models that stores samples either in the form of images or representations [44], which scales with the size of tasks in the system without limitation. AIM on the other hand performs implicit modeling of samples, analogous to the amortized modeling using a DNN instead of using a non-parametric method that stores samples from the training set for inference [8].

Following RIMs, AIM has a null vector that is concatenated with the input representation , giving us . The mechanisms then attend to the incoming latent representation as:

(1)

which could be understood as the passing of input representation through the weighted-summation of the mechanism weights, . The summation of the outputs of the mechanisms makes the extension to arbitrary number of mechanisms trivial when compared to the concatenation of outputs used in RIMs. Concatenation is also infeasible when the output dimension of is large, resulting in a wide input dimension for the upcoming layer. The summation of mechanisms also has the property of permutation invariance, reducing the complexity of the output classifier .

To encourage sparsity, we enforce the mechanisms to compete with each other to attend to the incoming representation. This is done by having only the weights of mechanisms that are closely related to the input representation to be selected, i.e. only top mechanisms out of a total of mechanisms are selected for the downstream prediction tasks. The strict selection of mechanisms forces the mechanisms to compete with each other to attend to the incoming signal, modeling the biased competition theory of selective attention [9]. The selection of mechanisms is given as:

(2)

The indices corresponding to the top values from a set is returned by the operator. The weights used to weight the importance of the selected mechanisms are composed of the softmax of the normalized inner-product, , between the mechanisms’ hidden state and the input representation that are first mapped to a lower-dimensional embedding by the query weight and key weight of output dimension respectively, given as:

(3)

Note that is applied locally for each mechanism, i.e. the transformation of the attention values corresponding to and into a probabilistic one. The value that corresponds to the input (not null) dimension from (10) is used for the top comparison in (9).

Intervention during training.

The training of AIM can be understood as an intervention procedure with the model selecting a few mechanisms to be included during the forward pass phase of training. Mechanisms that perform well on the training data are rewarded by having gradient update directed to the activated mechanisms, with the sensitivity to novel inputs reflected on . As one can predict, there is a possibility of the occurrence of mechanism-overfitting, where only a fixed set of mechanisms are active for all training tasks, losing the original motivation of having a sparse set of mechanisms acting as experts on different tasks. Mechanism-overfitting is also equivalent to having a DNN with multiple residual paths, resembling a single layer of Inception [48], diverting from our original goal of building models that are invariant across tasks.

To prevent the collapsing towards having only a few active mechanisms for all tasks, the trick is to enable the exploration of different amount of mechanisms during training, instead of locking down to the top mechanisms. Stochasticity is introduced into the selection process by sampling top (also known as stochastic sampling count) instead of top mechanisms. We then perform uniform sampling without replacement of mechanisms from the top mechanisms, where the original sampling condition of (9) can now be written as:

(4)

Here, is the cardinality operator to ensure that the sampled subset is of size and is sampled without replacement. Such intervention is analogous to stochastic intervention [25] and dropout [46] which adds stochasticity to the training of AIM, preventing the locking down to a few mechanisms that are attended to upon initialization.

Training and evaluation of AIM.

Weight updates in AIM is similar to a typical layer in DNNs, i.e. gradients are backpropagated from the final loss function. A distinct difference from a conventional module in DNNs is that only the mechanisms activated during a forward pass are updated, resulting in a sparse set of weight updates. As AIM is designed to model higher order concepts, it is placed in the higher level of a DNN and has fast weights that are updated in the inner-loop of a meta-learning pipeline. The role of AIM as a module is shown in Figure

1. The procedure for the meta-training of AIM for both few-shot learning and continual learning is shown in Algorithm 1, whereas the meta-testing counterpart is shown in Algorithm 2. The algorithms shown are applicable for both few-shot and continual learning with the distinction between both highlighted with different colors — few-shot learning using SIB in green and continual learning using OML and ANML in blue. Step sizes for the inner-loop and outer-loop are defined as and respectively. Step size for synthetic gradient update used for SIB is defined as . For few-shot learning, the fast adaptation of AIM is evaluated using the meta-testing test set of the sampled task, i.e.  in the outer-loop. For continual learning, evaluation is performed after the completion of meta-training, and is tested on the entire meta-test train set and meta-test test set .

1: sequential tasks ; step size ; inner iterations ; modules (SIB only)
2:while not done do
3:      SIB: i.i.d.; continual: sequential
4:     for  do
5:         Update fast weights using step size:
6:  SIB: OML:  ANML:
7:     end for
8:     Update using transductive inference step size:
9:     Update slow weights using step size:
10: SIB: OML: ANML:
11:end while
Algorithm 1 Meta-Training: Training of AIM
1: sequential unseen tasks ; step size ; inner iterations ; modules (SIB only)
2:; initialize empty set
3:for  do
4:      SIB: i.i.d.; continual: sequential
5:      store trajectory
6:     for  do
7:         Update fast weights using step size:
8:  SIB: OML:  ANML:
9:     end for
10:     Update using transductive inference step size:
11:     Evaluate on
12:end for
13:Evaluate on end of meta-test training trajectory
14:Evaluate on eval on entire meta-test testing set
Algorithm 2 Meta-Testing: Evaluation of AIM
(a) Synthetic Information Bottleneck (SIB) [20]
(b) Online aware Meta-Learning (OML) [22]
(c) A Neuromodulated Meta-Learning Algorithm (ANML) [5]
Figure 2: Applying AIM on both few-shot learning (LABEL:sub@fig:sib_aim SIB) and continual learning (LABEL:sub@fig:oml_aim OML and LABEL:sub@fig:anml_aim ANML) frameworks. For all frameworks, AIM (yellow) is placed directly after the feature extractor, . With different learning scheme (fast and slow) used in meta-learning, weights or modules that correspond to fast update are highlighted in red, slow update are in blue and frozen weights are in green.

3.2 Few-Shot Learning Using SIB

SIB is composed of two works: synthetic gradient modeling [21] and a feature averaging classifier [15]. In [21], the idea is to use a synthetic gradient model, , that is meta-learned to generate gradient when labeled data is absent for transductive inference, i.e. update of weights without gradients propagated from a loss that is dependent on label. In [15]

, a classifier is defined as the cosine similarity between feature representations

and classification weight vectors . is generated using an external classification weight generator parameterized by followed by iterative update by the synthetic gradient model . Feature vectors of training samples of a novel category are fed as input to generate a new set of weights for classification, . In SIB, feature averaging based weight inference is used, i.e. the classification weight vector is obtained as , where is the Hadamard product and ( is the -normalized version of ). The classification weight vector is then updated iteratively using the synthetic gradient model in SIB, given as . Both the synthetic gradient model and the weights of the weight generator are meta-learned, i.e. updated in the outer-loop. To encourage sparse modeling of higher order concepts in the network, AIM is inserted right after the feature extractor and before the output linear classifier that is generated using and , or,

(5)
Training.

Following the training pipeline in SIB [20], the weights of the feature extractor are frozen to simplify the training procedure. The weights of the AIM, , and the output linear classifier, , are updated as fast weights, i.e. inner-loop. Only the weights of the classification weight generator are updated as slow weight, i.e. outer-loop. The application of AIM to SIB is shown in Figure (a)a.

3.3 Continual Learning: Learning Fast and Slow

It is shown in the task of continual learning that learning fast and slow from the context of meta-learning is helpful for the mitigation of catastrophic forgetting [22, 5]. OML [22] and ANML [5] are example frameworks for continual learning that uses this methodology, showing promising results. To validate our claim on the importance of incorporating sparse modeling on an architectural level for the mitigation of catastrophic forgetting, we insert AIM into both OML and ANML and observe the resulting performance.

Oml.

The entire architecture is split into two parts — representation learning network (RLN) and prediction learning network (PLN). RLN uses slow weights and PLN uses fast weights. Following our notations, RLN is the feature extractor in our work, , and PLN is the classifier (not limited to a single layer), , in our work. AIM is inserted after the RLN and before the PLN, or,

(6)

AIM is trained jointly with PLN, i.e. they have fast weights. The application of AIM to OML is shown in Figure (b)b.

Anml.

Two set of feature extractors are used in ANML — a neuromodulatory network, , and a prediction network, . The role of the neuromodulatory network is to modulate the latent representation of the prediction network, i.e. the output of in Figure (c)c. The output of the neuromodulatory network is element-wise multiplied with the outout of before passing to the final classifier, or . Only the neuromodulatory network has slow weights and the entire prediction network has fast weights. Similar to SIB and OML, AIM is inserted right after the feature extractor and uses fast weights,

(7)

4 Experiments

4.1 Few-Shot Learning

Datasets.

For all datasets, class splits are disjoint. MiniImageNet [50] contains a total of 100 classes which are split into 64 training, 16 validation and 20 testing classes; images are of size . CIFAR-FS [6] is created by dividing CIFAR-100 into 64 training, 16 validation and 20 testing classes; images are of size . For few-shot classification, each task (episode) consists of a train set and a test set. For each task, classes are sampled from the class pool mentioned. For each class, examples are drawn and are relabeled as disjoint classes forming the train set. For the test set, 15 samples are used. We show results of for both and .

Network architecture.

We follow the setup in [20, 15, 40, 14] by using a 4-layer convolutional network with 64 feature channels (Conv-4-64) or a WideResNet (WRN-28-10) [51] as our feature extractor, .

is pretrained in a typical end-to-end supervised learning fashion, i.e. the entire training set is used for batch update. Our classifier is adopted directly from

[20, 15] having . For transductive inference [20], the synthetic gradient network is modeled by a MLP of 3 layers and hidden size 8. Classification is done by using the cosine-similarity based classifier found in [20, 15]. For AIM, all weights are linear layers. The hidden state of the mechanisms are of dimension 256. The key and query weights ( and ) maps the input and hidden state to a dimension of 128 to perform distance measurement. For the output dimension of the mechanism weights, , we picked 400 for CIFAR-FS trained on Conv-4-64 and 800 for the rest; this decision is based on the dimension of the flattened feature map at the output of the feature extractor (not cherry-picked).

Training details.

We use mechanisms with top mechanisms selected during inference with induced stochasticity by having during training. SGD is used a batch size of 1 for 50,000 steps with learning rate for SIB’s classifier synthetic update, for outer-loop update and

for inner-loop update. The feature extractor is frozen during training. 1,000 tasks are sampled from the validation set for hyperparameter selection at each training epoch. All experiments are run on a single GTX1080Ti using PyTorch. A complete run of Conv-4-64 on CIFAR-FS and WRN-28-10 on MiniImageNet takes less than 2 hours and 5 hours respectively.

(a) SIB + AIM using Conv-4-64 on CIFAR-FS, 1-shot.
(b) OML + AIM on CIFAR-100.
Figure 3: With 1 indicating an active mechanism and 0 indicating an inhibited mechanism and having top mechanisms selected for every inference, the average of the activation for the same class across the entire validation set is taken here. The active mechanisms can be categorized into two sets: 1. fixed set of shared active mechanisms; 2. sparse set of mechanisms with class-dependent activations.
Method Backbone Transductive MiniImageNet, 5-Way CIFAR-FS, 5-Way
1-shot 5-shot 1-shot 5-shot
Matching Net [50] Conv-4-64 44.2% 57% - -
MAML [12] Conv-4-64
Prototypical Net [45] Conv-4-64
Relation Net [47] Conv-4-64
TPN [33] Conv-4-64 55.5% 69.9% - -
Gidaris [14] Conv-4-64
SIB [20] Conv-4-64
SIB + Linear layer Conv-4-64
AIM (ours) Conv-4-64
TADAM [38] ResNet-12 - -
SNAIL [44] ResNet-12 - -
CTM [31] ResNet-18 - -
LEO [43] WRN-28-10 - -
Gidaris [14] WRN-28-10
SIB [20] WRN-28-10
SIB + Linear layer WRN-28-10
AIM (ours) WRN-28-10
Table 1:

Average classification accuracies with 95% confidence intervals on the test-set of MiniImageNet and CIFAR-FS. 2000 episodes are sampled for MiniImageNet and CIFAR-FS using Conv-4-64 and WRN-28-10 as the feature extractor.

4.1.1 Qualitative Study: Activation of AIM

We show heatmaps that illustrate mechanisms activated for different classes from the validation set in Figure 3. The heatmap is plotted by averaging the mechanisms’ activity for each class over the entire validation set, with 1 and 0 indicating active and inhibited mechanism respectively. We can observe that there’s a set of mechanisms that are shared among tasks and another set that are distributed sparsely. The sharing of mechanisms can be understood as different classes sharing similar concepts. The sparse allocation of mechanisms over different classes show that there are features that are invariant for certain classes only, improving resiliency to covariate shift among distributions.

(a)
(b)
Figure 4: Illustrates the accuracy obtained by varying LABEL:sub@fig:stochastic_sam stochastic sampling count ( and is manipulated) and LABEL:sub@fig:active_mech active mechanisms count ( and is manipulated). The zero mean-ed accuracy is shown to better demonstrate the change in accuracy across different model-dataset pairs. is the cardinality operator.
(a)
(b)
(c)
Figure 5:

Evaluation of continual learning methods using dataset of various scales. Meta-test testing (training) trajectories are shown in solid (dashed) lines. All curves are averaged over 10 runs with standard deviation shown.

4.1.2 Quantitative Study

Stochastic sampling count.

To show the importance of inducing stochasticity in the mechanism selection process for inference, we perform an empirical study by varying the stochastic sampling count, . We fix and vary from 0 to 24. As we can see from Figure (a)a, the accuracy obtained by varying have different maximum for different datasets, models and number of shots. For most cases, the peak accuracy usually occurs at small value of and slowly deteriorates as more stochasticity is introduced.

Number of active mechanisms.

An interesting question would be how maybe active mechanisms are required to reap the benefits of sparse activations. Empirical study is performed as shown in Figure (b)b, showing the accuracy obtained by varying the number of active mechanisms from 1 to 32. The results show that accuracy is low when is small and saturates for larger values of . This shows that a limited set of active mechanisms is sufficient. Sparsity in representation can still be met when the number of active mechanisms is large, but it will be cost inefficient during both training and inference.

Benchmark evaluation.

As AIM is introduced as an additional component that’s integrated into SIB [20], the gain in accuracy shows the importance of having a mixture of experts for fast adaptation. We also show the results for SIB with a linear layer (parameters equal the total parameters found in the AIM module) added before the classifier (SIB + Linear) to show that the gain in accuracy from AIM is not solely from the increase in parameters. From Table 1, we can see that AIM outperforms all existing few-shot classification methods by a noticeable margin. As only a single layer of AIM is explored, the coupling between AIM as found in RIMs [16] is not considered here. We believe that further improvements can be attained if layers of AIM are stacked, with coupling between them considered.

4.2 Continual Learning

Datasets.

Omniglot [27] has over 1,623 characters from 50 diferent alphabets, where each character has 20 hand-written images of size . The dataset is split into 963 classes for meta-training and 660 classes for meta-testing. In each trajectory, 15 images are used for training and 5 images for testing in both meta-training and meta-testing. CIFAR-100 [26] is composed of 60,000 images of size distributed uniformly over 100 classes, i.e. 500 train images and 100 test images for each class. 70 classes are used for meta-training and 30 classes are used for meta-testing. MiniImageNet [50] has 64 training classes and 20 testing classes with images of size . Each class has 600 images with 540 for training and 60 for testing. 30 training images are sampled for each class. In each trajectory of CIFAR-100 and MiniImageNet, we sample 30 train images for training all test images for testing for both meta-training and meta-testing.

Network architecture.

We adopt the model from OML [22] and ANML [5] with a slight modification for our experiments. For OML, the feature extractor is a 6-layer convolutional network with 112 channels and the classifier is a single linear layer with AIM in between and . For ANML, both neuromodulatory network and prediction network have a 3-layer convolutional network and is a single linear layer with AIM placed after and . has 112 channels while has 256 channels. For CIFAR-100 and MiniImageNet, an additional linear layer is placed before AIM for dimension reduction. The hidden state . and maps theirs corresponding inputs to .

Training details.

We use mechanisms in our system and top mechanisms are selected during inference with induced stochasticity by having during training. We follow the -order MAML strategy in [22, 5]. We use a batch size of 1 for 20,000 steps with step size of for the outer-loop (slow weights) and for the inner-loop (fast weights). A complete meta-training of AIM using OML or ANML on Omniglot, CIFAR-100 and MiniImageNet takes less than 2 hours, 3 hours and 6 hours respectively.

4.2.1 Qualitative Study: Activation of AIM

Following the settings in few-shot learning, activations of AIM when applied to OML are shown in Figure (b)b. The activations are similar to what we observed in few-shot learning, i.e. a set of common mechanisms for all classes and another set for mechanisms that are sparsely activated.

4.2.2 Quantitative Study

To evaluate the capability of AIM to continually learn new concepts and mitigating catastrophic forgetting, we show the results of meta-test training and testing in Figure 19. To demonstrate that the accuracy gain using AIM is not due to the increase in parameters, baseline is plotted and is defined as the swapping of AIM with a linear layer containing the same amount of parameters as AIM added to OML. Samples of new classes are continuously fed without replacement, and samples of old classes are not stored. Prior works use the results from meta-test training as a measure of forgetting and meta-test testing to measure both forgetting and generalization error. We argue that memorizing features that doesn’t transfer well to the testing set is not a good measure of forgetting. Results show that through the application of AIM, the difference between train and test accuracy is marginal, i.e. small generalization error, demonstrating that AIM is not only useful for the adaptation to new knowledge and mitigation of catastrophic forgetting, it also plays an important role in the learning of concepts that are generalizable to the test set. Consistent improvement in accuracy is observed when AIM is applied to existing continual learning frameworks. The only exception is the application of AIM to ANML trained on Omniglot, which could be remedied through a better selection of hyperparamters.

5 Conclusion

We have shown that AIM as a mixture of experts is an important building block for modeling higher-order concepts, translating to the capability of fast adaptation and mitigation of catastrophic forgetting. Through the sparse modeling of higher-order concepts, substantial improvement over prior arts can be seen for both few-shot and continual learning. It would be interesting to see the extension of AIM to multiple layers for hierarchical modeling of higher-order concepts.

Acknowledgement

This project is supported by MOST under code 107-2221-E-009 -125 -MY3. Eugene Lee is partially supported by Novatek Ph.D. Fellowship Award. The authors are grateful for the suggestions provided by Dr. Eugene Wong from University of California in Berkeley and Dr. Jian-Ming Ho from Academia Sinica of Taiwan.

References

  • [1] D. Abati, J. Tomczak, T. Blankevoort, S. Calderara, R. Cucchiara, and B. E. Bejnordi (2020) Conditional channel gated networks for task-aware continual learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3931–3940. Cited by: §2.
  • [2] W. C. Abraham and A. Robins (2005) Memory retention–the synaptic stability versus plasticity dilemma. Trends in neurosciences 28 (2), pp. 73–78. Cited by: §1.
  • [3] R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, and L. Page-Caccia (2019) Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems, pp. 11849–11860. Cited by: §2.
  • [4] A. J. Bauer and M. A. Just (2015) Monitoring the growth of the neural representations of new animal concepts. Human brain mapping 36 (8), pp. 3213–3226. Cited by: §1.
  • [5] S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune, and N. Cheney (2020) Learning to continually learn. arXiv preprint arXiv:2002.09571. Cited by: §B.2, §1, §2, (c)c, §3.3, §3, §4.2, §4.2.
  • [6] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi (2018) Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136. Cited by: §4.1.
  • [7] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
  • [8] C. Chang and C. Lin (2011)

    LIBSVM: a library for support vector machines

    .
    ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: §3.1.
  • [9] R. Desimone and J. Duncan (1995) Neural mechanisms of selective visual attention. Annual review of neuroscience 18 (1), pp. 193–222. Cited by: §3.1.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [11] J. Fagot and R. G. Cook (2006) Evidence for large long-term memory capacities in baboons and pigeons and its implications for learning and the evolution of cognition. Proceedings of the National Academy of Sciences 103 (46), pp. 17564–17567. Cited by: §1.
  • [12] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 1126–1135. Cited by: §1, §1, §1, §2, Table 1.
  • [13] V. Garcia and J. Bruna (2017) Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043. Cited by: §2.
  • [14] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord (2019) Boosting few-shot visual learning with self-supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8059–8068. Cited by: §1, §4.1, Table 1.
  • [15] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4367–4375. Cited by: §2, §3.2, §4.1.
  • [16] A. Goyal, A. Lamb, J. Hoffmann, S. Sodhani, S. Levine, Y. Bengio, and B. Schölkopf (2019) Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893. Cited by: §3.1, §4.1.2.
  • [17] G. E. Hinton and D. C. Plaut (1987) Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pp. 177–186. Cited by: §2.
  • [18] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2, §3.1.
  • [19] R. Hou, H. Chang, M. Bingpeng, S. Shan, and X. Chen (2019) Cross attention network for few-shot classification. In Advances in Neural Information Processing Systems, pp. 4005–4016. Cited by: §2.
  • [20] S. X. Hu, P. Moreno, Y. Xiao, X. Shen, G. Obozinski, N. Lawrence, and A. Damianou (2020) Empirical bayes transductive meta-learning with synthetic gradients. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2, (a)a, §3.2, §3, §4.1, §4.1.2, Table 1.
  • [21] M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu (2017) Decoupled neural interfaces using synthetic gradients. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1627–1635. Cited by: §2, §3.2.
  • [22] K. Javed and M. White (2019) Meta-learning representations for continual learning. In Advances in Neural Information Processing Systems, pp. 1820–1830. Cited by: §B.2, §1, §2, (b)b, §3.3, §3, §4.2, §4.2.
  • [23] R. Kemker, M. McClure, A. Abitino, T. Hayes, and C. Kanan (2017) Measuring catastrophic forgetting in neural networks. arXiv preprint arXiv:1708.02072. Cited by: §1, §1, §2.
  • [24] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §2.
  • [25] K. B. Korb, L. R. Hope, A. E. Nicholson, and K. Axnick (2004) Varieties of causal intervention. In

    Pacific Rim International Conference on Artificial Intelligence

    ,
    pp. 322–331. Cited by: §3.1.
  • [26] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §B.2, §4.2.
  • [27] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §B.2, §4.2.
  • [28] E. Lee and C. Lee (2020) NeuralScale: efficient scaling of neurons for resource-constrained deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1478–1487. Cited by: §1.
  • [29] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh (2019) Set transformer: a framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744–3753. Cited by: §3.1.
  • [30] K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. Cited by: §3.1.
  • [31] H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang (2019) Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10. Cited by: Table 1.
  • [32] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §1.
  • [33] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang (2018) Learning to propagate labels: transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002. Cited by: §2, Table 1.
  • [34] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • [35] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2017) A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141. Cited by: §2.
  • [36] T. Munkhdalai and H. Yu (2017) Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2554–2563. Cited by: §2.
  • [37] A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §1, §2.
  • [38] B. Oreshkin, P. R. López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: Table 1.
  • [39] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1.
  • [40] S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238. Cited by: §4.1.
  • [41] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1.
  • [42] S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. Cited by: §2.
  • [43] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2018) Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960. Cited by: §1, §2, Table 1.
  • [44] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §2, §3.1, Table 1.
  • [45] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §1, §2, Table 1.
  • [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.1.
  • [47] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: Table 1.
  • [48] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §3.1.
  • [49] M. Tan and Q. V. Le (2019)

    Efficientnet: rethinking model scaling for convolutional neural networks

    .
    arXiv preprint arXiv:1905.11946. Cited by: §1.
  • [50] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §B.2, §2, §4.1, §4.2, Table 1.
  • [51] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1.
  • [52] D. Zeithamova, M. L. Mack, K. Braunlich, T. Davis, C. A. Seger, M. T. van Kesteren, and A. Wutz (2019) Brain mechanisms of concept learning. Journal of Neuroscience 39 (42), pp. 8259–8266. Cited by: §1.
  • [53] L. M. Zintgraf, K. Shiarlis, V. Kurin, K. Hofmann, and S. Whiteson (2018) Fast context adaptation via meta-learning. arXiv preprint arXiv:1810.03642. Cited by: §1, §2.
  • [54] B. Zoph and Q. V. Le (2016)

    Neural architecture search with reinforcement learning

    .
    arXiv preprint arXiv:1611.01578. Cited by: §1.

Appendix A Few-Shot Learning

a.1 Observation of Attention Weight Change over Training Epochs

In this section, we provide an empirical study on the change of the attention weight corresponding the the input dimension of the attention score, , over the entire training epochs. This study provides two insights: 1. the dynamics of the activation of mechanisms based on the number of training samples (shots); 2. the distribution of active and inhibited mechanisms over the training iterations. To show the dynamical change in a 2D plot, we sample classes from the validation set and observe them for the entire training process. Plots that uses Conv-4-64 and WRN-28-10 as backbone is shown in Figure 6 and Figure 7 respectively. From the plots, we can see that the activation of mechanisms are initially distributed uniformly followed by slow convergence to a sparse distribution over the training epochs, having only a few active mechanisms upon convergence. The active and inhibited attention weights are also clearly separated for all examples. Another observation is that having a larger number of training samples (shots), a smoother convergence for the activation weights across the training epochs is obtained. Smooth convergence is also obtained when a deeper backbone (WRN-28-10) is used when compared to a shallower one (Conv-4-64). This observation is intuitive as AIM is able to learn more efficiently when more samples or higher quality input features are provided, enabling the mechanisms to better model higher-order factorized information.

Competitive selection of mechanisms.

From the figures shown, a distinct gap between active and inhibited mechanisms can be clearly observed. This motivates the idea of basing the activation of mechanisms on its corresponding attention value (soft decision) instead of making a hard decision that selects a total of

AIM on every inference. To demonstrate if basing the activation of AIM on the attention value would work, a simple experiment can be performed by allowing a mechanism to be active if its attention value is above 0.5 (similar to ReLU

[nair2010rectified]), or:

(8)

where is given as,

(9)

and is defined as,

(10)

To keep our experiments simple, we do not induce stochasticity in the original approach and fix to provide a fair comparison. We name the original method that keeps 8 mechanisms active as hard decision and name the method in (8) – (10) as soft decision. Comparison between hard decision and soft decision is shown in Table 2. From the results, we can see that when Conv-4-64 is used as backbone, higher accuracy is obtained when hard decision is used. The opposite can be observed when WRN-28-10 is used as backbone. We deduce that when extracted features are more reliable, i.e. through the use of deeper backbone or higher number of shots, the attention weights are of higher quality leading to clear distinction between relevant and less relevant mechanisms.

Backbone Method MiniImageNet, 5-Way CIFAR-FS, 5-Way
1-shot 5-shot 1-shot 5-shot
Conv-4-64 Hard decision
Soft decision
WRN-28-10 Hard decision
Soft decision
Table 2: Results for the comparison of using either hard decision (proposed method; with ) or soft decision (8) – (10) for the activation of mechanisms during inference. Average classification accuracies with 95% confidence intervals on the test-set are shown.

a.2 Observation of Attention Weight for All Classes

Different from the previous section, we show the mask instead of the attention weights here. The masks have a value of 1 for active mechanisms and 0 for inhibited mechanisms having the competitive selection based on the attention weights; for all experiments, mechanisms will be active during both training and inference. We set to induce stochasticity during training. Instead of sampling a single sample from each class, we take the average of the masks of each class accumulated across the entire validation set. We show heatmaps covering all classes and all 32 AIMs mechanisms on the first and final training epochs. Results that use Conv-4-64 as backbone using #shots-dataset pair of 1-shot-CIFAR-FS, 5-shot-CIFAR-FS, 1-shot-MiniImageNet and 5-shot-MiniImageNet are shown in Figure 8, Figure 9, Figure 10 and Figure 11 respectively. Results that use WRN-28-10 as backbone using #shots-dataset pair of 1-shot-CIFAR-FS, 5-shot-CIFAR-FS, 1-shot-MiniImageNet and 5-shot-MiniImageNet are shown in Figure 12, Figure 13, Figure 14 and Figure 15

respectively. By observing the heatmaps, we can see that the activation of mechanisms in the first epoch are uniformly distributed whereas in the final epoch, only a few set of mechanisms that are jointly used between classes accompanied by a sparse set of mechanisms that are invariant among samples from the same class. This observation meets our expectation of learning a set of experts that are each responsible for certain task. To give a better illustration on the transition of the heatmaps over the training epochs, we have attached several

.mp4 files that follows the naming convention of mask-DATASET_MODEL_#shot (italicized as wildcard strings) along with the supplementary materials.

a.3 Manipulating the Stochastic Sampling and Active Mechanisms Count

In this section, we show tabulated results of the manipulation of stochastic sampling and active mechanisms count as found in the main paper. The plots in the main paper show zero mean-ed results whereas the actual accuracy is reported in Table 3 and Table 4 for the maniputation of stochastic sampling count and active mechanisms count respectively. By looking at the tabulated results, we can say that the introduction of some stochasticity on the competitive selection of mechanisms during training is beneficial for the overall performance.

Appendix B Continual Learning

b.1 Quantitative Analysis

We show the plots from the main paper comparing different continual learning methods in Figure 19 with the accompanied tabulated data in Table 5. Baseline shown correspond to the swapping of AIM layer with a single linear layer with number of parameters close to the originally introduced AIM layer to demonstrate that the increase in accuracy is not from over-parameterization. From the results, we can observe that with the addition of AIM as a module for continual learning, consistent improvement in accuracy can be obtained. It is also shown that the gain in accuracy does not result from the increase in parameters as shown by the accuracy attained using the baseline method.

b.2 Activation of AIM

Similar to the analysis done for the activations of mechanisms for few-shot, we show the activations of mechanisms when AIM is used for continual learning. We apply AIM to both OML [22] and ANML [5] with activation heatmaps when trained on Omniglot [27], CIFAR-100 [26] and MiniImageNet [50] in Figure (a)a, Figure (b)b and Figure (c)c respectively. We can observe that for Omniglot, the activation of mechanisms are sparsely distributed when compared to the activations obtained using CIFAR-100 and MiniImageNet. We conjecture that this is due to the simplicity of extracted representations, resulting in simpler higher-order modeling by the mechanisms. For natural images like CIFAR-100 and MiniImageNet, the features are not as distinct as the alphabets found in Omniglot, hence higher-order modeling of representations isn’t as sparsely distributed. The sparsely distributed activations found in Omniglot result in distinctive increase in accuracy when compared to other datasets as shown in Table 5, e.g. at a trajectory containing 600 classes, the relative increase in accuracy when AIM is applied to OML is 20.70%. Even when MiniImageNet is used, distinctive increment in accuracy can also be observed, e.g. 19.40% relative increment in accuracy when AIM is applied to ANML, which be believe is due to the richness of information embedded in the latent representation resulting from the larger image size of . We believe that larger gain in accuracy can be attained through the introduction of a better feature extractor, i.e. an alternative to convolutional layers.

(a) CIFAR-FS, 1-shot.
(b) CIFAR-FS, 5-shot.
(c) MiniImageNet, 1-shot.
(d) MiniImageNet, 5-shot.
Figure 6: Change of attention weight or score corresponding to the input dimension over the training epochs. Different datasets pairs with different amount of training samples (shots) are shown here, using Conv-4-64 as its backbone. Each line represent an independent mechanism.
(a) CIFAR-FS, 1-shot.
(b) CIFAR-FS, 5-shot.
(c) MiniImageNet, 1-shot.
(d) MiniImageNet, 5-shot.
Figure 7: Change of attention weight or score corresponding to the input dimension over the training epochs. Different datasets pairs with different amount of training samples (shots) are shown here, using WRN-28-10 as its backbone. Each line represent an independent mechanism.
(a) First epoch.
(b) Last epoch.
Figure 8: Activation of AIM from few-shot learning. Training on CIFAR-FS with 1-shot using Conv-4-64 as backbone.
(a) First epoch.
(b) Last epoch.
Figure 9: Activation of AIM from few-shot learning. Training on CIFAR-FS with 5-shot using Conv-4-64 as backbone.
(a) First epoch.
(b) Last epoch.
Figure 10: Activation of AIM from few-shot learning. Training on MiniImageNet with 1-shot using Conv-4-64 as backbone.
(a) First epoch.
(b) Last epoch.
Figure 11: Activation of AIM from few-shot learning. Training on MiniImageNet with 5-shot using Conv-4-64 as backbone.
(a) First epoch.
(b) Last epoch.
Figure 12: Activation of AIM from few-shot learning. Training on CIFAR-FS with 1-shot using WRN-28-10 as backbone.
(a) First epoch.
(b) Last epoch.
Figure 13: Activation of AIM from few-shot learning. Training on CIFAR-FS with 5-shot using WRN-28-10 as backbone.
(a) First epoch.
(b) Last epoch.
Figure 14: Activation of AIM from few-shot learning. Training on MiniImageNet with 1-shot using WRN-28-10 as backbone.
(a) First epoch.
(b) Last epoch.
Figure 15: Activation of AIM from few-shot learning. Training on MiniImageNet with 5-shot using WRN-28-10 as backbone.
(a) OML+AIM
(b) ANML+AIM
Figure 16: Activation of AIM from continual learning. Subset of classes from Omniglot are shown.
(a) OML+AIM
(b) ANML+AIM
Figure 17: Activation of AIM from continual learning. Subset of classes from CIFAR-100 are shown.
(a) OML+AIM
(b) ANML+AIM
Figure 18: Activation of AIM from continual learning. Subset of classes from MiniImageNet are shown.
Backbone Stochastic sampling count MiniImageNet, 5-Way CIFAR-FS, 5-Way
1-shot 5-shot 1-shot 5-shot
Conv-4-64 8
10
12
16
20
24
28
32
WRN-28-10 8
10
12
16
20
24
28
32
Table 3: Results for varying the stochastic sampling count . Zero mean-ed plot is found in the main paper. Throughout the experiment, and is varied. Average classification accuracies with 95% confidence intervals on the test-set are shown.
Backbone Active Mechanism Count MiniImageNet, 5-Way CIFAR-FS, 5-Way
1-shot 5-shot 1-shot 5-shot
Conv-4-64 1
2
4
8
12
16
20
24
28
32
WRN-28-10 1
2
4
8
12
16
20
24
28
32
Table 4: Results for varying the active mechanism count . Zero mean-ed plot is found in the main paper. Throughout the experiment, is varied and . Average classification accuracies with 95% confidence intervals on the test-set are shown.
(a)
(b)
(c)
Figure 19: Evaluation of continual learning methods using dataset of various scales. Meta-test testing (training) trajectories are shown in solid (dashed) lines. All curves are averaged over 10 runs with standard deviation shown.
Method Number of classes
10 50 75 100 200 300 400 500 600
Dataset: Omniglot
Baseline 10.00 2.00 1.33 1.00 0.50 0.33 0.25 0.20 0.17
OML 94.34 80.69 76.30 73.61 62.96 56.61 51.27 47.34 44.56
ANML 82.60 84.92 83.60 81.92 76.85 72.83 69.12 65.74 63.41
OML+AIM 97.70 (+3.45) 98.60 (+1.28) 98.55 (+5.50) 97.75 (+8.41) 96.28 (+11.76) 94.08 (+17.58) 91.09 (+24.26) 85.51 (+22.93) 80.37 (+20.70)
ANML+AIM 94.25 (+11.65) 97.32 (+12.40) 93.05 (+9.45) 89.34 (+7.42) 84.52 (+7.67) 76.50 (+3.67) 66.83 (-2.29) 62.58 (-3.17) 59.68 (-3.74)
2 4 6 8 10 15 20 25 30
Dataset: CIFAR-100
Baseline 50.00 25.00 16.67 12.50 10.00 6.67 5.00 4.00 3.33
OML 81.57 61.46 58.24 52.39 53.65 39.42 33.58 28.53 26.78
ANML 86.86 73.03 63.66 56.60 50.01 45.02 40.06 36.29 34.15
OML+AIM 85.65 (+4.08) 79.46 (+18.00) 68.03 (+9.79) 60.44 (+8.04) 53.39 (-0.26) 46.16 (+6.74) 42.02 (+8.43) 36.17 (+7.64) 33.59 (+6.81)
ANML+AIM 84.10 (-2.76) 70.10 (-2.93) 60.90 (-2.76) 58.35 (+1.76) 55.88 (+5.87) 46.72 (+1.70) 39.84 (-0.23) 36.47 (+0.18) 28.81 (-5.34)
2 4 6 8 10 12 15 18 20
Dataset: MiniImageNet
Baseline 50.00 25.00 16.67 12.50 10.00 8.33 6.67 5.56 5.00
OML 63.00 42.17 30.80 28.40 23.31 19.26 16.82 13.97 11.54
ANML 73.25 48.42 33.42 28.44 23.43 19.49 16.66 13.81 12.73
OML+AIM 75.75 (+12.75) 56.13 (+13.96) 42.92 (+12.12) 38.94 (+10.53) 33.52 (+10.20) 28.57 (+9.30) 27.81 (+10.99) 23.85 (+9.89) 23.03 (+11.48)
ANML+AIM 85.25 (+12.00) 64.13 (+15.71) 52.97 (+19.56) 52.79 (+24.35) 44.90 (+21.47) 39.28 (+19.79) 36.67 (+20.01) 33.01 (+19.20) 32.13 (+19.40)
Table 5: Average meta-testing test accuracy of continual learning on various datasets. During training, trajectory of samples are introduced, i.e. meta-test train images are fed to the model sequentially without the usage of rehearsal memory and evaluation using meta-testing test set is performed at the end. Relative increment (decrease) in accuracy through the introduction of AIM is shown in green (red), e.g. when 10 classes in a trajectory is introduced for Omniglot, a relative increment in accuracy of +3.45 over is attained when AIM is inserted, shown as .