Deep neural networks (DNNs) are known to perform well when deployed to test distributions that shares high similarity with the training distribution. Feeding DNNs with new data sequentially that were unseen in the training distribution has two major challenges – fast adaptation to new tasks and catastrophic forgetting of old tasks. Such difficulties paved way for the on-going research on few-shot learning and continual learning. To tackle these problems, we introduce Attentive Independent Mechanisms (AIM). We incorporate the idea of learning using fast and slow weights in conjunction with the decoupling of the feature extraction and higher-order conceptual learning of a DNN. AIM is designed for higher-order conceptual learning, modeled by a mixture of experts that compete to learn independent concepts to solve a new task. AIM is a modular component that can be inserted into existing deep learning frameworks. We demonstrate its capability for few-shot learning by adding it to SIB and trained on MiniImageNet and CIFAR-FS, showing significant improvement. AIM is also applied to ANML and OML trained on Omniglot, CIFAR-100 and MiniImageNet to demonstrate its capability in continual learning. Code made publicly available at https://github.com/huang50213/AIM-Fewshot-Continual.READ FULL TEXT VIEW PDF
Humans have the ability to learn new concepts continually while retaining previously learned concepts . While learning new concepts, prior concepts that were learned are leveraged to form new connections in the brain [4, 52]
. The plasticity of the human brain plays an important role on the forming of novel neuronal connections for learning new concepts. Current deep learning methods are inefficient in remembering old concepts after being fed with new concepts, also widely know as catastrophic forgetting[34, 23]. Deep neural networks (DNNs) trained in an end-to-end fashion also has difficulty in learning new tasks in a sample efficient manner . It is conjectured that the cause of catastrophic forgetting and inefficiency in learning new tasks is from the stability-plasticity dilemma . Stability is required so that previously learned information can be retained through the limitation of abrupt weight changes. Plasticity on the other hand encourages large weight changes, resulting in the fast acquisition of new concepts with the trade-off of forgetting old concepts.
It is believed that by scaling up the currently available architecture, DNNs are able to generalize better [7, 41, 10]. Tremendous effort is placed into neural architecture search (NAS) [28, 54, 49, 39, 32] with the hypothesis that improvements on a structural level introduce inductive bias that improves the generalizability of a neural network. As most of the prior arts are evaluated on benchmark datasets that are distributed similarly to the training set that it is trained on, the evaluation results are not a good measure of the generalization. We argue that the ability to adapt, acquire new knowledge and recall previously learned information plays an important role in reaching true generalization. The importance of learning to learn, i.e. meta-learning, has shone the spotlight on two major research direction that we will focus on — few-shot learning and continual learning. In few-shot learning [12, 37, 45, 14], the goal is to learn novel concepts with as few samples as possible, i.e. evaluating the capability of adapting to new tasks. Whereas in continual learning, the ability to learn an increasing amount of concepts while not forgetting old ones is evaluated.
Following OML , we separate the feature extraction part and the decision making part of the network, defined in OML as representation learning network (RLN) and prediction learning network (PLN) respectively. The fast and slow learning in OML is performed on an architecture level, i.e. RLN is updated in the outer loop (slow weights) and PLN is updated in the inner loop (fast weights). Such approach has proven to be helpful in learning sparse representation that are beneficial for fast adaptation and prevention of catastrophic forgetting. We take one step further by introducing sparsity on an architectural level, accomplished through the introduction of Attentive Independent Mechanisms (AIM). AIM is composed of a set of mechanisms that competitively attend to the input representation, having mechanisms that are closely related to the input representation being activated during inference. AIM can be understood as a mixture of experts competing to explain an incoming representation, hence only the mechanisms that best explain the input representation will be updated, leading to a sparse representation or modeling on an architectural level. Having sparse modeling on an architectural level for higher-order representations has its benefits, as only the experts or mechanisms that best explain a task will be involved in the learning process, helping in the acceleration of learning new concepts and the mitigation of catastrophic forgetting. To demonstrate the potential of AIM as a fundamental building block for fast learning without forgetting, we demonstrate its strength on few-shot classification [12, 43, 53] and continual learning [5, 22, 23] benchmarks.
Our contributions are as follows: (1) In Section 3, we give a detailed description and formulation of AIM — a novel module that can be used for few-shot and continual learning. (2) We apply AIM on few-shot learning and continual learning tasks in Section 4.1 and Section 4.2 respectively. Qualitative and quantitative results are shown for both learning tasks, giving readers an insight on the importance of having AIM in the context of few-shot and continual learning. For few-shot classification, experiments are performed on CIFAR-FS and MiniImageNet whereas for continual learning, experiments are performed on Omniglot, CIFAR-100 and MiniImageNet. Substantial improvement in accuracy over prior arts are shown.
Meta-learning revolves on the idea of learning to learn, hoping that through the observation of training iterations on a few tasks, we are able to generalize to unseen tasks with only a few or zero samples. Meta-learning is usually composed of a support set and a query set. The support set is used for fast adaptation and the query set is used to evaluate the adapted model and to meta-learn the adaptation procedure. Model-based meta-learning methods include the work by  that uses a meta-learner based on a LSTM  which includes all previously seen samples, i.e. all support samples of a task are considered during the class prediction of query samples through an attentive mechanism. Another similar work by  augments LSTM with an external memory bank.  incorporates fast and slow weights for few-shot classification.
Metric-based meta-learning methods include Siamese Network proposed by  which predicts whether two images originate from the same class.  proposed Matching Networks that uses cosine distance in an attention kernel to measure the similarity of images in its embedding space.  later found that using Euclidean distance as a metric instead of cosine distance improves performance. A generalization of all the mentioned work is done by modeling the metric using a graph neural network proposed by .
proposed an inner and outer-loop optimization method having fast adaptation in the inner-loop and an outer-loop update that backpropagates through the inner-loop updates. used the concept of inner and outer loop-update by having the context parameters (embeddings of tasks) updated in the inner-loop. LEO 
has its classifier weights generated by a low-dimensional latent embedding updated in the inner-loop.
proposed a similar approach where classification weights are generated using feature vectors that corresponds to the support set. SIB performs transductive inference using synthetic gradient  on the feature averaging variant classifier proposed by . Transductive inference was first introduced to the context of few-shot classification by , having a graph constructed for the support set and the query set, with labels propagated within the graph. As the architecture proposed by  is restrictive,  proposed a more general approach that uses a cross attention module that models semantic relevance between the support and query set.
In continual learning, the objective is to mitigate catastrophic forgetting . Earlier works are based on regularization method, with  proposing the use of fast and slow training weights, borrowing the idea of plasticity and stability for network training. This idea is then adopted by OML  to learn representations that are useful for future learning and helps in mitigating catastrophic forgetting. Similarly, fast and slow learning is applied to ANML , having a neuromodulatory network modeled using slow weights.  uses task-specific gate module and prediction head to reduce competitive effect between classes. A criterion is designed in  to store most-interfered samples in a fixed-sized rehearsal memory.
As Attentive Independent Mechanisms (AIM) is used to model higher order information, we place it right after a feature extractor, defined as . is a series of convolutional layers parameterized by , is an input sample and is its corresponding representation. AIM is a module that is parameterized by and is defined as . The representation from AIM is then fed to a linear layer for the task of classification. An illustration of AIM as a module is shown in Figure 1. We also show an illustration on the application of AIM to existing meta-learning frameworks used for few-shot learning and continual learning in Figure 2. We first describe the implementation of AIM as module in Section 3.1 followed by its integration to SIB  for few-shot learning in Section 3.2 and to OML  and ANML  for continual learning in Section 3.3.
The goal of AIM is to learn a sparse set of mechanisms, i.e. mixture of experts, to decouple the modeling of higher order information from the feature extraction pipeline. These mechanisms compete and attend to the input representation in a top-down fashion using cross-attention [30, 29]. Through the strict selection of mechanisms, a sparse set of mechanisms will be selected for every task, inducing an architectural bias that helps in fast adaptation to new tasks and mitigating catastrophic forgetting. The structure of AIM is composed of a set of independent mechanisms, each parameterized by its own set of parameters. Each mechanism acts as an independent expert that collaborate with other experts to solve a particular task. AIM can be viewed as a static version of RIMs , i.e. temporal modeling of hidden states using LSTM  found in RIMs is removed. For RIMs, the model is fed with a continuous stream of inputs, making dynamical modeling using LSTM intuitive. For AIM, the assumption of having continuous stream of inputs does not hold as the practice of few-shot classification and continual learning have i.i.d. data being fed into the model during training and inference. Departing from RIMs, the objective of AIM is to show that through a mixture of experts, new concepts can be easily learned with minimal catastrophic forgetting. We hypothesize that by having a set of independent mechanisms, a sparse set of factorized representations or concepts can be extracted from the input representation. Such concepts have properties that are tasks-invariant which can be helpful in learning new tasks. The learning of concepts in AIM can also be understood as the amortized version of memory based models that stores samples either in the form of images or representations , which scales with the size of tasks in the system without limitation. AIM on the other hand performs implicit modeling of samples, analogous to the amortized modeling using a DNN instead of using a non-parametric method that stores samples from the training set for inference .
Following RIMs, AIM has a null vector that is concatenated with the input representation , giving us . The mechanisms then attend to the incoming latent representation as:
which could be understood as the passing of input representation through the weighted-summation of the mechanism weights, . The summation of the outputs of the mechanisms makes the extension to arbitrary number of mechanisms trivial when compared to the concatenation of outputs used in RIMs. Concatenation is also infeasible when the output dimension of is large, resulting in a wide input dimension for the upcoming layer. The summation of mechanisms also has the property of permutation invariance, reducing the complexity of the output classifier .
To encourage sparsity, we enforce the mechanisms to compete with each other to attend to the incoming representation. This is done by having only the weights of mechanisms that are closely related to the input representation to be selected, i.e. only top mechanisms out of a total of mechanisms are selected for the downstream prediction tasks. The strict selection of mechanisms forces the mechanisms to compete with each other to attend to the incoming signal, modeling the biased competition theory of selective attention . The selection of mechanisms is given as:
The indices corresponding to the top values from a set is returned by the operator. The weights used to weight the importance of the selected mechanisms are composed of the softmax of the normalized inner-product, , between the mechanisms’ hidden state and the input representation that are first mapped to a lower-dimensional embedding by the query weight and key weight of output dimension respectively, given as:
Note that is applied locally for each mechanism, i.e. the transformation of the attention values corresponding to and into a probabilistic one. The value that corresponds to the input (not null) dimension from (10) is used for the top comparison in (9).
The training of AIM can be understood as an intervention procedure with the model selecting a few mechanisms to be included during the forward pass phase of training. Mechanisms that perform well on the training data are rewarded by having gradient update directed to the activated mechanisms, with the sensitivity to novel inputs reflected on . As one can predict, there is a possibility of the occurrence of mechanism-overfitting, where only a fixed set of mechanisms are active for all training tasks, losing the original motivation of having a sparse set of mechanisms acting as experts on different tasks. Mechanism-overfitting is also equivalent to having a DNN with multiple residual paths, resembling a single layer of Inception , diverting from our original goal of building models that are invariant across tasks.
To prevent the collapsing towards having only a few active mechanisms for all tasks, the trick is to enable the exploration of different amount of mechanisms during training, instead of locking down to the top mechanisms. Stochasticity is introduced into the selection process by sampling top (also known as stochastic sampling count) instead of top mechanisms. We then perform uniform sampling without replacement of mechanisms from the top mechanisms, where the original sampling condition of (9) can now be written as:
Here, is the cardinality operator to ensure that the sampled subset is of size and is sampled without replacement. Such intervention is analogous to stochastic intervention  and dropout  which adds stochasticity to the training of AIM, preventing the locking down to a few mechanisms that are attended to upon initialization.
Weight updates in AIM is similar to a typical layer in DNNs, i.e. gradients are backpropagated from the final loss function. A distinct difference from a conventional module in DNNs is that only the mechanisms activated during a forward pass are updated, resulting in a sparse set of weight updates. As AIM is designed to model higher order concepts, it is placed in the higher level of a DNN and has fast weights that are updated in the inner-loop of a meta-learning pipeline. The role of AIM as a module is shown in Figure1. The procedure for the meta-training of AIM for both few-shot learning and continual learning is shown in Algorithm 1, whereas the meta-testing counterpart is shown in Algorithm 2. The algorithms shown are applicable for both few-shot and continual learning with the distinction between both highlighted with different colors — few-shot learning using SIB in green and continual learning using OML and ANML in blue. Step sizes for the inner-loop and outer-loop are defined as and respectively. Step size for synthetic gradient update used for SIB is defined as . For few-shot learning, the fast adaptation of AIM is evaluated using the meta-testing test set of the sampled task, i.e. in the outer-loop. For continual learning, evaluation is performed after the completion of meta-training, and is tested on the entire meta-test train set and meta-test test set .
SIB is composed of two works: synthetic gradient modeling  and a feature averaging classifier . In , the idea is to use a synthetic gradient model, , that is meta-learned to generate gradient when labeled data is absent for transductive inference, i.e. update of weights without gradients propagated from a loss that is dependent on label. In 
, a classifier is defined as the cosine similarity between feature representationsand classification weight vectors . is generated using an external classification weight generator parameterized by followed by iterative update by the synthetic gradient model . Feature vectors of training samples of a novel category are fed as input to generate a new set of weights for classification, . In SIB, feature averaging based weight inference is used, i.e. the classification weight vector is obtained as , where is the Hadamard product and ( is the -normalized version of ). The classification weight vector is then updated iteratively using the synthetic gradient model in SIB, given as . Both the synthetic gradient model and the weights of the weight generator are meta-learned, i.e. updated in the outer-loop. To encourage sparse modeling of higher order concepts in the network, AIM is inserted right after the feature extractor and before the output linear classifier that is generated using and , or,
Following the training pipeline in SIB , the weights of the feature extractor are frozen to simplify the training procedure. The weights of the AIM, , and the output linear classifier, , are updated as fast weights, i.e. inner-loop. Only the weights of the classification weight generator are updated as slow weight, i.e. outer-loop. The application of AIM to SIB is shown in Figure (a)a.
It is shown in the task of continual learning that learning fast and slow from the context of meta-learning is helpful for the mitigation of catastrophic forgetting [22, 5]. OML  and ANML  are example frameworks for continual learning that uses this methodology, showing promising results. To validate our claim on the importance of incorporating sparse modeling on an architectural level for the mitigation of catastrophic forgetting, we insert AIM into both OML and ANML and observe the resulting performance.
The entire architecture is split into two parts — representation learning network (RLN) and prediction learning network (PLN). RLN uses slow weights and PLN uses fast weights. Following our notations, RLN is the feature extractor in our work, , and PLN is the classifier (not limited to a single layer), , in our work. AIM is inserted after the RLN and before the PLN, or,
AIM is trained jointly with PLN, i.e. they have fast weights. The application of AIM to OML is shown in Figure (b)b.
Two set of feature extractors are used in ANML — a neuromodulatory network, , and a prediction network, . The role of the neuromodulatory network is to modulate the latent representation of the prediction network, i.e. the output of in Figure (c)c. The output of the neuromodulatory network is element-wise multiplied with the outout of before passing to the final classifier, or . Only the neuromodulatory network has slow weights and the entire prediction network has fast weights. Similar to SIB and OML, AIM is inserted right after the feature extractor and uses fast weights,
For all datasets, class splits are disjoint. MiniImageNet  contains a total of 100 classes which are split into 64 training, 16 validation and 20 testing classes; images are of size . CIFAR-FS  is created by dividing CIFAR-100 into 64 training, 16 validation and 20 testing classes; images are of size . For few-shot classification, each task (episode) consists of a train set and a test set. For each task, classes are sampled from the class pool mentioned. For each class, examples are drawn and are relabeled as disjoint classes forming the train set. For the test set, 15 samples are used. We show results of for both and .
is pretrained in a typical end-to-end supervised learning fashion, i.e. the entire training set is used for batch update. Our classifier is adopted directly from[20, 15] having . For transductive inference , the synthetic gradient network is modeled by a MLP of 3 layers and hidden size 8. Classification is done by using the cosine-similarity based classifier found in [20, 15]. For AIM, all weights are linear layers. The hidden state of the mechanisms are of dimension 256. The key and query weights ( and ) maps the input and hidden state to a dimension of 128 to perform distance measurement. For the output dimension of the mechanism weights, , we picked 400 for CIFAR-FS trained on Conv-4-64 and 800 for the rest; this decision is based on the dimension of the flattened feature map at the output of the feature extractor (not cherry-picked).
We use mechanisms with top mechanisms selected during inference with induced stochasticity by having during training. SGD is used a batch size of 1 for 50,000 steps with learning rate for SIB’s classifier synthetic update, for outer-loop update and
for inner-loop update. The feature extractor is frozen during training. 1,000 tasks are sampled from the validation set for hyperparameter selection at each training epoch. All experiments are run on a single GTX1080Ti using PyTorch. A complete run of Conv-4-64 on CIFAR-FS and WRN-28-10 on MiniImageNet takes less than 2 hours and 5 hours respectively.
|Method||Backbone||Transductive||MiniImageNet, 5-Way||CIFAR-FS, 5-Way|
|Matching Net ||Conv-4-64||44.2%||57%||-||-|
|Prototypical Net ||Conv-4-64|
|Relation Net ||Conv-4-64|
|SIB + Linear layer||Conv-4-64||✓|
|SIB + Linear layer||WRN-28-10||✓|
Average classification accuracies with 95% confidence intervals on the test-set of MiniImageNet and CIFAR-FS. 2000 episodes are sampled for MiniImageNet and CIFAR-FS using Conv-4-64 and WRN-28-10 as the feature extractor.
We show heatmaps that illustrate mechanisms activated for different classes from the validation set in Figure 3. The heatmap is plotted by averaging the mechanisms’ activity for each class over the entire validation set, with 1 and 0 indicating active and inhibited mechanism respectively. We can observe that there’s a set of mechanisms that are shared among tasks and another set that are distributed sparsely. The sharing of mechanisms can be understood as different classes sharing similar concepts. The sparse allocation of mechanisms over different classes show that there are features that are invariant for certain classes only, improving resiliency to covariate shift among distributions.
Evaluation of continual learning methods using dataset of various scales. Meta-test testing (training) trajectories are shown in solid (dashed) lines. All curves are averaged over 10 runs with standard deviation shown.
To show the importance of inducing stochasticity in the mechanism selection process for inference, we perform an empirical study by varying the stochastic sampling count, . We fix and vary from 0 to 24. As we can see from Figure (a)a, the accuracy obtained by varying have different maximum for different datasets, models and number of shots. For most cases, the peak accuracy usually occurs at small value of and slowly deteriorates as more stochasticity is introduced.
An interesting question would be how maybe active mechanisms are required to reap the benefits of sparse activations. Empirical study is performed as shown in Figure (b)b, showing the accuracy obtained by varying the number of active mechanisms from 1 to 32. The results show that accuracy is low when is small and saturates for larger values of . This shows that a limited set of active mechanisms is sufficient. Sparsity in representation can still be met when the number of active mechanisms is large, but it will be cost inefficient during both training and inference.
As AIM is introduced as an additional component that’s integrated into SIB , the gain in accuracy shows the importance of having a mixture of experts for fast adaptation. We also show the results for SIB with a linear layer (parameters equal the total parameters found in the AIM module) added before the classifier (SIB + Linear) to show that the gain in accuracy from AIM is not solely from the increase in parameters. From Table 1, we can see that AIM outperforms all existing few-shot classification methods by a noticeable margin. As only a single layer of AIM is explored, the coupling between AIM as found in RIMs  is not considered here. We believe that further improvements can be attained if layers of AIM are stacked, with coupling between them considered.
Omniglot  has over 1,623 characters from 50 diferent alphabets, where each character has 20 hand-written images of size . The dataset is split into 963 classes for meta-training and 660 classes for meta-testing. In each trajectory, 15 images are used for training and 5 images for testing in both meta-training and meta-testing. CIFAR-100  is composed of 60,000 images of size distributed uniformly over 100 classes, i.e. 500 train images and 100 test images for each class. 70 classes are used for meta-training and 30 classes are used for meta-testing. MiniImageNet  has 64 training classes and 20 testing classes with images of size . Each class has 600 images with 540 for training and 60 for testing. 30 training images are sampled for each class. In each trajectory of CIFAR-100 and MiniImageNet, we sample 30 train images for training all test images for testing for both meta-training and meta-testing.
We adopt the model from OML  and ANML  with a slight modification for our experiments. For OML, the feature extractor is a 6-layer convolutional network with 112 channels and the classifier is a single linear layer with AIM in between and . For ANML, both neuromodulatory network and prediction network have a 3-layer convolutional network and is a single linear layer with AIM placed after and . has 112 channels while has 256 channels. For CIFAR-100 and MiniImageNet, an additional linear layer is placed before AIM for dimension reduction. The hidden state . and maps theirs corresponding inputs to .
We use mechanisms in our system and top mechanisms are selected during inference with induced stochasticity by having during training. We follow the -order MAML strategy in [22, 5]. We use a batch size of 1 for 20,000 steps with step size of for the outer-loop (slow weights) and for the inner-loop (fast weights). A complete meta-training of AIM using OML or ANML on Omniglot, CIFAR-100 and MiniImageNet takes less than 2 hours, 3 hours and 6 hours respectively.
Following the settings in few-shot learning, activations of AIM when applied to OML are shown in Figure (b)b. The activations are similar to what we observed in few-shot learning, i.e. a set of common mechanisms for all classes and another set for mechanisms that are sparsely activated.
To evaluate the capability of AIM to continually learn new concepts and mitigating catastrophic forgetting, we show the results of meta-test training and testing in Figure 19. To demonstrate that the accuracy gain using AIM is not due to the increase in parameters, baseline is plotted and is defined as the swapping of AIM with a linear layer containing the same amount of parameters as AIM added to OML. Samples of new classes are continuously fed without replacement, and samples of old classes are not stored. Prior works use the results from meta-test training as a measure of forgetting and meta-test testing to measure both forgetting and generalization error. We argue that memorizing features that doesn’t transfer well to the testing set is not a good measure of forgetting. Results show that through the application of AIM, the difference between train and test accuracy is marginal, i.e. small generalization error, demonstrating that AIM is not only useful for the adaptation to new knowledge and mitigation of catastrophic forgetting, it also plays an important role in the learning of concepts that are generalizable to the test set. Consistent improvement in accuracy is observed when AIM is applied to existing continual learning frameworks. The only exception is the application of AIM to ANML trained on Omniglot, which could be remedied through a better selection of hyperparamters.
We have shown that AIM as a mixture of experts is an important building block for modeling higher-order concepts, translating to the capability of fast adaptation and mitigation of catastrophic forgetting. Through the sparse modeling of higher-order concepts, substantial improvement over prior arts can be seen for both few-shot and continual learning. It would be interesting to see the extension of AIM to multiple layers for hierarchical modeling of higher-order concepts.
This project is supported by MOST under code 107-2221-E-009 -125 -MY3. Eugene Lee is partially supported by Novatek Ph.D. Fellowship Award. The authors are grateful for the suggestions provided by Dr. Eugene Wong from University of California in Berkeley and Dr. Jian-Ming Ho from Academia Sinica of Taiwan.
LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: §3.1.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §1, §1, §2, Table 1.
Pacific Rim International Conference on Artificial Intelligence, pp. 322–331. Cited by: §3.1.
Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1.
Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1.
In this section, we provide an empirical study on the change of the attention weight corresponding the the input dimension of the attention score, , over the entire training epochs. This study provides two insights: 1. the dynamics of the activation of mechanisms based on the number of training samples (shots); 2. the distribution of active and inhibited mechanisms over the training iterations. To show the dynamical change in a 2D plot, we sample classes from the validation set and observe them for the entire training process. Plots that uses Conv-4-64 and WRN-28-10 as backbone is shown in Figure 6 and Figure 7 respectively. From the plots, we can see that the activation of mechanisms are initially distributed uniformly followed by slow convergence to a sparse distribution over the training epochs, having only a few active mechanisms upon convergence. The active and inhibited attention weights are also clearly separated for all examples. Another observation is that having a larger number of training samples (shots), a smoother convergence for the activation weights across the training epochs is obtained. Smooth convergence is also obtained when a deeper backbone (WRN-28-10) is used when compared to a shallower one (Conv-4-64). This observation is intuitive as AIM is able to learn more efficiently when more samples or higher quality input features are provided, enabling the mechanisms to better model higher-order factorized information.
From the figures shown, a distinct gap between active and inhibited mechanisms can be clearly observed. This motivates the idea of basing the activation of mechanisms on its corresponding attention value (soft decision) instead of making a hard decision that selects a total of
AIM on every inference. To demonstrate if basing the activation of AIM on the attention value would work, a simple experiment can be performed by allowing a mechanism to be active if its attention value is above 0.5 (similar to ReLU[nair2010rectified]), or:
where is given as,
and is defined as,
To keep our experiments simple, we do not induce stochasticity in the original approach and fix to provide a fair comparison. We name the original method that keeps 8 mechanisms active as hard decision and name the method in (8) – (10) as soft decision. Comparison between hard decision and soft decision is shown in Table 2. From the results, we can see that when Conv-4-64 is used as backbone, higher accuracy is obtained when hard decision is used. The opposite can be observed when WRN-28-10 is used as backbone. We deduce that when extracted features are more reliable, i.e. through the use of deeper backbone or higher number of shots, the attention weights are of higher quality leading to clear distinction between relevant and less relevant mechanisms.
|Backbone||Method||MiniImageNet, 5-Way||CIFAR-FS, 5-Way|
Different from the previous section, we show the mask instead of the attention weights here. The masks have a value of 1 for active mechanisms and 0 for inhibited mechanisms having the competitive selection based on the attention weights; for all experiments, mechanisms will be active during both training and inference. We set to induce stochasticity during training. Instead of sampling a single sample from each class, we take the average of the masks of each class accumulated across the entire validation set. We show heatmaps covering all classes and all 32 AIMs mechanisms on the first and final training epochs. Results that use Conv-4-64 as backbone using #shots-dataset pair of 1-shot-CIFAR-FS, 5-shot-CIFAR-FS, 1-shot-MiniImageNet and 5-shot-MiniImageNet are shown in Figure 8, Figure 9, Figure 10 and Figure 11 respectively. Results that use WRN-28-10 as backbone using #shots-dataset pair of 1-shot-CIFAR-FS, 5-shot-CIFAR-FS, 1-shot-MiniImageNet and 5-shot-MiniImageNet are shown in Figure 12, Figure 13, Figure 14 and Figure 15
respectively. By observing the heatmaps, we can see that the activation of mechanisms in the first epoch are uniformly distributed whereas in the final epoch, only a few set of mechanisms that are jointly used between classes accompanied by a sparse set of mechanisms that are invariant among samples from the same class. This observation meets our expectation of learning a set of experts that are each responsible for certain task. To give a better illustration on the transition of the heatmaps over the training epochs, we have attached several.mp4 files that follows the naming convention of mask-DATASET_MODEL_#shot (italicized as wildcard strings) along with the supplementary materials.
In this section, we show tabulated results of the manipulation of stochastic sampling and active mechanisms count as found in the main paper. The plots in the main paper show zero mean-ed results whereas the actual accuracy is reported in Table 3 and Table 4 for the maniputation of stochastic sampling count and active mechanisms count respectively. By looking at the tabulated results, we can say that the introduction of some stochasticity on the competitive selection of mechanisms during training is beneficial for the overall performance.
We show the plots from the main paper comparing different continual learning methods in Figure 19 with the accompanied tabulated data in Table 5. Baseline shown correspond to the swapping of AIM layer with a single linear layer with number of parameters close to the originally introduced AIM layer to demonstrate that the increase in accuracy is not from over-parameterization. From the results, we can observe that with the addition of AIM as a module for continual learning, consistent improvement in accuracy can be obtained. It is also shown that the gain in accuracy does not result from the increase in parameters as shown by the accuracy attained using the baseline method.
Similar to the analysis done for the activations of mechanisms for few-shot, we show the activations of mechanisms when AIM is used for continual learning. We apply AIM to both OML  and ANML  with activation heatmaps when trained on Omniglot , CIFAR-100  and MiniImageNet  in Figure (a)a, Figure (b)b and Figure (c)c respectively. We can observe that for Omniglot, the activation of mechanisms are sparsely distributed when compared to the activations obtained using CIFAR-100 and MiniImageNet. We conjecture that this is due to the simplicity of extracted representations, resulting in simpler higher-order modeling by the mechanisms. For natural images like CIFAR-100 and MiniImageNet, the features are not as distinct as the alphabets found in Omniglot, hence higher-order modeling of representations isn’t as sparsely distributed. The sparsely distributed activations found in Omniglot result in distinctive increase in accuracy when compared to other datasets as shown in Table 5, e.g. at a trajectory containing 600 classes, the relative increase in accuracy when AIM is applied to OML is 20.70%. Even when MiniImageNet is used, distinctive increment in accuracy can also be observed, e.g. 19.40% relative increment in accuracy when AIM is applied to ANML, which be believe is due to the richness of information embedded in the latent representation resulting from the larger image size of . We believe that larger gain in accuracy can be attained through the introduction of a better feature extractor, i.e. an alternative to convolutional layers.
|Backbone||Stochastic sampling count||MiniImageNet, 5-Way||CIFAR-FS, 5-Way|
|Backbone||Active Mechanism Count||MiniImageNet, 5-Way||CIFAR-FS, 5-Way|
|Method||Number of classes|
|OML+AIM||97.70 (+3.45)||98.60 (+1.28)||98.55 (+5.50)||97.75 (+8.41)||96.28 (+11.76)||94.08 (+17.58)||91.09 (+24.26)||85.51 (+22.93)||80.37 (+20.70)|
|ANML+AIM||94.25 (+11.65)||97.32 (+12.40)||93.05 (+9.45)||89.34 (+7.42)||84.52 (+7.67)||76.50 (+3.67)||66.83 (-2.29)||62.58 (-3.17)||59.68 (-3.74)|
|OML+AIM||85.65 (+4.08)||79.46 (+18.00)||68.03 (+9.79)||60.44 (+8.04)||53.39 (-0.26)||46.16 (+6.74)||42.02 (+8.43)||36.17 (+7.64)||33.59 (+6.81)|
|ANML+AIM||84.10 (-2.76)||70.10 (-2.93)||60.90 (-2.76)||58.35 (+1.76)||55.88 (+5.87)||46.72 (+1.70)||39.84 (-0.23)||36.47 (+0.18)||28.81 (-5.34)|
|OML+AIM||75.75 (+12.75)||56.13 (+13.96)||42.92 (+12.12)||38.94 (+10.53)||33.52 (+10.20)||28.57 (+9.30)||27.81 (+10.99)||23.85 (+9.89)||23.03 (+11.48)|
|ANML+AIM||85.25 (+12.00)||64.13 (+15.71)||52.97 (+19.56)||52.79 (+24.35)||44.90 (+21.47)||39.28 (+19.79)||36.67 (+20.01)||33.01 (+19.20)||32.13 (+19.40)|