1 Introduction
The ability to learn and generalize from a few examples is present on a wide range of species, from the fruit fly to humans. It is therefore not surprising that biological systems have served as inspiration for the exploration of continual or incremental learning. How to achieve the right balance between plasticity and consolidation, and therefore how to avoid catastrophic forgetting, has long been one of the central areas of research in the area of continual learning. However, the design of systems intended for edge computing brings up an additional set of constraints, such as how to design systems that have to learn inputs that are not known during the design phase, how to achieve high accuracy when the system has no external way of replaying data (i.e., stream processing), or what the tradeoff is between performance and size.
These resource constraints may cause performance penalties with respect to online and continual learning strategies that are not bounded by memory and that keep a detailed record of past performance and data. On the other hand, these constraints are also present in biological systems, and yet they excel in many continual learning tasks. Neuroinspired approaches can therefore provide useful insights that complement other machine learning approaches.
One of the salient features of the central nervous systems is its heterogeneity in terms of architecture, types of neurons, and neurochemistry, including where, when, and how learning takes place within the brain. If we focus on synaptic plasticity mechanisms, we see a wide diversity, not only in the hyperparameters of a specific mechanism or in where learning takes place, but also in the nature of the synaptic plasticity mechanism itself. For instance, while Hebbian rules and spike timingdependent plasticity have attracted the bulk of the attention in the machine learning and neuromorphic computing communities, antiHebbian and nonHebbian rules abound
(Dayan, ). For instance, in some key learning centers researchers have showed that synaptic plasticity is independent of postsynaptic activity (Hige et al., 2015). Moreover, variations in the local learning rules have been found even in areas that are otherwise morphologically equivalent but that are assigned to different tasks. Clearly, if we want to develop robust continual learning approaches, we need to pay attention to this diversity and develop a better understanding of the interplay between heterogeneity and continual learning across very different tasks, which is often overlooked in continual learning across diverse tasks.To this end, we make the following contributions.

[leftmargin=*]

We propose a neuromorphic architecture that consists of multiple layers for feature extraction and neuromodulated supervised learning. It incorporates diverse synaptic plasticity mechanisms, and hence is capable of online continual learning without explicit data storage and replay.

We cast the problem of metalearning taskspecific architecture choice as a mixedinteger blackbox optimization problem and develop a technique based on transfer coefficients to explain the accuracy differences when the metalearned configurations are transferred across datasets and tasks.

We evaluate our approach in an online nonincremental learning and classincremental learning scenarios and demonstrate that the accuracy values are on par with the shallow networks in the former scenario and outperform other memoryfree continual learning approaches while obtaining accuracies similar to those of the memoryreplaybased approaches without the memory.
2 Multilayer Neuromodulated Architecture
We consider multilayer, recurrent architectures integrating both processing and learning into the networks as shown in Fig. 1.
2.1 Processing Component
The processing component comprises a feedforward neural network. It has two layers:
Feature extraction layers:
It is based on sparse projection (SPARSE) of the dimensional input into a much larger dimensional space. Each neuron receives inputs, with . In addition, dynamic thresholding followed by rectification ensures that only the most salient features are included in the representation:
(1) 
where is the sparse projection matrix, and
are the mean and standard deviation of the matrix multiplication product
, and is a constant controlling the cutoff. This approach highlights correlations between different input channels and represents the broadest possible prior for feature extraction, since it does not assume any spatial dependence or correlations between inputs. This layer is similar to that used in previous models inspired in the cerebellum and the mushroom body, two of the biological instances of this architecture (LitwinKumar et al., 2017; YanguasGil,Angel, 2019). A key distinction is that , , and can all be adjusted through our optimization algorithm.Neuromodulated learning layer:
This comprises one or more layers where supervised learning occurs. Synaptic weights are treated as firstclass network elements that can be updated in real time by using a series of local learning rules codified in the learning component.
2.2 Learning Component
It comprises the recurrent version of the network, implementing local learning rules. These local learning rules are defined as functions:
(2) 
where represents the synaptic weights; and are the presynaptic and postsynaptic activities, respectively;
represents a vector of modulatory signals controlling learning in real time; and
represents a discrete number of hyperparameters that are specific to a given learning rule. By using a consistent interface, we can swap out learning rules without having to further modify our architecture. This allows us to treat metalearning as a combined architecture and hyperparameter optimization process.The architecture receives realtime feedback on its performance while learning (i.e., the label). The modulatory component transforms this feedback into the modulatory signals that are fed into the local learning rules. Here, we use the following synaptic plasticity rules.

[leftmargin=*]

Generalized Hebbian model (GEN): the generalized Hebbian model is a modulated version of the covariance rule commonly used in neuroscience. The synaptic weight evolution is given by
(3) It has long been known that this rule is unstable and that clamping of a regularization mechanism needs to be included in order to keep the synaptic weights bounded. This rule is at the core of some of the neuromodulationbased approaches recently developed (Miconi et al., 2018).

Oja rule (OJA): a modification of the basic Hebb’s rule providing a normalization mechanisms through a firstorder loss term (Oja, 1982):
(4) 
MSE, nonHebbian rule: It is based on recent experimental results on synaptic plasticity mechanisms in the mushroom body (Hige et al., 2015). The key assumptions are that learning is independent of postsynaptic activity and that instead postsynaptic activity modulates regulates synaptic plasticity:
(5) This rule is consistent with the MSE cost function used in stochastic gradient descent methods
(YanguasGil et al., 2019).
These are just a few of the many nonequivalent possible formulations of synaptic plasticity rules (Madireddy et al., 2019)
. We have treated these models as categorical variables that can be swapped during the metalearning optimization process. Moreover, we have added two conditions that allow us to externally tune the learning rate as a function of time and to provide symmetric or positive clamping of the synaptic weights in order to prevent instabilities in the algorithms.
2.3 MixedInteger Optimization Framework
The parameter space in the proposed multilayer neuromodulated architecture is composed of categorical variables (e.g., the selection of the local learning rule), integer parameters (e.g., the dimension of hidden layer), and continuous parameters in each of the learning rules. We adopt a parallel asynchronousmodelbased search approach (AMBS) (Balaprakash et al., 2018) to find the highperforming parameter configuration in this mixed (categorical, continuous, integer) search space. Our AMBS approach consists of sampling a number of parameter configurations and progressively fitting a surrogate model over the parameter configurations’ accuracy metric space. This surrogate model is asynchronously updated as new configurations are being evaluated by the parallel processes, which are then used to obtain configurations that will be evaluated in the next iteration.
Crucial to the optimization approach is the choice of the surrogate model, since this model generates the configuration to evaluate in the mixed search space. The AMBS adopts random forest
approach to build efficient regression models on this search space. The random forest is an ensemble learning approach that builds multiple decision trees and uses bootstrap aggregation (or bagging) to combine them to produce a model with better predictive accuracy and lower variance. Another key choice for the AMBS approach is the acquisition function, which encapsulates criteria to choose the most promising configurations to evaluate next. Hence, the acquisition function is key to maintaining the explorationexploitation balance during the search. The AMBS adopts the
lower confidence bound (LCB) acquisition function. We set the kappa to a large value, which increases exploration. We do so to accommodate the high variability we observed in the accuracy metrics within and across the local learning rules that led them to local minimum when a smaller value of kappa was used.3 Online and Continual Learning Experiments
We focus on a specific type of experiment in which the system is subject to a stream of data and labels during a predetermined number of epochs. This constitutes a single episode. Through the synaptic plasticity mechanisms implemented in the network, the weights evolve during the episode. At the end of the episode, the system is evaluated against the testing dataset to validate its accuracy. By concatenating multiple episodes involving different tasks and datasets, we can create a curriculum to evaluate the system’s ability to carry out continual learning.
The resulting accuracy is the metric that the optimization framework uses to carry out the exploration of the architecture and hyperparameter space and find the optimal configurations. As part of the optimization process, multiple episodes are run for systems with different architectures and specific hyperparameters. In all these cases, the system starts from identical starting conditions, so that no knowledge is transferred between episodes that do not belong to the same curriculum.
We consider two different learning modalities to evaluate the proposed approach:
(a) NonIncremental Learning: There is a single episode
in the curriculum, which has access to all the data and hence performs a single
task (e.g., multiclass classification);
(b) ClassIncremental Learning:
The synaptic weights are updated
incrementally, where a specific task (e.g., twoclass classification)
is performed in each episode using data streams from distinct (clearly separated)
classes. The curriculum involves a concatenation of multiple episodes such
that the final accuracy measures the model’s ability to predict on test data
from all the classes, spanning across episodes.
By changing tasks, datasets, and learning modalities, we build a collection of highperforming configurations for each type of experiment. We use these configurations to explore the transferability of optimal learning conditions to other datasets and to different types of tasks.
4 Results and Discussion
We first describe the results obtained in the nonincremental learning scenario, followed by incremental learning scenario. For both scenarios, we demonstrate that the accuracies we obtain are on par with (or outperformed) other stateoftheart approaches without explicit memory consideration. Finally, we perform experiments exploring the transferability of optimal learning conditions to tasks and datasets and find that it is strongly correlated to similarity between the datasets.
4.1 NonIncremental Learning
In this experiment, we demonstrate the learning capability of our multilayer neuromodulated learning framework on single episode curriculum learning task, specifically multiclass classification. We consider MNIST, Fashion MNIST (Xiao et al., 2017) (FMNIST), and Extended MNIST (Cohen et al., 2017) (EMNIST) datasets because of the homogeneity in image and class sizes and the existence of benchmarks in continual learning. For each dataset, we jointly optimize over the local learning rules and their parameters to find the optimal configuration.
We employed SPARSE as the feature representation layer and three supervised learning rules (MSE, GEN, OJA) for label prediction. While the number of parameters in each of the learning rule differs, the search space is defined to exploit the common parameters, whose dimensionality is defined by the rule with maximum parameters. In addition, only the subset of parameters present in the locallearning rule are active at search time. For example, the parameter defines the weight clamping and is common to the three rules; the GEN rule has all the parameters () active; whereas MSE has only active, so the other parameters () are set to None while exploring the search space. For each configuration evaluated during the optimization, the model is run for epochs. The best accuracy obtained and the corresponding optimal parameters for all the datasets are shown in Table 1. The search trajectory showing the learning rule and test accuracy of the configurations evaluated as a function of time is shown Fig. 2 for MNIST and FMNIST. In both the cases, the initial configurations range across the different learning rules, but the algorithm quickly finds the potential learning rule and evaluates more configurations from it. The corresponding accuracies for each of the three datasets obtaining by using the optimal configurations, but run of epochs is shown in Table 2. These accuracies are on par with (or outperform) other shallow network architectures (YanguasGil et al., 2019; Xiao et al., 2017; Cohen et al., 2017).
MNIST  FMNIST  EMNIST 

97.45  94.32  97.34 
4.2 Incremental Learning
In this experiment, we demonstrate the learning capability of our approach in the classincremental learning scenario, where multiple episodes are present per learning curriculum. We evaluate our method on the SplitMNIST (Farquhar and Gal, 2018) and PermutedMNIST (Goodfellow et al., 2013) incremental learning benchmarks that have been extensively adopted in the literature (Shin et al., 2017; Zenke et al., 2017; Nguyen et al., 2017). The SplitMNIST data is prepared by splitting the original MNIST dataset (both training and testing splits) consisting of ten digits into five twoclass classification tasks. This defines an incremental learning scenario in which the model sees these five tasks incrementally one after the other. The PermutedMNIST is also prepared from the original MNIST data by permuting the pixels in the images. A unique permutation is applied to all the images in order to generate a new tenclass classification task. The number of distinct permutations applied represents the length of the task sequence. For instance, we adopted a tenpermuted MNIST, which produces ten (tenclass classification) tasks that are sequentially seen by the model. The first task is the original MNIST data, and the subsequent nine tasks are obtained by permuting the original data.
Following (Hsu et al., 2018)
, for SplitMNIST, a simple twolayer multilayered perceptron (MLP) neural network with 400 neurons per layer is adopted to evaluate the baseline accuracy, as well as the accuracy with other incremental learning algorithms. Similarly, for PermutedMNIST, we use twolayer MLP with 1,000 neurons per layer to evaluate the accuracy of the baseline and other incremental learning algorithms. See
(Hsu et al., 2018)for more details on the hyperparameters used for both these models. The first baseline is a naive approach in which the MLP network (with Adagrad optimizer) is trained by progressively updating the parameters through backpropagation as new tasks are observed. The second baseline is a naive rehearsal (experience replay) approach that stores a fraction of data from previous tasks randomly and uses that while training incrementally. Among the popular incremental learning approaches, we choose representative ones from regularizationbased (Online EWC
(Schwarz et al., 2018), SI (Zenke et al., 2017), MAS (Aljundi et al., 2018)), and memorybased methods (GEM (LopezPaz and Ranzato, 2017), DGR (Shin et al., 2017), Rtf (van de Ven and Tolias, 2018)). We also compare the nonincremental learning scenario with the corresponding baseline models for both datasets as well as our multilayer neuromodulated learning approach. The results of the incremental learning experiments are shown in Table 3.For the nonincremental learning case, where the data from all tasks are provided at the same time, the baseline MLP model gives an accuracy of and for the SplitMNIST and PermutedMNIST datasets, respectively, after 4 epochs of training for the former and 10 epochs for the latter, following (Hsu et al., 2018). We use the bestperforming parameter configuration for MNIST data obtained for the nonincremental learning (Table 1) to evaluate the performance of our architecture in these two datasets. This transfer of parameter configuration from the nonincremental to the incremental learning scenario is further discussed in the next section. With this parameter configuration, we found that the accuracy of our model is on SplitMNIST trained for 4 epochs and on the PermutedMNIST trained for just 1 epoch. These values are very close to the MLP model accuracy— on SplitMNIST trained for 4 epochs and on the PermutedMNIST trained for 10 epochs .
For the classincremental learning on SplitMNIST and the PermutedMNIST, our model (using MNIST parameter configuration from Table 1) significantly outperforms both the baseline and other nonmemorybased incremental learning approaches considered. In addition, we obtain accuracy comparable to the stateoftheart memorybased models for SplitMNIST data and outperform all the other incremental learning algorithms on the PermutedMNIST data with only three epochs (as compared with ten epochs for all the other algorithms). The naive rehearsal approach outperforms all the other approaches, but it comes with the additional memory overhead.
4.3 Transferability Study
Identification of the best configuration for each group of classes (a dataset) is based on the assumption that all the data is available at the beginning of the training procedure. However, for many online learning scenarios, both incremental and nonincremental, this configuration might not be known a priori. This raises the question of transfer metalearning: how to effectively optimize a learning algorithm to learn unknown tasks and data. To address this problem, we empirically study the effect of transferring optimal configurations tasks for the incremental learning scenario, and across datasets for the nonincremental learning case.
Across tasks: The best configuration obtained through the joint optimization on each dataset separately (Table 1) is used to evaluate the incremental learning accuracy with SplitMNIST and PermutedMNIST datasets. The results using the configurations corresponding to MNIST, FMNIST, and EMNIST are shown in Table 4. We observe that the configuration learned on the MNIST dataset is readily transferable to incremental learning on SplitMNIST and the PermutedMNIST data, as seen in the preceding section. The configuration learned with the EMNIST data leads to an accuracy difference of only and , respectively, on these incremental learning datasets. The FMNIST configuration, on the other hand, leads to an accuracy drop of and
, respectively on them. This observation suggests that the transfer learning configuration (both the local learning rule and its corresponding parameters) to the incremental learning experiments needs to be done carefully, which otherwise would lead to suboptimal accuracy.
Across datasets: We seek to understand and characterize the dependence of dataset similarity (defined as a distance metric) and the transferability of optimal configurations learned on a dataset to other (different) datasets. To this end, we choose the best configuration obtained through the joint optimization on each dataset (Table 1) and use that for transfer learning and evaluating the accuracy for the remaining datasets. The results for transfer learning across MNIST, FMNIST, and EMNIST are summarized in Table 4, where a column represents the accuracy on a particular dataset obtained by transfer learning through the optimal configurations learned on standalone MNIST, FMNIST, and EMNIST datasets. The diagonal elements are the same as the accuracy values obtained previously for four epoch runs with the same dataoptimal configuration combination (Table 2). We find that the configurations learned on MNIST and EMNIST are readily exchangeable without significant loss in accuracy. These configurations, however, do not transfer well to the FMNIST dataset, as seen by the accuracy decrease of and , respectively, with MNIST and EMNIST. On the other hand, the FMNIST configurations also lead to a decrease in accuracy of and , respectively, when transferred to MNIST and EMNIST data.
NonIncremental Learning  ClassIncremental Learning  

Config (), Dataset ()  MNIST  FMNIST  EMNIST  SplitMNIST  PMNIST 
MNIST  97.35  82.87  97.69  92.76  96.62 
FMNIST  91.82  94.32  93.59  91.78  92.48 
EMNIST  96.93  83.20  97.34  93.47  96.12 
Distance metrics as a measure of transferability: To rationalize the results shown in Table 4
, we have explored the correlation between the drop in performance between datasets and their effective dimensionality and the distance between datasets. We focus on a definition of dimensionality derived from the covariance matrix of the complete dataset. The eigenvalues of the covariance matrix yields the principal components of the dataset. We consider as a metric of dimensionality the sum of squares of the eigenvalues (
), so that(6) 
This metric has been used in the past to characterize sparse representations (LitwinKumar et al., 2017). The interpretation of this metric is that when the variance is equally distributed across all dimensions, Eq. 6 yields a dimension , where is the number of dimensions in the space (784 in this case). If all the variance is concentrated in a single eigenvalue, then .
Data  MNIST  FMNIST  EMNIST 
EigenDim  30.69  7.91  27.09 
To quantify the distance and separability between categories in the dataset, we have used the cosine distance metric. In Table 6 we have applied the traditional definition of distance between two clusters and calculated the minimum distance between any two categories of two datasets. We also show the maximum separation for comparison. When these metrics are used within a given dataset, we obtain the minimum and maximum distance between two categories of the same dataset, which could be interpreted as a measure of separability.
Data  MNIST  FMNIST  EMNIST 

MNIST  (0.073,0.55)  (0.17,0.54)  (0.044,0.59) 
FMNIST  (0.012,0.56)  (0.13,0.57)  
EMNIST  (0.098,0.52) 
Figure 3 shows the drop in classification accuracy during transfer metalearning as a function of a transfer coefficient obtained from two different metrics: the relative difference in the eigenvalue dimension given by Eq. 6 and the minimum cosine distance between the two datasets. Both values are normalized as
(7) 
where is the metric of the dataset selected to run the experiment and is the metric of the dataset whose optimal configuration is used. Note that this transfer coefficient is not symmetric because of the different normalization in Eq. 7, as should be expected since transfer metalearning is directional. The results clearly show that as the distance between the datasets increases, transfer learning the configurations across them leads to a decrease in accuracy.
5 Related Work
Incremental learning, also referred to as continual or lifelong learning (Thrun and Pratt, 2012), describes a learning modality in which a model seeks to learn from data and tasks that are sequentially presented to it. Several incremental learning approaches have been presented in the literature, which can loosely be categorized into three classes: (1) novel neural architectures or customization of the common ones; (2) regularization strategies that impose constraints to boost knowledge retainment; and (3) metalearning, which uses a series of tasks to learn a common parameter configuration that is easily adaptable for new tasks. Algorithms in the first category include bioinspired dualmemory architecture (Parisi et al., 2019); progressive neural networks (Rusu et al., 2016) that explicitly support information transfer across sequences of tasks through network expansion; and deep generative replay (Shin et al., 2017), which proposed a cooperative dual model architecture framework, inspired by hippocampus, that retains past knowledge by the concurrent replay of generated pseudodata. The second category consists of algorithms such as elastic weight consolidation (EWC) (Kirkpatrick et al., 2017) that computes synaptic importance using a Fisher importance matrixbased regularization; synaptic intelligence (SI) (Zenke et al., 2017)
, whose regularization penalty is similar to EWC but is computed online at persynapse level; and learning without forgetting
(Li and Hoiem, 2017), which applies a distillation loss on the attentionenabled deep networks seeking to minimize task overlap. The third category consists of algorithms such as online metalearning (Javed and White, 2019), neuromodulated metalearning algorithm (Beaulieu et al., 2020), and incremental taskagnostic metalearning (Rajasegaran et al., 2020) that show great promise but are not particularly suited for the memoryconstrained continual learning scenarios that could be relevant in edge computing. These algorithms are characterized by large network sizes and memory buffer requirements and have a few restricting assumptions on the structure of the data; hence we do not include a comparison with this class of approaches. The multilayer neuromodulated architecture approach we proposed falls under the first class of continual learning methods discussed earlier.6 Conclusions
We developed a multilayer, recurrent neuromorphic architecture capable of online continual learning in a memoryconstrained setting, where large models and data storage/replay is limited. The proposed architecture consists of a processing component and a learning component, where learning can be viewed as a dynamic process that alters the architecture itself via recurrent interactions as it processes and assigns valence to certain inputs and create associations over time. Our approach parameterizes multiple synaptic plasticity mechanisms as layers of this network sharing a common interface. This allows us to cast the optimization of the architecture’s learning capabilities as an optimization problem and to employ a Bayesian optimizationbased search to find optimal taskspecific configurations in the mixed (categorical, continuous) integer space that spans over the choice of the learning algorithms, their specific hyperparameters, and feature extraction layers.
We demonstrate our approach using two different learning scenarios. The first is a nonincremental learning scenario, where the learning curriculum consists of a episode that has access to all tasks, and the model is updated online as the data streams in. The second is a classincremental learning scenario, where data from each task is presented sequentially such that each episode in the learning curriculum consists of a unique task. The model is then updated online in each episode and continually with a sequence of these episodes. In the nonincremental learning case, our algorithm identified configurations capable of obtaining online learning accuracies of for MNIST, for Fashion MNIST, and for Extended MNIST, which were on par with (or outperformed) the performance of static shallow neural networks run for four epochs. The optimal configurations in our architecture were obtained by imposing a limit of epochs to learn each dataset, which was cheap computationally and led to good configurations. In the classincremental learning case, we obtained an accuracy of and on the SplitMNIST and PermutedMNIST data, which consisted of five and ten sequential tasks, respectively. Hence, our approach clearly outperformed the memoryfree approaches, whose accuracies obtained a maximum of and , respectively. The memory replaybased continual learning approaches gave a maximum accuracy of with Rtf and with GEM using four and ten epoch of training, respectively, on the two datasets, whereas our approach produced an accuracy of 92.76 with four epochs on SplitMNIST and 96.62 with one epoch on PermutedMNIST. These results demonstrate that memoryless approaches such as the one proposed here can achieve performances on par with the memoryreplaybased approaches used as benchmarks, without the additional memory overhead. This suggests the need to identify more challenging continual learning assays.
Our approach allowed us to explore the transferability of optimal learning conditions across datasets and tasks, in order to understand the interplay between taskheterogeneity and continual learning across very different tasks. We demonstrated through systematic experiments that the accuracy of this transfer metalearning to datasets previously not seen can be largely explained through a transfer coefficient that can be based on metrics of dimensionality and distance between datasets.
Acknowledgements
This work was supported through the Lifelong Learning Machines (L2M) program from DARPA/MTO. The material is also based in part by work supported by the U.S. Department of Energy, Office of Science, under contract DEAC0206CH11357. We gratefully acknowledge the computing resources provided on Bebop (and/or Blues), a highperformance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory.
References

Memory aware synapses: learning what (not) to forget.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 139–154. Cited by: §4.2.  DeepHyper: asynchronous hyperparameter search for deep neural networks. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC), Vol. , pp. 42–51. External Links: Document, ISSN 26400316 Cited by: §2.3.
 Learning to continually learn. arXiv preprint arXiv:2002.09571. Cited by: §5.
 EMNIST: extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. Cited by: §4.1, §4.1.
 [5] Cambridge, Mass.. External Links: ISBN 0262041995 Cited by: §1.
 Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733. Cited by: §4.2.
 An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv preprint arXiv:1312.6211. Cited by: §4.2.
 Heterosynaptic plasticity underlies aversive olfactory learning in ¡em¿drosophila¡/em¿. Neuron 88 (5), pp. 985–998. Note: doi: 10.1016/j.neuron.2015.11.003 External Links: Document, ISBN 08966273, Link Cited by: §1, 3rd item.
 Reevaluating continual learning scenarios: a categorization and case for strong baselines. arXiv preprint arXiv:1810.12488. Cited by: §4.2, §4.2.
 Metalearning representations for continual learning. In Advances in Neural Information Processing Systems, pp. 1818–1828. Cited by: §5.
 Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §5.
 Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §5.
 Optimal degrees of synaptic connectivity. Neuron 93 (5), pp. 1153–1164.e7. Note: doi: 10.1016/j.neuron.2017.01.030 External Links: Document, ISBN 08966273, Link Cited by: §2.1, §4.3.
 Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §4.2.
 Neuromorphic architecture optimization for taskspecific dynamic learning. In Proceedings of the International Conference on Neuromorphic Systems, pp. 1–5. Cited by: §2.2.
 Differentiable plasticity: training plastic neural networks with backpropagation. CoRR abs/1804.02464. External Links: Link, 1804.02464 Cited by: 1st item.
 Variational continual learning. arXiv preprint arXiv:1710.10628. Cited by: §4.2.
 Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology 15 (3), pp. 267–273. External Links: Document, ISBN 14321416, Link Cited by: 2nd item.
 Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §5.
 ITAML: an incremental taskagnostic metalearning approach. arXiv preprint arXiv:2003.11652. Cited by: §5.
 Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §5.
 Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning, pp. 4528–4537. Cited by: §4.2.
 Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §4.2, §4.2, §5.
 Learning to learn. Springer Science & Business Media. Cited by: §5.
 Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §4.2.
 Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1.
 Note: https://github.com/zalandoresearch/fashionmnist.git External Links: cs.LG/1708.07747 Cited by: §4.1.
 The insect brain as a model system for low power electronics and edge processing applications. In 2019 IEEE Space Computing Conference (SCC), Vol. , pp. 60–66. Cited by: 3rd item, §4.1.
 Memristor design rules for dynamic learning and edge processing applications. APL Materials 7 (9), pp. 091102. External Links: Document, https://doi.org/10.1063/1.5109910, Link Cited by: §2.1.
 Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3987–3995. Cited by: §4.2, §4.2, §5.
Comments
There are no comments yet.