Introduction and historical context
Several definitions of intelligence have emerged in machine learning literature in recent years (Legg and Hutter, 2007; Tenenbaum et al., 2011). Although many of these differ in particularities, one consistent requirement is that an intelligent agent must have the capacity to accumulate and maintain task proficiency across experiences (Hinton and others, 1986; Hassabis et al., 2017). In contrast to biological agents, existing neural network approaches have proved lacking in this regard, with sequential task learning often resulting in catastrophic forgetting of previously encountered tasks (French, 1999; McCloskey and Cohen, 1989; Ratcliff, 1990). A proliferation of research seeking to alleviate the catastrophic forgetting problem has emerged in recent years, motivated by the requirement for machine learning pipelines to accumulate and analyse vast data streams in real-time (Parisi et al., 2019; Hadsell et al., 2020). Despite significant progress being made through such research, both in theory and application, the sub-field of continual learning research is vast, and therefore benefits from clarification and unifying critical appraisal. Simple categorisation of these approaches according to network architecture, regularisation, and training paradigm proves useful in structuring the literature of this increasingly important sub-field. Furthermore, many existing approaches to alleviating catastrophic forgetting in neural networks draw inspiration from neuroscience (Hassabis et al., 2017). In this review, both of these issues will be addressed, providing a broad critical appraisal of current approaches to continual learning, while interrogating the extent to which insight might be provided by the rich literature of learning and memory in neuroscience.
The influence of network architecture on task performance has been widely described in machine learning (LeCun et al., 2015), and represents a fruitful area of continual learning research (Parisi et al., 2019; Kemker and Kanan, 2017). In particular, attention has focused on network architectures which dynamically reconfigure in response to increasing training data availability, primarily by recruiting the training of additional neural network units or layers.
Progressive neural networks are one such example, where a dynamically expanding neural network architecture is employed (Rusu et al., 2016)
. For training on each subsequent task, this model recruits additional neural networks which are trained on these tasks, while transfer of learned knowledge across tasks is facilitated by learned ‘lateral’ connections between the constituent networks (Figure 1A). Together, this alleviates catastrophic forgetting across a range of reinforcement learning benchmarks, such as Atari games, and compares favourably with baseline methods which leverage pre-training or fine-tuning of model parameters (Figure 1B,C). Despite the empirical successes of progressive networks, an obvious conceptual limitation is that the number of network parameters grows with the number of tasks experienced. For sequential task training (onn tasks), the broader applicability of this method, as , remains unclear.
An alternative approach was proposed by Draelos et al. (2017)
, which takes direct inspiration from hippocampal neurogenesis, a well-described phenomenon from neuroscience in which new neurons are formed in adulthood, raising intriguing questions of learning and plasticity in both natural and artificial intelligence settings(Aimone et al., 2014). The method proposed in this paper, termed neurogenesis deep learning, is conceptually similar to that of progressive neural networks, but in this instance involves additional neurons in deep neural network layers being recruited as the network is trained on subsequent tasks. Draelos et al. (2017)
implement this as an autoencoder trained on the MNIST dataset of handwritten digits. As a greater range of digits is added incrementally to the training distribution, units are added in parallel to the autoencoder, thereby giving rise to the ‘neurogenesis’ in this dynamic network architecture. The autoencoder network in this instance preserves weights associated with previously learned tasks using a form of replay, while the reconstruction error provides an indication of how well representations of previously learned digits are preserved across learning of subsequent digits (that is, subsequent task learning). This paper presents an elegant idea for mitigating catastrophic forgetting, but further experiments are required to fully appraise its potential. For instance, the incremental training data used in this paper is solely in the form of discrete, one-hot categories, rather than the more challenging (and more naturalistic) scenario of novel data accumulating gradually, or without clear boundaries.
Both of the approaches discussed so far have involved dynamic network architectures, but nonetheless ones in which networks or units are recruited and incorporated in response to subsequent tasks. An alternative method has been advanced by Cortes et al. (2017)
, in which no network architecture is explicitly encoded. Instead, the proposed AdaNet algorithm adaptively selects both the optimal network architecture and weights for the given task. When tested on binary classification tasks drawn from the popular CIFAR-10 image recognition dataset, this approach performed well, with the algorithm automatically learning appropriate network architectures for the given task. Although AdaNet has not been tested exhaustively in the context of continual learning, it represents an appealing method of dynamically reconfiguring the network to mitigate catastrophic forgetting with subsequent tasks. Overall, some combination of these approaches – a dynamic network architecture and an algorithm for automatically inferring the optimal architecture for newly encountered tasks – might offer novel solutions to the continual learning problem.
Imposing constraints on the neural network weight updates is another major area of continual learning research (Goodfellow et al., 2013). Such regularisation approaches have proved popular in recent years, and many derive inspiration from models of memory consolidation in theoretical neuroscience (Fusi et al., 2005; Losonczy et al., 2008).
Learning without forgetting
Learning without forgetting (LwF) is one such proposed regularisation method for continual learning (Li and Hoiem, 2017) and draws on knowledge distillation (Hinton et al., 2015). Proposed by Hinton and colleagues, knowledge distillation is a technique in which the learned knowledge from a large, regularised model (or ensemble of models) is distilled into a model with many fewer parameters (the details of this technique, however, are beyond the scope of this work). This concept was subsequently employed in the LwF algorithm to provide a form of functional regularisation, whereby the weights of the network trained on previous tasks or training data are enforced to remain similar to the weights of the new network trained on novel tasks. Informally, LwF aims to effectively take a representation of the network before training on new tasks. In Li and Hoiem (2017)
, this was implemented as a convolutional neural network, in which only novel task data was used to train the network, while the ‘snapshot’ of the prior network weights preserved good performance on previous tasks. This approach has garnered significant attention in recent years, and offers a novel perspective on the use of knowledge distillation techniques in alleviating catastrophic forgetting. However, Learning without Forgetting has some notable limitations. Firstly, it is highly influenced by task history, and is thus susceptible to forming sub-optimal representations for novel tasks. Indeed, balancing stability of existing representations with the plasticity required to efficiently learn new ones is a major unresolved topic of research in continual learning. A further limitation of LwF is that, due to the nature of the distillation protocol, training time for each subsequent task increases linearly with the number of tasks previously learned. For broad applicability, this practically limits the capacity of this technique to handle pipelines of training data for which novel tasks are encountered regularly.
Elastic weight consolidation
In recent years, one of the most prominent regularisation approaches to prevent catastrophic forgetting is that of elastic weight consolidation (EWC) (Kirkpatrick et al., 2017). EWC, which is suitable for supervised and reinforcement learning paradigms, takes direct inspiration from neuroscience, where synaptic consolidation is thought to preserve sequential task performance by consolidating the most important features of previously encountered tasks (Yang et al., 2009). Intuitively, EWC works by slowing learning of the network weights which are most relevant for solving previously encountered tasks. This is achieved by applying a quadratic penalty to the difference between the parameters of the prior and current network weights, with the objective of preserving or consolidating
the most task-relevant weights. It is this quadratic penalty, with its ‘elastic’ preservation of existing network weights, which takes inspiration from synaptic consolidation in neuroscience, and is schematically represented in Figure 2A. More formally, the loss function of EWC,, is given by:
Where represents the parameters of the network, represents the loss for task B,
is a hyperparameter indicating the relative importance of previously encountered tasks compared to new tasks,is the Fisher information matrix, and finally
represents the trainable parameters of the network important for solving previously encountered tasks. Intuitively, this loss function can be understood as penalising large differences between previous and current network weights (the term within the brackets). In EWC, the Fisher information matrix is used to give an estimation of the importance of weights for solving tasks, by using an importance weighting proportional to the diagonal of the Fisher information metric over the old parameters for the previous task. While conceptually elegant, this presents a notable limitation of EWC: exact computation of the Fisher diagonal has complexity linear with the number of outputs, limiting the applicability of this method to low-dimensional output spaces.
Empirically, EWC performs well on supervised and reinforcement learning tasks, such as MNIST digit classification and sequential Atari 2600 games, respectively. However, the suitability of the quadratic penalty term (which is only derived for the two-task case in the original paper) has been questioned for cases of multi-task learning (Huszár, 2018). Additionally, building on the strong empirical results presented in the paper, Kemker et al. (2018) demonstrated in further experiments that EWC fails to learn new classes incrementally in response to accumulating training data (Figure 2C). Overall, EWC is a promising method for alleviating catastrophic forgetting, but several details regarding its broader applicability (and theoretical underpinning) remain unresolved.
An approach closely related to EWC is that of synaptic intelligence (Zenke et al., 2017)
. In this method, however, individual synapses (the connections between neurons or units) estimate their importance in solving a given task. Suchintelligent synapses can then be preserved for subsequent tasks by penalising weight updates, thereby mitigating catastrophic forgetting of previously learned tasks. Intuitively, synaptic intelligence can be considered a mechanism of anchoring network weights relevant for previous tasks to their existing values, and decelerating updates of these weights to prevent over-writing of previous task performance.
In summary, numerous regularisation methods have been developed to aid continual learning. By modulating gradient-based weight updates, these methods aim to preserve the performance of the model across multiple tasks trained in sequence. Many such regularisation methods have garnered interest from the research community, both due to both their theoretical appeal and strong empirical validation. Ultimately, however, none of these approaches has offered a comprehensive solution to the continual learning problem, and it is likely that a deeper understanding of credit assignment within neural networks will drive this research further.
Beyond the model itself, the training regime employed is critical to sequential task performance, and represents a rich avenue of continual learning research. Although numerous training paradigms have been described, this review will focus on those most directly developed to alleviate catastrophic forgetting.
Historically, limitations of dataset size have motivated the study of transfer learning for machine learning systems, with the aim of initially training ANNs on large datasets before transferring the trained network parameters to other tasks (Bengio, 2012; Higgins et al., 2016)et al. (2014) demonstrated the efficacy of transfer learning for computer vision in a range of benchmarking datasets. It was then found that subsequent layers in the hierarchy of the neural network display representational features which are reminiscent of human visual cognition (such as edge detection in lower layers and image-specific feature detection in higher layers).
Successes of transfer learning in few-shot learning paradigms, which aim to perform well upon first presentation of a novel task, have been well-described in recent years (Palatucci et al., 2009; Vinyals et al., 2016). However, translation of this potential into alleviating catastrophic forgetting has proved more challenging. One of the earliest attempts to leverage transfer for continual learning was implemented in the form of a hierarchical neural network, termed CHILD, trained to solve increasingly challenging reinforcement learning problems (Ring, 1998). This model was not only capable of learning to solve complex tasks, but also demonstrated a degree of continual learning – after learning nine task structures, the agent could still successfully perform the first task when returned to it. The impressive (and perhaps overlooked) performance of CHILD draws on two main principles: firstly, transfer of previously learned knowledge to novel tasks; secondly, incremental addition of network units as more tasks are learned. In many ways, this model serves as a precursor of progressive neural networks (Rusu et al., 2016), and offers an appealing account of transfer learning in aiding continual task learning.
In parallel to this, curriculum learning has gained attention of improving continual learning capacities in ANNs (Bengio et al., 2009; Graves et al., 2016). Broadly, curriculum learning can be defined as the phenomenon by which both natural and artificial intelligence agents learn more efficiently when the training dataset contains some inherent structure, and when exemplars provided to the learning system are organised in a meaningful way (for instance, progressing in difficulty) (Bengio et al., 2009). When trained on datasets such as MNIST, curriculum learning both accelerates the learning process (as measured by time or training steps to reach the global minimum) and helps prevent catastrophic forgetting (Bengio et al., 2009). However, one limitation of curriculum learning in ANNs is the assumption that task difficulty can be represented in a linear and uniform manner (often described as a ‘single axis of difficulty’), disregarding the nuances of each task structure. Nevertheless, curriculum learning is a promising, and underexplored, avenue of research for better continual learning performance in neural networks.
The neural phenomenon of hippocampal (or, more generally, memory) replay has recently garnered attention with respect to the design of ANNs (Skaggs and McNaughton, 1996; Shin et al., 2017; Kemker et al., 2018). Shin et al. (2017) designed the model most obviously inspired by hippocampal replay, with a training regime termed Deep Generative Replay. This comprises a generative model and a task-directed model, the former of which is used to generate representative data from previous tasks, from which a sample is selected and interspersed with the dataset of the new task. In this way, the model mimics hippocampal replay, drawing on statistical regularities from previous experiences when completing novel tasks (Liu et al., 2019). This approach, which shares some conceptual similarities to the model proposed by Draelos et al. (2017), displays a substantial improvement in continual learning compared to complexity-matched models lacking replay.
Several other implementations of replay in neural networks have also been described, including a straightforward experience replay buffer of all prior events for a reinforcement learning agent (Rolnick et al., 2018). This method, called CLEAR, attempts to address the stability-plasticity tradeoff of sequential task learning, using off-policy learning and replay-based behavioural cloning to enhance stability, while maintaining plasticity via on-policy learning. This outperforms existing deep reinforcement learning approaches with respect to catastrophic forgetting, but might prove unsuitable in cases where storage of a complete memory buffer is intractable.
Drawing on the successes of replay in alleviating catastrophic forgetting, Hayes et al. (2020) proposed a novel method which more accurately reflects the role of memory replay as described by modern neuroscience, implemented as a convolutional neural network. Standard replay approaches for convolutional neural networks (CNNs) rely on presenting raw pixel input from previously encountered data interleaved with novel data. Although effective, this approach is both biologically implausible and memory-intensive. By contrast, the REMIND (replay using memory indexing) algorithm proposed in this paper replays a compressed representation of previously encountered training data, thereby enabling efficient replay and online training. This approach is directly inspired by hippocampal indexing theory (Teyler and Rudy, 2007), and proposes a solution to the issue with prior replay implementations of having to store raw training data from previous tasks. In REMIND, this compression of input data is achieved using layers of the neural network, where the raw input data is compressed into a lower-dimensional tensor representation (for instance, in the context of CNNs, a feature map) for replay (see Figure 3C). Hayes et al. implement this using product quantization, a technique for compressing data which has a significantly lower reconstruction error compared to other methods (the details of product quantization are beyond the scope of this review, but see Jegou et al. (2010)
for a comprehensive account). This compression proves highly effective in maximising memory efficiency: REMIND can store one million compressed representations compared to just 20,000 when raw data input is stored in alternative models, matched for memory capacity. Empirically, replay of compressed training data was shown to confer strong benefits, whereby REMIND outperforms constraint-matched methods on incremental class learning tasks derived from the ImageNet dataset(Deng et al., 2009).
Can inspiration be drawn from neuroscience?
Neuroscience (in particular, cognitive neuroscience) and artificial intelligence have long been intertwined both in aetiology and research methodologies (Hassabis et al., 2017; Hinton and others, 1986), and this intersection has already inspired numerous approaches to continual learning research. Many of the methods described previously in this review draw inspiration from neuroscience, either implicitly or explicitly (for instance, generative replay, the REMIND algorithm, and transfer learning are all conceptually indebted to decades of neuroscience research).
One line of justification for this approach is that there are studies demonstrating a phenomenon analogous to catastrophic forgetting in humans, suggesting that a shared mechanistic framework might underlie continual learning (and its inherent limitations) in both humans and ANNs (Pallier et al., 2003; Mareschal et al., 2007). The first of these studies, Pallier et al. (2003), examined language acquisition and overwriting in Korean-born subjects whose functional mother tongue was French (due to being adopted before the age of 8), and had no conscious knowledge of the Korean language, as verified by behavioural testing. Functional neuroimaging (fMRI) demonstrated that the Korean-born francophone subjects displayed no greater (cortical) response to the Korean language in the setting of passive listening compared to French subjects with no exposure to Korean. This was interpreted as a form of over-writing, or catastrophic forgetting, of the first language by the second. The significance of these results is unclear, particularly given the limited literature on human catastrophic forgetting, but represents an interesting mandate for the use of cognitive neuroscience as a source of inspiration for continual learning research.
Replay in brains and neural networks
From the perspective of neuroscience, several mechanistic features underpinning human continual learning have been dissected, such as memory replay (perhaps as a means of transferring learned knowledge from short-term to long-term storage), curriculum and transfer learning paradigms, structural plasticity, and the integration of multiple sensory modalities to provide rich sensory context for memories (Parisi et al., 2019).
One phenomenon widely considered to contribute to continual learning is memory replay or hippocampal replay, defined as the re-activation (hence, replay) of patterns of activity in hippocampal neurons during states of slow-wave sleep and passive, resting awake (Skaggs and McNaughton, 1996; Dave and Margoliash, 2000; Rudoy et al., 2009). Such replay episodes are thought to provide additional trials serving to rehearse task learning and generalise knowledge during so-called ‘offline’ learning, and were first identified by recording hallmarks of brain activity during learning and mapping these onto covarying activity patterns identified during sleep (Rasch, 2018). An elegant demonstration of this phenomenon in humans was provided by Rudoy et al. (2009), whereby subjects learned the position of arbitrary objects on a computer screen, with each object presented in association with a unique and characteristic sound. The participants then slept for a short period of time, with electroencephalography (EEG) used to identify different stages of sleep. During slow-wave sleep, the characteristic sounds for half of the objects were played at an audible but unobtrusive volume. It was found that the participants subsequently recalled the positions of these sound-consolidated objects with greater accuracy. Replay approaches in machine learning have already proved fruitful, and studies such as these from neuroscience only serve to further motivate replay as a topic of continual learning research.
Replay, as well as the lesser-understood hippocampal pre-play (whereby hippocampal neurons display what is thought to represent simulated activity which can be mapped onto future environments) might aid continual learning through this ‘offline’ consolidation (Dragoi and Tonegawa, 2011; Bendor and Spiers, 2016). Although the mechanism remains incompletely described, it has been proposed that replay could contribute to continual learning by promoting the consolidation of previous task knowledge (Ólafsdóttir et al., 2018). In some ways, this can be considered analogous to pre-training neural networks with task-relevant data, a technique already demonstrated to confer considerable advantages (Erhan et al., 2010) . Indeed, the value of neuroscience-inspired research was recently underlined by a study from van de Ven et al. (2020). Here, a more biologically plausible form of replay was implemented, whereby instead of storing previous task data, a learned generative model was trained to replay internal, compressed representations of that data. These internal representations for replay were then interleaved with current training data in a context-dependent manner, modulated by feedback connections. This approach also overcomes a potential issue with existing replay methods in machine learning – cases where storing previously encountered data is not permissable due to safety or privacy concerns. Replay of compressed representations, or replay in conjunction with some form of federated learning (Kairouz et al., 2019), might offer a solution in these instances.
Complementary learning systems for continual learning
Complementary learning systems (CLS) theory, first advanced by McClelland et al. (1995) and recently updated by Kumaran et al. (2016), delineates two distinct structural and functional circuits underlying human learning (McClelland et al., 1995; Kumaran et al., 2016; Girardeau et al., 2009). The hippocampus serves as the substrate for short-term memory and the rapid, ‘online’ learning of knowledge relevant to the present task; in parallel, the neocortex mediates long-term, generalised memories, structured over experience. Transfer of knowledge from the former to the latter occurs with replay, and it is intuitive that the catastrophic forgetting of previously learned knowledge in machine learning systems could be mitigated to some extent by such complementary learning systems. This has proved an influential theory in neuroscience, offering an account of the mechanisms by which humans accumulate task performance over time, and has started to provide ideas for how complementary learning systems might aid continual learning in artificial agents.
Transfer and curriculum learning
Furthermore, the learning strategies employed by humans, of which transfer learning and curriculum learning have come into recent focus, are themselves likely to contribute further to continual learning (Barnett and Ceci, 2002; Holyoak and Thagard, 1997). Humans are capable of transferring knowledge between domains with minimal interference, and this capability derives from both continual task learning and generalisation of previously learned knowledge. A human learning to play tennis, for instance, can generalise some features of this task learning to a different racquet sport (Goode and Magill, 1986). Such transfer learning is poorly understood at the level of neural computations, and has perhaps been neglected by the neuroscience research community until recently, when attempts to endow artificial agents with these abilities has re-focussed attention on the mechanisms underpinning transfer learning in humans (Weiss et al., 2016).
Attempts to explain this abstract transfer of generalised knowledge in humans have themselves recapitulated many features of continual learning (Doumas et al., 2008; Barnett and Ceci, 2002; Pan and Yang, 2009; Holyoak and Thagard, 1997). For instance, Doumas et al. (2008) proposed that this is achieved by the neural encoding of relational information between objects comprising a sensory environment. Critically, such relational information would be invariant to nuances and specific features in these objects, and this could aid continual learning by providing a generalised task learning framework. Although the neural coding for such a relational framework has not yet been elicited, an intriguing recent paper by Constantinescu et al. (2016) proposed abstract concepts are encoded by grid cells in the human entorhinal cortex in a similar way to maps of physical space. Just as replay and memory consolidation have already led approaches to continual learning, such research might inspire novel approaches to alleviating catastrophic forgetting.
A related, and perhaps lesser-studied, learning paradigm is that of curriculum learning (Bengio et al., 2009; Elman, 1993). Intuitively, curriculum learning states that agents (both natural and artificial) learn more effectively when learning examples are structured and presented in a meaningful manner. The most obvious instantiation of this is to increase the difficulty of the learning rules throughout the sequence of examples presented; indeed, this is consistent with the structure of most human educational programmes (Krueger and Dayan, 2009; Goldman and Kearns, 1995). It has been appreciated for some time that this form of non-random learning programme aids human continual learning (Elman, 1993; Krueger and Dayan, 2009); however, the theoretical underpinnings of this are only starting to be elicited. Curriculum learning has the potential to enhance continual learning in neural networks by providing more structured training regimes, which emphasise the features of the training dataset which are most relevant to the tasks. Ultimately, however, more work is required to explore the promise of this.
Multi-sensory integration and attention
It has been appreciated for some time in the field of cognitive neuroscience that humans receive a stream of rich multi-sensory input from the environment, and that the ability to integrate this information into a multi-modal representation is critical for cognitive functions ranging from reasoning to memory (Spence, 2010, 2014; Stein and Meredith, 1993; Stein et al., 2014). By enriching the context and representation of individual memories, multi-sensory integration is thought to aid human continual learning.
There is also evidence that attention is a cognitive process contributing to continual learning in humans (Flesch et al., 2018). Here, when the task learning curriculum was designed in a manner permitting greater attention (namely, with tasks organised in blocks for training, rather than ‘interleaved’ task training), continual learning of the task in the (human) subjects was enhanced. Even if optimal training regimes differ across biological and artificial agents, this underlines the importance of curriculum and attention in addressing catastrophic forgetting.
Future perspective: bridging neuroscience and machine learning to inspire continual learning research
The inherent efficacy of human continual learning and its cognitive substrates is perhaps most impressive when contrasted with the current inability to endow AI agents with similar properties. With a projected increase in global data generation from 16 zettabytes annually in 2018 to over 160 zettabytes annually by 2025 (and the consequent intractability of comprehensive storage), there is a clear motivation for developing machine learning systems capable of continual learning in a manner analogous to humans (IDC White Paper, 2017; Tenenbaum et al., 2011).
The super-human performance of deep reinforcement learning agents on a range of complex learning tasks, from Atari 2600 video games to chess, has been well-publicised in recent years (Mnih et al., 2015; Silver et al., 2016; Kasparov, 2018; LeCun et al., 2015). However, these successes conceal a profound limitation of such machine learning systems: an inability to sustain performance when trained sequentially over a range of tasks. Traditionally, approaches to the issue of catastrophic forgetting have focussed on training regime, and often remain tangential to the cause of such forgetting. In the future, bridging the conceptual gap between continual learning research and the rich literature of learning and memory in neuroscience might prove fruitful, as motivated by several of the examples already discussed in this review.
Parallels of CLS theory in machine learning
For example, with respect to CLS theory, biologically inspired neural network architectures involving two different network parameters (a ‘plastic’ parameter for slow-changing information, and a rapidly updating parameter) have existed for decades (Hinton and Plaut, 1987). Indeed, these networks outperformed state-of-the-art ANNs in continual learning-related tasks at their time of development. When considered from the perspective of CLS theory, this suggests that the parallel and complementary functions of the hippocampus and neocortex in human memory contribute to continual learning.
More recent models, such as the Differentiable Neural Computer (DNC), also support the view that having complementary memory systems supports continual learning (Graves et al., 2016). The DNC architecture consists of an artificial neural network and an external memory matrix, to which the network has access to store and manipulate data structures (broadly analogous to random-access memory). As such, a DNC can be interpreted as having ‘short-term’ and ‘long-term’ memory proxies, and the capacity to relay information between them. This model is capable of solving complex RL problems, and answering natural language questions constructed to mimic reasoning, lending further support to the contribution of complementary learning systems in human continual learning. More recently, models directly inspired by neuroscience have attempted to study this principle of fast- and slow-learning systems (Whittington et al., 2020), but have not yet been more broadly explored through the lens of continual learning.
Emerging significance of multi-sensory integration and attention in artificial intelligence agents for continual learning
In the context of continual learning in machine learning, multi-sensory integration (often called ‘multi-modal integration’ in this context) has an obvious benefit of conferring additional information from different modalities when the environment is uncertain or has high entropy. Indeed, multi-modal machine learning has demonstrated efficacy in a range of task learning paradigms, such as lip reading, where the presence of both audio (phoneme) and visual (viseme) information improves performance compared to a uni-sensory training approach (Ngiam et al., 2011). Ultimately, greater investigation of multi-modal machine learning could unravel the value of such integration across domains, and offer approaches to aiding continual learning in settings where the environment is unpredictable or multimodal. The role of attention in human continual learning was underlined by a recent study endowing ANNs with a ‘hard attention mask’, an attentional gating mechanism directly inspired by human cognitive function (Serra et al., 2018). This substantially decreased catastrophic forgetting in this model when trained on image classification tasks, thereby emphasising attention as an important contribution to continual learning.
Advances in deep learning have accelerated in recent years (Krizhevsky et al., 2012; Vaswani et al., 2017), capturing the imagination of researchers and the public alike with their capacity to achieve superhuman performance on tasks (Silver et al., 2016), and aid scientific discovery (Senior et al., 2020). However, if machine learning pipelines are ever going to dynamically learn new tasks in real time, with interfering goals and multiple input datasets, the continual learning problem must be addressed. In this review, several of the most promising avenues of research have been appraised, with many of these deriving inspiration from neuroscience. Although much progress has been made, no existing approach adequately solves the continual learning problem. This review argues that more directly bridging continual learning research with neuroscience might offer novel insights and inspiration, ultimately guiding the development of novel approaches to catastrophic forgetting which bring the performance of artificial agents closer to that of humans – assimilating knowledge and skills over time and experience.
A repository of selected papers and implementations accompanying this review can be found at https://github.com/mccaffary/continual-learning.
- Regulation and function of adult neurogenesis: from genes to cognition. Physiological reviews 94 (4), pp. 991–1026. Cited by: Architectural considerations.
- When and where do we apply what we learn?: a taxonomy for far transfer.. Psychological bulletin 128 (4), pp. 612. Cited by: Transfer and curriculum learning, Transfer and curriculum learning.
- Does the hippocampus map out the future?. Trends in cognitive sciences 20 (3), pp. 167–169. Cited by: Replay in brains and neural networks.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: Curriculum learning, Transfer and curriculum learning.
- Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pp. 437–478. Cited by: Transfer learning.
- Organizing conceptual knowledge in humans with a gridlike code. Science 352 (6292), pp. 1464–1468. Cited by: Transfer and curriculum learning.
- Adanet: adaptive structural learning of artificial neural networks. In International conference on machine learning, pp. 874–883. Cited by: Architectural considerations.
- Song replay during sleep and computational rules for sensorimotor vocal learning. Science 290 (5492), pp. 812–816. Cited by: Replay in brains and neural networks.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: REMIND.
- A theory of the discovery and predication of relational concepts.. Psychological review 115 (1), pp. 1. Cited by: Transfer and curriculum learning.
- Neurogenesis deep learning: extending deep networks to accommodate new classes. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 526–533. Cited by: Architectural considerations, Generative replay.
- Preplay of future place cell sequences by hippocampal cellular assemblies. Nature 469 (7330), pp. 397–401. Cited by: Replay in brains and neural networks.
- Learning and development in neural networks: the importance of starting small. Cognition 48 (1), pp. 71–99. Cited by: Transfer and curriculum learning.
- Why does unsupervised pre-training help deep learning?. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 201–208. Cited by: Replay in brains and neural networks.
- Comparing continual task learning in minds and machines. Proceedings of the National Academy of Sciences 115 (44), pp. E10313–E10322. Cited by: Multi-sensory integration and attention.
- Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: Introduction and historical context.
- Cascade models of synaptically stored memories. Neuron 45 (4), pp. 599–611. Cited by: Regularisation.
- Selective suppression of hippocampal ripples impairs spatial memory. Nature neuroscience 12 (10), pp. 1222–1223. Cited by: Complementary learning systems for continual learning.
- On the complexity of teaching. Journal of Computer and System Sciences 50 (1), pp. 20–31. Cited by: Transfer and curriculum learning.
- Contextual interference effects in learning three badminton serves. Research quarterly for exercise and sport 57 (4), pp. 308–314. Cited by: Transfer and curriculum learning.
- An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Cited by: Regularisation.
- Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: Curriculum learning, Parallels of CLS theory in machine learning.
- Embracing change: continual learning in deep neural networks. Trends in cognitive sciences. Cited by: Introduction and historical context.
- Neuroscience-inspired artificial intelligence. Neuron 95 (2), pp. 245–258. Cited by: Introduction and historical context, Figure 2, Can inspiration be drawn from neuroscience?.
- Remind your neural network to prevent catastrophic forgetting. In European Conference on Computer Vision, pp. 466–483. Cited by: Figure 3, REMIND.
- Early visual concept learning with unsupervised deep learning. arXiv preprint arXiv:1606.05579. Cited by: Transfer learning.
Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, Vol. 1, pp. 12. Cited by: Introduction and historical context, Can inspiration be drawn from neuroscience?.
- Using fast weights to deblur old memories. In Proceedings of the 9th Annual Conference of the Cognitive Science Society, pp. 177–186. Cited by: Parallels of CLS theory in machine learning.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: Learning without forgetting.
- The analogical mind.. American psychologist 52 (1), pp. 35. Cited by: Transfer and curriculum learning, Transfer and curriculum learning.
- Note on the quadratic penalties in elastic weight consolidation. Proceedings of the National Academy of Sciences, pp. 201717042. Cited by: Elastic weight consolidation.
- Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: REMIND.
- Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: Replay in brains and neural networks.
- Chess, a drosophila of reasoning. American Association for the Advancement of Science. Cited by: Future perspective: bridging neuroscience and machine learning to inspire continual learning research.
- Fearnet: brain-inspired model for incremental learning. arXiv preprint arXiv:1711.10563. Cited by: Architectural considerations.
- Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: Figure 2, Elastic weight consolidation, Generative replay.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: Figure 2, Elastic weight consolidation.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: Conclusion.
- Flexible shaping: how learning in small steps helps. Cognition 110 (3), pp. 380–394. Cited by: Transfer and curriculum learning.
- What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences 20 (7), pp. 512–534. Cited by: Complementary learning systems for continual learning.
- Deep learning. nature 521 (7553), pp. 436–444. Cited by: Architectural considerations, Future perspective: bridging neuroscience and machine learning to inspire continual learning research.
- Universal intelligence: a definition of machine intelligence. Minds and machines 17 (4), pp. 391–444. Cited by: Introduction and historical context.
- Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: Learning without forgetting.
- Human replay spontaneously reorganizes experience. Cell 178 (3), pp. 640–652. Cited by: Generative replay.
- Compartmentalized dendritic plasticity and input feature storage in neurons. Nature 452 (7186), pp. 436–441. Cited by: Regularisation.
- Neuroconstructivism: how the brain constructs cognition. Vol. 1, Oxford University Press. Cited by: Can inspiration be drawn from neuroscience?.
- Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological review 102 (3), pp. 419. Cited by: Complementary learning systems for continual learning.
- Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: Introduction and historical context.
- Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: Future perspective: bridging neuroscience and machine learning to inspire continual learning research.
- Multimodal deep learning. In ICML, Cited by: Emerging significance of multi-sensory integration and attention in artificial intelligence agents for continual learning.
- The role of hippocampal replay in memory and planning. Current Biology 28 (1), pp. R37–R50. Cited by: Replay in brains and neural networks.
- Zero-shot learning with semantic output codes. Advances in Neural Information Processing Systems. Cited by: Transfer learning.
- Brain imaging of language plasticity in adopted adults: can a second language replace the first?. Cerebral cortex 13 (2), pp. 155–161. Cited by: Can inspiration be drawn from neuroscience?.
- A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: Transfer and curriculum learning.
- Continual lifelong learning with neural networks: a review. Neural Networks 113, pp. 54–71. Cited by: Introduction and historical context, Architectural considerations, Replay in brains and neural networks.
- Memory formation: let’s replay. Elife 7, pp. e43832. Cited by: Replay in brains and neural networks.
- Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological review 97 (2), pp. 285. Cited by: Introduction and historical context.
- CHILD: a first step towards continual learning. In Learning to learn, pp. 261–292. Cited by: Transfer learning.
- Experience replay for continual learning. arXiv preprint arXiv:1811.11682. Cited by: Generative replay.
- Strengthening individual memories by reactivating them during sleep. Science 326 (5956), pp. 1079–1079. Cited by: Replay in brains and neural networks.
- Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: Figure 1, Architectural considerations, Transfer learning.
- Improved protein structure prediction using potentials from deep learning. Nature 577 (7792), pp. 706–710. Cited by: Conclusion.
- Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pp. 4548–4557. Cited by: Emerging significance of multi-sensory integration and attention in artificial intelligence agents for continual learning.
- Continual learning with deep generative replay. arXiv preprint arXiv:1705.08690. Cited by: Generative replay.
- Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: Future perspective: bridging neuroscience and machine learning to inspire continual learning research, Conclusion.
- Replay of neuronal firing sequences in rat hippocampus during sleep following spatial experience. Science 271 (5257), pp. 1870–1873. Cited by: Generative replay, Replay in brains and neural networks.
- Crossmodal spatial attention. Annals of the New York Academy of Sciences 1191 (1), pp. 182–200. Cited by: Multi-sensory integration and attention.
- Orienting attention: a crossmodal perspective.. Oxford University Press. Cited by: Multi-sensory integration and attention.
- The merging of the senses.. The MIT Press. Cited by: Multi-sensory integration and attention.
- Development of multisensory integration from the perspective of the individual neuron. Nature Reviews Neuroscience 15 (8), pp. 520–535. Cited by: Multi-sensory integration and attention.
- How to grow a mind: statistics, structure, and abstraction. science 331 (6022), pp. 1279–1285. Cited by: Introduction and historical context.
- The hippocampal indexing theory and episodic memory: updating the index. Hippocampus 17 (12), pp. 1158–1169. Cited by: REMIND.
- Brain-inspired replay for continual learning with artificial neural networks. Nature communications 11 (1), pp. 1–14. Cited by: Replay in brains and neural networks.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Conclusion.
- Matching networks for one shot learning. Advances in neural information processing systems 29, pp. 3630–3638. Cited by: Transfer learning.
- A survey of transfer learning. Journal of Big data 3 (1), pp. 1–40. Cited by: Transfer and curriculum learning.
- The tolman-eichenbaum machine: unifying space and relational memory through generalization in the hippocampal formation. Cell 183 (5), pp. 1249–1263. Cited by: Parallels of CLS theory in machine learning.
- Stably maintained dendritic spines are associated with lifelong memories. Nature 462 (7275), pp. 920–924. Cited by: Elastic weight consolidation.
- How transferable are features in deep neural networks?. arXiv preprint arXiv:1411.1792. Cited by: Transfer learning.
- Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: Synaptic intelligence.