1 Introduction
Learning a continual stream of tasks has been a longstanding challenge in machine learning
(Ring, 1997; Chen and Liu, 2018). Continual learning with deep neural networks has been an active area of research over the past few years (Delange et al., 2021), and it has multiple applications in a range of problem domains (Lesort et al., 2020; Lee and Lee, 2020; Maschler et al., 2021). Catastrophic forgetting of existing knowledge for tasks learned sequentially has been the main challenge (Delange et al., 2021). A variety of methods for this problem in supervised continual learning have been proposed, including approaches for replaying examples (LopezPaz and Ranzato, 2017), regularisationbased methods (Kirkpatrick et al., 2017) and network expansion methods (Ostapenko et al., 2019).Knowledge transfer has recently been explored as an alternative for improving the performance of continual learning systems. Transferring knowledge in the forward direction has demonstrated some gains (Ke et al., 2021). Backward transfer on the other hand has been paid much less attention in continual learning with deep neural networks (Riemer et al., 2018; Ke et al., 2020; Vogelstein et al., 2020; New et al., 2022)
. However, backward transfer has succeeded in other lifelong learning studies that use techniques such as Support Vector Machines (SVMs)
(BenavidesPrado et al., 2020), and continues to be a desired property of continual learning systems (Rish, 2022).We develop a theory for knowledge transfer in continual learning. We first derive error bounds for individual tasks, when these are subject to forward transfer when learned for the first time, or to backward transfer from future tasks when these are learned. We then consider the order of arrival of tasks, since this influences the the amount of transfer that task is subject to. Based on the bounds derived for individual tasks, we calculate error bounds for a continual learner that learns related tasks sequentially using forward and backward transfer.
Our framework relies on three core assumptions. First, the continual learner is embedded into an environment of related tasks. This allows us to treat the problem of learning a sequence of tasks as the problem of learning a bias for the whole environment incrementally. Learning this bias is helpful since the continual learning will perform better at any task in that environment. Our second assumption is that relatedness between these tasks relies on the similarity between their example generating distributions. This assumption allows us to use a set of transformation functions as a tool for constraining the hypothesis family for learning a particular task, based on its similarity to other tasks in the environment (from which forward or backward transfer are to be performed). This tool has been used in other studies in multitask learning (BenDavid and Borbely, 2008). Our final assumption is that each task has a sufficient number of examples from which to learn. This assumption distinguishes our framework from approaches in zeroshot or fewshot learning. However, in Section 4 we demonstrate that the number of examples required to learn decreases with the number of tasks.
Our proposed framework is generic and does not rely on any practical assumptions about the implementation of the continual learner (e.g. in terms of the learning technique, the implementation architecture, the mechanisms used for transfer or the continual learning scenario  taskincremental, domainincremental or classincremental (Van de Ven and Tolias, 2019)). We aim to provide a rigorous theoretical analysis to show the potential of knowledge transfer while learning sequentially, and to encourage more research in this direction.
This paper is organised as follows. Section 2 describes previous research in knowledge transfer for continual learning. Section 3 provides preliminaries and notation. Section 4 describes error bounds derived for tasks learned continually using forward knowledge transfer. Section 5 describes error bounds for tasks learned continually using backward transfer. Section 6 describes error bounds for a continual learner that uses both forward and backward transfer. Finally, Section 8 provides some discussion and final remarks.
2 Previous Research
Catastrophic forgetting or interference of new tasks with previously acquired knowledge has been studied extensively in supervised continual learning with deep neural networks (Delange et al., 2021). Several methods to avoid catastrophic forgetting have been proposed, ranging from example replay (LopezPaz and Ranzato, 2017; van de Ven et al., 2020) to regularisationbased (Kirkpatrick et al., 2017; Zeng et al., 2019) to dynamic networks (Yoon et al., 2017; Hung et al., 2019). Beyond catastrophic forgetting, the classic aim of continual learning systems has been to achieve increasingly knowledgeable systems (Ring, 1997; Chen and Liu, 2018). Knowledge transfer has been proposed as a mechanism to achieve this (Ke et al., 2020; Rostami et al., 2020; BenavidesPrado, 2020). Forward transfer with continual deep neural networks has been studied recently (Ke et al., 2021). Backward transfer, in contrast, has received much less attention (Riemer et al., 2018; Ke et al., 2020; Vogelstein et al., 2020), although it was explored with alternative techniques such as SVM (BenavidesPrado et al., 2020).
BenDavid and Borbely (2008) and Baxter (2000) studied the effects of learning multiple related tasks jointly with multitask learning. Baxter (2000) derived the expected average error for a group of tasks learned jointly. BenDavid and Borbely (2008) derived similar bounds for a single task learned under the same framework. More recently, BenavidesPrado, Koh and Riddle (2020) derived error bounds of knowledge transfer across SVM models in supervised continual learning. This research showed that given a set of related tasks, backward transfer with SVM can be used to achieve systems that improve their performance with each incoming task (BenavidesPrado et al., 2020). Furthermore, forward transfer can also be used to aid learning of new tasks. Although novel, these bounds were specific to the implementation using SVM. Here we extend this work by deriving error bounds that are agnostic to the implementation, for both forward transfer and backward transfer. We also derive error bounds for a continual learner that uses knowledge transfer whilst learning related tasks sequentially.
Other theoretical frameworks in transfer learning have studied how the degree of relatedness among tasks helps transfer
(Lampinen and Ganguli, 2018), and how transfer helps curriculum learning (Weinshall et al., 2018). Theoretical studies in continual learning have studied the effects of task similarity in catastrophic forgetting (Lee et al., 2021), and discovered that optimal continual learning is NPhard and requires perfect memory (Knoblauch et al., 2020). However, to the best of our knowledge there is no prior study that evaluates the effects of forward and backward knowledge transfer in learning a continual stream of supervised tasks.3 Preliminaries and Definitions
Supervised continual learning is about learning a stream of tasks . A given task
in the sequence has an underlying probability distribution
(or simply , which we use later indistinctly). For that task, the aim is to learn a function , that maps the input space to the output space . Learning works by exploring a hypothesis space on that task, and finding the hypothesis such that:(1) 
where
is a loss function. Naturally, estimating the error of
over the actual distribution is difficult since can not be observed directly. Instead, a sample of examples extracted repeatedly from is used such that:(2) 
And the empirical error of over is such that:
(3) 
To find the best that satisfies Eq. 3, the learner aims to find the hypothesis that best fits this sample better, such that:
(4) 
Knowledge transfer for continual learning aims to share knowledge across tasks observed sequentially. In our framework, we distinguish two types of transfer: 1) forward transfer, which aims to learn new or target tasks better or faster by transferring knowledge gained during tasks learned earlier, and 2) backward transfer, which aims to improve future performance over previous or source tasks by using knowledge collected while learning new tasks. We assume that tasks observed by the continual learner are related. Therefore, these tasks are assumed to belong to the same environment, and the continual learner can become better at learning in this environment as more tasks are observed.
Formally, we define the environment of the continual learner as follows:
Definition 1.
An environment
of related tasks, corresponds to the set of all probability distributions on
, denoted , and a distribution on , denoted . Instead of exploring a single hypothesis space, the continual learner has access to a family of hypothesis spaces , one for each task. In practice, the learner has access to multiple samples to learn from, one sample for each task, such that are drawn at random from underlying probability distributions .Access to a family of hypothesis spaces rather than a single hypothesis space, as in singletask learning, gives the continual learner the potential to learn a good bias that can generalise well to novel tasks from the same environment. Rather than producing a hypothesis that with high probability will perform well on future examples of a particular task, by learning related tasks continually the learner will produce a hypothesis space that with high probability will perform well on future tasks within the same environment. This main result has been demonstrated in the context of multitask learning (Baxter and others, 2000), and is the main result we demonstrate in Sections 46 for a stream of tasks learned continually using knowledge transfer.
The notion of relatedness for tasks in the environment of the continual learner relies on the similarity of their example generating distributions (BenDavid and Borbely, 2008). Formally, given a set of transformation functions such that , tasks in the environment are related if, for some fixed probability distribution over , if the examples in each of these tasks can be generated by applying some to that distribution. Therefore, we can define the equivalence relation (Raczkowski and Sadowski, 1990) on , where is a family of hypothesis spaces for all tasks in the environment, as follows:
Definition 2.
Let be the underlying probability distributions of a set of tasks over a domain . Let be a set of transformations . Let and be related if one can be generated from the other by applying some , such that (and therefore ) or (and therefore ). The samples to be used during learning tasks are said to be related if these samples come from related probability distributions.
Let be a family of hypothesis spaces over the domain , and be closed under the action of . Let be a family of hypothesis spaces that consist of sets of hypotheses which are equivalent up to transformations in . If acts as a group over because:

For every and every , , and

is closed under transformation composition and inverses, i.e. for every , the inverse transformation, , and the composition, are also members of
Then the equivalence relation on is defined by: there exists such that .
Therefore this framework considers the family of hypothesis spaces , which is the family of all equivalence classes of under .
The original setting of this framework is in multitask learning (BenDavid and Borbely, 2008), where the equivalence class for a target task is first found using samples from all tasks. This requires to first identify aspects of all tasks that are invariant under . A second step restricts the learning of a particular task to selecting a hypothesis as the hypothesis for that task. Therefore, the target task can benefit from transfer during this second step by exploring the hypothesis space to be explored for the target task that contains these invariances.
In continual learning we are faced with a similar problem, but rather than learning tasks jointly these are observed sequentially. However, provided these tasks are related, we can adopt a similar framework to derive error bounds of a target task that is learned with forward transfer from a set of source tasks, and of source tasks for which knowledge is updated with backward transfer from a recently learned target tas. In the following sections we develop a theory of knowledge transfer across continual tasks that use these two transfer mechanisms.
4 Forward Knowledge Transfer across Related Tasks
In this and following sections, we will use to refer to a target task, or target probability distribution or target sample, and to denote a source task, or source probability distribution or source sample. In forward transfer, the aim is to learn a target task helped by knowledge obtained during previous source tasks , with probability distributions and and their corresponding observed samples and . Forward transfer for a continual learner which observes related tasks is defined as follows:
Definition 3.
Given classes and , and a set of labeled samples for a set of source tasks and a labeled sample for a target task, in forward knowledge transfer while learning task , the continual learner:

Has access to , obtained as a result of minimising over all .

Selects that minimises over all , and outputs as the hypothesis for .
In practice, having access to during a target task implies that the continual learner can access to some representation of the knowledge obtained during previous tasks (e.g. access to a neural network representing that knowledge). We derive error bounds for learning a target task helped by knowledge transfer from related source tasks as follows:
Theorem 1.
Let , …, and be a set of related probability distributions, and , …, and random samples representing these distributions. Let and be defined as in Definition 2. Let . Let . Let be selected according to Definition 3. Then, for every constant , , , with and defined similarly to Theorem in BenDavid and Borbely (2008):
(5) 
and, for all i n:
(6) 
then with probability greater than :
(7) 
Proof.
Let be the best label predictor in , i.e. . Let be the equivalence class picked according to Definition 3. By the choice of :
(8) 
By Theorem 2 in BenDavid and Borbely (2008), with probability greater than :
(9) 
and:
(10) 
Then, combining the inequalities above, with probability greater than :
(11) 
Since , with probability greater than , will have an error for which is within of the best hypothesis there, i.e. . Therefore:
(12) 
∎
Theorem 1 implies that, for a sufficiently large number of examples for the sources and the target tasks, forward transfer is expected to benefit learning of a target task. This result is achieved by choosing a hypothesis space for which is biased towards the hypothesis space learned for previous related tasks from the same environment. The extent of this benefit depends on the number of examples per task (see Eq. 5 and Eq. 6). Baxter (2000) demonstrated that the number of examples required per task decreases along with an increasing number of tasks, in particular:
(13) 
where is the capacity of the learner given an error and a set of sets of loss functions for the family of hypothesis spaces . Provided that this capacity increases sublinearly with , the number of examples required per task will decrease with an increasing number of tasks.
The amount of transfer to a target task and therefore the extent to which the bound in Theorem 1 is satisfied depends on how many source tasks are used for transfer. Intuitively, the larger this number, the smaller the bound, since the target task will have a better bias of its environment with more related tasks having been observed, which would lead to a better hypothesis space to be selected for that task. Therefore, the later a target task is observed, the greater the opportunity for it to benefit from forward transfer. This is in accordance with previous research that demonstrated that a larger number of tasks learned continually benefits transfer (BenavidesPrado et al., 2017, 2020). Next we analyse the effect of the task order in forward transfer, and the error bounds of a target task depending on that order. Next we derive bounds for forward transfer that account for the order of the task being observed in the sequence.
Definition 4.
Given classes and , a set of labeled samples {, …, for a set of source tasks and a labeled sample for a target task. Let:

be the result of minimising over all , at time .

be the result of minimising over all , at time .

that minimises over all , and outputs as the hypothesis for task at time .

that minimises over all , and outputs as the hypothesis for task at time .
Corollary 1.
Let , …, , , …, and , , , , be defined as in Theorem 1, at time . Similarly, let , …, and be a set of related probability distributions, , …, and random samples representing these distributions, at time . Let and be selected according to Definition 4, at time and , respectively. Then, for every , , , if:
(14) 
and, at time , for all i n:
(15) 
while, at time , for all i (n+z):
(16) 
then with probability greater than :
(17) 
See Appendix A for the proof of this corollary. The main part of the proof in Appendix A lies in Eq. 46. Since the best hypothesis space for a larger number of tasks is better than the best hypothesis space for a smaller number of tasks in the same environment, i.e. the bias over the environment gets refined over time, tasks observed later in the sequence will benefit more from transfer.
5 Backward Knowledge Transfer across Related Tasks
Backward transfer works by updating a source task using knowledge gained during the most recent target task . Transfer occurs from the space of a target probability distribution , represented by a sample , to the space of a probability distribution that uses a sample for learning that source task. In continual learning, the aim is to use , and its corresponding sample , to bias the update of a refined version of towards aspects that are invariant with , provided these are related. BenavidesPrado, Koh and Riddle (2020), analysed the special case of two tasks, one source and one target , for a specific implementation of a continual learner based on SVM. Here, we present bounds for an agnostic continual learner, as follows:
Definition 5.
Given classes and , and a pair of labeled samples , for tasks , , during backward transfer the continual learner:

Selects that minimises over all .

Selects that minimises over all , and outputs as the hypothesis for task .
In practice, the two steps in Definition 5 could be performed sequentially or jointly. For example, selecting in the first step could be performed by jointly training an auxiliary learner with examples from both and , and then transferring back this information to during the second step. Alternatively, both could be selected jointly while training for aided by .
Based on Definition 5, in the special case of two tasks and :
Theorem 2.
Let and be a set of related probability distributions,and and random samples representing these distributions on tasks and respectively. Let and be defined as in Definition 2. Let and be defined as in Theorem 1. Let be selected according to Definition 5. Then, for every , , , if:
(18) 
and:
(19) 
then with probability greater than :
(20) 
Proof.
Let be the best label predictor in , i.e. . Let be the equivalence class picked according to Definition 5. By the choice of :
(21) 
By Theorem 2 in BenDavid and Borbely (2008), in the case of two tasks:
(22) 
then with probability greater than :
(23) 
and:
(24) 
Then, combining the inequalities above, with probability greater than :
(25) 
Since , with probability greater than , will have an error for which is within of the best hypothesis there, i.e. . Therefore:
(26) 
∎
Similar to forward transfer, these bounds depend on the difference between , , and . Section 4 provides details on the meaning of these parameters and their relation to each other.
The main result from Theorem 2 and its corresponding proof is that an existing source task can also benefit from knowledge acquired during a related target task. This benefit is expected to be smaller than that of transferring forward, since forward transfer benefits from multiple sources (see Eq. ) while backward transfer benefits from a single target task (see Eq. ). We show that doing backward transfer helps to select a better hypothesis space and therefore provides a better bound on the performance of that task (see Eq. ). Therefore, a natural next question is whether backward transfer from a sequence of target tasks, learned one at a time, can help improve these bounds. we prove that doing backward transfer multiple times sequentially helps to decrease the error on a source task sequentially as well.
Definition 6.
Given classes and , a set of labeled samples for a source task, and labeled samples , for target tasks at times and . Let:

be the result of minimising + over all , at time .

be the result of minimising + + over all , at time .

that minimises over all , and outputs as the hypothesis for task at time .

that minimises over all , and outputs as the hypothesis for task at time .
Corollary 2.
Let , and be a set of related probability distributions, , and random samples representing these distributions. Let and be defined as in Definition 2. Let and be defined as in Theorem 1. Let and be selected according to Definition 6. Then, for every , , , , if:
(27) 
and, at time :
(28) 
while, at time :
(29) 
then with probability greater than :
(30) 
See Appendix B for the proof of this corollary. These results imply that doing backward transfer sequentially whilst learning target tasks will lead to more refined hypothesis spaces in a source task, beyond the hypothesis space learned initially (with or without forward transfer). Furthermore, this suggests that continually learning related tasks while doing both forward and backward transfer can lead to a better bias over the learning environment of these tasks, i.e. the result demonstrated by Baxter (2000) for multitask learning, which we demonstrate in the next section.
6 Continual Learning of Related Tasks using Knowledge Transfer
Based on the bounds derived in Section 4 and Section 5, now we are ready to derive bounds of a continual learner that observes supervised related tasks sequentially while doing knowledge transfer. First, lets recall from Definition 1 that the continual learner is embedded in an environment of related tasks, , where is the set of all probability distributions on and is a distribution on . The error of a selected hypothesis space for all tasks in such environment is defined as:
(31) 
for any drawn at random from according to . Let’s define as the average when performing forward transfer to a new task, i.e. corresponds to in Theorem 1, averaged across all tasks. Similarly, let’s define as the average when performing backward transfer to a new task, i.e. corresponds to in Theorem 2 averaged across all tasks. Although the definitions of and oversimplify the continual learner to the case of all tasks achieving roughly the same error bounds by means of transfer, this will serve to demonstrate how forward and backward transfer help to improve the bounds for the continual learner as a whole. For a task , the error bound of applying forward and backward transfer and selecting instead of as the hypothesis space for that task is:
(32) 
As demonstrated in Corollary 1 and 2, the extent to which transfer helps to improve the error bounds of a particular task depends on the order of that task in the sequence, which in Eq. 32 impacts the total amount of transfer through for forward transfer and for backward transfer. Given Eq. 32, for a sequence of tasks , we can define the error bounds on the environment that learns those tasks by means of transfer as follows.
Theorem 3.
Let be a set of distributions, one for each task, drawn at random from , the set of all probability distributions on , according to , a distribution on . Let be the family of hypothesis spaces for tasks to be learned in the environment , according to Definition 2, with selected according to Theorem 1 and Theorem 2. Let be the family of hypothesis spaces for tasks with no transfer, and let be selected as the hypothesis space for the tasks with no transfer. If the number of tasks satisfies:
(33) 
with = , i.e. the set of all hypothesis spaces in the hypothesis space family such that each is defined by:
(34) 
and, for all , the number of examples per task, satisfies:
(35) 
where = (i.e. is the union of all sequences of hypothesis , each of size , subject to loss function ), and with:
(36) 
then, with probability at least , will satisfy:
(37) 