DeepAI
Log In Sign Up

A Theory for Knowledge Transfer in Continual Learning

Continual learning of a stream of tasks is an active area in deep neural networks. The main challenge investigated has been the phenomenon of catastrophic forgetting or interference of newly acquired knowledge with knowledge from previous tasks. Recent work has investigated forward knowledge transfer to new tasks. Backward transfer for improving knowledge gained during previous tasks has received much less attention. There is in general limited understanding of how knowledge transfer could aid tasks learned continually. We present a theory for knowledge transfer in continual supervised learning, which considers both forward and backward transfer. We aim at understanding their impact for increasingly knowledgeable learners. We derive error bounds for each of these transfer mechanisms. These bounds are agnostic to specific implementations (e.g. deep neural networks). We demonstrate that, for a continual learner that observes related tasks, both forward and backward transfer can contribute to an increasing performance as more tasks are observed.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/18/2021

Continual Learning of a Mixed Sequence of Similar and Dissimilar Tasks

Existing research on continual learning of a sequence of tasks focused o...
03/16/2022

ConTinTin: Continual Learning from Task Instructions

The mainstream machine learning paradigms for NLP often work with two un...
07/14/2020

Lifelong Learning using Eigentasks: Task Separation, Skill Acquisition, and Selective Transfer

We introduce the eigentask framework for lifelong learning. An eigentask...
08/23/2020

Learn to Talk via Proactive Knowledge Transfer

Knowledge Transfer has been applied in solving a wide variety of problem...
05/01/2021

A Deep Learning Framework for Lifelong Machine Learning

Humans can learn a variety of concepts and skills incrementally over the...
07/22/2021

Learning to Transfer: A Foliated Theory

Learning to transfer considers learning solutions to tasks in a such way...
04/27/2020

A general approach to progressive learning

In biological learning, data is used to improve performance on the task ...

1 Introduction

Learning a continual stream of tasks has been a long-standing challenge in machine learning

(Ring, 1997; Chen and Liu, 2018). Continual learning with deep neural networks has been an active area of research over the past few years (Delange et al., 2021), and it has multiple applications in a range of problem domains (Lesort et al., 2020; Lee and Lee, 2020; Maschler et al., 2021). Catastrophic forgetting of existing knowledge for tasks learned sequentially has been the main challenge (Delange et al., 2021). A variety of methods for this problem in supervised continual learning have been proposed, including approaches for replaying examples (Lopez-Paz and Ranzato, 2017), regularisation-based methods (Kirkpatrick et al., 2017) and network expansion methods (Ostapenko et al., 2019).

Knowledge transfer has recently been explored as an alternative for improving the performance of continual learning systems. Transferring knowledge in the forward direction has demonstrated some gains (Ke et al., 2021). Backward transfer on the other hand has been paid much less attention in continual learning with deep neural networks (Riemer et al., 2018; Ke et al., 2020; Vogelstein et al., 2020; New et al., 2022)

. However, backward transfer has succeeded in other lifelong learning studies that use techniques such as Support Vector Machines (SVMs)

(Benavides-Prado et al., 2020), and continues to be a desired property of continual learning systems (Rish, 2022).

We develop a theory for knowledge transfer in continual learning. We first derive error bounds for individual tasks, when these are subject to forward transfer when learned for the first time, or to backward transfer from future tasks when these are learned. We then consider the order of arrival of tasks, since this influences the the amount of transfer that task is subject to. Based on the bounds derived for individual tasks, we calculate error bounds for a continual learner that learns related tasks sequentially using forward and backward transfer.

Our framework relies on three core assumptions. First, the continual learner is embedded into an environment of related tasks. This allows us to treat the problem of learning a sequence of tasks as the problem of learning a bias for the whole environment incrementally. Learning this bias is helpful since the continual learning will perform better at any task in that environment. Our second assumption is that relatedness between these tasks relies on the similarity between their example generating distributions. This assumption allows us to use a set of transformation functions as a tool for constraining the hypothesis family for learning a particular task, based on its similarity to other tasks in the environment (from which forward or backward transfer are to be performed). This tool has been used in other studies in multitask learning (Ben-David and Borbely, 2008). Our final assumption is that each task has a sufficient number of examples from which to learn. This assumption distinguishes our framework from approaches in zero-shot or few-shot learning. However, in Section 4 we demonstrate that the number of examples required to learn decreases with the number of tasks.

Our proposed framework is generic and does not rely on any practical assumptions about the implementation of the continual learner (e.g. in terms of the learning technique, the implementation architecture, the mechanisms used for transfer or the continual learning scenario - task-incremental, domain-incremental or class-incremental (Van de Ven and Tolias, 2019)). We aim to provide a rigorous theoretical analysis to show the potential of knowledge transfer while learning sequentially, and to encourage more research in this direction.

This paper is organised as follows. Section 2 describes previous research in knowledge transfer for continual learning. Section 3 provides preliminaries and notation. Section 4 describes error bounds derived for tasks learned continually using forward knowledge transfer. Section 5 describes error bounds for tasks learned continually using backward transfer. Section 6 describes error bounds for a continual learner that uses both forward and backward transfer. Finally, Section 8 provides some discussion and final remarks.

2 Previous Research

Catastrophic forgetting or interference of new tasks with previously acquired knowledge has been studied extensively in supervised continual learning with deep neural networks (Delange et al., 2021). Several methods to avoid catastrophic forgetting have been proposed, ranging from example replay (Lopez-Paz and Ranzato, 2017; van de Ven et al., 2020) to regularisation-based (Kirkpatrick et al., 2017; Zeng et al., 2019) to dynamic networks (Yoon et al., 2017; Hung et al., 2019). Beyond catastrophic forgetting, the classic aim of continual learning systems has been to achieve increasingly knowledgeable systems (Ring, 1997; Chen and Liu, 2018). Knowledge transfer has been proposed as a mechanism to achieve this (Ke et al., 2020; Rostami et al., 2020; Benavides-Prado, 2020). Forward transfer with continual deep neural networks has been studied recently (Ke et al., 2021). Backward transfer, in contrast, has received much less attention (Riemer et al., 2018; Ke et al., 2020; Vogelstein et al., 2020), although it was explored with alternative techniques such as SVM (Benavides-Prado et al., 2020).

Ben-David and Borbely (2008) and Baxter (2000) studied the effects of learning multiple related tasks jointly with multitask learning. Baxter (2000) derived the expected average error for a group of tasks learned jointly. Ben-David and Borbely (2008) derived similar bounds for a single task learned under the same framework. More recently, Benavides-Prado, Koh and Riddle (2020) derived error bounds of knowledge transfer across SVM models in supervised continual learning. This research showed that given a set of related tasks, backward transfer with SVM can be used to achieve systems that improve their performance with each incoming task (Benavides-Prado et al., 2020). Furthermore, forward transfer can also be used to aid learning of new tasks. Although novel, these bounds were specific to the implementation using SVM. Here we extend this work by deriving error bounds that are agnostic to the implementation, for both forward transfer and backward transfer. We also derive error bounds for a continual learner that uses knowledge transfer whilst learning related tasks sequentially.

Other theoretical frameworks in transfer learning have studied how the degree of relatedness among tasks helps transfer

(Lampinen and Ganguli, 2018), and how transfer helps curriculum learning (Weinshall et al., 2018). Theoretical studies in continual learning have studied the effects of task similarity in catastrophic forgetting (Lee et al., 2021), and discovered that optimal continual learning is NP-hard and requires perfect memory (Knoblauch et al., 2020). However, to the best of our knowledge there is no prior study that evaluates the effects of forward and backward knowledge transfer in learning a continual stream of supervised tasks.

3 Preliminaries and Definitions

Supervised continual learning is about learning a stream of tasks . A given task

in the sequence has an underlying probability distribution

(or simply , which we use later indistinctly). For that task, the aim is to learn a function , that maps the input space to the output space . Learning works by exploring a hypothesis space on that task, and finding the hypothesis such that:

(1)

where

is a loss function. Naturally, estimating the error of

over the actual distribution is difficult since can not be observed directly. Instead, a sample of examples extracted repeatedly from is used such that:

(2)

And the empirical error of over is such that:

(3)

To find the best that satisfies Eq. 3, the learner aims to find the hypothesis that best fits this sample better, such that:

(4)

Knowledge transfer for continual learning aims to share knowledge across tasks observed sequentially. In our framework, we distinguish two types of transfer: 1) forward transfer, which aims to learn new or target tasks better or faster by transferring knowledge gained during tasks learned earlier, and 2) backward transfer, which aims to improve future performance over previous or source tasks by using knowledge collected while learning new tasks. We assume that tasks observed by the continual learner are related. Therefore, these tasks are assumed to belong to the same environment, and the continual learner can become better at learning in this environment as more tasks are observed.

Formally, we define the environment of the continual learner as follows:

Definition 1.

An environment

of related tasks, corresponds to the set of all probability distributions on

, denoted , and a distribution on , denoted . Instead of exploring a single hypothesis space, the continual learner has access to a family of hypothesis spaces , one for each task. In practice, the learner has access to multiple samples to learn from, one sample for each task, such that are drawn at random from underlying probability distributions .

Access to a family of hypothesis spaces rather than a single hypothesis space, as in single-task learning, gives the continual learner the potential to learn a good bias that can generalise well to novel tasks from the same environment. Rather than producing a hypothesis that with high probability will perform well on future examples of a particular task, by learning related tasks continually the learner will produce a hypothesis space that with high probability will perform well on future tasks within the same environment. This main result has been demonstrated in the context of multitask learning (Baxter and others, 2000), and is the main result we demonstrate in Sections 4-6 for a stream of tasks learned continually using knowledge transfer.

The notion of relatedness for tasks in the environment of the continual learner relies on the similarity of their example generating distributions (Ben-David and Borbely, 2008). Formally, given a set of transformation functions such that , tasks in the environment are -related if, for some fixed probability distribution over , if the examples in each of these tasks can be generated by applying some to that distribution. Therefore, we can define the equivalence relation (Raczkowski and Sadowski, 1990) on , where is a family of hypothesis spaces for all tasks in the environment, as follows:

Definition 2.

Let be the underlying probability distributions of a set of tasks over a domain . Let be a set of transformations . Let and be related if one can be generated from the other by applying some , such that (and therefore ) or (and therefore ). The samples to be used during learning tasks are said to be -related if these samples come from -related probability distributions.

Let be a family of hypothesis spaces over the domain , and be closed under the action of . Let be a family of hypothesis spaces that consist of sets of hypotheses which are equivalent up to transformations in . If acts as a group over because:

  • For every and every , , and

  • is closed under transformation composition and inverses, i.e. for every , the inverse transformation, , and the composition, are also members of

Then the equivalence relation on is defined by: there exists such that .

Therefore this framework considers the family of hypothesis spaces , which is the family of all equivalence classes of under .

The original setting of this framework is in multitask learning (Ben-David and Borbely, 2008), where the equivalence class for a target task is first found using samples from all tasks. This requires to first identify aspects of all tasks that are invariant under . A second step restricts the learning of a particular task to selecting a hypothesis as the hypothesis for that task. Therefore, the target task can benefit from transfer during this second step by exploring the hypothesis space to be explored for the target task that contains these invariances.

In continual learning we are faced with a similar problem, but rather than learning tasks jointly these are observed sequentially. However, provided these tasks are -related, we can adopt a similar framework to derive error bounds of a target task that is learned with forward transfer from a set of source tasks, and of source tasks for which knowledge is updated with backward transfer from a recently learned target tas. In the following sections we develop a theory of knowledge transfer across continual tasks that use these two transfer mechanisms.

4 Forward Knowledge Transfer across Related Tasks

In this and following sections, we will use to refer to a target task, or target probability distribution or target sample, and to denote a source task, or source probability distribution or source sample. In forward transfer, the aim is to learn a target task helped by knowledge obtained during previous source tasks , with probability distributions and and their corresponding observed samples and . Forward transfer for a continual learner which observes -related tasks is defined as follows:

Definition 3.

Given classes and , and a set of labeled samples for a set of source tasks and a labeled sample for a target task, in forward knowledge transfer while learning task , the continual learner:

  1. Has access to , obtained as a result of minimising over all .

  2. Selects that minimises over all , and outputs as the hypothesis for .

In practice, having access to during a target task implies that the continual learner can access to some representation of the knowledge obtained during previous tasks (e.g. access to a neural network representing that knowledge). We derive error bounds for learning a target task helped by knowledge transfer from -related source tasks as follows:

Theorem 1.

Let , …, and be a set of -related probability distributions, and , …, and random samples representing these distributions. Let and be defined as in Definition 2. Let . Let . Let be selected according to Definition 3. Then, for every constant , , , with and defined similarly to Theorem in Ben-David and Borbely (2008):

(5)

and, for all i n:

(6)

then with probability greater than :

(7)
Proof.

Let be the best label predictor in , i.e. . Let be the equivalence class picked according to Definition 3. By the choice of :

(8)

By Theorem 2 in Ben-David and Borbely (2008), with probability greater than :

(9)

and:

(10)

Then, combining the inequalities above, with probability greater than :

(11)

Since , with probability greater than , will have an error for which is within of the best hypothesis there, i.e. . Therefore:

(12)

Theorem 1 implies that, for a sufficiently large number of examples for the sources and the target tasks, forward transfer is expected to benefit learning of a target task. This result is achieved by choosing a hypothesis space for which is biased towards the hypothesis space learned for previous -related tasks from the same environment. The extent of this benefit depends on the number of examples per task (see Eq. 5 and Eq. 6). Baxter (2000) demonstrated that the number of examples required per task decreases along with an increasing number of tasks, in particular:

(13)

where is the capacity of the learner given an error and a set of sets of loss functions for the family of hypothesis spaces . Provided that this capacity increases sublinearly with , the number of examples required per task will decrease with an increasing number of tasks.

The amount of transfer to a target task and therefore the extent to which the bound in Theorem 1 is satisfied depends on how many source tasks are used for transfer. Intuitively, the larger this number, the smaller the bound, since the target task will have a better bias of its environment with more related tasks having been observed, which would lead to a better hypothesis space to be selected for that task. Therefore, the later a target task is observed, the greater the opportunity for it to benefit from forward transfer. This is in accordance with previous research that demonstrated that a larger number of tasks learned continually benefits transfer (Benavides-Prado et al., 2017, 2020). Next we analyse the effect of the task order in forward transfer, and the error bounds of a target task depending on that order. Next we derive bounds for forward transfer that account for the order of the task being observed in the sequence.

Definition 4.

Given classes and , a set of labeled samples {, …, for a set of source tasks and a labeled sample for a target task. Let:

  • be the result of minimising over all , at time .

  • be the result of minimising over all , at time .

  • that minimises over all , and outputs as the hypothesis for task at time .

  • that minimises over all , and outputs as the hypothesis for task at time .

Corollary 1.

Let , …, , , …, and , , , , be defined as in Theorem 1, at time . Similarly, let , …, and be a set of -related probability distributions, , …, and random samples representing these distributions, at time . Let and be selected according to Definition 4, at time and , respectively. Then, for every , , , if:

(14)

and, at time , for all i n:

(15)

while, at time , for all i (n+z):

(16)

then with probability greater than :

(17)

See Appendix A for the proof of this corollary. The main part of the proof in Appendix A lies in Eq. 46. Since the best hypothesis space for a larger number of tasks is better than the best hypothesis space for a smaller number of tasks in the same environment, i.e. the bias over the environment gets refined over time, tasks observed later in the sequence will benefit more from transfer.

Bounds in Theorem 1 and Corollary 1 depend on the difference between and , and , with and , and . Ben-David and Borbely (2008) showed that, for a sufficiently large number or tasks , . We refer readers to Section 6 of Ben-David and Borbely (2008) for details.

5 Backward Knowledge Transfer across Related Tasks

Backward transfer works by updating a source task using knowledge gained during the most recent target task . Transfer occurs from the space of a target probability distribution , represented by a sample , to the space of a probability distribution that uses a sample for learning that source task. In continual learning, the aim is to use , and its corresponding sample , to bias the update of a refined version of towards aspects that are invariant with , provided these are related. Benavides-Prado, Koh and Riddle (2020), analysed the special case of two tasks, one source and one target , for a specific implementation of a continual learner based on SVM. Here, we present bounds for an agnostic continual learner, as follows:

Definition 5.

Given classes and , and a pair of labeled samples , for tasks , , during backward transfer the continual learner:

  1. Selects that minimises over all .

  2. Selects that minimises over all , and outputs as the hypothesis for task .

In practice, the two steps in Definition 5 could be performed sequentially or jointly. For example, selecting in the first step could be performed by jointly training an auxiliary learner with examples from both and , and then transferring back this information to during the second step. Alternatively, both could be selected jointly while training for aided by .

Based on Definition 5, in the special case of two tasks and :

Theorem 2.

Let and be a set of -related probability distributions,and and random samples representing these distributions on tasks and respectively. Let and be defined as in Definition 2. Let and be defined as in Theorem 1. Let be selected according to Definition 5. Then, for every , , , if:

(18)

and:

(19)

then with probability greater than :

(20)
Proof.

Let be the best label predictor in , i.e. . Let be the equivalence class picked according to Definition 5. By the choice of :

(21)

By Theorem 2 in Ben-David and Borbely (2008), in the case of two tasks:

(22)

then with probability greater than :

(23)

and:

(24)

Then, combining the inequalities above, with probability greater than :

(25)

Since , with probability greater than , will have an error for which is within of the best hypothesis there, i.e. . Therefore:

(26)

Similar to forward transfer, these bounds depend on the difference between , , and . Section 4 provides details on the meaning of these parameters and their relation to each other.

The main result from Theorem 2 and its corresponding proof is that an existing source task can also benefit from knowledge acquired during a related target task. This benefit is expected to be smaller than that of transferring forward, since forward transfer benefits from multiple sources (see Eq. ) while backward transfer benefits from a single target task (see Eq. ). We show that doing backward transfer helps to select a better hypothesis space and therefore provides a better bound on the performance of that task (see Eq. ). Therefore, a natural next question is whether backward transfer from a sequence of target tasks, learned one at a time, can help improve these bounds. we prove that doing backward transfer multiple times sequentially helps to decrease the error on a source task sequentially as well.

Definition 6.

Given classes and , a set of labeled samples for a source task, and labeled samples , for target tasks at times and . Let:

  • be the result of minimising + over all , at time .

  • be the result of minimising + + over all , at time .

  • that minimises over all , and outputs as the hypothesis for task at time .

  • that minimises over all , and outputs as the hypothesis for task at time .

Corollary 2.

Let , and be a set of -related probability distributions, , and random samples representing these distributions. Let and be defined as in Definition 2. Let and be defined as in Theorem 1. Let and be selected according to Definition 6. Then, for every , , , , if:

(27)

and, at time :

(28)

while, at time :

(29)

then with probability greater than :

(30)

See Appendix B for the proof of this corollary. These results imply that doing backward transfer sequentially whilst learning target tasks will lead to more refined hypothesis spaces in a source task, beyond the hypothesis space learned initially (with or without forward transfer). Furthermore, this suggests that continually learning -related tasks while doing both forward and backward transfer can lead to a better bias over the learning environment of these tasks, i.e. the result demonstrated by Baxter (2000) for multitask learning, which we demonstrate in the next section.

6 Continual Learning of Related Tasks using Knowledge Transfer

Based on the bounds derived in Section 4 and Section 5, now we are ready to derive bounds of a continual learner that observes supervised related tasks sequentially while doing knowledge transfer. First, lets recall from Definition 1 that the continual learner is embedded in an environment of related tasks, , where is the set of all probability distributions on and is a distribution on . The error of a selected hypothesis space for all tasks in such environment is defined as:

(31)

for any drawn at random from according to . Let’s define as the average when performing forward transfer to a new task, i.e. corresponds to in Theorem 1, averaged across all tasks. Similarly, let’s define as the average when performing backward transfer to a new task, i.e. corresponds to in Theorem 2 averaged across all tasks. Although the definitions of and oversimplify the continual learner to the case of all tasks achieving roughly the same error bounds by means of transfer, this will serve to demonstrate how forward and backward transfer help to improve the bounds for the continual learner as a whole. For a task , the error bound of applying forward and backward transfer and selecting instead of as the hypothesis space for that task is:

(32)

As demonstrated in Corollary 1 and 2, the extent to which transfer helps to improve the error bounds of a particular task depends on the order of that task in the sequence, which in Eq. 32 impacts the total amount of transfer through for forward transfer and for backward transfer. Given Eq. 32, for a sequence of tasks , we can define the error bounds on the environment that learns those tasks by means of transfer as follows.

Theorem 3.

Let be a set of distributions, one for each task, drawn at random from , the set of all probability distributions on , according to , a distribution on . Let be the family of hypothesis spaces for tasks to be learned in the environment , according to Definition 2, with selected according to Theorem 1 and Theorem 2. Let be the family of hypothesis spaces for tasks with no transfer, and let be selected as the hypothesis space for the tasks with no transfer. If the number of tasks satisfies:

(33)

with = , i.e. the set of all hypothesis spaces in the hypothesis space family such that each is defined by:

(34)

and, for all , the number of examples per task, satisfies:

(35)

where = (i.e. is the union of all sequences of hypothesis , each of size , subject to loss function ), and with:

(36)

then, with probability at least , will satisfy:

(37)
Proof.

According to Eq. 32, for all :

(38)

which leads to:

(39)

with:

(40)

and:

(41)

then: