Most analyses of learning algorithms assume that the algorithm starts learning from scratch when presented with a new dataset. However, in real life, it is often the case that we will use the same algorithm on many different tasks, and that information should be transferred from one task to another. For example, a key problem in pattern recognition is to learn a dictionary of features helpful for image classification: it makes perfectly sense to assume that features learnt to classify dogs against other animals can be re-used to recognize cats. This idea is at the core oftransfer learning, see (Thrun and Pratt, 1998; Balcan et al., 2015; Baxter, 1997, 2000; Cavallanti et al., 2010; Maurer, 2005; Maurer et al., 2013; Pentina and Lampert, 2014; Maurer et al., 2016) and references therein.
The setting in which the tasks are presented simultaneously is often referred to as learning-to-learn (Baxter, 2000), whereas when the tasks are presented sequentially, the term lifelong learning is often used (Thrun, 1996). In either case, a huge improvement over “learning in isolation” can be expected, especially when the sample size per task is relatively small. We will use the above terminologies in the paper.
Although a substantial amount of work has been done on the theoretical study of learning-to-learn (Baxter, 2000; Maurer, 2005; Pentina and Lampert, 2014; Maurer et al., 2016), up to our knowledge there is no analysis of the statistical performance of lifelong learning algorithms. Ruvolo and Eaton (2013) studied the convergence of certain optimization algorithms for lifelong learning. However, no statistical guarantees are provided. Furthermore, in all the aforementioned works, the authors propose a technique for transfer learning which constrains the within-task algorithm to be of a certain kind, e.g. regularized empirical risk minimization.
The main goal of this paper is to show that it is possible to perform a theoretical analysis of lifelong learning with minimal assumptions on the form of the within-task algorithm. Given a learner with her/his own favourite algorithm(s) for learning within tasks, we propose a meta-algorithm for transferring information from one task to the next. The algorithm maintains a prior distribution on the set of representations, which is updated after the encounter of each new task using the exponentially weighted aggregation (EWA) procedure, hence we call it EWA for lifelong learning or EWA-LL.
A standard way to provide theoretical guarantees for online algorithms are regret bounds, which measure the discrepancy between the prediction error of the forecaster and the error of an ideal predictor. We prove that, as long as the within-task algorithms have good statistical properties, EWA-LL inherits these properties. Specifically in Theorem 3.1 we present regret bounds for EWA-LL, in which the regret bounds for the within-tasks algorithms are combined into a regret bound for the meta-algorithm.
We also show, using an online-to-batch analysis, that it is possible to derive a strategy for learning-to-learn, and provide risk bounds for this strategy. The bounds are generally in the order of , where is the number of tasks and is the sample size per task. Moreover, we derive in some specific situations rates in . These rates are novel up to our knowledge and justify the use of transfer learning with very small sample sizes .
The paper is organized as follows. In Section 2 we introduce the lifelong learning problem. In Section 3 we present the EWA-LL algorithm and provide a bound on its expected regret. This bound is very general, but might be uneasy to understand at first sight. So, in Section 4 we present more explicit versions of our bound in two classical examples: finite set of predictors and dictionary learning. We also provide a short simulation study for dictionary learning. At this point, we hope that the reader will have a clear overview of the problem under study. The rest of the paper is devoted to theoretical refinements: in online learning, uniform bounds are the norm rather than bounds in expectations (Cesa-Bianchi and Lugosi, 2006). In Section 5 we establish such bounds for EWA-LL. Section 6 provides an online-to-batch analysis that allows one to use a modification of EWA-LL for learning-to-learn. The supplementary material include proofs (Appendix A), improvements for dictionary learning (Appendix B) and extended results (Appendix C).
In this section, we introduce our notation and present the lifelong learning problem.
Let and be some sets. A predictor is a function , where for regression and for binary classification. The loss of a predictor on a pair is a real number denoted by . As mentioned above, we want to transfer the information (a common data representation) gained from the previous tasks to a new one. Formally, we let be a set and prescribe a set of feature maps (also called representations) , and a set of functions . We shall design an algorithm that is useful when there is a function , common to all the tasks, and task-specific functions such that
is a good predictor for task , in the sense that the corresponding prediction error (see below) is small.
We are now ready to describe the learning problem. We assume that tasks are dealt with sequentially. Furthermore, we assume that each task dataset is itself revealed sequentially and refer to this setting as online-within-online lifelong learning. Specifically, at each time step , the learner is challenged with a task, corresponding to a dataset
where . The dataset is itself revealed sequentially, that is, at each inner step :
The object is revealed,
The learner has to predict , let denote the prediction,
The label is revealed, and the learner incurs the loss .
The task ends at time , at which point the prediction error is
This process is repeated for each task , so that at the end of all the tasks, the average error is
Ideally, if for a given representation , the best predictor for task was known in advance, then an ideal learner using for prediction would incur the error
The above expression is slightly different from the usual notion of regret Cesa-Bianchi and Lugosi (2006), which does not contain the factor . This normalization is important in that it allows us to give equal weigths to different tasks.
Note that an oracle who would have known the best common representation for all tasks in advance would have only suffered, on the entire sequence of datasets, the error
We are now ready to state our principal objective: we wish to design a procedure (meta-algorithm) that, at the beginning of each task , produces a function so that, within each task, the learner can use its own favorite online learning algorithm to solve task on the sequence . We wish to control the compound regret of our procedure
which may succinctly be written as . This objective is accomplished in Section 3 under the assumption that a regret bound for the within-task-algorithm is available.
We end this section with two examples included in the framework.
Example 2.1 (Dictionary learning).
Set , and call a dictionary, where each is a real-valued function on . Furthermore choose to be a set of linear functions on , so that, for each task
In practice depending on the value of
, we can use least square estimators or LASSO to learn. In (Maurer et al., 2013; Ruvolo and Eaton, 2013), the authors consider and for some matrix , and the goal is to learn jointly the predictors and the dictionary .
Example 2.2 (Finite set ).
We choose and any set. While this example is interesting in its own right, it is also instrumental in studying the continuous case via a suitable discretization process. A similar choice has been considered by Crammer and Mansour (2012) in the multitask setting, in which the goal is to bound the average error on a prescribed set of tasks.
We notice that a slightly different learning setting is obtained when each dataset is given all at once. We refer to this as batch-within-online lifelong learning; this setting is briefly considered in Appendix C. On the other hand when all datasets are revealed all at once, we are in the well-known setting of learning-to-learn (Baxter, 2000). In Section 6, we explain how our lifelong learning analysis can be adapted to this setting.
A sequence of datasets
, . associated with different learning tasks; the points within each dataset are also given sequentially.
A prior , a learning parameter and a learning algorithm for each task which, for any representation returns a sequence of predictions and suffers a loss
Run the within-task learning algorithm on and suffer loss .
In this section, we present our lifelong learning algorithm, derive its regret bound and then specify it to two popular within-task online algorithms.
3.1 EWA-LL Algorithm
, and updates a probability distributionon the set of representation before the encounter of task . The effect of Step iii is that any representation which does not perform well on task , is less likely to be reused on the next task. We insist on the fact that this procedure allows the user to freely choose the within-task algorithm, which does not even need to be the same for each task.
3.2 Bounding the Expected Regret
Since Algorithm 1 involves a randomization strategy, we can only get a bound on the expected regret, the expectation being with respect to the drawing of the function at step i in the algorithm. Let denote the expectation of when . Note that the expected overall-average loss that we want to upper bound is then
If, for any , and the within-task algorithm has a regret bound , then
where the infimum is taken over all probability measures and is the Kullback-Leibler divergence between
is the Kullback-Leibler divergence betweenand .
The proof is given in Appendix A. Some comments are in order as the bound in Theorem 3.1 might not be easy to read. First, similar to standard analyses in online learning, the parameter is a decreasing function of , hence the bound vanishes as grows. Second, corollaries are derived in Section 4 that are easier to read, as they are more similar to usual regret inequalities (Cesa-Bianchi and Lugosi, 2006), that is, the right hand side of the bound is of the form
The bound in Theorem 1 looks slightly different, but is quite similar in spirit. Indeed, instead of an infimum with respect to we have an infimum on all the possible aggregations with respect to ,
where the remainder term depends on . In order to look like (3.1), we could consider a measure highly concentrated around the representation minimizing (3.1). When is finite, this is a reasonable strategy and the bound is given explicitly in Section 4.1 below. However, in some situations, this would cause the term to diverge. Studying accurately the minimizer in usually leads to an interesting regret bound, and this is exactly what is done in Section 4.
3.3 Examples of Within Task Algorithms
We now specify the general bound in Theorem 1 to two popular online algorithms which we use within tasks.
3.3.1 Online Gradient
The first algorithm assumes that is a parametric family of functions , and for any , is convex, -Lipschitz, upper bounded by and denote by a subgradient.
The EWA-LL algorithm using the OGA within task with step size satisfies
We note that under additional assumptions on loss functions, (Hazan et al., 2007, Theorem 1) provides bounds for that are in .
3.3.2 Exponentially Weighted Aggregation
The second algorithm is based on the EWA procedure on the space for a prescribed representation .
A task .
; a prior probability distributionon .
is revealed, update
Recall that a function is called -exp-concave if is concave.
Assume that is finite and that there exists such that for any , the function is -exp-concave and upper bounded by a constant . Then the EWA-LL algorithm using the EWA within task with satisfies
A typical example is the quadratic loss function . When there is some such that and , then the exp-concavity assumption is verified with and the boundedness assumption with .
Note that when the exp-concavity assumption does not hold, Gerchinovitz (2011) derives a bound with . Moreover, PAC-Bayesian type bounds in various settings (including infinite ) can be found in (Catoni, 2004; Audibert, 2006; Gerchinovitz, 2013). We refer the reader to (Gerchinovitz, 2011) for a comprehensive survey.
In this section, we discuss two important applications. To ease our presentation, we assume that all the tasks have the same sample size, that is for all .
4.1 Finite Subset of Relevant Predictors
Under the assumptions of Theorem 3.1, if we set and uniform on ,
Fix , as the Dirac mass on and note that . ∎
Assume that is finite, that for some , for any , the function is -exp-concave and upper bounded by a constant . Then the EWA-LL algorithm using the EWA within task with satisfies
In Section 6, we derive from Theorem 3.1 a bound in the batch setting. As we shall see, in the finite case the bound is exactly the same as the bound on the compound regret. This allows us to compare our results to previous ones obtained in the learning-to-learn setting. In particular, our bound improves upon (Pentina and Lampert, 2014) who derived an bound.
4.2 Dictionary Learning
We now give details on Example 2.1 in the linear case. Specifically, we let , we let be the set formed by all matrices , whose columns have euclidean norm equal to one, and we define . Within this subsection we assume that the loss is convex and -Lipschitz with respect to its first argument, that is, for every and , it holds . We also assume that for all and , . Assume .
We define the prior as follows: the columns of
are i.i.d., uniformly distributed on the-dimensional unit sphere.
Under the assumptions of Theorem 3.1, with ,
When we use OGA within tasks, we can use Corollary 3.2 with and so for any . Moreover,
so Theorem 4.3 leads to the following corollary.
Algorithm EWA-LL for dictionary learning, with , and using the OGA algorithm within tasks, with step , satisfies
Note that the upper bound (4.1) may be lose. For example, when the are i.i.d. on the unit sphere, is close to . In this case, it is possible to improve the term employed in the calculation of the bound, we postpone the lengthy details to Appendix B.
4.2.1 Algorithmic Details and Simulations
We implement our meta-algorithm Randomized EWA in this setting. The algorithm used within each task is the simple version of the online gradient algorithm outlined in Section 3.3.1.
In order to draw from , we use -steps of Metropolis-Hastings algorithm with a normalized Gaussian proposal (see, for example, Robert and Casella, 2013). In order to ensure a short burn-in period, we use the previous drawing as a starting point. The procedure is given in Algorithm 4.
As in Algorithm 1.
A learning rate for EWA and a learning rate for the online gradient. A number of steps for the Metropolis-Hastings algorithm.
Run the within-task learning algorithm and suffer loss .
Metropolis-Hastings algorithm. Repeat times
Draw and then set .
Set with probability
remains unchanged otherwise.
Note the bottleneck of the algorithm: in step b we have to compare and on the whole dataset so far.
We now present a short simulation study. We generate data in the following way: we let , , and . The columns of
are drawn uniformly on the unit sphere, and task regression vectorsare also independent and have i.i.d. coordinates in . We generate the datasets as follows: all the are i.i.d. from the same distribution as , and where the are i.i.d. and .
We compare Algorithm 4 with to an oracle who knows the representation , but not the task regression vectors , and learns them using the online gradient algorithm with step size . Notice that after each chunk of observations, a new task starts, so the parameter changes. Thus, the oracle incurs a large loss until it learns the new (usually within a few steps). This explains the “stair” shape of the cumulative loss of the oracle in Figure 1. Figure 2 indicates that after a few tasks, the dictionary is learnt by EWA-LL: its cumulative loss becomes parallel to the one of the oracle. Due to the bottleneck mentioned above, the algorithm becomes quite slow to run when grows. In order to improve the speed of the algorithm, we also tried Algorithm 4 with . There is absolutely no theoretical justification for this, however, obviously the algorithm is 10 times faster. As we can see on the red line in Figure 2, this version of the algorithm still learns
, but it takes more steps. Note that this is not completely unexpected: the Markov chain generated by this algorithm is no longer stationary, but it can still enjoy good mixing properties. It would be interesting to study the theoretical performance of Algorithm4. However, this would require considerably technical tools from Markov chain theory which are beyond the scope of this paper.
5 Uniform Bounds
In this section, we show that it possible to obtain a uniform bound, as opposed to a bound in expectation as in Theorem 3.1. From a theoretical perspective, the price to pay is very low: we only have to assume that the loss function is convex with respect to its first argument. However, in practice, there is an aggregation step that might not be feasible. This is discussed at the end of the section. The algorithm is outlined in Algorithm 5.
Assuming that for any and that the algorithm used within-task has a regret . Assume that is convex with respect to its first argument. Then it holds that
At each step , the loss suffered by the algorithm is
and we can just apply Theorem 3.1. ∎
- Data and Input
same as in Algorithm 1.
Run the within-task learning algorithm on for each and return as predictions:
In practice, for an infinite set we are not able to run simultaneously the within-task algorithm for all . So, we cannot compute the prediction (5.1) exactly. A possible strategy is to draw elements of i.i.d. from , say , and to replace (5.1) by
An application of Hoeffding’s inequality shows for any , with probability at least , the bound in Theorem 5.1 will still hold, up to an additional term .
In this section, we show how our analysis of lifelong learning can be used to derive bounds for learning-to-learn. In this setting, the tasks and their datasets are generated by first sampling task distributions i.i.d. from a“meta-distribution”, called environment by Baxter (2000), and then for each task , a dataset is sampled i.i.d. from . We stress that in this setting, the entire data is given all at once to the learner. Note that for simplicity, we assumed that all the sample sizes are the same.
We wish to design a strategy which, given a new task and a new sample i.i.d. from , computes a function , that will predict well when . For this purpose we propose the following strategy:
Run EWA-LL on . We obtain a sequence of representations ,
Draw uniformly and put ,
Run the within task algorithm on the sample , obtaining a sequence of functions,
Draw uniformly and put .
Our next result establishes that the strategy leads indeed to safe predictions.
Let be the expectation over all data pairs , , , , and also over the randomized decisions of the learner , and . Then
The proof is given in Appendix A. As in Theorem 3.1, the result is given in expectation with respect to the randomized decisions of the learner. Assuming that is convex with respect to its first argument, we can state a similar result for a non-random procedure, as was done in Section 5. Details are left to the reader.
In (Baxter, 2000; Maurer et al., 2013; Pentina and Lampert, 2014), the results on learning-to-learn are given with large probability with respect to the observations , rather than in expectation. Using the machinery in (Cesa-Bianchi and Lugosi, 2006, Lemma 4.1) we conjecture that it is possible to derive a bound in probability from Theorem 6.1.
7 Concluding Remarks
We presented a meta-algorithm for lifelong learning and derived a fully online analysis of its regret. An important advantage of this algorithm is that it inherits the good properties of any algorithm used to learn within tasks. Furthermore, using online-to-batch conversion techniques, we derived bounds for the related framework of learning-to-learn.
We discussed the implications of our general regret bounds for two applications: dictionary learning and finite set
of representations. Further applications of this algorithm which may be studied within our framework are deep neural networks and kernel learning. In the latter case, which has been addressed byPentina and Ben-David (2015) in the learning-to-learn setting, is a feature map to a reproducing kernel Hilbert space , and . In the former case, and is typically a linear function. The vector-valued function models a multilayer network with shared hidden weights. This is discussed in (Maurer et al., 2016), again in the learning-to-learn setting.
Audibert, J.-Y. (2006).
A randomized online learning algorithm for better variance control.In Proc. 19th Annual Conference on Learning Theory, pages 392–407. Springer.
Balcan et al. (2015)
Balcan, M.-F., Blum, A., and Vempala, S. (2015).
Efficient representations for lifelong learning and autoencoding.In Proc. 28th Conference on Learning Theory, pages 191–210.
- Baxter (1997) Baxter, J. (1997). A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7–39.
Baxter, J. (2000).
A model of inductive bias learning.
Journal of Artificial Intelligence Research, 12:149–198.
Catoni, O. (2004).
Statistical learning theory and stochastic optimization, volume
Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer-Verlag, Berlin.
- Cavallanti et al. (2010) Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. (2010). Linear algorithms for online multitask classification. Journal of Machine Learning Research, 1:2901–2934.
- Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge University Press.
- Crammer and Mansour (2012) Crammer, K. and Mansour, Y. (2012). Learning multiple tasks using shared hypotheses. In Advances in Neural Information Processing Systems 25, pages 1475–1483.
- Gerchinovitz (2011) Gerchinovitz, S. (2011). Prédiction de suites individuelles et cadre statistique classique: étude de quelques liens autour de la régression parcimonieuse et des techniques d’agrégation. PhD thesis, Paris 11.
Gerchinovitz, S. (2013).
Sparsity regret bounds for individual sequences in online linear regression.Journal of Machine Learning Research, 14(1):729–769.
- Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192.
- Maurer (2005) Maurer, A. (2005). Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6:967–994.
- Maurer et al. (2013) Maurer, A., Pontil, M., and Romera-Paredes, B. (2013). Sparse coding for multitask and transfer learning. In Proc. 30th International Conference on Machine Learning, pages 343–351.
- Maurer et al. (2016) Maurer, A., Pontil, M., and Romera-Paredes, B. (2016). The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1–32.
McAllester, D. A. (1998).
Some pac-bayesian theorems.
Proc. 11th Annual Conference on Computational Learning Theory, pages 230–234. ACM.
- Pentina and Ben-David (2015) Pentina, A. and Ben-David, S. (2015). Multi-task and lifelong learning of kernels. In Proc. 26th International Conference on Algorithmic Learning Theory, pages 194–208.
- Pentina and Lampert (2014) Pentina, A. and Lampert, C. (2014). A pac-bayesian bound for lifelong learning. In Proc. 31st International Conference on Machine Learning, pages 991–999.
- Robert and Casella (2013) Robert, C. and Casella, G. (2013). Monte Carlo statistical methods. Springer Science & Business Media.
- Ruvolo and Eaton (2013) Ruvolo, P. and Eaton, E. (2013). Ella: An efficient lifelong learning algorithm. In Proc. 30th International Conference on Machine Learning, pages 507–515.
- Shalev-Shwartz (2011) Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194.
- Thrun (1996) Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? In Advances in neural information processing systems, pages 640–646.
- Thrun and Pratt (1998) Thrun, S. and Pratt, L. (1998). Learning to Learn. Kluwer Academic Publishers.
- Vapnik (1998) Vapnik, V. (1998). Statistical Learning Theory. Wiley.
Appendix A Proofs
Proof of Theorem 3.1.
It is enough to show that the EWA strategy leads to
Once this is done, we only have to use the assumption that the regret of the within-task algorithm on task is upper bounded by to obtain that
and we obtain the statement of the result.
where we introduce the notation for the sake of shortness. Put
. Using Hoeffding’s inequality on the bounded random variablewe have, for any , that
which can be rewritten as
Next, we note that