1 Introduction
^{1}^{1}footnotetext:Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia, 16163 Genoa, Italy
^{2}^{2}footnotetext: Department of Mathematics, University of Genoa, 16146 Genoa, Italy^{3}^{3}footnotetext: Department of Electrical and Electronic Engineering, Imperial College of London, SW7 1AL, London, UK^{4}^{4}footnotetext: Department of Computer Science, University College London, WC1E 6BT, London, UKThe problem of learningtolearn (LTL) [4, 30] is receiving increasing attention in recent years, due to its practical importance [11, 26] and the theoretical challenge of statistically principled and efficient solutions [1, 2, 21, 23, 9, 10, 12]
. The principal aim of LTL is to design a metalearning algorithm to select a supervised learning algorithm that is well suited to learn tasks from a prescribed family. To highlight the difference between the metalearning algorithm and the learning algorithm, throughout the paper we will refer to the latter as the
inner or withintask algorithm.The metaalgorithm is trained from a sequence of datasets, associated with different learning tasks sampled from a metadistribution (also called environment in the literature). The performance of the selected inner algorithm is measured by the transfer risk [4, 18], that is, the average risk of the algorithm, trained on a random dataset from the same environment. A key insight is that, when the learning tasks share specific similarities, the LTL framework provides a means to leverage such similarities and select an inner algorithm of low transfer risk.
In this work, we consider environments of linear regression or binary classification tasks and we assume that the associated weight vectors are all close to a common vector. Because of the increasing interest in low computational complexity procedures, we focus on the family of withintask algorithms given by Stochastic Gradient Descent (SGD) working on the regularized true risk. Specifically, motivated by the above assumption on the environment, we consider as regularizer the square distance of the weight vector to a bias vector, playing the role of a common mean among the tasks. Knowledge of this common mean can substantially facilitate the inner algorithm and the main goal of this paper is to design a metaalgorithm to learn a good bias that is supported by both computational and statistical guarantees.
Contributions.
The first contribution of this work is to show that, when the variance of the weight tasks’ vectors sampled from the environment is small, SGD regularized with the “right” bias yields a model with smaller error than its unbiased counterpart when applied to a similar task. Indeed, the latter approach does not exploit the relatedness among the tasks, that is, it corresponds to learning the tasks in isolation – also known as independent task learning (ITL). The second and principal contribution of this work is to propose a metaalgorithm that estimates the bias term, so that the transfer risk of the corresponding SGD algorithm is as small as possible. Specifically, we consider the setting in which we receive in input a sequence of datasets and we propose an online metaalgorithm which efficiently updates the bias term used by the inner SGD algorithm. Our metaalgorithm consists in applying Stochastic Gradient Descent to a proxy of the transfer risk, given by the expected minimum value of the regularized empirical risk of a task. We provide a bound on the statistical performance of the biased SGD inner algorithm found by our procedure. It establishes that, when the number of observed tasks grows and the variance of the weight tasks’ vectors is significantly smaller than their second moment, then, running the inner SGD algorithm with the estimated bias brings an improvement in comparison to learning the tasks in isolation with no bias. The bound is coherent with the stateoftheart LTL analysis for other families of algorithms, but it applies for the first time to a fully online metaalgorithm. Our results holds for Lipschitz loss functions both in the regression and binary classification setting.
Our proof techniques combines ideas from online learning, stochastic and convex optimization, with tools from LTL. A key insight in our approach is to exploit the inner SGD algorithm to compute an approximate subgradient of the surrogate objective, in a such way that the degree of approximation can be controlled, without affecting the overall performance or the computational cost of the metaalgorithm.
Paper Organization. We start from recalling in section 2 the basic concepts of LTL. In section 3 we cast the problem of choosing a right bias term in SGD on the regularized objective in the LTL framework. Thanks to this formulation, in section 4 we characterize the situations in which SGD with the right bias term is beneficial in comparison to SGD with no bias. In section 5 we propose an online metaalgorithm to estimate the bias vector from a sequence of datasets and we analyze its statistical properties. In section 6 we report on the empirical performance of the proposed approach while in section 7 we discuss some future research directions.
Previous Work. The LTL literature in the online setting [1, 9, 10, 24] has received limited attention and is less developed than standard LTL approaches, in which the data are processed in one batch as opposed to incrementally, see for instance [4, 19, 20, 21, 23]. The idea of introducing a bias in the learning algorithm is not new, see e.g. [10, 15, 23] and the discussion in section 3. In this work, we consider the family of inner SGD algorithms with biased regularization and we develop a theoretically grounded metalearning algorithm learning the bias. We are not aware of other works dealing with such a family in the LTL framework. Differently from others online methods [1, 9], our approach does not need to keep previous training points in memory and it runs online both across and within the tasks. As a result, both the low space and time complexity are the strengths of our method.
2 Preliminaries
In this section, we recall the standard supervised (i.e. singletask) learning setting and the learningtolearn setting.
We first introduce some notation used throughout. We denote by the data space, where and (regression) or (binary classification). Throughout this work we consider linear supervised learning tasks , namely distributions over , parametrized by a weight vector . We measure the performance by a loss function such that, for any , is convex and closed. Finally, for any positive , we let and, we denote by and the standard inner product and euclidean norm. In the rest of this work, when specified, we make the following assumptions.
Assumption 1 (Bounded Inputs).
Let , where , for some radius .
Assumption 2 (Lipschitz Loss).
Let be Lipschitz for any .
For example, for any , the absolute loss and the hinge loss are both Lipschitz. We now briefly recall the main notion of singletask learning.
2.1 SingleTask Learning
In standard linear supervised learning, the goal is to learn a linear functional relation , between the input space and the output space . This target can be reformulated as the one of finding a weight vector minimizing the expected risk (or true risk)
(1) 
over the entire space . The expected risk measures the prediction error that the weight vector incurs on average with respect to points sampled from the distribution . In practice, the task is unknown and only partially observed by a corresponding dataset of i.i.d. points , where, for every , . In the sequel, we often use the more compact notation , where is the matrix containing the inputs vectors as rows and is the vector with entries given by the labels . A learning algorithm is a function that, given such a training dataset , returns a “good” estimator, that is, in our case, a weight vector , whose expected risk is small and tends to the minimum of Eq. (1) as increases.
2.2 LearningtoLearn (LTL)
In the LTL framework, we assume that each learning task we observe is sampled from an environment
, that is a (meta)distribution on the set of probability distributions on
. The goal is to select a learning algorithm (hence the name learningtolearn) that is well suited to the environment.Specifically, we consider the following setting. We receive a stream of tasks , which are independently sampled from the environment and only partially observed by corresponding i.i.d. datasets each formed by datapoints. Starting from these datasets, we wish to learn an algorithm , such that, when we apply it on a new dataset (composed by points) sampled from a new task , the corresponding true risk is low. We reformulate this target into requiring that algorithm trained with points^{1}^{1}1In order to simplify the presentation, we assume that all datasets are composed by the same number of points . The general setting can be addressed introducing the slightly different definition of the transfer risk . over the environment , has small transfer risk
(2) 
The transfer risk measures the expected true risk that the inner algorithm , trained on the dataset , incurs on average with respect to the distribution of tasks sampled from . Therefore, the process of learning a learning algorithm is a metalearning one, in that the inner learning algorithm is applied to tasks from the environment and then chosen from a sequence of training tasks (datasets) in attempt to minimize the transfer risk.
As we will see in the following, in this work, we will consider a family of learning algorithms parametrized by a bias vector .
3 SGD on the Biased Regularized Risk
In this section, we introduce the LTL framework for the family of withintask algorithms we analyze in this work: SGD on the biased regularized true risk.
The idea of introducing a bias in a specific family of learning algorithms is not new in the LTL literature, see e.g. [10, 15, 23] and references therein. A natural choice is given by regularized empirical risk minimization, in which we introduce a bias in the square norm regularizer – which we simply refer to as ERM throughout – namely
(3) 
where, for any , , we have defined the empirical error and its biased regularized version as
(4) 
Intuitively, if the weight vectors of the tasks sampled from are close to each other, then running ERM with should have a smaller transfer risk than running ERM with, for instance, . We make this statement precise in section 4. Recently, a number of works have considered how to learn a good bias in a LTL setting, see e.g. [23, 10]. However, one drawback of these works is that they assume the ERM solution to be known exactly, without leveraging the interplay between the optimization and the generalization error. Furthermore, in LTL settings, data naturally arise in an online manner, both between and within tasks. Hence, an ideal LTL approach should focus on inner algorithms processing one single data point at time.
Motivated by the above reasoning, in this work, we propose to analyze an online learning algorithm that is computationally and memory efficient while retaining (on average with respect to the sampling of the data) the same statistical guarantees of the more expensive ERM estimator. Specifically, for a training dataset , a regularization parameter and a bias vector , we consider the learning algorithm defined as
(5) 
where, is the average of the first iterations of Alg. 1, in which, for any , we have introduced the notation .
Alg. 1 coincides with online subgradient algorithm applied to the strongly convex function . Moreover, thanks to the assumption that , Alg. 1 is equivalent to SGD applied to the regularized true risk
(6) 
Relying on standard onlinetobatch argument, see e.g. [8, 13] and references therein, it is easy to link the true error of such an algorithm with the minimum of the regularized empirical risk, that is, . This fact is reported in the proposition below and it will be often used in our subsequent statistical analysis. We give a proof in Appendix F for completeness.
We remark that at this level of the analysis, one could also avoid the logarithmic factor in the above bound, see e.g. [29, 25, 16]. However, in order to not complicate our presentation and proofs, we avoid this refinement of the analysis.
In the next section we study the impact on the bias vector on the statistical performance of the inner algorithm. Specifically, we investigate under which circumstances there is an advantage in perturbing the regularization in the objective used by the algorithm with an appropriate ideal bias term , as opposed to fix . Throughout the paper, we refer to the choice as independent task learning (ITL), although strictly speaking, when is fixed in advanced, then, SGD is applied on each task independently regardless of the value of . Then, in section 5 we address the question of estimating this appropriate bias from the data.
4 The Advantage of the Right Bias Term
In this section, we study the statistical performance of the model returned by Alg. 1, on average with respect to the tasks sampled from the environment , for different choices of the bias vector . To present our observations, we require, for any , that the corresponding true risk admits minimizers and we denote by the minimum norm minimizer^{2}^{2}2This choice is made in order to simplify our presentation. However, our analysis holds for different choices of a minimizer , which may potentially improve our bounds.. With these ingredients, we introduce the oracle
representing the averaged minimum error over the environment of tasks, and, for a candidate bias , we give a bound on the quantity . This gap coincides with the averaged excess risk of algorithm Alg. 1 with bias over the environment of tasks, that is
Hence, this quantity is an indicator of the performance of the bias with respect to our environment. In the rest of this section, we study the above gap for a bias which is fixed and does not depend on the data. Before doing this, we introduce the notation
(8) 
which is used throughout this work and we observe that
(9) 
Theorem 2 (Excess Transfer Risk Bound for a Fixed Bias ).
For , consider the following decomposition
(12) 
where, A and B are respectively defined by
(13) 
In order to bound the term A, we use Prop. 1. Regarding the term B, we exploit the definition of the ERM algorithm and the fact that, since does not depend on , then . Consequently, we can upper bound the term B as
(14) 
The desired statement follows by combining the above bounds on the two terms, taking the average with respect to and optimizing over . ∎
Thm. 2 shows that the strength of the regularization that one should use in the withintask algorithm Alg. 1, is inversely proportional to both the variance of the bias and the number of points in the datasets. This is exactly in line with the LTL aim: when solving each task is difficult, knowing a priori a good bias can bring a substantial benefit over learning with no bias. To further investigate this point, in the following corollary, we specialize Thm. 2 to two particular choices of the bias which are particularly meaningful for our analysis. The first choice we make is , which coincides, as remarked earlier, with learning each task independently, while the second choice considers an ideal bias, namely, assuming that the transfer risk admits minimizer, we set .
Corollary 3 (Excess Transfer Risk Bound for ITL and the Oracle).
The proof of the first statement directly follows from the application of Thm. 2 with . The second statement is a direct consequence of the definition of implying and the application of Thm. 2 with on the second term. ∎
From the previous bounds we can observe that, using the bias in the regularizer brings a substantial benefit with respect to the unbiased case when the number of points in each dataset in not very large (hence learning each task is quite difficult) and the variance of the weight tasks’ vectors sampled from the environment is much smaller than their second moment, i.e. when
Driven by this observation, when the environment of tasks satisfies the above characteristics, we would like to take advance of this tasks’ similarity. But, since in practice we are not able to explicitly compute , in the following section we propose an efficient online LTL approach to estimate the bias directly from the observed data sequence of tasks.
5 Estimating the Bias
In this section, we study the problem of designing an estimator for the bias vector that is computed incrementally from a set of observed tasks.
5.1 The MetaObjective
Since direct optimization of the transfer risk is not feasible, a standard strategy used in LTL consists in introducing a proxy objective that is easier to handle, see e.g. [18, 19, 20, 21, 9, 10]. In this paper, motivated by Prop. 1, according to which
we substitute in the definition of the transfer risk the true risk of the algorithm with the minimum of the regularized empirical risk
(15) 
This leads us to the following proxy for the transfer risk
(16) 
Some remarks about this choice are in order. First, convexity is usually a rare property in LTL. In our case, as described in the following proposition, the definition of the function as the partial minimum of a jointly convex function, ensures convexity and other nice properties, such as differentiability and a closed expression of its gradient.
Proposition 4 (Properties of ).
The above statement is a known result in the optimization community, see e.g. [3, Prop. ] and Appendix C for more details. In order to minimize the proxy objective in Eq. (16
), one standard choice done in stochastic optimization, and also adopted in this work, is to use firstorder methods, requiring the computation of an unbiased estimate of the gradient of the stochastic objective. In our case, according to the above proposition, this step would require computing the minimizer of the regularized empirical problem in Eq. (
15) exactly. A key observation of our work is to show below that we can easily design a “satisfactory” approximation (see the last paragraph in section 5) of its gradient, just substituting the minimizer in the expression of the gradient in Eq. (17) with the last iterate of Alg. 1. An important aspect to stress here is the fact that this strategy does not require any additional computational effort. Formally, this reasoning is linked to the concept of subgradient of a function. We recall that, for a given convex, proper and closed function and for a given point in its domain, is an subgradient of at , if, for any , .Proposition 5 (An Subgradient for ).
The above result is a key tool in our analysis. The proof requires some preliminaries on the subdifferential of a function (see Appendix A) and introducing the dual formulation of both the withintask learning problem and Alg. 1 (see Appendix B and Appendix E, respectively). With these two ingredients, the proof of the statement is deduced in subsection E.3 by the application of a more general result reported in Appendix D, describing how an minimizer of the dual of the withintask learning problem can be exploited in order to build an subgradient of the metaobjective function . We stress that this result could be applied to more general class of algorithms, going beyond Alg. 1 considered here.
5.2 The MetaAlgorithm to Estimate the Bias
In order to estimate the bias from the data, we apply SGD to the stochastic function introduced in Eq. (16). More precisely, in our setting, the sampling of a “metapoint” corresponds to the incremental sampling of a dataset from the environment^{3}^{3}3More precisely we first sample a distribution from and then a dataset .. We refer to Alg. 2 for more details. In particular, we propose to take the estimator obtained by averaging the iterations returned by Alg. 2. An important feature to stress here is the fact that the metaalgorithm uses subgradients of the function which are computed as described above. Specifically, for any , we define
(21) 
where is the last iterate of Alg. 1 applied with the current bias and the dataset . To simplify the presentation, throughout this work, we use the shorthand notation
Some technical observations follows. First, we stress that Alg. 2 processes one single instance at the time, without the need to store previously encountered data points, neither across the tasks nor within them. Second, the implementation of Alg. 2 does not require computing the metaobjective , which would increase the computational effort of the entire scheme. The rest of this section is devoted to the statistical analysis of Alg. 2.
5.3 Statistical Analysis of the MetaAlgorithm
In the following theorem we study the statistical performance of the bias returned by Alg. 2. More precisely we bound the excess transfer risk of the inner SGD algorithm ran with this biased term learned by the metaalgorithm.
Theorem 6 (Excess Transfer Risk Bound for the Bias Estimated by Alg. 2).
We consider the following decomposition
(24) 
where we have defined the terms
(25) 
Now, in order to bound the term A, noting that
we use Prop. 1 with and, then, we take the average on . As regards the term C, we apply the inequality given in Eq. (14) with and we again average with respect to . Finally, the term B is the convergence rate of Alg. 2 and its study requires analyzing the error that we introduce in the metagradients by Prop. 5. The bound we use for this term is the one described in Prop. 22 (see Appendix G) with . The result now follows by combining the bounds on the three terms and optimizing over . ∎
We remark that the bound in Thm. 6 is stated with respect to the mean of the tasks’ vector only for simplicity, and the same result holds for a generic bias vector . Specializing this rate to ITL ( recovers the rate in Cor. 3 for ITL (up to a contant ). Consequently, even when the tasks are not “close to each other” (i.e. their variance is high), our approach is not prone to negativetransfer, since, in the worst case, it recovers the ITL performance. Moreover, the above bound is coherent with the stateoftheart LTL bounds given in other papers studying other variants of Ivanov or Tikhonov regularized empirical risk minimization algorithms, see e.g. [18, 19, 20, 21]. Specifically, in our case, the bound has the form
(26) 
where reflects the advantage in exploiting the relatedness among the tasks sampled from the environment . More precisely, in section 4 we noticed that, if the variance of the weight vectors of the tasks sampled from our environment is significantly smaller than their second moment, running Alg. 1 with the ideal bias on a future task brings a significant improvement in comparison to the unbiased case. One natural question arising at this point of the presentation is whether, under the same conditions on the environment, the same improvement is obtained by running Alg. 1 with the bias vector returned by our online metaalgorithm in Alg. 2. Looking at the bound in Thm. 6, we can say that, when the number of training tasks used to estimate the bias is sufficiently large, the above question has a positive answer and our LTL approach is effective.
In order to have also a more precise benchmark for the biased setting considered in this work, in Appendix H we have repeated the statistical study described in the paper also for the more expensive ERM algorithm described in Eq. (3). In this case, we assume to have an oracle providing us with this exact estimator, ignoring any computational costs. As before, we have performed the analysis both for a fixed bias and the one estimated from the data which is returned by running Alg. 2. We remark that, thanks to the assumption on the oracle, in this case, Alg. 2 is assumed to run with exact metagradients. Looking at the results reported in Appendix H, we immediately see that, up to constants and logarithmic factors, the LTL bounds we have stated in the paper for the lowcomplexity SGD family are equivalent to the ones we have reported in Appendix H for the more expensive ERM family.
All the above facts justify the informal statement given before Prop. 5 according to which the trick used to compute the approximation of the metagradient by using the last iterate of the inner algorithm, not only, does not require additional effort, but it is also accurate enough from the statistical view point, matching a stateoftheart bound for more expensive withintask algorithms based on ERM.
We conclude by observing that, exploiting the explicit form of the error on the metagradients, it is possible to extend the analysis presented in Thm. 6 above to the adversarial case, where no assumption on the data generation process is made. The result in our statistical setting can be derived from this more general adversarial setting by the application of two onlinetobatch conversions, one withintask and one outertask.
6 Experiments
In this section, we test the effectiveness of the LTL approach proposed in this paper on synthetic and real data ^{4}^{4}4The code used for the following experiments is available at https://github.com/prolearner/onlineLTL. In all experiments, the regularization parameter and the stepsize were tuned by validation, see Appendix I for more details.
Synthetic Data. We considered two different settings, regression with the absolute loss and binary classification with the hinge loss. In both cases, we generated an environment of tasks in which SGD with the right bias is expected to bring a substantial benefit in comparison to the unbiased case. Motivated by our observations in section 4, we generated linear tasks with weight vectors characterized by a variance which is significantly smaller than their second moment. Specifically, for each task , we created a weight vector
from a Gaussian distribution with mean
given by the vector in with all components equal to . Each task corresponds to a dataset , with and . In the regression case, the inputs were uniformly sampled on the unit sphere and the labels were generated as , with sampled from a zeromean Gaussian distribution, with standard deviation chosen to have signaltonoise ratio equal to for each task. In the classification case, the inputs were uniformly sampled on the unit sphere, excluding those points with margin smaller than and the binary labels were generated as a logistic model, . In Fig. 1 we report the performance of Alg. 1 with different choices of the bias: (our LTL estimator resulting from Alg. 2), (ITL) and , a reasonable approximation of the oracle minimizing the transfer risk. The plots confirm our theoretical findings: estimating the bias with our LTL approach leads to a substantial benefits with respect to the unbiased case, as the number of the observed training tasks increases.Real Data. We run experiments on the computer survey data from [17], in which 180 people (tasks) rated the likelihood of purchasing one of 20 different personal computers (). The input represents 13 different computer characteristics (price, CPU, RAM, etc.) while the output is an integer rating from to . Similarly to the synthetic data experiments, we consider a regression setting with the absolute loss and a classification setting. In the latter case each task is to predict whether the rating is above . We compare the LTL bias with ITL. The results are reported in Fig. 2. The figures above are in line with the results obtained on synthetic experiments, indicating that the bias LTL framework proposed in this work is effective for this dataset. Moreover, the results for regression are also in line with what observed in the multitask setting with variance regularization [22]. The classification setting has not been used before and has been created adhoc for our purpose. In this case we have an increased variance probably due to the datasets being highly unbalanced. In order to investigate the impact of passing through the data only once in the different steps in our method, we conducted additional experiments. The results, presented in Appendix J, indicate that the single pass strategy is competitive with respect to the more expensive ERM.
7 Conclusion and Future Work
We have studied the performance of Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector, over a class of tasks. Drawing upon a learningtolearn framework, we have shown that, when the variance of the tasks is relatively small, the introduction of an appropriate bias vector could be beneficial in comparison to the standard unbiased version, corresponding to learning the tasks independently. Then, we have proposed an efficient online metalearning algorithm to estimate this bias and we have theoretically shown that the bias returned by our method can bring a comparable benefit. In the future, it would be interesting to investigate other kinds of relatedness among the tasks and to extend our analysis to other classes of loss functions, as well as to a Hilbert space setting. Finally, another valuable research direction is to derive fully dependent bounds, in which the hyperparameters are selftuned during the learning process, see e.g.
[31].References

[1]
P. Alquier, T. T. Mai, and M. Pontil.
Regret bounds for lifelong learning.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
, volume 54 of Proceedings of Machine Learning Research, pages 261–269, 2017. 
[2]
M.F. Balcan, A. Blum, and S. Vempala.
Efficient representations for lifelong learning and autoencoding.
In Conference on Learning Theory, pages 191–210, 2015.  [3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator theory in Hilbert Spaces, volume 408. Springer, 2011.
 [4] J. Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12(149–198):3, 2000.
 [5] A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 [6] J. Borwein and Q. Zhu. Techniques of variational analysis, ser, 2005.
 [7] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
 [8] N. CesaBianchi, A. Conconi, and C. Gentile. On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
 [9] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learningtolearn with statistical guarantees. In Proc. 34th Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
 [10] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean. In Advances in Neural Information Processing Systems, pages 10190–10200, 2018.
 [11] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135. PMLR, 2017.
 [12] R. Gupta and T. Roughgarden. A pac approach to applicationspecific algorithm selection. SIAM Journal on Computing, 46(3):992–1017, 2017.
 [13] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016.
 [14] H.U. JeanBaptiste. Convex analysis and minimization algorithms: advanced theory and bundle methods. SPRINGER, 2010.
 [15] I. Kuzborskij and F. Orabona. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106(2):171–195, 2017.
 [16] S. LacosteJulien, M. Schmidt, and F. Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
 [17] P. J. Lenk, W. S. DeSarbo, P. E. Green, and M. R. Young. Hierarchical bayes conjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2):173–191, 1996.
 [18] A. Maurer. Algorithmic stability and metalearning. Journal of Machine Learning Research, 6:967–994, 2005.
 [19] A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009.

[20]
A. Maurer, M. Pontil, and B. RomeraParedes.
Sparse coding for multitask and transfer learning.
In International Conference on Machine Learning, 2013.  [21] A. Maurer, M. Pontil, and B. RomeraParedes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
 [22] A. M. McDonald, M. Pontil, and D. Stamos. New perspectives on ksupport and cluster norms. Journal of Machine Learning Research, 17(155):1–38, 2016.
 [23] A. Pentina and C. Lampert. A PACBayesian bound for lifelong learning. In International Conference on Machine Learning, pages 991–999, 2014.
 [24] A. Pentina and R. Urner. Lifelong learning with weighted majority votes. In Advances in Neural Information Processing Systems, pages 3612–3620, 2016.
 [25] A. Rakhlin, O. Shamir, K. Sridharan, et al. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, volume 12, pages 1571–1578. Citeseer, 2012.
 [26] S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. In I5th International Conference on Learning Representations, 2017.
 [27] S. ShalevShwartz and S. BenDavid. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
 [28] S. ShalevShwartz and S. M. Kakade. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems, pages 1457–1464, 2009.
 [29] O. Shamir and T. Zhang. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
 [30] S. Thrun and L. Pratt. Learning to Learn. Springer, 1998.
 [31] Z. Zhuang, A. Cutkosky, and F. Orabona. Surrogate losses for online learning of stepsizes in stochastic nonconvex optimization. arXiv preprint arXiv:1901.09068, 2019.
Appendix
The appendix is organized as follows. In Appendix A we report some basic facts regarding the subdifferential of a function which are used in the subsequent analysis. In Appendix B we give the primaldual formulation of the biased regularized empirical risk minimization problem for each single task and, in Appendix C, we recall some wellknown properties of our metaobjective function. In Appendix D, we show how an minimizer of the dual problem can be exploited in order to build an subgradient of our metaobjective function. As described in Appendix E, interpreting our withintask algorithm as a coordinate descent algorithm on the dual problem, we can adapt this result to our setting and prove, in this way, Prop. 5. In Appendix F, we report the proof of Prop. 1 and, in Appendix G, we give the convergence rate of Alg. 2 which is used in the paper, during the proof of Thm. 6. In Appendix H, we repeat the statistical study described in the paper also for the family of ERM algorithms introduced in Eq. (3) and, in Appendix I, we describe how to perform the validation procedure in our LTL setting. Finally, in Appendix J we report additional experiments comparing our method to ERM variants.
Appendix A Basic Facts on Subgradients
In this section, we report some basic concepts about the subdifferential which are then used in the subsequent analysis. This material is based on [14, Chap. XI]. Throughout this section we consider a convex closed and proper function with domain and we always let .
Definition 7 (Subgradient, [14, Chap. XI, Def. ]).
Given , the vector is called subgradient of at when the following property holds for any
(27) 
The set of all subgradients of f at is the subdifferential of at , denoted by .
The standard subifferential is retrieved with . The following lemma, which is a direct consequence of Def. 7, points out the link between and an minimizer of .
Lemma 8 (See [14, Chap. XI, Thm. ]).
The following two properties are equivalent.
(28) 
The subsequent lemma describes the behavior of the subdifferential with respect to the duality.
Lemma 9 (See [14, Chap. XI, Prop. ]).
Let be the Fenchel conjugate of , namely, . Then, given , the vector is an subgradient of at iff
(29) 
As a result,
(30) 
We now describe some properties of the subdifferential which are used in the following analysis.
Lemma 10 ( See [14, Chap. XI, Thm. ]).
Let and be two convex closed and proper functions. Then, given , we have that
(31) 
Moreover, denoting by the relative interior of a set , when , equality holds.
Lemma 11 ( See [14, Chap. XI, Prop. ]).
Let be a scalar. Then, for a given , we have that
(32) 
Lemma 12.
Let be a matrix. Then, for a given such that , we have that
(33) 
Let be . Then, by definition, there exist such that . Consequenlty, for any , we can write
(34) 
where, in the inequality we have used the fact that . This gives the desired statement. ∎
The next two results characterize the subdifferential of two functions, which are useful in our subsequent analysis. In the following we denote by the set of the symmetric positive semidefinite matrices.
Example 1 (Quadratic Functions, [14, Chap. XI, Ex. ]).
For a given matrix and a given vector , consider the function
(35) 
Then, given , we can express the subdifferential of at with respect to the gradient as follows
(36) 
Example 2 (Moreau Envelope [14, Chap. XI, Ex. ]).
For and a fixed vector , consider the Moreau envelope of at the point with parameter , given by
(37) 
Denote by the unique minimizer of the above function, namely, the vector characterized by the optimality conditions
(38) 
Then, for any and , we have that
(39) 
where, for any center and any radius , we recall the notation
(40) 
For we retrieve the wellknown result according to which is differentiable, with Lipschitz gradient given by
(41) 
Finally, from Eq. (39), we can deduce that, if , then
(42) 
Appendix B PrimalDual Formulation of the WithinTask Problem
In this section, we give the primaldual formulation of the biased regularized empirical risk minimization problem outlined in Eq, (3) for each single task. Specifically, rewriting for any and , the empirical risk
(43) 
for any , we can express our metaobjective function in Eq. (15) as
(44) 
We remark that, in the optimization community, this function coincides with the Moreau envelope of the empirical error at the point , see also 2. In this section, in order to simplify the presentation, we omit the dependence on the dataset in the notation. The unique minimizer of the above function
(45) 
is known as the proximity operator of the empirical error at the point and it coincides with the ERM algorithm introduced in Eq. (3) in the paper. We interpret the vector in Eq. (3)–(45) as the solution of the primal problem
(46) 
The next proposition is a standard result stating that, in this setting, strong duality holds and the optimality conditions, also known as Karush–Kuhn–Tucker (KKT) conditions provide a unique way to determine the primal variables from the dual ones.
Proposition 13 (Strong Duality, [6, Thm. ], [3, Prop. ]).
Consider the primal problem in Eq. (131). Then, its dual problem admits a solution
(47) 
where, thanks to the separability of , for any , we have that
(48) 
Moreover, strong duality holds, namely,
(49) 
and the optimality (KKT) conditions read as follows
(50) 
Appendix C Properties of the MetaObjective
In this section we recall some properties of the metaobjective function already outlined in the text in Prop. 4.
See 4