DeepAI

# Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

We study the problem of learning-to-learn: inferring a learning algorithm that works well on tasks sampled from an unknown distribution. As class of algorithms we consider Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector. We present an average excess risk bound for such a learning algorithm. This result quantifies the potential benefit of using a bias vector with respect to the unbiased case. We then address the problem of estimating the bias from a sequence of tasks. We propose a meta-algorithm which incrementally updates the bias, as new tasks are observed. The low space and time complexity of this approach makes it appealing in practice. We provide guarantees on the learning ability of the meta-algorithm. A key feature of our results is that, when the number of tasks grows and their variance is relatively small, our learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term. We report on numerical experiments which demonstrate the effectiveness of our approach.

• 5 publications
• 34 publications
• 7 publications
• 63 publications
05/18/2020

### Meta-learning with Stochastic Linear Bandits

We investigate meta-learning procedures in the setting of stochastic lin...
03/21/2019

### SVAG: Unified Convergence Results for SAG-SAGA Interpolation with Stochastic Variance Adjusted Gradient Descent

We analyze SVAG, a variance reduced stochastic gradient method with SAG ...
03/21/2018

### Incremental Learning-to-Learn with Statistical Guarantees

In learning-to-learn the goal is to infer a learning algorithm that work...
12/09/2016

### Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks

Representations are fundamental to artificial intelligence. The performa...
09/28/2020

### Why resampling outperforms reweighting for correcting sampling bias

A data set sampled from a certain population is biased if the subgroups ...
02/25/2020

### Biased Stochastic Gradient Descent for Conditional Stochastic Optimization

Conditional Stochastic Optimization (CSO) covers a variety of applicatio...
12/22/2020

### Unbiased Gradient Estimation for Distributionally Robust Learning

Seeking to improve model generalization, we consider a new approach base...

## 1 Introduction

11footnotetext:

Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia, 16163 Genoa, Italy

22footnotetext: Department of Mathematics, University of Genoa, 16146 Genoa, Italy33footnotetext: Department of Electrical and Electronic Engineering, Imperial College of London, SW7 1AL, London, UK44footnotetext: Department of Computer Science, University College London, WC1E 6BT, London, UK

The problem of learning-to-learn (LTL) [4, 30] is receiving increasing attention in recent years, due to its practical importance [11, 26] and the theoretical challenge of statistically principled and efficient solutions [1, 2, 21, 23, 9, 10, 12]

. The principal aim of LTL is to design a meta-learning algorithm to select a supervised learning algorithm that is well suited to learn tasks from a prescribed family. To highlight the difference between the meta-learning algorithm and the learning algorithm, throughout the paper we will refer to the latter as the

The meta-algorithm is trained from a sequence of datasets, associated with different learning tasks sampled from a meta-distribution (also called environment in the literature). The performance of the selected inner algorithm is measured by the transfer risk [4, 18], that is, the average risk of the algorithm, trained on a random dataset from the same environment. A key insight is that, when the learning tasks share specific similarities, the LTL framework provides a means to leverage such similarities and select an inner algorithm of low transfer risk.

In this work, we consider environments of linear regression or binary classification tasks and we assume that the associated weight vectors are all close to a common vector. Because of the increasing interest in low computational complexity procedures, we focus on the family of within-task algorithms given by Stochastic Gradient Descent (SGD) working on the regularized true risk. Specifically, motivated by the above assumption on the environment, we consider as regularizer the square distance of the weight vector to a bias vector, playing the role of a common mean among the tasks. Knowledge of this common mean can substantially facilitate the inner algorithm and the main goal of this paper is to design a meta-algorithm to learn a good bias that is supported by both computational and statistical guarantees.

Contributions.

Our proof techniques combines ideas from online learning, stochastic and convex optimization, with tools from LTL. A key insight in our approach is to exploit the inner SGD algorithm to compute an approximate subgradient of the surrogate objective, in a such way that the degree of approximation can be controlled, without affecting the overall performance or the computational cost of the meta-algorithm.

Paper Organization. We start from recalling in section 2 the basic concepts of LTL. In section 3 we cast the problem of choosing a right bias term in SGD on the regularized objective in the LTL framework. Thanks to this formulation, in section 4 we characterize the situations in which SGD with the right bias term is beneficial in comparison to SGD with no bias. In section 5 we propose an online meta-algorithm to estimate the bias vector from a sequence of datasets and we analyze its statistical properties. In section 6 we report on the empirical performance of the proposed approach while in section 7 we discuss some future research directions.

Previous Work. The LTL literature in the online setting [1, 9, 10, 24] has received limited attention and is less developed than standard LTL approaches, in which the data are processed in one batch as opposed to incrementally, see for instance  [4, 19, 20, 21, 23]. The idea of introducing a bias in the learning algorithm is not new, see e.g. [10, 15, 23] and the discussion in section 3. In this work, we consider the family of inner SGD algorithms with biased regularization and we develop a theoretically grounded meta-learning algorithm learning the bias. We are not aware of other works dealing with such a family in the LTL framework. Differently from others online methods [1, 9], our approach does not need to keep previous training points in memory and it runs online both across and within the tasks. As a result, both the low space and time complexity are the strengths of our method.

## 2 Preliminaries

In this section, we recall the standard supervised (i.e. single-task) learning setting and the learning-to-learn setting.

We first introduce some notation used throughout. We denote by the data space, where and (regression) or (binary classification). Throughout this work we consider linear supervised learning tasks , namely distributions over , parametrized by a weight vector . We measure the performance by a loss function such that, for any , is convex and closed. Finally, for any positive , we let and, we denote by and the standard inner product and euclidean norm. In the rest of this work, when specified, we make the following assumptions.

###### Assumption 1 (Bounded Inputs).

Let , where , for some radius .

###### Assumption 2 (Lipschitz Loss).

Let be -Lipschitz for any .

For example, for any , the absolute loss and the hinge loss are both -Lipschitz. We now briefly recall the main notion of single-task learning.

In standard linear supervised learning, the goal is to learn a linear functional relation , between the input space and the output space . This target can be reformulated as the one of finding a weight vector minimizing the expected risk (or true risk)

 (1)

over the entire space . The expected risk measures the prediction error that the weight vector incurs on average with respect to points sampled from the distribution . In practice, the task is unknown and only partially observed by a corresponding dataset of i.i.d. points , where, for every , . In the sequel, we often use the more compact notation , where is the matrix containing the inputs vectors as rows and is the vector with entries given by the labels . A learning algorithm is a function that, given such a training dataset , returns a “good” estimator, that is, in our case, a weight vector , whose expected risk is small and tends to the minimum of Eq. (1) as increases.

### 2.2 Learning-to-Learn (LTL)

In the LTL framework, we assume that each learning task we observe is sampled from an environment

, that is a (meta-)distribution on the set of probability distributions on

. The goal is to select a learning algorithm (hence the name learning-to-learn) that is well suited to the environment.

Specifically, we consider the following setting. We receive a stream of tasks , which are independently sampled from the environment and only partially observed by corresponding i.i.d. datasets each formed by datapoints. Starting from these datasets, we wish to learn an algorithm , such that, when we apply it on a new dataset (composed by points) sampled from a new task , the corresponding true risk is low. We reformulate this target into requiring that algorithm trained with points111In order to simplify the presentation, we assume that all datasets are composed by the same number of points . The general setting can be addressed introducing the slightly different definition of the transfer risk . over the environment , has small transfer risk

 En(A)=Eμ∼ρ EZn∼μn Rμ(A(Zn)). (2)

The transfer risk measures the expected true risk that the inner algorithm , trained on the dataset , incurs on average with respect to the distribution of tasks sampled from . Therefore, the process of learning a learning algorithm is a meta-learning one, in that the inner learning algorithm is applied to tasks from the environment and then chosen from a sequence of training tasks (datasets) in attempt to minimize the transfer risk.

As we will see in the following, in this work, we will consider a family of learning algorithms parametrized by a bias vector .

## 3 SGD on the Biased Regularized Risk

In this section, we introduce the LTL framework for the family of within-task algorithms we analyze in this work: SGD on the biased regularized true risk.

The idea of introducing a bias in a specific family of learning algorithms is not new in the LTL literature, see e.g. [10, 15, 23] and references therein. A natural choice is given by regularized empirical risk minimization, in which we introduce a bias in the square norm regularizer – which we simply refer to as ERM throughout – namely

 AERMh(Zn)≡wh(Zn)=\operatornamewithlimitsargminw∈Rd RZn,h(w), (3)

where, for any , , we have defined the empirical error and its biased regularized version as

 RZn(w)=1nn∑k=1ℓk(⟨xk,w⟩)RZn,h(w)=RZn(w)+λ2∥w−h∥2. (4)

Intuitively, if the weight vectors of the tasks sampled from are close to each other, then running ERM with should have a smaller transfer risk than running ERM with, for instance, . We make this statement precise in section 4. Recently, a number of works have considered how to learn a good bias in a LTL setting, see e.g. [23, 10]. However, one drawback of these works is that they assume the ERM solution to be known exactly, without leveraging the interplay between the optimization and the generalization error. Furthermore, in LTL settings, data naturally arise in an online manner, both between and within tasks. Hence, an ideal LTL approach should focus on inner algorithms processing one single data point at time.

Motivated by the above reasoning, in this work, we propose to analyze an online learning algorithm that is computationally and memory efficient while retaining (on average with respect to the sampling of the data) the same statistical guarantees of the more expensive ERM estimator. Specifically, for a training dataset , a regularization parameter and a bias vector , we consider the learning algorithm defined as

 ASGDh(Zn)≡¯wh(Zn), (5)

where, is the average of the first iterations of Alg. 1, in which, for any , we have introduced the notation .

Alg. 1 coincides with online subgradient algorithm applied to the strongly convex function . Moreover, thanks to the assumption that , Alg. 1 is equivalent to SGD applied to the regularized true risk

 Rμ,h(w)=Rμ(w)+λ2∥w−h∥2. (6)

Relying on standard online-to-batch argument, see e.g. [8, 13] and references therein, it is easy to link the true error of such an algorithm with the minimum of the regularized empirical risk, that is, . This fact is reported in the proposition below and it will be often used in our subsequent statistical analysis. We give a proof in Appendix F for completeness.

###### Proposition 1.

Let Asm. 1 and Asm. 2 hold and let be the output of Alg. 1. Then, we have that

 (7)

We remark that at this level of the analysis, one could also avoid the logarithmic factor in the above bound, see e.g. [29, 25, 16]. However, in order to not complicate our presentation and proofs, we avoid this refinement of the analysis.

In the next section we study the impact on the bias vector on the statistical performance of the inner algorithm. Specifically, we investigate under which circumstances there is an advantage in perturbing the regularization in the objective used by the algorithm with an appropriate ideal bias term , as opposed to fix . Throughout the paper, we refer to the choice as independent task learning (ITL), although strictly speaking, when is fixed in advanced, then, SGD is applied on each task independently regardless of the value of . Then, in section 5 we address the question of estimating this appropriate bias from the data.

## 4 The Advantage of the Right Bias Term

In this section, we study the statistical performance of the model returned by Alg. 1, on average with respect to the tasks sampled from the environment , for different choices of the bias vector . To present our observations, we require, for any , that the corresponding true risk admits minimizers and we denote by the minimum norm minimizer222This choice is made in order to simplify our presentation. However, our analysis holds for different choices of a minimizer , which may potentially improve our bounds.. With these ingredients, we introduce the oracle

 Eρ=Eμ∼ρ Rμ(wμ),

representing the averaged minimum error over the environment of tasks, and, for a candidate bias , we give a bound on the quantity . This gap coincides with the averaged excess risk of algorithm Alg. 1 with bias over the environment of tasks, that is

 En(¯wh)−Eρ=Eμ∼ρ EZn∼μn [Rμ(¯wh(Zn))−Rμ(wμ)].

Hence, this quantity is an indicator of the performance of the bias with respect to our environment. In the rest of this section, we study the above gap for a bias which is fixed and does not depend on the data. Before doing this, we introduce the notation

 Var2h=12 Eμ∼ρ ∥wμ−h∥∥2 (8)

which is used throughout this work and we observe that

 m≡Eμ∼ρwμ=\operatornamewithlimitsargminh∈Rd Var2h. (9)
###### Theorem 2 (Excess Transfer Risk Bound for a Fixed Bias h).

Let Asm. 1 and Asm. 2 hold and let be the output of Alg. 1 with regularization parameter

 λ=RLVarh √2(log(n)+1)n. (10)

Then, the following bound holds

 En(¯wh)−Eρ≤Varh 2RL √2(log(n)+1)n. (11)

For , consider the following decomposition

 EZn∼μn Rμ(¯wh(Zn))−Rμ(wμ)≤A+B, (12)

where, A and B are respectively defined by

 (13)

In order to bound the term A, we use Prop. 1. Regarding the term B, we exploit the definition of the ERM algorithm and the fact that, since does not depend on , then . Consequently, we can upper bound the term B as

 EZn∼μn[RZn,h(wh(Zn))−Rμ,h(wμ)]+λ2∥∥wμ−h∥∥2=EZn∼μn[RZn,h(wh(Zn))−RZn,h(wμ)]+λ2∥∥wμ−h∥∥2≤λ2∥∥wμ−h∥∥2. (14)

The desired statement follows by combining the above bounds on the two terms, taking the average with respect to and optimizing over . ∎

Thm. 2 shows that the strength of the regularization that one should use in the within-task algorithm Alg. 1, is inversely proportional to both the variance of the bias and the number of points in the datasets. This is exactly in line with the LTL aim: when solving each task is difficult, knowing a priori a good bias can bring a substantial benefit over learning with no bias. To further investigate this point, in the following corollary, we specialize Thm. 2 to two particular choices of the bias which are particularly meaningful for our analysis. The first choice we make is , which coincides, as remarked earlier, with learning each task independently, while the second choice considers an ideal bias, namely, assuming that the transfer risk admits minimizer, we set .

###### Corollary 3 (Excess Transfer Risk Bound for ITL and the Oracle).

Let Asm. 1 and Asm. 2 hold.

1. Independent Task Learning. Let be the output of Alg. 1 with bias and regularization parameter as in Eq. (10) with . Then, the following bound holds

 En(¯w0)−Eρ≤Var0 2RL √2(log(n)+1)n.
2. The Oracle. Let be the output of Alg. 1 with bias and regularization parameter as in Eq. (10) with . Then, the following bound holds

The proof of the first statement directly follows from the application of Thm. 2 with . The second statement is a direct consequence of the definition of implying and the application of Thm. 2 with on the second term. ∎

From the previous bounds we can observe that, using the bias in the regularizer brings a substantial benefit with respect to the unbiased case when the number of points in each dataset in not very large (hence learning each task is quite difficult) and the variance of the weight tasks’ vectors sampled from the environment is much smaller than their second moment, i.e. when

 Var2m=12 Eμ∼ρ ∥wμ−m∥2≪12 Eμ∼ρ ∥wμ∥2=Var20.

Driven by this observation, when the environment of tasks satisfies the above characteristics, we would like to take advance of this tasks’ similarity. But, since in practice we are not able to explicitly compute , in the following section we propose an efficient online LTL approach to estimate the bias directly from the observed data sequence of tasks.

## 5 Estimating the Bias

In this section, we study the problem of designing an estimator for the bias vector that is computed incrementally from a set of observed tasks.

### 5.1 The Meta-Objective

Since direct optimization of the transfer risk is not feasible, a standard strategy used in LTL consists in introducing a proxy objective that is easier to handle, see e.g. [18, 19, 20, 21, 9, 10]. In this paper, motivated by Prop. 1, according to which

 EZn∼μn [Rμ(¯wh(Zn))]≤EZn∼μn [RZn,h(wh(Zn))]+2R2L2(log(n)+1)λn,

we substitute in the definition of the transfer risk the true risk of the algorithm with the minimum of the regularized empirical risk

 LZn(h)=minw∈Rd RZn,h(w)=RZn,h(wh(Zn)). (15)

This leads us to the following proxy for the transfer risk

 ^En(h)=Eμ∼ρ EZn∼μn LZn(h). (16)

Some remarks about this choice are in order. First, convexity is usually a rare property in LTL. In our case, as described in the following proposition, the definition of the function as the partial minimum of a jointly convex function, ensures convexity and other nice properties, such as differentiability and a closed expression of its gradient.

###### Proposition 4 (Properties of LZn).

The function in Eq. (15) is convex and -smooth over . Moreover, for any , its gradient is given by the formula

 ∇LZn(h)=−λ(wh(Zn)−h), (17)

where is the ERM algorithm in Eq. (3). Finally, when Asm. 1 and Asm. 2 hold, is -Lipschitz.

The above statement is a known result in the optimization community, see e.g. [3, Prop. ] and Appendix C for more details. In order to minimize the proxy objective in Eq. (16

), one standard choice done in stochastic optimization, and also adopted in this work, is to use first-order methods, requiring the computation of an unbiased estimate of the gradient of the stochastic objective. In our case, according to the above proposition, this step would require computing the minimizer of the regularized empirical problem in Eq. (

15) exactly. A key observation of our work is to show below that we can easily design a “satisfactory” approximation (see the last paragraph in section 5) of its gradient, just substituting the minimizer in the expression of the gradient in Eq. (17) with the last iterate of Alg. 1. An important aspect to stress here is the fact that this strategy does not require any additional computational effort. Formally, this reasoning is linked to the concept of -subgradient of a function. We recall that, for a given convex, proper and closed function and for a given point in its domain, is an -subgradient of at , if, for any , .

###### Proposition 5 (An ϵ-Subgradient for LZn).

Let be the last iterate of Alg. 1. Then, under Asm. 1 and Asm. 2, the vector

 ^∇LZn(h)=−λ(wh(n+1)(Zn)−h) (18)

is an -subgradient of at point , where is such that

 EZn∼μn [ϵ]≤2R2L2(log(n)+1)λn. (19)

Moreover, introducing ,

 EZn∼μn ∥∥ΔZn(h)∥∥2≤4R2L2(log(n)+1)n. (20)

The above result is a key tool in our analysis. The proof requires some preliminaries on the -subdifferential of a function (see Appendix A) and introducing the dual formulation of both the within-task learning problem and Alg. 1 (see Appendix B and Appendix E, respectively). With these two ingredients, the proof of the statement is deduced in subsection E.3 by the application of a more general result reported in Appendix D, describing how an -minimizer of the dual of the within-task learning problem can be exploited in order to build an -subgradient of the meta-objective function . We stress that this result could be applied to more general class of algorithms, going beyond Alg. 1 considered here.

### 5.2 The Meta-Algorithm to Estimate the Bias h

In order to estimate the bias from the data, we apply SGD to the stochastic function introduced in Eq. (16). More precisely, in our setting, the sampling of a “meta-point” corresponds to the incremental sampling of a dataset from the environment333More precisely we first sample a distribution from and then a dataset .. We refer to Alg. 2 for more details. In particular, we propose to take the estimator obtained by averaging the iterations returned by Alg. 2. An important feature to stress here is the fact that the meta-algorithm uses -subgradients of the function which are computed as described above. Specifically, for any , we define

 ^∇LZ(t)n(h(t))=−λ(w(n+1)h(t)(Z(t)n)−h(t)), (21)

where is the last iterate of Alg. 1 applied with the current bias and the dataset . To simplify the presentation, throughout this work, we use the short-hand notation

 Lt(⋅)=LZ(t)n(⋅), ∇(t)=∇Lt(h(t)), ^∇(t)=^∇Lt(h(t)).

Some technical observations follows. First, we stress that Alg. 2 processes one single instance at the time, without the need to store previously encountered data points, neither across the tasks nor within them. Second, the implementation of Alg. 2 does not require computing the meta-objective , which would increase the computational effort of the entire scheme. The rest of this section is devoted to the statistical analysis of Alg. 2.

### 5.3 Statistical Analysis of the Meta-Algorithm

In the following theorem we study the statistical performance of the bias returned by Alg. 2. More precisely we bound the excess transfer risk of the inner SGD algorithm ran with this biased term learned by the meta-algorithm.

###### Theorem 6 (Excess Transfer Risk Bound for the Bias ¯hT Estimated by Alg. 2).

Let Asm. 1 and Asm. 2 hold and let be the output of Alg. 2 with step size

 γ=√2∥m∥LR √(T(1+4(log(n)+1)n))−1. (22)

Let be the output of Alg. 1 with bias and regularization parameter

 λ=2RLVarm √log(n)+1n. (23)

Then, the following bound holds

 E[En(¯w¯hT)]−Eρ≤Varm 4RL √log(n)+1n+∥m∥ LR √2(1+4(log(n)+1)n)1T,

where the expectation above is with respect to the sampling of the datasets from the environment .

We consider the following decomposition

 E[En(¯w¯hT)]−Eρ≤A+B+C, (24)

where we have defined the terms

 A=En(¯w¯hT)−^En(¯hT)B=E ^En(¯hT)−^En(m)C=^En(m)−Eρ. (25)

Now, in order to bound the term A, noting that

 A=Eμ∼ρ EZn∼μn [Rμ(¯w¯hT(Zn))−RZn,¯hT(w¯hT(Zn))],

we use Prop. 1 with and, then, we take the average on . As regards the term C, we apply the inequality given in Eq. (14) with and we again average with respect to . Finally, the term B is the convergence rate of Alg. 2 and its study requires analyzing the error that we introduce in the meta-gradients by Prop. 5. The bound we use for this term is the one described in Prop. 22 (see Appendix G) with . The result now follows by combining the bounds on the three terms and optimizing over . ∎

We remark that the bound in Thm. 6 is stated with respect to the mean of the tasks’ vector only for simplicity, and the same result holds for a generic bias vector . Specializing this rate to ITL ( recovers the rate in Cor. 3 for ITL (up to a contant ). Consequently, even when the tasks are not “close to each other” (i.e. their variance is high), our approach is not prone to negative-transfer, since, in the worst case, it recovers the ITL performance. Moreover, the above bound is coherent with the state-of-the-art LTL bounds given in other papers studying other variants of Ivanov or Tikhonov regularized empirical risk minimization algorithms, see e.g. [18, 19, 20, 21]. Specifically, in our case, the bound has the form

 O(Varm√n)+O(1√T), (26)

where reflects the advantage in exploiting the relatedness among the tasks sampled from the environment . More precisely, in section 4 we noticed that, if the variance of the weight vectors of the tasks sampled from our environment is significantly smaller than their second moment, running Alg. 1 with the ideal bias on a future task brings a significant improvement in comparison to the unbiased case. One natural question arising at this point of the presentation is whether, under the same conditions on the environment, the same improvement is obtained by running Alg. 1 with the bias vector returned by our online meta-algorithm in Alg. 2. Looking at the bound in Thm. 6, we can say that, when the number of training tasks used to estimate the bias is sufficiently large, the above question has a positive answer and our LTL approach is effective.

In order to have also a more precise benchmark for the biased setting considered in this work, in Appendix H we have repeated the statistical study described in the paper also for the more expensive ERM algorithm described in Eq. (3). In this case, we assume to have an oracle providing us with this exact estimator, ignoring any computational costs. As before, we have performed the analysis both for a fixed bias and the one estimated from the data which is returned by running Alg. 2. We remark that, thanks to the assumption on the oracle, in this case, Alg. 2 is assumed to run with exact meta-gradients. Looking at the results reported in Appendix H, we immediately see that, up to constants and logarithmic factors, the LTL bounds we have stated in the paper for the low-complexity SGD family are equivalent to the ones we have reported in Appendix H for the more expensive ERM family.

All the above facts justify the informal statement given before Prop. 5 according to which the trick used to compute the approximation of the meta-gradient by using the last iterate of the inner algorithm, not only, does not require additional effort, but it is also accurate enough from the statistical view point, matching a state-of-the-art bound for more expensive within-task algorithms based on ERM.

We conclude by observing that, exploiting the explicit form of the error on the meta-gradients, it is possible to extend the analysis presented in Thm. 6 above to the adversarial case, where no assumption on the data generation process is made. The result in our statistical setting can be derived from this more general adversarial setting by the application of two online-to-batch conversions, one within-task and one outer-task.

## 6 Experiments

In this section, we test the effectiveness of the LTL approach proposed in this paper on synthetic and real data 444The code used for the following experiments is available at https://github.com/prolearner/onlineLTL. In all experiments, the regularization parameter and the step-size were tuned by validation, see Appendix I for more details.

Synthetic Data. We considered two different settings, regression with the absolute loss and binary classification with the hinge loss. In both cases, we generated an environment of tasks in which SGD with the right bias is expected to bring a substantial benefit in comparison to the unbiased case. Motivated by our observations in section 4, we generated linear tasks with weight vectors characterized by a variance which is significantly smaller than their second moment. Specifically, for each task , we created a weight vector

from a Gaussian distribution with mean

given by the vector in with all components equal to . Each task corresponds to a dataset , with and . In the regression case, the inputs were uniformly sampled on the unit sphere and the labels were generated as , with sampled from a zero-mean Gaussian distribution, with standard deviation chosen to have signal-to-noise ratio equal to for each task. In the classification case, the inputs were uniformly sampled on the unit sphere, excluding those points with margin smaller than and the binary labels were generated as a logistic model, . In Fig. 1 we report the performance of Alg. 1 with different choices of the bias: (our LTL estimator resulting from Alg. 2), (ITL) and , a reasonable approximation of the oracle minimizing the transfer risk. The plots confirm our theoretical findings: estimating the bias with our LTL approach leads to a substantial benefits with respect to the unbiased case, as the number of the observed training tasks increases.

Real Data. We run experiments on the computer survey data from [17], in which 180 people (tasks) rated the likelihood of purchasing one of 20 different personal computers (). The input represents 13 different computer characteristics (price, CPU, RAM, etc.) while the output is an integer rating from to . Similarly to the synthetic data experiments, we consider a regression setting with the absolute loss and a classification setting. In the latter case each task is to predict whether the rating is above . We compare the LTL bias with ITL. The results are reported in Fig. 2. The figures above are in line with the results obtained on synthetic experiments, indicating that the bias LTL framework proposed in this work is effective for this dataset. Moreover, the results for regression are also in line with what observed in the multitask setting with variance regularization [22]. The classification setting has not been used before and has been created ad-hoc for our purpose. In this case we have an increased variance probably due to the datasets being highly unbalanced. In order to investigate the impact of passing through the data only once in the different steps in our method, we conducted additional experiments. The results, presented in Appendix J, indicate that the single pass strategy is competitive with respect to the more expensive ERM.

## 7 Conclusion and Future Work

We have studied the performance of Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector, over a class of tasks. Drawing upon a learning-to-learn framework, we have shown that, when the variance of the tasks is relatively small, the introduction of an appropriate bias vector could be beneficial in comparison to the standard unbiased version, corresponding to learning the tasks independently. Then, we have proposed an efficient online meta-learning algorithm to estimate this bias and we have theoretically shown that the bias returned by our method can bring a comparable benefit. In the future, it would be interesting to investigate other kinds of relatedness among the tasks and to extend our analysis to other classes of loss functions, as well as to a Hilbert space setting. Finally, another valuable research direction is to derive fully dependent bounds, in which the hyperparameters are self-tuned during the learning process, see e.g.

[31].

## References

• [1] P. Alquier, T. T. Mai, and M. Pontil. Regret bounds for lifelong learning. In

Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

, volume 54 of Proceedings of Machine Learning Research, pages 261–269, 2017.
• [2] M.-F. Balcan, A. Blum, and S. Vempala.

Efficient representations for lifelong learning and autoencoding.

In Conference on Learning Theory, pages 191–210, 2015.
• [3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator theory in Hilbert Spaces, volume 408. Springer, 2011.
• [4] J. Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12(149–198):3, 2000.
• [5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
• [6] J. Borwein and Q. Zhu. Techniques of variational analysis, ser, 2005.
• [7] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
• [8] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
• [9] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learning-to-learn with statistical guarantees. In Proc. 34th Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
• [10] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean. In Advances in Neural Information Processing Systems, pages 10190–10200, 2018.
• [11] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135. PMLR, 2017.
• [12] R. Gupta and T. Roughgarden. A pac approach to application-specific algorithm selection. SIAM Journal on Computing, 46(3):992–1017, 2017.
• [13] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016.
• [14] H.-U. Jean-Baptiste. Convex analysis and minimization algorithms: advanced theory and bundle methods. SPRINGER, 2010.
• [15] I. Kuzborskij and F. Orabona. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106(2):171–195, 2017.
• [16] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
• [17] P. J. Lenk, W. S. DeSarbo, P. E. Green, and M. R. Young. Hierarchical bayes conjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2):173–191, 1996.
• [18] A. Maurer. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6:967–994, 2005.
• [19] A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009.
• [20] A. Maurer, M. Pontil, and B. Romera-Paredes.

Sparse coding for multitask and transfer learning.

In International Conference on Machine Learning, 2013.
• [21] A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
• [22] A. M. McDonald, M. Pontil, and D. Stamos. New perspectives on k-support and cluster norms. Journal of Machine Learning Research, 17(155):1–38, 2016.
• [23] A. Pentina and C. Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, pages 991–999, 2014.
• [24] A. Pentina and R. Urner. Lifelong learning with weighted majority votes. In Advances in Neural Information Processing Systems, pages 3612–3620, 2016.
• [25] A. Rakhlin, O. Shamir, K. Sridharan, et al. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, volume 12, pages 1571–1578. Citeseer, 2012.
• [26] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In I5th International Conference on Learning Representations, 2017.
• [27] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
• [28] S. Shalev-Shwartz and S. M. Kakade. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems, pages 1457–1464, 2009.
• [29] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
• [30] S. Thrun and L. Pratt. Learning to Learn. Springer, 1998.
• [31] Z. Zhuang, A. Cutkosky, and F. Orabona. Surrogate losses for online learning of stepsizes in stochastic non-convex optimization. arXiv preprint arXiv:1901.09068, 2019.

## Appendix

The appendix is organized as follows. In Appendix A we report some basic facts regarding the -subdifferential of a function which are used in the subsequent analysis. In Appendix B we give the primal-dual formulation of the biased regularized empirical risk minimization problem for each single task and, in Appendix C, we recall some well-known properties of our meta-objective function. In Appendix D, we show how an -minimizer of the dual problem can be exploited in order to build an -subgradient of our meta-objective function. As described in Appendix E, interpreting our within-task algorithm as a coordinate descent algorithm on the dual problem, we can adapt this result to our setting and prove, in this way, Prop. 5. In Appendix F, we report the proof of Prop. 1 and, in Appendix G, we give the convergence rate of Alg. 2 which is used in the paper, during the proof of Thm. 6. In Appendix H, we repeat the statistical study described in the paper also for the family of ERM algorithms introduced in Eq. (3) and, in Appendix I, we describe how to perform the validation procedure in our LTL setting. Finally, in Appendix J we report additional experiments comparing our method to ERM variants.

## Appendix A Basic Facts on ϵ-Subgradients

In this section, we report some basic concepts about the -subdifferential which are then used in the subsequent analysis. This material is based on [14, Chap. XI]. Throughout this section we consider a convex closed and proper function with domain and we always let .

###### Definition 7 (ϵ-Subgradient, [14, Chap. XI, Def. 1.1.1]).

Given , the vector is called -subgradient of at when the following property holds for any

 f(h)≥f(^h)+⟨u,h−^h⟩−ϵ. (27)

The set of all -subgradients of f at is the -subdifferential of at , denoted by .

The standard subifferential is retrieved with . The following lemma, which is a direct consequence of Def. 7, points out the link between and an -minimizer of .

###### Lemma 8 (See [14, Chap. XI, Thm. 1.1.5]).

The following two properties are equivalent.

 0∈∂ϵf(^h)⟺f(^h)≤f(h)+ϵfor any h∈Rd. (28)

The subsequent lemma describes the behavior of the -subdifferential with respect to the duality.

###### Lemma 9 (See [14, Chap. XI, Prop. 1.2.1]).

Let be the Fenchel conjugate of , namely, . Then, given , the vector is an -subgradient of at iff

 f∗(u)+f(^h)−⟨u,^h⟩≤ϵ. (29)

As a result,

 u∈∂ϵf(^h)⟺^h∈∂ϵf∗(u). (30)

We now describe some properties of the -subdifferential which are used in the following analysis.

###### Lemma 10 ( See [14, Chap. XI, Thm. 3.1.1]).

Let and be two convex closed and proper functions. Then, given , we have that

 ⋃0≤ϵ1+ϵ2≤ϵ∂ϵ1f1(^h)+∂ϵ2f2(^h)⊂∂ϵ(f1+f2)(^h). (31)

Moreover, denoting by the relative interior of a set , when , equality holds.

###### Lemma 11 ( See [14, Chap. XI, Prop. 1.3.1]).

Let be a scalar. Then, for a given , we have that

 ∂ϵ(f∘a)(^h)=a ∂ϵf(a^h). (32)
###### Lemma 12.

Let be a matrix. Then, for a given such that , we have that

 X⊤∂ϵf(X^h)⊂∂ϵ(f∘X)(^h). (33)

Let be . Then, by definition, there exist such that . Consequenlty, for any , we can write

 ⟨u,h−^h⟩=⟨X⊤v,h−^h⟩=⟨v,Xh−X^h⟩≤f(Xh)−f(X^h)+ϵ=(f∘X)(h)−(f∘X)(^h)+ϵ, (34)

where, in the inequality we have used the fact that . This gives the desired statement. ∎

The next two results characterize the -subdifferential of two functions, which are useful in our subsequent analysis. In the following we denote by the set of the symmetric positive semi-definite matrices.

###### Example 1 (Quadratic Functions, [14, Chap. XI, Ex. 1.2.2 ]).

For a given matrix and a given vector , consider the function

 f:h∈Rd↦12⟨Qh,h⟩+⟨b,h⟩. (35)

Then, given , we can express the -subdifferential of at with respect to the gradient as follows

 (36)
###### Example 2 (Moreau Envelope [14, Chap. XI, Ex. 3.4.4]).

For and a fixed vector , consider the Moreau envelope of at the point with parameter , given by

 (37)

Denote by the unique minimizer of the above function, namely, the vector characterized by the optimality conditions

 0∈∂f(wh)+λ(wh−h). (38)

Then, for any and , we have that

 ∂ϵL(h)=⋃0≤α≤ϵ∂ϵ−αf(wh)∩B(−λ(wh−h),√2λα), (39)

where, for any center and any radius , we recall the notation

 B(c,r)={u∈Rd:∥u−c∥≤r}. (40)

For we retrieve the well-known result according to which is differentiable, with -Lipschitz gradient given by

 ∇L(h)=−λ(wh−h). (41)

Finally, from Eq. (39), we can deduce that, if , then

 ∥∥∇L(h)−u∥∥≤√2λϵ. (42)

## Appendix B Primal-Dual Formulation of the Within-Task Problem

In this section, we give the primal-dual formulation of the biased regularized empirical risk minimization problem outlined in Eq, (3) for each single task. Specifically, rewriting for any and , the empirical risk

 RZn(w)=(g∘Xn)(w)g(u)=1nn∑k=1ℓk(uk), (43)

for any , we can express our meta-objective function in Eq. (15) as

 LZn(h)=minw∈Rd (g∘Xn)(w) + λ2 ∥w−h∥2. (44)

We remark that, in the optimization community, this function coincides with the Moreau envelope of the empirical error at the point , see also 2. In this section, in order to simplify the presentation, we omit the dependence on the dataset in the notation. The unique minimizer of the above function

 wh=\operatornamewithlimitsargminw∈Rd (g∘Xn)(w)+ λ2 ∥w−h∥2 (45)

is known as the proximity operator of the empirical error at the point and it coincides with the ERM algorithm introduced in Eq. (3) in the paper. We interpret the vector in Eq. (3)–(45) as the solution of the primal problem

 wh=\operatornamewithlimitsargminw∈RdΦh(w)Φh(w)=(g∘Xn)(w) + λ2 ∥w−h∥2. (46)

The next proposition is a standard result stating that, in this setting, strong duality holds and the optimality conditions, also known as Karush–Kuhn–Tucker (KKT) conditions provide a unique way to determine the primal variables from the dual ones.

###### Proposition 13 (Strong Duality, [6, Thm. 4.4.2], [3, Prop. 15.18]).

Consider the primal problem in Eq. (131). Then, its dual problem admits a solution

 uh∈\operatornamewithlimitsargminu∈RnΨh(u)Ψh(u)=g∗(u)+12λ∥∥X⊤nu∥∥2−⟨Xnh,u⟩, (47)

where, thanks to the separability of , for any , we have that

 g∗(u)=1nn∑k=1ℓ∗k(nuk). (48)

Moreover, strong duality holds, namely,

 L(h)=Φh(wh)=minw∈RdΦh(w)=−minu∈RnΨh(u)=−Ψh(uh) (49)

and the optimality (KKT) conditions read as follows

 wh=−1λX⊤nuh+h⟺λ(wh−h)=−X⊤nuhuh∈∂g(Xnwh)⟺Xnwh∈∂g∗(uh). (50)

## Appendix C Properties of the Meta-Objective

In this section we recall some properties of the meta-objective function already outlined in the text in Prop. 4.

See 4

The first part of the statement is a well-known fact, see [3, Prop. ] and also 2. In order to prove the second part of the statement, we exploit Asm. 1 and Asm. 2 and we proceed as follows. According to the change of variables , exploiting the fact that, for any two convex functions and