# On the Statistical Benefits of Curriculum Learning

Curriculum learning (CL) is a commonly used machine learning training strategy. However, we still lack a clear theoretical understanding of CL's benefits. In this paper, we study the benefits of CL in the multitask linear regression problem under both structured and unstructured settings. For both settings, we derive the minimax rates for CL with the oracle that provides the optimal curriculum and without the oracle, where the agent has to adaptively learn a good curriculum. Our results reveal that adaptive learning can be fundamentally harder than the oracle learning in the unstructured setting, but it merely introduces a small extra term in the structured setting. To connect theory with practice, we provide justification for a popular empirical method that selects tasks with highest local prediction gain by comparing its guarantees with the minimax rates mentioned above.

## Authors

• 7 publications
• 64 publications
08/02/2020

### Curriculum Learning with a Progression Function

Curriculum Learning for Reinforcement Learning is an increasingly popula...
06/15/2021

### An Analytical Theory of Curriculum Learning in Teacher-Student Networks

In humans and animals, curriculum learning – presenting data in a curate...
08/28/2019

### Learning a Multitask Curriculum for Neural Machine Translation

Existing curriculum learning research in neural machine translation (NMT...
05/10/2020

### A SentiWordNet Strategy for Curriculum Learning in Sentiment Analysis

Curriculum Learning (CL) is the idea that learning on a training set seq...
11/18/2016

### Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks

Curriculum Learning emphasizes the order of training instances in a comp...
07/14/2019

One of the questions that arises when designing models that learn to sol...
09/27/2018

### An Empirical Comparison of Syllabuses for Curriculum Learning

Syllabuses for curriculum learning have been developed on an ad-hoc, per...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

It has long been realized that we can design more efficient learning algorithms if we can make them learn on multiple tasks. Transfer learning, multitask learning and meta-learning are just few of the sub-areas of machine learning where this idea has been pursued vigorously. Often the goal is to minimize the weighted average losses over a set of tasks that are expected to be similar. While previous literature often assumes a predetermined (and often equal) number of observations for all the tasks, in many applications, we are allowed to decide the

order in which the tasks are presented and the number of observations from each task. Any strategy that tries to improve the performance with a better task scheduling is usually referred to curriculum learning (CL) (Bengio et al., 2009). The agent that schedules tasks at each step is often referred as the task scheduler.

Though curriculum learning has been extensively used in modern machine learning (Gong et al., 2016; Sachan and Xing, 2016; Tang et al., 2018; Narvekar et al., 2020)

, there is very little theoretical understanding of the actual benefits of CL. We also do not know whether the heuristic methods used in many empirical studies can be theoretically justified. Even the problem itself has not been rigorously formulated. To address these challenges, we first formulate the curriculum learning problem in the context of the linear regression problem. We analyze the minimax optimal rate of CL in two settings: an unstructured setting where parameters of different tasks are arbitrary and a structured setting where they have a low-rank structure. Finally we discuss the theoretical justification of a popular heuristic task scheduler that greedily selects tasks with highest local prediction gain.

## 2 Background

We would like to point out previous work on three crucial aspects of CL: two types of benefits one may expect from CL, task similarities assumptions, and task scheduler used in empirical studies.

#### Two types of benefits.

There are two distinct ways to understand the benefits of CL. From the perspective of optimization, some papers argue that the benefits of curriculum can be interpreted as learning from more convex and more smooth objective functions, which serves as a better initialization point for the non-convex target objective function (Bengio et al., 2009). The order of task scheduling is essential here. As an example, Figure 1 shows the objective functions of a problem with four source tasks and one target task with increasing difficulty (non-convexity). Directly minimizing the target task (marked in purple line) using gradient descent can be hard due to the non-convexity. However, the simple gradient descent algorithm can converge to the global optima of the current task if it starts from the global optima of the previous task. We refer to the benefit that involves a faster convergence in optimization as optimization benefit. Optimization benefits highly depend on the order of scheduling. Generally speaking, if one directly considers the empirical risk minimizer (ERM) which requires global minimization of empirical risk, there may not be any optimization benefit.

The second type of benefit concerns the benefit brought by carefully choosing the number of observations from each task while independent of the order, which we call statistical benefit

. For example, we have two linear regression problems that are identical except for the standard deviation of the Gaussian noise on response variables. If we consider the OLS estimator on the joint dataset of the two tasks, there is a reduction in noise level when more samples are allocated to the task with a lower level, and the benefit is independent of the order by the nature of OLS. A

statistical benefit can be seen as any benefit one can get except for the reduction in the difficulties of optimization. Weinshall and Amir (2020) focused on a special curriculum learning task where each sample is considered a task and they analyzed the convergence rate on the samples of different noise levels. They analyzed the benefits on the convergence rate, which should count as the statistical benefits.

In general, the two types of benefits can coexist. A good curriculum should account for both the non-convexity and the noise levels. However, due to the significantly different underlying mechanism in the two learning benefits. it is natural to study them separately. This paper will focus on the analysis of the statistical benefits. Thus, we analyze algorithms that map datasets to an estimator for each task that may involve finding global minima of the empirical errors for non-convex functions.

#### Similarity assumptions.

We discussed the problem with two almost identical tasks, where we can achieve perfect transfer and the trivial curriculum that allocates all the samples to the simpler task is optimal. However, tasks are generally not identical. Understanding how much benefits the target task can gain from learning source tasks has been a central problem in transfer learning and multitask learning literature. The key is to propose meaningful similarity assumptions.

Let and be the input and output space. Assume we have tasks with data distributions over . Let be a sample from task . Let , a mapping

, be the mean function. In this paper, we adopt the simple parametric model on the mean function with

represented by parameter .

We consider two scenarios: structured and unstructured. In Section 3, we adopt simple linear regression models and do not assume any further internal structure on the true parameters. Two tasks are similar if is small. A learned parameter is directly transferred to the target task. This setting has been applied in many previous studies (Yao et al., 2018; Bengio et al., 2009; Xu et al., 2021). In Section 4, we study the multitask representation learning setting (Maurer et al., 2016; Tripuraneni et al., 2020; Xu and Tewari, 2021), where a stronger internal structure is assumed. To be specific, we write , where , is the linear representation mapping and is the task specific parameter. Generally, the input dimension is much larger than the representation dimension ().

These two settings, while representative, do not exhaust all of the settings in the literature. We refer the reader to Teshima et al. (2020) for a brief summary of theoretical assumptions on the task similarity.

Many empirical methods have been developed to automatically schedule tasks. Liu et al. (2020)

Cioba et al. (2021) discussed several meta-learning scenarios where the optimal data allocations are different, which interestingly aligns with our theoretical results. For a more general use, one major family of task scheduler is based on the intuition that the task scheduler should select the task that leads to the highest local gain on the target loss (Graves et al., 2017)

. Since the accurate prediction gain is not accessible, online decision-making algorithms (bandit and reinforcement learning) are frequently used to adaptively allocate samples

(Narvekar et al., 2020). However, there is no theoretical guarantee that such greedy algorithms can lead to the optimal curriculum.

#### Notation.

For any positive integer , we let . We use the standard and notation to hide universal constant factors. We also use and to indicate and .

## 3 Unstructured Linear Regression

In this section, we study the problem of learning from tasks to generate an estimate for a single target task.

### 3.1 Formulations

We consider linear regression tasks. Let denote tasks. Let denote the true parameter of task . The response of task is generated in the following manner

 Yt=XTtθ∗t+ϵt,

where is assumed to be the Gaussian noise with and where is the covariance matrix that is positive definite. Any task, therefore, can be fully represented by a triple .

Throughout the paper, we are more interested in the unknown parameters rather than the covariate distribution or the noise level. We simply denote the parameters of a problem ( tasks) and let be the -th column of the matrix.

We make a uniform assumption on the covariance matrix of input variables. The same assumption is also used by Du et al. (2020).

###### Assumption 1 (Coverage of covariate distribution).

We assume that all for some constant and any .

#### Goal.

Let be random samples from task . Let

be a loss function and

be the expected loss of a given hypothesis evaluated on task . Moreover, we denote the excess risk by

 Gt(θ)=Lt(θ)−infθ′Lt(θ′).

Our goal in this section is to minimize the expected loss of the last task , which we call the target task. Throughout the paper, we use square loss function.

#### Transfer distance.

Algorithms tend to perform better when the tasks are similar to each other, such that any observations collected from non-target task bear less transfer bias. We define transfer distance between tasks as .

It is not fair to compare the performances between problems with different transfer distances. To study a minimax rate, we are interested in the worst performance over a set of problems with similar transfer distance. Let

be the distance vector encoding the upper bounds on the distance between the target task to any task. We define the hypothesis set with known transfer distance as

. The hypothesis set with unknown transfer distance can be defined as , where is any permutation of . We say this hypothesis set has unknown transfer distance because even if there exists some small such that the transfer distance is low, an agent does not know which task has the low transfer distance.

#### Curriculum learning and task scheduler.

This paper concerns only the statistical learning benefits. Since the order of selecting tasks does not affect the outcome of the algorithm, we denote a curriculum by , where each is the total number of observations from task and . Note that

can consist of random variables depending on the task scheduler. The set of all the curriculum with a total number of observations

is denoted by .

Any curriculum learning involves a multitask learning algorithm, which is defined as a mapping from a set of datasets to a hypothesis for the target task.

A task scheduler runs the following procedure. At the start of the step , we have observations from each task. The task scheduler at step is defined as a mapping from the past observations to a task index. Then a new observation from the selected task is sampled.

One of the goals of this work is to understand the minimax rate of the excess risk on the taregt task over all the possible combinations of multitask learning algorithms and task schedulers. We first attempt to understand a limit of that rate by considering an oracle scenario that provides the optimal curriculum for any problem.

Rigorously, we denote the loss of a fixed curriculum with respect to a fixed algorithm and problem by

 RNT(c,A∣θ)=ESc11,…,ScTTGT(A(Sc11,…,ScTT)).

We define the following oracle rate, which takes infimum over all the possible fixed curriculum designs given a fixed task set with different in a hypothesis set .

 RNT(Θ)\coloneqqinfAsupθ∈Θinfc∈CNRNT(c,A∣θ). (1)

In general, the above oracle rate considers an ideal case, because the optimal curriculum depends on the unknown problem and any learning algorithm has to adaptively learn the problem to decide the optimal curriculum.

We ask the following question: can adaptively learned curriculum perform as well as the optimal one as in Equ. (1)? To answer the question, we define the minimax rate for adaptive learning:

 ~RNT(Θ)\coloneqqinfAinfTsupθ∈ΘEGT(A(ScT,11,…,ScT,TT)), (2)

where is the curriculum adaptively selected by the task scheduler and the expectation is taken over both datasets and .

In this section, we are interested in the oracle rate in (1) compared to some naive strategy that allocates all the samples to one task. This answers how much benefits one can achieve compared to some naive learning schedules. We are also interested the gap between Equ. (1) and Equ. (2).

### 3.2 Oracle rate

In this section, we analyze the oracle rate defined in Equ. (1). We first give an overview of our results. For any problem instance, there exists a single task such that the naive curriculum with matches a lower bound for the oracle rate defined in Equ. (1).

For any task , its direct transfer performance of the OLS estimator on the target task can be roughly bounded by .

Thus, our result implies that essentially, the goal of curriculum learning is to identify the best task that balance the transfer distance and the noise level.

###### Theorem 1.

Let be a fixed distance vector defined above. The oracle rate within in Equ. (1) can be lower bounded by

 RNT(Θ(Q))≳C0mint{Q2t+dσ2tN}. (3)

### 3.3 Minimax rate for adaptive learning

More generally, even if we have some data from the target task, we will show that one is not able to avoid term, the learning difficulty of the target task. Now we formally introduce our results.

###### Theorem 2.

Assume . Let . Let be a fixed distance vector that satisfies and for all . The minimax rate in Equ. (2) can be lower bounded by

 ~RNT(~Θ(Q))≳min{σ2Tlog(T)N,Q2sub}+mintdσ2tN. (4)

Theorem 2 implies that without knowing the transfer distance, any adaptively learned curriculum of any multitask learning algorithm will suffer an unavoidable loss of , when is large. Compared to the rate without transfer learning, there is still a potential improvement of a factor of when and are small.

#### Upper bound.

As we showed above, there is a potential improvement of . This is because given the prior information that one of the source tasks is identical to the target task, the problem reduces from estimating a -dimensional vector to identifying the best task from a candidate set, whose complexity reduces to .

In fact, a simple fixed curriculum could achieve the above minimax rate. Assume that any for some constant . Let and for all the other tasks . For each , let be the OLS estimator using only its own samples. Let be the projection of onto . Then we choose one estimator from , that minimizes the empirical loss for the target task:

 t∗=argmint∈[T−1]N/2∑i=1(YT,i−XTT,i^θt)2. (5)
###### Theorem 3.

Assume there exists a task such that and

. With a probability at least

, satisfies

 GT(^θt∗)≲C0log(T/δ)(C2σ2TN+dTσ2t∗C1N). (6)

Note that is a random value. However, when all satisfy , the first term is the dominant term and our bound matches the lower bound in (4). This could happen when for all . For a fixed problem instance, as long as is sufficiently large, one should be able to identify the optimal source task, which removes the dependence of in the second term above. To this end, we introduce another task scheduler based on task elimination in Appendix C.

#### General function class.

As we mentioned before, though it is difficult to identify the good source tasks, the complexity of doing so is still lower than learning the parameters directly. We remark that this result can be generalized to any function class beyond linear functions. Keeping all the other setup unchanged, we assume that the mean function for some input space and output space shared by all the tasks. For convenience, we assume there is no covariate shift, i.e. the input distributions are the same. We give an analogy of Theorem 3.

###### Assumption 2 (Assumption B in Jin et al. (2021)).

Assume is -strongly convex and -Lipschitz at any . Furthermore, for all and ,

 E[∇l(f∗t(X),Y)∣X=x]=0.

Assume we have observations for all tasks and observations for the target task. Let be the empirical risk minimizer of the task . Similarly to (5), let

 t∗=argmint∈[T−1]lNT(^ft),

where is the empirical loss on task . Let and . We will use Rademacher complexity to measure the hardness of learning a function class. We refer readers to Bartlett and Mendelson (2002) for the detailed definition of Rademacher complexity.

###### Proposition 1.

Given the above setting and Assumption 2, we have with a probability at least ,

 GT(^ft∗)≲L∗+L1L2(RN/T(Ft′)+√log(1/δ)N/T),

where is the Rademacher complexity of function space .

This bound improves the bound for single target task learning, which scales with , when . The underlying proof idea is still that identifying good tasks is easier than learning the model itself.

## 4 Structured Linear Regression

Now we consider a slightly different setting, where we want to learn a shared linear representation that generalizes to any target task within a set of interest.

A lot of recent papers have shown that to achieve a good generalization ability of the learned representation, the algorithm have to choose diverse source tasks (Tripuraneni et al., 2020; Du et al., 2020; Xu and Tewari, 2021). They all study the performance of a given choice of source tasks, while it has been unclear whether an algorithm can adaptively select diverse tasks.

### 4.1 Problem setup

We adopt the setup in Du et al. (2020). Let be the dimension of input and representation, respectively (). We also set . Let be the shared representation. Let be the linear coefficients for prediction functions. The model setup is essentially the same as the setup in Section 3.1 except for the true parameters being . We call this setting structured because if one stacks the true parameters as a matrix, the matrix has a low-rank structure. To be specific, the output of task given by

 Yt=XTtB∗β∗t+ϵt.

We use the same setup for the covariate as in Section 3 and we consider for some .

#### Diversity.

Let be the task selected by the scheduler at step . It has been well understood that to learn a representation that could generalize to any target task with arbitrary , we will need a lower bound on the following term

 λk(N∑i=1β∗tiβ∗Tti)\eqqcolonλN,k, (7)

where is the

-th largest eigenvalue of a matrix, i.e. the smallest eigenvalue. Basically, we hope the source tasks cover all the possible directions such that any new task could be similar to at least some of the source tasks. Equ. (

7) serves as an assumption in Du et al. (2020). When the true are known, we can simply diversely pick tasks. When the are unknown, the trivial strategy that equally allocates samples will perform badly. For example, let and let all the be identical. The trivial strategy will only cover one direction sufficiently, which ruins the generalization ability.

In this section, we will show that it is possible to adaptively schedule tasks to achieve the diversity even in the hard case discussed above.

### 4.2 Lower bounding diversity

In this section, we introduce an OFU (optimism in face of uncertainty) algorithm that adaptively selects diverse source tasks.

#### Two-phase estimator.

We first introduce an estimator on the unknown parameters. Assume up to step , we have dataset for each task . We evenly split each dataset to two datasets and , both with a sample size of . We solve the optimization problem below:

 ^Bi=argminB∈Rd×kminβt∈Rk,t∈[T]∑t∈[T]∑(x,y)∈S(1)i,t∥y−xTBβt∥2,
 and ^βi,t=argminβt∈Rk∑(x,y)∈S(2)i,t∥yj−xjBβtj∥2.

Note that we split the dataset such that and are independent.

Our algorithm runs by keeping a confidence bound for for each and each step . Lemma 1 introduces a suitable upper bound construction. Lemma 1 holds under the following assumptions.

###### Lemma 1.

Let . Assume Assumption 1 hold. There exists universal constants such that, at all step , with a probability , we have for all ,

 ∥^BTi^βi,t−B∗β∗i,t∥22≲αC5σ2dklog(κNδ/T)C21ni,t,

where is the number of observations from task up to step .

Following the bound in Lemma 1, we construct the confidence set with width

 Wi,t\coloneqqC5σ2dklog(κNδ/T)C21ni,t.

At each step for each task , we construct a confidence set around ,

 Bi,t={θ∈Rd:∥^Bi^βi,t−θ∥22≤Wi,t}. (8)

Then following the principle of optimism in face of uncertainty, we select the task such that

 ti∈argmaxt∈[T]maxθ∈Bi,tλk(i−1∑j=1~θj~θTj+θθT) (9)

and . Here is our belief for task at the step .

Now we are ready to present our lower bound results for diversity. Our results hold under two assumptions. The first assumption require the representation matrix is not degenerated. We also assume boundedness on ’s.

###### Assumption 3.

Assume the largest singular value of

is smaller than for some .

###### Assumption 4 (Boundedness).

We also assume that for all .

###### Theorem 4.

Suppose Assumption 3 and 4 hold. Assume for all , there exists some task such that for some . Let be the tasks select by Algorithm 1 for some constant . Then there exists some , such that with a probability at least ,

 λN,kN≳λC4k− ⎷σ2C25dkTlog(κN/(Tδ))C24C21λN.

If we are provided with the oracle, we will only have the first term above. When is sufficiently large, the second term in Theorem 4 is negligible and we will achieve diversity asymptotically as long as . Our proof follows the standard framework for OFU algorithms. We first show the correctness of the confidence set implied by Lemma 1. Then the key steps are to show the optimism, i.e. and to bound the difference term between the belief and the actual value . We provide the proof in Appendix E.

### 4.3 Upper bound results

Though the lower bound in Theorem 4 is already satisfying, we still want to shed some light on whether the dependency on is avoidable by showing an upper bound result in Theorem 5.

###### Theorem 5.

For any curriculum learning algorithm, there exists tasks () such that for all , there exists some , and

 E[λN,kN]≲maxt1,…,tN∈[T]λk(∑Ni=1β∗tiβ∗Tti)N−√σ2TNk3.

Theorem 5 states that the dependency is unavoidable, while there is still a gap of between the upper bound and the lower bound. Our hard case construction is inspired by the case where the naive strategy that allocates samples evenly. To be specific, we consider tasks such that of them are diversely specified and all the other tasks are identical. Naive strategies will fail by having . We divide tasks into blocks. Then we construct similar problems. Different problems have the diverse tasks in different blocks. The difficulty of the problem becomes identifying the block with diverse tasks, which is analogous to the idea of bandit model in a general sense. From here, we follow a similar proof of stochastic bandits (Lattimore and Szepesvári, 2020). The full proofs can be found in Appendix F.

## 5 Analysis of Prediction Gain

In this section, we give some theoretical guarantees on prediction-gain driven task scheduler under the unstructured setting discussed in Section 3. Wo do not consider the structured setting because it is not clear how to apply the prediction-gain driven method to multitask representation learning setting.

#### Prediction Gain and convergence rate.

We define prediction gain in the following way. At the step , a multitask learning algorithm maps any trajectory to a parameter for the target task. Let the estimate at step be . The prediction gain is defined as

 G(A,Hi+1)\coloneqqLT(θi)−LT(θi+1).

At the start of the round , the prediction-gain based task scheduler selects such that is maximized.

Note that in general, prediction gain is not observable to the algorithm before and are actually sampled. There are simple ways to estimate prediction gain, for example, from several random samples from each task.

In a linear model, the prediction gain is equivalent to convergence rate.

 LT(θi)−LT(θi+1)=∥θi−θ∗T∥2ΣT−∥θi+1−θ∗T∥2ΣT.

Weinshall and Amir (2020) discussed various benefits of curriculum learning by show that their strategy gives higher local convergence rate. It is not clear from the context that the greedy strategy that selects the highest local prediction gain gives the best total prediction gain in long run.

#### Decomposing prediction gain.

Considering a identical covariance matrix , the loss over a given parameter can be written as .

Assume the gradient is calculated from a sample from the task . According to the update of SGD, at the step , we have

 θi+1−θ∗T=(I−ηix(t)ix(t)Ti)(θi−θ∗T)+ηix(t)i(ϵi+xTiθΔt,T),

where .

The one-step prediction gain is

 ∥θi−θ∗T∥2−∥θi+1−θ∗T∥2 =ηt∥θi−θ∗T∥2(2−ηi∥x(t)i∥22)xixTi−η2i∥x(t)i(ϵi+xTiθΔt,T)∥22 −ηi(θi−θ∗T)T(I−ηix(t)ixTi)x(t)i(ϵi+xTiθΔt,T).

The first term on the R.H.S is the absolute gain shared by all the tasks. On expectation, the second term is

 −Eη2i∥xi(ϵi−xTiθΔt,T)∥22=−Eη2i∥xi∥22(σ2t+∥θΔt,T∥2xixTi). (10)

In expectation, the third term is

 −Eηi(θi−θ∗T)T(I−ηixixTi)xixTiθΔt,T =−E(1−ηi∥xt∥22)ηi(θi−θ∗T)TxixTiθΔt,T. (11)

Now we discuss term (10) and (11), respectively. (11) is independent of and it is a dynamic effects depending on the current estimate . That means (11) is independent of the task difficulty and its constantly changes. When , the task has a larger prediction gain. This is when the gradient descent direction is consistent in both target and the task .

For term (10), we notice that task difficulty and transfer distance play equal importance in the prediction gain measure regardless of the number of observations.

#### Optimality of prediction gain.

Let be the optimal task defined by

 t∗=argmintΔ2t,T+dσ2tN.

We consider an averaging SGD algorithm with a step size . In general, let . The following Theorem shows that the performance of the averaging SGD with an accurate prediction-gain based task scheduler matches the minimax lower bound in Theorem 1.

###### Theorem 6.

Assume . Assume for all . Given tasks with noise levels and transfer distance , let be the averaging SGD estimator with an accurate prediction-gain based task scheduler defined above. We have

 GT(¯θN)≲Δ2t∗,T+(dσ2t∗+C5)log(N)N. (12)

Theorem 6 gives an upper bound on that matches the lower bound in Theorem 1.

## 6 Discussion

In this paper, we discussed the benefits of Curriculum Learning under two special settings: multitask linear regression and multitask representation learning. In the multitask linear regression setting, it is fundamentally hard to adaptively identify the optimal source task to transfer. In the multitask representation learning setting, a good curriculum is the curriculum that diversifies the source tasks. We show that the extra error caused by the adaptive learning is small and it is possible to achieve a near-optimal curriculum. Then we provided theoretical justification for the popular prediction-gain driven task scheduler that has been used in the empirical work.

Our results suggest some natural directions for future work. We show a lower bound (Thm. 5) on the diversity in the multitask representation learning setting, while leaving a gap of compared to our upper bound (Thm. 4). We believe this gap is because a loose construction of the hard cases that ignores the difficulty of learning the shared representation. Another direction is to show whether prediction-gain methods with no accurate gain estimation could still have performance close to lower bounds for the adaptive learning setting.

## Reference

• P. L. Bartlett and S. Mendelson (2002) Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: §3.3.
• Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §1, §2, §2.
• A. Cioba, M. Bromberg, Q. Wang, R. Niyogi, G. Batzolis, D. Shiu, and A. Bernacchia (2021) How to distribute data across tasks for meta-learning?. arXiv preprint arXiv:2103.08463. Cited by: §2.
• S. B. David, T. Lu, T. Luu, and D. Pál (2010) Impossibility theorems for domain adaptation. In

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

,
pp. 129–136. Cited by: §3.3.
• S. S. Du, W. Hu, S. M. Kakade, J. D. Lee, and Q. Lei (2020) Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434. Cited by: §E.1, §E.1, §3.1, §4.1, §4.1, §4, Lemma 7.
• C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang (2016) Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing 25 (7), pp. 3249–3260. Cited by: §1.
• A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017)

Automated curriculum learning for neural networks

.
In international conference on machine learning, pp. 1311–1320. Cited by: §2.
• C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan (2021) On nonconvex optimization for machine learning: gradients, stochasticity, and saddle points. Journal of the ACM (JACM) 68 (2), pp. 1–29. Cited by: Assumption 2.
• S. M. M. Kalan, Z. Fabian, A. S. Avestimehr, and M. Soltanolkotabi (2020) Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. arXiv preprint arXiv:2006.10581. Cited by: Appendix A, Appendix A, Appendix A, Appendix B, §3.2, Lemma 2.
• T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: Appendix F, Appendix F, §4.3.
• C. Liu, Z. Wang, D. Sahoo, Y. Fang, K. Zhang, and S. C. Hoi (2020) Adaptive task sampling for meta-learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp. 752–769. Cited by: §2.
• A. Maurer, M. Pontil, and B. Romera-Paredes (2016) The benefit of multitask representation learning. Journal of Machine Learning Research 17 (81), pp. 1–32. Cited by: §2.
• S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone (2020) Curriculum learning for reinforcement learning domains: a framework and survey. arXiv preprint arXiv:2003.04960. Cited by: §1, §2.
• M. Sachan and E. Xing (2016) Easy questions first? a case study on curriculum learning for question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 453–463. Cited by: §1.
• Y. Tang, X. Wang, A. P. Harrison, L. Lu, J. Xiao, and R. M. Summers (2018) Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In International Workshop on Machine Learning in Medical Imaging, pp. 249–258. Cited by: §1.
• T. Teshima, I. Sato, and M. Sugiyama (2020) Few-shot domain adaptation by causal mechanism transfer. In International Conference on Machine Learning, pp. 9458–9469. Cited by: §2.
• N. Tripuraneni, M. I. Jordan, and C. Jin (2020) On the theory of transfer learning: the importance of task diversity. arXiv preprint arXiv:2006.11650. Cited by: §2, §4.
• J. A. Tropp (2015) An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571. Cited by: Lemma 5.
• D. Weinshall and D. Amir (2020) Theory of curriculum learning, with convex loss functions. Journal of Machine Learning Research 21 (222), pp. 1–19. Cited by: §2, §5.
• Z. Xu, A. Meisami, and A. Tewari (2021) Decision making problems with funnel structure: a multi-task learning approach with application to email marketing campaigns. In International Conference on Artificial Intelligence and Statistics, pp. 127–135. Cited by: §2.
• Z. Xu and A. Tewari (2021) Representation learning beyond linear prediction functions. arXiv preprint arXiv:2105.14989. Cited by: §2, §4.
• J. Yao, T. Killian, G. Konidaris, and F. Doshi-Velez (2018)

Direct policy transfer via hidden parameter markov decision processes

.
In LLARLA Workshop, FAIM, Vol. 2018. Cited by: §2.

## Reference

• P. L. Bartlett and S. Mendelson (2002) Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: §3.3.
• Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §1, §2, §2.
• A. Cioba, M. Bromberg, Q. Wang, R. Niyogi, G. Batzolis, D. Shiu, and A. Bernacchia (2021) How to distribute data across tasks for meta-learning?. arXiv preprint arXiv:2103.08463. Cited by: §2.
• S. B. David, T. Lu, T. Luu, and D. Pál (2010) Impossibility theorems for domain adaptation. In

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

,
pp. 129–136. Cited by: §3.3.
• S. S. Du, W. Hu, S. M. Kakade, J. D. Lee, and Q. Lei (2020) Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434. Cited by: §E.1, §E.1, §3.1, §4.1, §4.1, §4, Lemma 7.
• C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang (2016) Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing 25 (7), pp. 3249–3260. Cited by: §1.
• A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017)

Automated curriculum learning for neural networks

.
In international conference on machine learning, pp. 1311–1320. Cited by: §2.
• C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan (2021) On nonconvex optimization for machine learning: gradients, stochasticity, and saddle points. Journal of the ACM (JACM) 68 (2), pp. 1–29. Cited by: Assumption 2.
• S. M. M. Kalan, Z. Fabian, A. S. Avestimehr, and M. Soltanolkotabi (2020) Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. arXiv preprint arXiv:2006.10581. Cited by: Appendix A, Appendix A, Appendix A, Appendix B, §3.2, Lemma 2.
• T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: Appendix F, Appendix F, §4.3.
• C. Liu, Z. Wang, D. Sahoo, Y. Fang, K. Zhang, and S. C. Hoi (2020) Adaptive task sampling for meta-learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp. 752–769. Cited by: §2.
• A. Maurer, M. Pontil, and B. Romera-Paredes (2016) The benefit of multitask representation learning. Journal of Machine Learning Research 17 (81), pp. 1–32. Cited by: §2.
• S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone (2020) Curriculum learning for reinforcement learning domains: a framework and survey. arXiv preprint arXiv:2003.04960. Cited by: §1, §2.
• M. Sachan and E. Xing (2016) Easy questions first? a case study on curriculum learning for question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 453–463. Cited by: §1.
• Y. Tang, X. Wang, A. P. Harrison, L. Lu, J. Xiao, and R. M. Summers (2018) Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In International Workshop on Machine Learning in Medical Imaging, pp. 249–258. Cited by: §1.
• T. Teshima, I. Sato, and M. Sugiyama (2020) Few-shot domain adaptation by causal mechanism transfer. In International Conference on Machine Learning, pp. 9458–9469. Cited by: §2.
• N. Tripuraneni, M. I. Jordan, and C. Jin (2020) On the theory of transfer learning: the importance of task diversity. arXiv preprint arXiv:2006.11650. Cited by: §2, §4.
• J. A. Tropp (2015) An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571. Cited by: Lemma 5.
• D. Weinshall and D. Amir (2020) Theory of curriculum learning, with convex loss functions. Journal of Machine Learning Research 21 (222), pp. 1–19. Cited by: §2, §5.
• Z. Xu, A. Meisami, and A. Tewari (2021) Decision making problems with funnel structure: a multi-task learning approach with application to email marketing campaigns. In International Conference on Artificial Intelligence and Statistics, pp. 127–135. Cited by: §2.
• Z. Xu and A. Tewari (2021) Representation learning beyond linear prediction functions. arXiv preprint arXiv:2105.14989. Cited by: §2, §4.
• J. Yao, T. Killian, G. Konidaris, and F. Doshi-Velez (2018)

Direct policy transfer via hidden parameter markov decision processes

.
In LLARLA Workshop, FAIM, Vol. 2018. Cited by: §2.

## Appendix A Proof of Theorem 1

###### Proof.

Our proof is inspired by the proof of Kalan et al. (2020), which gives a lower bound construction for the two-tasks transfer learning problem. Our results can be seen as an extension of their constructions to multiple-source tasks setting.

 t∗=argmint{Q2t+dσ2tN}.

Let . In general, we construct parameters with the -th row corresponding to the hypothesis set of the -th task.

We start by constructing the the hypothesis set of the target task and the task . Let . By definition, we have .

Consider the set Let be a -packing of the set in the -norm (). We can find the packing with . Since , we also have for any .

Now we construct hypothesis set for the target task. For all , we choose such that . So the construction for the target tasks satisfies

Now we discuss two cases. For any task with , we randomly pick a parameter in the hypothesis set of the target task which we denote by and we set all for all . This construction is valid since any .

For any task with , we will use the same construction as we use for .

Let be a random variable uniformly over representing the true hypothesis. The samples for each task is i.i.d. generated from the linear model described in Section 3.1 with a parameter . Our goal is to show that on expectation, any algorithm will perform badly as in Theorem 1.

Let be a random sample from task given the true parameter being . Similarly to (5.2) in Kalan et al. (2020), using Fano’s inequality, we can conclude that

 RNT(Θ(Q))≥δ2(1−log(2)+∑Tt=1ntI(J;Et)log(M)). (13)

We proceed by giving an uniform bound on the mutual information. We will need the following lemma to upper bound the mutual information term.

###### Lemma 2 (Lemma 1 in Kalan et al. (2020)).

The mutual information between and any sample can be upper bounded by , where is the induced distribution by the parameter . Furthermore we have

 DKL(Pθt,i∥Pθt,j)=∥Σ1/2t(θt,i−θt,j)∥22/(2σ2t)≤C0∥θt,i−θt,j∥22/(2σ2t).

Using Lemma 2, we bound the mutual information of any task .

###### Lemma 3.

Under the constructions introduced above, the mutual information

 I(J;Et)≤512C07σ2t∗δ′2 for % all t∈[T].
###### Proof.

For any task in the first case discussed above (), the mutual information is 0. Thus the statement holds trivially.

Now we discuss the second case above. By definition, we have

 Q2t+dσ2tN≥Q2t∗+dσ2t∗N=64δ2. (14)

Note that

 Qt≤5δ′=5(Qt∗,T/16+δ)≤7.5δ.

Plugging back into (14), we have and by definition we have . Therefore, we have .

Since the constructions are the same for the second case, the mutual information can be uniformly bounded by

 I(J,Et)≤1M2∑i,j32C07σ2t∗∥θt∗,i−θt∗,j∥22≤512C07σ2t∗δ′2.

Finally, we follow the analysis in Section 7.4 of Kalan et al. (2020). Using Lemma A on Equation (13), we have

 RNT(Θ(Q))≥