The Sample Complexity of Meta Sparse Regression

by   Zhanyu Wang, et al.

This paper addresses the meta-learning problem in sparse linear regression with infinite tasks. We assume that the learner can access several similar tasks. The goal of the learner is to transfer knowledge from the prior tasks to a similar but novel task. For p parameters, size of the support set k , and l samples per task, we show that T ∈ O (( k log(p) ) /l ) tasks are sufficient in order to recover the common support of all tasks. With the recovered support, we can greatly reduce the sample complexity for estimating the parameter of the novel task, i.e., l ∈ O (1) with respect to T and p . We also prove that our rates are minimax optimal. A key difference between meta-learning and the classical multi-task learning, is that meta-learning focuses only on the recovery of the parameters of the novel task, while multi-task learning estimates the parameter of all tasks, which requires l to grow with T . Instead, our efficient meta-learning estimator allows for l to be constant with respect to T (i.e., few-shot learning).



There are no comments yet.


page 1

page 2

page 3

page 4


Provable Meta-Learning of Linear Representations

Meta-learning, or learning-to-learn, seeks to design algorithms that can...

Towards Sample-efficient Overparameterized Meta-learning

An overarching goal in machine learning is to build a generalizable mode...

On Optimality of Meta-Learning in Fixed-Design Regression with Weighted Biased Regularization

We consider a fixed-design linear regression in the meta-learning model ...

Sample Efficient Subspace-based Representations for Nonlinear Meta-Learning

Constructing good representations is critical for learning complex tasks...

Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation

Multi-task learning (MTL) aims to improve the generalization of several ...

Sample Efficient Linear Meta-Learning by Alternating Minimization

Meta-learning synthesizes and leverages the knowledge from a given set o...

Multi-task Learning of Order-Consistent Causal Graphs

We consider the problem of discovering K related Gaussian directed acycl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Current machine learning algorithms have shown great flexibility and representational power. On the downside, in order to obtain good generalization, a large amount of data is required for training. Unfortunately, in some scenarios, the cost of data collection is high. Thus, an inevitable question is how to train a model in the presence of few training samples. This is also called

Few-Shot Learning (Wang and Yao, 2019). Indeed, there might not be much information about an underlying task when only few examples are available. A way to tackle this difficulty is Meta-Learning (Vanschoren, 2018): we gather many similar tasks instead of several examples in one task, and use the data from different tasks to train a model that can generalize well in the similar tasks. This hopefully also guarantees a good performance of the model for a novel task, even when only few examples are available for the new task. In this sense, the model can rapidly adapt to the novel task with prior knowledge extracted from other similar tasks.

As a meta-learning example, for the particular model class of neural networks, researchers have developed algorithms such as Matching Networks

(Vinyals et al., 2016), Prototypical Networks (Snell et al., 2017)

, long-short term memory-based meta-learning

(Ravi and Larochelle, 2016), Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017)

, among others. These algorithms are experimental works that have been proved to be successful in some cases. Unfortunately, there is a lack of theoretical understanding of the generalization of meta-learning, in general, for any model class. Some of the algorithms can perform very well in tasks related to some specific applications, but it is still unclear why those methods can learn across different tasks with only few examples given for each task. For example, in few shot learning, the case of 5-way 1-shot classification requires the model to learn to classify images from 5 classes with only one example shown for each class. In this case, the model should be able to identify useful features (among a very large learned feature set) in the 5 examples instead of building the features from scratch.

There has been some efforts on building the theoretical foundation of meta-learning. For MAML, Finn et al. (2019) showed that the regret bound in the online learning regime is , and Fallah et al. (2019) showed that MAML can converge and find an -first order stationary point with iterations. A natural question is how we can have a theoretical understanding of the meta-learning problem for any algorithm, i.e., the lower bound of the sample complexity of the problem. The upper and lower bounds of sample complexity is commonly analyzed in simple but well-defined statistical learning problems. Since we are learning a novel task with few samples, meta-learning falls in the same regime than sparse regression with large number of covariates and a small sample size , which is usually solved by regularized (sparse) linear regression such as LASSO, albeit for a single task. Even for a sample efficient method like LASSO, we still need the sample size to be of order to achieve correct support recovery, where is the number of non-zero coefficients among the coefficients. The rate has been proved to be optimal (Wainwright, 2009). If we consider meta-learning, we may be able to bring prior information from similar tasks to reduce the sample complexity of LASSO. In this respect, researchers have considered the multi-task problem, which assumes similarity among different tasks, e.g., tasks share a common support. Then, one learns for all tasks at once. While it seems that considering many similar tasks together can bring information to each single task, the noise or error is also introduced. In the results from previous papers, e.g., (Yuan and Lin, 2006), (Negahban and Wainwright, 2008), (Lounici et al., 2009), (Jalali et al., 2010), in order to achieve good performance on all tasks, one needs the number of samples to scale with the number of tasks . More specifically, one requires or for each task, which is not useful in the regime where with respect to .

Our contribution in this paper is as follows. First, we proposed a meta-sparse regression problem and a corresponding generative model that are amenable to solid statistical analysis and also capture the essence of meta-learning. Second, we prove the upper and lower bounds of the sample complexity of this problem, and show that they match in the sense that and . Here is the number of coefficients in one task, is the number of non-zero coefficients among the coefficients, and is the sample size of each task. In short, we assume that we have access to possibly an infinite number of tasks from a distribution of tasks, and for each task we only have limited number of samples. Our goal is to first recover the common support of all tasks and then use it for learning a novel task. We prove that simply by merging all the data from different tasks and solving a regularized (sparse) regression problem (LASSO), we can achieve the best sample complexity rate for identifying the common support and learning the novel task. To the best of our knowledge, our results are the first to give upper and lower bounds of the sample complexity of meta-learning problems.

2 Method

In this section, we present the meta sparse regression problem as well as our regularized regression method.

2.1 Problem Setting

We consider the following meta sparse regression model. The dataset containing samples from multiple tasks is generated in the following way:


where, indicates the -th task, is a constant across all tasks, and is the individual parameter for each task. Note that the tasks are the related tasks we collect for helping solve the novel task . Each task contains training samples. The sample size of task is denoted by , which is equal to in the setting above, but generally it could also be larger than .

Tasks are independently drawn from one distribution, i.e., , or equivalently . We assume

is a sub-Gaussian distribution with mean 0 and variance proxy

. The latter is a very mild assumption, since the class of sub-Gaussian random variables includes for instance Gaussian random variables, any bounded random variable (e.g. Bernoulli, multinomial, uniform), any random variable with strictly log-concave density, and any finite mixture of sub-Gaussian variables. We denote the support set of each task

as . For simplicity, here we consider the case that and , .

We assume that are i.i.d and follow a sub-Gaussian distribution with mean 0 and variance proxy . Sample covariates are independent and all entries in them are independent. These entries are i.i.d from a sub-Gaussian distribution with mean 0 and variance proxy no greater than .

2.2 Our Method

In meta sparse regression, our goal is to use the prior tasks and their corresponding data to recover the common support of all tasks. We then estimate the parameters for the novel task. For the setting we explained above, this is equivalent to recover .

First, we determine the common support over the prior tasks by the support of formally introduced below, i.e., , where


Note that we have tasks in total, and samples for each task.

Second, we use the support as a constraint for recovering the parameters of the novel task . That is


We point out that our method makes a proper application of regularized (sparse) regression, and in that sense is somewhat intuitive. In what follows, we show that this method correctly recovers the common support and the parameter of the novel task. At the same time, our method is minimax optimal, i.e., it achieves the optimal sample complexity rate.

3 Main Results

First, we state our result for the recovery of the common support among the prior tasks.

Theorem 3.1.

Let be the solution of the optimization problem (3). If and

with probability greater than

, we have that

  1. the support of is contained within (i.e., );

  2. ,

where are constants.

Note that in Theorem 3.1, the term in is typically encountered in the analysis of the single-task sparse regression or LASSO (Wainwright, 2009). The additional term in is due to the difference in the coefficients among tasks.

Next, we state our result for the recovery of the parameters of the novel task.

Theorem 3.2.

Let be the solution of the optimization problem (5). With the support recovered from Theorem 3.1, if , and , with probability greater than , we have that

  1. the support of is contained within (i.e., );

  2. ,

where are constants.

The theorems above provide an upper bound of the sample complexity, which can be achieved by our method. The lower bound of the sample complexity is an information-theoretic result, and it relies on the construction of a restricted class of parameter vectors. We consider a special case of the setting we previously presented: all non-zero entries in

are 1, and all non-zero entries in are also 1. We use to denote the set of all possible parameters . Therefore the number of possible outcomes of the parameters .

If the parameter is chosen uniformly at random from , for any algorithm estimating this parameter by , the answer is wrong (i.e., ) with probability greater than if . Here we use to denote the sample size of task . This fact is proved in the following theorem.

Theorem 3.3.

Let . Furthermore, assume that is chosen uniformly at random from . We have:

where are constants.

In the following section, we prove that the mutual information between the true parameter and the data is bounded by . In order to prove Theorem 3.3, we use Fano’s inequality and the construction of a restricted class of parameter vectors. The use of Fano’s inequality and restricted ensembles is customary for information-theoretic lower bounds (Wang et al., 2010; Santhanam and Wainwright, 2012; Tandon et al., 2014).

Note that from Theorem 3.3, we know if and , then any algorithm will fail to recover the true parameter very likely. On the other hand, if we have and , by Theorem 3.1 and 3.2, we can recover the support of and (by ). Therefore we claim that our rates of sample complexity is minimax optimal.

4 Sketch of the proofs

In this section, we provide details about the proofs of our main results.

4.1 Proof of Theorem 3.1

We use the primal-dual witness framework (Wainwright, 2009) to prove our results. First we construct the primal-dual candidate; then we show that the construction succeeds with high probability. Here we outline the steps in the proof. (See the supplementary materials for detailed proofs.)

We first introduce some useful notations:

is the matrix of collocated (covariates of all samples in the -th task). Similarly, and .

is the matrix of collocated (covariates of all samples in all tasks). Similarly, .

is the sub-matrix of containing only the rows corresponding to the support of , i.e., with . Similarly, , , and .

is the sub-matrix of containing only the rows and columns corresponding to the support of .

4.1.1 Primal-dual witness

Step 1: Prove that the objective function has positive definite Hessian when restricted to the support, i.e., .


We know that


We prove the above condition in the following lemma.

Lemma 4.1.

For , assume that each element in is i.i.d. sub-Gaussian random variable with mean and variance proxy . We have



and are constants.

Using the lemma above, we show the minimum singular value of

is larger than 0 with high probability. We let to have .

Thus, with probability , we have

where we set

Step 2: Set up a restricted problem:


Step 3: Choose the corresponding dual variable to fulfill the complementary slackness condition:

, if , otherwise

Step 4: Solve to let fulfill the stationarity condition:


Step 5: Verify that the strict dual feasibility condition is fulfilled for :

To prove support recovery, we only need to show that step 5 holds. In the next subsection we indeed show that this holds with high probability.

4.1.2 Strict dual feasibility condition

We first rewrite (10) as follows:

Then we solve for . That is

and plug it in the equation below (by rewriting (11)).

We have

where is an orthogonal projection matrix, and is the dual variable we choose at step 3.

One can bound the norm of by the techniques used in (Wainwright, 2009). Specifically, if we set and , with the mutual incoherence condition being satisfied (i.e., ), we have

Note that the remaining two terms containing are new to the meta-learning problem and need to be handled with novel proof techniques.

We first rewrite with respect to each of its entries (denoted by ) as follows: , we have


We know that are sub-Gaussian random variables. It is well-known that the product of two sub-Gaussian is sub-exponential (whether they are independent or not). To characterize the product of three sub-Gaussians and the sum of the i.i.d. products, we need to use Orlicz norms and a corresponding concentration inequality.

4.1.3 Orlicz norm

Here we introduce the concept of exponential Orlicz norm. For any random variable and , we define the (quasi-) norm as

We define . This concept is a generalization of sub-Gaussianity and sub-exponentiality since the random variable family with finite exponential Orlicz norm corresponds to the -sub-exponential tail decay family which is defined by

where are constants. More specifically, if , we set so that fulfills the -sub-exponential tail decay property above. We have two special cases of Orlicz norms: corresponds to the family of sub-Gaussian distributions and

corresponds to the family of sub-exponential distributions.

A good property of the Orlicz norm is that the product or the sum of many random variables with finite Orlicz norm has finite Orlicz norm as well (possibly with a different .) We state this property in the two lemmas below.

Lemma 4.2.

[Lemma A.1 in (Götze et al., 2019)] Let be random variables such that for some and let . Then and

Lemma 4.3.

[Lemma A.3 in (Götze et al., 2019)] For any and any random variables , we have

By the lemmas above, we know that the sum (with respect to ) of the products in (12) is a -sub-exponential tail decay random variable. The details are shown in the next subsection. This result does not require any independence conditions, thus we will use this fact later for bounding .

4.1.4 -sub-exponential tail decay random variable

For , we have

From Lemma 4.2, we know

where and is a constant.

From Lemma 4.3, we have

From Lemma 4.2 again, we know

where .

4.1.5 Concentration inequality for

Recall that

We know that for different and , the random variables are independent with and . Now we use a concentration inequality to bound .

Lemma 4.4 (Theorem 1.4 in (Götze et al., 2019)).

Let be a set of independent random variables satisfying for some . There is a constant such that for any , we have


We set . Then we have

When , we have

Therefore, can be bounded by with probability

4.1.6 Bound on

By definition,

Here we define

Since the independence between random variables is not necessary in Lemma 4.2, we use the same technique for bounding to bound . More specifically, can be bounded by with probability

We define event . Then we know

We bound by breaking it into two terms and .

Here we know .