Current machine learning algorithms have shown great flexibility and representational power. On the downside, in order to obtain good generalization, a large amount of data is required for training. Unfortunately, in some scenarios, the cost of data collection is high. Thus, an inevitable question is how to train a model in the presence of few training samples. This is also calledFew-Shot Learning (Wang and Yao, 2019). Indeed, there might not be much information about an underlying task when only few examples are available. A way to tackle this difficulty is Meta-Learning (Vanschoren, 2018): we gather many similar tasks instead of several examples in one task, and use the data from different tasks to train a model that can generalize well in the similar tasks. This hopefully also guarantees a good performance of the model for a novel task, even when only few examples are available for the new task. In this sense, the model can rapidly adapt to the novel task with prior knowledge extracted from other similar tasks.
As a meta-learning example, for the particular model class of neural networks, researchers have developed algorithms such as Matching Networks(Vinyals et al., 2016), Prototypical Networks (Snell et al., 2017)
, long-short term memory-based meta-learning(Ravi and Larochelle, 2016), Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017)
, among others. These algorithms are experimental works that have been proved to be successful in some cases. Unfortunately, there is a lack of theoretical understanding of the generalization of meta-learning, in general, for any model class. Some of the algorithms can perform very well in tasks related to some specific applications, but it is still unclear why those methods can learn across different tasks with only few examples given for each task. For example, in few shot learning, the case of 5-way 1-shot classification requires the model to learn to classify images from 5 classes with only one example shown for each class. In this case, the model should be able to identify useful features (among a very large learned feature set) in the 5 examples instead of building the features from scratch.
There has been some efforts on building the theoretical foundation of meta-learning. For MAML, Finn et al. (2019) showed that the regret bound in the online learning regime is , and Fallah et al. (2019) showed that MAML can converge and find an -first order stationary point with iterations. A natural question is how we can have a theoretical understanding of the meta-learning problem for any algorithm, i.e., the lower bound of the sample complexity of the problem. The upper and lower bounds of sample complexity is commonly analyzed in simple but well-defined statistical learning problems. Since we are learning a novel task with few samples, meta-learning falls in the same regime than sparse regression with large number of covariates and a small sample size , which is usually solved by regularized (sparse) linear regression such as LASSO, albeit for a single task. Even for a sample efficient method like LASSO, we still need the sample size to be of order to achieve correct support recovery, where is the number of non-zero coefficients among the coefficients. The rate has been proved to be optimal (Wainwright, 2009). If we consider meta-learning, we may be able to bring prior information from similar tasks to reduce the sample complexity of LASSO. In this respect, researchers have considered the multi-task problem, which assumes similarity among different tasks, e.g., tasks share a common support. Then, one learns for all tasks at once. While it seems that considering many similar tasks together can bring information to each single task, the noise or error is also introduced. In the results from previous papers, e.g., (Yuan and Lin, 2006), (Negahban and Wainwright, 2008), (Lounici et al., 2009), (Jalali et al., 2010), in order to achieve good performance on all tasks, one needs the number of samples to scale with the number of tasks . More specifically, one requires or for each task, which is not useful in the regime where with respect to .
Our contribution in this paper is as follows. First, we proposed a meta-sparse regression problem and a corresponding generative model that are amenable to solid statistical analysis and also capture the essence of meta-learning. Second, we prove the upper and lower bounds of the sample complexity of this problem, and show that they match in the sense that and . Here is the number of coefficients in one task, is the number of non-zero coefficients among the coefficients, and is the sample size of each task. In short, we assume that we have access to possibly an infinite number of tasks from a distribution of tasks, and for each task we only have limited number of samples. Our goal is to first recover the common support of all tasks and then use it for learning a novel task. We prove that simply by merging all the data from different tasks and solving a regularized (sparse) regression problem (LASSO), we can achieve the best sample complexity rate for identifying the common support and learning the novel task. To the best of our knowledge, our results are the first to give upper and lower bounds of the sample complexity of meta-learning problems.
In this section, we present the meta sparse regression problem as well as our regularized regression method.
2.1 Problem Setting
We consider the following meta sparse regression model. The dataset containing samples from multiple tasks is generated in the following way:
where, indicates the -th task, is a constant across all tasks, and is the individual parameter for each task. Note that the tasks are the related tasks we collect for helping solve the novel task . Each task contains training samples. The sample size of task is denoted by , which is equal to in the setting above, but generally it could also be larger than .
Tasks are independently drawn from one distribution, i.e., , or equivalently . We assume
. The latter is a very mild assumption, since the class of sub-Gaussian random variables includes for instance Gaussian random variables, any bounded random variable (e.g. Bernoulli, multinomial, uniform), any random variable with strictly log-concave density, and any finite mixture of sub-Gaussian variables. We denote the support set of each taskas . For simplicity, here we consider the case that and , .
We assume that are i.i.d and follow a sub-Gaussian distribution with mean 0 and variance proxy . Sample covariates are independent and all entries in them are independent. These entries are i.i.d from a sub-Gaussian distribution with mean 0 and variance proxy no greater than .
2.2 Our Method
In meta sparse regression, our goal is to use the prior tasks and their corresponding data to recover the common support of all tasks. We then estimate the parameters for the novel task. For the setting we explained above, this is equivalent to recover .
First, we determine the common support over the prior tasks by the support of formally introduced below, i.e., , where
Note that we have tasks in total, and samples for each task.
Second, we use the support as a constraint for recovering the parameters of the novel task . That is
We point out that our method makes a proper application of regularized (sparse) regression, and in that sense is somewhat intuitive. In what follows, we show that this method correctly recovers the common support and the parameter of the novel task. At the same time, our method is minimax optimal, i.e., it achieves the optimal sample complexity rate.
3 Main Results
First, we state our result for the recovery of the common support among the prior tasks.
Note that in Theorem 3.1, the term in is typically encountered in the analysis of the single-task sparse regression or LASSO (Wainwright, 2009). The additional term in is due to the difference in the coefficients among tasks.
Next, we state our result for the recovery of the parameters of the novel task.
The theorems above provide an upper bound of the sample complexity, which can be achieved by our method. The lower bound of the sample complexity is an information-theoretic result, and it relies on the construction of a restricted class of parameter vectors. We consider a special case of the setting we previously presented: all non-zero entries inare 1, and all non-zero entries in are also 1. We use to denote the set of all possible parameters . Therefore the number of possible outcomes of the parameters .
If the parameter is chosen uniformly at random from , for any algorithm estimating this parameter by , the answer is wrong (i.e., ) with probability greater than if . Here we use to denote the sample size of task . This fact is proved in the following theorem.
Let . Furthermore, assume that is chosen uniformly at random from . We have:
where are constants.
In the following section, we prove that the mutual information between the true parameter and the data is bounded by . In order to prove Theorem 3.3, we use Fano’s inequality and the construction of a restricted class of parameter vectors. The use of Fano’s inequality and restricted ensembles is customary for information-theoretic lower bounds (Wang et al., 2010; Santhanam and Wainwright, 2012; Tandon et al., 2014).
4 Sketch of the proofs
In this section, we provide details about the proofs of our main results.
4.1 Proof of Theorem 3.1
We use the primal-dual witness framework (Wainwright, 2009) to prove our results. First we construct the primal-dual candidate; then we show that the construction succeeds with high probability. Here we outline the steps in the proof. (See the supplementary materials for detailed proofs.)
We first introduce some useful notations:
is the matrix of collocated (covariates of all samples in the -th task). Similarly, and .
is the matrix of collocated (covariates of all samples in all tasks). Similarly, .
is the sub-matrix of containing only the rows corresponding to the support of , i.e., with . Similarly, , , and .
is the sub-matrix of containing only the rows and columns corresponding to the support of .
4.1.1 Primal-dual witness
Step 1: Prove that the objective function has positive definite Hessian when restricted to the support, i.e., .
We know that
We prove the above condition in the following lemma.
For , assume that each element in is i.i.d. sub-Gaussian random variable with mean and variance proxy . We have
and are constants.
Using the lemma above, we show the minimum singular value ofis larger than 0 with high probability. We let to have .
Thus, with probability , we have
where we set
Step 2: Set up a restricted problem:
Step 3: Choose the corresponding dual variable to fulfill the complementary slackness condition:
, if , otherwise
Step 4: Solve to let fulfill the stationarity condition:
Step 5: Verify that the strict dual feasibility condition is fulfilled for :
To prove support recovery, we only need to show that step 5 holds. In the next subsection we indeed show that this holds with high probability.
4.1.2 Strict dual feasibility condition
We first rewrite (10) as follows:
Then we solve for . That is
and plug it in the equation below (by rewriting (11)).
where is an orthogonal projection matrix, and is the dual variable we choose at step 3.
One can bound the norm of by the techniques used in (Wainwright, 2009). Specifically, if we set and , with the mutual incoherence condition being satisfied (i.e., ), we have
Note that the remaining two terms containing are new to the meta-learning problem and need to be handled with novel proof techniques.
We first rewrite with respect to each of its entries (denoted by ) as follows: , we have
We know that are sub-Gaussian random variables. It is well-known that the product of two sub-Gaussian is sub-exponential (whether they are independent or not). To characterize the product of three sub-Gaussians and the sum of the i.i.d. products, we need to use Orlicz norms and a corresponding concentration inequality.
4.1.3 Orlicz norm
Here we introduce the concept of exponential Orlicz norm. For any random variable and , we define the (quasi-) norm as
We define . This concept is a generalization of sub-Gaussianity and sub-exponentiality since the random variable family with finite exponential Orlicz norm corresponds to the -sub-exponential tail decay family which is defined by
where are constants. More specifically, if , we set so that fulfills the -sub-exponential tail decay property above. We have two special cases of Orlicz norms: corresponds to the family of sub-Gaussian distributions and
corresponds to the family of sub-exponential distributions.
A good property of the Orlicz norm is that the product or the sum of many random variables with finite Orlicz norm has finite Orlicz norm as well (possibly with a different .) We state this property in the two lemmas below.
[Lemma A.1 in (Götze et al., 2019)] Let be random variables such that for some and let . Then and
[Lemma A.3 in (Götze et al., 2019)] For any and any random variables , we have
By the lemmas above, we know that the sum (with respect to ) of the products in (12) is a -sub-exponential tail decay random variable. The details are shown in the next subsection. This result does not require any independence conditions, thus we will use this fact later for bounding .
4.1.4 -sub-exponential tail decay random variable
For , we have
From Lemma 4.2, we know
where and is a constant.
From Lemma 4.3, we have
From Lemma 4.2 again, we know
4.1.5 Concentration inequality for
We know that for different and , the random variables are independent with and . Now we use a concentration inequality to bound .
Lemma 4.4 (Theorem 1.4 in (Götze et al., 2019)).
Let be a set of independent random variables satisfying for some . There is a constant such that for any , we have
We set . Then we have
When , we have
Therefore, can be bounded by with probability
4.1.6 Bound on
Here we define
Since the independence between random variables is not necessary in Lemma 4.2, we use the same technique for bounding to bound . More specifically, can be bounded by with probability
We define event . Then we know
We bound by breaking it into two terms and .
Here we know .