# The Sample Complexity of Meta Sparse Regression

## Authors

• 6 publications
• 52 publications
02/26/2020

### Provable Meta-Learning of Linear Representations

Meta-learning, or learning-to-learn, seeks to design algorithms that can...
01/16/2022

### Towards Sample-efficient Overparameterized Meta-learning

An overarching goal in machine learning is to build a generalizable mode...
10/31/2020

### On Optimality of Meta-Learning in Fixed-Design Regression with Weighted Biased Regularization

We consider a fixed-design linear regression in the meta-learning model ...
02/14/2021

### Sample Efficient Subspace-based Representations for Nonlinear Meta-Learning

Constructing good representations is critical for learning complex tasks...
06/16/2021

### Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation

Multi-task learning (MTL) aims to improve the generalization of several ...
05/18/2021

### Sample Efficient Linear Meta-Learning by Alternating Minimization

Meta-learning synthesizes and leverages the knowledge from a given set o...
11/03/2021

### Multi-task Learning of Order-Consistent Causal Graphs

We consider the problem of discovering K related Gaussian directed acycl...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Current machine learning algorithms have shown great flexibility and representational power. On the downside, in order to obtain good generalization, a large amount of data is required for training. Unfortunately, in some scenarios, the cost of data collection is high. Thus, an inevitable question is how to train a model in the presence of few training samples. This is also called

As a meta-learning example, for the particular model class of neural networks, researchers have developed algorithms such as Matching Networks

(Vinyals et al., 2016), Prototypical Networks (Snell et al., 2017)

, long-short term memory-based meta-learning

(Ravi and Larochelle, 2016), Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017)

, among others. These algorithms are experimental works that have been proved to be successful in some cases. Unfortunately, there is a lack of theoretical understanding of the generalization of meta-learning, in general, for any model class. Some of the algorithms can perform very well in tasks related to some specific applications, but it is still unclear why those methods can learn across different tasks with only few examples given for each task. For example, in few shot learning, the case of 5-way 1-shot classification requires the model to learn to classify images from 5 classes with only one example shown for each class. In this case, the model should be able to identify useful features (among a very large learned feature set) in the 5 examples instead of building the features from scratch.

## 2 Method

In this section, we present the meta sparse regression problem as well as our regularized regression method.

### 2.1 Problem Setting

We consider the following meta sparse regression model. The dataset containing samples from multiple tasks is generated in the following way:

 yti,j=XTti,j(w∗+Δ∗ti)+ϵti,j, (1)

where, indicates the -th task, is a constant across all tasks, and is the individual parameter for each task. Note that the tasks are the related tasks we collect for helping solve the novel task . Each task contains training samples. The sample size of task is denoted by , which is equal to in the setting above, but generally it could also be larger than .

Tasks are independently drawn from one distribution, i.e., , or equivalently . We assume

is a sub-Gaussian distribution with mean 0 and variance proxy

. The latter is a very mild assumption, since the class of sub-Gaussian random variables includes for instance Gaussian random variables, any bounded random variable (e.g. Bernoulli, multinomial, uniform), any random variable with strictly log-concave density, and any finite mixture of sub-Gaussian variables. We denote the support set of each task

as . For simplicity, here we consider the case that and , .

We assume that are i.i.d and follow a sub-Gaussian distribution with mean 0 and variance proxy . Sample covariates are independent and all entries in them are independent. These entries are i.i.d from a sub-Gaussian distribution with mean 0 and variance proxy no greater than .

### 2.2 Our Method

In meta sparse regression, our goal is to use the prior tasks and their corresponding data to recover the common support of all tasks. We then estimate the parameters for the novel task. For the setting we explained above, this is equivalent to recover .

First, we determine the common support over the prior tasks by the support of formally introduced below, i.e., , where

 ℓ(w)=12TlT∑i=1l∑j=1∥yti,j−XTti,jw∥22 (2)
 ^w=argminw{ℓ(w)+λ∥w∥1} (3)

Note that we have tasks in total, and samples for each task.

Second, we use the support as a constraint for recovering the parameters of the novel task . That is

 ℓT+1(w)=12ll∑j=1∥ytT+1,j−XTtT+1,jw∥22 (4)
 ^wT+1=argminw,Supp(w)⊆^S{ℓT+1(w)+λT+1∥w∥1} (5)

We point out that our method makes a proper application of regularized (sparse) regression, and in that sense is somewhat intuitive. In what follows, we show that this method correctly recovers the common support and the parameter of the novel task. At the same time, our method is minimax optimal, i.e., it achieves the optimal sample complexity rate.

## 3 Main Results

First, we state our result for the recovery of the common support among the prior tasks.

###### Theorem 3.1.

Let be the solution of the optimization problem (3). If and

 T∈O(max{(logp)2k3l6,klog(p−k)lϵ2}),

with probability greater than

, we have that

1. the support of is contained within (i.e., );

2. ,

where are constants.

Note that in Theorem 3.1, the term in is typically encountered in the analysis of the single-task sparse regression or LASSO (Wainwright, 2009). The additional term in is due to the difference in the coefficients among tasks.

Next, we state our result for the recovery of the parameters of the novel task.

###### Theorem 3.2.

Let be the solution of the optimization problem (5). With the support recovered from Theorem 3.1, if , and , with probability greater than , we have that

1. the support of is contained within (i.e., );

2. ,

where are constants.

The theorems above provide an upper bound of the sample complexity, which can be achieved by our method. The lower bound of the sample complexity is an information-theoretic result, and it relies on the construction of a restricted class of parameter vectors. We consider a special case of the setting we previously presented: all non-zero entries in

are 1, and all non-zero entries in are also 1. We use to denote the set of all possible parameters . Therefore the number of possible outcomes of the parameters .

If the parameter is chosen uniformly at random from , for any algorithm estimating this parameter by , the answer is wrong (i.e., ) with probability greater than if . Here we use to denote the sample size of task . This fact is proved in the following theorem.

###### Theorem 3.3.

Let . Furthermore, assume that is chosen uniformly at random from . We have:

 P[^θ≠θ∗]≥1−log2+c′′1⋅Tl+c′′2⋅lT+1log|Θ|

where are constants.

In the following section, we prove that the mutual information between the true parameter and the data is bounded by . In order to prove Theorem 3.3, we use Fano’s inequality and the construction of a restricted class of parameter vectors. The use of Fano’s inequality and restricted ensembles is customary for information-theoretic lower bounds (Wang et al., 2010; Santhanam and Wainwright, 2012; Tandon et al., 2014).

Note that from Theorem 3.3, we know if and , then any algorithm will fail to recover the true parameter very likely. On the other hand, if we have and , by Theorem 3.1 and 3.2, we can recover the support of and (by ). Therefore we claim that our rates of sample complexity is minimax optimal.

## 4 Sketch of the proofs

In this section, we provide details about the proofs of our main results.

### 4.1 Proof of Theorem 3.1

We use the primal-dual witness framework (Wainwright, 2009) to prove our results. First we construct the primal-dual candidate; then we show that the construction succeeds with high probability. Here we outline the steps in the proof. (See the supplementary materials for detailed proofs.)

We first introduce some useful notations:

is the matrix of collocated (covariates of all samples in the -th task). Similarly, and .

is the matrix of collocated (covariates of all samples in all tasks). Similarly, .

is the sub-matrix of containing only the rows corresponding to the support of , i.e., with . Similarly, , , and .

is the sub-matrix of containing only the rows and columns corresponding to the support of .

#### 4.1.1 Primal-dual witness

Step 1: Prove that the objective function has positive definite Hessian when restricted to the support, i.e., .

 (∀wS∈R|S|)[∇2ℓ((wS,0))]S,S≻0 (6)

We know that

 [∇2ℓ((wS,0))]S,S≻0⇔1Tl[XT[T]X[T]]S,S≻0 (7)

We prove the above condition in the following lemma.

###### Lemma 4.1.

For , assume that each element in is i.i.d. sub-Gaussian random variable with mean and variance proxy . We have

 P[∣∣∣∣∣∣1nXTX−σ2xI∣∣∣∣∣∣2≥σ2xδ(n,k,t)]≤2e−nC1t2 (8)

where

 δ(n,k,t):=(C2√k/n+t)+(C2√k/n+t)2

and are constants.

Using the lemma above, we show the minimum singular value of

is larger than 0 with high probability. We let to have .

Thus, with probability , we have

 λmin[^ΣS,S] =xT0^ΣS,Sx0=xT0(^ΣS,S−σ2xI)x0+σ2x ≥σ2x−δ(n,k,t)σ2x≥σ2x/4>0

where we set

Step 2: Set up a restricted problem:

 ~wS=argminwS∈R|S|ℓ((wS,0))+λ∥wS∥1 (9)

Step 3: Choose the corresponding dual variable to fulfill the complementary slackness condition:

, if , otherwise

Step 4: Solve to let fulfill the stationarity condition:

 [∇ℓ((~wS,0))]S +λ~zS=0 (10) Sc +λ~zSc=0 (11)

Step 5: Verify that the strict dual feasibility condition is fulfilled for :

 ∥~zSc∥∞<1

To prove support recovery, we only need to show that step 5 holds. In the next subsection we indeed show that this holds with high probability.

#### 4.1.2 Strict dual feasibility condition

We first rewrite (10) as follows:

 1TlT∑i=1XTti,SXti,S(~wS−w∗S)=−λ~zS+1TlT∑i=1XTti,Sϵti+1TlT∑i=1XTti,SXti,SΔ∗ti,S

Then we solve for . That is

 ~wS −w∗S={1TlT∑i=1XTti,SXti,S}−1(1TlT∑i=1XTti,Sϵti−λ~zS+1TlT∑i=1XTti,SXti,SΔ∗ti,S)

and plug it in the equation below (by rewriting (11)).

 ~zSc=1λTlT∑i=1(XTti,Scϵti−XTti,ScXti,S(~wS−w∗S−Δ∗ti,S))

We have

 ~zSc= XT[T],Sc{1TlX[T],S(^ΣS,S)−1~zS+Π⊥X[T],S(ϵ[T]λTl)}~zSc,1+1λTlT∑i=1XTti,ScXti,SΔ∗ti,S~zSc,2 −1λ(Tl)2X[T],ScX[T],S(^ΣS,S)−1(T∑i=1XTti,SXti,SΔ∗ti,S)~zSc,3

where is an orthogonal projection matrix, and is the dual variable we choose at step 3.

One can bound the norm of by the techniques used in (Wainwright, 2009). Specifically, if we set and , with the mutual incoherence condition being satisfied (i.e., ), we have

 P[∥~zSc,1∥∞≥γ]≤c1exp(−c2min{k,log(p−k)})

Note that the remaining two terms containing are new to the meta-learning problem and need to be handled with novel proof techniques.

We first rewrite with respect to each of its entries (denoted by ) as follows: , we have

 ~zn,2=1λTlT∑i=1l∑j=1∑m∈SXti,j,nXti,j,mΔ∗ti,m (12)

We know that are sub-Gaussian random variables. It is well-known that the product of two sub-Gaussian is sub-exponential (whether they are independent or not). To characterize the product of three sub-Gaussians and the sum of the i.i.d. products, we need to use Orlicz norms and a corresponding concentration inequality.

#### 4.1.3 Orlicz norm

Here we introduce the concept of exponential Orlicz norm. For any random variable and , we define the (quasi-) norm as

 ∥X∥ψα=inf{t>0:Eexp(|X|αtα)≤2}

We define . This concept is a generalization of sub-Gaussianity and sub-exponentiality since the random variable family with finite exponential Orlicz norm corresponds to the -sub-exponential tail decay family which is defined by

 P(|X|≥t)≤cexp(−tαC) ∀t≥0.

where are constants. More specifically, if , we set so that fulfills the -sub-exponential tail decay property above. We have two special cases of Orlicz norms: corresponds to the family of sub-Gaussian distributions and

corresponds to the family of sub-exponential distributions.

A good property of the Orlicz norm is that the product or the sum of many random variables with finite Orlicz norm has finite Orlicz norm as well (possibly with a different .) We state this property in the two lemmas below.

###### Lemma 4.2.

[Lemma A.1 in (Götze et al., 2019)] Let be random variables such that for some and let . Then and

 ∥∥ ∥∥k∏i=1Xi∥∥ ∥∥ψt≤k∏i=1∥Xi∥ψαi
###### Lemma 4.3.

[Lemma A.3 in (Götze et al., 2019)] For any and any random variables , we have

 ∥∥ ∥∥l∑i=1Xi∥∥ ∥∥Ψα≤l1/α(l∑i=1∥Xi∥Ψα)

By the lemmas above, we know that the sum (with respect to ) of the products in (12) is a -sub-exponential tail decay random variable. The details are shown in the next subsection. This result does not require any independence conditions, thus we will use this fact later for bounding .

#### 4.1.4 23-sub-exponential tail decay random variable

For , we have

 ~zj,2: =1λTlT∑i=1XTti,jXti,SΔ∗ti,S =1λTlT∑i=1l∑m=1Xti,j,mXti,S,mΔ∗ti,S =1λTlT∑i=1∑q∈SΔ∗ti,q(l∑m=1Xti,j,mXti,q,m)

From Lemma 4.2, we know

 ∥∥Xti,j,mXti,q,m∥∥ψ1≤∥Xti,j,m∥ψ2∥Xti,q,m∥ψ2=M2X

where and is a constant.

From Lemma 4.3, we have

 ∥∥ ∥∥l∑m=1Xti,j,mXti,q,m∥∥ ∥∥ψ1≤l2M2X

From Lemma 4.2 again, we know

 ∥∥ ∥∥Δ∗ti,q(l∑m=1Xti,j,mXti,q,m)∥∥ ∥∥ψ23 ≤∥Δ∗ti,q∥ψ2∥∥ ∥∥(l∑m=1Xti,j,mXti,q,m)∥∥ ∥∥ψ1 ≤MΔl2M2X=MS

where .

#### 4.1.5 Concentration inequality for ~zSc,2

Recall that

 ~zj,2 =1λTlT∑i=1∑q∈SΔ∗ti,q(l∑m=1Xti,j,mXti,q,m) : =1TT∑i=1∑q∈S~zj,2,i,q

We know that for different and , the random variables are independent with and . Now we use a concentration inequality to bound .

###### Lemma 4.4 (Theorem 1.4 in (Götze et al., 2019)).

Let be a set of independent random variables satisfying for some . There is a constant such that for any , we have

 P(∣∣ ∣∣1nn∑i=1(Xi−EXi)∣∣ ∣∣≥t)≤2exp(−f3(M,t,n)C3)

where

 f3(M,t,n)=min⎛⎝t2nM2,tnM,(tnM)23⎞⎠

We set . Then we have

 P(∣∣~zj,2∣∣≥γ) =P⎛⎝∣∣ ∣∣1TkT∑i=1∑q∈S~zj,2,i,q∣∣ ∣∣≥γk⎞⎠ ≤2exp⎛⎜ ⎜ ⎜ ⎜⎝−min(Tkω2,Tkω,(Tkω)23)C3⎞⎟ ⎟ ⎟ ⎟⎠

When , we have

 ω ∈O((log(p))1/2/(kT1/2l3/2)) Tkω2 ∈O(log(p)/(kl3)) Tkω ∈O((log(p))3/2/(kl3)3/2) (Tkω)23 ∈O(log(p)/(kl3)).

Therefore, can be bounded by with probability

 P[∥~zSc,2∥∞≥γ]≤c5exp(−c6log(p)/(kl3))

#### 4.1.6 Bound on ∥~zSc,3∥∞

By definition,

 ~zSc,3 =1λ(Tl)2X[T],ScX[T],S(^ΣS,S)−1(T∑i=1XTti,SXti,SΔ∗ti,S) =1TlX[T],ScX[T],S(^ΣS,S)−11λTl(T∑i=1XTti,SXti,SΔ∗ti,S) :=1TlX[T],ScX[T],S(^ΣS,S)−1ζS

Here we define

 ζS:=1λTl(T∑i=1XTti,SXti,SΔ∗ti,S)

Since the independence between random variables is not necessary in Lemma 4.2, we use the same technique for bounding to bound . More specifically, can be bounded by with probability

 P[∥ζS∥∞≥γ]≤c5exp(−c6log(p)/(kl3))

We define event . Then we know

 P[∥~zSc,3∥∞≥γ′]≤P[∥~zSc,3∥∞≥γ′|Tc2]+P[T2]

We bound by breaking it into two terms and .

 ~zSc,3=A′j+B′j
 A′j:=ETj1TlX[T],S(^ΣS,S)−1ζS
 B′j:=ΣjS(ΣSS)−1ζS

Here we know .

 P[max|A′j|≥γ′] ≤P[max|A′j|≥γ′|Tc2]+P[T2] ≤2(p−k)exp(−γ′22ρu(ΣSc|S)M′1(ϵ))+4ex