# A Theory of Multiple-Source Adaptation with Limited Target Labeled Data

We study multiple-source domain adaptation, when the learner has access to abundant labeled data from multiple-source domains and limited labeled data from the target domain. We analyze existing algorithms for this problem, and propose a novel algorithm based on model selection. Our algorithms are efficient, and experiments on real data-sets empirically demonstrate their benefits.

## Authors

• 56 publications
• 36 publications
• 29 publications
• 9 publications
• ### Multi-source Domain Adaptation for Visual Sentiment Classification

Existing domain adaptation methods on visual sentiment classification ty...
01/12/2020 ∙ by Chuang Lin, et al. ∙ 30

• ### Simplified Neural Unsupervised Domain Adaptation

05/22/2019 ∙ by Timothy A Miller, et al. ∙ 0

• ### Domain Adaptations for Computer Vision Applications

A basic assumption of statistical learning theory is that train and test...
11/20/2012 ∙ by Oscar Beijbom, et al. ∙ 0

• ### Cross-Language Domain Adaptation for Classifying Crisis-Related Short Messages

Rapid crisis response requires real-time analysis of messages. After a d...
02/17/2016 ∙ by Muhammad Imran, et al. ∙ 0

This paper studies the problem of stance detection which aims to predict...
02/06/2019 ∙ by Brian Xu, et al. ∙ 0

• ### Algorithms and Theory for Multiple-Source Adaptation

This work includes a number of novel contributions for the multiple-sour...
05/20/2018 ∙ by Judy Hoffman, et al. ∙ 0

• ### Synthesizing Credit Card Transactions

Two elements have been essential to AI's recent boom: (1) deep neural ne...
10/04/2019 ∙ by Erik R. Altman, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A common assumption in supervised learning is that training and test samples are drawn from the same distribution. When the training sample is sufficiently large, an adequately complex model trained on that sample can achieve a good accuracy. However, in many applications such as cloud computing, the amount of training data is limited. In such scenarios, a natural approach is to use one or several proxy data-sets during training to improve the model accuracy on the test data. This approach is referred to as

(DA). Domain adaptation arises in many applications such as natural language processing

(Blitzer07Biographies; dredze2007frustratingly; jiang2007instance), speech processing (gauvain1994maximum; jelinek1997statistical)(leggetter1995maximum), etc.

More formally, in domain adaptation, the learner has access to a large number of samples from one or more source domains, but only a small amount of labeled samples from the target domain. If the number of source domains is more than one, it is often referred to as multiple-source domain adaptation (MSDA). In this paper, we study MSDA when the learner has access to a small number of labeled samples from the target domain.

Theoretical studies of domain adaptation can be broadly viewed along three axes: number of source domains, availability of labeled target data, and availability of labeled source data.

ben2007analysis considered the problem of single source DA, when the learner has access to labeled source data and unlabeled target data. blitzer2008learning and ben2010theory extended the work to the case when small amount of labeled target data is available and gave a bound on the error rate of a hypothesis derived from a weighted combination of the source datasets for empirical risk minimization.

MansourMohriRostamizadeh2009a, MansourMohriRostamizadeh2009 and HoffmanMohriZhang2018 considered MSDA when the learner has access to unlabeled samples and learned models for source domains. This approach has been further used in many applications such as object recognition (hoffman_eccv12; gong_icml13; gong_nips13). zhao2018adversarial and wen2019domain considered MSDA with only unlabeled target data available and provided generalization bounds for classification and regression.

Recently, there have been many experimental studies of domain adaptation in various tasks. ganin2016domain proposed to learn features that cannot discriminate between source and target domains. tzeng2015simultaneous proposed a CNN architecture to exploit unlabeled and sparsely labeled target domain data. motiian2017unified, motiian2017few and wang2019few proposed to train maximally separated features via adversarial learning.

In contrast to the above theoretical studies on MSDA, we consider supervised MSDA, where the learner has access to both labeled source data and a small amount of labeled target data, and ask what is the best that the learner can do. Our setting is similar to that of konstantinov2019robust, which considered the problem of learning from multiple untrusted sources and a single target domain.

In Section 2, we formulate the learning problem and discuss a natural baseline. In Section 3, we present our main solution for this adaptation problem and prove that it benefits from near-optimal guarantees. We first present a model selection solution that is nearly-optimal. Next, we give an alternative algorithm based on a boosting-style formulation. Then, in Section 4, we compare our algorithm with natural discrepancy-based solutions and explain why they may not benefit from the same theoretical optimality in general. In Section 5, we report the results of experiments with our CMSA algorithms compared to several other algorithms.

## 2 Preliminaries

In this section, we state the notation, characterize the baseline, describe the problem formulation, and state an oracle-aided solution.

### 2.1 Notation

Before we proceed further, we introduce some general notation and definitions used throughout the paper. Let denote the input space and the output space. We focus on the multi-class classification problem where is a finite set of classes, but much of our results can be extended straightforwardly to regression and other problems. The hypotheses we consider are of the form , where stands for the simplex over . Thus,

is a probability distribution over the classes or categories that can be assigned to

. We denote by a family of such hypotheses. We also denote by

a loss function defined over

and taking non-negative values. The loss of for a labeled sample is given by . Let be an upper bound for the loss function . We denote by the expected loss of a hypothesis with respect to a distribution over :

 LD(h)=E(x,y)∼D[ℓ(h(x),y)],

and by its minimizer: .

We denote by the target domain distribution and by the source domain distributions. During training, we observe independent samples from distribution . We denote by the corresponding empirical distribution. We also denote by denote the total number of samples from all the source domains. In practice, we expect to be significantly larger than ().

### 2.2 Baseline model trained on the target distribution

What is the best that one can achieve without data from any source distribution? Suppose we train on the target domain samples alone, and obtain a model . By standard learning theoretic tools (MohriRostamizadehTalwalkar2012), the generalization bound for this model can be stated as follows: with probability at least , the minimizer of the empirical risk satisfies,

 LD0(hˆD0)≤minh∈HLD0(h)+O⎛⎜ ⎜⎝√d+√log1δ√m0⎞⎟ ⎟⎠, (1)

where is the pseudo-dimension of the hypothesis class . For simplicity, we provide generalization bounds in terms of pseudo-dimension throughout the paper. They can be easily extended to bounds based on Rademacher complexity. For 0-1 loss, pseudo-dimension also coincides with VC dimension. Finally, there exist distributions and hypotheses where (1) is tight (MohriRostamizadehTalwalkar2012, Theorem 3.23).

### 2.3 Formulation

Let be the set of probability distributions over . In order to provide meaningful bounds and improve upon (1), we assume that the target distribution is approximated by a convex combination of source domains, i.e., let , there exists such that

 D0≈Dλ. (2)

This assumption is fairly broad, and if for some , then (2) holds exactly. It is also equivalent to that of a non-negative factorization of the underlying distributions, often adopted in topic modelling or LDA (hofmann2017probabilistic; lee2001algorithms). Previous work in MSDA (MansourMohriRostamizadeh2009a) made a similar assumption.

To measure the approximation, we need a notion of divergence. Divergence between distributions is usually measured in terms of Bregman divergences such as KL divergence. However, they do not consider the classification task at hand. To obtain better bounds, we use the notion of discrepancy between distributions (mansour2009domain; mohri2012new), which generalizes the distance proposed by ben2007analysis.

For two distributions over features and labels, and , a class of hypotheses , following mohri2012new, we define label-discrepancy as

 discH(D,D′)=maxh∈H|LD(h)−LD′(h)|.

The standard notion of discrepancy in mansour2009domain is only defined on features. Since target labels are available in our setting, as outlined in mohri2012new, this notion of label-discrepancy is tighter. For the above notion of label-discrepancy, the loss function does not need to satisfy triangle inequality.

If the loss of all the hypotheses in the class is equal between and , then the discrepancy is zero, hence models trained on generalizes well on and vice versa. If the discrepancy is large, then there is a certain hypothesis where the models behave very differently. Thus, discrepancy takes into account both the hypothesis set and the loss functions, hence the structure of the learning problem. Additionally, the discrepancy can be upper bounded in terms of the total variation and relative entropy.

With the above definitions, we can define how good a mixture weight is. For a given , a natural algorithm is to combine samples from the empirical distributions to obtain the mixed empirical distribution , and minimize loss on . Let be the minimizer of this loss. A good should lead to

with the performance close to that of the optimal estimator for

. In other words, the goal is to find that minimizes

 LD0(h¯¯¯¯Dλ)−LD0(hD0).

The above term is difficult to minimize and hence we upper bound it as follows:

 LD0(h¯¯¯¯Dλ)−LD0(hD0) ≤2maxh∈H|L¯¯¯¯Dλ(h)−LDλ(h)|+2discH(D0,Dλ), (3)

We refer readers to Appendix A for the derivation of (3). Let the uniform bound on the excess risk for a given be

 E(λ)=2maxh∈H|L¯¯¯¯Dλ(h)−LDλ(h)|+2discH(D0,Dλ), (4)

and be the mixture weight that minimizes the above uniform excess bound, i.e.

 λ∗=argminλ∈ΔpE(λ).

Our goal is to produce a model with error close to , without the knowledge of . Before we review the existing algorithms, we provide a bound on .

### 2.4 The known mixture setting

The problem of domain adaptation can be broadly broken down into two parts: (i) finding ; (ii) finding the hypothesis that minimizes the loss over corresponding . In this section, we discuss guarantees on (ii) if is known.

Let denote the empirical distribution of samples

. Skewness between distributions is defined as

. Skewness is a divergence and measures how far and the empirical distribution of samples are. It naturally arises in the generalization bounds of weighted mixtures. For example, if , then and the generalization bound in Lemma 2.4 will be same as the bound for the uniform weighted model. If , then

and the generalization bound will be same as the bound for training on a single domain. Thus skewness smoothly interpolates between the uniform weighted model and the single domain model. For a fixed

, the following result of mohri2019agnostic proves a generalization bound. [(mohri2019agnostic)] Let . Then with probability at least ,

 E(λ)≤4M√s(λ||m)m⋅(dlogemd+log1δ)+2discH(D0,Dλ).

The above bound for 0-1 loss can also be found in blitzer2008learning. Since , the above bound is much stronger than the bound for the model just trained on the target data (1).

## 3 Proposed algorithms

Our goal is to come up with hypothesis without knowledge of whose excess risk is close to the uniform bound on the excess risk in the known mixture setting. Since , one approach is to find the best minimizer for all values of and use the one that performs best on

. This approach is similar to model selection in machine learning, where the goal is to select both the model class and the model parameters. We refer to this as

cover-based mutliple source adaptation (CMSA). The complete algorithm is in Figure 1. The algorithm takes as an input, a subset of . For each element of the cover, it finds the best estimator for , denoted by . Let be the resulting set of hypotheses. The algorithm then finds the best hypothesis from this set using . The algorithm is relatively parameter free and easy to implement.

### 3.1 Generalization bounds for finite cover

We provide the following generalization guarantee for CMSA when is a finite cover of . Let be a minimal cover of such that for each , there exists a such that . Let . With probability at least , the output of CMSA satisfies,

 LD0(hm)−minh∈HLD0(h)≤E(λ∗)+2ϵM+2M√plogpδϵ√m0.

Due to space constraints we provide proof in Appendix B.1. Note that the guarantee for CMSA is closer to the known mixture setting when is known. The algorithm finds that not only has small discrepancy to the distribution , but also has small skewness and hence generalizes better. In particular, if there are multiple distributions that are very close to , then it chooses the one that generalizes better. Furthermore, if there is a such that , then the algorithm chooses either or another that is slightly worse in terms of discrepancy, but generalizes much better.

Finally, the last term in Theorem 3.1, is the penalty for model selection and just depends on the number of samples from and independent of . However, note that for this algorithm to be better than the bound for the local model (1), we need , which is a fairly reasonable assumption in practice as the number of domains can be at most in terms of hundreds, but the typically number of model parameters in machine learning is millions. Furthermore, by combining the cover based bound in (7) with VC-dimension bounds, one can reduce the penalty of model selection to .

If the time complexity of finding for a given is , then the time complexity of CMSA is . Hence the algorithm is efficient for small values of .

### 3.2 Generalization for infinite covers

Theorem 3.1 shows that the output of CMSA generalizes well for finite covers. We extend this result to the entire simplex . To prove generalization for CMSA, we need an additional assumption, that the loss function is strongly convex in the parameters of optimization. The generalization bound uses the following lemma proved in Appendix B.2. Let , and be a -strongly convex function whose gradient norms are bounded by for all . Then for any distribution ,

 LD0(hλ)−LD0(hλ′)≤G√M√μ⋅∥λ−λ′∥1/21.

The following theorem, proved in Appendix B.3, provides a generalization guarantee for CMSA. Let assumptions in Lemma 3.2 hold. With probability at least , the output of CMSA, satisfies

 LD0(hm)−minh∈HLD0(h)≤E(λ∗)+minϵ≥02√plogG2Mϵ2μδ√m0+2ϵM.

### 3.3 Lower bound

The bounds in Theorems 3.1 and 3.2, contain a model selection penalty of . Using an information theoretic bound, we show that any algorithm incurs a penalty of for some problem settings. We relegate the proof to Appendix C. For any algorithm , there exists a set of hypotheses , a loss function , distributions , such that and the following holds. Given infinitely many samples from and samples from , the output of the algorithm satisfies,

 E[LD0(hA)]≥minh∈HLD0(h)+c⋅√pm0,

where is a universal constant.

### 3.4 A boosting-type algorithm

We ask if there are faster algorithms for solving CMSA(. One approach would be to consider a convex relaxation of CMSA(). We propose to minimize

 LD0(∑λ∈Λαλhλ), (5)

subject to and for all , . Before we discuss optimization objectives, we first show that the above objective has similar generalization error as that of CMSA(). We refer readers to Appendix B.4 for the proof. Let and be Lipschitz. With probability at least , the solution to (5) satisfies,

 LD0(hm)−minh∈HLD0(h) ≤E(λ∗)+2ϵM+L√2plogpϵm0+2M ⎷log1δm0.

Since the loss function is convex, (5) is convex in . However, the number of predictors is , which can be potentially large. This scenario is very similar to boosting where there are exponentially many learners and the goal is to find a combination that performs well. Boosting is generally defined for binary classification, whereas the problem of interest in general is a multi-class problem. Hence, we propose to use randomized or block randomized coordinate decent (RCD) (nesterov2012efficiency). The convergence guarantees follow from known results on RCD (nesterov2012efficiency).

Motivated by this, the algorithm proceeds as follows. Let be the coordinate chosen at time and , be the corresponding mixture weight and the hypothesis at time . We propose to find and as follows. The algorithm randomly selects values of , denoted by and chooses the one that minimizes

 αt+1,λt+1=argminα,λ∈St+1LˆD0(t∑i=1αλihλi+αhλ).

We refer to this algorithm as CMSA-Boost. It is known that the the above algorithm converges to the global optimum (nesterov2012efficiency).

In practice, for efficiency, we can use different sampling schemes. Suppose we have a hierarchical clustering of

. At each round, instead of randomly sampling a set with values of , we could sample, values of , one from each cluster and find the with the maximum decrease in loss. We can then sample values of , one from each sub-cluster of the chosen cluster. This process is repeated till the reduction in loss is small, at which point we can choose the corresponding as

. This algorithm is similar to boosting with decision trees.

## 4 Alternative algorithms

A natural approach to find is to use Bregman divergence based non-negative matrix factorization. However, such an algorithm would incur higher generalization errors and due to space constraints, we discuss it in Appendix D.1. We now discuss two algorithms based on discrepancy. However, as the bounds suggest, these may not be optimal for all problem scenarios.

### 4.1 A pairwise discrepancy approach

Discrepancy arises in classification tasks as a natural divergence between distributions. Hence, we can use discrepancy to find . One can assign higher weight to a source domain that is closer to the target distribution e.g., (wen2019domain; konstantinov2019robust). However, we demonstrate that this approach is not ideal in all scenarios with a simple example. Let and and . Furthermore let the number of samples from each source domain be very large. In this case, observe that . If we just use the pairwise discrepancies between and to set , then would satisfy , which is far from optimal. The example is illustrated in Figure 2.

Some of the recent works on MSDA with unlabeled target data use pairwise discrepancy to determine the mixture weights , see e.g., (wen2019domain). Recently,  konstantinov2019robust proposed an algorithm based on pairwise discrepancy for learning from untrusted multiple sources, where limited target label data is available. They propose to assign , by minimizing

 p∑k=1λkdisc(Dk,D0)+γ√ms(λ||m),

for some regularization parameter . If the number of samples from each source domain is the same and if , then for any value , their approach assigns , which is sub-optimal for scenarios such as Example 4.1. Since the convergence guarantees of their proposed algorithms are based on pairwise discrepancies, loosely speaking, their bounds are tight in our formulation when is close to . However, for examples similar to above, such an algorithm would be sub-optimal.

### 4.2 A convex combination discrepancy-based algorithm

Since pairwise discrepancies would result in identifying a sub-optimal , instead of just considering the pairwise discrepancies, one can consider the discrepancy between and any . Since

 LD0(h)≤minλ∈ΔpLDλ(h)+discH(D0,Dλ),

and the learner has more data from than from , a natural algorithm is to minimize . However, note that this requires estimating both the discrepancy and the expected loss over . In order to account for both terms, we propose to minimize the upper bound on ,

 minλL¯¯¯¯Dλ(h)+Cϵ(λ), (6)

where is given by,

 discH(ˆD0,¯Dλ)+c√d+log1δ√m0+ϵM+cM√s(λ||m)√m⋅(√dlogemd+plog1ϵδ),

for some constant . Let be the solution to (6), we now give a generalization bound for the above algorithm (proof in Appendix D.2). With probability at least , the solution for (6) satisfies

 LD0(hR)≤minh∈HLD0(h)+2minλCϵ(λ).

The above bound is comparable to the model trained on only target data as contains , which can be large for a small values of . This bound can be improved on certain favorable cases when for some known . In this case if we use the same set of samples for and , then the bound can be improved to , which in favorable cases such that is large, yields a better bound than the target-only model.

## 5 Experiments

We evaluated our algorithms on a standard MSDA dataset compromising of four domains MNIST (lecun-mnisthandwrittendigit-2010), MNIST-M (ganin_icml15), SVNH (netzer2011reading), and SynthDigits (ganin_icml15), by treating one of MNIST, MNIST-M, or SVHN as the target domain, and the rest as source. We used the same preprocessing and data split as (zhao2018adversarial), i.e., labeled training samples for each domain when used as a source. When a domain is used the target, we use the first examples from the

. We also used the same convolution neural network as the digit classification model in

, with the exception that we used a regular ReLU instead of leaky ReLU. Unlike

, we trained the models using stochastic gradient descent with a fixed learning rate without weight decay.

We also propose an additional gradient descent based algorithm CMSA-min-max for solving the CMSA objective. Due to space constraints, we describe the algorithm in Appendix E. The CMSA-min-max algorithm rewrites the objective of CMSA in a min-max formulation, which can be viewed as a two player game, a player which is minimizing the loss on the target domain and a player who is certifying if the hypothesis belongs to . Such a min-max optimization can be solved using stochastic optimization techniques (NemirovskiYudin1983). However, CMSA-min-max is non-convex and currently we do not have any provable convergence gurantees.

We evaluated several baselines: best-single-source: best model trained only on one of the sources; combined-sources: model trained on dataset obtained by concatenating all the sources; target-only: model trained only on the limited target data; sources+target: models trained by combining source and targets; sources + target (equal weight): models trained by combining source and targets where all of them get the same weight, and pairwise discrepancy: the pairwise discrepancy approach of konstantinov2019robust. Baselines , , and involve data concatenation. For the baseline and the proposed algorithms CMSA, CMSA-boost, CMSA-min-max, we report the better results of the following two approaches: one where all target samples are treated as and one where random samples are treated as a separate new source and samples are treated as samples from .

The results are in Table 1. The proposed algorithms perform well compared to the baselines. We note that CMSA-min-max performed better using all target samples as , whereas konstantinov2019robust, CMSA, and CMSA-Boost performed better using target samples as a separate new source domain. As expected, the performance of proposed algorithms is better than that of the unsupervised domain adaptation algorithms of (zhao2018adversarial) (see Table 2 in their paper), due to the availability of labeled target samples.

Figure 3 shows the performance of the CMSA as a function of the number of target samples. Of the three target domains, MNIST is the easiest domain and requires very few target samples to achieve good accuracy; and MNIST-M is the hardest and requires many target samples to achieve good accuracy. We omit the curves for CMSA-Boost and CMSA-min-max because they are similar.

## 6 Conclusion

We studied multiple-source domain adaptation with limited target labeled data. We proposed a model selection type algorithm, provided generalization guarantees, and showed that it is optimal by providing information theoretic lower bounds for any algorithm. We also proposed an efficient boosting algorithm with similar guarantees. We demonstrated the practicality of algorithms by evaluating them on synthetic and public datasets.

## Appendix A Proof of equation (3)

By the definition of discrepancy,

 LD0(h¯¯¯¯Dλ)≤LDλ(h¯¯¯¯Dλ)+disc(Dλ,D0).

Similarly,

 LD0(hD0) =LD0(hD0)−LDλ(hD0)+LDλ(hD0) ≥LDλ(hD0)−disc(Dλ,D0) ≥LDλ(hDλ)−disc(Dλ,D0)

Combining the above two equations yields

 LD0(h¯¯¯¯Dλ)−LD0(hD0)≤LDλ(h¯¯¯¯Dλ)−LDλ(hDλ)+2disc(Dλ,D0).

Next observe that, by rearranging terms,

 LDλ(h¯¯¯¯Dλ)−LDλ(hDλ) +L¯¯¯¯Dλ(h¯¯¯¯Dλ)−L¯¯¯¯Dλ(hDλ)

However, by the definition of ,

 L¯¯¯¯Dλ(h¯¯¯¯Dλ)≤L¯¯¯¯Dλ(hDλ).

Hence,

 LDλ(h¯¯¯¯Dλ)−LDλ(hDλ) ≤LDλ(h¯¯¯¯Dλ)−L¯¯¯¯Dλ(h¯¯¯¯Dλ)+L¯¯¯¯Dλ(hDλ)−LDλ(hDλ) ≤2suph|L¯¯¯¯Dλ(h)−LDλ(h)|,

where the last inequality follows by taking the supremum. Combining the above equations, gives the proof.

## Appendix B Proofs for the proposed algorithms

### b.1 Proof of Theorem 3.1

We first bound the number of elements in the cover . Consider the cover given as follows. For each coordinate , the domain weight belongs to the set , and is determined by the fact that . Note that and has at most elements and for every , there is a such that . Hence the size of the minimal cover is at most . Hence, by the McDiarmid’s inequality and the union bound, with probability at least ,

 LD0(hm)≤minh∈HΛLD0(h)+2M√plogpδϵ√m0. (7)

Let denote and denote . For any ,

 minh∈HΛLD0(h)−minh∈HLD0(h) (a)≤minh∈HΛLDλ(h)−minh∈HLDλ(h)+2discH(D0,Dλ) ≤LDλ(h^λϵ)−LDλ(hλ)+2discH(D0,Dλ) ≤LDλ(h^λϵ)−L¯¯¯¯Dλ(h^λϵ)+L¯¯¯¯Dλ(h^λϵ)−LDλ(hλ)+2discH(D0,Dλ) (c)≤E(λ)+2ϵM, (8)

follows from the definition of discrepancy and follows from the definition of . For , observe that by the definition of and ,

 L¯¯¯¯Dλ(h^λϵ)≤L¯¯¯¯Dλϵ(h^λϵ)+ϵM≤L¯¯¯¯Dλϵ(h^λ)+ϵM≤L¯¯¯¯Dλ(h^λ)+2ϵM≤L¯¯¯¯Dλ(hλ)+2ϵM,

where the second inequality follows by observing that is the optimal estimator for . Last inequality follows similarly. Combining equations (7) and (8) and taking minimum over yields the theorem.

### b.2 Proof of Lemma 3.2

By the strong convexity of ,

 L¯¯¯¯Dλ(hλ′)−L¯¯¯¯Dλ(hλ) ≥∇L¯¯¯¯Dλ(hλ)⋅(hλ′−hλ)+μ2∥hλ′−hλ∥2 =μ2∥hλ′−hλ∥2,

where the equality follows from the definition of . Similarly, since the function is bounded by

 L¯¯¯¯Dλ(hλ′)−L¯¯¯¯Dλ(hλ) ≤L¯¯¯¯Dλ′(hλ′)−L¯¯¯¯Dλ′(hλ)+∥λ−λ′∥1M ≤−∇L¯¯¯¯Dλ′(hλ′)⋅(hλ′−hλ)−μ2∥hλ′−hλ∥2+∥λ−λ′∥1M =−μ2∥hλ′−hλ∥2+∥λ−λ′∥1M.

Combining the above equations,

 μ∥hλ′−hλ∥2≤M∥λ−λ′∥1.

Hence for any distribution ,

 |LD0(hλ′)−LD0(hλ)| ≤|∇LD0(hλ)⋅(hλ′−hλ)| ≤|∇LD0(hD)⋅(hλ′−hλ)| =G∥hλ′−hλ∥ =G√M√μ⋅∥λ−λ′∥1/21.

### b.3 Proof of Theorem 3.2

Let be the minimal cover of in the distance such that any two elements of the cover has distance at most . Such a cover will have at most elements. Hence, by Lemma 3.2, McDiarmid’s inequality, together with union bound over the above cover, we get with probability at least ,

 LD0(hm)−minh∈HΔpLD0(h) ≤2maxh∈HΔpLD0(h)−LˆD0(h) ≤2maxh∈HΛLD0(h)−LˆD0(h)+2ϵM ≤2M√plogG2M2ϵ2μδ√m0+2ϵM. (9)

Similar to the proof of Theorem 3.1,

 minh∈HΔpLD0(h)−minh∈HLD0(h)≤E(λ). (10)

Combining (9) and (10) and taking minimum over , yields the theorem.

### b.4 Proof of Lemma 5

Let denote the convex hull of . We show that

 LD0(hm)−minh∈conv(HΛ)LD0(h)≤L√2plogpϵm0.

The rest of the proof is similar to Theorem 3.1 and omitted. For any algorithm output that is trained over ,

 LD0(hm)−minh∈conv(HΛ)LD0(h)≤2maxh∈conv(HΛ)|LD0(h)−LˆD0(h)|.

By the McDiarmid’s inequality, with probability at least ,

 maxh∈conv(HΛ)|LD0(h)−LˆD0(h)|≤Emaxh∈conv(HΛ)|LD0(h)−LˆD0(h)|+2M ⎷log1δm0,

By the definition of Rademacher complexity,

 Emaxh∈conv(HΛ)|LD0(h)−LˆD0(h)|≤R(conv(ℓ(HΛ))).

Since the Rademacher complexity of convex hull is same as the Rademacher complexity of the class,

 R(ℓ(conv(HΛ))) ≤LR(conv(HΛ)) =LR(HΛ)≤L√2plogpϵm0.

Hence the lemma.

## Appendix C Proof of Theorem 3.3

Let be a multiple of four. Let and . For all , and , let . For every even , let

and for every odd

,