# On Fundamental Limits of Robust Learning

We consider the problems of robust PAC learning from distributed and streaming data, which may contain malicious errors and outliers, and analyze their fundamental complexity questions. In particular, we establish lower bounds on the communication complexity for distributed robust learning performed on multiple machines, and on the space complexity for robust learning from streaming data on a single machine. These results demonstrate that gaining robustness of learning algorithms is usually at the expense of increased complexities. As far as we know, this work gives the first complexity results for distributed and online robust PAC learning.

## Authors

• 141 publications
• ### Outlier Robust Online Learning

We consider the problem of learning from noisy data in practical setting...
01/01/2017 ∙ by Jiashi Feng, et al. ∙ 0

• ### Improved Algorithms for Collaborative PAC Learning

We study a recent model of collaborative PAC learning where k players wi...
05/22/2018 ∙ by Huy L. Nguyen, et al. ∙ 0

• ### Fundamental Limits of Online Learning: An Entropic-Innovations Viewpoint

In this paper, we examine the fundamental performance limitations of onl...
01/12/2020 ∙ by Song Fang, et al. ∙ 0

• ### On the Sample Complexity of Adversarial Multi-Source PAC Learning

We study the problem of learning from multiple untrusted data sources, a...
02/24/2020 ∙ by Nikola Konstantinov, et al. ∙ 0

• ### Layered Sampling for Robust Optimization Problems

In real world, our datasets often contain outliers. Moreover, the outlie...
02/27/2020 ∙ by Hu Ding, et al. ∙ 0

• ### Distributed and Streaming Linear Programming in Low Dimensions

We study linear programming and general LP-type problems in several big ...
03/13/2019 ∙ by Sepehr Assadi, et al. ∙ 0

• ### On Communication Complexity of Classification Problems

This work introduces a model of distributed learning in the spirit of Ya...
11/16/2017 ∙ by Daniel M. Kane, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The last decade has witnessed a tremendous growth in the amount of data involved in machine learning tasks. In many cases, data volume has outgrown the capacity of memory of a single machine and it is increasingly common that learning tasks are performed in a

distributed fashion on many machines [2, 5, 25] or an online fashion on a single machine [15].

This gives rise to the following fundamental, yet rarely investigated, complexity questions for distributed and online learning algorithms. Considering a concrete example (our analysis is beyond this case): suppose one has the positive and negative examples stored on two separate machines, then how much communication is necessary to learn a good hypothesis to specified error rate? and how about for the case where more machines are involved? In addition, if the positive and negative examples arrives in a single machine with arbitrary order, how much memory must be used to store the necessary information to learn a good hypothesis? if multiple passes of the data are allowed, how does the memory cost scale then?

In this work, we investigate the above communication and space complexities of distributed and online probably approximately correct (PAC) learning

[19], in the presence of malicious errors (outliers) in the training examples. PAC learning with malicious errors was first proposed in [20] and then generalized in [10]. In this learning setting, there is a fixed probability of an error occurring independently on each request for a training example. This error may be of an arbitrary nature – in particular, it may be chosen by an adversary with unbounded computational resources, and exact knowledge of the target representation, the target distributions, and the current internal state of the learning algorithm. Though various robust learning algorithms have been proposed in literature to tackle the outliers in the data, the fundamental limits of these algorithms as well as of the problem itself, in terms of their communication and space complexities, are rarely investigated.

### 1.1 Our Contributions

We analyze and provide general lower bounds on the amount of communication and memory cost needed to learn a good hypothesis, from either distributed or streaming data, in the presence of malicious errors and outliers.

Our results reveal the connection between the communication complexity and the VC-dimension of the hypothesis to learn, as well as the outlier probability . We first derive a lower bound of with the specified error rate, for the simplest distributed learning protocol – only two machines are involved and only a single message is allowed to communicate. We then extend the lower bound analysis to more general distributed learning protocols, allowing -machine and -round communications, and obtain a general lower bound of . All these lower bounds present an additional factor proportional to , which explains the extra communication expense brought by the outliers in the learning process.

We also analyze the -pass online robust PAC learning algorithms. We demonstrate their space complexity lower bounds can be conveniently deduced from the communication complexity results for distributed learning algorithms and hence obtain a space complexity lower bound of .

Finally, we provide a communication efficient distributed robust PAC learning algorithm along with its performance guarantee. Experimental studies on synthetic and real data demonstrate the proposed algorithm consumes significant less communication to achieve comparable accuracy with a naive distributed learning algorithm, which aggregates all the examples to a single machine.

### 1.2 Related Works

Since its introduction in [23], communication complexity [13] has proven to be a powerful technique for establishing lower bounds in a variety of settings, including distributed [25, 7, 9] and streaming data models [15].

Several recent works on communication complexity for distributed PAC learning include [2, 5, 6]

and there are also some works about distributed statistical estimation

[25, 7]. In particular, Duchi et al. [8] demonstrated an analysis tool based on the information Fano inequalities. Garg et al. [9] investigated how the communication cost scales in the parameter dimensionality for distributed statistical estimation. Kremer et al. [12] provide a connection between communication complexity and VC-dimension. However, those works generally focus on the case no outliers or malicious errors are presented in the learning process.

On the other hand, studies on streaming algorithms focus on the scenario where input arrives very rapidly and there is limited memory to store the input [15]

, and investigate the necessary space cost under the most adversarial data orders for various data mining problems, include quantile query

[25], frequent elements query [14, 1], regression [22] and model monitoring [11]. Those algorithms do not consider the case with malicious errors either.

## 2 Problem Setup and Preliminaries

#### Robust PAC Learning

In a classical distribution-free learning setting [19], the probably approximately correct (PAC) learning model typically assumes the oracles Pos and Neg always faithfully return positive and negative examples from a sample domain , drawn according to target distributions respectively. However, in many real environments, it is possible that an erroneous or even adversarial example is given to the learning algorithm. We consider in this work such a generalized PAC learning problem, termed as robust PAC learning, where malicious outliers are possibly presented in training examples [10].

In particular, we consider following erroneous oracles in robust PAC learning. For a error probability , we have access to following two noisy oracles – and – which behave as follows: when oracle is called, with probability , it returns positive examples as in the error free model. But with probability , an example about which absolutely no assumptions can be made is returned. In particular, this example may be dynamically and maliciously chosen by an adversary. The oracle behaves similarly.

Formally, the task of a robust PAC learning algorithm is to find a hypothesis , with access only to the above noisy oracles and and with predefined parameters and , such that: for any input values and , the output hypothesis has bounded errors , with probability at least . Throughout the paper, we assume a constant failure probability for the simplicity of analysis. Note that we always have as proved in [10].

#### Distributed and Online PAC Learning

As mentioned above, in this work we consider a realistic case where the entire data set is too large to store on a single machine. Thus following two learning frameworks become the choices to deal with the issues brought by the large scale of the dataset, which include:

1. Distributed robust learning. Data are distributed on machines. The problem of interest is how to communicate among the machines, and especially how much communication is necessary to produce a low-error hypothesis. Namely, we are concerned about the communication complexity of distributed robust learning.

2. Online robust learning. Data are revealed sequentially (in multiple passes) to a single machine. The problem we want to answer is how much memory cost is necessary to maintain the intermediate information, in order to obtain the low-error hypothesis, i.e., the space complexity of online robust learning.

We regard communication (the number of bits communicated) and memory as a limited resource, and therefore we not only want to optimize the hypothesis error but also the resource cost of the whole learning procedure. We aim at providing a fundamental understanding on the necessary communication and space cost in order to learn a hypothesis with a specified error , in presence of a constant outlier fraction .

#### Communication Protocols and Complexity

In the above distributed robust learning setting, several different communication protocols among the machines can be employed. For instance, in a -round -machine public-randomness protocol between two machines and , denoted as , is only allowed to send a single message to which must then be able to output the hypothesis . Similarly, when multiple-round communications are allowed between and , i.e., multiple messages can be communicated among the two machines, we are using a -round -machine protocol, which is denoted as . Extending the above protocols to the cases involving machines gives the -round -machine and the -round -machine protocols respectively.

The communication complexity of a protocol is the minimal number of bits needed to exchange between machines for learning the hypothesis . Assume a specified error rate of the learned hypothesis , we specifically consider following communication complexities – the communication complexity of randomized -round -machine protocol, and – the complexity of the protocol.

We start the communication complexity analysis from the case when there are two machines, and the -round protocol is employed. After demonstrating the lower bound on the complexity , we proceed to analyze the complexity . The complexity lower bounds for other protocols, such as -machine -round and -machine -round protocols, are provided in the appendix.

## 3 Main Results on Distributed Robust PAC Learning

In this section we analyze the lower bounds on the communication cost for distributed robust PAC learning. We then extend the results to an online robust PAC learning setting in Section 4, through a one-direction equivalent lemma between distributed and online learning, in terms of communication complexity and space complexity.

### 3.1 Communication Models

For distributed learning, popular communication models include: (1) Blackboard model: any message sent by a machine is written on a blackboard visible to all machines; (2) Message passing model: a machine sending a message specifies another machine that will receive this message; (3) Coordinator model: there is an additional machine called the coordinator, who receives no input. Machines can only communicate with the coordinator, and not with each other directly.

We will focus on the message-passing model and the coordinator model, considering the blackboard model may introduce extra communication overload and is not so practical. Note that the coordinator model is almost equivalent to the message-passing model, up to a multiplicative factor of , since instead of machine sending the message to machine , machine can transmit message to the coordinator, and the coordinator forwards it to machine . Therefore, in the following sections, we mainly provide the complexity results for the message passing model, which can then be used to deduce the complexities of the coordinate model easily.

### 3.2 Lower Bound for 1-Round 2-Party Communication

From now on, we derive the main results of this work – the communication complexity lower bounds for distributed robust PAC learning: how much information communication is needed to learn a hypothesis with specified error rate . Since we are mainly concerned about how communication complexity scales with , outlier fraction and VC-dimension of the hypothesis , in developing the results, we assume the failure probability is fixed as a constant, such as .

We start with one simplest distributed learning setting: only two machines are involved and only a single message communication between them is allowed. We first construct a connection between the communication complexity and VC-dimension [21] of the hypothesis to learn, for this -round -machine communication protocol.

The result is basically obtained by following Yao’s principle [24] (Theorem 7 in the appendix) to investigate the communication complexity for a “difficult” distribution of the examples over the two machines, which then provides a lower communication complexity bound for general example distributions over the machines. The lower bound on the communication complexity for the specific “difficult” sample distribution is constructed by limiting the information capacity (depending on the VC-dimension ) of the communication channel between the two machines, and we show that with such limited capacity, the inevitable encoding error will result in a lowered bounded error rate of the learned hypothesis . Based on the above technique, we obtain the lower bound on the complexity communication of the protocol . More details on the tools and proofs for the following claim are provided in the appendix.

###### Theorem 1 (Communication Complexity Lower Bound for 1-Round 2-Machine Protocol).

For learning a hypothesis with a constant error , where the oracles have an outlier rate of , we have the following communication complexity lower bound for -round and -machine communication protocol,

 RM1→M2ϵ(h)=Ω(1−ϵ1−λd),

for a constant success probability. Here is the VC-dimension of the hypothesis .

From the above lower bound, one can observe that the necessary communicated bits between two machines is monotonically increasing with the outlier rate when the error rate is fixed, while monotonically decreasing with the specified error rate . When there are no outliers in the samples, i.e., , the lower bound for learning a hypothesis with reduces to , which matches the result provided in [2] for faithful oracles.

Applying Theorem 1 directly provides a communication lower bound on learning the linear half spaces in the real space , whose VC-dimension is known to be .

###### Corollary 1.

For the hypothesis of linear half spaces in , applying the above theorem gives the one-round communication complexity between two machines to learn a half space is lower bounded as

 RM1→M2ϵ(ℏ)=Ω(1−ϵ1−λ(p+1)),

for a constant success probability.

In Section 3.5, we compare the above lower bound for half space learning with the bound given in [5], and we propose a communication-efficient distributed half-space learning algorithm, whose communication complexity is close to the above lower bound.

### 3.3 Lower Bound for t-Round 2-Machine Communication

We then consider a more complicated and practical communication protocol, which allows the two machines to communicate messages in turn. Such a protocol is called a -Round -Machine protocol. Here we assume, by convention, that one machine sends the first message and the recipient machine of the last message announces the learned hypothesis . To lower bound the complexity of multiple round communications, we need to investigate the size of the message sent in each round and invoke a technique of round elimination. Now we elaborate on how we can obtain the desired complexity bound.

We first provide a more detailed description on the multiple round communication protocol, which explicitly specifies the size of message sent in each round.

###### Definition 1.

A (resp. ) communication protocol is the protocol where the machine (resp. ) starts the communication, the size of the th message is bits, and the communication goes on for rounds.

The communication complexity for such a protocol is then proportional to . We employ a round elimination technique [18] to deduce the communication complexity lower bound for a -round protocol from a -round protocol. After applying the round elimination for times, it is shown that we only need to bound the complexity of a -round protocol, which has been provided in Theorem 1.

The round elimination, which is formally described by the following lemma, is based on the intuition that the existence of a “good” round protocol with starting, implied that there exists a “good” -round protocol where communicates first.

###### Lemma 1 (Round Elimination, adapted from [18]).

Assume is a hypothesis to learn. Suppose the communication between two machines has a protocol which produces a hypothesis with error less than . Then there is a protocol for learning with error less than . Here is also the VC-dimension of .

Applying the above round elimination lemma, we can reduce any -round communication protocol to a -round communication protocol. Thus applying the communication complexity for -round protocol in Theorem 1 gives the following lower bound.

###### Theorem 2 (Communication Complexity Lower Bound for t-Round 2-Machine Protocols).

Let be the specified error rate of . Let be the outlier rate in the oracles with . Consider a -round communication protocol to learn hypothesis . Then the communication complexity is lower bounded as

 RM1↔M2ϵ=Ω(1−ϵ1−λdt2),

for a constant success probability. Here is the VC-dimension of .

### 3.4 Lower Bound for t-Round k-Machine Communication

We now proceed to obtain the communication complexity lower bound for a more general protocol . To extend the above lower bound for -machine communication to the -machine case, we use the symmetrization technique [17] to reduce -machine communication to -machine communication. The symmetrization constructs a -machine protocol from a -machine protocol with communication cost , where the expectation is taken on the randomness of the samples returned by oracles. Then we can derive the bound for from . The result is formally stated in the following theorem, whose proofs are given in the appendix.

###### Theorem 3 (Communication Complexity Lower Bound for 1-Round k-Machine Protocols).

Let be the specified error rate of learned hypothesis . Let be the outlier rate in the oracles with . For -round -machine protocol, its communication complexity is lower bounded as

 RM1→…→Mkϵ=Ω(1−2ϵ1−λdk),

for a constant success probability. Here is the minimal VC-dimension of the hypothesis .

In the above theorem, we observe that there is a factor of instead of , which appears due to the employed the symmetrization technique and introduces a gap with the lower bound for -machine protocol. In the future, we will further narrow such a gap.

Then similar to obtaining the results in Theorem 2 and Theorem 3, applying the round elimination and machine elimination by symmetrization together gives a general lower bound on the communication complexity for the -round -machine protocol, as stated in the following theorem.

###### Theorem 4 (Communication Complexity Lower Bound for t-Round k-Machine Protocols).

Let be the specified error rate of learned . Let be the outlier rate in the oracles with . Then for a -round -machine protocol, its communication complexity is lower bounded as

 RM1↔M2↔…↔Mkϵ=Ω(1−2ϵ1−λdkt2),

for a constant success probability. Here is the minimal VC-dimension of the hypothesis .

### 3.5 A Communication-Efficient Protocol: Weighted Sampling

We present an efficient distributed PAC learning algorithm, focusing on the -machine case, where the machines are denoted as and . The proposed algorithm is mainly based on communicating learned hypotheses and difficult examples, similar to boosting: each hypothesis learned by is communicated to and tested by . Then the difficult examples of are randomly sampled by a probability, proportional to how many times the hypothesis fails on them, and communicated back to for updating hypothesis. This two-machine protocol algorithm is developed based on the algorithm proposed in [5].

Following lemma [5] says the random weighted sampling can always find a good representing subset of the entire set on machine .

###### Lemma 2.

Let have a weighted set of points with weight function . For any constant , machine can send a set of size

such that any hypothesis that correctly classifies all points in

will mis-classify points in with a total weight at most . The set can be constructed by a weighted random sample from as in Alg. 2, which succeeds with constant probability.

Based on Lemma 2, we can obtain the following communication complexity result for Algorithm 1

###### Theorem 5.

The two-machine two-way protocol Weighted Sampling in Alg. 1 for linear separator in mis-classifies at most points after rounds and uses bits of communication.

Comparing the communication complexity of Ws in Theorem 5 with the upper bound in Corollary 3 (where the sample description length is and ) provides that Ws reduces the communication complexity by a factor of . This demonstrates Ws is much more communication efficient than naively aggregating sufficiently many samples.

Consider that VC-dimension of a linear classifier in is . Applying Theorem 2 gives a communication complexity lower bound of learning linear classifiers from two machines to be . Omitting constant factors gives a ratio between this theoretical complexity lower bound and the practical complexity of Ws in Theorem 5 as . This communication efficiency gap, linearly depending on , is significant when is large. The factor comes from the algorithm Mwu sampling a subset whose size depends on . This gap also demonstrates there is room to enhance communication efficiency for Ws algorithm, if we can construct a data subset from whose size independent of . This -machine communication protocol for distributed classifier learning can be straightforwardly generalized to -machine communication, using the coordinator communication model as follows. Fixing an arbitrary machine (say machine ) as the coordinator for the other machines. Then machine runs the -machine communication protocol from the perspective of and the other machines serve jointly as the second machine . Each other machine reports the total weight of their data to machine , who then reports back to each machine what fraction of the total weight they own. Then each machine sends the coordinator a random sample of size (for see Alg. 2). Party learns a classifier on the union of its own sample and the joint samples from other machines, and sends the updated classifier back to all the machines.

## 4 Online Robust PAC Learning

We briefly discuss here about the space complexity for online robust PAC learning. We use our results on communication complexity from the previous sections to derive lower bounds on space complexity for robust PAC learning in online setting. The space complexity has been extensively investigated in the context of streaming methods [15]. However, specific space complexity lower bound for robust learning is still absent.

Following lemma shows the connection between distributed learning algorithm and online learning algorithm, in particular from the perspective of the connection between the communication complexity and the space complexity bound in an online data model.

###### Lemma 3.

Suppose that we can learn a hypothesis using an online algorithm that has bits of working storage and makes passes over the samples. Then there is a -machine distributed algorithm for learning that uses bits of communications.

The above lemma allows us to deduce the space lower bound for the online learning, given the communication complexity of distributed learning algorithms.

###### Theorem 6 (Space Complexity Lower Bound for r-Pass Online Learning).

Let be the specified error rate of learned . Let be the outlier rate in the oracles with . For a -pass online PAC learning algorithm, its space complexity is lower bounded as , where is the VC-dimension of hypothesis .

For an online PAC learning algorithm, in order to learn a hypothesis with an error at most , a space at the order of must be maintained. However, if we allow the data to be passed to the learning algorithm for passes, then the lower bound on the space cost can be reduced by a factor of .

## 5 Simulations

In this section, we present simulation studies on the distributed learning algorithm, Ws, in Algorithm 1, for finding linear classifier in for two-machine and -machine scenarios. We empirically compare it with a Naive approach, which sends all samples from machines to a coordinator machine and then learns the classifier at the coordinator. For any dataset, this accuracy is the best possible.

For the two compared methods, a linear SVM is used as the underlying classifier. We report their training accuracies and communication costs. The cost of communicating one sample or one linear classifier in is assumed to be ( for describing the coordinates and for sign or offset). For instance, the total communication cost of the Naive method is the number of samples sent by machines multiplied by the description length of each sample , which is equal to . We always set the communication cost of the naive method as and report the communication cost ratio of Ws to the Naive method.

We report results for two-machine and four-machine protocols on both synthetic and real-world datasets. Four datasets, two each for two-machine and four-machine cases, are generated synthetically from mixture of two Gaussians with and two randomly generated diagonal matrices. Each Gaussian is carefully seeded to make sure the generated data from two components are separated well. Algorithms access training examples via noisy oracles and with a malicious error rate of and . That is, among the training examples, or of them come from a noisy distribution , which is significantly different from the above two Gaussian components. In addition, we conduct empirical studies on two real-world datasets from UCI repository, including the Cancer and Mushroom data sets. Statistics on the used data sets are given in Table 2 and corresponding results, including accuracies and relative communication costs (regarding the cost of Naive as ), are shown in Table 1. Observations on the results demonstrate: (1) the proposed multiplicative weighted sampling algorithm achieves prediction accuracy matching the best results provided by naive algorithm, and (2) increasing (error rate of oracles) generally brings extra communication cost to achieve comparable performance with the naive algorithm.

## 6 Conclusions

In this work, we provided several theoretic results regarding fundamental limits of the communication complexity for distributed PAC learning and space complexity for online PAC learning, in the presence of malicious errors in the training examples. We demonstrated how the complexities increase along with the malicious error rates for various distributed learning settings, from simplest two-machine one-round communication protocols to the general multi-machine multi-round communication protocols. A connection between online learning and distributed learning was presented, which gives the space complexity for various online learning protocols. We also provided a boosting flavor distributed robust linear classifier learning algorithm Ws, which presented significantly higher communication efficiency than the naive distributed learning algorithm with negligible classification accuracy loss.

## Appendix A Simulations

More details on the used datasets for the simulation evaluation are summarized in Table 2.

## Appendix B Communication Complexity Lower Bounds for Other Communication Protocols

### b.1 Lower Bound for 1-Round k-Machine Communication

Now we consider a slightly more general communication protocol than the -round -machine one, where machines are involved, and only one-round communication is allowed for two neighboring machines in the sequence of communication.

We first note that the multi-machine one-round protocol is not more efficient than the two-machine protocol [3], in terms of the maximal size of message communicated. The total communication complexity lower bound for -round -machine protocol is provided in 3.4.

To show this, we associate with every disjoint sample partition over machines a disjoint sample partition for two machines: for , and , and we have following results.

###### Lemma 4.

For every hypothesis , every disjoint input partition dividing into , and every specified error rate ,

 RI1→⋯→Ik,maxϵ(h)≥maxjRJj1→Jj2ϵ(h),

where the superscript in the communication complexity denotes the length of the longest message.

###### Proof.

The two machines can simulate the -machine protocol as follows: simulates the first machines; it then transmits what the -th machine would have sent to the -st machine to ; completes the computation by simulating the last machines. ∎

The result also holds for the -distributional complexity for any input distribution . It therefore follows from Theorem 1 that:

###### Corollary 2 (Communication Complexity Lower Bound on Maximum Message Size for 1-Round k-Machine Protocols).

Let be the outlier rate in the oracles. Let be the specified error rate of . For every hypothesis , every disjoint input partition , we have

 RM1→⋯→Mk,maxϵ(h)=Ω(1−ϵ1−λd′),

for a constant success probability. Here is the maximal VC-dimension among the hypothesis learned from .

## Appendix C Proofs of Main Results

### c.1 Tools

The following essential theorem is a consequence of the minmax theorem in [16], which states the randomized public-coin communication complexity is always lower bounded by the distribution complexity.

###### Theorem 7 (Yao’s principle, [24]).

For every function , and for every ,

 RA→B,pubϵ(h)=maxμDA→B,μϵ(h),

where ranges over all distributions on .

This important result provides us with a convenient way to lower bound the communication complexity, through finding a “hard” distribution w.r.t. which the communication complexity is easy to evaluate. Throughout the paper, we usually take a product distribution, i.e., and are independent of each other, as the hard distribution .

### c.2 Proof of Communication Complexity Lower Bound for 1-Round 2-Machine Protocol, Theorem 1

To prove the above lower bound, we need the following lemma, which says a lower bounded decoding error is inevitable when the communication channel capacity is limited. Define to be the minimum hamming distance between

and a vector

.

###### Lemma 5.

Suppose . For every set , , where with ,

 E[dist(z,U)]≥ϵ−λ1−λd,

where the expectation is taken with respect to the uniform distribution over

.

###### Proof.

For each , let . Denote , and we have . For each ,

 |Nu|=td∑i=0(di)≤(ed/(td))td=2tdlog(e/t).

Thus,

 ∑u|Nu|<2cd+tdlog(e/t)≤2d−1.

The last inequality is from the assumption that . We show that , and hence by Markov inequality. ∎

Now we proceed to prove Theorem 1.

###### Proof.

According to Yao’s theorem (Theorem 7), , where the maximum is taken over all distributions .

In order to prove this lower bound, we describe a product distribution for which , where is the VC-dimension of . By definition of the VC-dimension, there exists a set of size which is shattered by  – the hypothesis learned from . Namely, for every subset there exits , such that iff . For each , fix such an . Let be the uniform distribution over the set of pairs . Let be the outliers.

Let be a single round deterministic protocol for learning whose cost is at most . Thus, induces two mappings. , determines which bits should send to for every given , and determines the value of computed by for every , given the bits sent by . Combining these two mappings together, induces a mapping from into a set , where . The expected error of is

 ϵ(h)=(1−λ)1d2d∑z∈{0,1}ddist(z,P1,2(z))+λ,

where denotes the hamming distance between the vectors. Then Lemma 5 gives us

 ϵ(h)≥(1−λ)1d2d∑z∈{0,1}dϵ−λ1−λd+λ=ϵ.

Therefore, if the communication complexity is limited to less than , the error of the learned hypothesis will be always greater than specified error . Thus, we get the lower bound in Theorem 1. ∎

### c.3 Proof of Communication Complexity Lower Bound for t-Round 2-Machine Protocols, Theorem 2

###### Proof.

Suppose the hypothesis can be learned with -error following a -round randomized protocol . Let denote the learned hypothesis. Applying the round elimination Lemma 1 to repeatedly for times, we see that has a protocol with communication complexity and

 err(Q) ≤ ϵ+√ℓ1d+√ℓ1+ℓ2d+⋯+√ℓ1+⋯+ℓt−1d ≤ ϵ+t√ℓ1+⋯+ℓtd.

Suppose the communication complexity of satisfies . Then , so with communication cost less than . This is a contradiction to Theorem 1. Therefore, the communication complexity is at least . ∎

### c.4 Proof of Communication Complexity Lower Bound for 1-Round k-Machine Protocols, Theorem 3

###### Proof.

To extend the above lower bound for -machine communication to the -machine case, we use the symmetrization technique [17] to reduce -machine communication to -machine communication.

The symmetrization is conducted as follows. Consider a protocol for this -machine problem, which works on this distribution, communicates bits in expectation. We build from a new protocol for a -machine problem. In the -machine problem, suppose that gets input and gets input , where are independent random vectors. Then works as follows: and randomly choose two distinct index using the public randomness, and they simulate the protocol , where plays machine and lets , plays machine and lets , and they both play all of the rest of the machines; the inputs of the rest of the machines is chosen from shared randomness. and begin simulating the running of . Every time machine should speak, sends to the message that machine was supposed to communicate, and vice versa. When any other machine () should speak, both and know his input so they know what it should communicate, thus no communication is actually needed. A key observation is that the inputs of the machines are uniform and independent and thus entirely symmetrical, and since the index and were chosen uniformly at random, then the expected communication performed by the protocol is . Since (Theorem 1), we have .

### c.5 Proof on Equivalence between Distributed and Online Algorithms, Lemma 3

###### Proof.

An -pass, -space online algorithm for learning a hypothesis on a set yields an -round, -machine communication protocol for learning when is partitioned into subsets and the th machine receives : the th machine randomly permutes to generate stream and the machines emulate the online algorithm on the concatenated stream . The emulation requires bits of communication. ∎

## Appendix D Proof on The Ws Algorithm, Theorem 5

###### Proof.

At the start of each round , let be the potential function given by the sum of weights of all points in that round. Initially, since by definition for each point we have .

Then in each round, constructs a hypothesis at to correctly classify the set of points that accounts for at least fraction of the total weight by Lemma 2. All other misclassified points are upweighted by . Hence, for round we have .

Let us consider that weight of the points in the set that have been misclassified by a majority of the classifiers (after the protocol ends). This implies every point in has been misclassified at least number of times and at most number of times. So the minimum weight of points in is and the maximum weight is .

Let be the number of points in that has weight where . The potential function value of after rounds is . Our claim is that . Each of these at most points have a weight of at least . Hence we have

 ϕTS=T∑i=T/2ni(1+ρ)i≥(1+ρ)T/2T∑i=T/2ni=(1+ρ)T/2|S|.

Relating these two inequalities we obtain the following,

 |S|(1+ρ)T/2≤ϕTS≤ϕT=n(1+cρ)T.

Hence (using )

 |S|≤n(1+c′ρ(1+ρ)1/2)T = n(1+c′ρ(1+ρ)1/2)5log2(1/ϵ) = n(1/ϵ)5log2(1+c′ρ(1+ρ)1/2).

Setting , which means and , we get and thus , as desired since . Thus each round uses points, each requiring bits of communication, yielding a total communication of . ∎

## Appendix E Upper Bound for Communication Complexity

We first establish an upper bound on the communication complexity for distributed PAC learning, which is simply derived from the sample complexity result for PAC learning with malicious errors [10]. Before stating the sample complexity result, let us recall the Occam algorithm [4], whose existence is a sufficient condition for the learnability of a PAC learning problem [19].

###### Definition 2 (Occam Algorithm).

An algorithm is called an Occam algorithm if it draws training examples and outputs a hypothesis such that is consistent with these examples.

In the presence of malicious outliers (with error rate) in the training examples, we desire to develop a -tolerant algorithm, defined as follows [10].

###### Definition 3 (λ-tolerant Algorithm).

Given a malicious error rate for the oracles. If an algorithm with access to the oracles outputs a hypothesis with error probability with a probability of , the algorithm is a -tolerant algorithm.

Basically, a -tolerant algorithm is able to output a hypothesis with bounded error even in presence of a fraction of outliers in the training examples. Similarly, we call an Occam algorithm as a -tolerant Occam algorithm if outputs a hypothesis whose error over provided training examples containing malicious outlier with a fraction of .

Following theorem demonstrates that if we have a -tolerant Occam algorithm, we can always develop a -tolerant algorithm for learning any target representation with sufficiently many training examples.

###### Theorem 8 (Sample Complexity for Robust PAC Learning).

Suppose an algorithm is a -tolerant Occam algorithm for by . Then is also a -tolerant algorithm for by , and the sample size required by is , for achieving an error rate of with a success probability at least .

###### Proof.

Let be such that its error rate on positive examples . Then the probability that agrees with a point received from the oracle is bounded above by

 (1−λ)(1−ϵ)+λ=1−(1−λ)ϵ.

Thus the probability that agrees with at least a fraction of examples received from is bounded above by

 e−mϵ(1−2λ)2/8(1−λ),

by Chernoff bound. From this it follows that the probability that some with agrees with a fraction of the examples is at most . Solving , we obtain . ∎

Based on the above sample complexity result, a simplest communication protocol is to just have each machine out of machines send a random sample of size to a specific machine, which then performs the learning, and there is just one round of communication. Therefore, we can immediately obtain the following upper bound on the communication complexity.

###### Corollary 3 (Communication Complexity Upper Bound).

Assume a -tolerant Occam algorithm always exists. Any target representation can be learned from to error using round and total examples communicated. Suppose each example is represented by bits, then the total communication complexity is upper bounded by .

## References

• [1] Noga Alon, Yossi Matias, and Mario Szegedy.

The space complexity of approximating the frequency moments.

In

Proceedings of the twenty-eighth annual ACM symposium on Theory of computing

, pages 20–29. ACM, 1996.
• [2] Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, communication complexity and privacy. arXiv preprint arXiv:1204.3514, 2012.
• [3] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, and D Sivakumar. Information theory methods in communication complexity. In Computational Complexity, 2002. Proceedings. 17th IEEE Annual Conference on, pages 72–81. IEEE, 2002.
• [4] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam’s razor. Information processing letters, 24(6):377–380, 1987.
• [5] Hal Daumé III, Jeff M Phillips, Avishek Saha, and Suresh Venkatasubramanian. Efficient protocols for distributed classification and optimization. pages 154–168. Springer, 2012.
• [6] Hal Daume III, Jeff M Phillips, Avishek Saha, and Suresh Venkatasubramanian. Protocols for learning classifiers on distributed data. arXiv preprint arXiv:1202.6078, 2012.
• [7] John C Duchi, Michael I Jordan, Martin J Wainwright, and Yuchen Zhang. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. arXiv preprint arXiv:1405.0782, 2014.
• [8] John C Duchi and Martin J Wainwright. Distance-based and continuum fano inequalities with applications to statistical estimation. arXiv preprint arXiv:1311.2669, 2013.
• [9] Ankit Garg, Tengyu Ma, and Huy Nguyen. On communication cost of distributed statistical estimation and dimensionality. In NIPS, pages 2726–2734, 2014.
• [10] Michael Kearns and Ming Li. Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4):807–837, 1993.
• [11] Flip Korn, S Muthukrishnan, and Yunyue Zhu. Checks and balances: Monitoring data quality problems in network traffic databases. In Proceedings of the 29th international conference on Very large data bases-Volume 29, pages 536–547. VLDB Endowment, 2003.
• [12] Ilan Kremer, Noam Nisan, and Dana Ron. On randomized one-round communication complexity. Computational Complexity, 8(1):21–49, 1999.
• [13] E Kushilevitz and N Nisan. Communication Complexity. Cambridge University Press, Cambridge, 1996.
• [14] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Database Theory-ICDT 2005, pages 398–412. Springer, 2005.
• [15] Shanmugavelayutham Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.
• [16] J v Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
• [17] Jeff M Phillips, Elad Verbin, and Qin Zhang. Lower bounds for number-in-hand multiparty communication complexity, made easy. In SODA, pages 486–501. SIAM, 2012.
• [18] Pranab Sen. Lower bounds for predecessor searching in the cell probe model. In Computational Complexity, 2003. Proceedings. 18th IEEE Annual Conference on, pages 73–83. IEEE, 2003.
• [19] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
• [20] Leslie G Valiant. Learning disjunction of conjunctions. In IJCAI, pages 560–566, 1985.
• [21] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971.
• [22] Angeline Wong, Leejay Wu, Phillip B Gibbons, and Christos Faloutsos. Fast estimation of fractal dimension and correlation integral on stream data. Information Processing Letters, 93(2):91–97, 2005.
• [23] Andrew Chi-Chih Yao. Some complexity questions related to distributive computing (preliminary report). In Proceedings of the eleventh annual ACM symposium on Theory of computing, pages 209–213. ACM, 1979.
• [24] Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In FOCS, pages 222–227. IEEE, 1977.
• [25] Yuchen Zhang, John Duchi, Michael Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pages 2328–2336, 2013.