# Learning without Interaction Requires Separation

One of the key resources in large-scale learning systems is the number of rounds of communication between the server and the clients holding the data points. We study this resource for systems with two types of constraints on the communication from each of the clients: local differential privacy and limited number of bits communicated. For both models the number of rounds of communications is captured by the number of rounds of interaction when solving the learning problem in the statistical query (SQ) model. For many learning problems known efficient algorithms require many rounds of interaction. Yet little is known on whether this is actually necessary. In the context of classification in the PAC learning model, Kasiviswanathan et al. (2008) constructed an artificial class of functions that is PAC learnable with respect to a fixed distribution but cannot be learned by an efficient non-interactive (or one-round) SQ algorithm. Here we show that a similar separation holds for learning linear separators and decision lists without assumptions on the distribution. To prove this separation we show that non-interactive SQ algorithms can only learn function classes of low margin complexity, that is classes of functions that can be represented as large-margin linear separators.

## Authors

• 21 publications
• 31 publications
08/10/2021

### FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning

Federated Averaging (FedAvg, also known as Local-SGD) (McMahan et al., 2...
08/10/2020

### Improved Bounds for Distributed Load Balancing

In the load balancing problem, the input is an n-vertex bipartite graph ...
09/01/2019

### Round Complexity of Common Randomness Generation: The Amortized Setting

We study the effect of rounds of interaction on the common randomness ge...
06/03/2021

### Interactive Communication in Bilateral Trade

We define a model of interactive communication where two agents with pri...
02/21/2020

### Locally Private Hypothesis Selection

We initiate the study of hypothesis selection under local differential p...
11/11/2019

### Interaction is necessary for distributed learning with privacy or communication constraints

Local differential privacy (LDP) is a model where users send privatized ...
08/27/2018

### Communication-Rounds Tradeoffs for Common Randomness and Secret Key Generation

We study the role of interaction in the Common Randomness Generation (CR...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Overview

We consider learning in distributed systems where each client (or user) holds a data point drawn i.i.d. from some unknown distribution and the goal of the server is to solve some statistical learning problem using the data stored at the clients. In addition, the communication from the client to the server is constrained. The first constraint we consider is that of local differential privacy (LDP) [KasiviswanathanLNRS11]. In this model each user applies a differentially-private algorithm to their point and then sends the result to the server. The specific algorithm applied by each user is determined by the server. In the general version of the model the server can determine which LDP algorithm the user should apply on the basis of all the previous communications the server has received. In practice, however waiting for the client’s response often takes a relatively large amount of time. Therefore in such systems it is necessary to limit the number of rounds of interaction. That is, the queries of the server need to be split into a small number of batches such that the LDP algorithms used in each batch depend only on responses to queries in previous batches (a query specifies the LDP algorithm to apply). Indeed, currently deployed systems that use local differential privacy use very few rounds (usually just one) [ErlingssonPK14, appledp, DKY17-Microsoft].

The second constraint we consider is the small number of bits communicated from each client. Namely, each client applies a function with range to their input and sends the result to the server (for some

). As in the case of LDP, the specific function used is chosen by the server. One motivation for this model is collection of data from remote sensors where the cost of communication is highly asymmetric. In the context of learning this model was introduced by maxnames Ben-DavidD98 and generalized by maxnames SteinhardtVW16. Identical and closely related models are often studied in the context of distributed statistical estimation with communication constraints (

[luo2005universal, rajagopal2006universal, ribeiro2006bandwidth, ZhangDJW13, SteinhardtD15, suresh2016distributed]). As in the setting of LDP, the number of rounds of interaction that the server uses to solve a learning problem in this model is a critical resource.

To understand the round complexity of solving a learning problem in these models we will use the fact that both of these models are known to be closely related to learning in the statistical query model of maxnames Kearns:98. In this model an algorithm has access to a statistical query oracle for in place of i.i.d. samples from . The most commonly studied SQ oracle give an estimate of the mean of any bounded function with fixed tolerance. Let be a distribution over a domain and . A statistical query oracle is an oracle that given as input any function , returns some value such that . Tolerance of statistical queries roughly corresponds to the number of random samples in the traditional setting. Namely, the Chernoff-Hoeffding bounds imply that i.i.d. samples allow estimation of with tolerance

(with high probability). A special case of statistical queries are counting or linear queries in which the distribution

is uniform over the elements of a given database . In other words the goal is to estimate the empirical mean of on the given set of data points. This setting is studied extensively in the literature on differential privacy (see [DworkRoth:14] for an overview) and our discussion applies to this setting as well.

An algorithm in the SQ model is said to have rounds of interaction if the queries asked by can be split into batches in such a way that the queries in batch only depend on answers to queries in the previous batches. We also say that the algorithm is non-interactive (also referred as non-adaptive) if it requires only one round of interaction. Reductions between learning in the SQ model and the two constrained communication models above were given by maxnames KasiviswanathanLNRS11 and maxnames SteinhardtVW16. For our purposes it is important to note that all these four reductions preserve the number of rounds of interaction of a learning algorithm. In particular, this implies that it is sufficient for our purposes to study the round complexity of solving the learning problem in the SQ model.

In this paper we will focus on the standard PAC learning of a class of Boolean functions over some domain . In this setting the input distribution is over labeled examples where is drawn from some distribution and for some unknown (referred to as the target function). The goal of the learning algorithm is to output a function such that the error is small. In the distribution-independent setting is not known to the learning algorithm while in the distribution-specific setting the learning algorithm only needs to succeed for some specific .

For many of the important classes of functions all known learning algorithms require many rounds of interaction. Yet there are almost no known lower bounds on the round complexity of SQ algorithms. The only example that we are aware of is the result of maxnames KasiviswanathanLNRS11 who were also motivated by the local differential privacy model. They constructed a class of functions over

that can be PAC learned relative to the uniform distribution over

in the SQ model. At the same time cannot be learned by an efficient non-interactive SQ algorithm (the complexity is exponential in ). The class is highly unnatural. It splits the domain into two parts. Target function learned on the first half gives the key to the learning problem on the second half of the domain. That problem is exponentially hard to solve without the key. This approach does not extend to distribution-independent learning setting (intuitively, the learning algorithm will not be able to obtain the key if the distribution does not place any probability on the first half of the domain).

### 1.1 Our Result

We give a separation between the power of interactive and non-interactive algorithms for distribution-independent PAC learning of two natural classes of Boolean functions. Specifically, we prove that only classes that have polynomially small margin complexity can be efficiently PAC learned by a non-interactive SQ algorithm. The margin complexity of a class of Boolean functions , denoted by , measures (the inverse of) the largest margin of separation achievable by an embedding of in that makes the positive and negative examples of each function in linearly separable (see Definition 2.4). It is a well-studied measure of complexity of classes of functions and corresponding sign matrices in learning theory and communication complexity ([Novikoff:62, AizermanBR67, BoserGV92, ForsterSS01, Sherstov:08, LinialS:2009, KallweitSimon:11]). Let be a class of Boolean functions closed under negation. Assume that for some there exists a non-interactive possibly randomized SQ algorithm that, with success probability at least , PAC learns with error less than using at most queries to . Then .

The class of decision lists and the class of linear separators (or halfspaces) over are known to have exponentially large margin complexity [GHR:92, BuhrmanVW:07, Sherstov:08] (and are also negation closed). In contrast, these classes are known to be learnable efficiently by SQ algorithms [Kearns:98, DunaganVempala:04]. Combining these results with Theorem 1.1 gives the claimed separation for the SQ model. Using the reductions between the SQ model and the two models of distributed learning we obtain the separation in those models.

### 1.2 Related work

maxnames SmithTU17 address the question of the power of non-interactive LDP algorithms in the closely related setting of stochastic convex optimization. They derive new non-interactive LDP algorithms for the problem albeit requiring an exponential dependence in the dimension number of queries. They also give a strong lower bound for non-interactive algorithms that are further restricted to obtain only local information about the optimized function. Upper and lower bounds on the number of queries to the gradient oracle for algorithms with few rounds of interaction have been recently studied by maxnames DuchiRY18. In the context of discrete optimization from queries for the value of the optimized function the round complexity has been recently investigated by maxnames BalkanskiRS17,BalkanskiS18. To the best of our knowledge, the techniques used in these works are unrelated to ours.

The round complexity of PAC learning a class of functions by an SQ algorithm has been shown to determine the number of generations necessary to evolve in a variant of Valiant’s model of evolvability [Valiant:09, Kanade11]. The number of data samples necessary to answer statistical queries chosen interactively has recently been studied in a line of work on adaptive data analysis [DworkFHPRR14:arxiv, HardtU14, BassilyNSSSU16, SteinkeU15].

## 2 Preliminaries

For integer let .

### 2.1 Local Differential Privacy

In the local differential privacy (LDP) model [Warner, KasiviswanathanLNRS11] it is assumed that each data sample obtained by the server is randomized in a differentially private way. This is modeled by assuming that the server running the learning algorithm accesses the dataset via an oracle defined below. An -local randomizer is a randomized algorithm that satisfies and , . For a dataset , an oracle takes as an input an index and a local randomizer and outputs a random value obtained by applying . An algorithm is -LDP if it accesses only via the oracle with the following restriction: for all , if are the algorithm’s invocations of on index where each is an -randomizer then . This model can be contrasted with the standard, or central, model of differential privacy where the entire dataset is held by the learning algorithm whose output needs to satisfy differential privacy [DworkMNS:06]. This is a stronger model and an -LPD algorithm also satisfies -differential privacy.

### 2.2 Bounded communication

In the bounded communication model [Ben-DavidD98, SteinhardtVW16] it is assumed that the total number of bits learned by the server about each data sample is bounded by for some . As in the case of LDP this is modeled by using an appropriate oracle for accessing the dataset. We say that an algorithm extracts bits. For a dataset , an oracle takes as an input an index and an algorithm and outputs a random value obtained by applying . An algorithm is -bit communication bounded if it accesses only via the oracle with the following restriction: for all , if are the algorithm’s invocations of on index where each extracts bits then .

### 2.3 Equivalence to statistical queries

The third model we consider is the statistical query model of maxnames Kearns:98 that is defined by having access to oracle, where is the unknown data distribution (see Def. 1). To solve a learning problem in this model an algorithm needs to succeed for any valid (that is satisfying the guarantees on the tolerance) oracle’s responses. In other words, the guarantees of the algorithm should hold in the worst case over the responses of the oracle. A randomized learning algorithm needs to succeed for any SQ oracle whose responses may depend on the all queries asked so far but not on the internal randomness of the learning algorithm.

For an algorithm in any of these oracle models we say that the algorithm is non-interactive (or non-adaptive if all its queries are determined before observing any of the oracle’s responses.

maxnames KasiviswanathanLNRS11 show that one can simulate oracle with success probability by an -LDP algorithm using oracle for containing i.i.d. samples from . This has the following implication for simulating SQ algorithms. [[KasiviswanathanLNRS11]] Let be an algorithm that makes at most queries to . Then for every and there is an -local algorithm that uses oracle for containing i.i.d. samples from and produces the same output as (for some valid answers of ) with probability at least . Further, if is non-interactive then is non-interactive. maxnames KasiviswanathanLNRS11 also prove a converse of this theorem. [[KasiviswanathanLNRS11]] Let be an -LPD algorithm that makes at most queries to for drawn i.i.d. from . Then for every there is an SQ algorithm that in expectation makes queries to for and produces the same output as with probability at least . Further, if is non-interactive then is non-interactive.

As first observed by maxnames Ben-DavidD98, it is easy to simulate a single query to by extracting a single bit from each of the samples. This gives the following simulation. [[Ben-DavidD98]] Let be an algorithm that makes at most queries to . Then for every there is an -local algorithm that uses oracle for containing i.i.d. samples from and produces the same output as (for some valid answers of ) with probability at least . Further, if is non-interactive then is non-interactive. The converse of this theorem for the simpler oracle that accesses each sample once was given in [Ben-DavidD98, FeldmanGRVX:12]. For the stronger oracle in Definition 2.2, the converse was given by maxnames SteinhardtVW16. [[SteinhardtVW16]] Let be an -bit communication bounded algorithm that makes queries to for drawn i.i.d. from . Then for every , there is an SQ algorithm that makes queries to and produces the same output as with probability at least . Further, if is non-interactive then is non-interactive. Note that in this simulation we do not need to assume a separate bound on the number of queries since at most queries can be asked.

### 2.4 PAC Learning and Margin complexity

Our results are for the standard PAC model of learning [Valiant:84]. Let be a domain and be a class of Boolean functions over . An algorithm is said to PAC learn with error if for every distribution over and , given access (via oracle or samples) to the input distribution over examples for , the algorithm outputs a function such that . We say that the learning algorithm is efficient if its running time is polynomial in , and .

We say that a class of Boolean (-valued) functions is closed under negation if for every , . For dimension , we denote by the unit ball in norm in . Let be a domain and be a class of Boolean functions over . The margin complexity of , denoted , is the minimal number such that for some , there is an embedding for which the following holds: for every there is such that

 minx∈X{f(x)⋅\law,Ψ(x)\ra}≥1M.

We remark that margin complexity is closely related to the smallest dimension for which for which there exists a mapping of to such that every becomes expressible as a majority function over some subset of variables (a majority function is equal to if and only if the number of variables with indices in that are equal to 1 is larger than the number of those in set to ).

As pointed out in [Feldman:08ev], margin complexity is equivalent (up to a polynomial) to the existence of a (possibly randomized) algorithm that outputs a small set of functions such that with significant probability one of those functions is correlated with the target function. The upper bound in [Feldman:08ev] was sharpened by maxnames KallweitSimon:11 although they proved it only for determistic algorithms (which corresponds to a single fixed set of functions). It is however easy to see that their sharper bound extends to randomized algorithms and we give the resulting statement below: [[Feldman:08ev, KallweitSimon:11]] Let be a domain and be a class of Boolean functions over . Assume that there exists a (possibly randomized) algorithm that generates a set of functions satisfying: for every and distribution over with probability at least (over the randomness of ) there exists such that . Then

 MC(C)≤β2m3/2.

The conditions in Lemma 2.4 are also known to be necessary for low margin complexity. [[Feldman:08ev, KallweitSimon:11]] Let be a domain, be a class of Boolean functions over and . Then for , there exists a set of functions satisfying: for every and distribution over there exists such that .

Specifically, the correlational SQ dimension of is defined as follows. [[Feldman:08ev]] Let be a domain and be a class of Boolean functions over . The correlational SQ dimension of , denoted , is the minimal number such that there exist Boolean functions over , satisfying: for every and distribution over there exists such that . We will use the following upper-bound on in terms of proved in [Feldman:08ev] and tightened in [KallweitSimon:11]. [[KallweitSimon:11]] For every class of Boolean functions over domain :

 MC(C)≤CSQdim(C)3/2.

## 3 Lower Bounds for Non-Interactive Algorithms

We start by proving our main result. Let be a class of Boolean functions closed under negation. Assume that for some there exists a non-interactive SQ algorithm that PAC learns with error less than using at most queries to . Then .

###### Proof of Theorem 1.1.

We first recall a simple observation from [BshoutyFeldman:02] that allows to decompose each statistical query into a correlational and target-independent parts. Namely, for a function ,

 ϕ(x,y)=1−y2ϕ(x,−1)+1+y2ϕ(x,1)=ϕ(x,−1)+ϕ(x,1)2+y⋅ϕ(x,1)−ϕ(x,−1)2.

For a query , we will use and to denote the parts of the decomposition :

 h(x)≐ϕ(x,1)−ϕ(x,−1)2

and

 g(x)≐ϕ(x,1)+ϕ(x,−1)2.

For every input distribution and target functions , we define the following SQ oracle. Given a query , if then the oracle provides the exact expectation as the response. Otherwise it answers with . Note that, by the properties of the decomposition, this is a valid implementation of the SQ oracle.

Let denote with its random bits set to , where is drawn from some distribution . Let be the non-interactive statistical queries asked by . Let and denote the decomposition of these queries into correlational and target-independent parts. Let denote the hypothesis output by when used with the SQ oracle defined above.

We claim that if achieves error with probability at least , then for every and distribution , with probability at least , there exists such that (satisfying the conditions of Lemma 2.4 with ). To see this, assume for the sake of contradiction that for some distribution and function ,

 \prr∼R[r∈T(f,D)]>2/3,

where is the set of all random strings such that for all , . Let denote the set of random strings for which succeeds (with the given SQ oracle), that is .

By our assumption, and therefore _r∼R [r∈T(f,D) ∩S(f,D)]¿1/3 .

Now, observe that and, in particular, the answers of our SQ oracle to ’s queries are identical for and whenever . Further, if then . This means that for every , fails for the target function is and the distribution (by definition, ). By eq. (3) we obtain that fails with probability for and . This contradicts our assumption and therefore we obtain that

 \prr∼R[r∉T(f,D)]≥1/3.

By Lemma 2.4, we obtain the claim. ∎

We will now apply this result to obtain the claimed separations. We start with the class of halfspaces over which we denote by . We will use the following rather involved result about the margin complexity of halfspaces. [[GHR:92, Sherstov:08]] . Combining this result with Theorem 1.1 we obtain that: Any non-interactive SQ algorithm that PAC learns over with error less than and success probability using at most queries to must have .

On the other hand, maxnames DunaganVempala:04 give an efficient algorithm for PAC learning halfspaces (their description is not in the SQ model but it is known that their algorithm can be easily converted to the SQ model [BalcanF15]). [[DunaganVempala:04, BalcanF15]] There exists an SQ learning algorithm that for every , PAC learns with error using queries to .

A similar separation also holds for the more restricted class of decision lists over that we denote by (see [KearnsVazirani:94] for a definition). [[BuhrmanVW:07]] . Any non-interactive SQ algorithm that PAC learns over with error less than and success probability using at most queries to must have . On the other hand a simple algorithm of maxnames Kearns:98 shows that decision list are efficiently PAC learnable in the SQ model. [[Kearns:98]] There exists an SQ learning algorithm that for every , learns over with error using queries to .

It is known that by using the SQ algorithm for learning halfspaces (or decision lists) and simulations of SQ algorithms by -LDP algorithms (Theorem 2.3) and -bit communication bounded algorithms (Theorem 2.3) one obtains efficient PAC learning algorithm for halfspaces in these models. Hence we only state the lower bounds for non-interactive algorithms here that follow from Theorems 2.3 and 2.3.

Any non-interactive -LPD algorithm that PAC learns over with error less than and success probability at least using at most queries to for drawn i.i.d. from must have .

Any non-interactive -communication bounded algorithm that PAC learns over with error less than and success probability at least using queries to for drawn i.i.d. from must have .

## 4 Discussion

Our work shows that polynomial margin complexity is a necessary condition for learning a class of binary classifiers by a non-interactive SQ/LDP/limited-communication algorithm. We first point out a sense in which our condition is also sufficient. Lemma

2.4 implies that for every class of margin complexity , there exists a PAC learning algorithm that uses non-interactive SQ queries to and learns with error of at most for (since the correlated function or its negation will have error of at most ). Therefore our result implies that margin complexity characterizes (up to a polynomial) the complexity of weak PAC learning by non-interactive algorithms in the three models we consider. On the other hand, for PAC learning with small constant error (say ) all known algorithms for learning large-margin halfspaces require many rounds of interaction [BlumFKV:97, FeldmanGV:15]. Proving that this is necessary is a natural open problem.

A significant limitation of our result is that it does not rule out even a -round algorithm for learning halfspaces (or decision lists). This is, again, in contrast to the fact that learning algorithms for these classes require at least rounds of interaction. We believe that extending our lower bounds to multiple-round algorithms and quantifying the tradeoff between the number of rounds and the complexity of learning is an important direction for future work.

### Acknowledgements

We thank Kobbi Nissim, Adam Smith and Justin Thaler for insightful discussions of this problem.