Adaptive distributed methods under communication constraints

04/03/2018 ∙ by Botond Szabo, et al. ∙ universiteit leiden University of Amsterdam 0

We study distributed estimation methods under communication constraints in a distributed version of the nonparametric signal-in-white-noise model. We derive minimax lower bounds and exhibit methods that attain those bounds. Moreover, we show that adaptive estimation is possible in this setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we take up the study of the fundamental possibilities and limitations of distributed methods for high-dimensional, or nonparametric problems. The design and study of such methods has attracted substantial attention recently. This is for a large part motivated by the ever increasing size of datasets, leading to the necessity to analyze data while distributed over multiple machines and/or cores. Other reasons to consider distributed methods include privacy considerations or the simple fact that in some situations data are physically collected at multiple locations.

By now a variety of methods are available for estimating nonparametric or high-dimensional models to data in a distributed manner. A (certainly incomplete) list of recent references includes the papers [18, 25, 13, 6, 19, 20, 11, 1, 15]. Some of these papers propose new methods, some study theoretical aspects of such methods, and some do both. The number of theoretical papers on the fundamental performance of distributed methods is still rather limited however. In the paper [21]

we recently introduced a distributed version of the canonical signal-in-white-noise model to serve as a benchmark model to study aspects like convergence rates and optimal tuning of distributed methods. We used it to compare the performance of a number of distributed nonparametric methods recently introduced in the literature. The study illustrated the intuitively obvious fact that in order to achieve an optimal bias-variance trade-off, or, equivalently, to find the correct balance between over- and under-fitting, distributed methods need to be tuned differently than methods that handle all data at once. Moreover, our comparison showed that some of the proposed methods are more successful at this than others.

A major challenge and fundamental question for nonparametric distributed methods is whether or not it is possible to achieve a form of adaptive inference. In other words, whether we can design methods that do automatic, data-driven tuning in order to achieve the optimal bias-variance trade-off. We illustrated by example in [21] that naively using methods that are known to achieve optimal adaptation in non-distributed settings, can lead to sub-optimal performance in the distributed case. In the recent paper [27], which considers the same distributed signal-in-white-noise model and was written independently and at the same time as the present paper, it is in fact conjectured that adaptation in the considered particular distributed model is not possible.

In order to study convergence rates and adaptation for distributed methods in a meaningful way the class of methods should be restricted somehow. Indeed, if there is no limitation on communication or computation, then we could simply communicate all data from the various local machines to a central machine, aggregate it, and use some existing adaptive, rate-optimal procedure. In this paper we consider a setting in which the communication between the local and the global machines is restricted, much in the same way as the communication restrictions imposed in [25] in a parametric framework and recently in the simultaneously written paper [27] in the context of the distributed signal-in-white-noise model we introduced in [21].

In the distributed nonparametric regression model with communication constraints that we consider we can derive minimax lower bounds for the best possible rate that any distributed procedure can achieve under smoothness conditions on the true regression function. Technically this essentially relies on an extension of the information theoretic approach of [25] to the infinite-dimensional setting (this is different from the approach taken in [27], which relies on results from [23]). It turns out there are different regimes, depending on how much communication is allowed. On the one extreme end, and in accordance with intuition, if enough communication is allowed, it is possible to achieve the same convergence rates in the distributed setting as in the non-distributed case. The other extreme case is that there is so little communication allowed that combining different machines does not help. Then the optimal rate under the communication restriction can already be obtained by just using a single local machine and discarding the others. The interesting case is the intermediate regime. For that case we show there exists an optimal strategy that involves grouping the machines in a certain way and letting them work on different parts of the regression function.

These first results on rate-optimal distributed estimators are not adaptive, in the sense that the optimal procedures depend on the regularity of the unknown regression function. The same holds true for the procedure obtained in parallel in [27]. In this paper we go a step further and show that contrary perhaps to intuition, and contrary to the conjecture in [27], adaptation is in fact possible. Indeed, we exhibit in this paper an adaptive distributed method which involves a very specific grouping of the local machines, in combination with a Lepski-type method that is carried out in the central machine. We prove that the resulting distributed estimator adapts to a range of smoothness levels of the unknown regression function and that, up to logarithmic factors, it attains the minimax lower bound.

Although our analysis is theoretical, we believe it contains interesting messages that are ultimately very relevant for the development of applied distributed methods in high-dimensional settings. First of all, we show that depending on the communication budget, it might be advantageous to group local machines and let different groups work on different aspects of the high-dimensional object of interest. Secondly, we show that it is possible to have adaptation in communication restricted distributed settings, i.e. to have data-driven tuning that automatically achieves the correct bias-variance trade-off. We note however that although our proof of this fact is constructive, the method we exhibit appears to be still rather unpractical. We view our adaptation result primarily as a first proof of concept, that hopefully invites the development of more practical adaptation techniques for distributed settings.

1.1 Notations

For two positive sequences we use the notation if there exists an universal positive constant such that . Along the lines denotes that and hold simultaneously. Furthermore we write if . Let us denote by and the upper and lower integer value of the real number , respectively. The sum for positive real number denotes the sum . For a set let denote the size of the set. For we denote the standard -norm as , while for bounded functions denotes the -norm. The function evaluates to on and on . Furthermore, we use the notation . Throughout the paper, and denote global constants whose value may change one line to another.

2 Main results

We work with the distributed version of the random design regression model. We assume that we have ‘local’ machines and in the

th machine we observe pairs of random variables

, , (with ) satisfying

(2.1)

and (which is the same for all machines) is the unknown functional parameter of interest. We denote the local distribution and expectation corresponding to the th machine in (2.1) by and

, respectively, and the joint distribution and expectation over all machines

, by and , respectively. We assume that the total sample size is known to every local machine. For our theoretical results we will assume that the unknown true function belongs to some regularity class. We work in our analysis with Besov smoothness classes, more specifically we assume that for some degree of smoothness we have or . The first class is of Sobolev type, while the second one is of Hölder type with minimax estimation rates and , respectively. For precise definitions, see Section B in the supplementary material [suppl:szabo:zanten:2018]. Each local machine carries out (parallel to the others) a local statistical procedure and transmits the results to a central machine, which produces an estimator for the signal by somehow aggregating the messages received from the local machines.

We study these distributed procedures under communication constraints between the local machines and the central machine. We allow each local machine to send at most bits on average to the central machine. More formally, a distributed estimator is a measurable function of binary strings, or messages, passed down from the local machines to the central machine. We denote by the finite binary string transmitted from machine to the central machine, which is a measurable function of the local data . For a class of potential signals , we restrict the communication between the machines by assuming that for numbers , it holds that for every and , where denotes the length of the string . We denote the resulting class of communication restricted distributed estimators by . The number of machines and the communication constraints are allowed to depend on the overall sample size , in fact that is the interesting situation. To alleviate the notational burden somewhat we do not make this explicit in the notation however.

2.1 Distributed minimax lower bounds for the -risk

The first theorem we present gives a minimax lower bound for distributed procedures for the -risk, uniformly over Sobolev-type Besov balls, see Section B in the supplement for rigorous definitions.

Theorem 2.1.

Consider , and communication constraints . Let the sequence be defined as the solution to the equation

(2.2)

Then in distributed random design nonparametric regression model (2.1) we have that

Proof.

See Section 3.1

We briefly comment on the derived result. First of all note that the quantity in (2.2) is well defined, since the left-hand side of the equation is increasing, while the right-hand side is decreasing in . The proof of the theorem is based on an application of a version of Fano’s inequality, frequently used to derive minimax lower bounds. More specifically, as a first step we find as usual a large enough finite subset of the functional space over which the minimax rate is the same as over the whole space. This is done by finding the ‘effective resolution level’ in the wavelet representation of the function of interest and perturbing the corresponding wavelet coefficients, while setting the rest of the coefficients to zero. This effective resolution level for -smooth functions is usually in case of the -norm for non-distributed models (e.g. [10]). However, in our distributed setting the effective resolution level changes to , which can be substantially different from the non-distributed case, as it strongly depends on the number of transmitted bits. The dependence on the expected number of transmitted bits enters the formula by using a variation of Shannon’s source coding theorem. Many of the information theoretic manipulations in the proof are an extended and adapted version of the approach introduced in [25]

, where similar results were derived in context of distributed methods with communication constraints over parametric models.

To understand the result it is illustrative to consider the special case that the communication constraints are the same for all machines, i.e. for some . We can then distinguish three regimes: 1 the case ; 2 the case ; and 3 the case .

In regime 1 we have a large communication budget and by elementary computations we get that the minimum in (2.2) is taken in the second fraction and hence that . This means that in this case the derived lower bound corresponds to the usual non-distributed minimax rate . In the other extreme case, regime 3, the minimum is taken at the first term in (2.2) and , so the lower bound is of the order . This rate is, up to the factor, equal to the minimax rate corresponding to the sample size . Consequently, in this case it does not make sense to consider distributed methods, since by just using a single machine the best rate can already be obtained (up to a logarithmic factor). In the intermediate case 2 it is straightforward to see that . It follows that if , i.e. if we are only allowed to communicate ‘strictly’ less than in case 1, then the lower bound is strictly worse than the minimax rate corresponding to the non-distributed setting.

The findings above are summarized in the following corollary.

Corollary 2.2.

Consider , a communication constraints and assume that . Then

  1. [label=()]

  2. if ,

  3. if ,

  4. if ,

2.2 Non-adaptive rate-optimal distributed procedures for -risk

Next we show that the derived lower bounds are sharp by exhibiting distributed procedures that attain the bounds (up to logarithmic factors). We note that it is sufficient to consider only the case , since otherwise distributed techniques do not perform better than standard techniques carried out on one of the local servers. In case 3

therefore one would probably prefer to use a single local machine instead of a complicated distributed method with (possibly) worse performance.

As a first step let us consider Daubechies wavelets , , with at least

vanishing moments (for details, see Section

B in the supplement). Then let us estimate the wavelet coefficients of the underlying function in each local problems, i.e. for every and let us construct

and note that

Since one can only transmit finite amount of bits we have to approximate the estimators of the wavelet coefficients. Let us take an arbitrary and write it in a scientific binary representation, i.e. , with , . Then let us take consisting the same digits as up to the digits. for some , after the binary dot (and truncated there), i.e. , see also Algorithm 1.

1:procedure TransApprox()
2:     Transmit: , .
3:     Construct: .
Algorithm 1 Transmitting a finite-bit approximation of a number

Observe that the length of (viewed as a binary string) is bounded from above by bits. The following lemma asserts that if , then the expected length of the constructed binary string approximating is less than (for sufficiently large and by choosing ) and the approximation is sufficiently close to .

Lemma 2.3.

Assume that . Then the approximation of given in Algorithm 1 satisfies that

Proof.

See Section 3.4. ∎

After these preparations we can exhibit procedures attaining (nearly) the theoretical limits obtained in Corollary 2.2.

We first consider the case 1 that . In this case each local machine transmits the approximations (given in Algorithm 1) of the first wavelet coefficients , i.e. for . Then in the central machine we simply average the transmitted approximations to obtain the estimated wavelet coefficients

The final estimator for is the function in with these wavelet coefficients, i.e. . The method is summarized as Algorithm 2 below.

1:In the local machines:
2:for  to  do:
3:     for  1 to  do
4:          :=TransApprox()      
5:In the central machine:
6:for  to  do
7:     .
8:Construct: .
Algorithm 2 Nonadaptive -method, case 1

We note again that the procedure outlined in Algorithm 2 is just a simple averaging, sometimes called “divide and conquer” or “embarrassingly parallel” in the learning literature (e.g. [26], [17]). The following theorem asserts that the constructed estimator indeed attains the lower bound in case 1 (up to a logarithmic factor for close to the threshold).

Theorem 2.4.

Let , , and suppose that . Then the distributed estimator described in Algorithm 2 belongs to and satisfies

Proof.

See Section 3.2

Next we consider the case 2 of Corollary 2.2, i.e. the case that the communication restriction satisfies . For technical reasons we also assume that . Using Algorithm 2 in this case would result in a highly sub-optimal procedure, as we prove at the end of Section 3.3. It turns out that under this more severe communication restriction we can do much better if we form different groups of machines that work on different parts of the signal.

We introduce the notation . Then we group the local machines into groups and let the different groups work on different parts of wavelet domain as follows: the machines with numbers each transmit the approximations of the estimated wavelet coefficients for ; the next machines, with numbers , each transmit the approximations for , and so on. The last machines with numbers transmit the for . Then in the central machine we average the corresponding transmitted noisy coefficients in the obvious way. Formally, using the notation , the aggregated estimator is the function with wavelet coefficients given by

The procedure is summarized as Algorithm 3.

1:In the local machines:
2:for  to  do
3:     for  to  do
4:         for  to  do
5:               :=TransApprox()               
6:In the central machine:
7:for  to  do
8:     
9:Construct: .
Algorithm 3 Nonadaptive -method, case 2

The following theorem asserts that this estimator attains the lower bound in case 2 (up to a logarithmic factor).

Theorem 2.5.

Let , and suppose that . Then the distributed estimator described in Algorithm 3 belongs to and satisfies

with .

Proof.

See Section 3.3

2.3 Distributed minimax results for -risk

When we replace the -norm by the -norm and correspondingly change the type of Besov balls we consider, we can derive a lower bound similar to Theorem 2.1 (see Section B in the supplement for the rigorous definition of Besov balls).

Theorem 2.6.

Consider , communication constraints and assume that . Let the sequence be defined as the solution to the equation (2.2). Then in the distributed random design regression model (2.1) we have that

Proof.

See Section 4.1

The proof of the theorem is very similar to the proof of Theorem 2.6. The first term on the right hand side follows from the usual non-distributed minimax lower bound. For the second term we use the standard version of Fano’s inequality. We again consider a large enough finite subset of . The effective resolution level for the -norm in the non-distributed case is . Similarly to the case the effective resolution level changes to in the distributed setting, which can be again substantially different from the non-distributed case. The rest of the proof follows the same lines reasoning as the proof of Theorem 2.6.

Similarly to the -norm we consider again the specific case where all communication budgets are taken to be equal, i.e. . One can easily see that there are again three regimes of (slightly different compared to the -case).

Corollary 2.7.

Consider , communication constraint and assume that .

  1. [label=(b)]

  2. If , then

  3. If , then

  4. If , then

Next we provide matching upper bounds (up to a factor) in the first two cases, i.e. 1 and 2. In the third case the lower bound matches (up to a logarithmic factor) the minimax rate corresponding to a single local machine, hence it is not advantageous at all to develop complicated distributed techniques as a single server with only fraction of the total information performs at least as well. In the previous section dealing with estimation we have provided two algorithms (one where the machines had the same tasks and one where the machines were divided into groups and were assigned different tasks) to highlight the differences between the cases. Here for simplicity we combine the algorithms to a single one, but essentially the same techniques are used as before.

In each local machine we compute the local estimators of the wavelet coefficients and transmit a finite digit approximations of them , as in the -case. Then let us divide the machines into equal sized groups ( corresponds to case 1, while corresponds to case 2). Similarly to before machines with numbers transmit the approximations of the estimated wavelet coefficients for , and so on, the last machines with numbers transmit the approximations for . In the central machine we average the corresponding transmitted coefficients in the obvious way, i.e. the aggregated estimator is the function with wavelet coefficients given by

where . The procedure is summarized as Algorithm 4 and the (up to a logarithmic factor) optimal behaviour is given in Theorem 2.8 below.

1:In the local machines:
2:for  to  do
3:     for  to  do
4:         for  to  do
5:               :=TransApprox()               
6:In the central machine:
7:for  to  do
8:     
9:Construct: .
Algorithm 4 Nonadaptive -method, combined
Theorem 2.8.

Let . Then the distributed estimator described in Algorithm 4 belongs to and satisfies

  • for ,

  • for ,

    with .

Proof.

See Section 4.2

We can draw similar conclusions for the -norm as for the -norm. If we do not transmit a sufficient amount of bits (at least up to a factor) from the local machines to the central one then the lower bound from the theorem exceeds the minimax risk corresponding the non-distributed case. Furthermore by transmitting the sufficient amount of bits (i.e. up to a factor) corresponding to the class , the lower bound will coincide with the non-distributed minimax estimation rate.

2.4 Adaptive distributed estimation

The (almost) rate-optimal procedures considered so far have in common that they are non-adaptive, in the sense that they all use the knowledge of the regularity level of the unknown functional parameter of interest. In this section we exhibit a distributed algorithm attaining the lower bounds (up to a logarithmic factor) across a whole range of regularities simultaneously. In the non-distributed setting it is well known that this is possible, and many adaptation methods exist, including for instance the block Stein method, Lepski’s method, wavelet thresholding, and Bayesian adaptation methods, just to mention but a few (e.g. [22, 10]). In the distributed case the matter is more complicated. Using the usual adaptive tuning methods in the local machines will typically not work (see [21]) and in fact it was recently conjectured that adaptation, if at all possible, would require more communication than is allowed in our model (see [27]).

We will show, however, that in our setting, if all machines have the same communication restriction given by , it is possible to adapt to regularities ranging in the interval , where

(2.3)

and is the regularity of the considered Daubechies wavelet and can be chosen arbitrarily large. Note that is well defined. If , then we are in one of the non-trivial cases 1 or 2 of Corollary 2.2. We will construct a distributed method which, up to logarithmic factors, attains the corresponding lower bounds, without using knowledge about the regularity level .

Remark 2.9.

We provide some examples for the value of for different choices of and . Taking we have for all that . For and we get . For and we have that . Note that it is intuitively clear that in case the number of machines is large, then it is typically advantageous to use a distributed method compared to a single local machine as we would lose too much information in the later case. However, if we have a small number of machines and can transmit only a very limited amount of information, then it might be more advantageous to use only a single machine to make inference.

In the non-adaptive case we saw that different strategies were required to attain the optimal rate, case 2 requiring a particular grouping of the local machines. The cut-off between cases 1 and 2 depends, however, on the value of , so in the present adaptive setting we do not know beforehand in which of the two cases we are. In order to tackle this problem we introduce a somewhat more involved grouping of the machines, which basically gives us the possibility to carry out both strategies simultaneously. This is combined with a modified version of Lepski’s method, carried out in the central machine, ultimately leading to (nearly) optimal distributed concentration rates for every regularity class , simultaneously. (We note that in our distributed regression setting, deriving an appropriate version of Lepski’s method requires some non-standard technical work, see Section 3.5).

As a first step in our adaptive procedure we divide the machines into groups. To begin with, let us take the first machines and denote the set of their index numbers by . Then the remaining machines are split into equally sized groups (for simplicity each group has machines and the leftovers are discarded), where

The corresponding sets of indexes are denoted by . Note that , for . Then the machines in the group transmit the approximations (with in Algorithm 1) of the local estimators of the wavelet coefficients , for , with

to the central machine. The machines in group , , will be responsible for transmitting the coefficients at resolution level . First for every , the machines in group are split again into equal size groups (for simplicity each group has machines and the leftovers are discarded again), denoted by . A machine in one of the groups for transmits the approximations (again with in Algorithm 1) of the local estimators of the wavelet coefficients , for and to the central machine.

In the central machine we first average the transmitted approximations of the corresponding coefficients. We define

(2.4)

Using these coefficients we can construct for every the preliminary estimator

(2.5)

This gives us a sequence of estimators from which we select the appropriate one using a modified version of Lepski’s method. We consider