Semi-Cyclic Stochastic Gradient Descent

04/23/2019 ∙ by Hubert Eichner, et al. ∙ 0

We consider convex SGD updates with a block-cyclic structure, i.e. where each cycle consists of a small number of blocks, each with many samples from a possibly different, block-specific, distribution. This situation arises, e.g., in Federated Learning where the mobile devices available for updates at different times during the day have different characteristics. We show that such block-cyclic structure can significantly deteriorate the performance of SGD, but propose a simple approach that allows prediction with the same performance guarantees as for i.i.d., non-cyclic, sampling.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When using Stochastic Gradient Descent (SGD), it is important that samples are drawn at random. In contrast, cycling over a specific permutation of data points (e.g. sorting data by label, or even alphabetically or by time of collection) can be detrimental to the performance of SGD. But in some applications, such as in Federated Learning (FL3; FL4) or when running SGD in “real time”, some cyclic patterns in the data samples are hard to avoid. In this paper we investigate how such cyclic structure hurts the effectiveness of SGD, and how this can be corrected. We model this semi-cyclicity through a “block-cyclic” structure, where during the course of training, we cycle over blocks in a fixed order and samples for each block are drawn from a block-specific distribution.

Our primary motivation is Federated Learning. In this setting, mobile devices collaborate in the training of a shared model while keeping the training data decentralized. Devices communicate updates (e.g. gradients) to a coordinating server, which aggregates the updates and applies them to the global model. In each iteration (or round) of Federated SGD, typically a few hundred devices are chosen randomly by the server to participate; critically, however, only devices that are idle, charging, and on a free wireless connection are selected (FL_BLOG; bonawitz19sysml). This ensures Federated Learning does not impact the user’s experience when using their phone, but also can produce significant diurnal variation in the devices available, since devices are more likely to meet the training eligibility requirements at night local time. For example, for a language model (gboard), devices from English speakers in India and America are likely available at different times of day.

When the number of blocks in each cycle is high and the number of samples per block is low, the setting approaches fully-cyclic SGD, which is known to be problematic. But what happens when the number of blocks is fairly small and each blocks consists of a large number of samples? E.g., if there are only two blocks corresponding to “day” and “night”, with possibly millions of samples per block. An optimist would hope that this “slightly cyclic” case is much easier than the fully cyclic case, and the performance of SGD degrades gracefully as the number of blocks increases.

Unfortunately, in Section 3 we show that even with only two blocks, the block-cyclic sampling can cause an arbitrarily bad slowdown. One might ask whether alternative forms of updates, instead of standard stochastic gradient updates, can alleviate this degradation in performance. We show that such a slowdown is unavoidable for any iterative method based on semi-cyclic samples.

Instead, the solution we suggest is to embrace the heterogeneity in the data and resort to a pluralistic solution, allowing a potentially different model for each block in the cycle (e.g. a “daytime” model and a “nighttime” model). A naive pluralistic approach would still suffer a significant slowdown as it would not integrate information between blocks (Section 4). In Section 5 we show a remarkably simple and practical pluralistic approach that allows us to obtain exactly the same guarantees as with i.i.d. non-cyclic sampling, thus entirely alleviating the problem introduced by such data heterogeneity—as we also demonstrate empirically in Section 7. In Section 6 we go even further and show how we can maintain the same guarantee without any deterioration while also being competitive with separately learned predictors, hedging our bets in case the differences between components are high.

2 Setup

We consider a stochastic convex optimization problem

(1)

where each component represents the data distribution associated with one of blocks . For simplicity, we assume a uniform mixture; our results can be easily extended to non-uniform mixing weights and corresponding non-uniform block lengths. In a learning setting, represents a labeled example and the instantaneous objective is the loss incurred if using the model .

We assume is convex and 1-Lipschitz with respect to , which lives in some high-, possibly infinite-dimensional, Euclidean or Hilbert space. Our goal is to compete with the best possible predictor of some bounded norm , that is to learn a predictor such that where . We consider Stochastic Gradient Descent (SGD) iterates on the above objective:

(2)

If the samples are chosen independently from the data distribution , then with an appropriately chosen stepsizes , SGD attains the mini-max optimal error guarantee (e.g. shalev2009stochastic):

(3)

where and is the total number of iterations, and thus also the total number of samples used.

But here we study Block-Cyclic SGD, which consists of cycles (e.g., days) of iterations each, for a total of iterations of the update (2). In each cycle, the first samples are drawn from , the second from and so forth. That is, samples are drawn according to

(4)

where indexes the cycles, indexes blocks and indexes iterations within a block and thus indexes the overall sequence of samples and corresponding updates. The samples are therefor no longer identically distributed and the standard SGD analysis no longer valid. We study the effect of such block-cyclic sampling on the SGD updates (2).

One can think of as the population distribution, with each step of SGD being based on a fresh sample; or of as an empirical distribution over a finite training set that is partitioned to groups. It is important however not to confuse a cycle over the different components with an epoch over all samples—the notion of an epoch or the size of the support of (i.e. the number of training points if is viewed as an empirical objective) do not enter our analysis. Either view is consistent with our development, though our discussion will mostly refer to the population view of SGD based on fresh samples.

3 Lower Bounds for Block-Cyclic Sampling

How badly can the block-cyclic sampling (4) hurt the SGD updates (2)? Unfortunately, we will see that the effect can be quite detrimental, especially when the number of samples in each block is very large compared to the overall number of cycles , and even if the number of components is small. This is a typical regime in Federated Learning if we consider daily cycles, as we would hope training would take a small number of days, but the number of SGD iterations per block can be very high. For example, recent models for Gboard were trained for sequential steps over 4-5 days ( or , , and e.g. and if the day is divided into six 4-hour blocks) (gboard).111

The cited work uses the Federated Averaging algorithm with communication every 400 sequential steps, as opposed to averaging gradients every step, so while not exactly matching our setting, it provides a rough estimate for realistic scenarios.

To see how even can be very bad, consider an extreme setting where the objective depends only on the component of , but is the same for all samples from each . Formally, let with . That is there are only two possible examples which we call “1”, which is the only example in the support of the first component and “2”, which is the only example in the support of . The problem then reduces to a deterministic optimization problem of minimizing , but in each block we see gradients only of one of the two terms. With a large number of iterations per block , we might be able to optimize each term separately very well. But we also know that in order to optimize such a composite objective to within , even if we could optimize each component separately, requires alternating between the two components at least times (woodworth2016tight), establishing a lower bound on the number of required cycles, independent of the number of iterations per cycle. This extreme deterministic situation establishes the limits of what can be done with a limited number of cycles, and captures the essence of the difficulty with block-cyclic sampling.

In fact, the lower bound we shall see applies not only to the precise SGD updates (2) but to any method that uses block-cyclic samples as in (4), and at each iteration access , i.e. the objective on example . More precisely, consider any method that at iteration chooses a “query point” and evaluates and the gradient (or a sub-gradient) , where may be chosen in any way based on all previous function and gradient evaluations (the SGD iterates (2) is just one examples of such a method). The output of the algorithm can then be chosen as any function of the query points , the evaluated function values and the gradients.

Theorem 1.

Consider any (possibly randomized) optimization method of the form described in the previous paragraph, i.e. where access to the objective is by evaluating and on semi-cyclic samples (4) and where is chosen based on and the output based on all iterates222This theorem, as well as Theorem 2, holds even if the method is allowed “prox queries” of the form .. For any and there exists a 1-Lipschitz convex problem over high enough dimension such that , where the expectation is over and any randomization in the method.

Proof.

Let and , with the following functions taken from woodworth2018graph, which in turn is based on the constructions in arjevani2015communication; woodworth2017lower; carmon2017lower; woodworth2017lower:

(5)

where

are orthogonal vectors,

, and for now consider . The main observation is that each vector is only revealed after includes a component in direction (more formally: it is not revealed if ), and only when is queried (woodworth2018graph, Lemma 9). That is, each cycle will reveal at most two vectors, for queries on the first half of the blocks, and for queries on the second half. After cycles, the method would only encounter vectors in the span of the first vectors . But for , we have (woodworth2018graph, Lemma 8). These arguments apply if the method does not leave the span of gradients returned so far. Intuitively, in high enough dimensions, it is futile to investigate new directions aimlessly. More formally, to ensure that trying out new directions over queries wouldn’t help, following appendix C of (woodworth2018graph), we can choose randomly in and use a piecewise quadratic that is 0 for and is equal to for . ∎

Theorem 1 establishes that once we have a block-cyclic structure with multiple blocks per cycle, we cannot ensure excess error better then:

(6)

Compared to using i.i.d. sampling as in Eq. (3), this is worse by a factor of , which is very large when the number of samples per block is much larger then the number of cycles , as we would expect in many applications.

For smooth objectives (i.e., with Lipschitz gradients) the situation is quantitatively a bit better, but we again can be arbitrarily worse then i.i.d. sampling as the number of iterations per block increases:

Theorem 2.

Under the same setup as in Theorem 1, for any and there exists a 1-Lipschitz convex problem where the gradient is also 1-Lipschitz, such that .

Proof.

Use the same construction as in Theorem 1, but with and

The objective is smooth (woodworth2018graph, Lemma 7), and the same arguments as in the proof of Theorem 1 hold except that now spanned by has (woodworth2018graph, Lemma 8). ∎

That is, the best we can ensure with block-cyclic sampling, even if the objective is smooth, is an excess error of:

(7)

i.e., worse by a factor compared to i.i.d. sampling.

4 A Pluralistic Approach

The slowdown discussed in the previous section is due to the difficulty of finding a consensus solution that is good for all components . But thinking of the problem as a learning problem, this should be avoidable. Why should we insist on a single consensus predictor? In a learning setting, our goal is to be able to perform well on future samples . If the data arrives cyclically in blocks, we should be able to associate each sample with the mixture component

it was generated from. Formally, one can consider a joint distribution

over samples and components, where is uniform over and . In settings where the data follows a block-cyclic structure, it is reasonable to consider the component index as observed and leverage this at prediction (test) time.

For example, in Federated Learning, this could be done by associating each client with the hours it is available for training. If it is available during multiple blocks, we can associate with it a distribution over reflecting its availability.

Instead of learning a single consensus model for all components, we can therefore take a pluralistic approach and learn a separate model for each one of the components with the goal of minimizing:

(8)

where and .

How can this be done? A simplistic approach would be to learn a predictor for each component separately, based only on samples from block . This could be done by maintaining separate SGD chains, and updating chain only using samples from block :

(9)

We will therefore only have samples per model, but for each model, the samples are now i.i.d., and we can learn such that:

(10)

where , and so:

(11)
(12)

That is, ensuring excess error requires iterations, which represents a slow-down by a factor of relative to overall i.i.d. sampling.

The pluralistic approach can have other gains, owing to its multi-task nature, if the sub-problems are very different from each other. An extreme case is when the different subproblems are in direct conflict, in a sense “competing” with each other, and the optimal for the different sub-populations are not compatible with each other. For example, this could happen if a very informative feature has different effects in different subgroups. In this case we might have , and there is a strong benefit to the pluralistic approach regardless of the semi-cyclic sampling.

The lower bounds of Section 3 involve a subtly different conflicting situation, where there is a decent consensus predictor, but finding it requires, in a sense, “negotiations” between the different blocks, and this requires going back-and-forth between them many times, as is possible with i.i.d. non-cyclic data, but not in a block-cyclic situation. Learning separate predictors would bypass this required conflict resolution.

An intermediate situation is when the sub-problems are “orthogonal” to each other, e.g. different features are used in different problems. In this case we have that are orthogonal to each other, and we might have that and . However, in this case, we would have that on average (over ), , and so with the pluralistic approach we can learn relative to a norm-bound that is times smaller than would be required when using a single consensus model. This precisely negates the slow-down in terms of of the pluralistic learn-separate-predictors approach (9), and we recover the same performance as when learning a consensus predictor based on i.i.d. non-cyclic data.

The regime where we would see a significant slow-down is when the distributions are fairly similar. By separating the problem into distinct problems, and not sharing information between components, we are effectively cutting the amount of data, and updates, by a factor of , causing the slow-down. But if the distributions are indeed similar, then at least intuitively, the semi-cyclisity shouldn’t be too much of a problem, and a simple single-SGD approach might be OK. At an extreme, if all components are identical (), the block-cyclic sampling is actually i.i.d. sampling and we do not have a problem in the first place.

We see, then, that in extreme situations, semi-cyclicity is not a real problem: if the components are extremely “competing”, we would be better off with separate predictors, while if they are identical we can just use a single SGD chain and lose nothing. But how do we know which situation we are in? Furthermore, what we actually expect is a combination of the above scenarios, with some aspects being in direct competition between the components, some being orthogonal while others being aligned and similar. Is there a simple approach that would always allow us to compete with training a single model based on i.i.d. data? That is, a procedure for learning such that:

(13)

We would like to make a distinction between our objective here and that of multi-task learning. In multi-task learning (caruana1997multitask, and many others), one considers several different but possibly related tasks (e.g. the tasks specified by our component distributions ) and the goal is to learn different “specialized” predictors for each task so as to improve over the consensus error while leveraging the relatedness between them so as to reduce sample complexity. In order to do so, it is necessary to target a particular way in which tasks are related (baxter2000model; ben2003exploiting). For example, one can consider shared sparsity (e.g. turlach2005simultaneous), shared linear features or low-rank structure (ando2005framework)

, shared deep features

(e.g. caruana1997multitask), shared kernel or low nuclear-norm structure (e.g. argyriou2008convex; amit2007uncovering), low-norm additive structure (e.g. evgeniou2004regularized), or graph-based relatedness (e.g. evgeniou2005learning; maurer2006bounds). The success of multi-task learning then rests on whether the choosen relatedness assumptions hold, and the sample complexity depends on the specificity of this inductive bias. But we do not want to make any assumptions about relatedness. We would like a pluralistic method that always achieve the guarantee (13), without any additional or even low-order terms that depend on the relatedness. We hope we can achieve this since in (13), we are trying to compete with the fixed solution , and are resorting to pluralism only in order to overcome data heterogeneity, not in order to leverage it.

5 Pluralistic Averaging

We give a surprisingly simple solution to the above problem. It is possible to compete with in a semi-cyclic setting, without any additional performance deterioration (at least on average) and with no assumptions about the specific relatedness structure between different components. In fact, this can be done by running a single semi-cyclic SGD chain (2), which previously we discussed was problematic. The only modification is that instead of averaging all iterates to obtain a single final predictor (or using a single iterate), we create different pluralistic models by averaging, for each component , only the iterates corresponding to that block:

(14)
Theorem 3.

Consider semi-cyclic samples as in (4). The pluralistic averaged solution given in (14) in terms of the iterates of (2) with step size and starting at , satisfies

(15)

where the expectation is w.r.t. the samples.

The main insight is that the SGD guarantee can be obtained through an online-to-batch conversion, and that the online guarantee itself does not require any distributional assumption and so is valid also for semi-cyclic data. Going from Online Gradient Descent to Stochastic Gradient Descent, the i.i.d. sampling is vital for the online-to-batch conversion. But by averaging only over iterates corresponding to samples from the same component, we are in a sense doing an online-to-batch conversion for each component separately, and thus over i.i.d. samples.

Proof.

Viewing the updates (2) as implementing online gradient descent (zinkevich2003online)333The development in zinkevich2003online is for updates that also involve a projection: where . See, e.g., shalev2012online, for a development of SGD without the projection as in (2), and proof of the regret guarantee (16) for these updates. Although we present the analysis without a projection, our entire development is also valid with a projection as in zinkevich2003online, which results in all the same guarantees, we have the following online regret guarantee for any sequence (and in particular any sequence obtained from any sort of sampling) and any with :

(16)

choosing on the right hand side, dividing by and rearranging the summation we have:

(17)

The above is valid for any sequence . Taking to be random, regardless of their distribution, we can take expectations on both sides. For the semi-cyclic sampling (4), and since is independent of we have:

(18)

where the outer expectation on the left-hand-side is w.r.t. . On the right hand side we have that . On the left hand side, we can use the convexity of and apply Jensen’s inequality:

(19)

Substituting (19) back into (18) we have:

(20)

recalling the definition (8) of and that , we obtain the expectation bound (15). ∎

An important, and perhaps surprising, feature of the guarantee of Theorem 3 is that it does not depend on the number of blocks , and does not deteriorate as increases. That is, in terms of learning, we could partition our cycle to as many blocks as we would like, ensuring homogeneity inside each block and without any accuracy cost. E.g. we could learn a separate model for each hour, or minute, or second of the day. The cost here is only a memory and engineering cost of storing, distributing and handling the plenitude of models, not an accuracy nor direct computational cost. This is in sharp contrast to separate training for each component, as well as to most multi-task approaches.

6 Pluralistic Hedging

The pluralistic averaging approach shows that we can always compete with the best single possible solution , even when faced with semi-cyclic data. But as discussed in Section 4, perhaps in some extreme cases learning separate predictors , each based only on data from the th component as in (9), might be better. That is, depending on the similarity and conflict between components, the guarantees (10) might be better than that of Theorem 3, at least for some of the predictors. But if we do not know in advance which regime we are at, nor for which components we are better off with a separately-learned model (because they are very different from the others) and which are better off leveraging also the other components, can we still ensure the individual guarantees (10) and the guarantee of Theorem 3 simultaneously?

Here we give a method for doing so, based on running both the single SGD chain (2) and the separate SGD chains (9) and carefully combining them using learned weights for each chain. Let be the weight for the full SGD chain  (2), which will be kept fixed throughout, and let be the weights assigned to the block-specific SGD chains  (9) on step . We learn the weights using a multiplicative update rule. At step , in which block is active, this update takes the form:

where is a learning rate (separate from those of the SGD chains). Then, we let and choose between the full and block-specific SGD chains by:

Finally, we obtain the final predictors via pluralistic averaging; namely, for each component , we average only the iterate within the corresponding blocks:

(21)

For the averaged solution , we have the following.

Theorem 4.

Set , , and initialize for each . Then, for all we have

and further, provided that ,

The requirement that is extremely mild, as we would generally expect a large number of iterations per block , and only a mild number of blocks , i.e. .

The proofs in this section use the following notation. For all , let be the set of time steps where we got a sample from distribution (so that ). We first prove the following.

Lemma 1.

For each we have

and

Here, the expectation is taken w.r.t. the as well as the internal randomization of the algorithm.

The proof (in Appendix A) is based on classic analysis of the Prod algorithm (Cesa-Bianchi2007; see also Even-Dar2008; sani2014exploiting). We can now prove the main theorem of this section.

Proof of Theorem 4.

The proof follows from a combination of Lemma 1 with regret bounds for the SGD chains (2) and (9), viewed as trajectories of the Online Gradient Descent algorithm. Standard regret bounds for the latter (zinkevich2003online; see also shalev2012online; hazan2016introduction) imply that for any sequence and for any , it holds that

(22)
and, for each and for any such that ,
(23)

Now, fix ; Lemma 1 together with Eq. 23 imply

Using the facts that and for any , we have

Appealing to the convexity of and applying Jensen’s inequality on the left-hand side to lower bound , we obtain

which implies the first guarantee of the theorem. For the second claim, the first bound of Lemma 1 implies

Summing this with Eq. 22 and dividing through by gives

Next, as before, substitute the conditional expectations and use Jensen’s to lower bound the left-hand side; this yields

Recalling now the definitions and , we have shown that

Noting implies conludes the proof. ∎

7 Experiments

To illustrate the challenges of optimizing on block-cyclic data, we train and evaluate a small logistic regression model on the Sentiment140 Twitter dataset

(go09), a binary classification task over examples. We split the data into training (90%) and test (10%) sets, partition it into components based on the timestamps (but not dates) of the Tweets: 12am - 4am, 4am - 8am, etc, then divide each component across cycles (days). For more details, see Appendix B

. For simplicity, we keep the model architecture (linear bag of words classifier over top 1024 words) and minibatch size (128) fixed; we used a learning rate of

(determined through log-space grid search) except for the per-component SGD approach (9) where was optimal due to fewer iterations.

To illustrate the differences between the proposed methods, as well as to capture the diurnal variation expected in practical Federated Learning settings, we vary the label balance as a function of the time of day, ensuring the

are somewhat different. In particular, we randomly drop negative posts from the midnight group so the overall rate is 2/3 positive sentiment, we randomly drop positive posts from the noon group so the overall rate is 1/3 positive sentiment, and we linearly interpolate for the other 4 components. For discussion, and a figure giving positive and negative label counts over components, see Appendix 

B.2. We write for the empirical accuracy on component of the test set, with measuring the overall test accuracy.

We consider the following approaches:

  1. A non-pluralistic block-cyclic consensus model, subject to the lower bounds of Section 3. The accuracy of this procedure for day is given by the expected test-set accuracy of randomly picking , and evaluating , i.e. evaluating the model after completing a random block on each day .

  2. The per-component SGD approach (9), where we run SGD chains, with each model training on only one block per day. For each day , we evaluate , where is the model for component trained on the first blocks for component .

  3. The pluralistic single SGD chain approach of Section 5. The SGD chain is the same as for the consensus model, but evaluation is pluralistic as above: for each day we evaluate , i.e. for each test component we use the model at the end of the corresponding training block.

  4. An (impractical in real settings) idealized i.i.d. model, trained on the complete shuffled training set. For purposes of evaluation, we treat the data as if it had cyclic block structure and evaluate identically to the block-cyclic consensus model.

See Appendix B.3 for more details on evaluation. Note that we do not use a model average, but rather take the most recent relevant iterate, matching how such models are typically deployed in practice.

Figure 1:

Accuracy of a model trained on block-cyclic data tested on two blocks of data - posts from the midnight component (12am-4am), and posts from the noon component (12pm-4pm). Mean and standard deviations are computed from ten training repetitions.

Figure 1 illustrates how the block-cyclic structure of data can hurt accuracy of a consensus model. We plot how the accuracy on two test data components — the midnight, and the noon component — changes as a function of time of day when evaluated on the iterates of the SGD chain (2). Training is quick to leverage the label bias and gravitate towards a model too specific for the block being processed, instead of finding a better predictive solution based on other features that would help predictions across all groups.

In Figure 2 we compare results from the four different training and test methods introduced above. First, pluralistic models with separate SGD chains take longer to converge because they are trained on less data. Depending on how long training proceeds, the size of the data set, and the number of components, these models may or may not surpass the idealized i.i.d. SGD and pluralistic single SGD chain models in accuracy. Second, the pluralistic models from a single SGD chain consistently perform better than a single consensus model from the same chain. Third, the performance of pluralistic models is en par with or better than the idealized i.i.d. SGD model, reflecting the fact that these models better match the target data distribution than a single model (i.i.d. or block cyclic consensus) can.

8 Summary

We considered training in the presence of block-cyclic data, showed that ignoring this source of data heterogeneity can be detrimental, but that a remarkably simple pluralistic approach can entirely resolve the problem and ensure, even in the worst case, the same guarantee as with homogeneous i.i.d. data, and without any degradation based on the number of blocks. When the component distributions are actually different, pluralism can outperform the “ideal” i.i.d. baseline, as our experiments illustrate. An important lesson we want to stress is that pluralism can be a critical tool for dealing with heterogeneous data, by embracing this heterogeneity instead of wastefully insisting on a consensus solution.

Dealing with heterogeneous data, users or clients can be difficult in many machine learning settings, and especially in Federated Learning where learning happens on varied devices with different characteristics, each handling its own distinct data. Other heterogeneous aspects we can expect are variabilities in processing time and latency, amount of data per client, and client availability and reliability, which can all be correlated in different ways with different data components. All these pose significant challenges when training. We expect pluralistic solutions might be relevant in many of these situations.

Figure 2: Comparison of various training and testing methodologies for a simple model on a sentiment classification task, using i.i.d. and block-cyclic data. Means and standard deviations shown are computed from ten training repetitions.

In this paper we considered only optimizing convex

objectives using sequential SGD. In many cases we are faced with non-convex objectives, as in when training deep neural networks. As with many other training techniques, we expect this study of convex semi-cyclic training to be indicative also of non-convex scenarios (except that a random predictor would need to be used instead of an averaged predictor in the online-to-batch conversion) and also serve as a basis for further analysis of the non-convex case. We are also interesting in analyzing the effect of block-cyclic data on methods beyond sequential SGD, and most prominently parallel SGD (aka Federated Averaging). Unfortunately, there is currently no satisfying and tight analysis of parallel SGD even for i.i.d. data, making a detailed analysis of semi-cyclicity for this method beyond our current reach. Nevertheless, we again hope our analysis will be both indicative and serve as a basis for future exploration of parallel SGD and other distributed optimization approaches.

Acknowledgements

NS was supported by NSF-BSF award 1718970 and a Google Faculty Research Award.

References

Appendix A Deferred Proofs from Section 6

In this section, we give a proof of Lemma 1. Such a result has been previously shown in Even-Dar2008; sani2014exploiting, building on a lemma of Cesa-Bianchi2007. We give a full proof for completeness. We start with a simple Lemma.

Lemma 2.

For any ,

Proof.

The upper bound on is standard, and follows e.g. by the concavity of the log function. For the lower bound, write

Then . Thus is decreasing in and increasing in . Thus in the range , . ∎

We consider the more general case of experts with losses in , and a chosen expert , with respect to which we want constant regret. We consider the Prod Algorithm that starts out with initial weights:

At time step , it picks an expert with probability proportional to :

Finally, on receiving the loss function

, it updates the weights according to the multiplicative update

Lemma 3.

Assume that . Then this Prod algorithm achieves

for all , and
Proof.

Let and let denote the gap between the chosen expert and expert at step . Note that .

On the one hand,

On the other hand, for any ,

where the middle inequality holds since . It follows that for ,

Moreover, since does not change during the algorithm, we also have, using the lower bound in Lemma 2, that

This implies that

Optimizing parameters, we get the corollary:

Corollary 1.

Set and assume that that and that is large enough so that . Then the algorithm achieves the following regret bounds:

for all , and

Lemma 1 follows from the version of this corollary, where the two experts are the algorithms and .

Appendix B Experimental Details

b.1 Dataset

The sentiment140 dataset set (go09) was collected by querying Twitter (a popular social network) for posts (a.k.a. Tweets) containing positive and negative emoticons, and labeling the retrieved posts (with emoticons removed) as positive and negative sentiment, respectively.

The data sets used for the above scenarios are created by first shuffling the data randomly and splitting it into a training (90%, or examples) and test set (10%, or examples). This data set is used as-is for training and evaluating the idealized i.i.d. model. For the other scenarios trained on block-cyclic data, we group the shuffled training set and test set into blocks each by the time of day of the post (e.g. midnight block: posts from 12am - 4am; noon block: posts from 12pm - 4pm). This results in blocks of varying sizes, on average (training) and (testing) examples, respectively.

We simulate cycles (days). Observing that one pass (epoch) over the entire i.i.d. data set was sufficient for convergence of our relatively small model, this results in training examples per day, or training examples per day per block.

b.2 Artificially balanced labels

Figure 3:

Sentiment bias as a function of day time in the Sentiment140 dataset. For experiments in this paper, we introduced additional time-of-day dependent label skew to allow for a clearer illustration of how pluralistic approaches differ.

The raw data grouped by time of day exhibits some block-cyclic characteristics; for instance, positive tweets are slightly more likely at night time hours than day time hours (see Figure 3). However, we believe this dataset has an artificially balanced label distribution, which is not ideal to illustrate semi-cyclic behavior (go09). In particular, the data collection process separately queried the Twitter API every 2 minutes for positive tweets (defined to be those containing the :) emoticon), and simultaneously for negative sentiment via :(. Since only up to 100 results are returned via each API query, this will generally produce an (artificially) balanced label distribution, as in Fig. 3. Due to this fact, because large diurnal variations are likely in practice in Federated Learning (e.g., differences in the use of English language between the US and India), and because it better illustrates our theoretical results, we adjust the positive-sentiment rate as a function of time as described in section 7.

b.3 Details of evaluation methodology

For the block-cyclic consensus model, picking a random iteration of the form ensures we evaluate a set of models that have the same expected number of iterations as for the single-chain pluralistic approach, without using block-specific models. In the implementation, we compute the expectation of this quantity by evaluating all iterates against all