# Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets

We study distributed optimization algorithms for minimizing the average of heterogeneous functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of convergence rate on maximum Lipschitz constant of gradients across the devices. In this paper, we propose a novel adaptive sampling of machines specially catered to these settings. Our method relies on an adaptive estimate of local Lipschitz constants base on the information of past gradients. We show that the new way improves the dependence of convergence rate from maximum Lipschitz constant to average Lipschitz constant across machines, thereby, significantly accelerating the convergence. Our experiments demonstrate that our method indeed speeds up the convergence of the standard SVRG algorithm in heterogeneous environments.

## Authors

• 7 publications
• 1 publication
• 5 publications
• 6 publications
• 99 publications
05/13/2014

### Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling

Stochastic Gradient Descent (SGD) is a popular optimization method which...
06/16/2022

### Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

We study the asynchronous stochastic gradient descent algorithm for dist...
12/31/2020

Stochastic gradient descent (SGD) has taken the stage as the primary wor...
06/05/2018

03/10/2020

### Communication-efficient Variance-reduced Stochastic Gradient Descent

We consider the problem of communication efficient distributed optimizat...
02/14/2021

### Communication-Efficient Distributed Optimization with Quantized Preconditioners

We investigate fast and communication-efficient algorithms for the class...
08/20/2015

We study distributed stochastic convex optimization under the delayed gr...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we study distributed optimization algorithms to solve finite-sum problems of the form:

 minx∈RdF(x):=1MM∑m=1Fm(x), (1)

where , , and all sets are disjoint. Here represents partitions of a large dataset with data points such that each dataset only contains data points (in fact, it is admissible to assume that

varies across the workers and the analysis of our method would stay the same). This problem arises naturally in machine learning in the form of empirical risk minimization. We are particularly interested in the decentralized distributed learning setting where each

is stored locally in a worker. In this setting, each function

is a local average of the total average loss function

. We aim to minimize the total average loss function with minimal communication amongst the workers.
Traditional distributed machine learning settings assume that each worker has independent and identical distributed (i.i.d.) samples from an underlying distribution. This implicitly implies that each of the local average loss function is statistically similar to the total average loss function

due to the law of large numbers. In contrast, we assume that the data on each worker may be generated from different distributions. Consequently, the local average loss functions can be potentially very different from each other and from the total average loss function. A typical example of this setting is one where a large dataset is gathered to a server and then distributed unevenly to all workers in the sense that each worker only contains some main features of the whole data. Another canonical example of our setting is that of learning a machine learning model using data from mobile phone users. Here each mobile phone user is a worker and contains data such as photos, texts based on their interest. As a result, the characteristics of data on each mobile phone vary by user. Our setting is a particular case of a more general framework, Federated Learning

[konevcny2016federated], which is a challenging and exciting setting for distributed optimization.

In the settings above, the change of the gradients from some worker’s local functions could dominate the change of the gradient of the global function . We refer to these workers as informative workers. In particular, the gradients of some workers might change very slowly so that their contribution to the change of the gradient of is almost negligible. Hereafter, we refer to such workers as non-informative workers. Naively using SGD or its variance reduced variants (eg., SVRG) with uniform sampling often yields poor performance in such settings because the majority of the computation is spent on non-informative workers. This insight was exploited in the work of [chen2018lag] to prevent computing new gradients of non-informative workers frequently in the deterministic gradient descent (GD). We can think of our work as their stochastic counterpart.

Our primary goal in this paper is to design an adaptive sampling strategy for SVRG. It’s a reduced variance variant of SGD that works efficiently in the heterogeneous setting of our interest by paying more attention to informative workers. We want to emphasize in an environment that the information held at each worker may be very different. Treating them, in the same way, may results in inefficiency due to loss of information. For instance, using uniform distribution to select workers as in the standard SGD and SVRG slows down since it keeps revisiting non- informative workers. Formally, since the gradients of non-informative workers are very small comparing to the gradients of informative workers, the optimization will have very small (or almost zero) improvement by following these directions. Thus, it is desirable to design an adaptive optimization method that is able to select useful workers during the training process. By selecting workers actively, we are able to save a number of iterations from reaching a predetermine precision comparing to the uniform based sampling method.

Contributions. In light of the above discussion, we state the main contributions of this work.

• First, we develop an adaptive

sampling strategy for the SVRG algorithm and show that it improves the convergence of the SVRG algorithm in the heterogeneous setting. Our method is also robust to the homogenous data across machines; meanwhile, few machines have outlier data with much larger Lipschitz constant. In detail, our adaptive sampling technique pays more attention to informative workers. Consequently, we can reduce the dependency on the maximum of the Lipschitz constants to the average of them in the convergence rate of the SVRG algorithm. Besides, our experiments show that our adaptive algorithm is more stable with large step sizes than the standard SVRG algorithm.

• Second, we design an efficient adaptive local Lipschitz estimation method that is another version of the importance sampling algorithm due to [xiao2014proximal]. Our method outperforms the result above in the sense that we don’t need any pre-information regarding the exact or estimated values of Lipschitz constants. We provide a robust theoretical analysis of the estimation method and show that the convergence rate of this method is almost the same as the importance sampling strategy.

• Third, we propose a new parallel communication method with optimal cost. This method enables sampling with respect to weights in a condition that initially, machines know just their weights. In detail, we show that our parallel sampling technique can choose workers by just using many worker-worker communications for any .

## 2 Related Work

#### Single-machine Setting:

Although there were some efficient SGD-based approaches for the single-machine setting [bottou2010large, robbins1951stochastic], none of them did better than sub-linear convergence rate, leading to SVRG [johnson2018training] and others [le2013stochastic, defazio2014saga, bouchard2015online, zhao2015stochastic] that addressed variance reduction and hence improving the convergence rate. Serving the same purpose, gradient-based approximate sampling methods [alain2015variance, katharopoulos2017biased, katharopoulos2018not] were proposed, but they suffered from high computation cost. To solve this problem, more robust and less computation-consuming methods based on gradient norms [johnson2018training, stich2017safe] were used to reduce the sampling cost while still maintaining variance reduction goal.

#### Distributed Learning:

Distributing large-scale datasets across multiple servers is an effective solution to reduce per-server storage and memory utilization [dean2008mapreduce, zaharia2010spark, dean2012large]. The first and traditional approach is synchronous parallel minibatch SGD [dekel2012optimal, li2014communication]. Although being able to split the workloads to many nodes to speed up jobs, this method suffers from the high latency problem which might happen due to one or some slow nodes, which can be solved by the second group of asynchronous methods [recht2011hogwild, reddi2015variance, duchi2013estimation].

#### Communication Efficiency:

In order to overcome the communication burden in distributed optimization, communication-efficient methods have been proposed [zinkevich2010parallelized, zhang2013information, zhang2012communication, shamir2014communication, reddi2016aide, chen2018lag]. The methods by [zinkevich2010parallelized, shamir2014communication, reddi2016aide] reduce the communication rounds by increase the computation on local workers. However, those approaches also assumed i.i.d setting, unlike ours. On the contrary, the work by [chen2018lag] tackles with the non-i.i.d setting. Specifically, they propose an algorithm that can detect slow-varying gradients and skip their calculations when computing the full gradients to reduce the communication cost.

## 3 Preliminaries

#### Notations:

Standard inner product and norm induced from that are denoted by and correspondingly. and stands for conditional and full expectations. Sets ,
and will be represented by [M], [N,M] and [M]N respectively. Adding scalar to a set will correspond to the set that each entry of the set increased by : . stands for traditional ceiling function.

#### Problem Setup:

We consider the finite sum optimization problem (1) in the distributed learning setting where each function is stored on a local worker. We assume workers can communicate with each other and also with the server. However, each type of communication has its own cost. In practice, servers can perform mass broadcasting to multiple workers, but not vice-versa [chen2018lag]. Therefore, we assume that server to worker communication is cheaper than worker to worker. However, worker to server communication is more expensive. Therefore, the cost of information flow is dominated by worker to server and worker to worker communications. For convergence analysis, we assume that each function is convex with -Lipschitz gradients. In other words, for any and , we have the following:

 Fj(y)−Fj(x)−⟨∇Fj(x),y−x⟩≤Lj2∥y−x∥2.

Moreover, we use to denote the average of Lipschitz constants and for a maximum of them. Due to the non-i.i.d data distributed setting, we assume that ’s vary highly across the workers. We assume each is -strongly (and is -strongly) convex, i.e. for any we have :

 Fj(y)−Fj(x)−⟨∇Fj(x),y−x⟩≥λj2∥y−x∥2.

Finally, we denote where where ’s are Lipschitz constants for gradients of atomic function . is a parameter for dataset of each machine representing, how much are these datapoints different than the average of datasets, and in this paper we assume this number is not tremendously big.

#### Motivation:

The main mechanism in large scale optimization for machine learning is the SGD algorithm. At each iteration, this method picks a function uniformly random then uses the gradient of this chosen function in the gradient descent update instead of the full gradient. Although the computation is saved, the convergence rate of the SGD algorithm depends strongly on the variance of the stochastic gradients. [bottou2018optimization] shows that if holds for some positive constants , then following statement is satisfied.

Theorem: If we choose the step size for some and such that , then for all , the expected optimality gap satisfies

 E[F(xt)−F(x∗)]≤νσ+t

where and is the Lipschitz constant of .

The convergence rate above is sublinear, and it also depends on the Lipschitz constant of the gradient, which may be very big in practice. Therefore the SVRG algorithm was proposed in [johnson2013accelerating] to overcome these issues.

At each inner iteration the algorithm chooses a function according to the distribution

and constructs an unbiased estimation

of the gradient . Similarly to the SGD algorithm, the convergence rate of this method is then affected by the term . Intuitively, the smaller values of this quantity give a better rate. After algebraic manipulations, we conclude the following equation:

 E∥vt∥2=M∑m=1 ∥∇Fm(xt−1)−∇Fm(¯xk−1)∥2M2p2m+ (2) +∥∇F(xt−1)∥2−∥∇F(xt−1)−∇F(¯xk−1)∥2

Notice that the first term above depends on the distribution , which means that the choice of has some effect on the convergence rate. One standard option in practice of is the uniform distribution. In this case, the standard analysis of the SVRG algorithm [johnson2013accelerating] shows that:

 E[F(¯xk)−F(x∗)]≤(1ληT(1−2η˜L)+2η˜L1−2η˜L)[F(¯x0)−F(x∗)].

We notice that his rate depends on the maximum Lipschitz constant . Although this rate is better than previous methods, being dependent on can be inefficient when data being non-iid distributed. Especially because in the distributed machine learning setting since it may cause many communications. Given that our goal is to design a communication-efficient algorithm, it is crucial to reduce this dependency. [xiao2014proximal] proposed a solution to this problem if Lipschitz constants of the gradients are previously known by setting fixed distribution above to . They showed that the convergence rate is as following:

 E[F(¯xk)−F(x∗)]≤ (1λη(1−2η¯L)+2η¯L1−2η¯L)[F(¯x0)−F(x∗)]

which depends on a smaller constantaverage of Lipschitz constants .

## 4 Theoretical Results

In this section, we discuss the details of the theoretical contributions of this work. First, we provide intuition behind the estimation of local Lipschitz constants of each machine. Second, we provide the main algorithm that uses the idea of local Lipschitz values to extend it to the Adaptive SVRG algorithm. Third, we provide our Novel Sampling Strategy, and at last, we present tools that we used for the proof of the main algorithm.
One crucial question arises on the method due to [xiao2014proximal] is what if we don’t have access to any information about the exact or estimated value of Lipschitz constants. Should we return to uniform sampling, or are there alternative methods to solve this issue. Given that the maximum of Lipschitz constants can be drastically different from their average return to uniform sampling will give us a prolonged convergence rate. Estimating Lipschitz constants by querying many points before executing the algorithm can be very slow as the estimation process can take exponential runtime with respect to dimension .
Therefore, we suggest a solution which estimates local Lipschitz constants efficiently and prove that the algorithm is still converging as fast when we sampling happens with these weights. Going back to equation (2), we notice that choosing a distribution that minimizes the first summand adaptively at each iteration will improve the performance of the SVRG algorithm. To clarify this issue, we analyze the following optimization problem.

 minp∈ΔMM∑m=1∥∇Fm(xt−1)−∇Fm(¯xk−1)∥2p2m.

By applying the KKT conditions, the solution of the above problem is

 pk,tm =∥∇Fm(xt−1)−∇Fm(¯xk−1)∥∑Mm=1∥∇Fm(xt−1)−∇Fm(¯xk−1)∥

This probability distribution does not depend on any information but local values of the function, which is easily accessible. The only requirement here is computing the following rephrase of the

which can be rewritten as the following:

 1|Sm|∑j∈Sm(∇fj(xt−1)−∇fj(¯xk−1)) (3)

However, naively computing each of these values is an expensive task as it requires to go through each datapoint once. To overcome this issue, we first propose an efficient estimation method to this expression, then we prove that weights due to estimations also successfully give a fast convergence rate.
In the following lemma (extension of concentrationwithoutreplacement), we show that taking a small subsample at each machine and computing the average :

 1|˜Sm|∑j∈˜Sm(∇fj(xt−1)−∇fj(¯xk−1)) (4)

of this sample in this machine will successfully estimate the expression in (3). Setting below lets us to bound with respect to and this helps us to use the lemma 4.

Let

be a set of vectors that

, for any and fixed vectors . denotes the average of vectors in S: and is the set of size that uniformly sampled without replacement from . Given that then the following inequality is satisfied with probability at least :

 ∥∥ ∥∥1nn∑i=1Xi−μ∥∥ ∥∥2≤τ∥μ∥2

As an illustration let’s look at the example that selected as . This value of corresponds that each of the weights has an estimation of error rate. For instance, estimation changes initial weights as:

 w=(40,40,60,60)⟹˜w=(39,41,58,61)

then categorical probabilities change as:

 p=(0.2,0.2,0.3,0.3)⟹˜p=(0.195,0.206,0.291,0.306).

We can easily show the probability of each category roughly cannot change more than times. To show this phenomenon in the example above we rewrite as

 (0.195,0.206,0.291,0.306) =0.97(0.2,0.2,0.3,0.3) +0.03(003,0.4,0,0.57)

More generally, where is another categorical distribution. To generalize this decomposition, we prove the following lemma:

Let be a categorical distribution with weights and be a perturbation of with new weights . Lets be a categorical distribution with weights and defined as

 γ=1−min1≤i≤mwi+δiw1+δ1+w2+δ2+…+wm+δmwiw1+w2+…+wm

Then, we can decompose to the combination of and as following:

 Ψ={sample with respect to P%withprobability1−γsample with respect to Qwith probability γ

Moreover, this is the smallest that enables to decompose to and some other distribution.

### 4.1 Communication Algorithm

In the previous section, we already discussed how to efficiently estimate weight–the importance of each machine at a given time. The next important question is how to deliver this information among machines, so we can successfully sample essential machines. Given that worker to server communication is more expensive than other types of communications, sending all weights to the server directly should be avoided.
Therefore, it’s intuitive that we need to use the communication among workers to support the sampling process. Hence, we target to design a communication method, which optimizes the number of bytes transferred meanwhile having an efficient runtime. To provide an intuition for the problem, we present a method of how to sample from the set {2,3,5,7} with weights {1,1,3,2} efficiently.
We have four machines, and each of them carries one of the prime numbers above with corresponding weight. The idea is as simple as the following. The machine one sends its information (number 2 and weight 1) to machine two, and latter samples among prime numbers 2,3 with respect to weights and . Meanwhile, machine three sends information (number 5 and weight 2) to machine four, and the same process happens there as well. In the second phase, machine two sends its sampled number to machine four together with cumulative weight from the first phase . Then machine four makes final sampling with weights and and announces the final result. The probabiliy of selection of number 7 at the end is equal to which is the desired probability. A simple analysis of this idea, tells us this method runs in time using bytes transferred.
Note that, after each iteration/update, weights of each machine will change very incrementally. Therefore recomputing weights and resampling every time is not efficient. That’s why we extend the idea above to enable the sampling of many machines. A natural extension is sending many information machines at each step instead of one, which enables sampling of machines at the end. The transfer complexity of this method would bytes. However, in the following algorithm, we show how to perform this task using still transfer - bytes, no matter how many machines we sample. We give an illustration for in the figure above and provide analysis in the lemma below.

In the following lemma, we show that the algorithm above is optimal for communication.

The Parallel Communication sampling technique above samples many workers with replacement using just worker to worker communication for any . Furthermore, sampling process ends in total time of .

We show how to extend the idea of the parallelization further, to decrease the time of the sampling process in appendices. The details of the method can be found in Optimal Communication algorithm.

### 4.2 Main Algorithm

In this section, we merge all the ideas discussed in previous sections to build an adaptive Distributed SVRG algorithm. Our algorithm outperforms previous algorithms under the condition that: i) there is no pre-information regarding Lipschitz constants and ii) the maximum of the Lipschitz constants is much higher than the average of them.
We use the estimation discussed in expression 4 and lemma 4 to efficiently approximate the local Lipschitz constant (weight) of each worker in line 8. Then, using algorithm we transfer this information among workers and perform sampling in the next line. Then, using the lemma 4 we complete the analysis of the proposed algorithm.
Final convergence rate of the algorithm described in the theorem 4.2. Notice that the convergence rate of our method also depends on the average of Lipschitz constants. Thus, our method is better than uniform sampling SVRG. Moreover, in comparison with the importance sampling method xiao2014proximal, we can see that our algorithm is still at least as good as that method. The sub-sampling method in line 8 is crucial when each of the machines has tremendous data. Even though it gives some small error to sampling weights, we show that it does not affect the convergence rate importantly.

Given , and small. The iteration in converges to the optimal solution linearly in expectation. Moreover, under the condition each of the satisfies the condition in lemma 4, then the following inequality get satisfied:

 E[F(¯xk)−F(x∗)]≤ρE[F(¯xk−1)−F(x∗)],

with probability of where defined as

 ρ=1λTη[1−η¯L(8R+1)]+8η¯LR[1−η¯L(8R+1)].

## 5 Experimental Results

To empirically validate our proposed distributed algorithm ASD-SVRG, we compare them to the two baselines which are Distributed SGD and Distributed SVRG, both of which treat all workers uniformly.
To make the an objective comparison, we initialize the same settings for all of them in that each run has the same number of epochs. For distributed systems perspective, we employ the data parallelization manner, in which the whole data are split into workers. Furthermore, each worker employs the same model architecture, with the same initialization of weights. We describe those experiments in more details in the following sections.

We design two synthetic datasets for two tasks that has strong convex objective functions: linear regression and logistic regression. For each one, we create increasing Lipschitz constants by worker indices, using an exponential function of those indices. For linear regression, we generate 500 samples of dimension 10 and for logistic regression (in the form of binary classification), we generate 300 samples of dimension 100.

#### Experimental Setup

To build a convex objective function for both tasks, we build a simple neural network that has only 1 fully-connected layer, which directly transforms the input space to 1 for linear regression, or 2 for logistic regression. This layer is equivalent to a matrix multiplication with an added bias, which is linear in combination.

For linear regression, we apply mean-square error (MSE) loss. For logistic regression, however, we use log-softmax activation for the logits (for numerical stability), followed by a negative log-likelihood (NLL) loss. In detail, the combination of log-softmax and NLL is equivalent to CrossEntropy loss. For optimizers, for both problems, we apply L2 regularization of rate

to SGD which is equivalent to weight decaying loshchilov2017decoupled. However, for SVRG and ASD-SVRG, we do not apply this regularization mechanism. Finally, to make a fair comparison, we do a grid search of learning rates of each algorithm and compare the best version of each one. Likewise, for each problem, the best performance of each one is yielded by a learning rate different from those of the other two.

In terms of physical settings, we use a set of 8 paralleled CPUs in a single physical host to run each experiment. And because we implement our code in Pytorch

paszke2017automatic, all the 3 distributed algorithms can be easily adapted to other settings of network architectures (of either convex or non-convex) or distributed configurations such as using parallel GPUs or multiple nodes with many GPUs/CPUs per each.

#### Results

For linear regression, as shown in Figure 3, ASD-SVRG clearly outperforms others in training and testing: it converges much earlier, and more efficiently, especially in training, which is the main goal in terms of optimization perspective. In particular, it also does that with much higher learning rate, which plays an essential role in training speed.
For logistic regression, Figure 2 shows a similar behavior in the first two plots. In more detail, our algorithm significantly outperforms others for the training set (the main goal), with a learning rate times larger than SVRG. We also observe that ASD-SVRG converges faster as compared to SVRG which is consistent with our theoretical analyses. Furthermore, ASD-SVRG does not trade generalization for optimization goal. In particular, as shown in the third plot of test accuracy, ASD-SVRG achieves much higher accuracy in prediction ( vs ) while achieving similar test loss compared to SVRG. Hence, ASD-SVRG clearly outperforms the baseline methods in both tasks.

#### Ablation Study

To investigate the difference between ASD-SVRG and its direct counterpart SVRG, we fix the setting of both algorithms and vary the learning rate of SVRG (to the left and right of its best one) to compare its performances with the best setting of ASD-SVRG for both tasks. As shown in Figure 5, ASD-SVRG can easily outperform SVRG in every case for both training and testing for linear regresion. Additionally, if we increase the learning rate towards that of ASD-SVRG (which is many times much larger), SVRG behaves unstably and diverges even at the much lower rates.
The same observations also happens to logistic regression, as shown in Figure 4, in both train and test losses. Although some SVRG’s rates are better in terms of accuracy (only on a margin of 1% to 3%), all of them are worse in terms of both losses. All in all, in terms of optimization perspective, our ablation studies clearly show advantages of ASD-SVRG over SVRG.

## 6 Conclusion

In this paper, we have designed and presented a distributed optimization algorithm, namely ASD-SVRG, which assumes no prior knowledge about optimizing functions. Instead, our algorithm is adaptive, in which it samples the most important machines based on data themselves at each step to guide the updates in the optimization process. That way, our algorithm is faster converged by redirecting the dependence of convergence rate from maximum to average Lipschitz constants across distributed machines. We also provide a statistical categorical distribution decomposition method, which estimates noisy distributions with noiseless versions. Moreover, we created a novel communication method that effectively minimizes the number of bytes transferred to the parameter server meanwhile having efficient overall run time, both of which are important in practice of distributed algorithms. For experiments, we implement all algorithms in Pytorch for the ease of adaptation and extension in future, and will also release the code to the public community for results replication. We hope that our theoretical results and empirical tools provided in this paper would inspire and help future works in this area.

## Appendix A Proof of Lemma 4

###### Lemma

Let be a set of vectors that , for any and fixed vectors . denotes the average of vectors in S: and is the set of size that uniformly sampled without replacement from . Given that then the following inequality is satisfied with probability at least :

 ∥∥ ∥∥1nn∑i=1Xi−μ∥∥ ∥∥2≤τ∥μ∥2

We use the following concentration inequality to bound the estimation error concentrationwithoutreplacement:
Lemma : Let be a set of real points which satisfies and for any and real numbers and . Lets draw uniform randomly without replacement from the set . Then, with probability higher than the following inequality satisfied:

 1nn∑i=1Xi−μ≤(b−a)√ρnlog1/δ2n

where we define

 ρn={1−n−1Nif % n≤N/21−nNif n>N/2

using the fact we conclude:

 1nn∑i=1Xi−μ≤(b−a)√ρnlog1/δ2n≤(b−a)√log1/δ2n

and applying the same inequality to the set we conclude with probability at least

 −1nn∑i=1Xi+μ≥(b−a)√log1/δ2n.

Using union bound gives us with probability at least the following inequality satisfied:

 |1nn∑i=1Xi−μ|≤(b−a)√log1/δ2n

To extend this inequality to vectors, we assume that and we denote by the average of ’th coordinates: . and will stand for corresponding upper and lower bounds for ’th coordinate. Finally, is ’th coordinate of -th randomly selected element. Then, for each , the following is satisfied with probability :

 |1nn∑i=1Xji−μj|≤(bj−aj)√log1/δ2n

Again using union bound we conclude that with probability of all of the following inequalities are satisfied:

Summing all of the terms in left and right side we conclude with probability :

 ∥1nn∑i=1Xi−μ∥22≤∥b−a∥22log1/δ2n

satisfied. Hence plugging guarantees

 ∥1nn∑i=1Xi−μ∥22≤τ2∥μ∥2

with probability . Hence, assigning and taking square root above implies choosing guarantees the following inequality with probability

 ∥1nn∑i=1Xi−μ∥2≤τ∥μ∥

## Appendix B Proof of Lemma 4

Similar to the main algorithm, here we denote the norm of averages of gradients of machine by . (in the lemma above it corresponds to ) and its estimation by . Then from the lemma 4 we have with probability . Hence we can write where with probability . Lemma 4 gives an interesting property of noisy categorical distributions:

###### Lemma

Let be a categorical distribution with weights and be perturbed distribution of with modified weights as . Lets be a categorical distribution with weights and

 γ=1−min1≤i≤mwi+δiw1+δ1+w2+δ2+…+wm+δmwiw1+w2+…+wm

Then, we can decompose to the combination of and as following:

 Ψ={sample with respect to P%withprobability1−γsample with respect to Qwith probability γ

Moreover, this is the smallest that enables to decompose to and some other distribution.

It is straightforward to notice the is well-defined as:

 δj−wjmin(δiwi)≥δj−wjδjwj=0

Moreover, considering the fact that for the inequality is tight, then is the smallest number that can be decomposed into and some other distribution. Then all we need to show that probability of selection of category of proposed method is equal to probability of category for which is . Lets find the probability- of category for distribution :

 PΨ(j)= (1−γ)wjw1+w2+…+wm (5) +γδj−wjmin(δiwi)δ1−w1min(δiwi)+δ2−w2min(δiwi)+…+δm−wmmin(δiwi)

Lets do detailed analysis of each of these summands. We start with the left summand first.

 (1−γ)wjw1+w2+…+wm =wjw1+…+wmmin1≤i≤mwi+δiw1+δ1+w2+δ2+…+wm+δmwiw1+w2+…+wm =wjw1+δ1+…+wm+δmmin1≤i≤mwi+δiwi =wjw1+δ1+…+wm+δm(1+min1≤i≤mδiwi)

Now, we focus on understanding the right summand better. First we focus on the value of :

 γ=1−min1≤i≤mwi+δiw1+δ1+w2+δ2+…+wm+δmwiw1+w2+…+wm =1−w1+...+wmw1+δ1+…+wm+δmmin1≤i≤mwi+δiwi =w1+δ1+…+wm+δmw1+δ1+…+wm+δm−(w1+...+wm)min1≤i≤mwi+δiwiw1+δ1+…+wm+δm =w1+δ1+…+wm+δm−(w1+...+wm)(1+min1≤i≤mδiwi)w1+δ1+…+wm+δm =δ1+…+δm−min1≤i≤mδiwi(w1+…+wm)w1+δ1+…+wm+δm

Therefore, the right summand in (5) above will simply be equal to:

 γ×δj−wjmin(δiwi)δ1−w1min(δiwi)+δ2−w2min(δiwi)+…+δm−wmmin(δiwi)= δ1+…+δm−min1≤i≤mδiwi(w1+…+wm)w1+δ1+…+wm+δmδj−wjmin(δiwi)δ1+…+δm−min1≤i≤mδiwi(w1+…+wm)= δj−wjmin(δiwi)w1+δ1+…+wm+δm.

Finally, putting these summands together concludes:

 wj(1+min1≤i≤mδiwi