# vqSGD: Vector Quantized Stochastic Gradient Descent

In this work, we present a family of vector quantization schemes vqSGD (Vector-Quantized Stochastic Gradient Descent) that provide asymptotic reduction in the communication cost with convergence guarantees in distributed computation and learning settings. In particular, we consider a randomized scheme, based on convex hull of a point set, that returns an unbiased estimator of a d-dimensional gradient vector with bounded variance. We provide multiple efficient instances of our scheme that require only O(logd) bits of communication. Further, we show that vqSGD also provides strong privacy guarantees. Experimentally, we show vqSGD performs equally well compared to other state-of-the-art quantization schemes, while substantially reducing the communication cost.

## Authors

• 6 publications
• 8 publications
• 27 publications
• ### High-Dimensional Stochastic Gradient Quantization for Communication-Efficient Edge Learning

Edge machine learning involves the deployment of learning algorithms at ...
10/09/2019 ∙ by Yuqing Du, et al. ∙ 0

• ### Quantized Epoch-SGD for Communication-Efficient Distributed Learning

Due to its efficiency and ease to implement, stochastic gradient descent...
01/10/2019 ∙ by Shen-Yi Zhao, et al. ∙ 0

• ### Stochastic Markov Gradient Descent and Training Low-Bit Neural Networks

The massive size of modern neural networks has motivated substantial rec...
08/25/2020 ∙ by Jonathan Ashbrock, et al. ∙ 0

• ### Achieving the fundamental convergence-communication tradeoff with Differentially Quantized Gradient Descent

The problem of reducing the communication cost in distributed training t...
02/06/2020 ∙ by Chung-Yi Lin, et al. ∙ 9

• ### DEED: A General Quantization Scheme for Communication Efficiency in Bits

In distributed optimization, a popular technique to reduce communication...
06/19/2020 ∙ by Tian Ye, et al. ∙ 0

• ### Learning of Gaussian Processes in Distributed and Communication Limited Systems

It is of fundamental importance to find algorithms obtaining optimal per...
05/07/2017 ∙ by Mostafa Tavassolipour, et al. ∙ 0

• ### End-to-End Efficient Representation Learning via Cascading Combinatorial Optimization

We develop hierarchically quantized efficient embedding representations ...
02/28/2019 ∙ by Yeonwoo Jeong, et al. ∙ 12

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent surge in the volumes of available data has motivated the development of large-scale distributed learning algorithms. Synchronous Stochastic Gradient Descent (SGD) is one such learning algorithm widely used to train large models. In order to minimize the empirical loss, the SGD algorithm, in every iteration takes a small step in the negative direction of the stochastic gradient

which is an unbiased estimate of the true gradient of the loss function.

In this work, we consider the data-distributed model of distributed SGD where the data sets are partitioned across various compute nodes. In each iteration of SGD, the compute nodes send their computed local gradients to a parameter server that averages and updates the global parameter. The distributed SGD model is highly scalable, however, with the exploding dimensionality of data and the increasing number of servers (such as in a Federated learning setup [18]), communication becomes a bottleneck to the efficiency and speed of learning using SGD [8].

In the recent years various quantization and sparsification techniques [2, 4, 6, 17, 21, 25, 26, 28] have been developed to alleviate the problem of communication bottleneck. The goal of the quantization schemes is to efficiently compute either a low precision or a sparse unbiased estimate of the

-dimensional gradients. One also requires the estimates to have a bounded second moment in order to achieve guaranteed convergence.

Moreover, the data samples used to train the model often contain sensitive information. Hence, preserving privacy of the participating clients is crucial. Differential privacy [10, 11] is a mathematically rigorous and standard notion of privacy considered in both literature and in practice. Informally, it ensures that the information from the released data (e.g. the gradient estimates) cannot be used to distinguish between two neighboring data sets.

#### Our Contribution:

In this work, we present a family of privacy-preserving vector-quantization schemes that incur low communication costs while providing convergence guarantees. In particular, we propose quantization schemes based on convex hull of specific structured point sets in that require bits to communicate an unbiased gradient estimate that has bounded variance.

At a high level, our scheme is based on the idea that any vector with bounded norm can be represented as a convex combination of a carefully constructed point set . This convex combination essentially allows us to chose a point

with probability proportional to its coefficient, which makes it an unbiased estimator of

. The bound on the variance is obtained from the circumradius of the convex hull of . Moreover, communicating the unbiased estimate is equivalent to communicating the index of (according to some fixed ordering) that requires only bits.

Large convex hulls have small variation in the coefficients of the convex combination of any two points of bounded norm. This observation allows us to obtain -differential privacy (for any ), where depends on the choice of the point set. We also propose Randomized Response (RR) [27] and RAPPOR [12] based mechanisms that can be used over the proposed quantization to achieve -differential privacy (for any ) with small trade-off in the variance of the estimates.

The family of schemes described above is fairly general and can be instantiated using different structured point sets. The cardinality of the point set bounds the communication cost of the quantization scheme. Whereas, the diameter of the point set dictates the variance bounds and the privacy guarantees of the scheme. We propose some specific structured point sets and show tradeoff in the various parameters guaranteed by them. Our results (summarized in Table 1) are the first quantization schemes in literature to achieve privacy directly through quantization.

Empirically we compare our quantization schemes to the state-of-art schemes [4, 25]. We observe that our cross-polytope vqSGD, performs equally well in practice, while providing asymptotic reduction in the communication cost.

## 2 Related Work

The foundations of gradient quantization was laid by [19] and [24] with schemes that require the compute nodes to send exactly -bit per coordinate of the gradient. They also suggested using local error accumulation to correct the global gradient in every iteration. While these novel techniques worked well in practice, there were no theoretical guarantees provided for convergence of the scheme. These seminal works fueled multiple research directions.

Quantization & Sparsification: [4, 28, 26] propose stochastic quantization techniques to represent each coordinate of the gradient using small number of bits. The proposed schemes always return an unbiased estimator of the local gradient and require bits of communication to compute the global gradient with variance bounded by a multiplicative factor of . The quantization techniques for distributed SGD, can be used in the more general setting of communication efficient distributed mean estimation problem, which was the focus of [25]. The quantization schemes proposed in [25] require bits of communication per compute node to estimate the global mean with a constant (independent of ) squared error (variance). Even though the tradeoff between communication and accuracy achieved by the above mentioned schemes are near optimal [29], they were unable to break the barrier of communication cost. In this work, we propose quantization schemes that require about bits of communication and are almost optimal as well.

Gradient sparsification techniques with provable convergence (under standard assumptions) were studied in [2, 5, 15, 23]. The main idea in these techniques is to communicate only the top- components of the -dimensional local gradients that can be accumulated globally to obtain a good estimate of the true gradient. Unlike the quantization schemes described above, gradient sparsification techniques can achieve bits of communication, but are not usually unbiased estimates of the true gradients. [21] suggest randomized sparsification schemes that are unbiased, but are not known to provide any theoretical convergence guarantees in very low sparsity regimes.

See Table 2 for a comparison of our results with the state of the art quantization schemes.

Error Feedback: Many works focused on providing techniques to reduce the error incurred due to quantization [14, 16] using locally accumulated errors. In this work, we focus primarily on gradient quantization techniques, and note that the variance reduction techniques of [14] can be used on top of the proposed quantization schemes.

Privacy: While differential privacy for gradient based algorithms [1, 22] were considered earlier in literature, cpSGD [3] is the only work that considered achieving differential privacy for gradient based algorithms and simultaneously minimizing the gradient communication cost. The authors propose a binomial mechanism to add discrete noise to the quantized gradients to achieve communication-efficient -differentially private gradient descent with convergence guarantees. The quantization schemes used are similar to those presented in [25] and hence require bits of communication per compute node. The parameters of the binomial noise are dictated by the required privacy guarantees which in turn controls the communication cost.

In this work we show that certain instantiations of our quantization schemes are -differentially private. Note that this is a much stronger privacy notion than -privacy. Moreover, we get this privacy guarantee directly from the quantization schemes and hence the communication cost remains sublinear () in dimension. We also propose a Randomized Response [27] based private-quantization scheme that requires bits of communication per compute node to get an -differential privacy while losing a factor of in convergence rate. Table 3 compares the guarantees provided by our private quantization schemes with the results of cpSGD [3].

## 3 Background

For any , we denote the Euclidean () distance between them as . For any vector , we denote its -th coordinate by . For any , and , let denote a -dimensional ball of radius centered at . Let denote the -th standard basis vector which has in the -th position and everywhere else. Also, let and denote the all ’s vector and all ’s vector in respectively. By we denote the set .

For a discrete set of points , let denote the convex hull of points in , i.e., ,

 \sc{conv}(C):={∑c∈Cacc∣ac≥0,∑c∈Cac=1}.

Suppose

be the parameters of a function to be learned (such as weights of a neural network). In each step of the SGD algorithm, the parameters are updated as

, where is a possibly time-varying learning rate and is a stochastic unbiased estimate of , the true gradient of some loss function with respect to . The convergence rate of the SGD algorithm depends on the variance (mean squared error) of the unbiased estimate [20].

The goal of any gradient quantization scheme is to reduce cost of communicating the gradient, while not compromising too much on the quality of the gradient estimate. The quality of the gradient estimate is measured in terms the convergence guarantees it provides. Given an unbiased estimator of the true gradient , the convergence of SGD is known to depend on the variance of this estimate.

In distributed setting with worker nodes, let and are the local true gradient and its unbiased estimate computed at the th compute node for some . For , variance of the estimate is defined as

 Var(^g) :=E[∥1NN∑i=1gi−1NN∑i=1^gi∥22] =1N2N∑i=1E[∥gi−^gi∥22].

In this work, our goal is to design quantization schemes to efficiently compute unbiased estimate of such that is minimized.

For the privacy preserving gradient quantization schemes, we consider the standard notion of -differential privacy (DP) as defined in [11]. Consider data-sets from a domain . Two data-sets , are neighboring if they differ in at most one data point.

###### Definition 1.

A randomized algorithm with domain is -differentially private (DP) if for all and for all neighboring data sets ,

 Pr[M(A)∈S]≤eϵPr[M(B)∈S]+δ,

where, the probability is over the randomness in . If , we say that is -DP.

## 4 Quantization Scheme

We first present our quantization scheme in full generality. Individual quantization schemes with different tradeoffs are then obtained as specific instances of this general scheme.

Let be a discrete set of points and let be its convex hull that satisfies

 Bd(0d,1)⊂\sc{conv}(C)⊆Bd(0d,R), for some R>1. (1)

For any vector , and some large number such that , let be the vector inside the -centered unit ball in the direction of . Since , we can write as a convex linear combination of points in . Let

 ~v=|C|∑i=1aici, where ai≥0,|C|∑i=1ai=1.

We can view the coefficients of the convex combination

as a probability distribution over points in

. Define the quantization of with respect to the set of points as follows:

 QC(v):=B⋅ci with probability ai

It follows from the definition of the quantization that is an unbiased estimator of .

.

###### Proof.
 E[QC(v)]=|C|∑i=1aiB⋅ci=B~v=v.

Assume that the set is fixed in advance and is known to both the compute nodes and the parameter server. Communicating the quantization of any vector , then amounts to sending a floating point number and the index of point which requires bits. For many loss functions, such as Lipschitz functions, the bound on the norm of the gradients is known to both the compute nodes and the parameter server. Therefore, we can avoid sending and the cost of communicating the gradients is then exactly bits.

Any point set that satisfies Condition (1) gives the following bound on the variance of the quantizer.

###### Lemma 2.

Let be a point set satisfying Condition (1). For any , let . Then,

 E[∥v−^v∥22]≤(R+1)2B2=O(R2B2).
###### Proof.

From the definition of the quantization function,

 E[∥v−QC(v)∥22] =|C|∑i=1ai∥v−Bci∥22 =|C|∑i=1ai(∥v∥22+B2∥ci∥22−2B⟨v,ci⟩).

Since satisfies Condition (1), each point has a bounded norm, . Also from the assumption that , we get the following upper bound.

 E[∥v−QC(v)∥22] ≤|C|∑i=1ai(B2+B2R2+2B⋅B⋅R)(since ⟨v,ci⟩≥−∥v∥∥ci∥) =B2(1+R)2|C|∑i=1ai=B2(1+R)2.

The last equality follows from the fact that is a convex linear combination. ∎

From the above mentioned properties, we get a family of quantization schemes depending on the choice of point set that satisfy Condition (1). For any choice of quantization scheme from this family, we get the following bound regarding the convergence of the distributed SGD.

###### Theorem 3.

Let be a point set satisfying Condition (1). Let be the local gradient computed at the -th compute node, and be an upper bound on for all . Define , where . Then,

 E[^g]=g andE[∥g−^g∥22]=O(R2B2N).
###### Proof.

Since is the average of unbiased estimators, the fact that follows from Lemma 1. For the variance computation, note that

 E[∥g−^g∥22] =1N2(N∑i=1E[∥gi−^gi∥22])( since ^gi is % an unbiased estimator of g ) ≤1N2N∑i=1B2(1+R)2 =B2(1+R)2N(from Lemma~{}???% ).

###### Remark 1.

The assumption is made to ease the presentation. The bound on the variance can be more precisely stated as:  .

###### Remark 2.

In order to compute the quantization , we have to first compute the convex combination of with respect to the point set . This requires us to solve a system of linear equations in . For general point sets , this takes about time (since ). We will show that there exist certain structured point sets for which we can compute these probabilities in linear time.

From Theorem 3 we observe that the communication cost of the quantization scheme depends on the cardinality of while the convergence is dictated by the circumradius of the convex hull of . In the next two sections, we present two constructions of point sets that achieve optimal communication and optimal variance respectively.

### 4.1 Cross Polytope Scheme

In this section, we present an explicit construction of a small sized point set that gives a quantization scheme which requires only bits to communicate an unbiased estimate of a vector in .

Consider the following point set of points in :

 Ccp:={±√d ei∣i∈[d]},

The convex hull is a scaled cross polytope that satisfies Condition (1) with (see Proposition 4 for the proof). Let be the instantiation of the quantization scheme described in Section 4 with the point set .

To compute the convex combination of any point , we need a non-negative solution to the following system of equations

 (2)

where, is the identity matrix. Equation 2 leads to the following closed form solution that can be computed in time:

 ai=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩vi√d+γ2dif vi>0 and i≤d−vi√d+γ2dif vi≤0 and i>dγ2dotherwise (3)

where, , is a non-negative quantity for every .

The bound on the variance of the quantizer follows directly from Lemma 2.

###### Proposition 4.

For any with , let . Then,

 E[^v]=v andE[∥v−^v∥22]=O(dB2).
###### Proof.

The proof of Proposition 4 follows directly from Lemma 2 provided the point set satisfies Condition (1) with . We will now prove this fact.

Since each vertex is of the form , it follows that all the vertices of , and hence the entire convex hull lies inside a ball of radius , i.e., , .

To prove that the unit ball is contained in the convex hull , we pick any arbitrary point and show that it can written as a convex combination of points in . The fact follows from the solution to the system of linear equations (2) given in Equation (3). Note that the solution satisfies and for any point . ∎

###### Remark 3.

We remark that the quantization is also -DP for any . This bound on is not practical and in Section 5 we provide schemes that achieve better privacy bounds.

### 4.2 Scaled ϵ-nets

On the other end of the spectrum, we now show the existence of points sets of exponential size that are contained in a constant radius ball. This point set allows us to obtain a gradient quantization scheme with communication and variance.

###### Definition 2 (ϵ-net).

A set of points is an -net for the unit sphere if for any point there exists a net point such that .

There exist various constructions for -net over the unit sphere in of size at most [9]. We now show that appropriate constant scaling of the net points satisfies Condition (1).

###### Lemma 5.

For any , let . The point set satisfies Condition (1).

###### Proof.

Let be the convex hull of the -net points of the unit sphere. Let be the inscribed ball in for some . We show that .

Consider the face of that is tangent to at point . We will show that . Extend the line joining to meet at point . Since , we know that there exists a net point at a distance of at most from it. Therefore, the distance of from is upper bounded by , i.e., ,  . Therefore .

Therefore scaling all the points of by some we see that . ∎

Let be the instantiation of the quantization scheme with point set . From Lemma 2, we then directly get the following guarantees for the quantization scheme obtained from scaled -nets, for some constant .

###### Proposition 6.

For any with , let . Then, and .

Moreover, requires bits to represent the unbiased gradient estimate.

### 4.3 Reducing Variance

In this section, we propose a simple repetition technique to reduce the variance of the quantization scheme. For any , let be the average over independent applications of the quantization . Note that even though is not a point in , we can communicate using an equivalent representation as a tuple of independent applications of that requires bits.

 QC(s,v)≡(s-timesQC(v),…,QC(v)).

Using this repetition technique described above, we see that the variance reduces by factor of while the communication increases by the exact same factor.

###### Proposition 7.

Let be a point set satisfying Condition (1). For any such that and any , let . Then,

 E[∥v−^v∥22]=O(R2B2/s).

In particular, for , the cross polytope based quantization scheme (Proposition 4) achieves a variance of at the cost of communicating bits.

## 5 Private Quantization

In this section we show that under certain conditions the quantization scheme obtained from the point set is also -differentially private. We first see why the quantization scheme described in Section 4 is not privacy preserving in general.

Let be any point set with . For any point , let denote the indices of the points in that are in the range of .

In order for to be differentially private for any , we need to show that for any two gradient vectors, of neighboring datasets and any ,

 Pr[QC(x)=z]Pr[QC(y)=z]≤eϵ0. (4)

Since , there may exist two gradient vectors such that . Therefore, for any , there may not exist any finite for which Equation (4) holds.

The discussion above establishes a sufficient condition for the quantization scheme to be differentially private. Essentially, we want all points in to have full support on all the points in . This is definitely possible when . Therefore if the point set satisfying Condition (1) has size , then the quantization scheme is -differentially private, for some .

### 5.1 Simplex Scheme

Consider the following set of points

 CS={2d ei∣i∈[d]}∪{−41d}.

The convex hull of satisfies Condition (1) with (see Proposition 8 for proof). Since the size of the set is exactly , every point in the unit ball can be represented as a convex combination of all the points in (i.e., all coefficients of the convex combination are non zero). The scaling of points in is large enough that for any two data sets, their scaled gradients contained in have similar representation. The -differential privacy follows from this fact.

The coefficients of the convex combination of any point can be computed from the following system of linear equations:

 [−41d2dId]⎡⎢ ⎢⎣a0⋮ad⎤⎥ ⎥⎦=⎡⎢ ⎢⎣v1⋮vd⎤⎥ ⎥⎦such that d∑i=0ai=1, (5)

Equation 5 leads to the following closed form solution that can be computed in linear-time:

 a0=2d−∑di=1vi6dai=vi+4a02d∀i≥1. (6)
###### Proposition 8.

For any with , let . Then, and . Moreover, is -differentially private for any .

###### Proof.

First we show that the point set satisfies Condition (1) with . The fact that follows trivially from the observation that each point in .

To show that , consider any face of the convex hull, , for some . We show that is at an distance of at least from . This in turn shows that any point outside the convex hull must be outside the unit ball as well.

First consider the case when . We observe that the face

is contained in the hyperplane

, and therefore is at a distance of from the origin.

Now consider the case when . Let be a unit vector. We note that , where is the hyperplane defined by the unit normal vector that is at a distance of at least from .

Since all other faces are symmetric, the proof for the case follows similarly.

#### Privacy:

We now show that the quantization scheme is -differentially private for any . From the definition of -DP, it is sufficient to show that for any , and any ,

 Pr[QCS(x)=c]Pr[QCS(y)=c]≤7.

Since , we can express them as the convex combination of points in . Let . Similarly, let . Then, from the construction of the quantization function , we know that

 Pr[QCS(x)=c]Pr[QCS(y)=c]=a(x)ca(y)c.

We now show that the ratio is at most for any pair and any . The privacy bound follows from this observation.

First, consider the case . From the closed form solution for any described in Equation (6), we know that . For any , . Therefore, . It then follows that for any and ,

 a(x)ca(y)c ≤13+16√d13−16√d=1+22√d−1≤3

Now we consider the case when . Then from the closed from solution in Equation (6), we get that for any the coefficient . Note that this quantity is maximized for and minimized for . Therefore the ratio for any and is at most

 a(x)ca(y)c ≤7d−2d+2≤7

The ratio for all other vertices can be computed in a similar fashion and is bounded by the same quantity.

Next we present a point set based on Hadamard matrices that gives slightly better privacy guarantees. The vertices obtained from an Hadamard matrix allows us to efficiently construct a set of points, all of which are far from coordinate axes. Moreover, computing the coefficients of the convex combination is efficient using the orthogonality properties of the Hadamard matrix.

In this section, we assume that is a power of , i.e., , for some . Our quantization scheme is based on the columns of a Hadamard matrix which are square matrices with entries in , whose rows are mutually orthogonal [13]. Let denote the Hadamard matrix. We can recursively construct a Hadamard matrix as follows:

 Hp:=[Hp−1Hp−1Hp−1−Hp−1].

For any , let denote the -th column of with the first coordinate punctured. Consider the following set of points obtained from the punctured columns of :

 CH={2√d hi∣i∈[d+1]}

The quantization scheme can be implemented in linear time since computing the probabilities requires computing a matrix vector product,

that has closed form solution for each as:

 ai=1d+1(1+hiTv2√d) (7)
###### Proposition 9.

For any such that , let . Then,

 E[^v]=v andE[∥v−^v∥22]=O(d2B2).

Moreover, is -differentially private for any .

###### Proof.

First, we show that satisfies Condition (1) with . The fact that is trivial and follows since every point in is contained in .

To show that , consider any , and the closed form solution for the coefficients given by Equation (7). We now show that these coefficients indeed give a convex combination. Note that . This holds since . Moreover, from the property of Hadamard matrices,

 d+1∑i=1ai=1d+1[1…1]HTp[1x2√d]=1.

The last equality follows from the following property of the Hadamard matrices that can be proved using induction.

 [1…1]HTp=[2p0…0].

Therefore, any can be expressed as a convex combination of the points in , i.e., , , for .

#### Privacy:

We now show that the quantization scheme is -differentially private for any . From the definition of -DP, it is sufficient to show that for any , and any ,

 Pr[QCH(x)=c]Pr[QCS(y)=c]≤1+√2

Since , we can express them as the convex combination of points in . Let . Similarly, let . Then, from the construction of the quantization function , we know that

 Pr[QCH(x)=c]Pr[QCH(y)=c]=a(x)ca(y)c. (8)

From the closed form solution in Equation (7), we know that for any , the coefficient of in the convex combination of is given by . Plugging this in Equation (8), we get

 Pr[QCH(x)=c]Pr[QCH(y)=c] =a(x)ca(y)c=1+c