# Quantized Frank-Wolfe: Communication-Efficient Distributed Optimization

How can we efficiently mitigate the overhead of gradient communications in distributed optimization? This problem is at the heart of training scalable machine learning models and has been mainly studied in the unconstrained setting. In this paper, we propose Quantized Frank-Wolfe (QFW), the first projection-free and communication-efficient algorithm for solving constrained optimization problems at scale. We consider both convex and non-convex objective functions, expressed as a finite-sum or more generally a stochastic optimization problem, and provide strong theoretical guarantees on the convergence rate of QFW. This is done by proposing quantization schemes that efficiently compress gradients while controlling the variance introduced during this process. Finally, we empirically validate the efficiency of QFW in terms of communication and the quality of returned solution against natural baselines.

## Authors

• 7 publications
• 53 publications
• 27 publications
• 27 publications
• 48 publications
• ### Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Large-scale distributed optimization is of great importance in various a...
06/21/2018 ∙ by Jiaxiang Wu, et al. ∙ 0

• ### Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients

The present paper develops a novel aggregated gradient approach for dist...
09/17/2019 ∙ by Jun Sun, et al. ∙ 1

• ### A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning

Learning sparse combinations is a frequent theme in machine learning. In...
04/09/2014 ∙ by Aurélien Bellet, et al. ∙ 0

• ### Communication Efficient Distributed Learning with Censored, Quantized, and Generalized Group ADMM

In this paper, we propose a communication-efficiently decentralized mach...
09/14/2020 ∙ by Chaouki Ben Issaid, et al. ∙ 0

• ### Robust and Communication-Efficient Collaborative Learning

We consider a decentralized learning problem, where a set of computing n...
07/24/2019 ∙ by Amirhossein Reisizadeh, et al. ∙ 0

• ### SCOPE: Scalable Composite Optimization for Learning on Spark

Many machine learning models, such as logistic regression (LR) and suppo...
01/30/2016 ∙ by Shen-Yi Zhao, et al. ∙ 0

• ### Scaling-up Distributed Processing of Data Streams for Machine Learning

Emerging applications of machine learning in numerous areas involve cont...
05/18/2020 ∙ by Matthew Nokleby, et al. ∙ 16

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Frank-Wolfe (FW) algorithm (Frank and Wolfe, 1956), also known as conditional gradient or projection-free method, has recently received a lot of attention in the machine learning community. As it does not need any projections while maintaining feasibility through its execution, FW is very efficient for various constrained convex (Jaggi, 2013; Garber and Hazan, 2014; Lacoste-Julien and Jaggi, 2015; Garber and Hazan, 2015; Hazan and Luo, 2016; Mokhtari et al., 2018b) and non-convex (Lacoste-Julien, 2016; Reddi et al., 2016) optimization problems. It is known that in many scenarios, projection operations are computationally prohibitive (e.g.

, projections onto a nuclear norm ball or onto a matroid polytope). To avoid this cost, FW replaces the projection step with solving a linear program.

In order to apply the FW algorithm to large-scale optimization and machine learning problems (e.g.

, training deep neural networks, SVMs, AdaBoost, experimental design,

etc) parallelization is unavoidable. To this end, distributed FW variants have been proposed for specific problems, e.g., learning low-rank matrix (Zheng et al., 2018), and optimization under block-separable constraint set (Wang et al., 2016).

A significant performance bottleneck of distributed optimization methods is the cost of communicating gradients, typically handled by using a parameter-server framework. Intuitively, if each worker/processor in the distributed system transmits the entire gradient, then at least floating-point numbers are communicated for each worker, where is the dimension of the optimization problem. This communication cost can be a huge burden on the performance of parallel optimization algorithms (Chilimbi et al., 2014; Seide et al., 2014; Strom, 2015). To circumvent this drawback, communication-efficient parallel algorithms have recently received significant attentions. One major approach is to quantize/compress the gradients while maintaining sufficient information (De Sa et al., 2015; Abadi et al., 2016; Wen et al., 2017). For the unconstrained

optimization setting, when Stochastic Gradient Descent (SGD) does not require to perform any projection, various communication-efficient distributed algorithms have been proposed, including QSGD

(Alistarh et al., 2017), SIGN-SGD (Bernstein et al., 2018), and Sparsified-SGD (Stich et al., 2018).

In the constrained setting, and in particular for distributed FW algorithms, the communication-efficient versions were only studied for specific problems such as sparse learning (Bellet et al., 2015; Lafond et al., 2016). In this paper, however, we develop Quantized Frank-Wolfe (QFW), a general communication-efficient distributed FW for both convex and non-convex objective functions. We study the performance of QFW in stochastic and finite-sum optimization settings.

Problem formulation: The focus of this paper is on constrained optimization in two widely recognized settings: 1) stochastic and 2) finite-sum optimization. Let denotes the constraint set which is assumed to be convex and compact throughout the paper. In constrained stochastic optimization the goal is to solve the following problem:

 minx∈Kf(x) := minx∈KEz∼P[~f(x,z)], (1)

where is the optimization variable,

is a random variable drawn from a distribution

, and together they determine the choice of a stochastic function . In constrained finite-sum optimization we further assume that

is a uniform distribution over

and the goal is to solve a special case of Problem (1), namely,

 minx∈Kf(x) := minx∈K1NN∑i=1fi(x), (2)

In parallel settings, we suppose that there is a master and workers. Each worker maintains a local copy of . At every iteration of the stochastic case, each worker has access to independent stochastic gradients of ; whereas in the finite-sum case, the objective function can be decomposed as

 f(x)=∑m∈[M],i∈[n]fm,i(x)Mn,

Our contributions: In this paper, we propose two general purpose quantization schemes that can be readily applied to distributed optimization settings. We then propose Quantized Frank-Wolfe, a distributed framework that handles quantization for constrained convex and non-convex optimization problems in finite-sum and stochastic cases. It achieves a sweet trade-off between the communication complexity and convergence rate in distributed computation. The results are summarized in Table 1.

As mentioned earlier, the communication cost can be reduced effectively by sending quantized gradients. In this section, we propose two quantization schemes which have different degrees of compression and information content. Depending on the specific requirements of optimization task, one might choose one over the other.

### 2.1 Single-Partition Quantization

and let be the -th coordinate of the gradient. To transmit the scalar , Sign Encoding Scheme sends the product of the sign of and a properly chosen random variable , defined as

 bi=⎧⎪⎨⎪⎩1,w.p.  |gi|∥g∥∞,0,w.p.  1−|gi|∥g∥∞,

where is the norm of . Note that the product of and belongs to the set and can be transmitted using two bits. On the receiver side, given access to the norm , one can recover the scalar (in expectation) by computing , since we have . According to this observation, for each coordinate , Sign Encoding Scheme only needs to communicate the encoded scalar , alongside the norm . Therefore, if we define as a vector containing ’s, and as a vector containing random variables , the Sign Encoding Scheme is a tuple , where is the Hadamard product operator. Similarly, the corresponding decoding scheme is

, which is an unbiased estimator of the gradient vector

.

Since transmitting each element of the vector requires two bits, the total communicated bits for are . Assuming that sending the norm , which is a scalar, requires bits, the overall communicated bits of Sign Encoding Scheme for each worker are per round. Finally, note that the elements of can only be , or . Thus intuitively, Sign Encoding Scheme compresses the gradient intensively, and may lead to a loss of information. In the following lemma we formally characterize the loss in terms of the variance of Sign Encoding Scheme.

###### Lemma 1 (Proof in Appendix A).

For any input vector , the variance of Sign Encoding Scheme is given by

 Var[ϕ′(g)|g]=∥g∥1∥g∥∞−∥g∥22. (3)

Lemma 1 implies that if the absolute values of the elements of the vector are in a same range – its energy is divided among its elements in a balanced way – then the variance of Sign Encoding Scheme, namely, , is small. For instance, if all the elements of the vector are equal to each other, then the variance is zero. Conversely, if the absolute values of a few elements of are significantly larger than the rest, then the variance becomes large. For instance, if one of the elements of is and the remaining ones are , then the variance is .

###### Remark 1.

For the probability distribution of the random variable

, instead of , we can use other norms (where ). But it can be verified that the -norm leads to the smallest variance for Sign Encoding Scheme.

### 2.2 Multi-Partition Quantization

Now we focus on Multi-Partition Quantization Scheme which has a lower variance comparing to Sign Encoding Scheme, but at the cost of sending more bits at each round of communication. Unlike Sign Encoding Scheme that codes each element into a scalar from , in s-Partition Encoding Scheme each element is encoded into an element from the set . To transmit the -th element of the gradient vector , s-Partition Encoding Scheme first computes the ratio and finds the indicator for which the following inequality is satisfied

 lis≤|gi|∥g∥∞≤li+1s. (4)

After finding , we define the random variable by the following probability distribution

 bi=⎧⎪⎨⎪⎩li/s,w.p.  1−|gi|∥g∥∞s+li,(li+1)/s,w.p.  |gi|∥g∥∞s−li. (5)

Then, instead of transmitting , s-Partition Encoding Scheme sends the product of and the random variable . It can be verified that . So we define the corresponding decoding scheme as to ensure that is an unbiased estimator of .

Intuitively, in s-Partition Encoding Scheme, we partition the interval into parts with the same length and find the specific interval in which falls, and estimate by one of the two end points of that interval randomly. The probability for each end point is chosen according to (5) to make sure that the estimation unbiased. Note that the output of the Sign Encoding Scheme can only take values from the set . Hence, Sign Encoding Scheme only considers the interval and estimates by one of the two end points, namely, and . This observation implies that the Sign Encoding Scheme can be interpreted as a single-partition quantization.

In Multi-Partition Quantization, for each coordinate ,we need 1 bit to transmit . Moreover, since , we need bits to send . Finally, we need 32 bits to transmit . Hence, the total number of communicated bits is .

One major advantage of the -Partition Encoding Scheme is that by tuning the partition parameter or the corresponding assigned bits , we can smoothly control the trade-off between gradient quantization and information loss, which helps distributed algorithms to attain their best performance. In the following lemma, we formally characterize the variance of the -Partition Encoding Scheme and highlight the accuracy v.s. communication cost trade-off.

###### Lemma 2 (Proof in Appendix B).

The variance of -Partition Encoding Scheme for any vector is bounded by

 Var[ϕ′(g)|g]≤ds2∥g∥2∞. (6)

Lemma 2 demonstrates the trade-off between the error of quantization and the communication cost for -Partition Encoding Scheme. In a nutshell, for larger choices of , the variance is smaller, which in turn results in higher communication cost.

## 3 Stochastic Optimization

In this section, we aim to solve the constrained stochastic optimization problem defined in (1) in a distributed fashion. In particular, we are interested in projection-free (Frank-Wolfe type) methods and execute quantization to reduce the communication cost between the workers and the master. Recall that we assume that at each round , each worker has access to an unbiased estimator of the objective function gradient , which is denoted by , i.e., . We further assume that the stochastic gradients are independent of each other.

As shown in Fig. 1, the workflow of our proposed algorithm is easy to understand. At iteration , each worker first computes its local stochastic gradient . Then, it encodes as – which is quantized and can be transmitted at a low communication cost – to the master. Once the master receives all the coded stochastic gradients , it uses a proper decoding scheme to evaluate , which are the decoded versions of the received signals . Indeed, by design, each of the decoded signals is an unbiased estimator of the objective function gradient . Then, the master evaluates the average of the decoded signals which we denote by , i.e., . After using a proper quantization scheme, the master broadcasts the coded signal to all the workers. The workers decode the received signals and use the resulted vector to improve their local stochastic gradient approximation.

Note that the vector is an unbiased estimator of . If we ignore the influence of quantization, has a lower variance comparing to the local vector as its computation incorporates the information of stochastic gradients. Still, if we use , instead of the actual but unavailable gradient , Frank-Wolfe may diverge Mokhtari et al. (2018b). To overcome this issue, we need to further reduce the variance caused by quantization. To do so, each worker uses a momentum local vector to update the iterates, which is defined by

 ¯gt←(1−ρt)¯gt−1+ρtΦ′(~gt). (7)

As the decoded vectors for the workers are identical, if they all initialize the sequence in the same way, for all iterations, the local vectors for all the workers are the same. As the update of in (7) computes a weighted average of the previous stochastic gradient approximation and the updated network average stochastic gradient , it has a lower variance comparing to the vector (note that it is not an unbiased estimator of ). The key fact that allows us to prove convergence is that the estimation error will approach zero as time passes, which is formally characterized in Lemma 3 in Appendix C.

After computing the vector based on the update in (7), workers can update their variables in the standard way, i.e., . The vector is defined as . Similar to the argument above, if the iterates of all the workers are initialized at the same point , then for all the iterations , the local variables of all the workers are identical. Note that the update of is slightly different from that of the Frank-Wolfe method as the exact but unavailable gradient is replaced by its stochastic approximation . The full description of our proposed Stochastic Quantized Frank-Wolfe is outlined in Algorithm 1. Finally, note that we can use different quantization schemes in Algorithm 1, which lead to different convergence rates and communication costs. We explore their effects empirically in our set of experiments.

### 3.1 Convex Optimization

In this subsection, we focus on the convergence rate of Stochastic Quantized Frank-Wolfe when applied to convex objective functions. To do so, we first make the following assumptions on the constraint set , the objective function , the local stochastic gradients , and the quantization scheme .

###### Assumption 1.

The constraint set is convex and compact. We also denote its diameter by .

###### Assumption 2.

The objective function is convex, bounded, i.e., , and -smooth over .

###### Assumption 3.

For each worker and iteration , the stochastic gradient is unbiased and has a uniformly bounded variance, i.e., for all and ,

 E[gmt(xt)|xt]=∇f(xt),Var[gmt(xt)|xt]≤σ21.
###### Assumption 4.

For any , and vectors and generated by Stochastic Quantized Frank-Wolfe, the quantization scheme satisfies

 E[Φ′(gmt(xt))|gmt(xt)]=gmt(xt), E[∥Φ′(gmt(xt))−gmt(xt)∥2]≤σ22, E[Φ′(~gt)|~gt]=~gt, E[∥Φ′(~gt)−~gt∥2]≤σ23.

By considering the above assumptions, in the following theorem we show the convergence rate of Stochastic Quantized Frank-Wolfe.

###### Theorem 1 (Proof in Appendix D).

Under Assumptions 4, 3, 2 and 1, if we set in Algorithm 1, then after iterations, the output is a feasible point, i.e., , and satisfies the inequality

 E[f(xT+1)]−f(x∗)≤Q0(T+4)1/3,

where , and is the global minimizer of on .

Theorem 1 shows that the suboptimality gap of Stochastic Quantized Frank-Wolfe converges to zero at a sublinear rate of . In other words, after running at most iterations we we can find a solution that is close to the optimum. Next, we can incorporate the concrete Sign Encoding Scheme into Stochastic Quantized Frank-Wolfe. We first need the following assumption on the stochastic gradients.

###### Assumption 5.

The stochastic gradients have uniformly bounded and norms, i.e.,

###### Corollary 1 (Proof in Appendix E).

Under Assumptions 5, 3, 2 and 1, if we set , and apply Sign Encoding Scheme in Algorithm 1, then after iterations, the output satisfies

 E[f(xT+1)]−f(x∗)≤Q0(T+4)1/3,

where , and is the global minimizer of on .

The idea of proof is quite straightforward. We want to apply Theorem 1, so we only need to calculate given the specific quantization scheme. Then we can prove the rate by Theorem 1 directly. Considering the fact that each round of communication in Sign Encoding Scheme requires bits, the overall communication cost to find an -suboptimal solution is .

### 3.2 Non-Convex Optimization

With slightly different parameters, Stochastic Quantized Frank-Wolfe can be applied to non-convex settings as well. In unconstrained non-convex optimization problems, the gradient norm is usually a good measure of convergence as implies convergence to a stationary point. However, in the constrained setting it is not a good benchmark and instead we need to look at the Frank-Wolfe Gap (Jaggi, 2013; Lacoste-Julien, 2016) defined as

 G(x)=maxv∈K⟨v−x,−∇f(x)⟩. (8)

For constrained optimization problem (1), if a point satisfies , then it is a first-order stationary point. Also, by definition, we have .

We will analyze the convergence rate of Algorithm 1 based on the following assumption on the objective function .

###### Assumption 6.

The objective function is bounded, i.e., , and -smooth over .

###### Theorem 2 (Proof in Appendix F).

Under Assumptions 3, 4, 6 and 1, and given the iteration horizon , if we set in Algorithm 1, then after iterations we have

 E[G(xo)]≤8M0+20DQ1/2/3(T+3)1/4+LD22(T+3)3/4,

where .

In other words, Theorem 2 indicates that in the non-convex setting, Stochastic Quantized Frank-Wolfe finds an -first order stationary point after at most iterations. This result combined with the concrete quantization method Sign Encoding Scheme leads to the following corollary.

###### Corollary 2.

Under Assumptions 5, 6, 3 and 1, if we set , and apply Sign Encoding Scheme in Algorithm 1, then the output satisfies

 E[G(xo)]≤8M0+20DQ1/2\/3(T+3)1/4+LD22(T+3)3/4,

where .

By using Sign Encoding Scheme, each round of communication requires bits. Therefore, to find an -first order stationary point, Corollary 2 indicates that we need rounds with the overall communication cost of .

## 4 Finite-Sum Optimization

In this section, we focus on the finite-sum problem (2) where we assume that there are functions in total and each worker has access to functions for . The major difference with the stochastic setting is that we can use a more aggressive variance reduction for communicating quantized gradients. More specifically, we adopt the Stochastic Path-Integrated Differential Estimator (SPIDER) technique, first introduced in (Fang et al., 2018) for unconstrained optimization in centralized settings. We properly generalize SPIDER to the constrained and distributed settings.

To do so, let us define a period parameter . At the beginning of each period, namely, mod, each worker , computes the full average of its local gradients and sends it to the master, the master calculates the average of these signals, i.e., the average of gradients for all the component functions, and broadcasts it to all the workers, then the workers update the gradient estimation as follows:

 ¯gt←M∑m=1n∑i=1∇fm,i(xt)/(Mn).

Note is identical for all the workers. In the remaining iterations of that period, i.e., mod, each worker samples a set of local component functions, denoted as , of size uniformly at random, computes the average of these local gradients and sends it to the master, the master calculates the average of the signals and broadcasts it to all the workers, then the workers update the gradient estimation as follow:

 ¯gt←M∑m=1∑i∈St,m[∇fm,i(xt)−∇fm,i(xt−1)]/(MS)+¯gt−1.

So is still identical for all the workers. In order to incorporate quantization, each worker simply pushes the quantized version of the average gradients. Then the master decodes the quantizations, encodes the average of decoded signals in a quantized fashion, and broadcasts the quantization. Finally, each worker decodes the quantized signal and updates locally. The full description of our proposed Finite-Sum Quantized Frank-Wolfe algorithm is outlined in Algorithm 2.

Compared with Stochastic Quantized Frank-Wolfe, one advantage of Finite-Sum Quantized Frank-Wolfe is that we can use different quantization schemes at different iterations , which makes Algorithm 2 more flexible in solving various optimization problems. We will explore their effects empirically in our set of experiments.

Finally, note again that FW is very sensitive to the accuracy of gradients in order to converge. Nevertheless, we next give strong theoretical guarantees on the convergence rate of Finite-Sum Quantized Frank-Wolfe for both convex and non-convex settings while using quantized and local gradients distributed over machines.

### 4.1 Convex Optimization

To analyze the convex case, we first make an assumption on the component functions.

###### Assumption 7.

The component functions ’s are convex, -smooth on , and uniformly bounded, i.e., .

Since functions are all bounded and -smooth on the compact constraint set , their gradients are also bounded on . Moreover, we only have a finite number of component functions ’s, thus, there will always be a uniform bound on . For simplicity of analysis, we assume an explicit upper bound in the following theorem. But this is a direct implication of the other assumptions.

###### Theorem 3 (Proof in Appendix H).

Let us set , and apply -Partition Encoding Scheme , and -Partition Encoding Scheme as in Algorithm 2 where . Under Assumptions 7 and 1, and , and after iterations, the output satisfies

 E[f(xT+1)]−f(x∗)≤Q0T,

where , , and is the global minimizer of on .

Theorem 3 indicates that in convex setting, if we use the recommended quantization schemes, then the output of Finite-Sum Quantized Frank-Wolfe is -suboptimal with at most rounds, i.e., the Linear-optimization Oracle (LO) complexity is . Also, the total Incremental First-order Oracle (IFO) complexity is . The average communication bits per round are at most .

Similar to the stochastic case, the key part of our analysis is to bound , which is addressed in Lemma 4 in Appendix G. Also since finite-sum optimization can be regarded as a special case of stochastic optimization, the recursive inequality for can be derived directly from the proof of Theorem 1. Combining these two ingredients, Theorem 3 can be derived.

### 4.2 Non-convex Optimization

Algorithm 2 can also be applied to the non-convex setting with a slight change in parameters. We first make a standard assumption on the component functions.

###### Assumption 8.

The component functions ’s are -smooth on and uniformly bounded, i.e., .

###### Theorem 4 (Proof in Appendix I).

Under Assumptions 8 and 1, and , if we set , and apply -Partition Encoding Scheme and -Partition Encoding Scheme as in Algorithm 2, where , then the output of Algorithm 2 satisfies

 E[G(xo)]≤2M0+D√L2D2+2G2∞+LD22√T.

Theorem 4 shows that for non-convex minimization, if we adopt the recommended quantization schemes, then Algorithm 2 can find an -first order stationary point with at most rounds, i.e., the LO complexity is . Also, the total IFO complexity is . The average communication bits per round are .

## 5 Experiments

We evaluate the performance of the algorithms from two aspects. The first one is how the loss (the objective function) changes with an increasing number of epochs, while the second one is the number of bits (

i.e., the communication complexity) that the master and worker nodes exchange per iteration.

We use the MNIST dataset and consider a convex model and a non-convex model. The convex model consists of a two-layer fully connected neural network with no hidden layer. The output layer has 10 neurons and the log loss for multiclass classification is used. This model is equivalent to multinomial logistic regression. The non-convex model adds two hidden layers with 10 neurons. The constraint is that the

-norm should be at most .

For both the convex and non-convex models and in both the stochastic and finite-sum settings, we vary the quantization level and compute the loss after each epoch. Additionally, we compute the average number of bits exchanged by the master and the worker nodes per iteration in order to quantify the communication complexity. A total number of 20 workers are used. In the stochastic setting, each batch of a worker contains 500 images. In the finite-sum setting, each sample of a worker contains 100 images.

The performance for the convex and non-convex models is quantified by the log loss and average Frank-Wolfe gap, respectively. Recall that according to Theorems 4, 3, 2 and 1, the output for the convex model is and the output for the non-convex model (denoted by ) is chosen uniformly at random from . We have . Therefore, for the non-convex model, we plot as the Frank-Wolfe gap at the -th epoch. The results for the convex model are presented in Fig. 2 while the results for the non-convex model are presented in Fig. 3. In both figures, the subfigures in the first row (Figs. 1(b) and 1(a) and Figs. 2(b) and 2(a)) show loss vs. epoch in the stochastic setting. Those in the second row (Figs. 1(d) and 1(c) and Figs. 2(d) and 2(c)) show loss vs. epoch in the finite-sum setting. The third row shows the computational complexity in both settings.

Recall that is the quantization level that is used when workers send their local gradients to the master and is used when the master broadcasts the tensor to all workers. Figs. 2(c), 2(a), 1(c) and 1(a) show how the loss changes if we fix and vary . We can observe that increasing improves the convergence performance significantly. In Figs. 2(a) and 1(a), choosing achieves a similar performance to the situation where all tensors are transferred in their raw form without any quantization. Similarly, in Figs. 2(c) and 1(c), using recommended by Theorems 4 and 3 results in the performance almost identical to that without quantization. According to Figs. 2(e) and 1(e), using is merely at the cost of a slight increase in communication complexity compared with . In contrast, it can be seen from Figs. 2(d), 2(b), 1(d) and 1(b) that if one fixes , the improvement by choosing a larger is limited. This suggests that it is more worthwhile to invest communication complexity and have a finer quantization in the process of broadcasting tensors from the master node to the workers. If one chooses a smaller (which results in a coarse quantization when the workers transfer their local gradients to the master), the noise incurred by the coarse quantization can be reduced by averaging the local gradients received from the workers. However, the noise associated with the tensor broadcast by the master cannot be mitigated.

As illustrated in Figs. 2(f), 1(f), 2(e) and 1(e), the unquantized setting suffers from the highest communication complexity. A slight increase in the quantization level of produces a convergence performance similar to the unquantized setting while preserving a low communication complexity.

## 6 Conclusion

In this paper, we developed Quantized Frank-Wolfe, the first general-purpose projection-free and communication-efficient framework for constrained optimization. Along with proposing various quantization schemes, Quantized Frank-Wolfe can address both convex and non-convex optimization settings in stochastic and finite-sum cases. We provided theoretical guarantees on the convergence rate of Quantized Frank-Wolfe