Frank-Wolfe (FW) algorithm (Frank and Wolfe, 1956), also known as conditional gradient or projection-free method, has recently received a lot of attention in the machine learning community. As it does not need any projections while maintaining feasibility through its execution, FW is very efficient for various constrained convex (Jaggi, 2013; Garber and Hazan, 2014; Lacoste-Julien and Jaggi, 2015; Garber and Hazan, 2015; Hazan and Luo, 2016; Mokhtari et al., 2018b) and non-convex (Lacoste-Julien, 2016; Reddi et al., 2016) optimization problems. It is known that in many scenarios, projection operations are computationally prohibitive (e.g.
, projections onto a nuclear norm ball or onto a matroid polytope). To avoid this cost, FW replaces the projection step with solving a linear program.
In order to apply the FW algorithm to large-scale optimization and machine learning problems (e.g.
, training deep neural networks, SVMs, AdaBoost, experimental design,etc) parallelization is unavoidable. To this end, distributed FW variants have been proposed for specific problems, e.g., learning low-rank matrix (Zheng et al., 2018), and optimization under block-separable constraint set (Wang et al., 2016).
A significant performance bottleneck of distributed optimization methods is the cost of communicating gradients, typically handled by using a parameter-server framework. Intuitively, if each worker/processor in the distributed system transmits the entire gradient, then at least floating-point numbers are communicated for each worker, where is the dimension of the optimization problem. This communication cost can be a huge burden on the performance of parallel optimization algorithms (Chilimbi et al., 2014; Seide et al., 2014; Strom, 2015). To circumvent this drawback, communication-efficient parallel algorithms have recently received significant attentions. One major approach is to quantize/compress the gradients while maintaining sufficient information (De Sa et al., 2015; Abadi et al., 2016; Wen et al., 2017). For the unconstrained
optimization setting, when Stochastic Gradient Descent (SGD) does not require to perform any projection, various communication-efficient distributed algorithms have been proposed, including QSGD(Alistarh et al., 2017), SIGN-SGD (Bernstein et al., 2018), and Sparsified-SGD (Stich et al., 2018).
In the constrained setting, and in particular for distributed FW algorithms, the communication-efficient versions were only studied for specific problems such as sparse learning (Bellet et al., 2015; Lafond et al., 2016). In this paper, however, we develop Quantized Frank-Wolfe (QFW), a general communication-efficient distributed FW for both convex and non-convex objective functions. We study the performance of QFW in stochastic and finite-sum optimization settings.
Problem formulation: The focus of this paper is on constrained optimization in two widely recognized settings: 1) stochastic and 2) finite-sum optimization. Let denotes the constraint set which is assumed to be convex and compact throughout the paper. In constrained stochastic optimization the goal is to solve the following problem:
where is the optimization variable,
is a random variable drawn from a distribution, and together they determine the choice of a stochastic function . In constrained finite-sum optimization we further assume that
is a uniform distribution overand the goal is to solve a special case of Problem (1), namely,
In parallel settings, we suppose that there is a master and workers. Each worker maintains a local copy of . At every iteration of the stochastic case, each worker has access to independent stochastic gradients of ; whereas in the finite-sum case, the objective function can be decomposed as
and each worker has access to the exact gradients of for all . This way the task of computing gradients is divided among the workers. The master node aggregates local gradients from the workers, and sends the aggregated gradients back to them so that each one of them can update the model (i.e., their own iterate) locally. Note that the kind of information that workers send to the master is of the form of local gradients. Thus, by transmitting quantized gradients, we can reduce the communication complexity (i.e., number of transmitted bits) significantly. The workflow diagram of a distributed quantization scheme is summarized in Fig. 1. Finally, we should highlight that there is a trade-off between gradient quantization and information flow. Intuitively, more intensive quantization reduces the communication cost, but also loses more information, which may decelerate the convergence rate. The goal of Quantized Frank-Wolfe is to find a communication-efficient and fast-converging parallel FW algorithm in stochastic and finite-sum cases, for both constrained convex and non-convex optimization problems.
Our contributions: In this paper, we propose two general purpose quantization schemes that can be readily applied to distributed optimization settings. We then propose Quantized Frank-Wolfe, a distributed framework that handles quantization for constrained convex and non-convex optimization problems in finite-sum and stochastic cases. It achieves a sweet trade-off between the communication complexity and convergence rate in distributed computation. The results are summarized in Table 1.
2 Gradient Quantization Schemes
As mentioned earlier, the communication cost can be reduced effectively by sending quantized gradients. In this section, we propose two quantization schemes which have different degrees of compression and information content. Depending on the specific requirements of optimization task, one might choose one over the other.
2.1 Single-Partition Quantization
Consider the gradient vectorand let be the -th coordinate of the gradient. To transmit the scalar , Sign Encoding Scheme sends the product of the sign of and a properly chosen random variable , defined as
where is the norm of . Note that the product of and belongs to the set and can be transmitted using two bits. On the receiver side, given access to the norm , one can recover the scalar (in expectation) by computing , since we have . According to this observation, for each coordinate , Sign Encoding Scheme only needs to communicate the encoded scalar , alongside the norm . Therefore, if we define as a vector containing ’s, and as a vector containing random variables , the Sign Encoding Scheme is a tuple , where is the Hadamard product operator. Similarly, the corresponding decoding scheme is
, which is an unbiased estimator of the gradient vector.
Since transmitting each element of the vector requires two bits, the total communicated bits for are . Assuming that sending the norm , which is a scalar, requires bits, the overall communicated bits of Sign Encoding Scheme for each worker are per round. Finally, note that the elements of can only be , or . Thus intuitively, Sign Encoding Scheme compresses the gradient intensively, and may lead to a loss of information. In the following lemma we formally characterize the loss in terms of the variance of Sign Encoding Scheme.
Lemma 1 (Proof in Appendix A).
For any input vector , the variance of Sign Encoding Scheme is given by
Lemma 1 implies that if the absolute values of the elements of the vector are in a same range – its energy is divided among its elements in a balanced way – then the variance of Sign Encoding Scheme, namely, , is small. For instance, if all the elements of the vector are equal to each other, then the variance is zero. Conversely, if the absolute values of a few elements of are significantly larger than the rest, then the variance becomes large. For instance, if one of the elements of is and the remaining ones are , then the variance is .
For the probability distribution of the random variable
For the probability distribution of the random variable, instead of , we can use other norms (where ). But it can be verified that the -norm leads to the smallest variance for Sign Encoding Scheme.
2.2 Multi-Partition Quantization
Now we focus on Multi-Partition Quantization Scheme which has a lower variance comparing to Sign Encoding Scheme, but at the cost of sending more bits at each round of communication. Unlike Sign Encoding Scheme that codes each element into a scalar from , in s-Partition Encoding Scheme each element is encoded into an element from the set . To transmit the -th element of the gradient vector , s-Partition Encoding Scheme first computes the ratio and finds the indicator for which the following inequality is satisfied
After finding , we define the random variable by the following probability distribution
Then, instead of transmitting , s-Partition Encoding Scheme sends the product of and the random variable . It can be verified that . So we define the corresponding decoding scheme as to ensure that is an unbiased estimator of .
Intuitively, in s-Partition Encoding Scheme, we partition the interval into parts with the same length and find the specific interval in which falls, and estimate by one of the two end points of that interval randomly. The probability for each end point is chosen according to (5) to make sure that the estimation unbiased. Note that the output of the Sign Encoding Scheme can only take values from the set . Hence, Sign Encoding Scheme only considers the interval and estimates by one of the two end points, namely, and . This observation implies that the Sign Encoding Scheme can be interpreted as a single-partition quantization.
In Multi-Partition Quantization, for each coordinate ,we need 1 bit to transmit . Moreover, since , we need bits to send . Finally, we need 32 bits to transmit . Hence, the total number of communicated bits is .
One major advantage of the -Partition Encoding Scheme is that by tuning the partition parameter or the corresponding assigned bits , we can smoothly control the trade-off between gradient quantization and information loss, which helps distributed algorithms to attain their best performance. In the following lemma, we formally characterize the variance of the -Partition Encoding Scheme and highlight the accuracy v.s. communication cost trade-off.
Lemma 2 (Proof in Appendix B).
The variance of -Partition Encoding Scheme for any vector is bounded by
Lemma 2 demonstrates the trade-off between the error of quantization and the communication cost for -Partition Encoding Scheme. In a nutshell, for larger choices of , the variance is smaller, which in turn results in higher communication cost.
3 Stochastic Optimization
In this section, we aim to solve the constrained stochastic optimization problem defined in (1) in a distributed fashion. In particular, we are interested in projection-free (Frank-Wolfe type) methods and execute quantization to reduce the communication cost between the workers and the master. Recall that we assume that at each round , each worker has access to an unbiased estimator of the objective function gradient , which is denoted by , i.e., . We further assume that the stochastic gradients are independent of each other.
As shown in Fig. 1, the workflow of our proposed algorithm is easy to understand. At iteration , each worker first computes its local stochastic gradient . Then, it encodes as – which is quantized and can be transmitted at a low communication cost – to the master. Once the master receives all the coded stochastic gradients , it uses a proper decoding scheme to evaluate , which are the decoded versions of the received signals . Indeed, by design, each of the decoded signals is an unbiased estimator of the objective function gradient . Then, the master evaluates the average of the decoded signals which we denote by , i.e., . After using a proper quantization scheme, the master broadcasts the coded signal to all the workers. The workers decode the received signals and use the resulted vector to improve their local stochastic gradient approximation.
Note that the vector is an unbiased estimator of . If we ignore the influence of quantization, has a lower variance comparing to the local vector as its computation incorporates the information of stochastic gradients. Still, if we use , instead of the actual but unavailable gradient , Frank-Wolfe may diverge Mokhtari et al. (2018b). To overcome this issue, we need to further reduce the variance caused by quantization. To do so, each worker uses a momentum local vector to update the iterates, which is defined by
As the decoded vectors for the workers are identical, if they all initialize the sequence in the same way, for all iterations, the local vectors for all the workers are the same. As the update of in (7) computes a weighted average of the previous stochastic gradient approximation and the updated network average stochastic gradient , it has a lower variance comparing to the vector (note that it is not an unbiased estimator of ). The key fact that allows us to prove convergence is that the estimation error will approach zero as time passes, which is formally characterized in Lemma 3 in Appendix C.
After computing the vector based on the update in (7), workers can update their variables in the standard way, i.e., . The vector is defined as . Similar to the argument above, if the iterates of all the workers are initialized at the same point , then for all the iterations , the local variables of all the workers are identical. Note that the update of is slightly different from that of the Frank-Wolfe method as the exact but unavailable gradient is replaced by its stochastic approximation . The full description of our proposed Stochastic Quantized Frank-Wolfe is outlined in Algorithm 1. Finally, note that we can use different quantization schemes in Algorithm 1, which lead to different convergence rates and communication costs. We explore their effects empirically in our set of experiments.
3.1 Convex Optimization
In this subsection, we focus on the convergence rate of Stochastic Quantized Frank-Wolfe when applied to convex objective functions. To do so, we first make the following assumptions on the constraint set , the objective function , the local stochastic gradients , and the quantization scheme .
The constraint set is convex and compact. We also denote its diameter by .
The objective function is convex, bounded, i.e., , and -smooth over .
For each worker and iteration , the stochastic gradient is unbiased and has a uniformly bounded variance, i.e., for all and ,
For any , and vectors and generated by Stochastic Quantized Frank-Wolfe, the quantization scheme satisfies
By considering the above assumptions, in the following theorem we show the convergence rate of Stochastic Quantized Frank-Wolfe.
Theorem 1 (Proof in Appendix D).
Theorem 1 shows that the suboptimality gap of Stochastic Quantized Frank-Wolfe converges to zero at a sublinear rate of . In other words, after running at most iterations we we can find a solution that is close to the optimum. Next, we can incorporate the concrete Sign Encoding Scheme into Stochastic Quantized Frank-Wolfe. We first need the following assumption on the stochastic gradients.
The stochastic gradients have uniformly bounded and norms, i.e.,
Corollary 1 (Proof in Appendix E).
The idea of proof is quite straightforward. We want to apply Theorem 1, so we only need to calculate given the specific quantization scheme. Then we can prove the rate by Theorem 1 directly. Considering the fact that each round of communication in Sign Encoding Scheme requires bits, the overall communication cost to find an -suboptimal solution is .
3.2 Non-Convex Optimization
With slightly different parameters, Stochastic Quantized Frank-Wolfe can be applied to non-convex settings as well. In unconstrained non-convex optimization problems, the gradient norm is usually a good measure of convergence as implies convergence to a stationary point. However, in the constrained setting it is not a good benchmark and instead we need to look at the Frank-Wolfe Gap (Jaggi, 2013; Lacoste-Julien, 2016) defined as
For constrained optimization problem (1), if a point satisfies , then it is a first-order stationary point. Also, by definition, we have .
We will analyze the convergence rate of Algorithm 1 based on the following assumption on the objective function .
The objective function is bounded, i.e., , and -smooth over .
Theorem 2 (Proof in Appendix F).
In other words, Theorem 2 indicates that in the non-convex setting, Stochastic Quantized Frank-Wolfe finds an -first order stationary point after at most iterations. This result combined with the concrete quantization method Sign Encoding Scheme leads to the following corollary.
By using Sign Encoding Scheme, each round of communication requires bits. Therefore, to find an -first order stationary point, Corollary 2 indicates that we need rounds with the overall communication cost of .
4 Finite-Sum Optimization
In this section, we focus on the finite-sum problem (2) where we assume that there are functions in total and each worker has access to functions for . The major difference with the stochastic setting is that we can use a more aggressive variance reduction for communicating quantized gradients. More specifically, we adopt the Stochastic Path-Integrated Differential Estimator (SPIDER) technique, first introduced in (Fang et al., 2018) for unconstrained optimization in centralized settings. We properly generalize SPIDER to the constrained and distributed settings.
To do so, let us define a period parameter . At the beginning of each period, namely, mod, each worker , computes the full average of its local gradients and sends it to the master, the master calculates the average of these signals, i.e., the average of gradients for all the component functions, and broadcasts it to all the workers, then the workers update the gradient estimation as follows:
Note is identical for all the workers. In the remaining iterations of that period, i.e., mod, each worker samples a set of local component functions, denoted as , of size uniformly at random, computes the average of these local gradients and sends it to the master, the master calculates the average of the signals and broadcasts it to all the workers, then the workers update the gradient estimation as follow:
So is still identical for all the workers. In order to incorporate quantization, each worker simply pushes the quantized version of the average gradients. Then the master decodes the quantizations, encodes the average of decoded signals in a quantized fashion, and broadcasts the quantization. Finally, each worker decodes the quantized signal and updates locally. The full description of our proposed Finite-Sum Quantized Frank-Wolfe algorithm is outlined in Algorithm 2.
Compared with Stochastic Quantized Frank-Wolfe, one advantage of Finite-Sum Quantized Frank-Wolfe is that we can use different quantization schemes at different iterations , which makes Algorithm 2 more flexible in solving various optimization problems. We will explore their effects empirically in our set of experiments.
Finally, note again that FW is very sensitive to the accuracy of gradients in order to converge. Nevertheless, we next give strong theoretical guarantees on the convergence rate of Finite-Sum Quantized Frank-Wolfe for both convex and non-convex settings while using quantized and local gradients distributed over machines.
4.1 Convex Optimization
To analyze the convex case, we first make an assumption on the component functions.
The component functions ’s are convex, -smooth on , and uniformly bounded, i.e., .
Since functions are all bounded and -smooth on the compact constraint set , their gradients are also bounded on . Moreover, we only have a finite number of component functions ’s, thus, there will always be a uniform bound on . For simplicity of analysis, we assume an explicit upper bound in the following theorem. But this is a direct implication of the other assumptions.
Theorem 3 (Proof in Appendix H).
Theorem 3 indicates that in convex setting, if we use the recommended quantization schemes, then the output of Finite-Sum Quantized Frank-Wolfe is -suboptimal with at most rounds, i.e., the Linear-optimization Oracle (LO) complexity is . Also, the total Incremental First-order Oracle (IFO) complexity is . The average communication bits per round are at most .
Similar to the stochastic case, the key part of our analysis is to bound , which is addressed in Lemma 4 in Appendix G. Also since finite-sum optimization can be regarded as a special case of stochastic optimization, the recursive inequality for can be derived directly from the proof of Theorem 1. Combining these two ingredients, Theorem 3 can be derived.
4.2 Non-convex Optimization
Algorithm 2 can also be applied to the non-convex setting with a slight change in parameters. We first make a standard assumption on the component functions.
The component functions ’s are -smooth on and uniformly bounded, i.e., .
Theorem 4 (Proof in Appendix I).
We evaluate the performance of the algorithms from two aspects. The first one is how the loss (the objective function) changes with an increasing number of epochs, while the second one is the number of bits (i.e., the communication complexity) that the master and worker nodes exchange per iteration.
We use the MNIST dataset and consider a convex model and a non-convex model. The convex model consists of a two-layer fully connected neural network with no hidden layer. The output layer has 10 neurons and the log loss for multiclass classification is used. This model is equivalent to multinomial logistic regression. The non-convex model adds two hidden layers with 10 neurons. The constraint is that the-norm should be at most .
For both the convex and non-convex models and in both the stochastic and finite-sum settings, we vary the quantization level and compute the loss after each epoch. Additionally, we compute the average number of bits exchanged by the master and the worker nodes per iteration in order to quantify the communication complexity. A total number of 20 workers are used. In the stochastic setting, each batch of a worker contains 500 images. In the finite-sum setting, each sample of a worker contains 100 images.
The performance for the convex and non-convex models is quantified by the log loss and average Frank-Wolfe gap, respectively. Recall that according to Theorems 4, 3, 2 and 1, the output for the convex model is and the output for the non-convex model (denoted by ) is chosen uniformly at random from . We have . Therefore, for the non-convex model, we plot as the Frank-Wolfe gap at the -th epoch. The results for the convex model are presented in Fig. 2 while the results for the non-convex model are presented in Fig. 3. In both figures, the subfigures in the first row (Figs. 1(b) and 1(a) and Figs. 2(b) and 2(a)) show loss vs. epoch in the stochastic setting. Those in the second row (Figs. 1(d) and 1(c) and Figs. 2(d) and 2(c)) show loss vs. epoch in the finite-sum setting. The third row shows the computational complexity in both settings.
, the tensors are transferred in their raw form without any quantization. The quantization levelthm denotes that and are chosen according to Theorem 3.
Recall that is the quantization level that is used when workers send their local gradients to the master and is used when the master broadcasts the tensor to all workers. Figs. 2(c), 2(a), 1(c) and 1(a) show how the loss changes if we fix and vary . We can observe that increasing improves the convergence performance significantly. In Figs. 2(a) and 1(a), choosing achieves a similar performance to the situation where all tensors are transferred in their raw form without any quantization. Similarly, in Figs. 2(c) and 1(c), using recommended by Theorems 4 and 3 results in the performance almost identical to that without quantization. According to Figs. 2(e) and 1(e), using is merely at the cost of a slight increase in communication complexity compared with . In contrast, it can be seen from Figs. 2(d), 2(b), 1(d) and 1(b) that if one fixes , the improvement by choosing a larger is limited. This suggests that it is more worthwhile to invest communication complexity and have a finer quantization in the process of broadcasting tensors from the master node to the workers. If one chooses a smaller (which results in a coarse quantization when the workers transfer their local gradients to the master), the noise incurred by the coarse quantization can be reduced by averaging the local gradients received from the workers. However, the noise associated with the tensor broadcast by the master cannot be mitigated.
In this paper, we developed Quantized Frank-Wolfe, the first general-purpose projection-free and communication-efficient framework for constrained optimization. Along with proposing various quantization schemes, Quantized Frank-Wolfe can address both convex and non-convex optimization settings in stochastic and finite-sum cases. We provided theoretical guarantees on the convergence rate of Quantized Frank-Wolfe and validated its efficiency empirically on training multinomial logistic regression and neural networks. Our theoretical results highlighted the importance of variance reduction techniques to stabalize Frank Wolfe and achieve a sweet trade-off between the communication complexity and convergence rate in distributed settings.
- Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
- Alistarh et al. (2017) Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
- Bellet et al. (2015) Aurélien Bellet, Yingyu Liang, Alireza Bagheri Garakani, Maria-Florina Balcan, and Fei Sha. A distributed frank-wolfe algorithm for communication-efficient sparse learning. In