99

It is well known that many optimization methods, including SGD, SAGA, and Accelerated SGD for over-parameterized models, do not scale linearly in the parallel setting. In this paper, we present a new version of block coordinate descent that solves this issue for a number of methods. The core idea is to make the sampling of coordinate blocks on each parallel unit independent of the others. Surprisingly, we prove that the optimal number of blocks to be updated by each of n units in every iteration is equal to m/n, where m is the total number of blocks. As an illustration, this means that when n=100 parallel units are used, 99% of work is a waste of time. We demonstrate that with m/n blocks used by each unit the iteration complexity often remains the same. Among other applications which we mention, this fact can be exploited in the setting of distributed optimization to break the communication bottleneck. Our claims are justified by numerical experiments which demonstrate almost a perfect match with our theory on a number of datasets.

READ FULL TEXT VIEW PDF

Authors

11/16/2020

Avoiding Communication in Logistic Regression

Stochastic gradient descent (SGD) is one of the most widely used optimiz...
12/09/2015

Efficient Distributed SGD with Variance Reduction

Stochastic Gradient Descent (SGD) has become one of the most popular opt...
12/20/2013

Accelerated, Parallel and Proximal Coordinate Descent

We propose a new stochastic coordinate descent method for minimizing the...
12/04/2012

Parallel Coordinate Descent Methods for Big Data Optimization

In this work we show that randomized (block) coordinate descent methods ...
12/20/2021

Distributed and Stochastic Optimization Methods with Gradient Compression and Local Steps

In this thesis, we propose new theoretical frameworks for the analysis o...
10/26/2021

Exponential Graph is Provably Efficient for Decentralized Deep Training

Decentralized SGD is an emerging training method for deep learning known...
10/17/2019

RPBA – Robust Parallel Bundle Adjustment Based on Covariance Information

A core component of all Structure from Motion (SfM) approaches is bundle...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this work we are concerned with parallel/distributed algorithms for solving finite sum minimization problems of the form

(1)

where each is convex and smooth. In particular, we are interested in methods which employ parallel units/workers/nodes/processors, each of which has access to a single function

and its gradients (or unbiased estimators thereof). Let

be an optimal solution of (1). In practical parallel or distributed scenarios, is often of the form

(2)

where the expectation is with respect to a distribution of training examples stored locally at machine . More typically, however, each machine contains a very large but finite number of examples (for simplicity, say there are examples on each machine), and is of the form

(3)

In the rest of this section we provide some basic motivation and intuitions in support of our approach. To this purpose, assume, for simplicity of exposition, that is of the finite-sum form (3

). In typical modern machine learning workloads, the number of machines

is much smaller than the number of data points on each machine . In a large scale regime (i.e., when the model size , the number of data points , or both are large), problem (1) needs to be solved by a combination of efficient methods and modern hardware. In recent years there has been a lot of progress in designing new algorithms for solving this problem using techniques such as stochastic approximation Robbins & Monro (1985)

, variance reduction 

Schmidt et al. (2017); Johnson & Zhang (2013); Defazio et al. (2014), coordinate descent Nesterov (2012); Richtárik & Takáč (2014); Wright (2015) and acceleration Nesterov (1983), resulting in excellent theoretical and practical performance.

The computational power of hardware is increasing as well. In recent years, a very significant amount of such increase is due to parallelism. Since many methods, such as minibatch Stochastic Gradient Descent (SGD), are embarrassingly parallel, it is very simple to use them in big data applications. However, it has been observed in practice that adding more resources beyond a certain limit does not improve iteration complexity significantly. Moreover, having more parallel units makes their synchronization harder due to so-called communication bottleneck. Minibatch versions of most variance reduced methods

111We shall mention that there are already a few variance reduced methods that scale, up to some level, linearly in a parallel setup: Quartz for sparse data Qu et al. (2015), Katyusha Allen-Zhu (2017), or SAGA/SVRG/SARAH non-with importance sampling for non-convex problems Horváth & Richtárik (2018). such as SAGA Defazio et al. (2014) or SVRG Johnson & Zhang (2013) scale even worse in parallel setting – they do not guarantee, in the worst case, any speedup from using more than one function at a time. Unfortunately, numerical experiments show that this is not a proof flaw, but rather a real property of these methods Gower et al. (2018).

In this paper, we demonstrate that a simple trick of independent block sampling can remedy the problem of scaling. To illustrate one of the key insights of our paper on a simple example, in what follows consider a thought experiment in which GD is a baseline method we would want to improve on.

1.1 From gradient descent to block coordinate descent and back

A simple benchmark in the distributed setting is a parallel implementation of gradient descent (GD). GD arises as a special case of the more general class of block coordinate descent methods (BCD) Richtárik & Takáč (2016). The conventional way to run BCD for problem (1) is to update a single or several blocks222Assume the entries of are partitioned into several non-overlapping blocks. of , chosen at random, on all machines Richtárik & Takáč (2016); Fercoq & Richtárik (2015), followed by an update aggregation step. Such updates on each worker typically involve a gradient step on a subspace corresponding to the selected blocks. Importantly, and this is a key structural property of BCD methods, the same set of blocks is updated on each machine. If communication is expensive, it often makes sense to do more work on each machine, which in the context of BCD means updating more blocks.333It is possible to consider higher order block updates, but this is beyond the reach of this work. A particular special case is to always update all blocks, which leads to a parallel implementation of GD for problem (1), as alluded to above. Moreover, it is known that the theoretical iteration complexity of BCD improves as the number of blocks updated increases Richtárik & Takáč (2016); Qu & Richtárik (2016a, b). For these and similar reasons, GD (or one of its variants, such as GD with momentum), is often the preferable method to BCD. Having said that, we did not choose to describe BCD only to discard it at this point; we shall soon return to it, albeit with a twist.

1.2 From gradient descent to independent block coordinate descent

Because of what we just said, iteration complexity of GD will not improve by any variant running BCD; it can only get worse. Despite this, we propose to run BCD, but a new variant which allows each worker to sample an independent subset of blocks instead. This variant of BCD for (1) was not considered before. As we shall show, our independent sampling approach leads to a better-behaved aggregated gradient estimator when compared to that of BCD, which in turn leads to better overall iteration complexity. We call our method independent block coordinate descent (IBCD). We provide a unified analysis of our method, allowing for a random subset of out of a total of blocks to be sampled on each machine, independently from other machines. GD arises as a special case of this method by setting . However, as we show (see Corollary 1), the same iteration complexity guarantee can be obtained by choosing as low as . The immediate consequence of this result is that it is suboptimal to run GD in terms of communication complexity. Indeed, GD needs to communicate all blocks per machine, while IBCD achieves the same rate with blocks per machine only. Coming back to the abstract, consider an example with machines. In this case, when compared to GD, IBCD only communicates of the data. Because the iteration complexities of the two methods are the same, and if communication cost is dominant, this means that the problem can be solved in just of the time. In contrast, and when compared to the potential of IBCD, parallel implementation of GD inevitably wastes 99% of the time.

The intuition behind why our approach works lies in the law of large numbers. By averaging independent noise we reduce the total variance of the resulting estimator by the factor of

. If, however, the noise is already tiny, as in non-accelerated variance reduced methods, there is no improvement. On the other hand, (uniform) block coordinate descent (CD) has variance proportional to , where is the ratio of used blocks. Therefore, after the averaging step the variance is , which illustrates why setting any should not yield a significant speedup when compared to the choice . It also indicates that it should be possible to throw away a fraction of blocks while keeping the same convergence rate.

1.3 Beyond gradient descent and further contributions

The goal of the above discussion was to introduce one of they ideas of this paper in a gentle way. However, our independent sampling idea has immense consequences beyond the realm of GD, as we show in the rest of the paper. Let us summarize the contributions here:

  • We show that the independent sampling idea can be coupled with variance reduction/SAGA (see Section 4), SGD for problem (1)+(2) (see Section 5), acceleration (under mild assumption on stochastic gradients; see Section 6) and regularization/SEGA (see Section 7). We call the new methods ISAGA, ISGD, IASGD and ISEGA, respectively.

  • We present two versions of the SAGA algorithm coupled with IBCD. The first one is for a distributed setting, where each machine owns a subset of data and runs a SAGA iteration with block sampling locally, followed by aggregation. The second version is in a shared data setting, where each machine has access to all functions. This allows for linear convergence even if .

  • We show that when combined with ICBD, the SEGA trick Hanzely et al. (2018) leads to a method that enjoys a linear rate for problems where and allows for a more general objectives which may include a non-separable non-smooth regularizer.

2 Practical Implications and Limitations

In this section we outline some further practical implications and point to two key limitations of our framework.

2.1 Practical implications

The main body of this work focuses on theoretical analysis and on verifying our claims via experiments. However, there are several straightforward and important applications of our technique.

Distributed synchronous learning.

A common way to run a distributed optimization method is to perform a local update, communicate the result to a parameter server using a ’reduce’ operation, and inform all workers using ’broadcast’. Typically, if the number of workers is significantly large, the bottleneck of such a system is communication. In particular, the ’reduce’ operation takes much more time than ’broadcast’ as it requires to add up different vectors computed locally, while ’broadcast’ informs the workers about

the same data. Nevertheless, if every worker can instead send to the parameter server only fraction of the -dimensional update, essentially the server node will receive just one full -dimensional vector, and thus our approach can compete against methods like QSGD Alistarh et al. (2017), signSGD Bernstein et al. (2018), TernGrad Wen et al. (2017), DGC Lin et al. (2017) or ATOMO Wang et al. (2018). In fact, this may completely remove the communication bottleneck. In this work, we focus on this direction.

Distributed asynchronous learning. The main difference with the synchronous case is that only one-to-one communications will be used instead of highly efficient ’reduce’ and ’broadcast’. Clearly, the communication to the server will be much faster with , so the main question is how to make the communication back fast as well. Hopefully, the parameter server can copy the current vector and send it using non-blocking communication, such as isend() in MPI4PY (Dalcin et al., 2011). Then, the communication back will not prevent the server from receiving the new updates. While we do not consider asynchronous methods in this paper, we believe our results can be extended to that setting.

Distributed sparse learning. Large datasets, such as binary classification data from LibSVM, often have sparse gradients. In this case, the ’reduce’ operation is not efficient and one needs to communicate data by sending positions of nonzeros and their values. Moreover, as we prove later, one can use independent sampling with -penalty, which makes the problem solution sparse. In that case, only communication from a worker to the parameter server is slow, so both synchronous and asynchronous methods gain in performance.

Methods with local subproblems. One can also try to extend our analysis to methods with exact block-coordinate minimization or primal-dual and proximal methods such as Point-SAGA Defazio (2016), PDHG Chambolle & Pock (2011), DANE Shamir et al. (2014), etc. There, by restricting ourselves to a subset of coordinates, we may obtain a subproblem that is easier to solve by orders of magnitude.

Block-separable problems within machines. If we have a problem with different coordinate blocks responsible for different data on the given machine, independent sampling improves scalability at no cost, since one can evaluate block partial derivatives very cheaply. Such problems can be obtained considering the dual problem, as is done in Ma et al. (2015), for example.

2.2 Main limitations

Our approach, however, also has two main limitations:

First, independent sampling does not generally result in a sparse aggregated update. Indeed, since each machine might sample a different subset of blocks, all these updates add up to a dense one, and this problem gets worse as increases, other things equal. For instance, if every parallel unit updates a single unique block444Assume is partitioned into several “blocks” of variables., the total number of updated blocks is equal . In contrast, standard coordinate descent, one that samples the same

block on each worker, would update a single block only. For simple linear problems, such as logistic regression, sparse updates allow for the fast implementation of coordinate descent via memorization of the residuals. Given problem dimensionality is high enough, standard coordinate descent methods are state-of-the-art for the mentioned class of problem.

Second, coordinate descent under independent sampling might not be a fixed-point iteration. In other words, having converged to the optimum, it is possible to jump away from it as the local gradient does not have to be zero. Nevertheless, it turns out that if for all 555See Remark 1 about significance of the mentioned class., independent sampling does not violate variance reduction property (the algorithm might not jump away from the optimum) and preserves fast rates. Even if , it is possible to invoke the recent SEGA Hanzely et al. (2018) approach in order to make independent sampling variance reduced. SEGA also allows for a non-separable, non-smooth convex regularizer in (1) and handles it using the corresponding proximal operator, which standard coordinate descent is not capable of, which is the main motivation behind SEGA. SEGA and its accelerated variant enjoy the same favourable theoretical iteration complexity rates as coordinate descent Hanzely et al. (2018).

For a comprehensive list of frequently used notation, see Table 1 in the supplementary material.

3 Independent Block Coordinate Descent

3.1 Technical assumptions

We present the most common technical assumptions required in order to derive convergence rates.

Definition 1.

Function is smooth if for all we have:

(4)

Similarly, is strongly convex if for all :

(5)

In vast majority of results we present, all will be required to be smooth and convex, and to be strongly convex.

Assumption 1.

For every , function is convex, smooth and function is strongly convex.

As mentioned, since independent sampling does not preserve the variance reduction property, in some of our results we shall consider for all .

Assumption 2.

For all we have .

Remark 1.

Although assuming might sound restrictive, for many optimization problems it is satisfied. For example, in least squares setting (which is clearly an instance of (1)), it is equivalent to existence of such that

. On the other hand, current state-of-the-art deep learning models are often overparameterized so that they allow zero training loss, which is again equivalent to

for all (however, such problems are typically non-convex).

In Section 7 we show that Assumption 2 can be dropped when applying the SEGA trick.

3.2 Block structure of  

Let be partitioned into blocks of arbitrary sizes, so that the parameter space is . For any vector and a set of blocks we denote by the vector that has the same coordinate as in the set of blocks and zeros elsewhere.

3.3 Ibcd

In order to provide a quick taste of our results, we first present the IBCD method described in the introduction and formalized as Algorithm 1.

1:  Input: , partition of into blocks , ratio of blocks to be sampled , stepsize , # of parallel units
2:  for  do
3:     for  in parallel do
4:        Sample independently and uniformly a subset of blocks
5:        
6:     end for
7:     
8:  end for
Algorithm 1 Independent Block Coordinate Descent (IBCD)

A key parameter of the method is (chosen so that is an integer), representing a fraction of blocks to be sampled by each worker. At iteration , each machine independently samples a subset of blocks , uniformly at random. The th worker then performs a subspace gradient step of the form , where is a stepsize. Note that only coordinates of belonging to get updated. This is then followed by aggregating all gradient updates: .

3.4 Convergence of IBCD

Theorem 1 provides a convergence rate for Algorithm 1. Admittedly, the assumptions of Theorem 1 are somewhat restrictive; in particular, we require for all . However, this is necessary. Indeed, in general one can not expect to have (which would be required for the method to converge to ) for independently sampled sets of blocks unless for all . As mentioned, the issue is resolved in Section 7 using the SEGA trick Hanzely et al. (2018).

Theorem 1.

Suppose that Assumptions 12 hold. For Algorithm 1 with we have

As a consequence of Theorem 1, we can choose as small as and get, up to a constant factor, the same convergence rate as gradient descent, as described next.

Corollary 1.

Under the setting of Theorem 1 with , the iteration complexity666Number of iterations to reach accurate solution. of Algorithm 1 is .

3.5 Optimal block sizes

If we naively use coordinates as blocks, i.e. all blocks have size equal 1, the update will be very sparse and the efficient way to send it is by providing positions of nonzeros and the corresponding values. If, however, we partition into blocks of size approximately equal , then on average only one block will be updated by each worker. This means that it will be just enough for each worker to communicate the block number and its entries, which is twice less data sent than when using coordinates as blocks.

4 Variance Reduction

As the first extension of IBCD, we inject independent coordinate sampling into SAGA Defazio et al. (2014), resulting in a new method we cal ISAGA. We consider two different settings for ISAGA. The first one is standard distributed setup (1), where each is of the fine-sum form (3). The idea is to run SAGA with independent coordinate sampling locally on each worker, followed by aggregating the updates. However, as for IBCD, we require for all . The second setting is a shared data/memory setup; i.e., we assume that all workers have access to all functions from the finite sum. This allows us to drop Assumption 2.

Independent coordinate sampling is not limited to SAGA and can be similarly applied to other variance reduction techniques.

4.1 Distributed ISAGA

In this section we consider problem (1) with of the finite-sum structure (3). Just like SAGA, every machine remembers the freshest gradient information of all local functions (stored in arrays ), and updates them once a new gradient information is observed. Given that index is sampled on -th machine at iteration , the iterate update step within each machine is taken only on a sampled set of coordinates:

Above, stands for the average of variables on -th machine, i.e. it is a delayed estimate of .

Since the new gradient information is a set of partial derivatives of , we shall update

(6)

Lastly, the local results are aggregated. See Algorithm 5 in the supplementary for details.

The next result provides a convergence rate of distributed ISAGA.

Theorem 2.

Suppose that Assumptions 12 hold. If , for iterates of distributed ISAGA we have

where , and .

The choice yields a convergence rate which is, up to a constant factor, the same as convergence rate of original SAGA. Thus, distributed ISAGA enjoys the desired parallel linear scaling. Corollary 2 formalizes this claim.

Corollary 2.

Consider the setting from Theorem 2. Set and . Then , and the complexity of distributed ISAGA is .

4.2 Shared data ISAGA

We now present a different setup for ISAGA in which the requirement is not needed. Instead of (1), we rather solve the problem

(7)

with workers all of which have access to all data describing . Therefore, all workers can evaluate for any . Similarly to the distributed setup, in the shared data setup we remember the freshest gradients in arrays , and update them as

(8)

where is the index sampled at iteration by machine , and refers to all indices that were not sampled at iteration by any machine. The iterate updates within each machine are again taken only on a sampled set of coordinates:

Above, stands for the average of all , thus it is a delayed estimate of .

1:  Input: , partition of into blocks , ratio of blocks to be sampled , stepsize , # parallel units
2:  Set
3:  for  do
4:     Sample uniformly set of indices without replacement
5:     for  in parallel do
6:        Sample independently and uniformly a subset of blocks
7:        
8:        
9:     end for
10:     For set
11:     
12:     
13:  end for
Algorithm 2 ISAGA with shared data
Theorem 3.

Suppose that function is strongly convex and each is smooth and convex. If , then for iterates of Algorithm 2 we have

where , and .

As in Section 4.1, the choice yields a convergence rate which is, up to a constant factor, the same as the convergence rate of SAGA. Therefore, Algorithm 2 enjoys the desired parallel linear scaling, without the additional requirement of Assumption 2. Corollary 3 formalizes the claim.

Corollary 3.

Consider the setting from Theorem 3. Set and . Then , and the complexity of Algorithm 2 is .

5 Sgd

In this section, we apply independent sampling in a setup with a stochastic objective. In particular, we consider problem (1) where is given as an expectation; see (2). We assume we have access to a stochastic gradient oracle which, when queried at , outputs a random vector whose mean is : .

Our proposed algorithm—ISGD—evaluates a subset of stochastic partial derivatives for the local objective and takes a step in the given direction for each machine. Next, the results are averaged and followed by the next iteration. We stress that the coordinate blocks have to be sampled independently within each machine.

1:  Input: , partition of into blocks , ratio of blocks to be sampled , stepsize sequence , # parallel units
2:  for  do
3:     for  in parallel do
4:        Sample independently and uniformly a subset of blocks
5:        Sample blocks of stochastic gradient such that
6:        
7:     end for
8:     
9:  end for
Algorithm 3 ISGD

In order to establish a convergence rate of ISGD, we shall assume boundedness of stochastic gradients for each worker.

Assumption 3.

Consider a sequence of iterates of Algorithm 3. Assume that is an unbiased estimator of satisfying

Next, we present the convergence rate of Algorithm 3. Since SGD is not a variance reduced algorithm, it does not enjoy a linear convergence rate and one shall use decreasing step sizes. As a consequence, Assumption 2 is not required anymore since there is no variance reduction property to be broken.

Theorem 4.

Let Assumptions 1 and 3 hold. If , where , , then for Algorithm 3 we can upper bound by

where .

Note that the residuals decrease as , which is a behavior one expects from standard SGD. Moreover, the leading complexity term scales linearly: if the number of workers is doubled, one can afford to halve to keep the same complexity.

Corollary 4.

Consider the setting from Theorem 4. Then, iteration complexity of Algorithm 3 is

Although problem (1) explicitly assumes convex , we also consider a non-convex extension, where smoothness of each individual is not required either. Theorem 5 provides the result.

Theorem 5 (Non-convex rate).

Assume is smooth, Assumption 3 holds and for all the difference between gradients of and ’s is bounded: for some constant . If is sampled uniformly from , then for Algorithm 3 we have

Again, the convergence rate from Theorem 5 scales almost linearly with : with doubling the number of workers one can afford to halve to keep essentially the same guarantees. Note that if is sufficiently large, increasing beyond a certain threshold does not improve convergence. This is a slightly weaker conclusion to the rest of our results where increasing beyond might still offer speedup. The main reason behind this is the fact that SGD may be noisy enough on its own to still benefit from the averaging step.

Corollary 5.

Consider the setting from Theorem 5. i) Choose and . Then . ii) For any there is sufficiently large such that choosing yields complexity . The complexity does not improve significantly when is increased.

6 Acceleration 

Here we describe an accelerated variant of IBCD in the sense of Nesterov (1983). In fact, we will do something more general and accelerate ISGD, obtaining the IASGD algorithm. We again assume that machine owns , which is itself a stochastic objective as in (2) with an access to an unbiased stochastic gradient every iteration: . A key assumption for the accelerated SGD used to derive the best known rates Vaswani et al. (2018) is so the called strong growth of the unbiased gradient estimator.

Definition 2.

Function satisfies the strong growth condition with parameters , if for all we have

In order to derive a strong growth property of the gradient estimator coming from the independent block coordinate sampling, we require a strong growth condition on with respect to and also a variance bound on stochastic gradients of each individual .

Assumption 4.

Function satisfies the strong growth condition with respect to :

(9)

Similarly, given that provides an unbiased estimator of , i.e. , variance of is bounded as follows for all :

(10)

Note that the variance bound (10) is weaker than the strong growth property as we always have .

Given that Assumption 4 is satisfied, we derive a strong growth property for the unbiased gradient estimator in Lemma 1. Next, IASGD is nothing but the scheme from Vaswani et al. (2018) applied to stochastic gradients . For completeness, we state IASGD in the supplementary material as Algorithm 6.

Lemma 1.

Suppose that Assumption 4 is satisfied. Then, we have for

(11)
(12)

It remains to use the stochastic gradient (with the strong growth bound from Lemma 1) as a gradient estimate in Vaswani et al. (2018)[Theorem 6], which we restate as Theorem 6 for completeness.

Theorem 6.

Suppose that is smooth, strongly convex and Assumption 4 holds. Then, for a specific choice of parameter sequences (See Vaswani et al. (2018)[Theorem 6] for details), iterates of IASGD admit an upper bound on of the form

The next corollary provides a complexity of Algorithm 6 in a simplified setting where . Note that implies for all . It again shows a desired linear scaling: given that we double the number of workers, we can halve the number of blocks to be evaluated on each machine and still keep the same convergence guarantees. It also shows that increasing beyond does not improve the convergence significantly.

Corollary 6.

Suppose that . Then, complexity of IASGD is

Theorem 6 shows an accelerated rate for strongly convex functions applying Vaswani et al. (2018)[Thm 6] to the bound. A non-strongly convex rate can be obtained analogously from Vaswani et al. (2018)[Thm 7].

7 Regularization

For this section only, let us consider a regularized objective of the form

(13)

where is a closed convex regularizer such that its proximal operator is computable: In this section we propose ISEGA: an independent sampling variant of SEGA Hanzely et al. (2018). We do this in order to both i) avoid Assumption 2 (while keeping linear convergence) and ii) allow for . Original SEGA learns gradients from sketched gradient information via the so called sketch-and-project process Gower & Richtárik (2015), constructing a vector sequence . In ISEGA on each machine we iteratively construct a sequence of vectors which play the role of estimates of . This is done via the following rule:

(14)

The key idea is again that these vectors are created from random blocks independently sampled on each machine. Next, using , SEGA builds an unbiased gradient estimator of as follows:

(15)

Then, we average the vectors and take a proximal step.

Unlike coordinate descent, SEGA (or ISEGA) is not limited to separable proximal operators since, as follows from our analysis , . Therefore, ISEGA can be seen as a variance reduced version of IBCD for problems with non-separable regularizers. The price to be paid for dropping Assumption 2 and having more general objective (13) is that updates from each worker are dense, in contrast to those in Algorithm 1.

In order to be consistent with the rest of the paper, we only develop a simple variant of ISEGA (Algorithm 4

) in which we consider block coordinate sketches with uniform probabilities and non-weighted Euclidean metric (i.e. 

in notation of Hanzely et al. (2018)). It is possible to develop the theory in full generality as in Hanzely et al. (2018). However, we do not do this for the sake of simplicity.

1:  Input: , initial gradient estimates , partition of into blocks , ratio of blocks to be sampled , stepsize , # parallel units
2:  for  do
3:     for  in parallel do
4:        Sample independently and uniformly a subset of blocks
5:        
6:        
7:     end for
8:     
9:  end for
Algorithm 4 ISEGA

We next present the convergence rate of ISEGA (Algorithm 4).

Theorem 7.

Suppose Assumption 1 holds. Algorithm 4 with satisfies where .

Note that if the condition number of the problem is not too small so that (which is usually the case in practice), ISEGA scales linearly in the parallel setting. In particular, when doubling the number of workers, each worker can afford to evaluate only half of the block partial derivatives while keeping the same convergence speed. Moreover, setting , the rate corresponds, up to a constant factor, to the rate of gradient descent. Corollary 7 states the result.

Corollary 7.

Consider the setting from Theorem 7. Suppose that and choose . Then, complexity of Algorithm 4 is .



Figure 1: Comparison of SAGA and Algorithm 2 for various values of (number of workers) and on LibSVM datasets. Stepsize is chosen in each case.


Figure 2: Comparison of Algorithm 4 for various such that and GD on LibSVM datasets. Stepsize was chosen for Algorithm 4 and for GD.

8 Experiments

In this section, we numerically verify our theoretical claims. Recall that there are various settings where it is possible to make practical experiments (see Section 2), however, we do not restrain ourselves to any of them in order to deliver as clear a message as possible.

Due to space limitations, we only present a small fraction of the experiments here. A full and exhaustive comparison, together with the complete experiment setup description, is presented in Section C of the supplementary material.

In the first experiment presented here, we compare SAGA against ISAGA in a shared data setup (Algorithm 2) for various values of with in order to demonstrate linear scaling. We consider logistic regression problem on LibSVM data Chang & Lin (2011). The results (Figure 1) corroborate our theory: indeed, setting does not lead to a decrease of the convergence rate when compared to original SAGA.

The next experiment (Figure 11) supports an analogous claim for ISEGA (Algorithm 4). We run the method for several pairs for which ; on logistic regression problems and LibSVM data. We also plot convergence of gradient descent with the analogous stepsize. As our theory predicts, all the methods exhibit almost the same convergence rate.777We have chosen the stepsize for GD, as this is the baseline to Algorithm 4 with zero variance. One can in fact set for GD and get faster convergence. However, this is only a constant factor. Note that for , Algorithm 4 throws away of partial derivatives while keeping the same convergence speed as GD, which justifies the title of the paper.

References

  • Alistarh et al. (2017) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
  • Allen-Zhu (2017) Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    , pp. 1200–1205. ACM, 2017.
  • Bernstein et al. (2018) Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. SignSGD: Compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434, 2018.
  • Chambolle & Pock (2011) Chambolle, A. and Pock, T. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011.
  • Chang & Lin (2011) Chang, C.-C. and Lin, C.-J.

    LibSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
  • Csiba & Richtárik (2018) Csiba, D. and Richtárik, P. Importance sampling for minibatches. The Journal of Machine Learning Research, 19(1):962–982, 2018.
  • Dalcin et al. (2011) Dalcin, L. D., Paz, R. R., Kler, P. A., and Cosimo, A. Parallel distributed computing using Python. Advances in Water Resources, 34(9):1124–1139, 2011.
  • Defazio (2016) Defazio, A. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems, pp. 676–684, 2016.
  • Defazio et al. (2014) Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014.
  • Fercoq & Richtárik (2015) Fercoq, O. and Richtárik, P. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997–2023, 2015.
  • Gower & Richtárik (2015) Gower, R. M. and Richtárik, P. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015.
  • Gower et al. (2018) Gower, R. M., Richtárik, P., and Bach, F. Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching. arXiv preprint arXiv:1805.02632, 2018.
  • Hanzely & Richtárik (2018) Hanzely, F. and Richtárik, P. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. arXiv preprint arXiv:1809.09354, 2018.
  • Hanzely et al. (2018) Hanzely, F., Mishchenko, K., and Richtárik, P. SEGA: Variance reduction via gradient sketching. In Advances in Neural Information Processing Systems, pp. 2083–2094, 2018.
  • Horváth & Richtárik (2018) Horváth, S. and Richtárik, P. Nonconvex variance reduced optimization with arbitrary sampling. arXiv preprint arXiv:1809.04146, 2018.
  • Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323, 2013.
  • Lin et al. (2017) Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, W. J. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
  • Ma et al. (2015) Ma, C., Smith, V., Jaggi, M., Jordan, M. I., Richtárik, P., and Takáč, M. Adding vs. averaging in distributed primal-dual optimization. In The 32nd International Conference on Machine Learning, pp. 1973–1982, 2015.
  • Mishchenko et al. (2018a) Mishchenko, K., Iutzeler, F., and Malick, J. A distributed flexible delay-tolerant proximal gradient algorithm. arXiv preprint arXiv:1806.09429, 2018a.
  • Mishchenko et al. (2018b) Mishchenko, K., Iutzeler, F., Malick, J., and Amini, M.-R. A delay-tolerant proximal-gradient algorithm for distributed learning. In International Conference on Machine Learning, pp. 3584–3592, 2018b.
  • Nesterov (1983) Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/k^2). In Dokl. Akad. Nauk SSSR, volume 269, pp. 543–547, 1983.
  • Nesterov (2012) Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • Qu & Richtárik (2016a) Qu, Z. and Richtárik, P. Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optimization Methods and Software, 31(5):829–857, 2016a.
  • Qu & Richtárik (2016b) Qu, Z. and Richtárik, P. Coordinate descent with arbitrary sampling II: Expected separable overapproximation. Optimization Methods and Software, 31(5):858–884, 2016b.
  • Qu et al. (2015) Qu, Z., Richtárik, P., and Zhang, T. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pp. 865–873, 2015.
  • Richtárik & Takáč (2014) Richtárik, P. and Takáč, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.
  • Richtárik & Takáč (2016) Richtárik, P. and Takáč, M. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433–484, 2016.
  • Robbins & Monro (1985) Robbins, H. and Monro, S. A stochastic approximation method. In Herbert Robbins Selected Papers, pp. 102–109. Springer, 1985.
  • Schmidt et al. (2017) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
  • Shamir et al. (2014) Shamir, O., Srebro, N., and Zhang, T. Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning, pp. 1000–1008, 2014.
  • Vaswani et al. (2018) Vaswani, S., Bach, F., and Schmidt, M. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. arXiv preprint arXiv:1810.07288, 2018.
  • Wang et al. (2018) Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. ATOMO: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pp. 9872–9883, 2018.
  • Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pp. 1509–1519, 2017.
  • Wright (2015) Wright, S. J. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.

Appendix A Table of Frequently Used Notation 

General
Optimal solution of the optimization problem
Number of parallel workers/machines Sec. 3.2
Ratio of coordinate blocks to be sampled by each machine Sec. 3.2
Dimensionality of space Sec. 3.2
Number of coordinate blocks Sec. 3.2
Part of the objective owned by machine (1)
Each is smooth As. 1 and (4)
is strongly convex As. 1 and (5)
Subset of blocks sampled at iteration and worker
Stepsize
Unbiased gradient estimator
SAGA
Delayed estimate of -th objective (6), (8)
Finite sum size for shared data problem (7)
Number of datapoints per machine in distributed setup (3)
Lyapunov function (24)
SGD
Unbiased stochastic gradient;
An upper bound on the variance of stochastic gradients As. 3
SEGA
Regularizer (13)
Sequence of biased estimators for (14)
Sequence of unbiased estimators for (15)
Lyapunov function Thm. 7
Table 1: Summary of frequently used notation.

Appendix B Future Work

We sketch several possible extensions of this work.

  • Combining the tricks from the paper. Distributed ISAGA requires . We believe it would be possible to develop SEGA approach on top of it (such as it is developed on top of coordinate descent) and drop the mentioned requirement. We also believe it should be possible to accelerate the combination of SEGA and ISAGA.

  • Convergence in the asynchronous setup. We have provided theoretical results for parallel algorithms in the synchronous setting and the asynchronous theory is a very natural future step. Moreover, as we mentioned before, it has very direct practical implications. We believe that it possible to extend the works (Mishchenko et al., 2018b, a) to design a method with proximable regularizer (for penalty) that would communicate little both sides.

  • Importance sampling. Standard coordinate descent exploits a smoothness structure of objective (either via coordinate-wise smoothness constants or more generally using a smoothness matrix) in order to sample coordinates non-uniformly Qu & Richtárik (2016b); Csiba & Richtárik (2018); Hanzely & Richtárik (2018). It would be interesting to derive an importance sampling in our setting in order to converge even faster.

Appendix C Extra Experiments

We present exhaustive numerical experiments to verify the theoretical claims of the paper. The experiments are performed in a simulated environment instead of the honestly distributed setup, as we only aim to verify the iteration complexity of proposed methods.

First, in Section C.1 provides the simplest setting in order to gain the best possible insight – Algorithm 1 is tested on the artificial quadratic minimization problem. We compare Algorithm 1 against both gradient descent (GD) and standard CD (in our setting: when each machine samples the same subset of coordinates). We also study the effect of changing on the convergence speed.

In the remaining parts, we consider a logistic regression problem on LibSVM data Chang & Lin (2011). Recall that logistic regression problem is given as

(16)

where is data matrix and is vector of data labels: 888 The datapoints (rows of ) have been normalized so that each is of norm . Therefore, each is smooth in all cases. We set regularization parameter as in all cases. . In the distributed scenario (everything except of Algorithm 2), we imitate that the data is evenly distributed to workers (i.e. each worker owns a subset of rows of and corresponding labels, all subsets have almost the same size).

As our experiments are not aimed to be practical at this point (we aim to properly prove the conceptual idea), we consider multiple of rather smaller datasets: a1a (), mushrooms (), phishing (), w1a (). The experiments are essentially of 2 types: one shows that setting does not significantly violate the convergence of the original method. In the second type of experiments we study the behavior for varying , and show that beyond certain threshold, increasing does not significantly improve the convergence. The threshold is smaller as increases, as predicted by theory.

c.1 Simple, well understood experiment

In this section we study the simplest possible setting – we test the behavior of Algorithm 1 on the artificial quadratic minimization problem. The considered quadratic objective is set as

(17)

where entries of and

are sampled independently from standard normal distribution.

In the first experiment (Figure 3), we compare Algorithm 1 with against gradient descent (GD) and two versions of coordinate descent - a default version with stepsize , and a coordinate descent with importance sampling (sample proportionally to coordinate-wise smoothness constants) and optimal step sizes (inverse of coordinate-wise smoothness constants). In all experiments, gradient descent enjoys twice better iteration complexity than Algorithm 1 which is caused by twice larger stepsize. However, in each case, Algorithm 1 requires fewer iterations to CD with importance sampling, which is itself significantly faster to plain CD.




Figure 3: Comparison of gradient descent, (standard) coordinate descent, (standard) coordinate descent with importance sampling and Algorithm 1 on artificial quadratic problem (17).

Next, we study the effect of changing on the iteration complexity of Algorithm 1. Figure 4 provides the result. The behavior predicted from theory is observed – increasing over does not significantly improve the convergence speed, while decreasing it below slows the algorithm notably.




Figure 4: Behavior of Algorithm 1 for different on a simple artificial quadratic problem (17).

c.2 Isgd

In this section we numerically test Algorithm 3 for logistic regression problem. As mentioned,

consists of set of (uniformly distributed) rows of

from (16). We consider the most natural unbiased stochastic oracle for the – gradient computed on a subset data points from .

In all experiments of this section, we consider constant step sizes in order to keep the setting as simple as possible and gain as much insight from the experiments as possible. Therefore, one can not expect convergence to the exact optimum.

In the first experiment, we compare standard SGD (stochastic gradient is computed on single, randomly chosen datapoint every iteration) against Algorithm 3 varying and choosing for each . The results are presented by Figure 5. We see that, as our theory suggests, SGD and Algorithm 3 have always very similar performance.





Figure 5: Comparison of SGD (gradient evaluated on a single datapoint) and Algorithm 3 with . Constant was used for each algorithm. Label “batch_size” indicates how big minibatch was chosen for stochastic gradient of each worker’s objective.

Next, we study the dependence of the convergence speed on for various values of . Figure 6 presents the results. In each case, influences the convergence rate (or the region where the iterates oscillate) significantly, however, the effect is much weaker for larger . This is in correspondence with Corollary 4.