SySCD: A System-Aware Parallel Coordinate Descent Algorithm

11/18/2019 ∙ by Nikolas Ioannou, et al. ∙ ibm berkeley college 0

In this paper we propose a novel parallel stochastic coordinate descent (SCD) algorithm with convergence guarantees that exhibits strong scalability. We start by studying a state-of-the-art parallel implementation of SCD and identify scalability as well as system-level performance bottlenecks of the respective implementation. We then take a principled approach to develop a new SCD variant which is designed to avoid the identified system bottlenecks, such as limited scaling due to coherence traffic of model sharing across threads, and inefficient CPU cache accesses. Our proposed system-aware parallel coordinate descent algorithm (SySCD) scales to many cores and across numa nodes, and offers a consistent bottom line speedup in training time of up to x12 compared to an optimized asynchronous parallel SCD algorithm and up to x42, compared to state-of-the-art GLM solvers (scikit-learn, Vowpal Wabbit, and H2O) on a range of datasets and multi-core CPU architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Today’s individual machines offer dozens of cores and hundreds of gigabytes of RAM that can, if used efficiently, significantly improve the training performance of machine learning models. In this respect parallel versions of popular machine learning algorithms such as stochastic gradient descent

(Recht et al., 2011) and stochastic coordinate descent (Liu et al., 2015; Hsieh et al., 2015; Richtarik and Takac, 2016b) have been developed. These methods either introduce asynchronicity to the sequential algorithms, or they use a mini-batch approach, in order to enable parallelization and better utilization of compute resources. However, all of these methods treat machines as a simple, uniform, collection of cores. This is far from reality. While modern machines offer ample computation and memory resources, they are also elaborate systems with complex topologies, memory hierarchies, and CPU pipelines. As a result, maximizing the performance of parallel training requires algorithms and implementations that are aware of these system-level characteristics and respect their bottlenecks.

Setup. In this work we focus on the training of generalized linear models (GLMs). Our goal is to efficiently solve the following partially separable convex optimization problem using the full compute power available in modern CPUs:

(1)

The model is learned from the training data , the function is convex and smooth, and are general convex functions. The objective (1) covers primal as well as dual formulations of many popular machine learning models which are widely deployed in industry (Kaggle, 2017). For developing such a system-aware training algorithm we will build on the popular stochastic coordinate descent (SCD) method (Wright, 2015; Shalev-Shwartz and Zhang, 2013). We first identify its performance bottlenecks and then propose algorithmic optimizations to alleviate them.

Contributions.

The main contributions of this work can be summarized as follows:

  1. [leftmargin=20pt]

  2. We propose SySCD, the first system-aware coordinate descent algorithm that is optimized for

    • cache access patterns: We introduce buckets to design data access patterns that are well aligned with the system architecture.

    • thread scalability: We increase data parallelism across worker threads to avoid data access bottlenecks and benefit from the buckets to reduce permutation overheads.

    • numa-topology: We design a hierarchical numa-aware optimization pattern that respects non-uniform data access delays of threads across numa-nodes.

  3. We give convergence guarantees for our optimized method and motivate a dynamic re-partitioning scheme to improve its sample efficiency.

  4. We evaluate the performance of SySCD on diverse datasets and across different CPU architectures, and we show that SySCD drastically improves the implementation efficiency and the scalability when compared to state-of-the-art GLM solvers (scikit-learn Pedregosa et al. (2011), Vowpal Wabbit Langford (2007), and H2O The H2O.ai team (2015)), resulting in faster training on average.

2 Background

Stochastic coordinate descent (SCD) methods (Wright, 2015; Shalev-Shwartz and Zhang, 2013) have become one of the key tools for training GLMs, due to their ease of implementation, cheap iteration cost, and effectiveness in the primal as well as in the dual. Their popularity has been driving research beyond sequential stochastic solvers and a lot of work has been devoted to map these methods to parallel hardware. We will give a short summary in the following, putting emphasis on the assumptions made on the underlying hardware.

Previous works on parallel coordinate descent (Hsieh et al., 2015; Parnell et al., 2017; Richtarik and Takac, 2016b; Liu et al., 2015) assume that parallel processes are homogeneous and data as well as model information resides in shared memory which is accessible by all processes. Building on these assumptions, Hsieh et al. (2015); Liu et al. (2015); Liu and Wright (2015)

propose asynchronous methods for scaling up SCD: the model resides in shared memory and all processes simultaneously read and write this model vector. A fundamental limitation of such an approach is that its convergence relies on the fact that the model information used to compute each update is not too stale. Thus, asynchronous algorithms are prone to diverge when scaled up to a large number of processes. In addition, the heavy load on the model vector can cause significant runtime delays. Both limitations are more pronounced for dense data, thus we use a dense synthetic dataset to illustrate these effects in Fig 

(a)a; the orange, dashed line shows that convergence suffers from staleness, the gray line shows the respective runtime assuming perfect thread scalability and the yellow lines depicts the measured runtime. The algorithm diverges when scaled across more than 8 threads. Taking another route, Richtarik and Takac (2016b); Bradley et al. (2011) propose a synchronous approach for parallelizing SCD. Such methods come with more robust convergence properties. However, depending on the inherent separability of , the potential of acceleration can be small. For synthetic, well separable problems, mini-batch SDCA proposed by Richtarik and Takac (2016b) show almost linear scaling, whereas for correlated objectives or dense datasets, the potential for acceleration, as given in their theory diminishes. In addition, updates to the shared vector in the synchronous setting are guaranteed to conflict across parallel threads – mini-batch SDCA uses atomic operations111code available at https://code.google.com/archive/p/ac-dc/downloads to serialize those updates; this does not scale as the thread count increases, and especially not in numa machines. We have applied this method to the same synthetic example used in Fig 1 and we observed virtually no speedup () when using threads.

(a) PASSCoDe
(b) CoCoA
Figure 1:

Scalability of existing methods: Training of a logistic regression classifier on a synthetic dense dataset with

k training examples and features – (a) training using PASSCoDe-wild (Hsieh et al., 2015) and (b) training using CoCoA (Smith et al., 2018) deployed across threads. Details can be found in the appendix.

Orthogonal to parallel methods, distributed coordinate-based methods have also been the focus of many works, including (Yang, 2013; Ma et al., 2015; Richtarik and Takac, 2016a; Dünner et al., 2018a; Smith et al., 2018; Lee and Chang, 2018). Here the standard assumption on the hardware is that processes are physically separate, data is partitioned across them, and communication is expensive. To this end, state-of-the-art distributed first- and second-order methods attempt to pair good convergence guarantees with efficient distributed communication. However, enabling this often means trading convergence for data parallelism (Kaufmann et al., 2018). We have illustrated this tradeoff in Fig (b)b where we employ CoCoA Smith et al. (2018)

across threads; using 32 threads the number of epochs is increased by

resulting in a speedup of assuming perfect thread scalability. This small payback makes distributed algorithms generally not well suited to achieving acceleration; they are primarily designed to enable training of large datasets that do not fit into a single machine (Smith et al., 2018).

The fundamental trade-off between statistical efficiency (how many iterations are needed to converge) and hardware efficiency (how efficient they can be executed) of deploying machine learning algorithms on modern CPU architectures has previously been studied in Zhang and Ré (2014). The authors identified data parallelism as a critical tuning parameter and demonstrate that its choice can significantly affect performance of any given algorithm.

The goal of this work is to go one step further and enable better trade-offs by directly incorporate mitigations to critical system-level bottlenecks into the algorithm design. We exploit the shared memory performance available to worker threads within modern individual machines to enable new algorithmic features that improve scalability of parallel coordinate descent, while at the same time maintaining statistical efficiency.

3 Bottleneck Analysis

We start by analyzing state-of-the-art implementations of sequential and parallel coordinate descent to identify bottlenecks and scalability issues. For the parallel case, we use an optimized implementation of PASSCoDe (Hsieh et al., 2015) as the baseline for this study, which is vectorized and reasonably efficient. The parallel algorithm operates in epochs and repeatedly divides the shuffled coordinates among the parallel threads. Each thread then operates asynchronously: reading the current state of the model , computing an update for this coordinate and writing out the update to the model as well as the shared vector . The auxiliary vector is kept in memory to avoid recurring computations. Write-contention on is solved opportunistically in a wild fashion, which in practice is the preferred approach over expensive locking (Parnell et al., 2017; Hsieh et al., 2015). The parallel SCD algorithm is stated in Appendix A.1 for completeness.

One would expect that, especially for large datasets (e.g., datasets that do not fit in the CPU caches), the runtime would be dominated by (a) the time to compute the inner product required for the coordinate update computation and (b) retrieving the data from memory. While these bottlenecks can generally not be avoided, our performance analysis identified four other bottlenecks that in some cases vastly dominate the runtime:

(B1) Access to model vector. When the model vector does not fit in the CPU cache, a lot of time is spend in accessing the model. The origin of this overhead is the random nature of the accesses to , there is very little cache line re-use: a cache line is brought from memory (64B or 128B), out of which only 8B are used. This issue affects both the parallel and the sequential implementation. For the latter this bottleneck dominates and we found that, by accessing the model in a sequential manner, we can reduce the runtime by .

(B2) Access to the shared vector. For the parallel implementation, we found that writing the updates to the shared vector across the different threads was the main bottleneck. On top of dominating the runtime, staleness in the shared vector can also negatively impact convergence.

(B3) Non-uniform memory access. When the parallel implementation is deployed across multiple numa nodes, bottleneck (B2) becomes catastrophic, often leading to divergence of the algorithm (see Fig. (a)a). This effect can be explained by the fact that the inter-node delay when writing updates is far more pronounced than the intra-node delay.

(B4) Shuffling coordinates. A significant amount of time is spent permuting the coordinates before each epoch in both the parallel and the sequential case. For the latter, we found that by removing the permutation, effectively performing cyclic coordinate descent, we could achieve a further speed-up in runtime on top of removing (B1).

1:Input: Training data matrix
2:Initialize model and shared vector .
3:Partition coordinates into buckets of size .
4:Partition buckets across numa nodes according to .
5:for  do
6:     parfor  across numa nodes do
7:          
8:          for  do
9:               create random partitioning of local buckets across threads
10:               parfor  across threads do
11:                    
12:                    for  do
13:                         randomly select a bucket
14:                         for  do
15:                              randomly sample a coordinate in bucket
16:                              
17:                              
18:                              
19:                         end for
20:                    end for
21:               end parfor
22:               
23:          end for
24:     end parfor
25:     
26:end for
Algorithm 1 SySCD for minimizing (1)

4 Algorithmic Optimizations

In this section we present the main algorithmic optimizations of our new training algorithm which are designed to simultaneously address the system performance bottlenecks (B1)-(B4) identified in the previous section as well as the scalability issue demonstrated in Fig. (b)b. Our system-aware parallel training algorithm (SySCD) is summarized in Alg. 1 and its convergence properties are analyzed in Sec. 4.4. The following subsections will be accompanied by experimental results illustrating the effect of the individual optimizations. They show training of a logistic regression classifier on the criteo-kaggle dataset (Criteo-Labs, 2013) on a 4 node system with 8 threads per numa node (for the experimental setup, see Sec 5) . Results for two additional datasets can be found in the appendix.

4.1 Bucket Optimization

We have identified the cache line access pattern (B1) and the random shuffling computation (B4) as two critical bottlenecks in the sequential as well as the parallel coordinate descent implementation. To alleviate these in our new method, we introduce the concept of buckets: We partition the coordinates and the respective columns of into buckets and then train a bucket of consecutive coordinates at a time. Thus, instead of randomizing all coordinates at once, the order in which buckets are processed is randomized, as well as the order of coordinates within a bucket. This modification to the algorithm improves performance in several ways; (i) the model vector is accessed in a cache-line efficient manner, (ii) the computation overhead of randomizing the coordinates is reduced by , and (iii) CPU prefetching efficiency on accessing the different coordinates of is implicitly improved. For our test case this optimization leads to an average speedup of with only a small toll on convergence, as depicted in Fig. 3.

The bucket size will appear in our convergence rate (Theorem 1) and can be used to control the scope of the randomization to trade-off between convergence speed and implementation efficiency. We illustrate the sensitivity of our algorithm to the bucket size in Fig. 3. We see that the bottom line training time decreases significantly across the board by introducing buckets. The optimal bucket size in Fig. 3 is eight which coincides with the cache line size of the CPU with respect to the model vector

accesses. Thus we do not need to introduce an additional hyperparameter and can choose the bucket size

at runtime based on the cache line size of the CPU, using linux sysfs.

Figure 2: Bucket Optimization: Gain achieved by using buckets. Solid lines indicate time, and dashed-lines depict number of epochs.
Figure 3: Sensitivity analysis on the bucket size w.r.t. training time and epochs for convergence.

4.2 Increasing Data Parallelism

Our second algorithmic optimization mitigates the main scalability bottleneck (B2) of the asynchronous implementation: write-contention on the shared vector . We completely avoid this write-contention by replicating the shared vector across threads to increase data parallelism. To realize this data parallelism algorithmically we transfer ideas from distributed learning. In particular, we employ the CoCoA method (Smith et al., 2018) and map it to a parallel architecture where we partition the (buckets of) coordinates across the threads and replicate the shared vector in each one. The global shared vector is therefore only accessed at coarse grain intervals (e.g., epoch boundaries), where it is updated based on the replicas and broadcasted anew to each thread. Similar to CoCoA we can exploit the typical asymmetry of large datasets and map our problem such that the shared vector has dimensionality .

We have seen in Sec 2 that distributed algorithms such as CoCoA are generally not suited to achieve significant acceleration with parallelism. This behavior of distributed methods is caused by the static partitioning of the training data across workers which increases the epochs needed for convergence (Smith et al., 2018; Kaufmann et al., 2018) (e.g., see Fig (b)b). To alleviate this issue, we propose to combine our multi-threaded implementation with a dynamic re-partitioning scheme. That is, we shuffle all the (buckets of) coordinates at the beginning of each local optimization round (Step 9 of Alg. 1), and then, each thread picks a different set of buckets each time. Such a re-partitioning approach is very effective for convergence when compared to a default static partitioning, as depicted in Fig. 5. It reduces the number of epochs by at the cost of only a small implementation overhead. To the best of our knowledge we are the first to consider such a re-partitioning approach in combination with distributed methods and demonstrate a practical scenario where it pays off – in a classical distributed setting the cost of re-partitioning would be unacceptable.

Figure 4: Static and dynamic partitioning: Gain achieved by dynamic re-partitioning. Solid lines indicate time, and dashed-lines depict number of epochs.
Figure 5: Numa-level Optimizations: Gain achieved by numa-awareness. Solid lines indicate time, and dashed-lines depict number of epochs.

The intuition behind this approach is the following: In CoCoA (Smith et al., 2018) a block-separable auxiliary model of the objective is constructed. In this model the correlation matrix is approximated by a block-diagonal version where the blocks are aligned with the partitioning of the data. This allows one to decouple local optimization tasks. However, this also means that correlations between data points on different worker nodes are not considered. A dynamic re-partitioning scheme has the effect of choosing a different block diagonal approximation of in each step. By randomly re-partitioning coordinates, the off-diagonal elements of

are sampled uniformly at random and thus in expectation a good estimate of

is used. A rigorous analysis of this effect would be an interesting study for future work. However, note that SySCD inherits the strong convergence guarantees of the CoCoA method, independent of the partitioning, and can thus be scaled up safely to a large number of cores in contrast to our asynchronous reference implementation.

4.3 Numa-Level Optimizations

Subsequently, we focus on optimizations related to the numa topology in a multi-numa node system. Depending on the numa node where the data resides and the node on which a thread is running, data access performance can be non-uniform across threads. As we have seen in Fig. (b)b and discussed in Sec. 3 this amplifies bottleneck (B3). To avoid this in SySCD, we add an additional level of parallelism and treat each numa node as an independent training node, in the distributed sense. We then deploy a hierarchical scheme: we statically partition the buckets across the numa nodes, and within the numa nodes we use the dynamic re-partitioning scheme introduced in Sec 4.2. We exploit the fact that the training dataset is read-only and thus it does not incur expensive coherence traffic across numa nodes. We do not replicate the training dataset across the nodes and the model vector is local to each node which holds the coordinates corresponding to its partition . Crucially, each node holds its own replica of the shared vector, which is reduced across nodes periodically. The frequency of synchronization can be steered in Alg. 1 by balancing the total number of updates between and . This again offers a trade off between fast convergence (see Theorem 1) and implementation efficiency. This hierarchical optimization pattern that reflects the numa-topology results in a speedup of over a numa-oblivious implementation, as shown in Fig 5. To avoid additional hyperparameters, we dynamically detect the numa topology of the system, as well as the number of physical cores per node, using libnuma and the sysfs interface. If the number of threads requested by the user is less or equal to the number of cores in one node, we schedule a single node solver. We detect the numa node on which the dataset resides using the move_pages system call.

4.4 Convergence Analysis

We derive an end-to-end convergence rate for SySCD with all its optimizations as described in Alg. 1. We focus on strongly convex while every single component of SySCD is also guaranteed to converge for general convex , see Remark 2 in the Appendix.

Theorem 1.

Consider Algorithm 1 applied to (1). Assume is -smooth and are -strongly convex functions. Let be the number of numa nodes and the number of threads per numa node. Let be the bucket size. Denote the number of SDCA updates performed on each bucket, let be the number of buckets processed locally in each iteration and let be the number of communication rounds performed independently on each numa node before global synchronization. Then, after outer rounds the suboptimality can be bounded as

(2)

where and

(3)
Proof Sketch.

To derive a convergence rate of Alg. 1 we start at the outer most level. We focus on the two nested for-loops in Step 6 and Step 10 of Alg. 1. They implement a nested version of CoCoA where the outer level corresponds to CoCoA across numa nodes and the inner level to CoCoA across threads. The number of inner iterations is a hyper-parameter of our algorithm steering the accuracy to which the local subproblem assigned to each numa node is solved. Convergence guarantees for such a scheme can be derived from a nested application of (Smith et al., 2018, Theorem 3) similar to (Dünner et al., 2018b, Appendix B). Subsequently, we combine this result with the convergence guarantees of the local solver used by each thread. This solver, implementing the bucketing optimization, can be analyzed as a randomized block coordinate descent method, similar to (Dünner et al., 2017, Theorem 1), where each block corresponds to a bucket of coordinates. Each block update is then computed using SDCA (Shalev-Shwartz and Zhang, 2013). Again, the number of coordinate descent steps forms a hyper-parameter to steer the accuracy of each block update. Combining all these results in a nested manner yields the convergence guarantee presented in Theorem 1. We refer to the Appendix A.3 for a detailed proof. ∎

5 Evaluation

In this section, we evaluate the performance of SySCD in two different single-server multi numa-node environments. We first analyze the scalability of our method and the performance gains achieved over the reference implementation. Then, we compare SySCD with other state-of-the-art GLM solvers available in scikit-learn (Pedregosa et al., 2011)(0.19.2), H2O (The H2O.ai team, 2015) (3.20.0.8), and Vowpal Wabbit (VW) (Langford, 2007) (commit: 5b020c4). We take logistic regression with regularization as a test case. We use two systems with different CPU architectures and numa topologies: a 4-node Intel Xeon (E5-4620) with 8 cores and 128GiB of RAM in each node, and a 2-node IBM POWER9 with 20 cores and 512GiB in each node, 1TiB total. We evaluate on three datasets: (i) the sparse dataset released by Criteo Labs as part of their 2014 Kaggle competition (Criteo-Labs, 2013) (criteo-kaggle), (ii) the dense HIGGS dataset (Baldi et al., 2014) (higgs), and (iii) the dense epsilon dataset from the PASCAL Large Scale Learning Challenge (Epsilon, 2008) (epsilon). Results on epsilon and additional details can be found in the appendix.

Remark 1 (Hyperparameters).

The hyperparameters in Alg 1 can be used to optimally tune SySCD to different CPU architectures. However, a good default choice is

(4)

such that one epoch ( coordinate updates) is performed across the threads before each synchronization step. We will use these values for all our experiments and did not further tune our method. Further, recall that the bucket size is set to be equal to the cache line size of the CPU and the number of numa nodes as well as the number of threads is automatically detected.

5.1 Scalability

We first investigate the thread scalability of SySCD. Results, showing the speedup in time per epoch (an epoch corresponds to coordinate updates) over the sequential version, are depicted in Fig 6. We see that SySCD scales almost linearly across the two systems and thus the main scalability bottleneck (B2) of our reference implementation is successfully mitigated. The 4 node system show a slightly lower absolute speedup beyond 1-node (8 threads), which is expected due to the higher overhead when accessing memory on different numa nodes compared to the 2-node system.

(a) higgs
(b) criteo-kaggle
Figure 6: Strong thread scalability of SySCD w.r.t runtime per epoch with increasing thread counts for the two different datasets and systems: a 2 node P9 machine (blue) and a 4 node X86_64 machine (orange).

Note that in our experiments we disable simultaneous multi-threading (SMT), since in practice we often find enabling SMT leads to worse overall performance. Therefore, the maximal thread count corresponds to the number of physical cores present in the machine. In order to illustrate how SySCD scales when the number of threads exceeds the number of physical cores, we enabled SMT4 (4 hardware threads per core) on the P9 machine and re-ran the experiment from Fig. (b)b. The results are shown in Figure 16 in the appendix. As expected, we see linear scaling up to the number of physical CPU cores (in this case 40), after which we start to see diminishing returns due to the inherent inefficiency of SMT4 operation. We thus recommend disabling SMT when deploying SySCD.

5.2 Bottom Line Performance

Second, we compare the performance of our new SySCD algorithm to the PASSCoDe baseline implementation. Convergence is declared if the relative change in the learned model across iterations is below a threshold. We have verified that all implementations exhibit the same test loss after training, apart from the PASSCoDe implementation which can converge to an incorrect solution when using many threads (Hsieh et al., 2015). Fig 7 illustrates the results for two different systems. Comparing against PASSCoDe with the best performing thread count, SySCD achieves a speedup of (P9) and (X86_64) on average across datasets. The larger performance improvement observed for the 2-node system relative to the 4-node system, in particular on the higgs dataset, can be attributed to the increased memory bandwidth.

(a) higgs
(b) criteo-kaggle
Figure 7: Training time of the reference PASSCoDe and our optimized SySCD implementation for different thread count. Results are presented on two different datasets and CPU architectures: a 2 node (P9) and a 4 node (X86_64) machine.

5.3 Comparison with sklearn, VW, and H2O

We finally compare the performance of our new solver against widely used frameworks for training GLMs. We compare with scikit-learn (Pedregosa et al., 2011), using different solvers (liblinear, lbfgs, sag), with H2O (The H2O.ai team, 2015), using its multi-threaded auto solver and with VW (Langford, 2007), using its default solver. Care was taken to ensure that the regularization strength was equivalent across all experiments, and that the reported time did not include parsing of text and loading of data. Results showing training time against test loss for the different solvers, on the two systems, are depicted in Fig 8. We add results for SySCD with single (SySCD 1T) and maximum (SySCD MT) thread counts. Overall SySCD MT is over faster, on average, than the best performing alternative solver. The best competitor is VW for criteo-kaggle and H2O for higgs. H20 results are not shown in Fig (a)a and (b)b because we could not train the criteo-kaggle dataset in a reasonable amount of time ( hours), even by using the max_active_predictors option.

(a) criteo-kaggle - x86_64
(b) criteo-kaggle - P9
(c) higgs - x86_64
(d) higgs - P9
Figure 8: Comparing a single- and multi-threaded implementations of SySCD against state-of-the-art GLM solvers available in scikit-learn, VW, and H2O.

6 Conclusion

We have shown that the performance of existing parallel coordinate descent algorithms which assume a simplistic model of the parallel hardware, significantly suffers from system bottlenecks which prevents them from taking full advantage of modern CPUs. In this light we have proposed SySCD, a new system-aware parallel coordinate descent algorithm that respects cache structures, data access patterns and numa topology of modern systems to improve implementation efficiency and exploit fast data access by all parallel threads to reshuffle data and improve convergence. Our new algorithm achieves a gain of up to compared to a state-of-the-art system-agnostic parallel coordinate descent algorithm. In addition, SySCD enjoys strong scalability and convergence guarantees and is thus suited to be deployed in production.

References

  • P. Baldi, P. Sadowski, and D. O. Whiteson (2014)

    Searching for exotic particles in high-energy physics with deep learning.

    .
    Nature communications 5, pp. 4308. Cited by: §B.3, §5.
  • J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin (2011) Parallel coordinate descent for l1-regularized loss minimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 321–328. External Links: ISBN 978-1-4503-0619-5 Cited by: §2.
  • Criteo-Labs (2013) Terabyte click logs dataset. Note: http://labs.criteo.com/2013/12/download-terabyte-click-logs/Online; Accessed: 2018-01-25 Cited by: §4, §5.
  • C. Dünner, S. Forte, M. Takac, and M. Jaggi (2016) Primal-dual rates and certificates. In Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 783–792. Cited by: Remark 2.
  • C. Dünner, A. Lucchi, M. Gargiani, A. Bian, T. Hofmann, and M. Jaggi (2018a) A distributed second-order algorithm you can trust. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 1358–1366. Cited by: §2.
  • C. Dünner, T. Parnell, and M. Jaggi (2017) Efficient use of limited-memory accelerators for linear learning on heterogeneous systems. In Advances in Neural Information Processing Systems 30, pp. 4258–4267. Cited by: §A.2, §A.3.2, §A.3.2, §4.4, Remark 2.
  • C. Dünner, T. Parnell, D. Sarigiannis, N. Ioannou, A. Anghel, G. Ravi, M. Kandasamy, and H. Pozidis (2018b) Snap ML: A Hierarchical Framework for Machine Learning. In Advances in Neural Information Processing Systems 31, pp. 250–260. Cited by: §A.3.1, §C.3, §4.4, Remark 2, Theorem 2.
  • Epsilon (2008) Pascal large scale learning challenge. Note: http://www.k4all.org/project/large-scale-learning-challenge Cited by: §B.3, §B.4, §5.
  • C. Hsieh, H. Yu, and I. Dhillon (2015) PASSCoDe: parallel asynchronous stochastic dual co-ordinate descent. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2370–2379. Cited by: §5.2.
  • C. Hsieh, H. Yu, and I. Dhillon (2015) Passcode: parallel asynchronous stochastic dual co-ordinate descent. In International Conference on Machine Learning, pp. 2370–2379. Cited by: §A.1, §B.1, §1, Figure 1, §2, §3.
  • Kaggle (2017)

    Kaggle machine learning and data science survey

    .
    Note: https://www.kaggle.com/surveys/2017 Cited by: §1.
  • Kaggle (2019) LIBSVM data: classification, regression, and multi-label. Note: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ Cited by: §C.1.
  • M. Kaufmann, T. P. Parnell, and K. Kourtis (2018) Elastic cocoa: scaling in to improve convergence. arXiv:1811.02322. External Links: 1811.02322 Cited by: §2, §4.2.
  • J. Langford (2007) Vowpal wabbit. Note: https://github.com/JohnLangford/vowpal_wabbit/wiki Cited by: §C.3, item 3, §5.3, §5.
  • C. Lee and K. Chang (2018) Distributed block-diagonal approximation methods for regularized empirical risk minimization. arXiv:1709.03043. Cited by: §2.
  • J. Liu and S. Wright (2015) Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM Journal on Optimization 25 (1), pp. 351–376. Cited by: §2.
  • J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar (2015) An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research 16 (1), pp. 285–322. Cited by: §1, §2.
  • C. Ma, V. Smith, M. Jaggi, M. I. Jordan, P. Richtárik, and M. Takáč (2015) Adding vs. Averaging in Distributed Primal-Dual Optimization. In Proceedings of the 32th International Conference on Machine Learning, ICML, pp. 1973–1982. Cited by: §2.
  • T. P. Parnell, C. Dünner, K. Atasu, M. Sifalakis, and H. Pozidis (2017) Large-Scale Stochastic Learning Using GPUs. 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 419–428. Cited by: §2, §3.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: item 3, §5.3, §5.
  • B. Recht, C. Re, S. Wright, and F. Niu (2011) Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 693–701. Cited by: §1.
  • P. Richtarik and M. Takac (2016a) Distributed coordinate descent method for learning with big data. Journal of Machine Learning Research 17, pp. 1–25. Cited by: §2.
  • P. Richtarik and M. Takac (2016b) Parallel coordinate descent methods for big data optimization. Mathematical Programming 156, pp. 433–484. Cited by: §C.3, §1, §2.
  • S. Shalev-Shwartz and T. Zhang (2013) Stochastic dual coordinate ascent methods for regularized loss. JMLR 14 (1), pp. 567–599. External Links: ISSN 1532-4435 Cited by: §A.3.2, §A.3.2, §1, §2, §4.4, Remark 2.
  • V. Smith, S. Forte, C. Ma, M. Takáč, M. Jordan, and M. Jaggi (2018) CoCoA: a general framework for communication-efficient distributed optimization. 18. Cited by: §A.2, §A.3, §C.3, Figure 1, §2, §4.2, §4.2, §4.2, §4.4, Remark 2.
  • The H2O.ai team (2015) H2o: python interface for h2o. Note: http://www.h2o.aiPython package version 3.20.0.8 Cited by: §C.3, item 3, §5.3, §5.
  • S. J. Wright (2015) Coordinate descent algorithms. Mathematical Programming 151 (1), pp. 3–34. Cited by: §1, §2.
  • T. Yang (2013) Trading computation for communication: distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 629–637. Cited by: §2.
  • C. Zhang and C. Ré (2014) DimmWitted: a study of main-memory statistical analytics. Proc. VLDB Endow. 7 (12), pp. 1283–1294. External Links: ISSN 2150-8097 Cited by: §2.

Appendix A Appendix

a.1 Reference Algorithm

Algorithm 2 summarizes the procedure of PASSCoDe-wild (Hsieh et al., 2015) which serves as a baseline for the bottleneck analysis in Section 3 and as a reference scheme in Fig 7. The updates of the shared vector in Step 10 of Algorithm 2 are performed without any locking.

1:Input: Training data matrix
2:Initialize model and shared vector .
3:for  do
4:     parfor  do
5:          Read current state of model
6:          Read current state of shared vector
7:          
8:          
9:          for  do
10:               
11:          end for
12:     end parfor
13:end for
Algorithm 2 Parallel SCD for training GLMs of the form (1)

a.2 Local Subproblems in SySCD

Let denote the vector with only non-zero elements for indices , i.e., coordinates assigned to numa-node . Similarly, let denotes the vector with only non-zero elements for coordinates . Then, the local optimization tasks in Step 16 of Algorithm 1, assigned to thread on numa-node , is defined as:

(5)

where

(6)
(7)
Derivation.

We have numa nodes and threads on each numa node. Then, and in Step 16 of Alg. 1 define the local optimization task solved by each thread. The respective formulation (6), (7) can be derived from a nested application of the block separable upper-bound used in CoCoA (Smith et al., 2018, Equation (10)). A similar two-level approach has been taken by (Dünner et al., 2017, Appendix B) in the context of hierarchical training across GPUs. This yields the following block separable objective which is solved across the numa nodes

(8)
(9)

where . Thus, numa-node is assigned the following subproblem:

(10)

In order to compute the update on numa node across the threads, we again apply the to (10). This, yields the following separable problem:

(11)

To find the optimization problem (11) is solved repeatedly, in parallel, across the locally available threads. After each iteration the model is updated as and we locally keep track of the auxiliary information on each numa node . Consequently thread and numa node is assigned the subproblem stated in (5).

Crucially, each thread updates a different, distinct subset of coordinates of and only requires access to the respective columns of . Thereby, memory access bottlenecks on as well as are avoided in SySCD. In addition is local to each numa node and only needs to be synchronized periodically, mitigating the main scalability bottleneck (B3).

a.3 Convergence Analysis

In this section we will proof the convergence result stated in Theorem 1. We proceed level by level and then in the end nest the individual rates. We will focus on strongly convex for the reason of simplicity, however, every single component used in the proof has also non-asymptotic convergence guarantees for non-strongly convex , as can be found in the respective references. For our analysis the following definition for the approximation quality of a solution, according to Smith et al. (2018), will be useful:

Definition 1 (-approximate update).

A solution is called -approximate for iff

where .

a.3.1 Data-Parallelism across Numa-Nodes and Threads

Our hierarchical numa-aware optimization pattern described in Section 4.3 implements a hierarchical version of CoCoA across numa-nodes and threads. First, each numa-node is assigned the CoCoA subproblem (10) which is then solved across the available threads, again using CoCoA. This results in subproblem (5) assigned to thread on numa node . Let us for now ignore how this problem is solved, we just assume it is solved -approximately.

Theorem 2 (Hierarchical CoCoA– adapted from (Dünner et al., 2018b)).

Consider Algorithm 1 for solving (1). Assume is -smooth and are -strongly convex functions. Let be the number of threads per numa node and the number of numa nodes. Let be the number of synchronization steps performed on each numa node before global synchronization. Assume the local subtasks (5) assigned to each thread are solved -approximately. Then, after outer rounds the suboptimality can be bounded as

(12)

where and and denotes the initial suboptimality.

Proof.

For analyzing this hierarchical method we can build on the convergence result derived by Dünner et al. (2018b, in Appendix B.1) for hierarchical CoCoA in a heterogeneous GPU-CPU setting. In the context of Algorithm 1 the outer level of CoCoA spans across numa nodes (instead of physical CPUs) and the inner level across threads on each numa nodes (instead of GPUs). ∎

a.3.2 SCD using Buckets

To find a -approximate update to the subproblem (5) on each thread SySCD applies SCD with bucketized randomization. This means, in each step a bucked is selected uniformly at random and then steps of SCD are performed on the coordinates assigned to . We call this local optimization loop performed on each thread (Step 12 of Alg. 1) bucketized SDCA and derive the following convergence guarantee:

Theorem 3.

Consider the following optimization problem . Assume is -smooth and are -strongly convex. Then SDCA with bucketized randomization and SCD steps on each bucket, converges as

after steps where denotes the bucket size.

Proof.

For analyzing bucketized SDCA we will combine results of randomized block coordinate descent from (Dünner et al., 2017, Theorem 1) with classical convergence results of SDCA (Shalev-Shwartz and Zhang, 2013, Theorem 4).

Let a block of coordinates corresponds to the coordinates within the selected bucket . Then, results in (Dünner et al., 2017, Theorem 1) say that after block coordinate updates on (5) the suboptimality satisfies

(13)

where denotes the suboptimality of (5) and, accordingly, denotes the initial suboptimality when starting the thread-local optimization in Step 12 of Alg. 1. denotes the approximation quality of the individual block updates, according to Definition 1.

For computing a the -approximate update SySCD uses steps of SDCA. The SDCA algorithm has a linear rate when applied to strongly convex objectives (Shalev-Shwartz and Zhang, 2013, Theorem 4) and thus the suboptimality decays linearly as

(14)

with . Hence, after steps of SDCA the approximation quality of the block update (Definition 1) satisfies

(15)

Combining this with the rate of block coordinate descent (13) yields the convergence guarantee presented in Theorem 3. ∎

a.3.3 Overall Rate

Building on the above convergence results of the individual components we can derive an end-to-end rate for SySCD. Therefore we again exploit the definition of a -approximate update in Theorem (2), regarding the updates computed on each thread. Using Theorem 3 we have

(16)

Note that the optimization problem assigned to each thread is defined as (5), thus we have and . Combining this with Theorem 2 yields the convergence rate for Algorithm 1 presented in Theorem 1.

Remark 2 (Non-Strongly Convex ).

For the case where are non-strongly convex CoCoA, hierarchical CoCoA, block-coordinate descent, as well as SDCA have a sublinear convergence rates, see (Smith et al., 2018, Theorem 2), (Dünner et al., 2018b, Equation 2) (Dünner et al., 2017, Theorem 2) and (Shalev-Shwartz and Zhang, 2013, Theorem 2)(Dünner et al., 2016, Theorem 9), respectively. These guarantees for the individual components guarantee a function decrease in each update step of SySCD. Hence our method will converge. However, to derive an end-to-end non-asymptotic convergence rate the definition of multiplicative subproblem accuracy (Definition 1) is not well suited.

Appendix B Additional Experiments

In the following we will present additional experiments to more broadly support the claims of the paper. These results were offloaded to the appendix due to space limitations.

b.1 Motivational Example

The motivational examples shown in Fig 1 are performed on a synthetic dense dataset. The equivalent results for a sparse dataset are depicted in Fig 10. The dataset has 1k features and 100k examples and a uniform sparsity of . We see that the algorithm is less prone to diverge if the data is more sparse because collisions on the shared vector are less likely. This is in line with the theoretical results presented by Hsieh et al. (2015). However, the number of epochs to converge as well as implementation overheads resulting in no payback beyond 8 threads.

Figure 9: Training time of PASSCoDe-wild with increasing thread count on a sparse synthetic dataset.
Figure 10: Bucket Optimization: Gain achieved by using buckets on the higgs dataset on the 4 node system. Solid lines indicate time, and dashed-lines depict number of epochs.

b.2 Bottlenecks

The results of the bottleneck analysis referred to in Section 3 are depicted in Table B.2 and Fig 11.

Table 1: Training time per epoch of the single-threaded solver on the higgs dataset (a) of the original algorithm, (b) when the model vector is accessed sequentially, and (c) when the computation of random shuffling of coordinates is also removed. shuffling random access time to per epoch [s] (a) yes yes 4.33 (b) yes no 2.60 (c) no no 2.18 Figure 11: Multi-threaded performance of the PASSCoDe-wild without shared updates, and without shuffling, on the dense synthetic dataset.

b.3 Individual Optimizations

(a) higgs
(b) epsilon
Figure 12: Static and dynamic partitioning: Gain achieved by dynamic re-partitioning on the higgs and epsilon datasets on the 4 node system. Solid lines indicate time, and dashed-lines depict number of epochs.
(a) higgs
(b) epsilon
Figure 13: Numa-level Optimizations: Gain achieved by numa-awareness on the higgs and epsilon datasets on the 4 node system. Solid lines indicate time, and dashed-lines depict number of epochs.
(a) x86_64
(b) P9
Figure 14: Strong thread scalability of our SySCD implementation w.r.t training time per epoch with increasing thread counts.

In Section 4 we have presented and evaluated the individual algorithmic components of SySCD on the criteo-kaggle dataset. In the following we will evaluate the individual optimizations on two additional datasets: the dense HIGGS dataset (Baldi et al., 2014), and the epsilon dataset from the PASCAL Large Scale Learning Challenge (Epsilon, 2008) (epsilon). See Fig 10 for the effect of using buckets on higgs. For the epsilon dataset the bucket optimization is not applied due to the small model vector ( elements) that fits in CPU cache. See Fig 12 for the effect of dynamic re-partitioning. We observe that for the higgs dataset, reshuffling gives diminishing return. This is because the data is very imbalances and has a lot of redundancy in the examples. Thus a small subset is representative of the full data and reshuffling does not add much. See Fig 13 for the effect of the numa-level optimizations. For the accumulated gains see Sec. B.5.

b.4 Scaling

In Section 5.1 we have demonstrated the thread scalability of our implementation on the criteo-kaggle and the higgs dataset. In Fig 14 we add one more dataset: the epsilon dataset (Epsilon, 2008). Performance scales almost linearly across datasets and the two systems. Training on higgs on the 4 node machine is an exception: scaling gradually degrades when going from 1 to 2 and more numa nodes. By profiling its run-time, we observe that most of the time is spent on memory accesses to the training dataset. On the 2 node system, however, which has higher memory bandwidth and less numa nodes, those memory accesses are no longer the bottleneck.

In all our experiments we disable simultaneous multi-threading (SMT). Figure 16 illustrates how SySCD would scale when the number of threads exceeds the number of physical cores. In this example we enabled SMT4 (4 hardware threads per core) on the P9 machine and re-ran the experiment from Fig. (b)b on the criteo-kaggle dataset. We see linear scaling up to the number of physical CPU cores (in this case 40), after which we start to see diminishing returns due to the inherent inefficiency of SMT4 operation. We thus do not use SMT when deploying SySCD.

Figure 15: Training time w.r.t. thread count for the reference PASSCoDe-wild and our optimized SySCD implementation on a 2 node (P9) and a 4 node (X86_64) machine.
Figure 16: Scalability of SySCD w.r.t runtime per epoch with increasing thread counts beyond number of physical cores on the P9 machine in SMT4 mode for the criteo-kaggle dataset.

b.5 Bottom Line Performance

The comparison of SySCD and our baseline implementation PASSCoDe-wild on the epsilon dataset is shown in Fig 16. It shows the accumulated gain of all the optimizations.

b.6 Benchmarking Results

Results in Fig 17 augment the results of Section 5.3 where we compare the performance of SySCD to different GLM solvers. Here we use the epsilon dataset for benchmarking. Also for this dataset SySCD MT is significantly than all alternative solver; faster on the 4 node system and faster on the 2-node system. We observe that for the epsilon dataset H2O is the slowest competitor. The performance of H2O is somewhat extreme: its multi-threaded solver takes over all the cores in the system and is able to achieve the expected test loss, but the performance varies dramatically across datasets. We expect this to be an issue with large number of features (epsilon has ): by artificially reducing the number of features to 200 using the max_active_predictors H2O parameter, we get an order of magnitude speedup in time, with, however, a dramatic degradation of the test loss.

(a) epsilon - x86_64
(b) epsilon - P9
Figure 17: Comparing our single- and multi-threaded implementations against different solvers in scikit-learn, VW, and H2O.

Appendix C Experimental Details for Reproducability

All experiments performed in this paper show the training of an -regularized logistic regression classifier. This means the objective (1) is defined as

(17)

where denote the columns of , corresponding to training samples and denote the corresponding labels. The regularization parameter was verified to be a reasonable choice through cross validation; for the criteo dataset, for the higgs dataset and for the epsilon dataset.

c.1 Datasets

All datasets used for performance evaluation are publicly available and downloadable from the libsvm repository Kaggle (2019). Data was used as provided without any additional preprocessing. The data used for the motivational example in Fig 1 is a synthetic dense dataset. It was created by sampling the individual elements of the data matrix uniformly at random from the interval . The labels where also chosen uniformly at random from .

dataset #examples #features
higgs 11’000’000 28
epsilon 400’000 2’000
criteo-kaggle 45’840’617 1’000’000
synthetic dense (Fig 1) 100’000 100
synthetic sparse (Fig 10) 100’000 1’000

c.2 Infrastructure

We use two systems: a 4-node Intel Xeon (E5-4620) with 8 cores and 128GiB of RAM in each node, and a 2-node IBM POWER9 with 20 cores and 512GiB in each node, 1TiB total. We disable simultaneous multi-threading (hyper-threading on the Xeons, SMT on the P9s) and fix the CPU frequency to the maximum supported (2.2GHz for x86, and 3.8GHz for P9).

c.3 Implementation Details

PASSCoDe-wild. The procedure of the PASSCoDe-wild baseline is given in Algorithm 2. The algorithm is implemented in OpenMP.

mini-batch SDCA. Claims about mini-batch SDCA(Richtarik and Takac, 2016b) are based on experimental results obtained using the publicly available code222https://code.google.com/archive/p/ac-dc/ by the authors. Functionalities to generate and load data have been added.

CoCoA. CoCoA is implemented across threads, where each thread represents a worker in CoCoA. We use and as suggested by the authors Smith et al. (2018).

H2O. We used the python package version 3.20.0.8. provided by The H2O.ai team (2015). We use the multi-threaded auto solver. H2O could not work directly with the numpy binary files containing the training and test examples for each dataset, so we first convert them to csv files; directly converting them to H2OFrames took too long (hours) for the large datasets.

VW. We use the code provided at (Langford, 2007) with its default solver.

SySCD. Our new algorithm is available as part of the Snap ML software package (Dünner et al., 2018b) and example of how to use it can be found in the documentation.