1 Introduction
Today’s individual machines offer dozens of cores and hundreds of gigabytes of RAM that can, if used efficiently, significantly improve the training performance of machine learning models. In this respect parallel versions of popular machine learning algorithms such as stochastic gradient descent
(Recht et al., 2011) and stochastic coordinate descent (Liu et al., 2015; Hsieh et al., 2015; Richtarik and Takac, 2016b) have been developed. These methods either introduce asynchronicity to the sequential algorithms, or they use a minibatch approach, in order to enable parallelization and better utilization of compute resources. However, all of these methods treat machines as a simple, uniform, collection of cores. This is far from reality. While modern machines offer ample computation and memory resources, they are also elaborate systems with complex topologies, memory hierarchies, and CPU pipelines. As a result, maximizing the performance of parallel training requires algorithms and implementations that are aware of these systemlevel characteristics and respect their bottlenecks.Setup. In this work we focus on the training of generalized linear models (GLMs). Our goal is to efficiently solve the following partially separable convex optimization problem using the full compute power available in modern CPUs:
(1) 
The model is learned from the training data , the function is convex and smooth, and are general convex functions. The objective (1) covers primal as well as dual formulations of many popular machine learning models which are widely deployed in industry (Kaggle, 2017). For developing such a systemaware training algorithm we will build on the popular stochastic coordinate descent (SCD) method (Wright, 2015; ShalevShwartz and Zhang, 2013). We first identify its performance bottlenecks and then propose algorithmic optimizations to alleviate them.
Contributions.
The main contributions of this work can be summarized as follows:

[leftmargin=20pt]

We propose SySCD, the first systemaware coordinate descent algorithm that is optimized for

cache access patterns: We introduce buckets to design data access patterns that are well aligned with the system architecture.

thread scalability: We increase data parallelism across worker threads to avoid data access bottlenecks and benefit from the buckets to reduce permutation overheads.

numatopology: We design a hierarchical numaaware optimization pattern that respects nonuniform data access delays of threads across numanodes.


We give convergence guarantees for our optimized method and motivate a dynamic repartitioning scheme to improve its sample efficiency.

We evaluate the performance of SySCD on diverse datasets and across different CPU architectures, and we show that SySCD drastically improves the implementation efficiency and the scalability when compared to stateoftheart GLM solvers (scikitlearn Pedregosa et al. (2011), Vowpal Wabbit Langford (2007), and H2O The H2O.ai team (2015)), resulting in faster training on average.
2 Background
Stochastic coordinate descent (SCD) methods (Wright, 2015; ShalevShwartz and Zhang, 2013) have become one of the key tools for training GLMs, due to their ease of implementation, cheap iteration cost, and effectiveness in the primal as well as in the dual. Their popularity has been driving research beyond sequential stochastic solvers and a lot of work has been devoted to map these methods to parallel hardware. We will give a short summary in the following, putting emphasis on the assumptions made on the underlying hardware.
Previous works on parallel coordinate descent (Hsieh et al., 2015; Parnell et al., 2017; Richtarik and Takac, 2016b; Liu et al., 2015) assume that parallel processes are homogeneous and data as well as model information resides in shared memory which is accessible by all processes. Building on these assumptions, Hsieh et al. (2015); Liu et al. (2015); Liu and Wright (2015)
propose asynchronous methods for scaling up SCD: the model resides in shared memory and all processes simultaneously read and write this model vector. A fundamental limitation of such an approach is that its convergence relies on the fact that the model information used to compute each update is not too stale. Thus, asynchronous algorithms are prone to diverge when scaled up to a large number of processes. In addition, the heavy load on the model vector can cause significant runtime delays. Both limitations are more pronounced for dense data, thus we use a dense synthetic dataset to illustrate these effects in Fig
(a)a; the orange, dashed line shows that convergence suffers from staleness, the gray line shows the respective runtime assuming perfect thread scalability and the yellow lines depicts the measured runtime. The algorithm diverges when scaled across more than 8 threads. Taking another route, Richtarik and Takac (2016b); Bradley et al. (2011) propose a synchronous approach for parallelizing SCD. Such methods come with more robust convergence properties. However, depending on the inherent separability of , the potential of acceleration can be small. For synthetic, well separable problems, minibatch SDCA proposed by Richtarik and Takac (2016b) show almost linear scaling, whereas for correlated objectives or dense datasets, the potential for acceleration, as given in their theory diminishes. In addition, updates to the shared vector in the synchronous setting are guaranteed to conflict across parallel threads – minibatch SDCA uses atomic operations^{1}^{1}1code available at https://code.google.com/archive/p/acdc/downloads to serialize those updates; this does not scale as the thread count increases, and especially not in numa machines. We have applied this method to the same synthetic example used in Fig 1 and we observed virtually no speedup () when using threads.Scalability of existing methods: Training of a logistic regression classifier on a synthetic dense dataset with
k training examples and features – (a) training using PASSCoDewild (Hsieh et al., 2015) and (b) training using CoCoA (Smith et al., 2018) deployed across threads. Details can be found in the appendix.Orthogonal to parallel methods, distributed coordinatebased methods have also been the focus of many works, including (Yang, 2013; Ma et al., 2015; Richtarik and Takac, 2016a; Dünner et al., 2018a; Smith et al., 2018; Lee and Chang, 2018). Here the standard assumption on the hardware is that processes are physically separate, data is partitioned across them, and communication is expensive. To this end, stateoftheart distributed first and secondorder methods attempt to pair good convergence guarantees with efficient distributed communication. However, enabling this often means trading convergence for data parallelism (Kaufmann et al., 2018). We have illustrated this tradeoff in Fig (b)b where we employ CoCoA Smith et al. (2018)
across threads; using 32 threads the number of epochs is increased by
resulting in a speedup of assuming perfect thread scalability. This small payback makes distributed algorithms generally not well suited to achieving acceleration; they are primarily designed to enable training of large datasets that do not fit into a single machine (Smith et al., 2018).The fundamental tradeoff between statistical efficiency (how many iterations are needed to converge) and hardware efficiency (how efficient they can be executed) of deploying machine learning algorithms on modern CPU architectures has previously been studied in Zhang and Ré (2014). The authors identified data parallelism as a critical tuning parameter and demonstrate that its choice can significantly affect performance of any given algorithm.
The goal of this work is to go one step further and enable better tradeoffs by directly incorporate mitigations to critical systemlevel bottlenecks into the algorithm design. We exploit the shared memory performance available to worker threads within modern individual machines to enable new algorithmic features that improve scalability of parallel coordinate descent, while at the same time maintaining statistical efficiency.
3 Bottleneck Analysis
We start by analyzing stateoftheart implementations of sequential and parallel coordinate descent to identify bottlenecks and scalability issues. For the parallel case, we use an optimized implementation of PASSCoDe (Hsieh et al., 2015) as the baseline for this study, which is vectorized and reasonably efficient. The parallel algorithm operates in epochs and repeatedly divides the shuffled coordinates among the parallel threads. Each thread then operates asynchronously: reading the current state of the model , computing an update for this coordinate and writing out the update to the model as well as the shared vector . The auxiliary vector is kept in memory to avoid recurring computations. Writecontention on is solved opportunistically in a wild fashion, which in practice is the preferred approach over expensive locking (Parnell et al., 2017; Hsieh et al., 2015). The parallel SCD algorithm is stated in Appendix A.1 for completeness.
One would expect that, especially for large datasets (e.g., datasets that do not fit in the CPU caches), the runtime would be dominated by (a) the time to compute the inner product required for the coordinate update computation and (b) retrieving the data from memory. While these bottlenecks can generally not be avoided, our performance analysis identified four other bottlenecks that in some cases vastly dominate the runtime:
(B1) Access to model vector. When the model vector does not fit in the CPU cache, a lot of time is spend in accessing the model. The origin of this overhead is the random nature of the accesses to , there is very little cache line reuse: a cache line is brought from memory (64B or 128B), out of which only 8B are used. This issue affects both the parallel and the sequential implementation. For the latter this bottleneck dominates and we found that, by accessing the model in a sequential manner, we can reduce the runtime by .
(B2) Access to the shared vector. For the parallel implementation, we found that writing the updates to the shared vector across the different threads was the main bottleneck. On top of dominating the runtime, staleness in the shared vector can also negatively impact convergence.
(B3) Nonuniform memory access. When the parallel implementation is deployed across multiple numa nodes, bottleneck (B2) becomes catastrophic, often leading to divergence of the algorithm (see Fig. (a)a). This effect can be explained by the fact that the internode delay when writing updates is far more pronounced than the intranode delay.
(B4) Shuffling coordinates. A significant amount of time is spent permuting the coordinates before each epoch in both the parallel and the sequential case. For the latter, we found that by removing the permutation, effectively performing cyclic coordinate descent, we could achieve a further speedup in runtime on top of removing (B1).
4 Algorithmic Optimizations
In this section we present the main algorithmic optimizations of our new training algorithm which are designed to simultaneously address the system performance bottlenecks (B1)(B4) identified in the previous section as well as the scalability issue demonstrated in Fig. (b)b. Our systemaware parallel training algorithm (SySCD) is summarized in Alg. 1 and its convergence properties are analyzed in Sec. 4.4. The following subsections will be accompanied by experimental results illustrating the effect of the individual optimizations. They show training of a logistic regression classifier on the criteokaggle dataset (CriteoLabs, 2013) on a 4 node system with 8 threads per numa node (for the experimental setup, see Sec 5) . Results for two additional datasets can be found in the appendix.
4.1 Bucket Optimization
We have identified the cache line access pattern (B1) and the random shuffling computation (B4) as two critical bottlenecks in the sequential as well as the parallel coordinate descent implementation. To alleviate these in our new method, we introduce the concept of buckets: We partition the coordinates and the respective columns of into buckets and then train a bucket of consecutive coordinates at a time. Thus, instead of randomizing all coordinates at once, the order in which buckets are processed is randomized, as well as the order of coordinates within a bucket. This modification to the algorithm improves performance in several ways; (i) the model vector is accessed in a cacheline efficient manner, (ii) the computation overhead of randomizing the coordinates is reduced by , and (iii) CPU prefetching efficiency on accessing the different coordinates of is implicitly improved. For our test case this optimization leads to an average speedup of with only a small toll on convergence, as depicted in Fig. 3.
The bucket size will appear in our convergence rate (Theorem 1) and can be used to control the scope of the randomization to tradeoff between convergence speed and implementation efficiency. We illustrate the sensitivity of our algorithm to the bucket size in Fig. 3. We see that the bottom line training time decreases significantly across the board by introducing buckets. The optimal bucket size in Fig. 3 is eight which coincides with the cache line size of the CPU with respect to the model vector
accesses. Thus we do not need to introduce an additional hyperparameter and can choose the bucket size
at runtime based on the cache line size of the CPU, using linux sysfs.4.2 Increasing Data Parallelism
Our second algorithmic optimization mitigates the main scalability bottleneck (B2) of the asynchronous implementation: writecontention on the shared vector . We completely avoid this writecontention by replicating the shared vector across threads to increase data parallelism. To realize this data parallelism algorithmically we transfer ideas from distributed learning. In particular, we employ the CoCoA method (Smith et al., 2018) and map it to a parallel architecture where we partition the (buckets of) coordinates across the threads and replicate the shared vector in each one. The global shared vector is therefore only accessed at coarse grain intervals (e.g., epoch boundaries), where it is updated based on the replicas and broadcasted anew to each thread. Similar to CoCoA we can exploit the typical asymmetry of large datasets and map our problem such that the shared vector has dimensionality .
We have seen in Sec 2 that distributed algorithms such as CoCoA are generally not suited to achieve significant acceleration with parallelism. This behavior of distributed methods is caused by the static partitioning of the training data across workers which increases the epochs needed for convergence (Smith et al., 2018; Kaufmann et al., 2018) (e.g., see Fig (b)b). To alleviate this issue, we propose to combine our multithreaded implementation with a dynamic repartitioning scheme. That is, we shuffle all the (buckets of) coordinates at the beginning of each local optimization round (Step 9 of Alg. 1), and then, each thread picks a different set of buckets each time. Such a repartitioning approach is very effective for convergence when compared to a default static partitioning, as depicted in Fig. 5. It reduces the number of epochs by at the cost of only a small implementation overhead. To the best of our knowledge we are the first to consider such a repartitioning approach in combination with distributed methods and demonstrate a practical scenario where it pays off – in a classical distributed setting the cost of repartitioning would be unacceptable.
The intuition behind this approach is the following: In CoCoA (Smith et al., 2018) a blockseparable auxiliary model of the objective is constructed. In this model the correlation matrix is approximated by a blockdiagonal version where the blocks are aligned with the partitioning of the data. This allows one to decouple local optimization tasks. However, this also means that correlations between data points on different worker nodes are not considered. A dynamic repartitioning scheme has the effect of choosing a different block diagonal approximation of in each step. By randomly repartitioning coordinates, the offdiagonal elements of
are sampled uniformly at random and thus in expectation a good estimate of
is used. A rigorous analysis of this effect would be an interesting study for future work. However, note that SySCD inherits the strong convergence guarantees of the CoCoA method, independent of the partitioning, and can thus be scaled up safely to a large number of cores in contrast to our asynchronous reference implementation.4.3 NumaLevel Optimizations
Subsequently, we focus on optimizations related to the numa topology in a multinuma node system. Depending on the numa node where the data resides and the node on which a thread is running, data access performance can be nonuniform across threads. As we have seen in Fig. (b)b and discussed in Sec. 3 this amplifies bottleneck (B3). To avoid this in SySCD, we add an additional level of parallelism and treat each numa node as an independent training node, in the distributed sense. We then deploy a hierarchical scheme: we statically partition the buckets across the numa nodes, and within the numa nodes we use the dynamic repartitioning scheme introduced in Sec 4.2. We exploit the fact that the training dataset is readonly and thus it does not incur expensive coherence traffic across numa nodes. We do not replicate the training dataset across the nodes and the model vector is local to each node which holds the coordinates corresponding to its partition . Crucially, each node holds its own replica of the shared vector, which is reduced across nodes periodically. The frequency of synchronization can be steered in Alg. 1 by balancing the total number of updates between and . This again offers a trade off between fast convergence (see Theorem 1) and implementation efficiency. This hierarchical optimization pattern that reflects the numatopology results in a speedup of over a numaoblivious implementation, as shown in Fig 5. To avoid additional hyperparameters, we dynamically detect the numa topology of the system, as well as the number of physical cores per node, using libnuma and the sysfs interface. If the number of threads requested by the user is less or equal to the number of cores in one node, we schedule a single node solver. We detect the numa node on which the dataset resides using the move_pages system call.
4.4 Convergence Analysis
We derive an endtoend convergence rate for SySCD with all its optimizations as described in Alg. 1. We focus on strongly convex while every single component of SySCD is also guaranteed to converge for general convex , see Remark 2 in the Appendix.
Theorem 1.
Consider Algorithm 1 applied to (1). Assume is smooth and are strongly convex functions. Let be the number of numa nodes and the number of threads per numa node. Let be the bucket size. Denote the number of SDCA updates performed on each bucket, let be the number of buckets processed locally in each iteration and let be the number of communication rounds performed independently on each numa node before global synchronization. Then, after outer rounds the suboptimality can be bounded as
(2) 
where and
(3) 
Proof Sketch.
To derive a convergence rate of Alg. 1 we start at the outer most level. We focus on the two nested forloops in Step 6 and Step 10 of Alg. 1. They implement a nested version of CoCoA where the outer level corresponds to CoCoA across numa nodes and the inner level to CoCoA across threads. The number of inner iterations is a hyperparameter of our algorithm steering the accuracy to which the local subproblem assigned to each numa node is solved. Convergence guarantees for such a scheme can be derived from a nested application of (Smith et al., 2018, Theorem 3) similar to (Dünner et al., 2018b, Appendix B). Subsequently, we combine this result with the convergence guarantees of the local solver used by each thread. This solver, implementing the bucketing optimization, can be analyzed as a randomized block coordinate descent method, similar to (Dünner et al., 2017, Theorem 1), where each block corresponds to a bucket of coordinates. Each block update is then computed using SDCA (ShalevShwartz and Zhang, 2013). Again, the number of coordinate descent steps forms a hyperparameter to steer the accuracy of each block update. Combining all these results in a nested manner yields the convergence guarantee presented in Theorem 1. We refer to the Appendix A.3 for a detailed proof. ∎
5 Evaluation
In this section, we evaluate the performance of SySCD in two different singleserver multi numanode environments. We first analyze the scalability of our method and the performance gains achieved over the reference implementation. Then, we compare SySCD with other stateoftheart GLM solvers available in scikitlearn (Pedregosa et al., 2011)(0.19.2), H2O (The H2O.ai team, 2015) (3.20.0.8), and Vowpal Wabbit (VW) (Langford, 2007) (commit: 5b020c4). We take logistic regression with regularization as a test case. We use two systems with different CPU architectures and numa topologies: a 4node Intel Xeon (E54620) with 8 cores and 128GiB of RAM in each node, and a 2node IBM POWER9 with 20 cores and 512GiB in each node, 1TiB total. We evaluate on three datasets: (i) the sparse dataset released by Criteo Labs as part of their 2014 Kaggle competition (CriteoLabs, 2013) (criteokaggle), (ii) the dense HIGGS dataset (Baldi et al., 2014) (higgs), and (iii) the dense epsilon dataset from the PASCAL Large Scale Learning Challenge (Epsilon, 2008) (epsilon). Results on epsilon and additional details can be found in the appendix.
Remark 1 (Hyperparameters).
The hyperparameters in Alg 1 can be used to optimally tune SySCD to different CPU architectures. However, a good default choice is
(4) 
such that one epoch ( coordinate updates) is performed across the threads before each synchronization step. We will use these values for all our experiments and did not further tune our method. Further, recall that the bucket size is set to be equal to the cache line size of the CPU and the number of numa nodes as well as the number of threads is automatically detected.
5.1 Scalability
We first investigate the thread scalability of SySCD. Results, showing the speedup in time per epoch (an epoch corresponds to coordinate updates) over the sequential version, are depicted in Fig 6. We see that SySCD scales almost linearly across the two systems and thus the main scalability bottleneck (B2) of our reference implementation is successfully mitigated. The 4 node system show a slightly lower absolute speedup beyond 1node (8 threads), which is expected due to the higher overhead when accessing memory on different numa nodes compared to the 2node system.
Note that in our experiments we disable simultaneous multithreading (SMT), since in practice we often find enabling SMT leads to worse overall performance. Therefore, the maximal thread count corresponds to the number of physical cores present in the machine. In order to illustrate how SySCD scales when the number of threads exceeds the number of physical cores, we enabled SMT4 (4 hardware threads per core) on the P9 machine and reran the experiment from Fig. (b)b. The results are shown in Figure 16 in the appendix. As expected, we see linear scaling up to the number of physical CPU cores (in this case 40), after which we start to see diminishing returns due to the inherent inefficiency of SMT4 operation. We thus recommend disabling SMT when deploying SySCD.
5.2 Bottom Line Performance
Second, we compare the performance of our new SySCD algorithm to the PASSCoDe baseline implementation. Convergence is declared if the relative change in the learned model across iterations is below a threshold. We have verified that all implementations exhibit the same test loss after training, apart from the PASSCoDe implementation which can converge to an incorrect solution when using many threads (Hsieh et al., 2015). Fig 7 illustrates the results for two different systems. Comparing against PASSCoDe with the best performing thread count, SySCD achieves a speedup of (P9) and (X86_64) on average across datasets. The larger performance improvement observed for the 2node system relative to the 4node system, in particular on the higgs dataset, can be attributed to the increased memory bandwidth.
5.3 Comparison with sklearn, VW, and H2O
We finally compare the performance of our new solver against widely used frameworks for training GLMs. We compare with scikitlearn (Pedregosa et al., 2011), using different solvers (liblinear, lbfgs, sag), with H2O (The H2O.ai team, 2015), using its multithreaded auto solver and with VW (Langford, 2007), using its default solver. Care was taken to ensure that the regularization strength was equivalent across all experiments, and that the reported time did not include parsing of text and loading of data. Results showing training time against test loss for the different solvers, on the two systems, are depicted in Fig 8. We add results for SySCD with single (SySCD 1T) and maximum (SySCD MT) thread counts. Overall SySCD MT is over faster, on average, than the best performing alternative solver. The best competitor is VW for criteokaggle and H2O for higgs. H20 results are not shown in Fig (a)a and (b)b because we could not train the criteokaggle dataset in a reasonable amount of time ( hours), even by using the max_active_predictors option.
6 Conclusion
We have shown that the performance of existing parallel coordinate descent algorithms which assume a simplistic model of the parallel hardware, significantly suffers from system bottlenecks which prevents them from taking full advantage of modern CPUs. In this light we have proposed SySCD, a new systemaware parallel coordinate descent algorithm that respects cache structures, data access patterns and numa topology of modern systems to improve implementation efficiency and exploit fast data access by all parallel threads to reshuffle data and improve convergence. Our new algorithm achieves a gain of up to compared to a stateoftheart systemagnostic parallel coordinate descent algorithm. In addition, SySCD enjoys strong scalability and convergence guarantees and is thus suited to be deployed in production.
References

Searching for exotic particles in highenergy physics with deep learning.
. Nature communications 5, pp. 4308. Cited by: §B.3, §5.  Parallel coordinate descent for l1regularized loss minimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 321–328. External Links: ISBN 9781450306195 Cited by: §2.
 Terabyte click logs dataset. Note: http://labs.criteo.com/2013/12/downloadterabyteclicklogs/Online; Accessed: 20180125 Cited by: §4, §5.
 Primaldual rates and certificates. In Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 783–792. Cited by: Remark 2.
 A distributed secondorder algorithm you can trust. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 1358–1366. Cited by: §2.
 Efficient use of limitedmemory accelerators for linear learning on heterogeneous systems. In Advances in Neural Information Processing Systems 30, pp. 4258–4267. Cited by: §A.2, §A.3.2, §A.3.2, §4.4, Remark 2.
 Snap ML: A Hierarchical Framework for Machine Learning. In Advances in Neural Information Processing Systems 31, pp. 250–260. Cited by: §A.3.1, §C.3, §4.4, Remark 2, Theorem 2.
 Pascal large scale learning challenge. Note: http://www.k4all.org/project/largescalelearningchallenge Cited by: §B.3, §B.4, §5.
 PASSCoDe: parallel asynchronous stochastic dual coordinate descent. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2370–2379. Cited by: §5.2.
 Passcode: parallel asynchronous stochastic dual coordinate descent. In International Conference on Machine Learning, pp. 2370–2379. Cited by: §A.1, §B.1, §1, Figure 1, §2, §3.

Kaggle machine learning and data science survey
. Note: https://www.kaggle.com/surveys/2017 Cited by: §1.  LIBSVM data: classification, regression, and multilabel. Note: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ Cited by: §C.1.
 Elastic cocoa: scaling in to improve convergence. arXiv:1811.02322. External Links: 1811.02322 Cited by: §2, §4.2.
 Vowpal wabbit. Note: https://github.com/JohnLangford/vowpal_wabbit/wiki Cited by: §C.3, item 3, §5.3, §5.
 Distributed blockdiagonal approximation methods for regularized empirical risk minimization. arXiv:1709.03043. Cited by: §2.
 Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM Journal on Optimization 25 (1), pp. 351–376. Cited by: §2.
 An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research 16 (1), pp. 285–322. Cited by: §1, §2.
 Adding vs. Averaging in Distributed PrimalDual Optimization. In Proceedings of the 32th International Conference on Machine Learning, ICML, pp. 1973–1982. Cited by: §2.
 LargeScale Stochastic Learning Using GPUs. 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 419–428. Cited by: §2, §3.
 Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: item 3, §5.3, §5.
 Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 693–701. Cited by: §1.
 Distributed coordinate descent method for learning with big data. Journal of Machine Learning Research 17, pp. 1–25. Cited by: §2.
 Parallel coordinate descent methods for big data optimization. Mathematical Programming 156, pp. 433–484. Cited by: §C.3, §1, §2.
 Stochastic dual coordinate ascent methods for regularized loss. JMLR 14 (1), pp. 567–599. External Links: ISSN 15324435 Cited by: §A.3.2, §A.3.2, §1, §2, §4.4, Remark 2.
 CoCoA: a general framework for communicationefficient distributed optimization. 18. Cited by: §A.2, §A.3, §C.3, Figure 1, §2, §4.2, §4.2, §4.2, §4.4, Remark 2.
 H2o: python interface for h2o. Note: http://www.h2o.aiPython package version 3.20.0.8 Cited by: §C.3, item 3, §5.3, §5.
 Coordinate descent algorithms. Mathematical Programming 151 (1), pp. 3–34. Cited by: §1, §2.
 Trading computation for communication: distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 629–637. Cited by: §2.
 DimmWitted: a study of mainmemory statistical analytics. Proc. VLDB Endow. 7 (12), pp. 1283–1294. External Links: ISSN 21508097 Cited by: §2.
Appendix A Appendix
a.1 Reference Algorithm
a.2 Local Subproblems in SySCD
Let denote the vector with only nonzero elements for indices , i.e., coordinates assigned to numanode . Similarly, let denotes the vector with only nonzero elements for coordinates . Then, the local optimization tasks in Step 16 of Algorithm 1, assigned to thread on numanode , is defined as:
(5) 
where
(6)  
(7) 
Derivation.
We have numa nodes and threads on each numa node. Then, and in Step 16 of Alg. 1 define the local optimization task solved by each thread. The respective formulation (6), (7) can be derived from a nested application of the block separable upperbound used in CoCoA (Smith et al., 2018, Equation (10)). A similar twolevel approach has been taken by (Dünner et al., 2017, Appendix B) in the context of hierarchical training across GPUs. This yields the following block separable objective which is solved across the numa nodes
(8)  
(9) 
where . Thus, numanode is assigned the following subproblem:
(10) 
In order to compute the update on numa node across the threads, we again apply the to (10). This, yields the following separable problem:
(11) 
To find the optimization problem (11) is solved repeatedly, in parallel, across the locally available threads. After each iteration the model is updated as and we locally keep track of the auxiliary information on each numa node . Consequently thread and numa node is assigned the subproblem stated in (5).
Crucially, each thread updates a different, distinct subset of coordinates of and only requires access to the respective columns of . Thereby, memory access bottlenecks on as well as are avoided in SySCD. In addition is local to each numa node and only needs to be synchronized periodically, mitigating the main scalability bottleneck (B3).
∎
a.3 Convergence Analysis
In this section we will proof the convergence result stated in Theorem 1. We proceed level by level and then in the end nest the individual rates. We will focus on strongly convex for the reason of simplicity, however, every single component used in the proof has also nonasymptotic convergence guarantees for nonstrongly convex , as can be found in the respective references. For our analysis the following definition for the approximation quality of a solution, according to Smith et al. (2018), will be useful:
Definition 1 (approximate update).
A solution is called approximate for iff
where .
a.3.1 DataParallelism across NumaNodes and Threads
Our hierarchical numaaware optimization pattern described in Section 4.3 implements a hierarchical version of CoCoA across numanodes and threads. First, each numanode is assigned the CoCoA subproblem (10) which is then solved across the available threads, again using CoCoA. This results in subproblem (5) assigned to thread on numa node . Let us for now ignore how this problem is solved, we just assume it is solved approximately.
Theorem 2 (Hierarchical CoCoA– adapted from (Dünner et al., 2018b)).
Consider Algorithm 1 for solving (1). Assume is smooth and are strongly convex functions. Let be the number of threads per numa node and the number of numa nodes. Let be the number of synchronization steps performed on each numa node before global synchronization. Assume the local subtasks (5) assigned to each thread are solved approximately. Then, after outer rounds the suboptimality can be bounded as
(12) 
where and and denotes the initial suboptimality.
Proof.
For analyzing this hierarchical method we can build on the convergence result derived by Dünner et al. (2018b, in Appendix B.1) for hierarchical CoCoA in a heterogeneous GPUCPU setting. In the context of Algorithm 1 the outer level of CoCoA spans across numa nodes (instead of physical CPUs) and the inner level across threads on each numa nodes (instead of GPUs). ∎
a.3.2 SCD using Buckets
To find a approximate update to the subproblem (5) on each thread SySCD applies SCD with bucketized randomization. This means, in each step a bucked is selected uniformly at random and then steps of SCD are performed on the coordinates assigned to . We call this local optimization loop performed on each thread (Step 12 of Alg. 1) bucketized SDCA and derive the following convergence guarantee:
Theorem 3.
Consider the following optimization problem . Assume is smooth and are strongly convex. Then SDCA with bucketized randomization and SCD steps on each bucket, converges as
after steps where denotes the bucket size.
Proof.
For analyzing bucketized SDCA we will combine results of randomized block coordinate descent from (Dünner et al., 2017, Theorem 1) with classical convergence results of SDCA (ShalevShwartz and Zhang, 2013, Theorem 4).
Let a block of coordinates corresponds to the coordinates within the selected bucket . Then, results in (Dünner et al., 2017, Theorem 1) say that after block coordinate updates on (5) the suboptimality satisfies
(13) 
where denotes the suboptimality of (5) and, accordingly, denotes the initial suboptimality when starting the threadlocal optimization in Step 12 of Alg. 1. denotes the approximation quality of the individual block updates, according to Definition 1.
For computing a the approximate update SySCD uses steps of SDCA. The SDCA algorithm has a linear rate when applied to strongly convex objectives (ShalevShwartz and Zhang, 2013, Theorem 4) and thus the suboptimality decays linearly as
(14) 
with . Hence, after steps of SDCA the approximation quality of the block update (Definition 1) satisfies
(15) 
Combining this with the rate of block coordinate descent (13) yields the convergence guarantee presented in Theorem 3. ∎
a.3.3 Overall Rate
Building on the above convergence results of the individual components we can derive an endtoend rate for SySCD. Therefore we again exploit the definition of a approximate update in Theorem (2), regarding the updates computed on each thread. Using Theorem 3 we have
(16) 
Note that the optimization problem assigned to each thread is defined as (5), thus we have and . Combining this with Theorem 2 yields the convergence rate for Algorithm 1 presented in Theorem 1.
Remark 2 (NonStrongly Convex ).
For the case where are nonstrongly convex CoCoA, hierarchical CoCoA, blockcoordinate descent, as well as SDCA have a sublinear convergence rates, see (Smith et al., 2018, Theorem 2), (Dünner et al., 2018b, Equation 2) (Dünner et al., 2017, Theorem 2) and (ShalevShwartz and Zhang, 2013, Theorem 2)(Dünner et al., 2016, Theorem 9), respectively. These guarantees for the individual components guarantee a function decrease in each update step of SySCD. Hence our method will converge. However, to derive an endtoend nonasymptotic convergence rate the definition of multiplicative subproblem accuracy (Definition 1) is not well suited.
Appendix B Additional Experiments
In the following we will present additional experiments to more broadly support the claims of the paper. These results were offloaded to the appendix due to space limitations.
b.1 Motivational Example
The motivational examples shown in Fig 1 are performed on a synthetic dense dataset. The equivalent results for a sparse dataset are depicted in Fig 10. The dataset has 1k features and 100k examples and a uniform sparsity of . We see that the algorithm is less prone to diverge if the data is more sparse because collisions on the shared vector are less likely. This is in line with the theoretical results presented by Hsieh et al. (2015). However, the number of epochs to converge as well as implementation overheads resulting in no payback beyond 8 threads.
b.2 Bottlenecks
b.3 Individual Optimizations
In Section 4 we have presented and evaluated the individual algorithmic components of SySCD on the criteokaggle dataset. In the following we will evaluate the individual optimizations on two additional datasets: the dense HIGGS dataset (Baldi et al., 2014), and the epsilon dataset from the PASCAL Large Scale Learning Challenge (Epsilon, 2008) (epsilon). See Fig 10 for the effect of using buckets on higgs. For the epsilon dataset the bucket optimization is not applied due to the small model vector ( elements) that fits in CPU cache. See Fig 12 for the effect of dynamic repartitioning. We observe that for the higgs dataset, reshuffling gives diminishing return. This is because the data is very imbalances and has a lot of redundancy in the examples. Thus a small subset is representative of the full data and reshuffling does not add much. See Fig 13 for the effect of the numalevel optimizations. For the accumulated gains see Sec. B.5.
b.4 Scaling
In Section 5.1 we have demonstrated the thread scalability of our implementation on the criteokaggle and the higgs dataset. In Fig 14 we add one more dataset: the epsilon dataset (Epsilon, 2008). Performance scales almost linearly across datasets and the two systems. Training on higgs on the 4 node machine is an exception: scaling gradually degrades when going from 1 to 2 and more numa nodes. By profiling its runtime, we observe that most of the time is spent on memory accesses to the training dataset. On the 2 node system, however, which has higher memory bandwidth and less numa nodes, those memory accesses are no longer the bottleneck.
In all our experiments we disable simultaneous multithreading (SMT). Figure 16 illustrates how SySCD would scale when the number of threads exceeds the number of physical cores. In this example we enabled SMT4 (4 hardware threads per core) on the P9 machine and reran the experiment from Fig. (b)b on the criteokaggle dataset. We see linear scaling up to the number of physical CPU cores (in this case 40), after which we start to see diminishing returns due to the inherent inefficiency of SMT4 operation. We thus do not use SMT when deploying SySCD.
b.5 Bottom Line Performance
The comparison of SySCD and our baseline implementation PASSCoDewild on the epsilon dataset is shown in Fig 16. It shows the accumulated gain of all the optimizations.
b.6 Benchmarking Results
Results in Fig 17 augment the results of Section 5.3 where we compare the performance of SySCD to different GLM solvers. Here we use the epsilon dataset for benchmarking. Also for this dataset SySCD MT is significantly than all alternative solver; faster on the 4 node system and faster on the 2node system. We observe that for the epsilon dataset H2O is the slowest competitor. The performance of H2O is somewhat extreme: its multithreaded solver takes over all the cores in the system and is able to achieve the expected test loss, but the performance varies dramatically across datasets. We expect this to be an issue with large number of features (epsilon has ): by artificially reducing the number of features to 200 using the max_active_predictors H2O parameter, we get an order of magnitude speedup in time, with, however, a dramatic degradation of the test loss.
Appendix C Experimental Details for Reproducability
All experiments performed in this paper show the training of an regularized logistic regression classifier. This means the objective (1) is defined as
(17) 
where denote the columns of , corresponding to training samples and denote the corresponding labels. The regularization parameter was verified to be a reasonable choice through cross validation; for the criteo dataset, for the higgs dataset and for the epsilon dataset.
c.1 Datasets
All datasets used for performance evaluation are publicly available and downloadable from the libsvm repository Kaggle (2019). Data was used as provided without any additional preprocessing. The data used for the motivational example in Fig 1 is a synthetic dense dataset. It was created by sampling the individual elements of the data matrix uniformly at random from the interval . The labels where also chosen uniformly at random from .
c.2 Infrastructure
We use two systems: a 4node Intel Xeon (E54620) with 8 cores and 128GiB of RAM in each node, and a 2node IBM POWER9 with 20 cores and 512GiB in each node, 1TiB total. We disable simultaneous multithreading (hyperthreading on the Xeons, SMT on the P9s) and fix the CPU frequency to the maximum supported (2.2GHz for x86, and 3.8GHz for P9).
c.3 Implementation Details
PASSCoDewild. The procedure of the PASSCoDewild baseline is given in Algorithm 2. The algorithm is implemented in OpenMP.
minibatch SDCA. Claims about minibatch SDCA(Richtarik and Takac, 2016b) are based on experimental results obtained using the publicly available code^{2}^{2}2https://code.google.com/archive/p/acdc/ by the authors. Functionalities to generate and load data have been added.
CoCoA. CoCoA is implemented across threads, where each thread represents a worker in CoCoA. We use and as suggested by the authors Smith et al. (2018).
H2O. We used the python package version 3.20.0.8. provided by The H2O.ai team (2015). We use the multithreaded auto solver. H2O could not work directly with the numpy binary files containing the training and test examples for each dataset, so we first convert them to csv files; directly converting them to H2OFrames took too long (hours) for the large datasets.
VW. We use the code provided at (Langford, 2007) with its default solver.
SySCD. Our new algorithm is available as part of the Snap ML software package (Dünner et al., 2018b) and example of how to use it can be found in the documentation.
Comments
There are no comments yet.