1 Introduction
Today’s individual machines offer dozens of cores and hundreds of gigabytes of RAM that can, if used efficiently, significantly contribute to improve training performance of machine learning models. In this respect parallel versions of popular machine learning algorithms such a stochastic gradient descent
Recht2011 and stochastic coordinate descent liu2015asynchronous ; Hsieh2015 have been developed. These methods introduce asynchronicity to the sequential algorithms in order to enable parallelization and better utilization of compute resources. However, these methods treat machines as a simple, uniform, collection of cores. This is far from reality. While modern machines offer ample computation and memory resources, they are also elaborate systems with complex topologies, memory hierarchies, and CPU pipelines. As a result, maximizing the performance of parallel training, requires implementations that are aware of these systemlevel details and address their bottlenecks.In this paper, we focus on the popular stochastic coordinate descent algorithm Wright2015 ; sdca2013 and take a systemaware approach of building a parallel model trainer. We start with a systemoblivious stateoftheart asynchronous multithreaded implementation written in OpenMPsnapml18arxiv . As a first step, we identify bottlenecks and scalability issues within the execution of a single epoch. We address these issues by modifying the algorithm to be more aligned with the system architecture, leading to a faster runtime per epoch on average. However, these modifications come at the cost of convergence; our modifications increase the number of epochs required to converge. To address this, we combine our previous optimizations with a novel dynamic data partitioning algorithm that achieves both efficient execution and fast convergence. These combined optimizations lead to an average speedup in convergence time of compared with our previous implementation, and a speedup of on average, when comparing against scikitlearn and H2O.
2 Baseline Implementation
We use the Snap ML accelerated ML framework snapml18arxiv as the basis of this study. Snap ML offers stateofthe art sequential and multithreaded implementations of SDCA sdca2013 , with the multithreaded implementation being an asynchronous parallel training algorithm as detailed in Algorithm 1. The algorithm operates in epochs and repeatedly divides the shuffled coordinates amongst the parallel threads. Each thread then operates asynchronously: reading the current state of the model , computing an update for this coordinate and writing out the update to the model as well as the shared vector . While no two threads operate on the same coordinate of , the shared vector is accessed and updated by all the threads. To avoid an expensive locking mechanism this is done opportunistically in a “wild” fashion, i.e., without synchronization. It was shown in previous studies Hsieh2015 ; parnellFGCS18
that this approach performs reasonably well when the probability of concurrent updates to the shared vector is small, e.g., for extremely sparse datasets, and for small thread counts. However, as we will see, if the thread count increases, or the solver is deployed in a multi numanode machine without taking numa affinity into account, the convergence behavior as well as the execution efficiency deteriorates drastically.
To illustrate these issues, we train a logistic regression model on multiple threads using two synthetic datasets of
k training examples each: one dense with features and one sparse with k features and a uniform sparsity of . The results for training these datasets running on one and four numanodes are depicted in Fig 1. We see that when running on a single numa node the training on a sparse dataset (Fig (b)b) scales well with the number of threads which is in direct contrast to the training on the dense dataset. This difference can be attributed to highly reduced true sharing among parallel updates to the shared vector for the nonskewed and highly sparse dataset. When running on multiple numanodes, both for the dense (Fig
(a)a) and the sparse (Fig (b)b) datasets, we observe a different behavior. The performance of the algorithm is significantly deteriorated because “wild” updates on the shared vector result in expensive cache line coherence traffic across the numa nodes. On the dense dataset, this is even more pronounced, due to the higher probability of concurrent updates to the same cache line across the threads.3 Optimizing GLM training on CPU
SingleThreaded Implementation We start by profiling the already vectorized and efficient sequential implementation of the SDCA algorithm.Naively, we would expect that for large datasets (e.g., datasets that do not fit in the CPU caches), the runtime would be dominated by a) the inner product computations required for the coordinate update computation and b) retrieving the data from memory, while the ratio between these two would depend on the number of features per training example. In our analysis, we have detected two additional bottlenecks:

[leftmargin=0.5cm ]

When the model does not fit in the cache, a lot of time is spend in accessing the model. Due to the random nature of the accesses to the model vector, there is very little cache line reuse: a cache line is brought from memory (64B or 128B), out of which only 8B are used.

A significant amount of time is spent in the permutation of the example indices before each epoch.
To alleviate these issues, we introduce the concept of buckets. We partition the training examples into buckets, and train a bucket of consecutive training examples at a time. The bucket size is chosen at runtime based on the cache line size of the CPU, using linux sysfs. This modification to the algorithm improves performance in several respects; (i) the model vector is accessed in a cacheline efficient manner, (ii) the amount of indices to be randomized is reduced by a factor equal to the bucket size (e.g., 8 or 16fold), and (iii) CPU prefetching efficiency on accessing the different coordinates of the training examples is implicitly improved.
Note that the bucket can decrease the randomness with which training examples are chosen, and thus degrade the convergence of the algorithm. This is especially true for datasets with a small number of training examples. However, as we will see in the experimental section, this tradeoff pays off. Further, we observe that the bucket optimization reaps limited benefits if the model vector is small enough to fit in the last level cache of the CPU. We thus dynamically choose the bucket size at runtime: if the model vector does not fit in the last level cache of the CPU (typically this cutoff point is in the range of k entries), we use the buckets, otherwise we don’t.
Multithreaded Implementation We now turn to a asynchronous multithreaded implementation of our optimized sequential SDCA algorithm. Unsurprisingly, the shared vector updates over shared memory across the threads turns out to be a scalability bottleneck: simply disabling those updates results in improved scaling, as depicted in Fig (a)a. The next scalability bottleneck is the sequential shuffling of the training example indices, also shown in Fig (a)a.
Based on these observations, we propose to increase data parallelism of the algorithm to improve scalability. To achieve this we transfer ideas from distributed learning cocoa18jmlr , where the training examples are partitioned across worker nodes that independently work on a local version of the shared vector which is synchronized periodically. We map this approach to a parallel architecture where we partition the examples across the threads and replicate the shared vector in each one. In this way we have achieved that the global shared vector needs to be accessed by different threads much less frequently.
Again, this systemoptimization has a price – convergence suffers. The static partitioning of the training examples across processes is known to increase the epochs needed to convergence cocoa18jmlr . We have illustrated this at a toy example in Figure (b)b. In order to alleviate this issue, we propose a dynamic partitioning for the multithreaded implementation: we shuffle all the (buckets of) examples at the beginning of each epoch, and each thread picks a different set of buckets in each epoch. Such a repartitioning approach is very effective but has not been adapted by distributed algorithms previously because it is too expensive in a distributed environment.
Numalevel optimizations Subsequently, we focus on optimizations related to numa topology in a multi numa node system. We treat each node as an independent training node in a distributed setting, and deploy a hierarchical scheme: statically partition the training examples across the nodes in a distributed fashion, and within the numa nodes perform the dynamic partitioning introduced in Sec 3. We exploit the fact that the training dataset is readonly and thus it does not incur expensive coherence traffic across nodes and do not replicate the training dataset across the nodes. Each node holds its own replica of the shared vector, which is reduced across nodes at the end of each epoch. The model vector is also local to each node which holds the coordinates corresponding to the part of the training examples it handles.
We dynamically detect the numa topology of the system, as well as the number of physical cores per node, using libnuma and the sysfs interface. If the number of threads requested by the user is less or equal to the number of cores in one node, we schedule a single node solver. Otherwise, we evenly distribute the requested number of threads to the minimum number of nodes that can accommodate them w.r.t. physical cores. We detect the numa node on which the dataset resides using the move_pages system call, and always include that node in our selection.
4 Evaluation
In this section, we evaluate the performance of our optimized implementation within the Snap ML framework in singleserver multi numanode environments. First, we compare the different multithreaded implementations and investigate the performance sensitivity of the implementation to the optimizations introduced in Sec 3. And then we compare with the widelyused scikitlearn scikitlearn ML framework (0.19.2), as well as the H2O framework h2o (3.20.0.8). ^{1}^{1}1We also tried to evaluate VowpalWabbit vowpalwabbit using their scikitlearn API, but were unable to do so primarily due to the data conversion in the tovw function, which is part of the fit function and takes a significant amount of time (e.g., 4800s for higgs), altering the training performance time.
We use two systems with different CPU architectures and numa topologies: a 4node Intel Xeon (E54620) with 128GiB of RAM at each node, 512GiB total, and a 2node IBM Power9 with 512GiB at each node, 1TiB total. We disable simultaneous multithreading and fix the CPU frequency to the maximum supported (2.2GHz for x86, and 3.8GHz for P9). We evaluate against 3 datasets: (i) the sparse dataset released by Criteo Labs as part of their 2014 Kaggle competition criteodataset (criteokaggle), (ii) the dense HIGGS dataset higgs14nature (higgs), and (iii) the dense epsilon dataset from the PASCAL Large Scale Learning Challenge epsilondataset (epsilon). Data loading time is not included in the training time in any of the results.
Bottom line performance. First, we evaluate the performance in terms of time to convergence between the “wild” implementation and our new “domesticated” one. Convergence is declared if the relative change in the learned model form one epoch to the next is below a threshold. We have verified that all implementations exhibit the same test loss after training, apart from the “wild” implementation which can converge to an incorrect solution when using many threads passcode . Fig 3 illustrates the results for all 3 datasets, across the two different systems. Comparing against the best “wild” version that converges to a similar test loss (4 and 8 threads, for the 4 and 2 node systems), the “domesticated” optimizations result in a speedup of , , and , for the criteokaggle, higgs, and epsilon, respectively. For the 2 node machine, the speedups are , , and , for the criteokaggle, higgs, and epsilon, and on average. The “wild” implementation exhibits significantly better performance on the 2 node system relative to the 4 node: this is due to increased memory bandwidth.
Scalability. Second, we focus on the strong scalability behavior of the “domesticated” implementation w.r.t. time per epoch. Results, showing the speedup over the sequential version, are depicted in Fig 4. Performance scales almost linearly for all the datasets, across the two systems. The 4 node system show a slightly lower absolute speedup beyond 1node (8 threads), which is expected due to the higher overhead when accessing memory on different numa nodes compared to the 2 node system. Training on Higgs on the 4 node machine is an exception: scaling gradually degrades when going from 1 to 2 and more numa nodes. By profiling its runtime, we observe that most of the time is spent on memory accesses to the training dataset. On the 2 node system, however, which has higher memory bandwidth and less numa nodes, those memory accesses are no longer the bottleneck.
Training example data partitioning. We now evaluate the effect of the dynamic data partitioning scheme, presented in Sec 3, against a default static partitioning. Fig (a)a compares the two schemes on the criteokaggle dataset, for the 4 node system. By dynamically shuffling the training examples across worker threads within each node after every epoch we are able to gain an improvement in total training of on average compared to the static partitioning, realizing most of the achieved average reduction in epochs. A similar improvement holds for the epsilon dataset, with an average improvement of . Higgs, however, is less sensitive to data partitioning choices, with the different schemes performing virtually the same.Similar observations are made for the 2 node machine ( and , for criteokaggle and epsilon).
Buckets. Next, the bucket optimization is evaluated in Fig (b)b. This optimization results in an average speedup of for criteokaggle and for higgs. When training on epsilon we can not benefit from this optimization at all, since it only has
k training examples, and the model vector completely fits in the lastlevel cache of the CPU. Based on the heuristic described in Sec
3, our implementation doesn’t apply this optimization to epsilon. The 2 node machine exhibits similar results: and speedup, for criteokaggle and higgs.Numa optimizations. We now focus on the impact of the numa optimizations on performance. Results depicted in Fig (c)c, indicate a speedup of on criteokaggle. The numalevel optimizations result in an improvement of and , on average, for higgs and epsilon, on the 4 node machine. These optimizations had a smaller impact on the 2 node machine, with an average speedups of , , and , for criteo, higgs, and epsilon, respectively.
Comparison with scikitlearn and H20. Last, we comparethe performance of our solver for training a logistic regression model against the widely used scikitlearn scikitlearn library, implementing different solvers (liblinear, lbfgs, sag). Also, we compare with H2O h2o , using its multithreaded auto solver ^{2}^{2}2We could not get the binary files for the sparse dataset (criteokaggle) to work with H2O in a reasonable amount of time and leave this for future work..
Results comparing the training time against test loss for the different solvers, on the two systems, are depicted in Fig 6. We use results for single (snap.ml 1T) and maximum (snap.ml MT) thread counts for our optimized implementation. snap.ml MT is consistently faster than the best performing alternative solver, across the board: , , and for criteokaggle, higgs, and epsilon, respectively, on the 2 node system; , , and , respectively, on the 4 node system.
We observe that the only multithreaded scikitlearn solver (lbfgs) performs worse than the other scikitlearn solvers, both in test loss and time to converge; liblinear is the best choice. The only exception to this is higgs on the 2 node system, where lbfgs performs faster than liblinear, but operates at a higher test loss. H2O performance is somewhat extreme: its multithreaded solver takes over all the cores in the system and is able to achieve the expected test loss, but the performance varies dramatically across datatasets. For higgs, it is second only to snap.ml MT, and significantly faster than the alternatives ( faster than scikitlearn on the 2 node machine). However, for epsilon, it is by far the slowest solver (by at least ). We expect this to be an issue with large number of features (epsilon has ): by artificially reducing the number of features to 200 using the max_active_predictors H2O parameter, we get an order of magnitude speedup in time, with, however, a dramatic degradation of the test loss.
5 Conclusion
We have started with a stateoftheart multithreaded implementation of SDCA. We proposed several modifications to make it more aligned with the hardware architecture of a modern CPU while preserving good convergence behavior. 1) we have proposed a bucketing approach to improve the memory access pattern of a randomized stochastic algorithm, 2) we have introduced a novel dynamic data partitioning scheme to alleviate issues of false share between threads and 3) we have proposed a hierarchical numaaware partitioning of the workload across threads to enable efficient scaling of the algorithms across numanodes. Combining all these optimizations we achieve a remarkable gain of up to compared to a stateoftheart optimized systemagnostic parallel implementation of the same algorithm.
References

[1]
Pierre Baldi, Przemysław Sadowski, and D. O. Whiteson.
Searching for exotic particles in highenergy physics with deep learning.
Nature communications, 5:4308, 2014.  [2] CriteoLabs. Terabyte click logs dataset. http://labs.criteo.com/2013/12/downloadterabyteclicklogs/, 2013.
 [3] Celestine Dünner, Thomas P. Parnell, Dimitrios Sarigiannis, Nikolas Ioannou, Andreea Anghel, and Haralampos Pozidis. Snap ML: A hierarchical framework for machine learning. CoRR, abs/1803.06333, 2018.
 [4] Epsilon. Pascal large scale learning challenge. http://www.k4all.org/project/largescalelearningchallenge, 2008.
 [5] ChoJui Hsieh, HsiangFu Yu, and Inderjit Dhillon. Passcode: Parallel asynchronous stochastic dual coordinate descent. In International Conference on Machine Learning, pages 2370–2379, 2015.
 [6] ChoJui Hsieh, HsiangFu Yu, and Inderjit Dhillon. Passcode: Parallel asynchronous stochastic dual coordinate descent. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2370–2379, Lille, France, 07–09 Jul 2015. PMLR.
 [7] John Langford. Vowpal wabbit. https://github.com/JohnLangford/vowpal_wabbit/wiki, 2007.
 [8] Ji Liu, Stephen J Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285–322, 2015.
 [9] Thomas Parnell, Celestine Dünner, Kubilay Atasu, Manolis Sifalakis, and Haralampos Pozidis. Terascale coordinate descent on GPUs. Future Generation Computer Systems, 2018.
 [10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [11] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
 [12] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. JMLR, 14(1):567–599, 2013.
 [13] Virginia Smith, Simone Forte, Chenxin Ma, Martin Takáč, Michael Jordan, and Martin Jaggi. Cocoa: A general framework for communicationefficient distributed optimization. In Journal of Machine Learning Research, volume 18, 2018.
 [14] The H2O.ai team. H2O, 2018.
 [15] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
Comments
There are no comments yet.