As data has become a major source of insight, ml has become a dominant workload in many (public and private) cloud environments. Ever-increasing collection of data further drives development of efficient algorithms and systems for distributed ml smith2018 ; dunner2018_2 as resource demands often exceed the capacity of single nodes. However, distributed execution, and the usage of cloud resources, pose additional challenges in terms of efficient and flexible resource utilization. Recently, several works have aim to improve resource utilization and flexibility of ml applications harlap2017 ; qiao2017 ; zhang2017 .
In this paper, we focus on cocoa smith2018 , a state-of-the-art framework for efficient, distributed training of glm. cocoa significantly outperforms other distributed methods, such as mini-batch versions of sgd and sdca by minimizing the amount of communication necessary between training steps.
Our work is motivated by two characteristics of the cocoa algorithm. First, even assuming perfect scalability and no overheads, increasing the number of workers does not, in general, reduce the time to reach a solution. This is because the convergence rate of cocoa degrades as increases jaggi2014
. Overall, cocoa execution is split into epochs, and increasingreduces the execution time of each epoch, but also decreases the per epoch convergence rate, requiring more epochs to reach a solution. Finding the that minimizes execution time is not trivial and depends on the dataset.
Second, the number of workers that minimize execution time changes as the
Figure (a)a/(b)b shows the convergence rate with
workers, using the kdda and higgs datasets as examples. We
evaluate the convergence rate by plotting the duality-gap, which is given by the
distance between the primal and dual formulation of the training objective, and
has been shown to provide a robust certificate of convergence dunner2016 ; smith2018 .
Both examples show that for larger values of , the duality-gap converges
faster initially, but slows down earlier than for smaller values of , thus
resulting in smaller values for leading to a shorter
time-to-(high)-accuracy111 When we refer to the training accuracy we mean
that a highly accurate solution to the optimization problem has been found
(i.e., a small value of the duality gap), rather than the classification
accuracy of the resulting classifier.
When we refer to the training accuracy we mean that a highly accurate solution to the optimization problem has been found (i.e., a small value of the duality gap), rather than the classification accuracy of the resulting classifier.than large values for . However, this is not universally true, as Figure (c)c shows for the rcv1 dataset, which scales almost perfectly with .
Based on these observations, we built Chicle, an elastic distributed machine learning framework, based on cocoa, that reduces time time-to-accuracy, robustly finds (near-)optimal settings automatically and optimizes resource usage by exploiting the drifting of the optimal.
cocoa smith2018 is a distributed machine learning framework to train glm across workers. The training data matrix
is partitioned column-wise across all workers and processed by local optimizers that independently apply updates to a shared vector, which is synchronized periodically. In contrast to the mini-batch approach, local optimizers apply intermediate updates directly to their local version of the shared vector , thus benefiting from previous updates within the same epoch.
Due to the immediate local updates to by local optimizers, cocoa outperforms previous state-of-the-art mini-batch versions of sgd and sdca. However, for the same reason, it is not trivial to efficiently scale-out cocoa, as increasing the number of workers does not guarantee a decrease in time-to-accuracy, even assuming perfect linear scaling and zero communication costs between epochs. The reason for this counter-intuitive behavior is that, as each local optimizer gets a smaller partition of , i.e. as it sees a smaller picture of the entire problem, the number of identifiable correlations within each partition decreases as well, thus leaving more correlations to be identified across partitions, which is slower due to infrequent synchronization steps.
Moreover, as indicated in the previous section, there is no for which the convergence rate is maximal at all times. This poses a challenge about the selection of the best . It is up to the user to decide in advance whether to train quickly to a low accuracy and wait longer to reach a high accuracy or vice versa. A wrong decision can lead to longer training times and wasted resources as well as money, as resources – at least in cloud offerings – are typically billed by the hour.
Ideally, the system would automatically and dynamically select , such that the convergence rate is maximal at any point in time, in order minimize training time and resource waste. As Figure (b)b shows, the convergence rate, i.e. the slope of the curve, starting from the same level of accuracy, differs between different settings for . E.g, as the curve for flattens when reaching , the curves for become relatively steeper until they too, one by one, flatten out. Hence, in order to stay within a region of fast convergence for as long as possible, the system should switch to a smaller , once the curve for the current starts to flatten. We assume that the convergence rate, when switching from to workers, at a certain level of accuracy, will follow a similar trajectory, as if the training had reached said level of accuracy starting with workers in the first place. However, the validity of this assumption is obvious, given that the learned models in both cases are not guaranteed to be indentical.
Apart from the algorithmic side, adjusting also poses very practical challenges on the system side. Every change in incurs a transfer of potentially several gigabytes of training data between nodes – a task that overwhelms many systems zaharia2010 ; stuedi2017 ; sikdar2017 as data (de-)serialization and transfer can be very time consuming222Initially, we attempted to implement the concept of Chicle in Spark. This, however, failed to a large degree due to very time-consuming (de-)serialization of the training data.. It is therefore crucial that the the overhead, introduced by the adjustment of , is small, such that a net benefit can be realized.
Chicle 333Chicle is the Mexican-Spanish word for latex from the sapodilla tree that is used as basis for chewing gum. is a distributed, auto-elastic machine learning system based on the state-of-the-art CoCoA smith2018 framework that enables efficient ml training with minimized time-to-accuracy and optimized resource usage. The core concept of Chicle is to reduce the number of workers (and therefore training data partitions), starting from a set maximum number, dynamically, based on feedback from the training algorithm. This is rooted in the observation of a knee in the convergence rate, after which the convergence slows down significantly, and that this knee typically occurs at a lower duality-gap for fewer workers compared to more workers. This can be observed in Figure (b)b. Here, the knee occurs at for 16 workers and for 2 workers. The reasoning for adjusting the number of workers is the assumption that CoCoA can be accelerated, if, by reducing the number of workers, it can stay before the knee for as long as possible.
Chicle implements a master/slave design in which a central driver (master) coordinates one or more workers (slaves), each running on a separate node. Driver and worker communicate via a custom rpc framework based on rdma to enable fast data transfer with minimal overhead. Chicle is implemented in 3,000 lines of C++ code, including the RDMA-based rpc subsystem.
The driver is responsible for loading, partitioning and distribution the training data, hence no shared file system is required to store the training data. It partitions the data into partitions for workers, such that each worker is assigned partitions with being the least common multiple of and all potential scale-in sizes . Moreover, the central cocoa component is implemented as driver module. The workers implement an sdca optimizer. Each optimizer instance works on all partitions assigned to a worker, such that it can train with a bigger picture once partitions get reassigned to a smaller set of workers. For each epoch, workers compute the partial primal and dual objective for their assigned partitions, which are sent to the driver where the duality-gap is computed and passed to the scale-in policy module.
Chicle enables efficient adjustment of the number of workers (and the corresponding number of data partitions per worker process) using a decision policy and a rdma-based data copy mechanism. In the context of this paper, Chicle only scales-in, i.e. reduces the number of workers and therefore redistributs the number of partitions across fewer workers.
Our scale-in policy attempts to determine the earliest point in time when it is beneficial to reduce the number of workers (i.e. the beginning of the knee
) while, at the same time, being robust against occasional outlier (i.e. exceptionally long) epochs. To that end, we use the slope of the duality-gap over time to identify theknee. It computes two slopes (see Figure 2) – a long-term slope which considers the convergence of the duality-gap since the last scale-in event – and a short-term slope , which considers only the last N epochs. As soon as the policy directs the driver process to initiate the scale-in mechanism. Larger values for and generally lead to a more robust decision w.r.t. occasional outlier epochs, however they also increase the decision latency, thus potentially failing to maximize benefits from an earlier scale-in. Empirically, we have determined that and works well across all evaluated datasets. Our policy does not determine the optimal factor of the scale-in, i.e. . We use a fixed of 4, as test have shown that the convergence rate difference for smaller is often very small.
We implement a simple, RDMA-based foreground data-copy mechanism to copy data from to-be-removed workers to remaining workers. As the data transfer occurs in parallel, between multiple pairs of workers, we are able to exceed the maximal single-link bandwidth. For a scale-in from to workers and a single-link bandwidth of (e.g. 10 Gb/s), we can achieve a total transfer rate of , e.g. 40 Gb/s to scale-in from 16 to 4 workers on a 10 Gb/s network.
3.3 Data partitioning and in-memory representation
While we do not employ a sophisticated data partitioning scheme – we simply split the data into equally sized chunks as it is laid out in the input file – we use an in-memory layout optimized for efficient local access as well as efficient data transfer between workers (see Listing 1). In Chicle, data for each partition is stored consecutively in the Partition::data array, which eliminates the need for costly serialization. On the receiving side, a simple deserialization step is required to restore the Example::dp pointer into the Partition::data array for each Example. This data layout, combined with the usage of RDMA, enables us to transfer data at a rate close to the hardware limit.
While we have considered an anticipatory background transfer mechanism, our evaluations (see Table 3) show that the overhead, introduced by our mechanism, does not necessitate this.
In our evaluation, we attempt to answer the question of how much the CoCoA algorithm can be improved by scaling-in training and thus staying in front of the knee for as long as possible.
To answer this question, we compare the time-to-accuracy (duality-gap) of our static CoCoA implementation with our elastic version, using an SVM training algorithm444we use a constant regularizer term and the 6 datasets shown in Table 1. We evaluate static settings with 1, 2, 4, 8 and 16 workers as well as two elastic settings. In the first elastic setting, we start with 16 workers and scale-in to a single worker. This represents cases where the entire dataset fits inside a single node’s memory but limited CPU resources make distribution beneficial anyway. In the second elastic setting, we start with 16 workers but scale-in to only two workers. This represents cases where a dataset exceeds a single node’s memory capacity and therefore has to be distributed. As convergence behavior for 2+ nodes is similar (see Figure 3), this also indicates how our method works in a larger cluster, e.g., when scaling from 64 to 8 nodes.
|RCV1||667,399||47,236||1.2 GB||0.16 %|
|KDDA||20,216,830||8,407,752||2.5 GB||1.8e-04 %|
|Higgs||11,000,000||28||7.5 GB||92.11 %|
|KDD12||54,686,452||149,639,105||21 GB||2e-05 %|
|Webspam||350,000||16,609,143||24 GB||0.02 %|
|Criteo||45,840,617||999,999||35 GB||3.9e-03 %|
All test are executed on a 17 node cluster, equipped with Intel Xeon E5-2640v3/E5-2650v2, 160-256 GB RAM and CentOS/Fedora 26 Linux, running 16 workers and 1 driver, connected by a FDR (56 Gb/s) Infiniband fabric. The initial set of nodes is always chosen randomly. The results, shown in Figure 3 and Table 2, represent the best results over 6 test runs for all schemes, to account for potential node speed variations. We set a test time limit of 10 minutes (not including data loading). Time results include computing the duality gap.
Our evaluation shows that the basic concept of Chicle – to adjust the number of workers based on feedback from the training algorithm – has benefits for most evaluated datasets. When scaling down to a single worker, Chicle shows an average speedup of 2 compared to the best static setting and 2.2 when scaling down to two workers. While our method does not improve upon all evaluated settings and target accuracies (e.g., for KDDA, Webspam, RCV1), the slowdown (compared to the respective best static setting) is tolerable, and speedups are still achieved compared to non-optimal static settings. It is important to note that the optimal static setting is not necessarily known in advance and may require several test runs to determine. Chicle, on the other hand, is able to find an optimal or near optimal setting automatically, which shows its robustness.
|1-16 workers||0.12 s||0.73 s||0.71 s||5.04 s||2.78 s||4.52 s|
|2-16 workers||0.06 s||0.39 s||0.38 s||2.78 s||1.53 s||2.18 s|
Finally, we measured data-copy rates and overhead due to scaling-in. Both metrics include the actual data-transfer, control plane overhead and data deserialization. We measured data transfer rates of up to 5.8 GiB/s (1.4 GiB/s on average) and overheads as shown in Table 3. As the measured times do not constitute a significant overhead on our system, we did not implement background data transfer. For slower networks, such a method could be used to hide data transfer times behind regular computation.
5 Related Work
To our knowledge Chicle is the first elastic cocoa implementation. Several other elastic ml systems exist, but in contrast to Chicle, they target efficient resource utilization rather than reducing overall execution time. Litz qiao2017 is an elastic ML framework that over-partitions training data into partitions for physical workers. Elasticity is achieved by increasing or decreasing the amount of partitions per node. In contrast to Chicle, Litz does not scale based on feedback from the training algorithm nor does it improve the per-epoch training algorithm convergence rate when doing so, as partitions are always processed independently of each other. SLAQ zhang2017 is a cluster scheduler for ml applications. Like Chicle, SLAQ uses feedback from ml applications, but instead of optimizing the time to arbitrary accuracy for one application, SLAQ tries minimize the time to low accuracy for many applications at the same time, by shifting resources from applications with low convergence ratse to those with high ones, assuming that resources can be used more effectively there. Proteus harlap2017 enables the execution of ML applications using transient revocable resources, such as EC2’s spot instances, by keeping worker state minimal at the cost of increased communication.
6 Conclusion and Future Work
In this paper we have shown experimentally, that the optimal number of workers for cocoa changes over the course of the training. Based on this observation we built Chicle, an elastic ML framework, and have shown that it can outperform static cocoa for several datasets and settings by a factor of 2–2.2
on average, often, while using fewer resources. Future work includes additional ways to dynamically optimize cocoa in terms of training time and resource usage, as well as related use-cases, e.g. neural networkslin2018 . Furthermore, we are working towards a theoretical foundation of our observations.
- (1) Dünner, C., Forte, S., Takáč, M., and Jaggi, M. Primal-dual rates and certificates. arXiv preprint arXiv:1602.05205 (2016).
- (2) Dünner, C., Parnell, T. P., Sarigiannis, D., Ioannou, N., Anghel, A., and Pozidis, H. Snap ML: A hierarchical framework for machine learning. CoRR abs/1803.06333 (2018).
- (3) Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and Gibbons, P. B. Proteus: agile ml elasticity through tiered reliability in dynamic resource markets. In Proceedings of the Twelfth European Conference on Computer Systems (2017), ACM, pp. 589–604.
- (4) Jaggi, M., Smith, V., Takac, M., Terhorst, J., Krishnan, S., Hofmann, T., and Jordan, M. I. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3068–3076.
- (5) Lin, T., Stich, S. U., and Jaggi, M. Don’t use large mini-batches, use local sgd. CoRR abs/1808.07217 (2018).
- (6) Qiao, A., Aghayev, A., Yu, W., Chen, H., Ho, Q., Gibson, G. A., and Xing, E. P. Litz: An elastic framework for high-performance distributed machine learning.
- (7) Sikdar, S., Teymourian, K., and Jermaine, C. An experimental comparison of complex object implementations for big data systems. In Proceedings of the 2017 Symposium on Cloud Computing (New York, NY, USA, 2017), SoCC ’17, ACM, pp. 432–444.
- (8) Smith, V., Forte, S., Ma, C., Takáč, M., Jordan, M. I., and Jaggi, M. Cocoa: A general framework for communication-efficient distributed optimization. JMLR 18 (2018), 1–49.
- (9) Stuedi, P., Trivedi, A., Pfefferle, J., Stoica, R., Metzler, B., Ioannou, N., and Koltsidas, I. Crail: A high-performance i/o architecture for distributed data processing. IEEE Data Eng. Bull. 40, 1 (2017), 38–49.
- (10) Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (Berkeley, CA, USA, 2010), HotCloud’10, USENIX Association, pp. 10–10.
- (11) Zhang, H., Stafman, L., Or, A., and Freedman, M. J. Slaq: quality-driven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing (2017), ACM, pp. 390–404.