1 Introduction
As data has become a major source of insight, ml has become a dominant workload in many (public and private) cloud environments. Everincreasing collection of data further drives development of efficient algorithms and systems for distributed ml smith2018 ; dunner2018_2 as resource demands often exceed the capacity of single nodes. However, distributed execution, and the usage of cloud resources, pose additional challenges in terms of efficient and flexible resource utilization. Recently, several works have aim to improve resource utilization and flexibility of ml applications harlap2017 ; qiao2017 ; zhang2017 .
In this paper, we focus on cocoa smith2018 , a stateoftheart framework for efficient, distributed training of glm. cocoa significantly outperforms other distributed methods, such as minibatch versions of sgd and sdca by minimizing the amount of communication necessary between training steps.
Our work is motivated by two characteristics of the cocoa algorithm. First, even assuming perfect scalability and no overheads, increasing the number of workers does not, in general, reduce the time to reach a solution. This is because the convergence rate of cocoa degrades as increases jaggi2014
. Overall, cocoa execution is split into epochs, and increasing
reduces the execution time of each epoch, but also decreases the per epoch convergence rate, requiring more epochs to reach a solution. Finding the that minimizes execution time is not trivial and depends on the dataset.Second, the number of workers that minimize execution time changes as the algorithm progresses. Figure (a)a/(b)b shows the convergence rate with workers, using the kdda and higgs datasets as examples. We evaluate the convergence rate by plotting the dualitygap, which is given by the distance between the primal and dual formulation of the training objective, and has been shown to provide a robust certificate of convergence dunner2016 ; smith2018 . Both examples show that for larger values of , the dualitygap converges faster initially, but slows down earlier than for smaller values of , thus resulting in smaller values for leading to a shorter timeto(high)accuracy^{1}^{1}1
When we refer to the training accuracy we mean that a highly accurate solution to the optimization problem has been found (i.e., a small value of the duality gap), rather than the classification accuracy of the resulting classifier.
than large values for . However, this is not universally true, as Figure (c)c shows for the rcv1 dataset, which scales almost perfectly with .Based on these observations, we built Chicle, an elastic distributed machine learning framework, based on cocoa, that reduces time timetoaccuracy, robustly finds (near)optimal settings automatically and optimizes resource usage by exploiting the drifting of the optimal
.2 Background
cocoa smith2018 is a distributed machine learning framework to train glm across workers. The training data matrix
is partitioned columnwise across all workers and processed by local optimizers that independently apply updates to a shared vector
, which is synchronized periodically. In contrast to the minibatch approach, local optimizers apply intermediate updates directly to their local version of the shared vector , thus benefiting from previous updates within the same epoch.Due to the immediate local updates to by local optimizers, cocoa outperforms previous stateoftheart minibatch versions of sgd and sdca. However, for the same reason, it is not trivial to efficiently scaleout cocoa, as increasing the number of workers does not guarantee a decrease in timetoaccuracy, even assuming perfect linear scaling and zero communication costs between epochs. The reason for this counterintuitive behavior is that, as each local optimizer gets a smaller partition of , i.e. as it sees a smaller picture of the entire problem, the number of identifiable correlations within each partition decreases as well, thus leaving more correlations to be identified across partitions, which is slower due to infrequent synchronization steps.
Moreover, as indicated in the previous section, there is no for which the convergence rate is maximal at all times. This poses a challenge about the selection of the best . It is up to the user to decide in advance whether to train quickly to a low accuracy and wait longer to reach a high accuracy or vice versa. A wrong decision can lead to longer training times and wasted resources as well as money, as resources – at least in cloud offerings – are typically billed by the hour.
Ideally, the system would automatically and dynamically select , such that the convergence rate is maximal at any point in time, in order minimize training time and resource waste. As Figure (b)b shows, the convergence rate, i.e. the slope of the curve, starting from the same level of accuracy, differs between different settings for . E.g, as the curve for flattens when reaching , the curves for become relatively steeper until they too, one by one, flatten out. Hence, in order to stay within a region of fast convergence for as long as possible, the system should switch to a smaller , once the curve for the current starts to flatten. We assume that the convergence rate, when switching from to workers, at a certain level of accuracy, will follow a similar trajectory, as if the training had reached said level of accuracy starting with workers in the first place. However, the validity of this assumption is obvious, given that the learned models in both cases are not guaranteed to be indentical.
Apart from the algorithmic side, adjusting also poses very practical challenges on the system side. Every change in incurs a transfer of potentially several gigabytes of training data between nodes – a task that overwhelms many systems zaharia2010 ; stuedi2017 ; sikdar2017 as data (de)serialization and transfer can be very time consuming^{2}^{2}2Initially, we attempted to implement the concept of Chicle in Spark. This, however, failed to a large degree due to very timeconsuming (de)serialization of the training data.. It is therefore crucial that the the overhead, introduced by the adjustment of , is small, such that a net benefit can be realized.
3 Chicle
Chicle ^{3}^{3}3Chicle is the MexicanSpanish word for latex from the sapodilla tree that is used as basis for chewing gum. is a distributed, autoelastic machine learning system based on the stateoftheart CoCoA smith2018 framework that enables efficient ml training with minimized timetoaccuracy and optimized resource usage. The core concept of Chicle is to reduce the number of workers (and therefore training data partitions), starting from a set maximum number, dynamically, based on feedback from the training algorithm. This is rooted in the observation of a knee in the convergence rate, after which the convergence slows down significantly, and that this knee typically occurs at a lower dualitygap for fewer workers compared to more workers. This can be observed in Figure (b)b. Here, the knee occurs at for 16 workers and for 2 workers. The reasoning for adjusting the number of workers is the assumption that CoCoA can be accelerated, if, by reducing the number of workers, it can stay before the knee for as long as possible.
3.1 Overview
Chicle implements a master/slave design in which a central driver (master) coordinates one or more workers (slaves), each running on a separate node. Driver and worker communicate via a custom rpc framework based on rdma to enable fast data transfer with minimal overhead. Chicle is implemented in 3,000 lines of C++ code, including the RDMAbased rpc subsystem.
The driver is responsible for loading, partitioning and distribution the training data, hence no shared file system is required to store the training data. It partitions the data into partitions for workers, such that each worker is assigned partitions with being the least common multiple of and all potential scalein sizes . Moreover, the central cocoa component is implemented as driver module. The workers implement an sdca optimizer. Each optimizer instance works on all partitions assigned to a worker, such that it can train with a bigger picture once partitions get reassigned to a smaller set of workers. For each epoch, workers compute the partial primal and dual objective for their assigned partitions, which are sent to the driver where the dualitygap is computed and passed to the scalein policy module.
3.2 Scalein
Chicle enables efficient adjustment of the number of workers (and the corresponding number of data partitions per worker process) using a decision policy and a rdmabased data copy mechanism. In the context of this paper, Chicle only scalesin, i.e. reduces the number of workers and therefore redistributs the number of partitions across fewer workers.
Scalein policy.
Our scalein policy attempts to determine the earliest point in time when it is beneficial to reduce the number of workers (i.e. the beginning of the knee
) while, at the same time, being robust against occasional outlier (i.e. exceptionally long) epochs. To that end, we use the slope of the dualitygap over time to identify the
knee. It computes two slopes (see Figure 2) – a longterm slope which considers the convergence of the dualitygap since the last scalein event – and a shortterm slope , which considers only the last N epochs. As soon as the policy directs the driver process to initiate the scalein mechanism. Larger values for and generally lead to a more robust decision w.r.t. occasional outlier epochs, however they also increase the decision latency, thus potentially failing to maximize benefits from an earlier scalein. Empirically, we have determined that and works well across all evaluated datasets. Our policy does not determine the optimal factor of the scalein, i.e. . We use a fixed of 4, as test have shown that the convergence rate difference for smaller is often very small.Scalein mechanism.
We implement a simple, RDMAbased foreground datacopy mechanism to copy data from toberemoved workers to remaining workers. As the data transfer occurs in parallel, between multiple pairs of workers, we are able to exceed the maximal singlelink bandwidth. For a scalein from to workers and a singlelink bandwidth of (e.g. 10 Gb/s), we can achieve a total transfer rate of , e.g. 40 Gb/s to scalein from 16 to 4 workers on a 10 Gb/s network.
3.3 Data partitioning and inmemory representation
While we do not employ a sophisticated data partitioning scheme – we simply split the data into equally sized chunks as it is laid out in the input file – we use an inmemory layout optimized for efficient local access as well as efficient data transfer between workers (see Listing 1). In Chicle, data for each partition is stored consecutively in the Partition::data array, which eliminates the need for costly serialization. On the receiving side, a simple deserialization step is required to restore the Example::dp pointer into the Partition::data array for each Example. This data layout, combined with the usage of RDMA, enables us to transfer data at a rate close to the hardware limit.
While we have considered an anticipatory background transfer mechanism, our evaluations (see Table 3) show that the overhead, introduced by our mechanism, does not necessitate this.
4 Evaluation
In our evaluation, we attempt to answer the question of how much the CoCoA algorithm can be improved by scalingin training and thus staying in front of the knee for as long as possible.
To answer this question, we compare the timetoaccuracy (dualitygap) of our static CoCoA implementation with our elastic version, using an SVM training algorithm^{4}^{4}4we use a constant regularizer term and the 6 datasets shown in Table 1. We evaluate static settings with 1, 2, 4, 8 and 16 workers as well as two elastic settings. In the first elastic setting, we start with 16 workers and scalein to a single worker. This represents cases where the entire dataset fits inside a single node’s memory but limited CPU resources make distribution beneficial anyway. In the second elastic setting, we start with 16 workers but scalein to only two workers. This represents cases where a dataset exceeds a single node’s memory capacity and therefore has to be distributed. As convergence behavior for 2+ nodes is similar (see Figure 3), this also indicates how our method works in a larger cluster, e.g., when scaling from 64 to 8 nodes.
Dataset  Examples  Features  Size (SVM)  Sparsity 

RCV1  667,399  47,236  1.2 GB  0.16 % 
KDDA  20,216,830  8,407,752  2.5 GB  1.8e04 % 
Higgs  11,000,000  28  7.5 GB  92.11 % 
KDD12  54,686,452  149,639,105  21 GB  2e05 % 
Webspam  350,000  16,609,143  24 GB  0.02 % 
Criteo  45,840,617  999,999  35 GB  3.9e03 % 
All test are executed on a 17 node cluster, equipped with Intel Xeon E52640v3/E52650v2, 160256 GB RAM and CentOS/Fedora 26 Linux, running 16 workers and 1 driver, connected by a FDR (56 Gb/s) Infiniband fabric. The initial set of nodes is always chosen randomly. The results, shown in Figure 3 and Table 2, represent the best results over 6 test runs for all schemes, to account for potential node speed variations. We set a test time limit of 10 minutes (not including data loading). Time results include computing the duality gap.
Our evaluation shows that the basic concept of Chicle – to adjust the number of workers based on feedback from the training algorithm – has benefits for most evaluated datasets. When scaling down to a single worker, Chicle shows an average speedup of 2 compared to the best static setting and 2.2 when scaling down to two workers. While our method does not improve upon all evaluated settings and target accuracies (e.g., for KDDA, Webspam, RCV1), the slowdown (compared to the respective best static setting) is tolerable, and speedups are still achieved compared to nonoptimal static settings. It is important to note that the optimal static setting is not necessarily known in advance and may require several test runs to determine. Chicle, on the other hand, is able to find an optimal or near optimal setting automatically, which shows its robustness.


Setting  RCV1  KDDA  Higgs  KDD12  Webspam  Criteo 

116 workers  0.12 s  0.73 s  0.71 s  5.04 s  2.78 s  4.52 s 
216 workers  0.06 s  0.39 s  0.38 s  2.78 s  1.53 s  2.18 s 
Finally, we measured datacopy rates and overhead due to scalingin. Both metrics include the actual datatransfer, control plane overhead and data deserialization. We measured data transfer rates of up to 5.8 GiB/s (1.4 GiB/s on average) and overheads as shown in Table 3. As the measured times do not constitute a significant overhead on our system, we did not implement background data transfer. For slower networks, such a method could be used to hide data transfer times behind regular computation.
5 Related Work
To our knowledge Chicle is the first elastic cocoa implementation. Several other elastic ml systems exist, but in contrast to Chicle, they target efficient resource utilization rather than reducing overall execution time. Litz qiao2017 is an elastic ML framework that overpartitions training data into partitions for physical workers. Elasticity is achieved by increasing or decreasing the amount of partitions per node. In contrast to Chicle, Litz does not scale based on feedback from the training algorithm nor does it improve the perepoch training algorithm convergence rate when doing so, as partitions are always processed independently of each other. SLAQ zhang2017 is a cluster scheduler for ml applications. Like Chicle, SLAQ uses feedback from ml applications, but instead of optimizing the time to arbitrary accuracy for one application, SLAQ tries minimize the time to low accuracy for many applications at the same time, by shifting resources from applications with low convergence ratse to those with high ones, assuming that resources can be used more effectively there. Proteus harlap2017 enables the execution of ML applications using transient revocable resources, such as EC2’s spot instances, by keeping worker state minimal at the cost of increased communication.
6 Conclusion and Future Work
In this paper we have shown experimentally, that the optimal number of workers for cocoa changes over the course of the training. Based on this observation we built Chicle, an elastic ML framework, and have shown that it can outperform static cocoa for several datasets and settings by a factor of 2–2.2
on average, often, while using fewer resources. Future work includes additional ways to dynamically optimize cocoa in terms of training time and resource usage, as well as related usecases, e.g. neural networks
lin2018 . Furthermore, we are working towards a theoretical foundation of our observations.References
 (1) Dünner, C., Forte, S., Takáč, M., and Jaggi, M. Primaldual rates and certificates. arXiv preprint arXiv:1602.05205 (2016).
 (2) Dünner, C., Parnell, T. P., Sarigiannis, D., Ioannou, N., Anghel, A., and Pozidis, H. Snap ML: A hierarchical framework for machine learning. CoRR abs/1803.06333 (2018).
 (3) Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and Gibbons, P. B. Proteus: agile ml elasticity through tiered reliability in dynamic resource markets. In Proceedings of the Twelfth European Conference on Computer Systems (2017), ACM, pp. 589–604.
 (4) Jaggi, M., Smith, V., Takac, M., Terhorst, J., Krishnan, S., Hofmann, T., and Jordan, M. I. Communicationefficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3068–3076.
 (5) Lin, T., Stich, S. U., and Jaggi, M. Don’t use large minibatches, use local sgd. CoRR abs/1808.07217 (2018).
 (6) Qiao, A., Aghayev, A., Yu, W., Chen, H., Ho, Q., Gibson, G. A., and Xing, E. P. Litz: An elastic framework for highperformance distributed machine learning.
 (7) Sikdar, S., Teymourian, K., and Jermaine, C. An experimental comparison of complex object implementations for big data systems. In Proceedings of the 2017 Symposium on Cloud Computing (New York, NY, USA, 2017), SoCC ’17, ACM, pp. 432–444.
 (8) Smith, V., Forte, S., Ma, C., Takáč, M., Jordan, M. I., and Jaggi, M. Cocoa: A general framework for communicationefficient distributed optimization. JMLR 18 (2018), 1–49.
 (9) Stuedi, P., Trivedi, A., Pfefferle, J., Stoica, R., Metzler, B., Ioannou, N., and Koltsidas, I. Crail: A highperformance i/o architecture for distributed data processing. IEEE Data Eng. Bull. 40, 1 (2017), 38–49.
 (10) Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (Berkeley, CA, USA, 2010), HotCloud’10, USENIX Association, pp. 10–10.
 (11) Zhang, H., Stafman, L., Or, A., and Freedman, M. J. Slaq: qualitydriven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing (2017), ACM, pp. 390–404.
Comments
There are no comments yet.