SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

02/13/2018 ∙ by Haoyu Zhang, et al. ∙ 0

Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highly-tailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73 to 44 schedulers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background and Motivation

Machine learning (ML) is an increasingly important tool for large-scale data analytics. A key challenge in analyzing massive amounts of data with ML arises from the fact that model complexity and data volume are growing much faster than hardware speed improvements. Thus, time-sensitive ML on large datasets necessitates the use and efficient management of cluster resources. Three key features of ML are particularly relevant to resource management.

ML algorithms are intrinsically approximate.

ML models are approximate functions for input-output mapping. We use quality to measure how well the model maps input to the correct output. Training ML models is a process of optimizing the model parameters to maximize the quality on a dataset.

ML training is typically iterative with diminishing returns.

Algorithms such as Gradient Descent, L-BFGS and Expectation Maximization (EM) are widely used to

iteratively solve the numerical optimization problem. The quality improvement diminishes as more iterations are completed (Figure 3).

Training ML is an exploratory process.

ML practitioners retrain their models repeatedly to explore feature validity [2]

, tune hyperparameters 

[3, 4, 5, 6], and adjust model structures [7], in order to operationalize the final model with the best quality. Practitioners in experimental environments often prefer to work with more approximate models (e.g., 95% loss reduction) trained within a short period of time for preliminary testing, rather than wait a significant amount of time for a perfectly converged model with poorly tuned configurations.

Existing schedulers primarily focus on resource fairness [8, 9, 10, 11, 12, 13], but are agnostic to model quality and resource efficiency. With this policy, equal resources will be allocated to jobs that are in their early stages and could benefit significantly from extra resources as those that have nearly converged and cannot improve much further. This is not efficient. The key intuition behind our system is that in the context of approximate ML training, more resources should be allocated to jobs that have the most potential for quality improvement.

2 Design

Figure 1: of work is done in of time.
Figure 2: Normalized Loss for ML algorithms.
Figure 3: Resource allocation across job groups.
Figure 4: Average of normalized loss values.
Figure 5: Time to achieve loss reduction percentage.
Figure 6: Scheduling time.

We present SLAQ, a cluster scheduling system for ML training jobs that aims to maximize the overall job quality. To achieve this, SLAQ needs to (1) normalize the quality metrics in order to trades off resources and quality across multiple jobs; (2) predict how much progress the job would achieve if it was granted a certain amount of resources; (3) efficiently allocate cluster CPUs to maximize the system-wide quality improvement.

Normalizing Quality Metrics.

While metrics like accuracy and F1 score [14]

are intuitively understandable, they are not applicable to non-classification algorithms. In contrast, loss functions are internally calculated by almost all algorithms in each iteration, but each loss function has a different real-world interpretation, and its range, convexity, and monotonicity of depend on both the models and the optimization algorithms. Directly normalizing loss values requires a priori knowledge of the loss range, which is impractical in an online setting.

We choose to normalize the change in loss values between iterations, with respect to the largest change we have seen so far. Figure 3 shows the normalized changes of loss values for common ML algorithms. Even though the algorithms have diverse loss ranges, we observe that the changes generally follow similar convergence properties, and can be normalized to decrease from to . This helps SLAQ track and compare the progress of different jobs, and, for each job, correctly project the time to reach a certain loss reduction with a given resource allocation. Note that this approach currently does not

support some non-convex algorithms (such as training Deep Neural Networks) due to the lack of convergence of these analytical models.

Predicting Quality Improvement.

Previous work [15, 16]estimates general-purpose big-data job runtime by analyzing the job computation and communication structure, using offline analysis or code profiling. As the computation and communication pattern changes during ML model configuration tuning, the process of offline analysis needs to be performed every time, thus incurring significant overhead.

We use online quality prediction by leveraging the convergence properties of the loss functions. Based on the optimizers used for minimizing the loss function, we can broadly categorize the algorithms by their convergence rate.

I. Algorithms with sublinear convergence rate. First-order algorithms222Assume is convex, differentiable, and is Lipschitz continuous. (e.g., gradient descent) have a convergence rate of , where is the number of iterations [17]. The convergence rate could be improved to with optimization.

II. Algorithms with linear or superlinear convergence rates. Algorithms in this category333Assume is convex and twice continuously differentiable, optimizers can take advantage of the second-order derivative to get faster convergence. have a convergence rate of . For example, L-BFGS, which is a widely used Quasi-Newton Method, has a superlinear convergence rate which is between linear and quadratic.

With the assumptions of loss convergence rate, we use exponentially weighted history loss values at to fit a curve for sublinear algorithms, or for linear and superlinear algorithms. Intuitively, loss values obtained in the near past are more informative for predicting the loss values in the near future. Experiments show that this prediction achieves less than prediction errors for all the algorithms in Figure 3 when predicting the next 10th iteration.

Scheduling Based on Quality Improvements.

We schedule a set of

jobs running concurrently on the shared cluster for a fixed scheduling epoch

. The optimization problem for maximizing the total normalized loss reduction over a short time horizon is as follows. Sum of allocated resources cannot exceed the cluster resource capacity .

The algorithm starts with for each job to prevent starvation. At each step we consider increasing (for all jobs ) by one unit (i.e., one CPU core) and get the predicted loss reduction. Among these jobs, we pick the job that gives the highest loss reduction, and increase by one unit. We repeat this until we run out of available resources.

3 Evaluation

Setup.

We implemented SLAQ within the Apache Spark framework [18] and utilize its accompanying MLlib machine learning library [19]. Our testbed consists of a cluster of 20 c3.8xlarge EC2 instances on the AWS Cloud. We tested SLAQ

with the most common ML algorithms, including (i) classification: SVM, Neural Network (MLPC), Logistic Regression, GBT, and our extension to Spark MLlib with SVM polynomial kernels; (ii) regression: Linear/GBT Regression; (iii) unsupervised learning: K-Means, LDA. Each algorithm is further diversified to construct different models. We collected more than

GB datasets from various online sources, spanning numerical, plain texts [20], images [21], audio meta features [22], and so on [23]. The baseline we compare against is a work-conserving fair scheduler, which is the widely-used scheduling policy for cluster computing frameworks [8, 9, 10, 12, 13].

Scheduler Quality and Runtime Improvement.

We submit a set of

ML training jobs with different models, following a Poisson distribution (mean arrival time

s). Figure 6 shows the average normalized loss values across running jobs in an s time window of the experiment. When a new job arrives, its initial loss is , raising the average loss value; the spikes indicate new job arrivals. The average loss value achieved by SLAQ is on average lower than that of the fair scheduler.

Figure 6 shows the average time it takes a job to achieve different loss values. As SLAQ allocates more resources to jobs that have the most potential for quality improvement, it reduces the average time to reach () loss reduction from s (s) down to s (s), () faster. For exploratory training, this level of accuracy is frequently sufficient. Thus, in an environment where users submit exploratory ML training jobs, SLAQ could substantially reduce users’ wait times.

Figure 3 explains SLAQ’s benefits by plotting the allocation of CPU cores in the cluster over time. Here we group the active jobs by their normalized loss: (i) jobs with high loss values; (ii) jobs with medium loss values; (iii) jobs with low loss values (almost converged). With a fair scheduler, the cluster CPUs should be allocated to the three groups proportionally to the number of jobs. In contrast, SLAQ allocates much more resource to (i) () than to (iii) (), which is the underlying reason for the improvement in Figures 6 and 6.

Scalability and efficiency.

SLAQ is a fine-grained job-level scheduler: it allocates resources between competing ML jobs, but does so over short time intervals to ensure the continued rebalancing of resources across jobs. Figure 6 plots the time to schedule tens of thousands of concurrent jobs on large clusters (simulating both the jobs and worker nodes). SLAQ makes its scheduling decisions in hundreds of milliseconds to a few seconds, even when scheduling jobs across 16K worker cores. SLAQ is sufficiently fast and scalable for (rather aggressive) real-world needs.

4 Conclusion and Future Work

SLAQ is a quality-driven scheduling system designed for large-scale ML training jobs in shared clusters. SLAQ leverages the iterative nature of ML algorithms and obtains highly-tailored prediction to maximize the quality of models produced by a large class of ML training jobs. As a result, SLAQ improves the overall quality of executing ML jobs faster, particularly under resource contention.

Non-convex optimization.

Loss functions of non-convex optimization are not guaranteed to converge to global minima, nor do they necessarily decrease monotonically. The lack of an analytical model of the convergence properties interferes with our prediction mechanism, causing SLAQ to underestimate or overestimate the potential loss reduction. One potential solution is to let users provide the scheduler with hint of their target loss or performance, which could be acquired from state-of-the-art results on similar problems or previous training trials. The convergence properties of non-convex algorithms is being actively studied in the ML research community [24, 25]. We leave modeling the convergence of these algorithms to future work, and an interesting topic for future discussion at SysML.

plus 1pt

References