## I Introduction

Most leading IT companies have deployed distributed machine learning (ML) systems, which train various machine learning models over large datasets for providing AI-driven services. For example, Google uses its scalable ML framework, TensorFlow, to power products such as Google Photos and Google Cloud Speech

[1]. Microsoft employs its distributed cognitive toolkit, CNTK, for speech recognition and image related learning tasks [2]. Baidu developed a PArallel Distributed Deep LEarning (PaddlePaddle) system and extensively uses large-scale ML for advertising, group shopping, etc.

[3]. longTencent has applied its large-scale ML system, Angel [4], for social advertising, user portrait mining and other recommendation services. In these scenarios, large ML clusters with hundreds or thousands of (GPU) servers are deployed, where many internal/external training jobs are run to derive various prediction/inference models, e.g., Deep Neural Networks (DNNs), Logistic Regression (LR), and Latent Dirichlet Allocationlong (LDA).

Training machine learning models is typically resource intensive and time consuming. For example, it takes

hours to train a GoogLeNet model using the ImageNet dataset on a Titan supercomputer server with 32 NVIDIA K20 GPUs

[5][6]. A fundamental challenge faced by an ML cluster operator is how to efficiently schedule submitted training jobs to maximally exploit available server resources (especially the expensive GPU cards), and to complete training in an expedited fashion. In representative distributed ML systems [1][2][3][7], training is done in parallel by multiple concurrent workers. There are two parallelism models: data parallelism, where the input dataset is partitioned among the workers, and each worker has a local copy of the entire ML model, computes model parameter changes using allocated data chunks, and exchanges computation results with other workers to come up with the right global parameter updates [8][7]; model parallelism, where the ML model is partitioned among workers and each worker updates part of the parameters using the entire dataset [9]. Data parallelism has been more widely adopted than model parallelism, given that most ML models can be entirely stored in the memory of modern GPUs, eliminating the need for shortmodel partitionlongpartitioning a model. For example, latest NVIDIA GPU models (TITAN X and Tesla) have a memory of longGB, GB or GB, sufficient for most state-of-the-art models (e.g., [10][11]). We focus on data longparallel training jobs in this work.shortparallelism in this work.A typical approach to exchange parameter changes among workers is through a parameter server framework [8][9]: There are one or multiple parameter servers (typically implemented as virtualized instances using virtual machines or containers), and model parameters are evenly divided and maintained by the parameter servers. In each training iteration, a worker sends its computed parameter changes to the parameter servers; the parameter servers update their maintained parameters respectively, and send updated parameters back to the worker. The number of concurrent workers, as well as the number of parameter servers to support parameter exchange, decide the training speed and completion time of a job.

How are training jobs scheduled in the existing ML systems? Google uses Borg [12] as the ML cluster scheduler. Microsoft, Tencent, and Baidu use customized versions of YARN-like schedulers [13] for managing ML jobs, based on our exchanges with their employeeslong (there is little open discussion available). The default scheduling policies of these schedulers are typically FIFO (as in Spark [14]), Dominant Resource Fairness Scheduling [15] (as in YARN [13] and Mesos [16]), or priority-based greedy approaches (as in Borg [12]). To our knowledge, none of these systems allow a varying number of concurrent workers in a training job, which is specified by the job owner and remains fixed throughout the training course. Such static resource allocation to jobs may not fully utilize the (often expensive) ML cluster resources, preventing the best training speeds.

We propose an online job scheduling algorithm, tailored for operating a shared ML cluster running multiple training jobs. The algorithm, referred to as OASiS, computes the best job execution schedule upon the arrival of each job, based on projected resource availability in the future course and potential job utility to achieve (contingent on its completion time). Judging whether the potential job utility outweighs resource consumption, the algorithm decides admitting the job or not, and runs the job according to the best schedule if admitted. With the schedule, the numbers of workers and parameter servers and their deployment on servers are dynamically adjusted during the course of the job, for expedited training adapting to resource availability at different times. Over the long run, we seek overall shortjob utility maximizationlong of all training jobs.

Our online algorithm design utilizes an online primal-dual framework coupled with dual subroutines, to efficiently tackle the combinatorial online optimization problem. Based on the primal-dual framework, we maintain meticulously computed (dual) resource prices according to time-varying resource consumption levels (less resources when new jobs are admitted and more when jobs are completed), and decide job admission and resource allocation accordingly. Given the resource prices, the dual subroutines include efficient, optimal algorithms to compute the best schedule of worker and parameter server deployment for each job, exploiting a dynamic programming structure of the underlying multi-timeslot multi-dimensional resource packing problem.

We rigorously prove polynomial running time of our online algorithm, and its long-term performance guarantee in terms of a good competitive ratio in total job utility. We evaluate practical effectiveness of OASiS using trace-driven simulation and testbed experiments, by implementing it as a new scheduler module in Kubernetes [17] for MXNet – a popular distributed machine learning platform [7]. The results show that OASiS outperforms commonly adopted scheduling policieslong especially in systems with resource scarcity.

## Ii Related Work

### Ii-a Distributed Machine Learning Systems

A number of distributed ML frameworks have been designed and deployed, e.g., TensorFlow [1], CNTK [2], PaddlePaddle [3], MXNet [7]. The parameter server framework, mainly due to Li et al. [8], has been incorporated in some of them (e.g., [7][9]). In these systems, a static set of workers are employed; new workers are deployed only upon failure of existing ones. Most adopt Borg or YARN-like schedulers for ML cluster management [12][13].

Recently in the literature, Dorm [18]

advocates partitioning an ML cluster, runs one ML application per partition, and dynamically resizes the partitions for resource efficiency and fairness, by solving a mixed integer linear program (MILP) using a standard solver. In comparison, we design an online algorithm to guide resource allocation over time with proven performance. Dolphin

[19] solves a cost-minimizing problem to find an optimal number of nodes to use for an ML job, and reconfigures the system dynamically. It focuses on runtime optimization of one ML job, instead of optimal resource allocation among multiple concurrent jobs. Similarly, Yan et al. [20] develop performance models to quantify the impact of model and data partitioning and system provisioning on training performance of a DNN, where online job scheduling and resource sharing are not considered.### Ii-B Job Scheduling and Resource Allocation in Cloud Systems

longThere have been many studies on admission control and job scheduling/resource allocation in general cloud systems. Rayon [21] performs online admission control by accepting all jobs that can fit in the cluster agenda and rejecting ones that it can not satisfy, considering reservation of future resources. YARN [13] uses admission control to delay allocating fallow cluster resources to protect its own availability and schedules admitted jobs using a dominant resource fairness strategy. Apollo [22]

utilizes various admission control policies and decides how and when to assign its resource quotas to submitted jobs in a virtual cluster, using estimation-based scheduling,

i.e., minimizing estimated task completion time by considering relevant factors historically. In comparison, we maintain virtual resource prices to decide job admission, which together with optimal resource scaling, achieves long-term overall job utility maximization.In the offline setting, Huang et al. [23] and Chen et al. [24] study cloud job scheduling problems, targeting max-min fairness among jobs. For online scheduling, Azar et al. [25] propose an online preemptive job scheduling algorithm achieving a constant competitive ratio, for jobs running on a single machine with constant job utility. Lucier et al. [26]

propose an efficient heuristic for online job scheduling with preemption, aiming to maximize total value of all jobs. The resources allocated to each job are fixed over time and the job value is not influenced by completion time. Zhou

et al. [27] and Zhang et al. [28] design mechanisms for online cloud resource allocation and pricing, where no adjustment of allocated resources in a job is considered.Xiao et al. [29] design a scheduler for automatic scaling of Internet applications in a cloud, targeting high demand satisfaction ratio and short request-response time. TetriSched [30] enables resource scaling by periodically solving a schedule optimization problem among all pending jobs to compute their amounts of resources in need. These work do not provide theoretical guarantee for long-term performance.

## Iii Problem Model

### Iii-a Distributed Machine Learning System

Fig. 1 illustrates an ML cluster, where a set of training jobs are submitted in an online fashion during timespan . The training jobs come with large input datasets, and derive potentially different ML models using data parallel training and the parameter server framework [8]. A job arrives at time ,^{1}^{1}1We define throughout the paper, where can be different quantities. using a number of workers and parameter servers for model training.

Workers and parameter servers are implemented on virtual machines (VMs) or containers in the physical servers. The ML cluster hosts physical servers for worker deployment. Each machine has a capacity of type- resource. There are other physical servers for running parameter servers, and each server has a capacity of type- resource. Let be the total number of resource types, including GPU, CPU, memory, disk storage and bandwidth capacity of the server NIC. longWe ignore disk IO constraint as SSDs are widely used in ML clusters and the data read/write delay is often negligible. We practically assume two types of physical machines for running workers and parameter servers separately, given that parameter servers are typically placed on machines with high bandwidth but without GPU, while workers run on GPU servers. Such a separation between workers and parameter servers has been witnessed in existing ML systems [8][9]long[31].

Workers and parameter servers are customized for each job, and not shared among different jobs. Each worker (parameter server) of job occupies a () amount of type- resource, . An amount of bandwidth () is reserved for each worker (parameter server) of job , i.e., (). We do not distinguish upload and download bandwidth, but assume they are symmetric. Bandwidth reservation for a VM or container is common for accelerated computing in cloud platforms, to guarantee data transfer performance of each instance, e.g., the reserved bandwidth of EC2 GPU instance P2 on AWS is Gbps or Gbps [32].

### Iii-B Asynchronous Training Workflow

The input dataset to a training job is stored in a distributed storage system (e.g., HDFS [33]). The dataset is divided into equal-sized data chunks trained by different workers.long^{2}^{2}2We assume data chunks are assigned to workers based on a locality: the data chunks are stored in a HDFS-like distributed file system; each data chunk is assigned to workers based on the preference order of workers on the same server where there is a replica of the chunk, workers on the same rack with a replica, and other workers. Each data chunk is further divided into equal-sized mini-batches.

Upon start, a worker fetches a data chunk.short^{3}^{3}3The ML framework, e.g., PaddlePaddle, assigns data chunks to workers. Then the worker processes the first mini-batch in the data chunk, i.e., computes what changes to be made to the parameters (to approach their optimal values) in the ML model, using data in the mini-batch. Parameter changes are typically expressed as gradients

(directions of changes), and a distributed stochastic gradient descent method is typically used by workers to jointly improve the parameters

[8]. For example, when training an LR model for ad click-through-rate prediction, parameters are the weights of features (e.g., text, image used in an ad) in the prediction model, and gradients are the changes of weights [34].After processing a mini-batch, the worker sends gradients to the parameter servers for parameter updates. The parameter servers in a job are usually responsible for an evenly divided share of the parameters. In the above example, if there are two parameter servers, each will be responsible for half of the weights, and gradients computed by a worker are divided and sent to parameter servers maintaining respective weights. Upon receiving updated parameters from all parameter servers, the worker continues computing gradients using the next mini-batch, and so on. After an entire data chunk is processed, the worker continues training the next data chunk assigned to it.

Fig. 2 illustrates the asynchronous training workflow in our system, i.e., the training progress at different workers in a job is not synchronized and each parameter server updates its parameters each time upon receiving gradients from a worker. In the above example, a parameter server updates its weights using a formula like , and then sends updated weights back to the worker. Another representative training mode in today’s ML systems is synchronous training, where training progress at all workers is synchronized and each parameter server updates its parameters after it has collected gradients from all workers in each training iteration (i.e., training of one mini-batch). Asynchronous training achieves better bandwidth utilization, as gradients and updated parameters are sent from/to workers at different times, and hence potentially faster convergence. Further, model accuracy achieved with asynchronous training is not affected by changes of worker population through the course [8][9] (as what we advocate), while it varies with synchronous training if different numbers of concurrent workers are used [6][20].

Let be the number of input data chunks in job , each divided into mini-batches. Let denote the training time (gradient computation) for each mini-batch in job , which is assumed to be equal for all mini-batches on all workers in the same job, given the same resource allocation per worker. Let be the size of gradients produced by each worker of job after processing a mini-batch, which is the same as the size of updated parameters that the worker will receive from all parameter servers, since the total numbers of gradients and parameters are always the same and both use the same float point representation [6]. The time for sending gradients to or receiving updated parameters from all parameter servers can be computed as (bandwidth at a parameter server is typically large enough to receive gradients /send parameters from/to multiple workers). When training ResNet-152 model on ImageNet dataset [10][5], each data chunk is longtypically MB in size, a mini-batch is about MB in size, and training one mini-batch takes about one second, while training a data chunk takes less than one minute; the size of gradients/parameters exchanged between a worker and parameter servers is about MB.

We ignore worker/parameter server setup time, since the image containing the training program can be pre-stored in a physical machine or fetched in a very short time (e.g., a container image of hundreds of MBs can be fetched within seconds in a 10Gbps network). We also ignore the time for a worker to fetch data chunks from distributed storage, since a worker only needs to explicitly retrieve the first chunk, and fetching time of later chunks can be hidden behind training through pipelining. Fetching one data chunk takes much shorter time than training, e.g., less than s in a 10Gbps network for a MB chunk. With asynchronous training, the computation time at a parameter server for updating parameters using gradients from only one worker is very small (around tens of milliseconds in ResNet-152) and hence negligible too.

### Iii-C Offline Optimization Problem

Upon arrival of an ML job at , the following decisions are made:long^{4}^{4}4We focus on internal ML jobs in a company such that the number of workers and parameter servers can be specified by our algorithm.

(i) Whether the job should be admitted, denoted by a binary variable

: if job is admitted, and , otherwise. Admission control is common in cloud management systems [12][13], and jobs that are not admitted can be queued or resubmitted at a later time beyond . (ii) The number of workers of job to deploy on physical server in each time slot at and after , indicated by integer variable . (iii) The number of parameter servers of job to deploy on physical server in each time slot at and after , denoted by integer variable .Given that it is not practical to adjust worker and parameter server deployment frequently, the length of each time slot is potentially much larger than the duration of an epoch. For example, one time slot can be 1 hour or longer.

Let be the completion time slot of job . Each job has a non-negative utility , non-increasing with , specifying the job’s value in terms of different completion times [23]long[35]. The offline optimization problem to maximize overall utility is formulated as follows. Important notation is summarized in Table I.

(1) |

subject to:

(2) | |||

(3) | |||

(4) | |||

(5) | |||

(6) | |||

(7) | |||

(8) | |||

(9) | |||

(10) | |||

(11) | |||

(12) | |||

(13) |

Constraint (2) ensures that for each admitted job , a sufficient number of workers are deployed to accomplish training of its dataset for epochs. Here, is the time for training a mini-batch, sending gradients to parameter servers, and receiving updated parameters from parameter servers. is the total count of mini-batches trained in the job. indicates the total amount of work time that all deployed workers in job provide. (3) specifies the concurrent number of workers of job should be no more than the number of data chunks , to ensure that one data chunk is processed by at most one worker in each time slot (such that data chunks are trained evenly over time). (4) and (5) are resource capacity constraints on physical machines for worker and parameter server deployment, respectively. (6) guarantees that the total bandwidth of parameter servers is no smaller than total bandwidth of all workers in each job, i.e., parameter servers will not be bottlenecks during gradient/parameter exchange. (7) upper bounds the number of parameter servers by the number of workers at any time in each job, which is common in practical ML systems [8][9]. (8) gives the completion time slot of job . (9) and (10) set worker and parameter server numbers to 0 before a job’s arrival.

The optimization problem involves integer variables and non-conventional constraints in (8). We design an efficient online algorithm to solve it in an online fashion, without assuming knowledge of any future job arrivals.

of jobs | system timespan | ||
---|---|---|---|

completion time of job | arrival time of job | ||

of resource types | of data chunks in | ||

accept job or not | job ’s utility |

of training epochs for job | |

of mini-batches in a data chunk of job | |

() | of servers to deploy workers (parameter servers) |

capacity of type- resource on server () to deploy workers (parameter servers) | |

type- resource of a worker (parameter server) in | |

of workers of job deployed on server in | |

of parameter servers of deployed on server in | |

bandwidth of a worker (parameter server) of job | |

time to train a mini-batch in job | |

size of gradients/parameters exchanged between a worker and parameter servers in job | |

long select schedule for job or not | |

the completion time slot of job with schedule | |

of workers on server in in schedule of job | |

of parameter servers on server in in schedule of job | |

the set of feasible schedules of job | |

## Iv Online Algorithm

### Iv-a Problem Reformulation

To circumvent the non-conventional constraint (8), we reformulate problem (1) into the following integer linear program (ILP). Here is the set of feasible schedules for jobs , each corresponding to the set of decisions satisfying constraints (2)(3)(6)(7)(9)-(13). There is potentially an exponential number of feasible schedules for each job, due to combinatorial nature of those constraints. Decision variables in the ILP are binary variables , indicating whether job is admitted and scheduled according to schedule or not, . Job ’s completion time according to schedule is . () is the given number of workers (parameter servers) on server () in in job ’s schedule (not decision variables in (14)).

(14) |

s.t.

(15) | |||

(16) | |||

(17) | |||

(18) |

We use to indicate that schedule uses server to deploy worker(s) and server to deploy parameter server(s) for job in . (14), (15) and (16) are equivalent to (1), (4) and (5), respectively. (17) and (18) correspond to (2)(3)(6)-(13). Problems (1) and (14) are equivalent since a feasible solution to (1) has a corresponding feasible solution to (14), and vice versa, with the same objective values. Though the number of variables in (14), ’s, is potentially exponential, we will design an efficient online algorithm to solve (14) in polynomial time, exploiting the primal-dual framework [36]. We formulate the dual of (14) by relaxing integrality constraints (18) and associating dual variables , and with (15), (16) and (17), respectively.

(19) |

(20) | ||||

The dual variable (), associated with the primal capacity constraint on server (), can be interpreted as the unit cost for type- resource on the server in . Then () is the total resource cost of all workers (parameter servers) of job with schedule . The RHS of (20) is job utility minus overall resource cost for job with schedule . The following should hold to minimize the dual objective: . Hence, can be nicely interpreted as the payoff of admitting job according to the best schedule :

(21) |

### Iv-B Online Algorithm

These observations inspire the design of an online algorithm: Upon arrival of job , we compute the best schedule of job (assuming job admitted). Then we check if the RHS of (20) achieved by is positive. If so (, positive payoff), we accept job and run it according to (); otherwise (zero payoff), job is rejected (). The rationale is that, as resources are limited, we wish to accept jobs with larger utility and lower resource consumption, to maximize (14). A positive payoff indicates that the job utility is high enough to justify resource consumption, and we schedule the job in a way that maximizes its payoff.

To implement this idea, we need to resolve the following: (i) Solve (21) to find the best schedule for job . Simply enumerating all feasible schedules is not practical, given the exponential size of set . We will design an efficient subroutine to produce in polynomial time in Sec. IV-C. (ii) Compute dual resource prices ’s and ’s, to ensure a positive payoff for job schedules achieving high utilities (if there are enough resources to accommodate them), and non-positive payoff for job schedules resulting in low utilities or without available resources.

The sketch of our online algorithm, OASiS, is in Alg. 1. In line 5, Alg. 2 is the subroutine to compute . In line 9 (11), () records the amount of allocated type- resource on server () for (future) time slot . In lines 10 and 12, we update dual resource prices using carefully designed price functions and , respectively:

(22) |

where | (23) |

(24) | |||

(25) | |||

(26) |

() is the maximum per-unit-resource job utility for type- resource on physical servers to deploy workers (parameter servers), among all jobs. Here, is the largest utility that job can achieve, by using the maximum number of workers (i.e., ) at all times in training epochs to achieve the shortest job completion time . () represents the minimum unit-time-unit-resource job utility on physical servers to deploy workers (parameter servers), among all jobs. Here, is the smallest utility that job may achieve, when it ends at . and are scaling factors satisfying , and , to ensure the initial value of dual objective is bounded.

The rationales behind our price functions are as follows. (i) The prices should be low enough at the beginning to accept many incoming jobs. When , , we have , and then any job can be admitted at this point since and represent the lowest unit job utility (a formal proof is given in shortour technical report [techreport]longAppendix A). (ii) The prices increase exponentially when the allocated amounts of resources increase, to filter out jobs with low utilities which arrive early, and to reserve resources for jobs with higher utilities that may arrive later. (iii) The respective price should be high enough when a resource on a server is exhausted, such that no more jobs requiring this resource are admitted. When or , we have or , and no more jobs requiring these resources would be admitted since and are the largest unit job utilities (proof in short[techreport]longAppendix A). The price functions are important to guarantee a good competitive ratio for our online algorithm.

### Iv-C Subroutine for Finding Best Job Schedule

The optimization problem in (21) to compute the best schedule for job is equivalent to the following:

(27) |

We next show that (27) can be efficiently and optimally solved using dynamic programming and a greedy algorithm. When we fix , (27) is simplified to the following ILP, where :

(28) |

(29) | ||||

(30) | ||||

(31) | ||||

In problem (28), deployment decisions in different time slots are coupled only in constraint (29), which requires sufficient workers and parameter servers to be deployed such that all data chunks are trained for epochs during . We refer to in the RHS of (29) as training workload, indicating the total count of data chunks trained (a data chunk is counted times if trained for times). Since the time for training a data chunk is much smaller than the duration of a time slot, we may safely assume a worker trains an integer number of data chunks in each time slot. The training workload is distributed over different time slots in . If we know how much training workload (denoted by ) is to be fulfilled in a time slot , we are left with a further simplified problem:

(32) |

Though (32) is an ILP, it can be optimally solved using a greedy algorithm (to be discussed in Alg. 2 and analyzed in Theorem 1). Therefore, we come up with the following algorithm to find the best schedule for job : enumerate end times from to ; given , design a dynamic programming approach to compute how to best distribute the training workload over time slots in ; then use the greedy algorithm to decide deployment of workers and parameter servers in each time slot. Our algorithm is given in Alg. 2.

In Alg. 2, we enumerate job completion time slot (line 4) and find the optimal schedule with each by calling function DP_COST (line 5). Then we compare the payoffs achieved by schedules at different completion times and decide the best schedule achieving the highest payoff (lines 6-9).

Comments

There are no comments yet.