1 Introduction
Machine Learning (ML) has become a primary mechanism for distilling structured information and knowledge from raw data, turning them into automatic predictions and actionable hypotheses for diverse applications, such as analyzing social networks [4], reasoning about customer behaviors [3], interpreting texts, images and videos [67], identifying disease and treatment paths [34], driving vehicles without the need for a human [53], and tracking anomalous activity for cybersecurity [9], amongst others. The majority of ML applications are supported by a moderate number of families of welldeveloped ML approaches, each of which embodies a continuum of technical elements from model design, to algorithmic innovation, and even to perfection of the software implementation, and which attracts evergrowing novel contributions from the research and development community. Modern examples of such approaches include Graphical Models [54, 28, 58], Regularized Bayesian models [72, 70, 71], Nonparametric Bayesian models [18, 49], Sparse Structured models [63, 27], Largemargin methods [8, 46]
[21, 29], Matrix Factorization [31, 41], Sparse Coding [44, 32], and Latent Space Modeling [4, 68]. A common ML practice that ensures mathematical soundness and outcome reproducibility is for practitioners and researchers to write an ML program (using any generic highlevel programming language) for an applicationspecific instanceof a particular ML approach (e.g. semantic interpretation of images via a deep learning model such as a convolution neural network). Ideally this program is expected to execute quickly and accurately on a variety of hardware and cloud infrastructure: laptops, server machines, GPUs, cloud compute and virtual machines, distributed network storage, Ethernet and Infiniband networking, just to name a few. Thus, the program is
hardwareagnostic but MLexplicit (i.e., following the same mathematical principle when trained on data and attaining the same result regardless of hardware choices.).With the advancements in sensory, digital storage, and Internet communication technologies, conventional ML research and development — which excels in model, algorithm, and theory innovations — are now challenged by the growing prevalence of Big Data collections, such as hundreds of hours of video uploaded to videosharing sites every minute^{1}^{1}1https://www.youtube.com/yt/press/statistics.html, or petabytes of social media on billionplususer social networks^{2}^{2}2https://code.facebook.com/posts/229861827208629/scalingthefacebookdatawarehouseto300pb/. The rise of Big Data is also being accompanied by an increasing appetite for higherdimensional and more complex ML models with billions to trillions of parameters, in order to support the everincreasing complexity of data, or to obtain still higher predictive accuracy (e.g. for better customer service and medical diagnosis) and support more intelligent tasks (e.g. driverless vehicles and semantic interpretation of video data) [62, 11]. Training such Big ML Models over such Big Data is beyond the storage and computation capabilities of a single machine, and this gap has inspired a growing body of recent work on distributed ML, where ML programs are executed across research clusters, data centers and cloud providers with 10s to 1000s of machines. Given machines instead of one machine, one would expect a nearly fold speedup in the time taken by a distributed ML program to complete, in the sense of attaining a mathematically equivalent or comparable solution to that produced by a single machine; yet, the reported speedup often falls far below this mark — for example, even recent stateoftheart implementations of topic models [2] (a popular method for text analysis) cannot achieve speedup with machines, because of mathematical incorrectness in the implementation (as shown in [68]), while deep learning on MapReducelike systems such as Spark has yet to achieve speedup with machines [42]. Solving this scalability challenge is therefore a major goal of distributed ML research, in order to reduce the capital and operational cost of running Big ML applications.
Given the iterativeconvergent nature of most — if not all — major ML algorithms powering contemporary large scale applications, at a first glance, one might naturally identify two possible avenues toward scalability: faster convergence as measured by iteration number (also known as convergence rate in the ML community), and faster periteration time as measured by the actual speed at which the system executes an iteration (also known as throughput in the system community). Indeed, a major current focus by many distributed ML researchers is on algorithmic correctness as well as faster convergence rates over a wide spectrum of ML approaches [1, 43] However, many of the “accelerated” algorithms from this line of research face difficulties in making their way to industrygrade implementations, because of their idealized assumptions on the system — for example, the assumption that networks are infinitely fast (i.e. zero synchronization cost), or the assumption that all machines make algorithm progress at the same rate (implying no background tasks and only a single user of the cluster, which are unrealistic expectations for realworld research and production clusters shared by many users). On the other hand, systems researchers focus on high iteration throughput (more iterations per second) and faultrecovery guarantees, but may choose to assume that the ML algorithm will work correctly under nonideal execution models (such as fully asynchronous execution), or that it can be rewritten easily under a given abstraction (such as MapReduce or Vertex Programming) [15, 20, 64]. In both ML and systems research, issues from the other side can become oversimplified, which may in turn obscure new opportunities to reduce the capital cost of distributed ML. In this paper, we propose a strategy that combines MLcentric and systemcentric thinking, and in which the nuances of both ML algorithms (mathematical properties) and systems hardware (physical properties) are brought together to allow insights and designs from both ends to work in concert and amplify each other.
Many of the existing generalpurpose Big Data software platforms present a unique tradeoff among correctness, speed of execution, and easeofprogrammability for ML applications. For example, dataflow systems such as Hadoop and Spark [64] are built on a MapReducelike abstraction [15] and provide an easytouse programming interface, but have paid less attention to ML properties such as error tolerance, finegrained scheduling of computation and communication to speed up ML programs — as a result, they offer correct ML program execution and easy programming, but are slower than MLspecialized platforms [57, 36]. This (relative) lack of speed can be partly attributed to the bulk synchronous parallel (BSP) synchronization model used in Hadoop and Spark, where machines assigned to a group of tasks must wait at a barrier for the slowest machine to finish, before proceeding with the next group of tasks (e.g. all mappers must finish before the reducers can start) [23]. Another example are the graphcentric platforms such as GraphLab and Pregel, which rely on a graphbased “vertex programming” abstraction that opens up new opportunities for ML program partitioning, computation scheduling, and flexible consistency control — hence, they are usually correct and fast for ML. However, ML programs are not usually conceived as vertex programs (instead, they are mathematically formulated as iterativeconvergent fixedpoint equations), and it requires nontrivial effort to rewrite them as such. In a few cases, the graph abstraction may lead to incorrect execution or suboptimal execution speed [30, 33]. Of recent note is the parameter server paradigm [23, 12, 55, 2, 36], which provides a “design template” or philosophy for writing distributed ML programs from the groundup, but is not a programmable platform or work partitioning system in the same sense as Hadoop, Spark, GraphLab and Pregel. Taking into account the common ML practice of writing ML programs for applicationspecific instances, a usable software platform for ML practitioners could instead offer two utilities: (1) a readytorun set of ML workhorse implementations — such as stochastic proximal descent algorithms [6, 69], coordinate descent algorithms [16]
, Markov Chain Monte Carlo algorithms
[19] — that can be reused across different ML algorithm families. In turn, these workhorse implementations are supported by (2) an ML Distributed Cluster Operating System, which partitions and executes these workhorses across a wide variety of hardware. Such a software platform not only realizes the capital cost reductions obtained through distributed ML research, but even complements it by reducing the human cost (scientist and engineerhours) of Big ML applications, through easiertouse programming libraries and cluster management interfaces.With the growing need to enable datadriven knowledge distillation, decision making, and perpetual learning — which are representative hallmarks of the vision for machine intelligence — in the coming years, the major form of computing workloads on Big Data is likely to undergo a rapid shift from databasestyle operations for deterministic storage, indexing, and queries, to MLstyle operations such as probabilistic inference, constrained optimization, and geometric transformation. To best fulfill these computing tasks, which must perform a large number of passes over the data and solve a highdimensional mathematical program, there is a need to revisit the principles and strategies in traditional system architectures, and explore new designs that optimally balance correctness, speed, programmability, and deployability. A key insight necessary for guiding such explorations is an understanding that ML programs are optimizationcentric, and frequently admit iterativeconvergent algorithmic solutions rather than onestep or closed form solutions. Furthermore, ML programs are characterized by three properties: (1) error tolerance, which makes ML programs robust against limited errors in intermediate calculations; (2) dynamic structural dependencies, where the changing correlations between model parameters must be accounted for in order to achieve efficient, nearlinear parallel speedup; (3) nonuniform convergence, where each of the billions (or trillions) of ML parameters can converge at vastly different iteration numbers (typically, some parameters will converge in 23 iterations, while others take hundreds). These properties can be contrasted with traditional programs (such as sorting and database queries), which are transactioncentric and only guaranteed to execute correctly if every step is performed with atomic correctness [15, 64]. In this paper, we shall derive unique design principles for distributed ML systems based on these properties; these design principles strike a more effective balance between ML correctness, speed and programmability (while remaining generally applicable to almost all ML programs), and are organized into four upcoming sections: (I) How to distribute ML programs; (II) How to bridge ML computation and communication; (III) How to communicate; (IV) What to communicate. Before delving into the principles, let us first review some necessary background information about iterativeconvergent ML algorithms.
2 Background: IterativeConvergent ML Algorithms
With a few exceptions, almost all ML programs can be viewed as optimizationcentric programs that adhere to a general mathematical form:
(1)  
where 
In essence, an ML program tries to fit data samples (which may be labeled or unlabeled, depending on the realworld application being considered), represented by (where is present only for labeled data samples), to a model represented by . This fitting is performed by optimizing (maximizing or minimizing) an overall objective function , composed of two parts: a loss function that describes how data should fit the model, and a structureinducing function that incorporates domainspecific knowledge about the intended application, by placing constraints or penalties on the values can take.
The apparent simplicity of Eq. 1 belies the potentially complex structure of the functions , and the potentially massive size of the data and model . Furthermore, ML algorithm families are often identified by their unique characteristics on . For example, a typical deep learning model for image classification, such as [29], will contain 10s of millions through billions of matrixshaped model parameters in , while the loss function exhibits a deep recursive structure that learns a hierarchical representation of images similar to the human visual cortex. Structured sparse regression models [34] for identifying genetic disease markers may use overlapping structureinducing functions , where are overlapping subsets of , in order to respect the intricate process of chromosomal recombination. Graphical models, particularly Topic models, are routinely deployed on billions of documents — i.e. , a volume that is easily generated by social media such as Facebook and Twitter — and can involve up to trillions of parameters in order to capture rich semantic concepts over so much data [62].
Apart from specifying Eq. 1, one must also find the model parameters that optimize . This is accomplished by selecting one out of a small set of algorithmic techniques
, such as stochastic gradient descent
[6], coordinate descent [16], Markov Chain Monte Carlo (MCMC)^{3}^{3}3Strictly speaking, MCMC algorithms do not perform the optimization in Eq. 1 directly — rather, they generate samples from the function , and additional procedures are applied to these samples to find a optimizer . [19], and variational inference (to name just a few). The chosen algorithmic technique is applied to Eq. 1 to generate a set of iterativeconvergent equations, which are implemented as program code by ML practitioners, and repeated until a convergence or stopping criterion is reached (or just as often, until a fixed computational budget is exceed). Iterativeconvergent equations have the following general form:(2) 
where the parentheses denotes iteration number. This general form produces the next iteration’s model parameters , from the previous iteration’s and the data , using two functions: (1) an update function (which increases the objective ) that performs computation on data and previous model state , and outputs intermediate results. These intermediate results are then combined to form by (2) an aggregation function . For simplicity of notation, we will henceforth omit from the subscript of — with the implicit understanding that all ML programs considered in this paper bear an explicit loss function
(as opposed to heuristics or procedures lacking such a loss function).
Let us now look at two concrete examples of Eqs. 1, 2, which will prove useful for understanding the unique properties of ML programs. In particular, we shall pay special attention to the 4 key components of any ML program: (1) data and model ; (2) loss function ; (3) structureinducing function ; (4) algorithmic techniques than can be used for the program.
Lasso Regression:Lasso regression [51]
is perhaps the simplest exemplar from the structured sparse regression ML algorithm family, and is used to predict a response variable
given vectorvalued features
(i.e. regression, which uses labeled data) — but under the assumption that only a few dimensions or features in are informative about . As input, Lasso is given training pairs of the form , where the features are dimensional vectors. The goal is to find a linear function, parametrized by the weight vector , such that (1) , and (2) the dimensional parameters are sparse^{4}^{4}4Sparsity has two benefits: it automatically controls the complexity of the model (i.e. if the data requires fewer parameters, then the ML algorithm will do so), and improves human interpretation by focusing the ML practitioner’s attention on just a few parameters. (most elements are zero):(3) 
or more succinctly in matrix notation:
(4) 
where , , is the Euclidean norm on , is the norm on , and is some constant that balances model fit (the term) and sparsity (the term). Many algorithmic techniques can be applied to this problem, such as stochastic proximal gradient descent or coordinate descent. We shall present the coordinate descent^{5}^{5}5More specifically, we are presenting the form known as “block coordinate descent”, which is one of many possible forms of coordinate descent. iterativeconvergent equation:
(5) 
where is the “softthresholding operator”, and we assume the data is normalized so that for all , . Tying this back to the general iterativeconvergent update form, we have the following explicit forms for :
(6)  
where is the th element of .
Latent Dirichlet Allocation Topic Model: Latent Dirichlet Allocation (LDA) [5] is a member of the graphical models ML algorithm family, and is also known as a “topic model” for its ability to identify commonlyrecurring topics within a large corpus of text documents. As input, LDA is given N unlabeled documents , where each document contains words (referred to as “tokens” in the LDA literature) represented by . Each token is an integer representing one word out of a vocabulary of words — for example, the phrase “machine learning algorithm” might be represented as (the correspondence between words and integers is arbitrary, and has no bearing on the accuracy of the LDA algorithm).
The goal is to find a set of parameters — “token topic indicators” for each token in each document, “documenttopic vectors” for each document, and “wordtopic vectors” (or simply, “topics”) — that maximizes the following loglikelihood^{6}^{6}6
A loglikelihood is the natural logarithm of a probability distribution. As a member of the graphical models ML algorithm family, LDA specifies a probability distribution, and hence has an associated loglikelihood.
equation:(7)  
where is the Categorical (a.k.a discrete) probability distribution, is the Dirichlet probability distribution, and are constants that balance model fit (the term) with the practitioner’s prior domain knowledge about the documenttopic vectors and the topics (the term). Similar to Lasso, many algorithmic techniques such as Gibbs sampling and variational inference (to name just two) can be used on the LDA model; we shall consider the Collapsed Gibbs sampling equations^{7}^{7}7Note that Collapsed Gibbs sampling rerepresents as integervalued vectors instead of simplex vectors. Details can be found in [61].:
(8)  
where  
where are the selfincrement and selfdecrement operators (i.e. are being modified inplace), means “to sample from distribution ”, and is the conditional probability^{8}^{8}8There are a number of efficient ways to compute this probability. In the interest of keeping this article focused, we refer the reader to [61] for an appropriate introduction. of given the current values of . The update proceeds in two stages: (1) execute Eq. 8 over all document tokens ; (2) output . The aggregation turns out to simply be the identity function.
2.1 Unique Properties of ML Programs
To speed up the execution of largescale ML programs over a distributed cluster, we wish to understand their properties, which an eye towards how they can inform the design of distributed ML systems. It is helpful to first understand what an ML program is not: let us consider a traditional, nonML program, such as sorting on MapReduce. This algorithm begins by distributing the elements to be sorted, , randomly across a pool of mappers. The mappers hash each element into a keyvalue pair , where is an “orderpreserving” hash function that satisfies if . Next, for every unique key , the MapReduce system sends all keyvalue pairs to a reducer labeled “”. Each reducer then runs a sequential sorting algorithm on its received values , and finally, the reducers take turns (in ascending key order) to output their sorted values.
The first thing to note about MapReduce sort, is that it is singlepass and noniterative — only a single Map and a single Reduce step are required. This stands in contrast to ML programs, which are iterativeconvergent and repeat Eq. 2 many times. More importantly, MapReduce sort is operationcentric and deterministic, and does not tolerate errors in individual operations: for example, if some Mapper were to output a mishashed pair where (for the sake of argument, let us say this is due to improper recovery from a power failure), then the final output will be missorted because will be output in the wrong position. It is for this reason that Hadoop and Spark (which are systems that support MapReduce) provide strong operational correctness guarantees via robust faulttolerant systems. These faulttolerant systems certainly require additional engineering effort, and impose additional running time overheads in the form of harddiskbased checkpoints and lineage trees [14, 64] — yet they are necessary for operationcentric programs, which may fail to execute correctly in their absence.
This leads us to the first property of ML programs: error tolerance. Unlike the MapReduce sort example, ML programs are usually robust against minor errors in intermediate calculations. In Eq. 2, even if a limited number of updates are incorrectly computed or transmitted, the ML program is still mathematically guaranteed to converge to an optimal set of model parameters — that is to say, the ML algorithm terminates with a correct output (even though it might take more iterations to do so) [23, 12]
. An good example is stochastic gradient descent (SGD), a frequentlyused algorithmic workhorse for many ML programs, ranging from deep learning to matrix factorization and logistic regression
[66, 17, 13]. When executing an ML program that uses SGD, even if a small random vector is added to the model after every iteration, i.e. , convergence is still assured — intuitively, this is because SGD always computes the correct direction of the optimum for the update , moving around simply results in the direction being recomputed to suit [23, 12]. This property has important implications for distributed system design, as the system no longer needs to guarantee perfect execution, intermachine communication, or recovery from failure (which requires substantial engineering and running time overheads) — it is often cheaper to do these approximately, especially when resources are constrained or limited (e.g. limited intermachine network bandwidth) [23, 12].In spite of error tolerance, ML programs can in fact be harder to execute than operationcentric programs, because of dependency structure that is not immediately obvious from a cursory look at the objective or update functions . It is certainly the case that dependency structures occur in operationcentric programs: in MapReduce sort, the reducers must wait for the mappers to finish, otherwise the sort will be incorrect. In order to see what makes ML dependency structures unique, let us consider the Lasso regression example in Eq. 3: at first glance, the update equations 6 may look like they can be executed in parallel, but this is only partially true. A more careful inspection reveals that, for the th model parameter , its update depends on — in other words, potentially every other parameter is a possible dependency, and therefore the order in which the model parameters are updated has an impact on the ML program’s progress or even correctness [33]. Even more, there is an additional nuance not present in operationcentric programs: the Lasso parameter dependencies are not binary (i.e. only on or off), but can be softvalued and influenced by both the ML program state and input data: notice that if (meaning that data column is uncorrelated with column ), then and have zero dependency on each other, and can be updated safely in parallel [33]. Similarly, even if , as long as , then does not depend on . Such dependency structures are not limited to one ML program; careful inspection of the LDA topic model update equations 8 reveals the Gibbs sampler update for (word token in document ) depends on (1) all other word tokens in document , and (2) all other word tokens in other documents that represent the exact same word, i.e. [68]. If these ML program dependency structures are not respected, the result is either subideal scaling with additional machines (e.g. speedup with as many machines) [68] or even outright program failure that overwhelms the intrinsic error tolerance of ML programs [33].
A third property of ML programs is nonuniform convergence, the observation that not all model parameters will converge to their optimal values in the same number of iterations — a property that is absent from singlepass algorithms like MapReduce sort. In the Lasso example Eq. 3, the term encourages model parameters to be exactly zero, and it has been empirically observed that once a parameter reaches zero during algorithm execution, it is unlikely to revert to a nonzero value [33] — to put it another way, parameters that reach zero are (with high, though not , probability) already converged. This suggests that computation may be better prioritized towards parameters that are still nonzero, by executing more frequently on them — and such a strategy indeed reduces the time taken by the ML program to finish [33]. Similar nonuniform convergence has been observed and exploited in PageRank, another iterativeconvergent algorithm [37].
Finally, it is worth noting that a subset of ML programs exhibit compact updates, in that the updates are, upon careful inspection, significantly smaller than the size of the matrix parameters, . In both Lasso (Eq. 3) and LDA topic models [5], the updates
generally touch just a small number of model parameters, due to sparse structure in the data. Another salient example is that of “matrixparametrized” models, where
is a matrix (such as in deep learning [22]), yet individual updates can be decomposed into a few small vectors (a socalled “lowrank” update). Such compactness can dramatically reduce storage, computation, and communication costs if the distributed ML system is designed with it in mind, resulting in orderofmagnitude speedups [56, 65].2.2 On Data and Model Parallelism
For ML applications involving terabytes of data, using complex ML programs with up to trillions of model parameters, execution on a single desktop or laptop often takes days or weeks [29]; this computational bottleneck has spurred the development of many distributed systems for parallel execution of ML programs over a cluster [20, 64, 36, 57]. ML programs are parallelized by subdividing the updates over either the data or the model — referred to respectively as data parallelism and model parallelism.
It is crucial to note that the two types of parallelism are complementary and asymmetric — complementary, in that simultaneous data and model parallelism is possible (and even necessary, in some cases), and asymmetric, in that data parallelism can be applied generically to any ML program with an independent and identically distributed (i.i.d.) assumption over the data samples ; such i.i.d. ML programs (from deep learning, to logistic regression, to topic modeling and many others) make up the bulk of practical ML usage, and are easily recognized by a summation over data indices in the objective (for example, Lasso Eq. 3). Consequently, when a workhorse algorithmic technique (e.g. stochastic gradient descent) is applied to , the derived update equations will also have a summation^{9}^{9}9For Lasso coordinate descent (Eq. 5), the summation over is in the inner product . over , which can be easily parallelized over multiple machines, particularly when the number of data samples is in the millions or billions. In contrast, model parallelism requires special care, because model parameters do not always enjoy this convenient i.i.d assumption (Figure 1) — therefore, which parameters are updated in parallel, as well as the order in which the updates happen, can lead to a variety of outcomes: from nearideal fold speedup with machines, to no additional speedups with additional machines, or even to complete program failure. The dependency structures discussed for Lasso (Section 2.1) are a good example of the noni.i.d. nature of model parameters. Let us now discuss the general mathematical forms of data and model parallelism, respectively.
Data Parallelism: In data parallel ML execution, the data is partitioned and assigned to parallel computational workers or machines (indexed by ); we shall denote the th data partition by . If the update function has an outermost summation over data samples (as seen in ML programs with the commonplace i.i.d. assumption on data), we can split over data subsets and obtain a data parallel update equation, in which is executed on the th parallel worker:
(9) 
It is worth noting that the summation is the basis for a host of established techniques for speeding up dataparallel execution, such as minibatches and boundedasynchronous execution [23, 12]. As a concrete example, we can write the Lasso block coordinate descent equations 6 in a data parallel form, by applying a bit of algebra:
(10)  
where means (with a bit of notation abuse) to sum over all data indices included in .
Model Parallelism: In model parallel ML execution, the model is partitioned and assigned to workers/machines , and updated therein by running parallel update functions . Unlike dataparallelism, each update function also takes a scheduling or selection function , which restricts to operate on a subset of the model parameters (one basic use is to prevent different workers from trying to update the same parameters):
(11) 
where we have omitted the data since it is not being partitioned over. outputs a set of indices , so that only performs updates on — we refer to such selection of model parameters as scheduling. The model parameters are not, in general, independent of each other, and it has been established that model parallel algorithms are effective only when each iteration of parallel updates is restricted to a subset of mutually independent (or weaklycorrelated) parameters [33, 7, 47, 38], which can be performed by .
The Lasso block coordinate descent updates (Eq. 6) can be easily written in a simple model parallel form. Here, chooses the same fixed set of parameters for worker on every iteration, which we refer to by :
(12)  
On a closing note, simultaneous data and model parallelism is also possible, by partitioning the space of data samples and model parameters into disjoint blocks. The LDA topic model Gibbs sampling equations (Eq. 8) can be partitioned in such a blockwise manner (Figure 2), in order to achieve nearperfect speedup with machines [68].
3 Principles of ML System Design
The unique properties of ML programs, when coupled with the complementary strategies of data and model parallelism, interact to produce a complex space of design considerations that goes beyond the ideal mathematical view suggested by the general iterativeconvergent update equation Eq. 2. In this ideal view, one hopes that the functions simply need to be implemented equationbyequation (e.g. following the Lasso regression data and model parallel equations earlier), and then executed by a general purpose distributed system — for example, if we chose a MapReduce abstraction, one could write as Map and as Reduce, and then use a system such as Hadoop or Spark to execute them. The reality, however, is that the highestperforming ML implementations are not built in such a naive manner, and furthermore, they tend to be found in MLspecialized systems rather than on generalpurpose MapReduce systems [43, 36, 57, 62]. The reason is that highperformance ML goes far beyond an idealized MapReducelike view, and involves numerous considerations that are not immediately obvious from the mathematical equations: considerations such as what data batch size to use for data parallelism, how to partition the model for model parallelism, when to synchronize model views between workers, step size selection for gradient based algorithms, and even the order in which to perform updates.
The space of ML performance considerations can be intimidating to even veteran practitioners, and it is our view that a systems interface for parallel ML is needed, both to (a) facilitate the organized, scientific study of ML considerations, and also to (b) organize these considerations into a series of highlevel principles for developing new distributed ML systems. As a first step towards organizing these principles, we shall divide them according to 4 highlevel questions: if an ML program’s equations (Eq. 2) tell the system “what to compute”, then the system must consider: (1) How to distribute the computation? (2) How to bridge computation with intermachine communication? (3) How to communicate between machines? (4) What to communicate? By systematically addressing the ML considerations that fall under each question, we show that it is possible to build subsystems whose benefits complement and accrue with each other, and which can be assembled into a full distributed ML system that enjoys ordersofmagnitude speedups in ML program execution time.
3.1 How to Distribute: Scheduling and Balancing workloads
In order to parallelize an ML program, we must first determine how best to partition it into multiple tasks — that is to say, we must partition the monolithic in Eq. 2 into a set of parallel tasks, following the data parallel form (Eq. 9) or the model parallel form (Eq. 11) — or even a more sophisticated hybrid of both forms. Then, we must schedule and balance those tasks for execution on a limited pool of workers or machines: that is to say, we decide (i) which tasks go together in parallel (and just as importantly, which tasks should not be executed in parallel), (ii) the order in which tasks will be executed, while simultaneously ensuring (iii) each machine’s share of the workload is wellbalanced.
These three decisions have been carefully studied in the context of operationcentric programs (such as the MapReduce sort example), giving rise (for example) to the scheduler system used in Hadoop and Spark [64]. Such operationcentric scheduler systems may come up with a different execution plan — the combination of decisions (i)(iii) — depending on the cluster configuration, existing workload, or even machine failure; yet, crucially, they ensure that the outcome of the operationcentric program is perfectly consistent and reproducible every time. However, for ML iterativeconvergent programs, the goal is not perfectly reproducible execution, but rather convergence of the model parameters to an optimum of the objective function (that is to say, approaches to within some small distance of an optimum ). Accordingly, we would like to develop a scheduling strategy whose execution plans allow ML programs to provably terminate with the same quality of convergence every time — we shall refer to this as “correct execution” for ML programs. Such a strategy can then be implemented as a scheduling system, which creates ML program execution plans that are distinct from operationcentric ones.
Dependency Structures in ML Programs: In order to generate a correct execution plan for ML programs, it is necessary to understand how ML programs have internal dependencies, and how breaking or violating these dependencies through naive parallelization will slow down convergence. Unlike operationcentric programs such as sorting, ML programs are errortolerant, and can automatically recover from a limited number of dependency violations — but too many violations will increase the number of iterations required for convergence, and cause the parallel ML program to experience suboptimal, lessthanfold speedup with machines.
Let us understand these dependencies through the Lasso and LDA topic model example programs. In the model parallel version of Lasso (Eq. 12), each parallel worker performs one or more calculations of the form , which will then be used to update . Observe that this calculation depends on all other parameters , through the term , with the magnitude of the dependency being proportional to (1) the correlation between the th and th data dimensions, ; (2) the current value of parameter . In the worst case, both the correlation and could be large, and therefore updating sequentially (that is to say, over two different iterations , ) will lead to a different result from updating them in parallel (i.e. at the same time in iteration ). [7] noted that, if the correlation is large, then the parallel update will take more iterations to converge than the sequential update. It intuitively follows that we should not “waste” computation trying to update highly correlated parameters in parallel — rather we should seek to schedule uncorrelated groups of parameters for parallel updates, while performing updates for correlated parameters sequentially [33].
For LDA topic modeling, let us recall the updates (Eq. 8): for every word token (in position in document ), the LDA Gibbs sampler updates 4 elements of the model parameters (which are part of ): , , , , where and . These equations give rise to many dependencies between different word tokens and ; one obvious dependency occurs when , leading to a chance that they will update the same elements of (which happens when or are the same for both tokens). Furthermore, there are more complex dependencies inside the conditional probability ; in the interest of keeping this article at a suitably high level, we will summarize by noting that elements in the columns of , i.e. , are mutually dependent, while elements in the rows of , i.e. , are also mutually dependent. Due to these intricate dependencies, highperformance parallelism of LDA topic modeling requires a simultaneous dataandmodel parallel strategy (Figure 2), where word tokens must be carefully grouped by both their value and their document , which avoids violating the column/row dependencies in [68].
Scheduling in ML Programs: In light of these dependencies, how can we schedule the updates in a manner that avoids violating as many dependency structures as possible (noting that we do not have to avoid all dependencies thanks to ML error tolerance) — yet, at the same time, does not leave any of the worker machines idle due to lack of tasks or poor load balance? These two considerations have distinct yet complementary effects on ML program execution time: avoiding dependency violations prevents the progress per iteration of the ML program from degrading compared to sequential execution (i.e. the program will not need more iterations to converge), while keeping worker machines fully occupied with useful computation ensures that the iteration throughput (iterations executed per second) from machines is as close to times that of a single machine. In short, nearperfect fold ML speedup results from combining nearideal progress per iteration (equal to sequential execution) with nearideal iteration throughput ( times sequential execution) — thus, we would like to have an ideal ML scheduling strategy that attains these two goals.
To explain how ideal scheduling can be realized, we return to our running Lasso and LDA examples. In Lasso, the degree to which two parameters are interdependent is influenced by the data correlation between the th and th feature dimensions — we refer to this and other similar operations as a dependency check. If for a small threshold , then will have little influence on each other. Hence, the ideal scheduling strategy is to find all pairs such that , and then partition the parameter indices into independent subsets — where two subsets are said to be independent if for any and any , we have . These subsets can then be safely assigned to parallel worker machines (Fig.(3)), and each machine will update the parameters sequentially (thus preventing dependency violations) [33].
As for LDA, careful inspection reveals that the update equations for word token (Eq. 8) may (1) touch any element of column , and (2) touch any element of row . In order to prevent parallel worker machines from operating on the same columns/rows of , we must partition the space of words (corresponding to columns of ) into subsets , as well as partition the space of documents (corresponding to rows of ) into subsets . We may now perform ideal dataandmodel parallelization as follows: first, we assign document subset to machine out of . Then, each machine will only Gibbs sample word tokens such that and . Once all machines have finished, they rotate word subsets amongst each other, so that machine will now Gibbs sample such that and (or for machine , ). This process continues until rotations have completed, at which point the iteration is complete (every word token has been sampled) [68]. Figure 2 illustrates this process.
In practice, ideal schedules like the ones above may not be practical to use. For instance, in Lasso, computing for all pairs is intractable for high dimensional problems with large (millions to billions). We will return to this issue shortly, when we introduce Structure Aware Parallelization (SAP), a provably nearideal scheduling strategy that can be computed quickly.
Compute Prioritization in ML Programs: Because ML programs exhibit nonuniform parameter convergence, an ML scheduler has an opportunity to prioritize slowertoconverge parameters , thus improving the progress per iteration of the ML algorithm (i.e. requires fewer iterations to converge). For example, in Lasso, it has been empirically observed that the sparsityinducing norm (Eq. 4) causes most parameters to (1) become exactly zero after a few iterations, after which (2) they are unlikely to become nonzero again. The remaining parameters, which are typically a small minority, take much longer to converge (such as 10 times more iterations) [33].
A general yet effective prioritization strategy is to select parameters with probability proportional to their squared rate of change, — where is a small constant that ensures stationary parameters still have a small chance to be selected. Depending on the ratio of fast to slowconverging parameters, this prioritization strategy can an orderofmagnitude reduction in the number of iterations required to converge by Lasso regression [33]. Similar strategies have been applied to PageRank, another iterativeconvergent algorithm [37].
Balancing Workloads in ML Programs: When executing ML programs over a distributed cluster, they may have to stop in order to exchange parameter updates, i.e. synchronize — for example, at the end of Map or Reduce phases in Hadoop and Spark. In order to reduce the time spent waiting, it is desirable to loadbalance
the work on each machine, so that they proceed at close to the same rate. This is especially important for ML programs, which may exhibit skewed data distributions: for example, in LDA topic models, the word tokens
are distributed in a powerlaw fashion, where a few words occur across many documents, while most other words appear rarely. A typical ML loadbalancing strategy might apply the classic bin packing algorithm from computer science (where each worker machine is one of the “bins” to be packed), or any other strategy that works for operationcentric distributed systems such as Hadoop and Spark.However, a second, lessappreciated challenge is that machine performance may fluctuate in realworld clusters, due to subtle reasons such as changing datacenter temperature, machine failures, background jobs, or other users. Thus, load balancing strategies that are predetermined at the start of an iteration will often suffer from stragglers, machines that randomly become slower than the rest of the cluster, and which all other machines must wait for when performing parameter synchronization at the end of an iteration [23, 12, 10]. An elegant solution to this problem is to apply slowworker agnosticism [30], where the system takes direct advantage of the iterativeconvergent nature of ML algorithms, and allows the faster workers to repeat their updates whilst waiting for the stragglers to catch up. This not only solves the straggler problem, but can even correct for imperfectlybalanced workloads. We note that another solution to the straggler problem is to use boundedasynchronous execution (as opposed to synchronous MapReducestyle execution) — we shall discuss this in more detail in Section 3.2.
Structure Aware Parallelization:
Scheduling, prioritization and loadbalancing are complementary yet intertwined — the choice of parameters to prioritize will influence which dependency checks the scheduler needs to perform, and in turn, the “independent subsets” produced by the scheduler can make the loadbalancing problem more or less difficult. These three functionalities can be combined into a single programmable abstraction, to be implemented as part of a distributed system for ML. We call this abstraction Structure Aware Parallelization (SAP), in which ML programmers can specify how to (1) prioritize parameters to speed up convergence; (2) perform dependency checks on the prioritized parameters, and schedule them into independent subsets; (3) loadbalance the independent subsets across the worker machines. SAP exposes a simple, MapReducelike programming interface, where ML programmers implement three functions: (1) schedule(), in which a small number of parameters are prioritized, and then exposed to dependency checks; (2) push(), which performs in parallel on worker machines; (3) pull(), which performs . Load balancing is automatically handled by the SAP implementation, through a combination of classic bin packing and slowworker agnosticism.
Importantly, SAP schedule() does not naively perform dependency checks — instead, a few parameters are first selected via prioritization (where ). The dependency checks are then performed on , and the resulting independent subsets are updated via push() and pull(). Thus, SAP only updates a few parameters per iteration of schedule(), push(), pull(), rather than the full model . This strategy is provably nearideal for a broad class of model parallel ML programs:
Theorem 1 (adapted from [57])
SAP is close to ideal execution: Consider objective functions of the form , where is separable, , and has Lipschitz continuous gradient in the following sense:
(13) 
Let be the data samples rerepresented as feature vectors. W.l.o.g., we assume that each feature vector is normalized, i.e., . Therefore for all .
Suppose we want to minimize via model parallel coordinate descent. Let be an oracle (i.e. ideal) schedule that always proposes random features with zero correlation. Let be its parameter trajectory, and let be the parameter trajectory of SAP scheduling. Then,
(14) 
for constants .
This theorem says that the difference between the
parameter estimate
and the ideal oracle estimate rapidly vanishes, at a fast rate. In other words, one cannot do much better than scheduling — it is nearoptimal.SAP’s slowworker agnostic loadbalancing also comes with a theoretical performance guarantee — it not only preserves correct ML convergence, but also improves convergence per iteration over naive scheduling:
Theorem 2 (adapted from [30])
SAP slowworker agnosticism improves convergence progress per iteration:
Let the current variance (intuitively, the uncertainty) in the model be
, and let be the number of updates performed by worker (including additional updates due to slowworker agnosticism). After updates, is reduced to(15) 
where is a stepsize parameter that approaches zero as , are problemspecific constants, is the stochastic gradient of the ML objective function , is the covariance between , and represents 3rdorder and higher terms that shrink rapidly towards zero.
A low variance indicates that the ML program is close to convergence (because the parameters have stopped changing quickly). The above theorem shows that additional updates do indeed lower the variance — therefore, the convergence of the ML program is accelerated. To see why this is the case, we note that the 2nd and 3rd terms are always negative; furthermore, they are , so they dominate the 4th positive term (which is and therefore shrinks towards zero faster) as well as the 5th positive term (which is 3rdorder and shrinks even faster than the 4th term).
Empirically, SAP systems achieve orderofmagnitude speedups over nonscheduled and nonbalanced distributed ML systems. One example is the Strads system [33], which implements SAP schedules for several algorithms, such as Lasso Regression, Matrix Factorization, and Latent Dirichlet Allocation topic modeling, and achieves superior convergence times compared to other systems (Fig. 4).
3.2 How to Bridge Computation and Communication:
Bridging Models and Bounded Asynchrony
Many parallel programs require worker machines to exchange program state between each other — for example, MapReduce systems like Hadoop take the keyvalue pairs created by all Map workers, and transmit all pairs with key to the same Reduce worker. For operationcentric programs, this step must be executed perfectly without error — recall the MapReduce sort example (Section 2), where sending keys to two different reducers results in a sorting error. This notion of operationalcorrectness in parallel programming is underpinned by Bulk Synchronous Parallel (BSP) [52, 40], a bridging model that provides an abstract view of how parallel program computations are interleaved with interworker communication. Programs that follow the BSP bridging model alternate between a computation phase, and a communication phase or synchronization barrier (Figure 6), and the effects of each computation phase are not visible to worker machines until the next synchronization barrier has completed.
Because BSP creates a clean separation between computation and communication phases, many parallel ML programs running under BSP can be shown to be serializable — that is to say, they are equivalent to a sequential ML program. Seralizable BSP ML programs enjoy all the correctness guarantees of their sequential counterparts, and these strong guarantees have made BSP a popular bridging model for both operationcentric programs and ML programs [15, 39, 64]. One disadvantage of BSP is that workers must wait for each other to reach the next synchronization barrier, meaning that loadbalancing is critical for efficient BSP execution. Yet, even wellbalanced workloads can fall prey to stragglers, machines that become randomly and unpredictably slower than the rest of the cluster [10], due to realworld conditions such as temperature fluctuations in the datacenter, network congestion, and other users’ programs or background tasks. When this happens, the program’s efficiency drops to match that of the slowest machine (Figure 6) — and in a cluster with 1000s of machines, there may even be multiple stragglers. A second disadvantage is that communication between workers is not instantaneous, so the synchronization barrier itself can take a nontrivial amount of time. For example, in LDA topic modeling running on 32 machines under BSP, the synchronization barriers can be up to six times longer than the iterations [23]. Due to these two disadvantages, BSP ML programs may suffer from low iteration throughput, i.e. machines do not produce a fold increase in throughput.
As an alternative to running ML programs on BSP, asynchronous parallel execution has been explored [2, 13, 20], in which worker machines never wait for each other, and always communicate model information throughout the course of each iteration. Asynchronous execution usually obtains a nearideal fold increase in iteration throughput, but unlike BSP (which ensures serializability and hence ML program correctness), it often suffers from decreased convergence progress per iteration. The reason is that asynchronous communication causes model information to become delayed or stale (because machines do not wait for each other), and this in turn causes errors in the computation of . The magnitude of these errors grows with the delays, and if the delays are not carefully bounded, the result is extremely slow or even incorrect convergence [23, 12]. In a sense, there is “no free lunch” — model information must be communicated in a timely fashion between workers.
BSP and asynchronous execution face different challenges in achieving ideal fold ML program speedups — empirically, BSP ML programs have difficulty reaching the ideal fold increase in iteration throughput [23], while asynchronous ML programs have difficulty maintaining the ideal progress per iteration observed in sequential ML programs [23, 12, 68]. A promising solution is boundedasynchronous execution, in which asychronous execution is permitted up to a limit. To explore this idea further, we present a bridging model called Stale Synchronous Parallel (SSP) [23, 50], which generalizes and improves upon BSP.
Stale Synchronous Parallel: SSP is a boundedasynchronous bridging model, which enjoys a similar programming interface to the popular BSP bridging model. An intuitive, highlevel explanation goes as follows: we have parallel workers or machines, that perform ML computations in an iterative fashion. At the end of each iteration , SSP workers signal that they have completed their iterations — at this point, if the workers were instead running under BSP, a synchronization barrier would be enacted for intermachine communication. However, SSP does not enact a synchronization barrier. Instead, workers may be stopped or allowed to proceed as SSP sees fit; more specifically, SSP will stop a worker if it is more than iterations ahead of any other worker, where is called the staleness threshold (Figure 7).
More formally, under SSP, every worker machine keeps an iteration counter , and a local view of the model parameters . SSP worker machines “commit” their updates , and then invoke a clock() function that (1) signals that their iteration has ended, (2) increments their iteration counter , (3) informs the SSP system to start communicating to other machines, so they can update their local views of . This clock() is analogous to BSP’s synchronization barrier, but is different in that updates from one worker do not need to be immediately communicated to other workers — as a consequence, workers may proceeed even if they have only received a partial subset of the updates. This means that the local views of can becomes stale, if some updates have not been received yet. Given a userchosen staleness threshold , an SSP implementation or system enforces at least the following bounded staleness conditions:

Bounded clock difference: The iteration counters on the slowest and fastest workers must be apart — otherwise, SSP forces the fastest worker to wait for the slowest worker to catch up.

Timestamped updates: At the end of each iteration (right before calling clock()), each worker commits an update , which is is timestamped with time .

Model state guarantees: When a worker with clock computes , its local view of is guaranteed to include all updates with timestamp . The local view may or may not contain updates from other workers with timestamp .

Readmywrites: Each worker will always include its own updates , in its own local view of .
Since the fastest and slowest workers are clocks apart, a worker’s local view of at iteration will include all updates from all workers with timestamps in , plus some (or possibly none) of the updates whose timestamps fall in the range . Note that SSP is a strict generalization of BSP for ML programs: when , the first range becomes while the second range becomes empty, which corresponds exactly to BSP execution of an ML program.
Because SSP always limits the maximum staleness between any pair of workers to , it enjoys strong theoretical convergence guarantees for both data parallel and model parallel execution. We state two complementary theorems to this effect:
Theorem 3 (adapted from [12])
SSP data parallel Convergence Theorem: Consider convex objective functions of the form , where the individual components are also convex. We search for a minimizer via data parallel stochastic gradient descent on each component under SSP, with staleness parameter and workers. Let the data parallel updates be with . Under suitable conditions ( are Lipschitz and bounded divergence ), we have the following convergence rate guarantee:
where , and as . In particular, is the maximum staleness under SSP, is the average staleness experienced by the distributed system, and is the variance of the staleness.
This data parallel SSP theorem has two implications: first, data parallel execution under SSP is correct (just like BSP), because (the difference between the SSP parameter estimate and the true optimum) converges to in probability with an exponential tailbound. Second, it is important to keep the actual staleness and asynchrony as low as possible: the convergence bound becomes tighter with lower maximum staleness , and lower average and variance of the staleness experienced by the workers. For this reason, naive asynchronous systems (e.g. Hogwild! [43] and YahooLDA [2]) may experience poor convergence in complex production environments, where machines may temporarily slow down due to other tasks or users — in turn causing the maximum staleness and staleness variance to become arbitrarily large, leading to poor convergence rates.
Theorem 4 (to appear in 2016)
SSP model parallel Asymptotic Consistency: We consider minimizing objective functions of the form where , using a model parallel proximal gradient descent procedure that keeps a centralized “global view” (e.g. on a keyvalue store) and stale local worker views on each worker machine. If the descent step size satisfies , then the global view and local worker views will satisfy:

;

, and for all , ;

The limit points of coincide with those of , and both are critical points of .
Items 1 and 2 imply that the global view will eventually stop changing (i.e. converge), and the stale local worker views will converge to the global view — in other words, SSP model parallel execution will terminate to a stable answer. Item 3 further guarantees that the local and global views will reach an optimal solution to — in other words, SSP model parallel execution outputs the correct solution. Given additional technical conditions, we can further establish that SSP model parallel execution converges at rate .
The above two theorems show that both data parallel and model parallel ML programs running under SSP enjoy nearideal convergence progress per iteration (that approaches close to BSP and sequential execution). For example, the Bösen system [23, 12, 55] uses SSP to achieve up to 10fold shorter convergence times, compared to the BSP bridging model — and SSP with properly selected staleness values will not exhibit nonconvergence, unlike asynchronous execution (Figure 8). In summary, when SSP is effectively implemented and tuned, it can come close to enjoying the best of both worlds: nearideal progress per iteration close to BSP, and nearideal fold iteration througput similar to asynchronous execution — and hence, a nearideal fold speedup in ML program execution time.
3.3 How to Communicate: Managed Communication and Topologies
The bridging models (BSP and SSP) just discussed place constraints on when ML computation should occur relative to communication of updates to model parameters , in order to guarantee correct ML program execution. However, within the constraints set by a bridging model, there is stil room to prescribe how, or in what order, the updates should be communicated over the network. Consider the MapReduce sort example, under the BSP bridging model: the Mappers need to send keyvalue pairs with the same key to the same Reducer. While this can be performed via a bipartite topology (every Mapper communicates with every Reducer), one could instead use a star topology, where a third set of machines first aggregates all keyvalue pairs from the Mappers, and then sends them to the Reducers.
ML algorithms under the SSP bridging model open up an even wider design space — because SSP only requires updates to “arrive no later than iterations”, we could choose to send more important updates first, following the intuition that this should naturally improve algorithm progress per iteration. These considerations are important because every cluster or datacenter’s has its own physical switch topology and available bandwidth along each link, and we shall discuss them with the view that choosing the correct communication management strategy will lead to a noticable improvement in both ML algorithm progress per iteration and iteration throughput. We now discuss several ways in which communication management can be applied to distributed ML systems.
Continuous communication: In the first implementations of the SSP bridging model, all intermachine communication occurred right after the end of each iteration (i.e. right after the SSP clock() command) [23], while leaving the network idle at most other times (Figure 11). The resulting burst of communication (GBs to TBs) may cause synchronization delays (where updates take longer than expected to reach their destination), and these can be optimized away by adopting a continuous style of communication, where the system waits for existing updates to finish transmission before starting new ones [55].
Continuous communication can be achieved by a rate limiter in the SSP implementation, which queues up outgoing communications, and waits for previous communications to finish before sending out the next in line. Importantly, regardless of whether the ML algorithm is data parallel or model parallel, continuous communication still preserves the SSP bounded staleness conditions — and therefore, it continues to enjoy the same worstcase convergence progress per iteration guarantees as SSP. Furthermore, because managed communication reduces synchronization delays, it also provides a small (2to3fold) speedup to overall convergence time [55], that is partly due to improved iteration throughput (because of fewer synchronization delays), and partly due to improved progress per iteration (fewer delays also means lower average staleness in local parameter views , hence SSP’s progress per iteration is improved according to Theorem 3).
Waitfree Backpropagation: The deep learning family of ML models [29, 13] presents a special opportunity for continuous communication, due to their highlylayered structure. Two observations stand out in particular: (1) the “backpropagation” gradient descent algorithm — used to train deep learning models such as Convolutional Neural Networks (CNNs) — proceeds in a layerwise fashion; (2) the layers of a typical CNN (such as “AlexNet” [29]) are highly asymmetric in terms of model size and required computation for the backpropagation — usually, the top fullyconnected layers have approximately of the parameters, while the bottom convolutional layers account for of the backpropagation computation [65]. This allows for a specialized type of continuous communication, which we call waitfree backpropagation: after performing backpropagation on the top layers, the system will communicate their parameters while performing backpropagation on the bottom layers. This spreads the computation and communication out in an optimal fashion, in essence “overlapping computation with communication”.
Update prioritization: Another communication management strategy is to prioritize available bandwidth, by focusing on communicating updates (or parts of) that contribute most to convergence. This idea has a close relationship with Structure Aware Parallelization discussed in Section 3.1 — while SAP prioritizes computation towards more important parameters, update prioritization ensures that the changes to these important parameters are quickly propagated to other worker machines, so that their effects are immediately felt. As a concrete example, in ML algorithms that use stochastic gradient descent (e.g. Logistic Regression and Lasso Regression), the objective function changes proportionally to the parameters , and hence the fastestchanging parameters are often the largest contributors to solution quality.
Thus, the SSP implementation can be further augmented by a prioritizer, which rearranges the updates in the rate limiter’s outgoing queue, so that more important updates will be sent out first. The prioritizer can support strategies such as the following: (1) Absolute magnitude prioritization: updates to parameters are reordered according to their recent accumulated change , which works well for ML algorithms that use stochastic gradient descent; (2) Relative magnitude prioritization: same as absolute magnitude, but the sorting criteria is , i.e. the accumulated change normalized by the current parameter value . Empirically, these prioritization strategies already yield another speedup, on top of SSP and continuous communication [55], and there is potential to explore strategies tailored to a specific ML program (similar to the SAP prioritization criteria for Lasso).
Parameter Storage and Communication Topologies: A third communication management strategy is to consider the placement of model parameters across the network (parameter storage), as well as the network routes along which parameter updates should be communicated (communication topologies). The choice of parameter storage strongly influences the communication topologies that can be used, which in turn impacts the speed at which parameter updates can be delivered over the network (as well as their staleness). Hence, we begin by discussing two commonlyused paradigms for storing model parameters (Fig 12): (1) Centralized storage: a “master view” of the parameters is partitioned across a set of server machines, while workers machines maintain local views of the parameters. Communication is asymmetric in the following sense: updates are sent from the workers to the servers, and workers receive the most uptodate version of the parameters from the server. (2) Decentralized storage: every worker maintains its own local view of the parameters, without a centralized server. Communication is symmetric: workers send updates to each other, in order to bring their local views of uptodate.
The centralized storage paradigm can be supported by a masterslave network topology (Fig 13), where machines are organized into a bipartite graph with servers on one side, and workers on the other — whereas the decentralized storage paradigm can be supported by a peertopeer (P2P) topology (Fig 14), where each worker machine broadcasts to all other workers. An advantage of the masterslave network topology, is that it reduces the number of messages that need to be sent over the network — workers only need to send updates to the servers, which aggregate them using , and update the master view of the parameters . The updated parameters can then be broadcast to the workers as a single message, rather than a collection of individual updates — in total, only messages need to be sent. In contrast, P2P topologies must send messages every iteration, because each worker must broadcast to every other worker.
However, when has a compact or compressible structure — such as lowrankness in matrixparameterized ML programs like deep learning, or sparsity in Lasso regression — the P2P topology can achieve considerable communication savings over the masterslave topology. By compressing or rerepresenting in a more compact lowrank or sparse form, each of the P2P messages can be made much smaller than the mastertoslave messages, which may not admit compression (because the messages consist of the actual parameters , not the compressible updates ). Furthermore, even the P2P messages can be reduced, by switching from a full P2P to a partiallyconnected Halton Sequence topology (Fig 15) [35], where each worker only communicates with a subset of workers. Workers can reach any other worker by routing messages through intermediate nodes: for example, the routing path is one way to send a message from worker 1 to 6. The intermediate nodes can combine messages meant for the same destination, thus reducing the number of messages per iteration (and further reducing network load). However, one drawback to the Halton Sequence topology is that routing increases the time taken for messages to reach their destination, which raises the average staleness of parameters under the SSP bridging model — e.g. the message from worker 1 to 6 would be three iterations stale. The Halton Sequence topology is nevertheless a good option for very large cluster networks, which have limited peertopeer bandwidth.
By combining the various aspects of “how to communicate” — continuous communication, update prioritization, and a suitable combination of parameter storage and communication topology — we can design a distributed ML system that enjoys multiplicative speed benefits from each aspect, resulting in an almostorderofmagnitude speed improvement on top of what SAP (how to distribute) and SSP (bridging models) can offer. For example, the Bösen SSP system enjoys up to an additional 4fold speedup from continuous communication and update prioritization, as shown in Figure 9 and 10 [55].
3.4 What to Communicate
Going beyond how to store and communicate updates between worker machines, we may also ask “what” needs to be communicated in each update — in particular, is there any way to reduce the number of bytes required to transmit , and thus further alleviate the comunication bottleneck in distribute ML programs [56]? This question is related to the idea of lossless compression in operationcentric programs; for example, Hadoop Mapreduce is able to compresses keyvalue pairs to reduce their transmission cost from Mappers to Reducers. For data parallel ML programs, a commonlyused strategy for reducing the size of messages is to aggregate (i.e. sum) them before transmission over the network, taking advantage of the additive structure within (such as in the Lasso data parallel example, Eq 10). Such early aggregation is preferred for centralized parameter storage paradigms that communicate full parameters from servers to workers [23, 12], and it is natural to ask if there are other strategies, that may perhaps be bettersuited to different storage paradigms.
To answer this question, we may inspect the mathematical structure of ML parameters , and the nature of their updates . A number of popular ML programs have matrixstructured parameters (we use boldface to distinguish from the generic ) — examples include multiclass logistic regression (MLR), neural networks (NN) [10], distance metric learning (DML) [59] and sparse coding [44]. We refer to these as matrixparameterized models (MPMs), and note that can be very large in modern applications: in one application of MLR to Wikipedia [45], is a kbyk matrix containing several billion entries (10s of GBs). It is also worth pointing out that typical computer cluster networks can at most transmit a few GBs per second between two machines, hence naive synchronization of such matrices and their updates is not instantaneous. Because parameter synchronization occurs many times across the lifetime of an iterativeconvergent ML program, the time required for synchronization can become a substantial bottleneck.
More formally, an MPM is an ML objective functions with the following specialized form:
Comments
There are no comments yet.