ASYNC: Asynchronous Machine Learning on Distributed Systems

07/19/2019 ∙ by Saeed Soori, et al. ∙ UNIVERSITY OF TORONTO Rutgers University 0

ASYNC is a framework that supports the implementation of asynchronous machine learning methods on cloud and distributed computing platforms. The popularity of asynchronous optimization methods has increased in distributed machine learning. However, their applicability and practical experimentation on distributed systems are limited because current engines do not support many of the algorithmic features of asynchronous optimization methods. ASYNC implements the functionality and the API to provide practitioners with a framework to develop and study asynchronous machine learning methods and execute them on cloud and distributed platforms. The synchronous and asynchronous variants of two well-known optimization methods, stochastic gradient descent and SAGA, are implemented in ASYNC and examples of implementing other algorithms are also provided.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Asynchronous optimization methods have gained significant traction in the algorithmic machine learning community. These methods demonstrate near-linear speedups with increasing number of processors and because of their asynchronous execution model, can be significantly more efficient in processing large data on cloud and distributed platforms (Dean et al., 2012; Li et al., 2014; Zhou et al., 2017). Machine learning practitioners are actively introducing novel algorithms to improve the convergence and performance of asynchronous optimization methods. However, frameworks that support robust implementation of these algorithmic features and their execution on distributed systems are not available. State-of-the-art cluster computing engines such as Spark (Zaharia et al., 2012) support the implementation of synchronous optimization methods, however, because of their bulk synchronous and deterministic execution model they do not support asynchronous algorithms.

The asynchrony in asynchronous optimization methods and the presence of stragglers, slow workers (Lee et al., 2017), in distributed systems introduces staleness to the execution which increases the time to convergence. The machine learning community is working to mitigate this staleness with numerous strategies such as bounding the staleness (Cipar et al., 2013), using dynamic barrier controls (Zhang et al., 2018)

, formulating staleness-dependent learning strategies with hyperparameter selection

(Zhang et al., 2015b)

, and reducing the variance of noise from randomization

(Huo and Huang, 2017). The algorithms that mitigate staleness are developed with information about the distributed systems’ state such as the number of stragglers. These algorithms also control task assignment and task scheduling strategies in the system. Thus, to perform well in a real distributed setting, practitioners need to tune and implement asynchronous optimization algorithms with easy-to-use frameworks that support asynchronous execution and provide information and control over the execution environment. To the best of our knowledge, such a framework does not exist, thus, the implementation of asynchronous optimization algorithms on distrbuted systems is tedious and often impossible for machine learning practitioners.

Parameter server frameworks are frequently used by machine learning practitioners to implement distributed machine learning methods. Work such as DistBelief (Dean et al., 2012) and Project Adam (Chilimbi et al., 2014)

use parameter server architectures to train deep neural networks in distributed platforms. The Petuum

(Xing et al., 2015) framework supports a more general class of optimization methods and enables the implementation of bounded delays with parameter server architectures. Aside from support for bounded delays, because of not providing control over worker-specific features such as their status and staleness, parameter server frameworks do not support many of the strategies used in state-of-the-art algorithms in asynchronous methods such as flexible barrier control methods, variance reduction, and worker-specific hyperparameter selection.

Cloud-based engines known as distributed dataflow systems such as Hadoop (Hadoop, 2011) and Spark (Zaharia et al., 2012) have gained tremendous popularity in large-scale machine learning. However, these frameworks use a bulk synchronous execution model and synchronous all-reduce paradigms which limit their application for asynchronous machine learning. Libraries such as Mllib (Meng et al., 2016) provide implementations of some synchronous optimization methods in Spark; synchronous algorithms that use historical gradients for variance reduction are not supported in Mllib. Furthermore, Mllib does not support implementations of asynchronous algorithms. Glint (Jagerman and Eickhoff, 2016) implements the parameter server architecture on Spark, however, because worker-specific gradient reduction is not implemented in Glint, it does not support mini-batch asynchronous optimization methods.

This work presents ASYNC, a framework that support asynchronous machine learning applications on cloud and distributed platforms. ASYNC implements the functionality and components needed in cloud engines to enable the execution of asynchronous optimization methods on distributed platforms. The ASYNC-specific API provides machine learning practitioners with information needed to analyze and develop many of the novel strategies used to mitigate the effects of staleness in asynchronous executions. ASYNC comes with a robust interface that enables the implementation of algorithmic features such as Barrier Control, History of Gradients, and Dynamic Hyperparameter Selection. Major contributions of this paper are:

  • A novel framework for machine learning practitioners to implement and dispatch asynchronous machine learning applications on cloud and distributed platforms. ASYNC introduces three components to cloud engines, ASYNCcoodinator, ASYNCbroadcaster, and ASYNCscheduler to enable the asynchronous gather, broadcast, and schedule of tasks and results.

  • A novel broadcast strategy with bookkeeping structures, applicable to both synchronous and asynchronous algorithms, to facilitate the implementation of variance reduced optimization methods that operate on historical gradients.

  • A robust programming model built on top of Spark that enables the implementation of asynchronous optimization methods while preserving the in-memory and fault tolerant features of Spark.

  • A demonstration of ease-of-implementation in ASYNC with the implementation and performance analysis of two well-known optimization methods, stochastic gradient descent (SGD) (Robbins and Monro, 1951) and SAGA (Defazio et al., 2014), and their asynchronous variants in a distributed platform with straggling machines. Our results demonstrate that asynchronous SAGA (ASAGA) (Leblond et al., 2016) and asynchronous SGD (ASGD) outperform their synchronous variants up to 4 times on a distributed system with production cluster straggler patterns.

2. Preliminaries

Distributed machine learning often results in solving an optimization problem in which an objective function is optimized by iteratively updating the model parameters until convergence. Distributed implementation of optimization methods includes workers that are assigned tasks to process parts of the training data, and one or more servers, i.e. masters, that store and update the model parameters. Distributed machine learning models often result in the following structure:

(1)

where is the model parameter to be learned, is the number of servers, and is the localloss function computed by server i based on its assigned training data. Each server has access to data points, where the local cost has the form

(2)

for some loss functions (see e.g. (Shamir et al., 2014; Vapnik, 2013)).

For example, in supervised learning, given an input-output pair

, the loss function can be where is a fixed function of choice and is a convex loss function that measures the loss if is predicted from based on the model parameter

. This setting covers empirical risk minimization problems in machine learning that include linear and non-linear regression, support vector machines, and other classification problems such as logistic regression (see e.g.

(Vapnik, 2013; Shalev-Shwartz et al., 2009)). In particular, if and the function is the square of the Euclidean distance function, we obtain the familiar least squares problem

(3)

where

(4)

with is a column vector of length and is called the data matrix as its -th row is given by the input .

Machine learning optimization problems can be first-order algorithms, such as gradient descent (GD), or second-order methods that are based on adaptations of classical Newton and quasi-Newton methods (Shamir et al., 2014; Reddi et al., 2016; Zhang and Lin, 2015; Wang et al., 2017; Dünner et al., 2018). Deterministic variants of these methods are costly and are not practical for solving (1) on large-scale distributed settings and thus their stochastic versions are more popular.

In the following we use the gradient descent algorithm as an example to introduce stochastic optimization and other terminology used throughout the paper such as mini-batch size and hyperparameter selection. The introduced terms are used in all optimization problems and are widely used in the machine learning literature. GD iteratively computes the gradient of the loss function to update the model parameters at iteration . To implement gradient descent on a distributed system, each server computes its local gradient ; the local gradients are aggregated by the master when ready. The full pass over the data at every iteration of the algorithm with synchronous updates leads to large overheads. Distributed stochastic gradient descent (SGD) methods and their variants (Cutkosky and Busa-Fekete, 2018; Gemulla et al., 2011; Chen et al., 2016) are on the other hand scalable and popular methods for solving (1), they go back to the seminal work of (Robbins and Monro, 1951) in centralized settings. Distributed SGD replaces the local gradient

with an unbiased stochastic estimate

of it, computed from a subset of local data points:

(5)

where is a random subset that is sampled with or without replacement at iteration , and is the number of elements in (Bottou, 2012), also called the mini-batch size. To obtain desirable accuracy and performance, implementations of stochastic optimization methods such as SGD require tuning algorithm parameters, a process often referred to as hyperparameter selection. For example, the step size and the mini-batch sizes are parameters to tune in SGD methods (Bottou, 2012). Since gradients contain noise in SGD, convergence to the optimum requires a decaying step size (Moulines and Bach, 2011) which may lead to poor convergence behavior especially if the step size is not tuned well to the dataset (Bottou, 2012; Tan et al., 2016). A common practice is to assume that the step size has the form for some scalar constants

and choose these constants with heuristics before running the algorithm, but there are also more recent methods that choose the step size adaptively over the iterations based on the dataset

(Tan et al., 2016; De et al., 2016).

3. Synchronous and Asynchronous Optimization

This section discusses the algorithmic properties that distinguish distributed asynchronous optimization methods from their synchronous variants. In distributed optimization, workers compute local gradients of the objective function and then communicate the computed gradients to the server. To proceed to the next iteration of the algorithm, the server updates the shared model parameters with the received gradients, broadcasts the most recent model parameter, and schedules new tasks.

In synchronous implementations of the optimization methods, the server proceeds to the next iteration only after all workers have communicated their gradients. However, in asynchronous implementations, the server can proceed with the update and broadcast the model parameters without having to wait for all worker tasks to complete. The asynchronous execution allows the algorithm to make progress even in the presence of stragglers which is known as an increase in hardware efficiency (Cipar et al., 2013). However, this progress in computation comes at a cost, the asynchrony inevitably adds staleness to the system wherein some of the workers compute gradients using model parameters that may be several gradient steps behind the most updated set of model parameters which can lead to poor convergence. This is also referred to as a worsening in statistical efficiency (Chen et al., 2016).

Asynchronous optimization algorithms are formulated and developed with properties that balance statistical efficiency and hardware efficiency to maximize the performance of the optimization methods on distributed systems. Properties in the design of asynchronous optimization methods that enable this balance are barrier control, historical gradients, and dynamic hyperparameter selection.

Barrier control

. Barrier control strategies in asynchronous algorithms determine if a worker should proceed to computations for the next iteration of the algorithm or if it should wait until a specific number of workers have communicated their results to the server. Synchronous algorithms follow a Bulk Synchronous Parallel (BSP) execution, where a worker can not proceed until the model parameters are fully updated by all workers. Barrier control strategies in asynchronous algorithms are classified in to Asynchronous Parallel (ASP), in which a worker proceeds to the next iteration without having to wait for any of the latest model parameters, and the Stale Synchronous Parallel (SSP) in which workers synchronize when parameter staleness (determined by the number of stragglers) exceeds a threshold. BSP can be implemented in available cluster computing engines such as Spark. However, because of the synchronous and deterministic execution model of engines such as Spark, ASP and SSP are not supported. ASYNC supports ASP and SSP and also facilitates the implementation of other barrier control methods that use metrics besides staleness, such as worker task completion time

(Zhang et al., 2018).

Historical gradients. Popular distributed optimization methods arising in machine learning applications are typically stochastic (see e.g. (Agarwal and Duchi, 2011; Shamir et al., 2014; Zhang and Lin, 2015)). Stochastic optimization methods use a noisy gradient computed from random data samples instead of the true gradient that operated on the entire data which can lead to poor convergence. Variance reduction techniques, used in both synchronous and asynchronous optimization, augment the noisy gradient to reduce this variance. A class of variance-reduced asynchronous algorithms that have lead to significant improvements over traditional methods memorizes the gradients computed in previous iterations, i.e. historical gradients (Defazio et al., 2014). Historical gradients can not be implemented in cluster computing engines such as Spark primarily because Spark can only broadcast the entire history of the model parameters which can be very large and lead to significant overheads.

Dynamic hyperparameter selection. The staleness associated with each gradient computation increases the time to convergence in asynchronous optimization methods. The staleness will grow with increasing stragglers and delays. Some of the algorithm parameters can be dynamically adjusted, i.e. dynamic hyperparameter selection, to compensate for staleness. For example, some asynchronous optimization methods adjust the learning rate using the staleness value (Zhang et al., 2015b; Hu et al., 2018). Dynamic hyperparameter selection is not supported in engines such as Spark because the frameworks do not collect or expose these parameters. Therefore, available frameworks can not be used to analyze and experiment the effects of hyperparameter selection on the convergence of asynchronous optimization methods.

Figure 1. An overview of the ASYNC framework.

4. ASYNC: A Framework for Asynchronous Learning

ASYNC is a framework for the implementation and execution of asynchronous optimization algorithms while retaining the map-reduce model, scalability, and fault tolerance of state-of-the-art cluster computing engines. The components in ASYNC enable the implementation of algorithmic features in asynchronous optimization on cluster computing platforms. ASYNC is implemented on top of Spark (Zaharia et al., 2012). Because of its deterministic execution model, e.g. synchronous all-reduce, asynchronous execution is not supported in Spark. ASYNC changes Spark internals to provide this support.

The three main components in ASYNC are the ASYNCcoordinator, ASYNCbroadcaster, ASYNCscheduler which along with a number of bookkeeping structures enable the implementation of important algorithmic properties in asynchronous optimization such as barrier control, historical gradients, and dynamic hyperparameter selection. In the following we will introduce internal elements in ASYNC and discuss how these components work together to facilitate the implementation of asynchronous optimization methods. Figure 1 shows the internals of ASYNC.

4.1. Bookkeeping structures in ASYNC

To support the implementation of algorithmic features in asynchronous methods, ASYNC collects and stores information on the workers and the state of the system. This information is used by the internal components of ASYNC to facilitate the asynchronous execution; Spark does not collect or store these structures as it does not support asynchronous applications. For each submitted task result, the server stores the worker’s ID, staleness, mini-batch size, and the task result itself. The per-task stored data, such as staleness, is used to implement features such as dynamic hyperparameter selection. For each worker, with the help of the ASYNCcoordinator, the server also stores in a table called STAT, each worker’s most recent status which includes worker staleness, average-task-completion time, and availability. A worker is available if it is not executing a task and unavailable otherwise. The average-task-completion time is the average time of executing a task by the worker. The number of available workers and the maximum overall worker staleness are also stored on the server.

4.2. The ASYNCcoordinator

The main building block of ASYNC is the ASYNCcoordinator which collects bookkeeping structures and coordinates the function of other components. ASYNCcoordinator annotates a task result with the worker attributes to be used for the implementation of dynamic hyperparamter selection and to update the model parameters. The worker task results are processed by the ASYNCcoordinator and are then forwarded to other modules in the framework. When a worker submits a task result, the coordinator extracts worker attributes using information on the server such as the iteration at which the gradient is being computed. This information is used to tag the task result with the worker attributes. The ASYNCcoordinator also updates the worker STAT table. It monitors workers availability, updates the list of available workers, and computes the workers’ average-task-completion time and staleness. The workers’ status is passed to the ASYNCscheduler to facilitate the implementation of barrier control strategies.

4.3. The ASYNCbroadcaster

The ASYNCbroadcaster is implemented in addition to the broadcast module in Spark to support optimization methods that benefit from operating on historical gradients. To obtain historical gradients for a mini-batch sample of data, the workers need access to all the previous model parameters for the data samples. Broadcast is implemented in spark using a unique identifier (ID) and a value which is broadcast to the workers along with the tasks. To implement historical gradients in Spark all previous model parameters, which can be as large as the mini-batch size, have to be broadcast to the workers in each iteration which will lead to significant communication overheads. The ASYNCbroadcaster resolves this by broadcasting to the worker only the ID of the previously broadcast parameters, which excludes the broadcast of the value itself. The worker uses the IDs to determine if the model parameter is already stored locally in its memory. This enables the worker to recompute the gradient without additional communication. If the parameter is not local to the worker, the server sends the model parameter for the specified iteration to the worker. The ASYNCbroadcaster enables the implementation of historical gradients in both synchronous and asynchronous optimization methods.

4.4. The ASYNCscheduler

To implement barrier control, the framework should expose to the user, information such as worker availability and worker staleness so that the user can decide the strategies with which workers are assigned tasks. Because Spark does not provide such information, barrier control strategies can not be implemented with the Spark framework. With the help of the ASYNCscheduler, ASYNC provides the algorithm designer with the flexibility to define new barrier control strategies. The ASYNCscheduler communicates with the ASYNCcoordinator to obtain information such as worker availability and worker status. This information is used to enable the implementation of barrier controls. The ASYNCscheduler determines the strategies with which available workers should proceed based on staleness or average-task-completion time. ASYNC allows the user to define customized filters that selectively choose from available workers and allows users to implement a variety of barrier controls.

5. Programming with ASYNC

To use ASYNC, developers are provided an additional set of ASYNC-specific functions, on top of what Spark provides, to access the bookkeeping structures and to introduce asynchrony in the execution. The programming model in ASYNC is close to that of Spark. It operates on resilient distributed datasets (RDD) to preserve the fault tolerant and in-memory execution of Spark. The ASYNC-specific functions also either transform the RDDs, known as transformations in Spark, or conduct lazy actions. In this section, ASYNC’s programming model and API is first discussed. We then demonstrate how ASYNC can be used to implement two well-known asynchronous optimization methods, ASGD and ASAGA. We also discuss the implementation of other state-of-the-art asynchronous methods and show that ASYNC can be used to implement a variety of optimization algorithms.

5.1. The ASYNC programming model

Asynchronous Context (AC) is the entry point to ASYNC and should be created only once in the beginning of the application. The ASYNCscheduler, the ASYNCbroadcaster, and the ASYNCcoordinator communicate through AC and with this communication create barrier controls, broadcast variables, and store workers’ task results and status. AC maintains the bookkeeping structures and ASYNC-specific functions, including several actions and transformations that operate on RDDs. Workers use ASYNC functions to interact with AC and to store their results and attributes in the bookkeeping structures. The server queries AC to update the model parameters or to access workers’ status. Table 1 lists the main functions available in ASYNC. We show the signature of each operation by demonstrating the type parameters in square brackets.

Collective operations in ASYNC. ASYNCreduce is an action that aggregates the elements of the RDD on the worker using a function and returns the result to the server. ASYNCreduce differs from Spark’s reduce in two ways. First, Spark aggregates data across each partition and then combines the partial results together to produce a final value. However, ASYNCreduce executes only on the worker and for each partition. Secondly, reduce returns only when all partial results are combined on the server, but ASYNCreduce returns immediately. Task results on the server are accessed using the ASYNCcollect and ASYNCcollectAll methods. ASYNCcollect returns task results in FIFO (first-in-first-out) order and also returns the worker status attributes. The workers’ status can also be accessed with ASYNC.STAT.

Barrier and broadcast in ASYNC. ASYNCbarrier is a transformation, i.e. a deterministic operation which creates a new RDD based on the workers’ status. ASYNCbarrier takes the recent status of workers.STAT and decides which workers to assign new tasks to, based on a user-defined function. For example, for a fully asynchronous barrier model the following function is declared: . In Spark, broadcast parameters are “broadcast variable” objects that wrap around the to-be-broadcast value. ASYNCBroadcast also uses broadcast variables and similar to Spark the method value can be used to access the broadcast value. However, ASYNCbroadcast differs from the broadcast implementation in Spark since it has access to an index. The index is used internally by ASYNCbroadcast to get the ID of the previously broadcast variables for the specified index. ASYNCbroadcast eliminates the need to broadcast values when accessing the history of broadcast values.

5.2. Case studies

ASYNC is an easy-to-use and powerful framework developed for the algorithmic machine learning community to implement a large variety of asynchronous optimization methods on distributed platforms as well as develop novel asynchronous algorithms. The robust programming model in ASYNC provides control of low-level features in both the algorithm and the execution platform to facilitate the experimentation and investigation of asynchrony in optimization methods. The following demonstrates the implementation of two well-known asynchronous optimization methods ASGD and ASAGA in ASYNC as examples.

ASGD with ASYNC. An implementation of mini-batch stochastic gradient descent (SGD) using the map-reduce model in Spark is shown in Algorithm 1. The map phase applies the gradient function on the input data independently on workers. The reduce phase has to wait for all the map tasks to complete. Afterwards, the server aggregates the task results and updates the model parameter w. The asynchronous implementation of SGD in ASYNC is shown in Algorithm 2. With only a few extra lines from the ASYNC API, colored in blue, the synchronous implementation of SGD in Spark is transformed to ASGD. An ASYNCcontext is created in line 1 and is used in line 4 to create a barrier based on the current workers’ status, AC.STAT. We implement a bounded staleness barrier strategy by defining the function f that allows tasks to be submitted to available workers only when the number of available workers is at least where is the number of workers. The partial results from each partition are then obtained and stored in AC in line 4. Finally, these partial results are accessed in line 6 and are used to update the model parameter in line 7.

ASAGA with ASYNC. The SAGA implementation in Spark is shown in Algorithm 3. This implementation is inefficient and not practical for large datasets. Spark requires broadcasting a table of stored model parameters to each worker, colored in red in Algorithm 3 line 5. The table stores the model parameters and its size increases after each iteration. Broadcasting this table leads to large communication overheads. As a result of the overhead, machine learning libraries that are build on top of Spark such as Mllib (Meng et al., 2016) do not provide implementations of optimization methods such as SAGA that requires the history of gradients. ASYNC resolves the overhead with ASYNCbroadcast. The implementation of ASAGA is shown in Algorithm 4. ASYNCbroadcast is used to define a dynamic broadcast in line 4. Then, the broadcast variable is used to compute the historical gradients in line 5. In order to access the last model parameters for sample index, the method value is called in line 5. As shown in Algorithm 4, there is no need to broadcast a table of parameters which allows for efficient implementation of both SAGA and ASAGA in ASYNC.

Input : points, numIterations, learning rate , sampling rate b
Output :  model parameter
1 for  i ¡-1 to numIterations  do
2        w_br = sc.broadcast(w)
3        gradient = points.sample(b).map(p =¿ ). reduce(_+_)
4        w -= gradient
5       
6 end for
return
Algorithm 1 The SGD Algorithm
Input : points, numIterations, learning rate , sampling rate b
Output : model parameter
1 AC = new ASYNCcontext
2 for  i ¡-1 to numIterations  do
3        w_br = sc.broadcast(w)
4        points. ASYNCbarrier(f, AC.STAT).sample(b).map(p =¿ ) .ASYNCreduce(_+_, AC)
5        while  AC.hasNext()  do
6               gradient = AC.ASYNCcollect()
7               w -= gradient
8              
9        end while
10       
11 end for
return
Algorithm 2 The ASGD Algorithm
Input : points, numIterations, learning rate , sampling rate b, number of points n
Output : model parameter
1 averageHistory = 0
2 store in table
3 for  i ¡-1 to numIterations  do
4        w_br =sc.broadcast(w)
5        (gradient, history)= points.sample(b).map((index,p) =¿ , ).reduce(_+_)
6        averageHistory += (gradient - history) bn
7        w -= (gradient - history + averageHistory )
8        update table
9       
10 end for
return
Algorithm 3 The SAGA Algorithm
Input : points, numIterations, learning rate , sampling rate b, #points n, #partitions P
Output : model parameter
1 AC = new ASYNCcontext
2 averageHistory = 0
3 for  i ¡-1 to numIterations  do
4        w_br = AC.ASYNCbroadcast(w)
5        points.ASYNCbarrier(f, AC.STAT) .sample(b).map((index,p) =¿ , ). ASYNCreduce(_+_, AC)
6        while AC.hasNext() do
7               (gradient,history) = AC.ASYNCcollect()
8               averageHistory += (gradient - history) bn/P
9               w -= (gradient - history + averageHistory )
10              
11        end while
12       
13 end for
return
Algorithm 4 The ASAGA Algorithm
Actions
ASYNCreduce(f:(T,T) T, AC)
ASYNCaggregate(zeroValue: U)
(seqOp: (U, T) U, combOp: (U, U) U), AC)
Reduces the elements of the RDD using the specified associative
binary operator.
Aggregates the elements of the partition using the combine
functions and a neutral ”zero” value.
Transformations ASYNCbarrier(f:T Bool, Seq[T]) Returns a RDD containing elements that satisfy a predicate f.
Methods
ASYNCcollect()
ASYNCcollectAll()
ASYNCbroadcast(T)
AC.STAT
AC.hasNext()
Returns a task result.
Returns a task result and its attributes including index, staleness,
and mini-batch size.
Creates a dynamic broadcast variable.
Returns the current status of all workers.
Returns true if a task result exists.
Table 1. Transformations, actions, and methods in ASYNC. AC is the ASYNCcontext and Seq[T] is a sequence of elements of type T.

5.3. Other asynchronous machine learning methods in ASYNC

In this section we briefly discuss how ASYNC can be used to implement a large class of asynchronous optimization methods. The support for hyperparameter selection with the ASYNCcollectAll method enables the user to implement a variety of staleness-aware stochastic gradient descent methods. The task results and attributes returned by ASYNCcollectAll can be used together with other components in ASYNC to implement staleness-based algorithms (Odena, 2016; Zhang et al., 2015b; McMahan and Streeter, 2014). These algorithms use worker-specific information to adapt the asynchronous stochastic gradient descent to the staleness of workers. 1 shows an example of using worker attributes to implement the staleness-dependent learning rate modulation technique in (Zhang et al., 2015b) where each task result is assigned a weight based on its staleness.

ASYNC also enables the implementation of various barrier control strategies. It provides the interface to implement user-defined functions that selectively choose from available workers based on their status such as staleness or average-task-completion time. Along with well-known barrier control strategies such as SSP and ASP, recent strategies that use the performance of computational nodes (Zhang et al., 2018) as a metric for barrier control are also easy to implement in ASYNC. 2 shows the pseudo-code of implementing three of the most common barrier control strategies in ASYNC.

while(AC.hasNext()){
    (gradient, attr) = AC.ASYNCcollectAll()
    w -= alpha/atrr.staleness * gradient
}
Listing 1: The pseudo-code for implementing staleness-dependent learning rate methods in ASYNC.
f: STAT.foreach(true) % The ASP@ barrier control
f: STAT.foreach(Avaialble_Workers == P) % The BSP@ barrier control, P is the total number of workers
f: STAT.foreach(MAX_Staleness < s) % The SSP@ barrier control with a staleness threshold ’s’
points.ASYNCbarrier(f, AC.STAT) % Apply the barrier
Listing 2: The pseudo-code for implementing different barrier controls methods in ASYNC.

ASYNC benefits from Spark’s built-in synchronous transformation and actions because of its robust integration on top of Spark. As a result, ASYNC can be used for implementing variance-reduced asynchronous methods that require periodic synchronization (Reddi et al., 2015; Huo and Huang, 2017; Zhang et al., 2015c)

. These methods are epoch based since their algorithm sometimes does a full pass on the input data and collects task results from all workers before proceeding to the next epoch. Asynchronous updates to the model parameters occur in between the epochs. A pseudo-code of an epoch-based variance-reduced algorithm is shown in

3.

for (epoch = 1 to T){
    fullGradient = points.map(’gradient function’).reduce()
    // synchronous reduction
    for (i = 1 to numIterations){
        points.sample().map(’gradient function’).ASYNCreduce()
        // asynchronous reduction
        while(AC.hasNext()){
            gradient = AC.ASYNCcollect()
            ...
        }
    }
}
Listing 3: The pseudo-code for implementing the class of epoch-based variance reduced algorithms.

6. Results

We evaluate the performance of ASYNC by implementing two asynchronous optimization methods, namely ASGD and ASAGA, to solve least squares problems. The performance of ASGD and ASAGA are compared to their synchronous implementations in Spark. To the best of our knowledge, no library or implementation of asynchronous optimization methods exists on Spark. However, to demonstrate that the synchronous implementations of the algorithms using ASYNC are well-optimized, we first compare the performance of the synchronous variants of the tested optimization methods in ASYNC with the state-of-the-art machine learning library, Mllib (Meng et al., 2016). Mllib is a library that provides implementations of a number of synchronous optimization methods. In subsection 6.3 we evaluate the performance of ASGD and ASAGA in ASYNC in the presence of stragglers. Our experiments show that the asynchronous algorithms in ASYNC outperform the synchronous implementations and can lead to up to 2 speedup in 8 workers with a single controlled delay straggler and up to 4 speedup in 32 workers with straggler patterns from real production clusters.

6.1. Experimental setup

We consider the distributed least squares problem defined in (4). Our experiments use the datasets listed in Table 2 from the LIBSVM library (Chang and Lin, 2011), all of which vary in size and sparsity. The first dataset rcv1_full.binary is about documents in the Reuters Corpus Volume I (RCV1) archive, which are newswire stories (Lewis et al., 2004). The second dataset mnist8m contains handwritten digits commonly used for training various image processing systems (Loosli et al., 2007), and the third dataset epsilon is the Pascal Challenge 2008 that predicts the presence/absence of an object in an image. For the experiments, we use ASYNC, Scala 2.11, Mllib (Meng et al., 2016), and Spark 2.3.2. Breeze 0.13.2 and netlib 1.1.2 are used for the (sparse/dense) BLAS operations in ASYNC. XSEDE Comet CPUs (Towns et al., 2014) are used to assemble the cluster. We experiment with two different straggler behaviours: in the “Controlled Delay Straggler (CDS)” experiments a single worker is delayed with different intensities, in the “Production Cluster Stragglers (PCS)” experiments straggler patterns from real production clusters are used. The CDS experiments are ran with all three datasets on a cluster composed of a server and 8 workers. The PCS experiments require a larger cluster and thus are conducted on a cluster of 32 workers with one server using the two larger datasets (mnist8m and epsilon). In all configurations a worker runs an executor with 2 cores. The number of data partitions is 32 for all datasets and in the implemented algorithms. The experiments are repeated three times and the average is reported.

Dataset Row numbers Column numbers Size
rcv1_full.binary 697,641 47,236 851.2MB
mnist8m 8,100,000 784 19GB
epsilon 400,000 2000 12.16GB
Table 2. Datasets for the experimental study.

Parameter tuning: A sampling rate of b = 10% is selected for the mini-batching SGD for mnist8m and epsilon and b = 5% is used for rcv1_full.binary. SAGA and ASAGA use b = 10% for epsilon, b = 2% for rcv1_full.binary, and use b =1% for mnist8m. For the PCS experiment, we use b = 1% for mnist8m and epsilon.

The SGD implementation in Mllib uses a decaying step size strategy in which the initial step size is reduced by a factor of in iteration . Our synchronous implementation uses Mllib’s step size decay rate. We tune the initial step size for SGD so it converges faster to the optimal solution. For SAGA we use a fixed step size throughout the algorithm which is also tuned for faster convergence. The step size is not tuned for the asynchronous algorithms. Instead, we use the following heuristic, the step size of ASGD and ASAGA is computed by dividing the initial step size of their synchronous variants by the number of workers (Recht et al., 2011). We run the SGD algorithm in Mllib for 15000 iterations with sampling rate of 10% and use its final objective value as the baseline for the least squares problem.

6.2. Comparison with Mllib

We use ASYNC for implementations of both the synchronous and the asynchronous variants of the algorithms because (i) ASYNC’s performance for synchronous methods is similar to that of Mllib’s; (ii) asynchronous methods are not supported in Mllib; (iii) synchronous methods that require history of gradients can not be implemented in Mllib because of communication overheads. To demonstrate that our implementations in ASYNC are optimized, we compare the performance SGD in ASYNC and Mllib for solving the least squares problem (Johnson and Zhang, 2013). Both implementations use the same initial step size. The error is defined as objective function value minus the baseline. Figure 2 shows the error for three different datasets. The figure demonstrates that SGD in ASYNC has a similar performance to that of Mllib’s on 8 workers, the same pattern is observed on 32 workers. Therefore, for the rest of the experiments, we compare the asynchronous and synchronous implementations in ASYNC.

Figure 2. The performance of SGD implemented in ASYNC versus Mllib.

6.3. Robustness to stragglers

Controlled Delay Straggler: We demonstrate the effect of different delay intensities in a single worker on SGD, ASGD, SAGA, and ASAGA by simulating a straggler with controlled delay (Karakus et al., 2017; Cipar et al., 2013). From the 8 workers in the cluster, a delay between 0% to 100% of the time of an iteration is added to one of the workers. The delay intensity, which we show with delay-value %, is the percentage by which a worker is delayed, e.g. a 100% delay means the worker is executing jobs at half speed. The controlled delay is implemented with the sleep command. The first 100 iterations of both the synchronous and asynchronous algorithms are used to measure the average iteration execution time.

The performance of SGD and ASGD for different delay intensities are shown in Figure 3 where for the same delay intensity the asynchronous implementation always converges faster to the optimal solution compared to the synchronous variant of the algorithm. As the delay intensity increases, the straggler has a more negative effect on the runtime of SGD. However, ASGD converges to the optimal point with almost the same rate for different delay intensities. This is because the ASYNCscheduler continues to assign tasks to workers without having to wait for the straggler. When the task result from the straggling worker is ready, it independently updates the model parameter. Thus, while ASGD in ASYNC requires more iterations to converge, its overall runtime is considerably faster than the synchronous method. With a delay intensity of %100, a speedup of up to 2 is achieved with ASGD compared to SGD.

(a) mnist8m
(b) epsilon
(c) rcv1_full.binary
Figure 3. The performance of ASGD and SGD in ASYNC with 8 workers for different delay intensities of 0%, 30%, 60% and 100% which are shown with ASYNC/SYNC, ASYNC/SYNC-0.3, ASYNC/SYNC-0.6 and ASYNC/SYNC-1.0 respectively.
Figure 4. Average wait time per iteration with 8 workers for ASGD and SGD in ASYNC for different delay intensities.

Figure 4 shows the average wait time for each worker over all iterations for SGD and ASGD. The wait time is defined as the time from when a worker submits its task result to the server until it receives a new task. In the asynchronous algorithm, workers proceed without waiting for stragglers. Thus the average wait time does not change with changes in delay intensity. However, in the synchronous implementation worker wait times increase with a slower straggler. For example, for the mnist8m dataset in Figure 4, the average wait time for SGD increases significantly when the straggler is two times slower (delay = 100%). Comparing Figure 3 with Figure 4 shows that the overall runtime of ASGD and SGD is directly related to their average wait time where an increase in the wait time negatively affects the algorithms convergence rate.

The slow worker pattern used for the ASGD experiments is also used for ASAGA. Figure 5 shows experiment results for SAGA and ASAGA. The communication pattern in ASAGA is different from ASGD because of the broadcast required to compute historical gradients. In ASAGA, the straggler and its delay intensity only affects the computation time of a worker and does not change the communication cost. Therefore, the delay intensity does not have a linear effect on the overall runtime. However, Figure 5 shows that increasing the delay intensity negatively affects the convergence rate of SAGA while the ASAGA algorithm maintains the same convergence rate for different delay intensities.

(a) mnist8m
(b) epsilon
(c) rcv1_full.binary
Figure 5. The performance of ASAGA and SAGA in ASYNC for different delay intensities of 0%, 30%, 60% and 100% which are shown with ASYNC/SYNC, ASYNC/SYNC-0.3, ASYNC/SYNC-0.6 and ASYNC/SYNC-1.0 respectively.
Figure 6. Average wait time per iteration with 8 workers for ASAGA and SAGA in ASYNC for different delay intensities.

The workers’ average wait time for ASAGA is shown in Figure 6. With an increase in delay intensity, workers in SAGA wait more for new tasks. The difference between the average wait time of SAGA and ASAGA is more noticeable when the delay increases to 100%. In this case, the computation time is significant enough to affect the performance of the synchronous algorithm, however, ASAGA has the same wait time for all delay intensities.

Production Cluster Stragglers: Our PCS experiments are conducted on 32 workers with straggler patterns in real production clusters (Moreno et al., 2014; Ananthanarayanan et al., 2010); these clusters are used frequently by machine learning practitioners. We use the straggler behaviors reported in previous research (Garraghan et al., 2016; Ouyang et al., 2016; Garraghan et al., 2015) all of which are based on empirical analysis of production clusters from Microsoft Big (Ananthanarayanan et al., 2010) and Google (Moreno et al., 2014)

. Empirical analysis from production clusters concluded that approximately 25% of machines in cloud clusters are stragglers. From those, 80% have a uniform probability of being delayed between 150% to 250% of average-task-completion time. The remaining 20% of the stragglers have abnormal delays and are known as Long Tail workers. Long tail workers have a random delay between 250% to 10

. From the 32 workers in our experiment, 6 are assigned a random delay between 150%-250% and two are long tail workers with a random delay over 250% up to 10. The randomized delay seed is fixed across three executions of the same experiment.

The performance of SGD and ASGD on 32 workers with PCS is shown in Figure 7. As shown, ASGD converges to the solution considerably faster that SGD and leads to a speedup of 3 for mnist8m and 4 for epsilon. From Figure 8, ASAGA compared to SAGA obtains a speedup of 3.5 and 4 for mnist8m and epsilon respectively. The average wait time for both algorithms on 32 workers is shown in Table 3. The wait time increases considerably for all synchronous implementations which results in slower convergence of the synchronous methods.

Figure 7. The performance of ASGD and SGD in ASYNC on 32 workers shown with ASYNC and SYNC respectively.
Figure 8. The performance of ASAGA and SAGA in ASYNC on 32 workers shown with ASYNC and SYNC respectively.

7. Related work

Asynchronous optimization methods have demonstrated to be efficient in improving the performance of large-scale machine learning applications on shared memory and distributed systems. The most widely used asynchronous optimization algorithms are stochastic gradient methods (Dean et al., 2012; Duchi et al., 2015; Recht et al., 2011) and coordinate descent algorithms (Avron et al., 2015; Hsieh et al., 2015; Lian et al., 2015). An asynchronous variant of stochastic gradient descent methods with a constant step size known as Hogwild! (Recht et al., 2011) achieves sub-linear convergence rates but only converges to a neighborhood of the optimal solution. Other asynchronous methods such as DownpourSGD (Dean et al., 2012) and PetuumSGD (Xing et al., 2015) guarantee convergence to the optimal solution at the expense of a slower convergence rate.

To mitigate the negative effects of stale gradients on the convergence rate, machine learning practitioners have proposed numerous algorithms and strategies. Recent work proposes to alter the execution model and bound staleness (Xing et al., 2015; Ming et al., 2018; Ho et al., 2013; Agarwal and Duchi, 2011; Zinkevich et al., 2009; Cipar et al., 2013), theoretically adapt the method to the stale gradients (Zhang et al., 2015b; McMahan and Streeter, 2014), and use barrier control strategies (Zhang et al., 2018; Wang et al., 2017) to improve the convergence rate of asynchronous algorithms. In other work (Zhang et al., 2015b; McMahan and Streeter, 2014) staleness is compensated with staleness-dependent learning strategies that scale the algorithm hyperparameters. Variance reduction is frequently used to reduce noise in optimization methods (Mania et al., 2017; Ming et al., 2018; Reddi et al., 2015; Johnson and Zhang, 2013) in which the model is periodically synchronized to achieve a linear convergence rate. To extend variance reduction to asynchronous methods, ASAGA (Leblond et al., 2016; Pedregosa et al., 2017) eliminates the need for synchronization by storing a table of historical gradients. The alternating direction method of multipliers (ADMM) (Boyd et al., 2011), a well-known method for distributed optimization, has also been extended to support asynchrony for convex (Zhang and Kwok, 2014) and non-convex (Hong, 2018; Chang et al., 2016) problems. ASYNC provides the functionality and API to support the implementation of a large class of these asynchronous algorithms and enables practitioners to experiment and develop strategies that mitigate the staleness problem in distributed executions of asynchronous machine learning algorithms.

Asynchronous optimization methods on shared memory. Asynchronous optimization methods that are implemented for execution on shared memory systems rely on a “shared view” of the model parameters and on local communication. Recent work on first order asynchronous optimization and beyond (Recht et al., 2011; Mania et al., 2017; De Sa et al., 2017; Qin and Rusu, 2017) and their extension to parallel iterative linear solvers (Liu et al., 2014; Avron et al., 2015) demonstrate that linear speedups are generically achievable in the asynchronous shared-memory setting even when applied to non-convex problems (Dean et al., 2012; Chilimbi et al., 2014). Frameworks such as (Pan et al., 2016; Nawab et al., 2017) propose lock-free parallelization of stochastic optimization algorithms, while maintaining serial equivalence by partitioning updates among cores and ensuring no conflict exists across partitions. While asynchronous optimization has demonstrated great promise on shared memory systems, in the era of big data with large-scale data, shared-memory executions are not practical. Also, often the datasets are generated and stored on different machines in a distributed system, thus shared-memory machine learning models can not be used.

SAGA ASAGA SGD ASGD
mnist8m 42.8367 ms 9.8125 ms 6.4433 ms 3.5745 ms
epsilon 6.9926 ms 1.1721 ms 5.3112 ms 1.4165 ms
Table 3. Average wait time per iteration on 32 workers.

Asynchronous optimization methods on distributed systems. The demand for large-scale machine learning has lead to the development of numerous cloud and distributed frameworks. Many of these frameworks support a synchronous execution model and thus can not be used to implement asynchronous optimization methods. Commodity distributed dataflow systems such as Hadoop (Hadoop, 2011) and Spark (Zaharia et al., 2012) are optimized for coarse-grained, often bulk synchronous, parallel data transformations thus do not provide the fine-grained communication and control required by asynchronous algorithms (Zaharia et al., 2012; Hadoop, 2011; Mahout, 2008; Saadon and Mokhtar, 2017; Zhang et al., 2015a). Consequently, machine learning libraries that use these data flow systems such as Mllib only support synchronous machine learning algorithms (Meng et al., 2016).

Recent work has modified frameworks such as Hadoop and Spark to support asynchronous optimization methods. The iiHadoop framework (Saadon and Mokhtar, 2017) extends (Hadoop, 2011) to perform local incrimination on a fraction of data that has changed, instead of operating on the entire data. iiHadoop can only execute tasks asynchronously if they are not dependent. However, in a large class of asynchronous optimization methods tasks are dependent through updates to the shared model parameters. ASIP (Gonzalez et al., 2015) introduces a communication layer to Spark to enable fine-grained asynchronous communication amongst workers. ASIP only supports fully asynchronous algorithms and can not be used to implement adaptive asynchronous optimization methods such as (Zhang et al., 2015b) and (Dai et al., 2019) in which worker-specific data is required. Glint (Jagerman and Eickhoff, 2016) integrates the parameter server model on Spark with a communication method that allows workers to push their updates to the shared model. However, workers are not allowed to locally reduce their updates and then submit the aggregated update. As a result, Glint does not support mini-batch asynchronous optimization methods where the gradients have to be reduces locally per worker. In addition, Glint only supports fully asynchronous execution models.

Parameter server architectures such as (Xing et al., 2015; Chilimbi et al., 2014; Mai et al., 2015) are widely used in distributed machine leaning since they support asynchronous and bounded asynchronous parallelism. DistBelief (Dean et al., 2012) and Project Adam (Chilimbi et al., 2014) use parameter server models to train deep neural networks. Petuum (Xing et al., 2015) improves (Ahmed et al., 2012) with a bounded delay parameter server model. Other parameter server frameworks include MLNET (Mai et al., 2015), Litz (Qiao et al., 2018) and (Zhou et al., 2017). MLNET deploys a communication layer that uses tree-based overlays to implement distributed aggregation and multicast. However, the server in MLNET only has access to aggregated results while many asynchronous methods (Huo and Huang, 2017; Leblond et al., 2016) rely on individual task results to update the model parameter. Overall, parameter server architectures do not support asynchronous optimization methods that rely on worker-specific attributes such as staleness, do not facilitate the flexible implementations of barrier control strategies, and can not be used in variance reduction methods that need historical gradients.

8. Conclusion

This work introduces the ASYNC framework that facilities the implementation of asynchronous machine learning methods on cloud and distributed platforms. ASYNC is built with three fundamental components, the ASYNCcoordinator, the ASYNCbroadcaster, and the ASYNCscheduler. Along with bookkeeping structures, the components in ASYNC facilitate the implementation of numerous strategies and algorithms in asynchronous optimization methods such as barrier control and dynamic hyperparameter selection. The broadcast functionality in ASYNC enables communication-efficient implementation of variance reduction optimization methods that need historical gradients. ASYNC is built on top of Spark to benefit from Spark’s in-memory computation model and fault tolerant execution. We present the programming model and interface that comes with ASYNC and implement the synchronous and asynchronous variants of two well-known optimization methods as examples. The support for the implementation of some of the other well-known asynchronous optimization methods is also presented. These examples only scratch the surface of the types of algorithms that can be implemented in ASYNC. We hope that ASYNC helps machine learning practitioners with the implementation and investigation to the promise of asynchronous optimization methods.

Acknowledgments

This work was supported in part by the grants NSF DMS-1723085, NSF CCF-1814888 and NSF CCF-1657175. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (Towns et al., 2014), which is supported by National Science Foundation grant number ACI-1548562.

References

  • (1)
  • Agarwal and Duchi (2011) Alekh Agarwal and John C Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems. 873–881.
  • Ahmed et al. (2012) Amr Ahmed, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J Smola. 2012. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining. ACM, 123–132.
  • Ananthanarayanan et al. (2010) Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010.

    Reining in the Outliers in Map-Reduce Clusters using Mantri.. In

    Osdi, Vol. 10. 24.
  • Avron et al. (2015) Haim Avron, Alex Druinsky, and Anshul Gupta. 2015. Revisiting asynchronous linear solvers: Provable convergence rate through randomization. Journal of the ACM (JACM) 62, 6 (2015), 51.
  • Bottou (2012) Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade. Springer, 421–436.
  • Boyd et al. (2011) Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine Learning 3, 1 (2011), 1–122. https://doi.org/10.1561/2200000016
  • Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2, 3 (2011), 27.
  • Chang et al. (2016) Tsung-Hui Chang, Mingyi Hong, Wei-Cheng Liao, and Xiangfeng Wang. 2016. Asynchronous distributed ADMM for large-scale optimization—Part I: Algorithm and convergence analysis. IEEE Transactions on Signal Processing 64, 12 (2016), 3118–3130.
  • Chen et al. (2016) Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016).
  • Chilimbi et al. (2014) Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014.

    Project Adam: Building an efficient and scalable deep learning training system. In

    11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 571–582.
  • Cipar et al. (2013) James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. 2013. Solving the straggler problem with bounded staleness. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems.
  • Cutkosky and Busa-Fekete (2018) Ashok Cutkosky and Róbert Busa-Fekete. 2018. Distributed Stochastic Optimization via Adaptive SGD. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 1910–1919. http://papers.nips.cc/paper/7461-distributed-stochastic-optimization-via-adaptive-sgd.pdf
  • Dai et al. (2019) Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, and Eric Xing. 2019. Toward Understanding the Impact of Staleness in Distributed Machine Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=BylQV305YQ
  • De et al. (2016) Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. 2016. Big batch SGD: Automated inference using adaptive batch sizes. arXiv preprint arXiv:1610.05792 (2016).
  • De Sa et al. (2017) Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. 2017. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In ACM SIGARCH Computer Architecture News, Vol. 45. ACM, 561–574.
  • Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223–1231.
  • Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. 2014. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems. 1646–1654.
  • Duchi et al. (2015) John C Duchi, Sorathan Chaturapruek, and Christopher Ré. 2015. Asynchronous stochastic convex optimization. arXiv preprint arXiv:1508.00882 (2015).
  • Dünner et al. (2018) Celestine Dünner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, and Martin Jaggi. 2018. A Distributed Second-Order Algorithm You Can Trust. arXiv e-prints, Article arXiv:1806.07569 (Jun 2018), arXiv:1806.07569 pages. arXiv:cs.LG/1806.07569
  • Garraghan et al. (2015) Peter Garraghan, Xue Ouyang, Paul Townend, and Jie Xu. 2015. Timely long tail identification through agent based monitoring and analytics. In 2015 IEEE 18th International Symposium on Real-Time Distributed Computing. IEEE, 19–26.
  • Garraghan et al. (2016) Peter Garraghan, Xue Ouyang, Renyu Yang, David McKee, and Jie Xu. 2016. Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Transactions on Services Computing (2016).
  • Gemulla et al. (2011) Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. 2011. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 69–77.
  • Gonzalez et al. (2015) Joseph E Gonzalez, Peter Bailis, Michael I Jordan, Michael J Franklin, Joseph M Hellerstein, Ali Ghodsi, and Ion Stoica. 2015. Asynchronous complex analytics in a distributed dataflow architecture. arXiv preprint arXiv:1510.07092 (2015).
  • Hadoop (2011) Apache Hadoop. 2011. Apache hadoop. URL http://hadoop. apache. org (2011).
  • Ho et al. (2013) Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In Advances in neural information processing systems. 1223–1231.
  • Hong (2018) Mingyi Hong. 2018. A Distributed, Asynchronous, and Incremental Algorithm for Nonconvex Optimization: An ADMM Approach. IEEE Transactions on Control of Network Systems 5, 3 (2018), 935–945.
  • Hsieh et al. (2015) Cho-Jui Hsieh, Hsiang-Fu Yu, and Inderjit S Dhillon. 2015. PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent.. In ICML, Vol. 15. 2370–2379.
  • Hu et al. (2018) Wenhui Hu, Peng Wang, Qigang Wang, Zhengdong Zhou, Hui Xiang, Mei Li, and Zhongchao Shi. 2018. Dynamic Delay Based Cyclic Gradient Update Method for Distributed Training. In

    Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

    . Springer, 550–559.
  • Huo and Huang (2017) Zhouyuan Huo and Heng Huang. 2017. Asynchronous mini-batch gradient descent with variance reduction for non-convex optimization. In

    Thirty-First AAAI Conference on Artificial Intelligence

    .
  • Jagerman and Eickhoff (2016) Rolf Jagerman and Carsten Eickhoff. 2016. Web-scale Topic Models in Spark: An Asynchronous Parameter Server. arXiv preprint arXiv:1605.07422 (2016).
  • Johnson and Zhang (2013) Rie Johnson and Tong Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems. 315–323.
  • Karakus et al. (2017) Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. 2017. Straggler mitigation in distributed optimization through data encoding. In Advances in Neural Information Processing Systems. 5434–5442.
  • Leblond et al. (2016) Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. 2016. ASAGA: asynchronous parallel SAGA. arXiv preprint arXiv:1606.04809 (2016).
  • Lee et al. (2017) Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory 64, 3 (2017), 1514–1529.
  • Lewis et al. (2004) David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, Apr (2004), 361–397.
  • Li et al. (2014) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 583–598.
  • Lian et al. (2015) Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems. 2737–2745.
  • Liu et al. (2014) Ji Liu, Stephen J Wright, and Srikrishna Sridhar. 2014. An asynchronous parallel randomized Kaczmarz algorithm. arXiv preprint arXiv:1401.4780 (2014).
  • Loosli et al. (2007) Gaëlle Loosli, Stéphane Canu, and Léon Bottou. 2007. Training invariant support vector machines using selective sampling. Large-scale kernel machines 2 (2007).
  • Mahout (2008) Apache Mahout. 2008. Scalable machine-learning and data-mining library. available at mahout. apache. org (2008).
  • Mai et al. (2015) Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing network performance in distributed machine learning. In 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15).
  • Mania et al. (2017) Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. 2017. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization 27, 4 (2017), 2202–2229.
  • McMahan and Streeter (2014) Brendan McMahan and Matthew Streeter. 2014. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in Neural Information Processing Systems. 2915–2923.
  • Meng et al. (2016) Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235–1241.
  • Ming et al. (2018) Yuewei Ming, Yawei Zhao, Chengkun Wu, Kuan Li, and Jianping Yin. 2018. Distributed and asynchronous Stochastic Gradient Descent with variance reduction. Neurocomputing 281 (2018), 27–36.
  • Moreno et al. (2014) Ismael Solis Moreno, Peter Garraghan, Paul Townend, and Jie Xu. 2014. Analysis, modeling and simulation of workload patterns in a large-scale utility cloud. IEEE Transactions on Cloud Computing 2, 2 (2014), 208–221.
  • Moulines and Bach (2011) Eric Moulines and Francis R Bach. 2011.

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In

    Advances in Neural Information Processing Systems. 451–459.
  • Nawab et al. (2017) Faisal Nawab, Divy Agrawal, Amr El Abbadi, and Sanjay Chawla. 2017. COP: Planning Conflicts for Faster Parallel Transactional Machine Learning.. In EDBT. 132–143.
  • Odena (2016) Augustus Odena. 2016. Faster asynchronous SGD. arXiv preprint arXiv:1601.04033 (2016).
  • Ouyang et al. (2016) Xue Ouyang, Peter Garraghan, David McKee, Paul Townend, and Jie Xu. 2016. Straggler detection in parallel computing systems through dynamic threshold calculation. In 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). IEEE, 414–421.
  • Pan et al. (2016) Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael I Jordan, Kannan Ramchandran, and Christopher Ré. 2016. Cyclades: Conflict-free asynchronous machine learning. In Advances in Neural Information Processing Systems. 2568–2576.
  • Pedregosa et al. (2017) Fabian Pedregosa, Rémi Leblond, and Simon Lacoste-Julien. 2017. Breaking the nonsmooth barrier: A scalable parallel method for composite optimization. In Advances in Neural Information Processing Systems. 56–65.
  • Qiao et al. (2018) Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A Gibson, and Eric P Xing. 2018. Litz: Elastic framework for high-performance distributed machine learning. In 2018 USENIX Annual Technical Conference (USENIXATC 18). 631–644.
  • Qin and Rusu (2017) Chengjie Qin and Florin Rusu. 2017. Dot-product join: Scalable in-database linear algebra for big model analytics. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 8.
  • Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693–701.
  • Reddi et al. (2015) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems. 2647–2655.
  • Reddi et al. (2016) Sashank J Reddi, Jakub Konečnỳ, Peter Richtárik, Barnabás Póczós, and Alex Smola. 2016. AIDE: Fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879 (2016).
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics (1951), 400–407.
  • Saadon and Mokhtar (2017) Afaf G Bin Saadon and Hoda MO Mokhtar. 2017. iiHadoop: an asynchronous distributed framework for incremental iterative computations. Journal of Big Data 4, 1 (2017), 24.
  • Shalev-Shwartz et al. (2009) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. 2009. Stochastic Convex Optimization. In In 22nd Annual Conference on Learning Theory (COLT).
  • Shamir et al. (2014) Ohad Shamir, Nati Srebro, and Tong Zhang. 2014. Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning. 1000–1008.
  • Tan et al. (2016) Conghui Tan, Shiqian Ma, Yu-Hong Dai, and Yuqiu Qian. 2016. Barzilai-Borwein step size for stochastic gradient descent. In Advances in Neural Information Processing Systems. 685–693.
  • Towns et al. (2014) John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, Andrew Grimshaw, Victor Hazlewood, Scott Lathrop, Dave Lifka, Gregory D Peterson, et al. 2014. XSEDE: accelerating scientific discovery. Computing in Science & Engineering 16, 5 (2014), 62–74.
  • Vapnik (2013) Vladimir Vapnik. 2013.

    The nature of statistical learning theory

    .
    Springer Science & Business Media.
  • Wang et al. (2017) Liang Wang, Ben Catterall, and Richard Mortier. 2017. Probabilistic Synchronous Parallel. arXiv preprint arXiv:1709.07772 (2017).
  • Wang et al. (2017) S. Wang, F. Roosta-Khorasani, P. Xu, and M. W. Mahoney. 2017. GIANT: Globally Improved Approximate Newton Method for Distributed Optimization. ArXiv e-prints (Sept. 2017). arXiv:cs.LG/1709.03528
  • Xing et al. (2015) Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1, 2 (2015), 49–67.
  • Zaharia et al. (2012) Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2–2.
  • Zhang et al. (2018) Jilin Zhang, Hangdi Tu, Yongjian Ren, Jian Wan, Li Zhou, Mingwei Li, and Jue Wang. 2018. An Adaptive Synchronous Parallel Strategy for Distributed Machine Learning. IEEE Access 6 (2018), 19222–19230.
  • Zhang and Kwok (2014) Ruiliang Zhang and James Kwok. 2014. Asynchronous distributed ADMM for consensus optimization. In International Conference on Machine Learning. 1701–1709.
  • Zhang et al. (2015c) Ruiliang Zhang, Shuai Zheng, and James T Kwok. 2015c. Fast distributed asynchronous SGD with variance reduction. arXiv preprint arXiv:1508.01633 (2015).
  • Zhang et al. (2015b) Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2015b. Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950 (2015).
  • Zhang et al. (2015a) Yanfeng Zhang, Shimin Chen, Qiang Wang, and Ge Yu. 2015a. i2mapreduce: Incremental mapreduce for mining evolving big data. IEEE Transactions on Knowledge and Data Engineering 27, 7 (2015), 1906–1919.
  • Zhang and Lin (2015) Yuchen Zhang and Xiao Lin. 2015. Disco: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning. 362–370.
  • Zhou et al. (2017) Jun Zhou, Xiaolong Li, Peilin Zhao, Chaochao Chen, Longfei Li, Xinxing Yang, Qing Cui, Jin Yu, Xu Chen, Yi Ding, et al. 2017. Kunpeng: Parameter server based distributed learning systems and its applications in Alibaba and Ant Financial. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1693–1702.
  • Zinkevich et al. (2009) Martin Zinkevich, John Langford, and Alex J Smola. 2009. Slow learners are fast. In Advances in Neural Information Processing Systems. 2331–2339.