Machine learning (ML) is revolutionizing not only the computing industry, but also fields such as healthcare and education, where ML techniques are driving key applications. Thus, there is a race to build new ML systems [6, 1, 5, 33] that efficiently learn complex models from big datasets.
To support large model sizes and training data most systems execute distributed versions of ML (DML) algorithms across 10s-100s of workers in a cluster. These DML algorithms, e.g., distributed stochastic gradient descent (SGD), and distributed Latent Dirichlet Allocation (LDA)[8, 25], are iterative in nature, and are both computation and communication intensive. In each iteration, a worker computes an update to the large model, which then needs to be disseminated to all other workers. Model updates can be MB per worker per iteration, yielding large network transfers (§2).
Many DML systems [6, 5, 15, 1] focus on addressing the performance of computing updates at individual workers, e.g., via optimal use of hardware accelerators [19, 27]. In contrast, systematically addressing communication efficiency and network bottlenecks has received limited attention. In most systems, the application (DML algorithm) manages both computation and communication. A simple network view, as offering fixed bandwidth between all cluster workers, is adopted. Application-level techniques are then used to reduce total data transferred to/from a worker to avoid network bottlenecks.
DML systems thus treat the network as a blackbox, and as such, are unable to overcome network issues. Consider Parameter Server (PS) based ML systems [33, 10]. The model is stored at a separate location (server); in every iteration workers pull the latest model and compute an update, which is then shipped to the server and applied to the model. PS systems support flexible consistency schemes, e.g., strict synchronous, stale synchronous, or asynchronous model updates (§2), which help improve DML algorithms’ compute efficiency and convergence . However, they deal with network efficiency using ad hoc application-level approaches, such as, dropping updates deemed not significant, or coarsely quantizing updates . Unfortunately, these approaches affect algorithm convergence , and yet may not be effective enough in dealing with serious onset of congestion.
Likewise, MPI-based systems [1, 4] — which support only synchronous SGD — employ MPI_AllReduce operations to minimize data communication. Updates are aggregated along a static topology (e.g., a ring or a tree) among the workers. Unfortunately, network-unawareness of the aggregation topology means that a worker stuck behind a network bottleneck will block the aggregation of updates from other workers.
Treating the network as a blackbox imposes other drawbacks. In particular, DML systems leave on the table potential algorithmic improvements and new framework-level support for further improving large-scale DML that can be achieved by actively leveraging the network. We argue these issues in the context of MLfabric, a new DML communication library we have built. MLfabric applies equally to PS and MPI systems. MLfabric decouples computation from communication. Applications hand over the entire responsibility of transferring the model and its updates across the network to MLfabric. MLfabric then holistically determines the communication pattern of a DML algorithm at any given time in a network-aware fashion. This offers three benefits.
1. Flexible aggregation to overcome network bottlenecks. Using holistic control, MLfabric can determine in-network aggregation strategies. Workers can be dynamically organized into tree-like topologies over which updates are routed and aggregated before being committed at a server. This helps improve network efficiency in the presence of dynamically changing compute or network contention, which is common in shared environments [19, 34]. It is orthogonal to the algorithm-level approaches above (e.g., update quantization).
2. Leveraging the network for algorithmic advances In asynchronous SGD, updates from slow workers, e.g., compute stragglers, or those stuck behind a network bottleneck, have a high delay, i.e., their update is computed from an old model version. Applying stale updates to the model can affect convergence . To address this, asynchronous algorithms set small learning rates based on the worst case delay observed, which slows down training. By leveraging control over communication, MLfabric can orchestrate how and when updates are transferred over the network, thereby explicitly controlling the staleness of each update, and bounding worst case delay distribution. This allows the application to set a high learning rate even under a changing execution environment, which improves convergence. Further, updates with high delay that negatively affect convergence can be dropped at the worker itself without wasting network resources.
We find that using MLfabric’s in-network aggregation and explicit control over update delay have other surprising algorithmic consequences. Empirically, for some popular large deep neural net models (e.g., ResNet-50), we find that these techniques help asynchronous SGD-based training atop PS frameworks to converge faster than synchronous SGD-based training atop MPI in some straggler settings. The latter training approach has been the de facto standard for deep learning because of the significant network bottlenecks faced by the naive use of asynchronous algorithms and/or PS; many foundational systems and algorithmic approaches have optimized the speed of deep learning in MPI settings. In contrast, the use of MLfabric makes asynchronous/PS now a contender, and opens the door to related new systems and algorithmic approaches to improving deep learning.
3. Leveraging the network for framework improvements Existing PS systems  use a hot-standby for server fault tolerance. Chain replication is employed to ensure every model update to the parameter server is also applied to the replica, enforcing strong consistency. However, chain replication introduces additional per-iteration latency, and exacerbates network contention at the server. MLfabric’s control over communication supports flexible bounded consistency, which helps significantly control replication overhead. Under bounded consistency, we require that should always be satisfied, where and denote the models stored at the server and replica, and is a user-configured divergence limit; higher leads to lower replication overhead, but higher recovery cost. Bounded consistency is sufficient for ML algorithms due to their stochastic nature; upon server failure, the lost work due to non-replicated updates can be recovered by generating fresh worker updates using the latest model at the replica. In MLfabric, workers replicate updates, and the network controls them to carefully schedule original and replica transfers so as to ensure that divergence stays within .
We implement MLfabric as a thin communication layer between DML applications [11, 6] and MPI communication libraries [18, 2]. It internally uses MPI APIs to aggregate/schedule transfers across the network and/or to a server.
In designing MLfabric, we make four technical contributions. First, we prove3
). Our evaluation shows empirically that bounding the delay speeds up convergence even for non-convex optimization problems used in training deep neural networks and other asynchronous DML algorithms like distributed LDA using Gibbs sampling (§7). Second, we design a scheduling algorithm that, given a set of worker updates, computes the update transfer schedules in a network-aware fashion (§5). This algorithm transfers updates at non-overlapping times, reserving bandwidth per transfer, and it carefully orders worker updates. We show that the former enables a fast rate of model updates, and ordering helps bound delays. Third, we develop an in-network aggregation algorithm that determines whether to send each update directly to a server, or to an aggregator first. It performs the best in-network aggregation possible while efficiently utilizing aggregators’ bandwidth and preserving update ordering. Finally, we develop a replication algorithm that opportunistically schedules transfers to a replica server while leveraging spare capacity. It prioritizes primary server transfer schedules and while always bounding divergence. MLfabric’s scheduler runs these three algorithms in sequence for every batch of updates, leading to delay-bounded, divergence-bounded, networking efficient fast model updates.
We evaluate MLfabric using a 30 worker cluster with P100 GPUs and quad-core CPUs under 9 different time-varying network and compute straggler settings (§7). We study large deep neural net (ResNet-50 and ResNet-152) training, and distributed LDA for topic modeling using Gibbs sampling. We show that MLfabric improves overall training times by compared to state-of-the-art under various straggler settings. MLfabric offers up to 30X lower replication overhead for PS systems in some scenarios.
2 DML Performance Analysis
The de facto algorithm of choice for various ML applications like Deep learning, Generalized Linear Models, etc., is Stochastic Gradient Descent (SGD) . SGD is inherently serial; in each iteration the model is updated using a gradient from a single sample or a mini-batch of training data . In order to distribute SGD, ML practitioners have successfully used its different variants, each having different model consistency requirements: (#1) asynchronous SGD [29, 7], (#2) stale synchronous SGD , and (#3) synchronous SGD .
Our primary focus in this paper is on #1 as realized in parameter server (PS) DML systems. The entire suite of MLfabric’s algorithms for network control (scheduling; in-network aggregation; bounded-divergence replication) apply to #1. Subsets of MLfabric also apply to #2 (both PS and MPI) and #3; we discuss these in §6, and evaluate in §7. Furthermore, in §7, we show MLfabric’s benefits for other (non-SGD) synchronous/asynchronous algorithms like distributed LDA.
We provide a brief overview of #1 below, followed by DML algorithm’s computation and communication characteristics.
Asynchronous SGD: Here, a worker computes a gradient update using a mini-batch of local data, pushes it to the server and pulls the latest model. The update from each worker is applied independently to the model at the server at each iteration. In iteration , the update computed by a worker and the model update at the server, respectively, are:
where, is the model after iteration , is the loss function, is the mini-batch at worker , is the regularization term, is the learning rate and is the momentum term111This update strategy corresponds to SGD with momentum which is shown to be beneficial for asynchronous SGD .
The update, , is calculated using an old version of the model, instead of . Here, is called the delay of update ; it is the difference between the version of the model being updated and the one used to compute the update.
Performance based on model complexity:
Most real-world models trained with DML algorithms are complex. Consider the image recognition neural network model, ResNet50, that achieves up to 75% accuracy in classifying images among 1000 classes. It is 100MB in size; however, GPUs (e.g., NVIDIA P100) can process up to 200 images per second to compute updates for the model. In a distributed training setup (see §7), the number of images used to compute an update at a worker is much lower; typically 32 images. In such a scenario, we find that the computation phase at a DML workers takes less than 100ms. On the other hand, faster compute means that workers have to exchange 100MB worth of updates amongst themselves every 100ms. This causes high communication overhead; even bandwidth optimal AllReduce strategies like ring AllReduce (used in synchronous SGD algorithms) take at least 320ms (3 computation time) to exchange all updates when all workers are connected by commodity 10Gbps Ethernet. By aggregating updates between GPUs in a worker before exchanging over the network the communication cost is reduced to 160ms; further, effective pipelining of computation and communication reduces the overall time per iteration from (100 + 160)ms to 200ms. Other update exchange strategies (e.g., binary tree AllReduce or Parameter Server with single server) can have at least communication overhead. Increasing the number of parameter servers reduces the network load per server (even though it is still higher than ring AllReduce), at the cost of increased communication between servers to propagate model version information. This makes it undesirable for asynchronous ML algorithms.
Similar trend is observed for other algorithms; e.g., distributed LDA. Topic modeling using LDA on the NY Times dataset  with 32 parallel workers (see §7) has computation cost of 180ms at each worker. The communication cost (time to exchange updates) assuming ring AllReduce is 160ms. However, for PS-based system with 10Gbps server bandwidth, communication cost is 1.8s, i.e., computation.
Performance with stragglers and network bottlenecks: For synchronous algorithms, where the progress is determined by the slowest worker, the effect of stragglers and network bottlenecks is prominent. We find (§7) that the per-iteration time increases when 10% of the workers took 4X longer in each iteration and 10% of the incoming and outgoing network links had bandwidth lowered to Gbps.
In asynchronous SGD, these slow workers observe high delays. This severely impacts convergence speed or converged model accuracy (§7). We find that asynchronous LDA, upon introducing network bandwidth fluctuations (bandwidth on 10% of the links are reduced to 5 Gbps every 5s), takes 35% more iterations to converge due to network stragglers.
3 Central ideas
Today’s DML systems’ network-agnosticity causes slowdowns in the face of compute or network contention (stragglers). In MLfabric, instead of treating the network as a blackbox, all transfers of a DML algorithm are handed off to a communication library, which determines the entire communication pattern at any point in time. For simplicity, we explain MLfabric in the context of PS systems and asynchronous algorithms.
In MLfabric, each update push from a worker to a server is intercepted and fulfilled later, at which time it is either directly forwarded to the server or passed through intermediate hops where the update is aggregated with other workers’ updates. We refer to the ability to finely control the transfer of model and its update over the network as in-network control. In this section, we explain the benefits of in-network control using theory and qualitative arguments. Algorithms that realize the benefits are presented in §5.
First, in-network control enables network-based delay management (§3.1) – i.e., managing delay observed at the server by controlling the order of updates inside the network. This helps asynchronous SGD by lowering the number of iterations to convergence as well as the average iteration time. Second, network control enables in-network aggregation of updates (§3.2), which further improves per-iteration performance. Third, it enables off-loading model replication from parameter servers to the network by ensuring consistent ordering of updates across primary and replica servers, and bounding model-replica divergence (§3.3). This relieves both server-side and network replication load while enabling fast recovery from server failure.
3.1 Delay management
We describe what delay management is and how it is helpful. Then, we make a case for it to be network-based.
Recall from eqns. 2 and 1, that workers’ updates are applied to the model in a delayed fashion.  shows that for well behaved convex loss functions asynchronous SGD converges as long as the delay for each update is bounded () and the learning rate or “step size” in iteration, , is set as: , where is a constant. As a result, for execution environments with large observed delays the learning rate must be set small to guarantee convergence. This increases the number of iterations until convergence. In response, 
advocates on making learning rate a function of the delay observed for a worker; under the assumption that the delay follows an uniform distribution,, they show that delay adaptive SGD converges as:
where, is the optimal model minimizing loss function , and
is the estimated model afteriterations.
Building on this result, we show that (§10.4) if , delay adaptive SGD converges as:
In other words, we can get constant factor speedup in convergence, where the speedup is inversely proportional to .
Based on this result, our first idea is to carefully control the order in which updates are applied to the model. This reduces the variance of the delay distribution in asynchronous SGD and bounds maximum delay.
3.1.1 In-network Control: Fresher Model Versions
The above ordering of updates can be realized at the parameter server. However, we argue that it is better to leverage in-network control, thereby enforcing network-based ordering. This is because in-network control helps make fresher model versions available earlier, as argued below.
Consider an execution of asynchronous SGD; let us assume that at some time instant, , ( total) workers have pending gradient updates that need to be applied to the model. These updates were computed using prior versions of the model, versions , where is the current version of the model and , are integers denoting the delay of the update if it is applied to the model at version . Assume that there exists exactly one , such that and , where is the maximum allowable delay. In other words, one of the updates () has been computed with an older version of the model when compared to others.
In Figure 1, we show how updates are transferred today, and how that may cause delay to exceed (see caption).
As one alternative, we can enforce that no update with delay should be applied to the model; this causes update to be discarded, resulting in lost work.
Another alternative is server-based update ordering, where we buffer updates that complete after at the server (fig. 1), and apply them after update has been transferred and applied to the model. This ensures that update ’s delay is exactly ; the delay for all buffered updates increases by but remains under . The downside is that the workers’ interim pull requests do not see new model versions: all pull requests for the model between are returned version , which is worse than in fig. 1.
The final alternative is in-network control, where we can enforce network time sharing, i.e., different updates are transmitted by the network at carefully-chosen non-overlapping times at bottleneck links (See fig. 1; note: we assume a single bottleneck at the server here). The total time to transfer all the updates would be the same () since the total data transferred over the network is same as before. However, as long as update transfers are scheduled such that update is transferred as one of the first , the delay bound will be satisfied without any need for server-side buffering. Further, by transferring updates through except update in the order of completion time, we can emulate shorted-job-first and minimize average update time. This makes new model versions available earlier than in figs. 1 and 1. Our update scheduling algorithm in §5.1 relies on this idea.
In-network control enables the above ordered updates that are ready to be sent to the server to be further opportunistically aggregated at network locations before being applied at the server. Thus, network load is lowered, and model updates occur faster (at earlier times) w.r.t. not aggregating in-network.
Say at time updates () from 4 different workers are available. If all the updates are transferred to the server directly in a time shared manner, completion times are as shown in fig. 2. Even though the update from worker is available at , it is queued and starts transmitting only at (after update from has completed).
With in-network control, say updates and are forwarded directly to the server, but and are aggregated at (fig. 2). Our aggregation algorithm in §5.2 constructs such aggregation topologies dynamically based on current network load. Assuming full-bisection bandwidth, and will be transferred concurrently with and . After aggregation, the result can be transferred to the server, where it updates the model at .
Since , server network load is reduced. Also, pull requests at time can be replied with fresh information of all 4 model updates; without aggregation, pull requests within don’t capture the last update.
3.3 Toward Bounded Consistency Replication
In PS-based systems, the server stores the entire model. It is
therefore crucial to ensure server fault-tolerance.222 Server
fault-tolerance is needed because individual workers often do not
read the entire copy of the model in each iteration (e.g., sparse
Server fault-tolerance is needed because individual workers often do not read the entire copy of the model in each iteration (e.g., sparse logistic regression), and/or because, apart from the model, parameter servers also store additional states not visible to workers like history of updates, prior learning rate, etc. used for momentum based model updates. Existing PS implementations use chain replication for fault tolerance , which incurs data overhead for replicas. They attempt to reduce the data overhead by aggregating updates at the server and forwarding them to the replica once every iterations. However, if updates are sparse, infrequent replication only amortizes the server data overhead since the total data transferred from server to replica is not reduced.
To reduce server load, we can enable a replication strategy based on workers forwarding a copy of each update directly to the replica. However, such worker-based replication is not easy to achieve without active in-network control. In particular, having workers replicate by themselves fundamentally cannot preserve the ordering of updates. Coupled with the stateful nature of model updates (eqn. 2), this can result in unbounded server-replica model divergence, which makes recovery from the replica slow, if not impossible. We show this next. Then, we discuss how in-network control helps.
Assume at time , the server and the replica contain identical models (i.e., ) and have the same prior update (). Let be the next two updates to the model at the server; assume the same updates are applied in a different order () at the replica. Then, by applying eqn. 2 twice, the model at server and replica can be computed as:
Thus, the divergence is:
Each such re-ordering of updates will add a further non-zero divergence between the server and replica!
Enforcing bounded consistency: Using in-network control, we can ensure that all updates to the model at the server are also applied to model at the replica in the exact same order. If the same updates are always applied at the server and replica, then their models are identical, realizing strict consistency. However, we can use in-network control to bound model-replica divergence, thereby leading to a flexible new notion of bounded consistency: here, we simply ensure that is within a user-specified bound, . Bounded consistency flexibly trades off the cost of recovery against network efficiency of updates to primary and replica models. Specifically, a large bound allows delaying several replicated updates, which can be aggressively aggregated later, controlling network replication load. A small bound makes recovery fast but at the cost of higher replication load.
In-network control enables us to carefully schedule both original and replicated updates to achieve bounded consistency. We now show how in-network control can reduce divergence, which we use in our replication algorithm (§5.3),
Consider a scenario where at time , update is the latest to be applied to the model at both the server and replica; the divergence at is zero. Now, consider a schedule of future updates as shown in fig. 3. At time , the server leads the replica by two updates. By applying eqn. 2 twice, the model divergence at can be computed as:
where, is the update history at time .
4 Architecture and APIs
Architecture: The main component of MLfabric is a scheduler that interacts with MLfabric daemons on each worker/server; the scheduler processes update and model transfer requests from the daemons and determines the (a) next hop, and (b) schedule for each transfer. The next hop can either be a final destination (worker or server) or an intermediate aggregator deployed alongside workers; aggregators compute the (weighted) sum of incoming updates and forward them to the next hop determined by the scheduler. A network monitor periodically measures and reports available network bandwidth to the scheduler which is used to make scheduling decisions. MLfabric daemons are responsible for interfacing with application entities using MLfabric APIs and enforcing the scheduler’s decisions.
|worker||push(server, update, update_norm)|
|params||delay bound :=|
|divergence bound :=|
APIs: MLfabric extends existing PS APIs (see Table 1). It also provides an MPI AllReduce API which is realized through PS APIs (§6). These APIs help realize the optimizations discussed in §3. For example, for bounded consistency replication (§3.3), MLfabric allows machines to register as replicas and allows workers to specify the norm of the update when it is pushed.
MLfabric scheduler determines the communication pattern for a batch of updates available from workers. It computes the transfer schedule (i.e., how bytes in an update are transferred at any given time) and forwarding (next hop – i.e., server or intermediate aggregator hop) for each of these updates. This is done so as to (1) minimize the average completion time of update transfers (§3.1.1), improve network efficiency (§3.2), and make fresh models available earlier (§3.1.1 and §3.2), while (2) bounding worst-case delay (§3.1) even under stragglers or changing network conditions. Also, when replica servers are deployed, MLfabric schedules minimal replication traffic to bound primary-replica model divergence (§3.3).
We first formulate an integer linear program (ILP) to jointly determine the optimal schedules of updates and forwarding for aggregation (§10.1). But the ILP is intractable even for a PS-system with one server and no replicas. It is intractable even when determining the schedules alone (i.e., aggregation is also ignored) while considering delay bounds.
To handle this intractability, we decompose the problem and solve it progressively. As mentioned earlier, we process a batch of updates at a time. We first determine ordering for the batch of updates. Second, given ordering, we determine the forwarding/aggregation strategy, which results in tentative schedules for transferring updates to either servers or aggregators. Third, given the ordering, and schedules of updates, we determine which replica transfers to schedule and when so that they finish before updates in the batch are committed at the server; if this replication “falls short”, i.e., causes server-replica model divergence, we delay a small number of the tentative primary server transfer schedules such that divergence bound is met. In the end, we have concrete transfer schedules for all updates in the batch and for those that are replicated. For simplicity we assume a single server .333In §10.2, we consider the case where the model is sharded across multiple parameter servers. We describe the above three algorithms in turn.
5.1 Update Ordering
Given a set of available worker updates (), and a single server, we first describe how we determine the order () in which updates are transferred over the network. We ignore replication/aggregation for now.
We assume network time-sharing (§3.1.1), i.e., updates transferred on a bottleneck link do not have overlapping transfer times. Given this, we attempt to determine an ordering that (1) minimizes the average update transfer time to the server to ensure a fast rate of updates (§3.1.1
), (2) subject to the constraints that delays bounds are met. Since this problem in itself is also intractable, we develop a heuristic that decouples them by first attempting to minimize average transfer time (§5.1.1), and then “fixing” any violated delay bounds (§5.1.2). This heuristic may result in network links/server NIC laying “fallow”, and we show how to alter the ordering to address this inefficiency without violating delay bounds (§5.1.3).
5.1.1 Average completion time
To determine an order that minimizes average update transfer time, we iteratively emulate shortest-job-first ordering for update transfers (alg. 1): in each iteration, given current available bandwidth, we compute each single update’s transfer completion time, , by factoring in the bottleneck bandwidth the transfer has available over time, and determining how the bytes in the update are transferred by maximally using bottleneck capacity at any time (fig. 4). We pick the transfer with least completion time, and reserve capacity on its path over time; the amount of reservation equals the time-varying bottleneck bandwidth, and reservation duration equals the transfer completion time. We then update remaining network capacities over time, and iterate (line 1, alg.1).
5.1.2 Bounding delays
Shortest-transfer-first-ordering can increase delay (potentially greater than the configured upper bound, ) for large updates or those with less bandwidth to the server. To ensure that these are transferred earlier in the order so as to meet delay bounds, we introduce transfer deadlines. For an ordering over updates in , the deadline for update is:
where, is the model version for update and is the version of the model after all updates from previous batchs are applied.
We then modify the update ordering algorithm (alg. 1) as follows: in iteration if there exists an unscheduled , such that , then we pick in that iteration and reserve bandwidth for transferring as above; otherwise we greedily pick the update with the least transfer time ().
5.1.3 Dropping delayed updates
Unfortunately, simply accounting for shortest-transfer-first and deadlines does not suffice in determining a “good” ordering. Unless care is taken while factoring deadlines, the ordering may unnecessarily lead to network or server resources staying fallow. To see why, consider two workers and with updates and . Let the deadline for the two updates be and . Let the network topology and current state of the network be as shown in fig. 5. Since has a deadline of 1, the above approach picks as the first update to apply to the model, and thus transfers it first. Because its bottleneck bandwidth is 10Mbps, the transfer would take 10s. In the next iteration, the algorithm selects . If is scheduled immediately, then its available bandwidth is 90Mbps (after bandwidth for is reserved), and the update takes 1.1s to finish. Thus, , which violates ’s delay bound (recall ). One way to avoid this is to transfer in a delayed manner such that is applied first, but this leaves 90Mbps of network capacity on the link to the server unused while only is being transferred. Alternately, can be transferred per the above ordering, but applied only after is applied – in this case the server stays idle while waiting for even though is available to be applied.
To ensure work conserving server and network behavior as well as delay bounds, we modify our algorithm to drop the update at the worker itself. is then immediately scheduled for transfer.
Thus, our iterative algorithm requires the following fix: in every iteration of the modified algorithm from §5.1.2, where we pick an update (call it “current”) to satisfy a deadline, we look-ahead and determine the completion time of the next update that will be applied (call it “next”). If “next” completes before “current”, we discard “current”. The final ordering algorithm that combines shortest-job-first, meets delay bounds, and avoids wasting resources is in Alg. 2.
With the transfer order determined, we describe how to opportunistically aggregate updates in-network. The goal is to use spare compute and network capacity at non-server (aggregator) locations to aggregate as efficiently as possible while preserving the above-determined ordering.
We achieve this by grouping ordered updates in a clever manner, and streaming each group to either the server directly, or first to an aggregator and then the server, such that the server always has a constant stream of ordered or ordered+aggregated updates arriving at its NIC (fig. 6). In some more detail, given a set of aggregators for the server, we partition ordered updates to the server into groups, using an algorithm we describe shortly. Given the partitioning, the first of these groups, if non-zero in size, is forwarded directly to the server. All gradients in subsequent groups are aggregated before they are forwarded to the server (fig. 6). Further, updates in each group are forwarded to aggregators as per determined above. Thus we ensure that, (a) the delay constraints remain satisfied (fig. 6), and (b) update to the model is consistent to the case with no aggregation. The output from the aggregator also obeys the same order.
Our algorithm for determining the best way to partition updates into groups for the server is key to efficiency and is shown in Alg. 3. The algorithm determines group membership so as to minimize the total time until the aggregated update from the last group is transferred to the server. The partitioning is guided by the following key constraint: aggregating all updates in the group from the corresponding workers should not finish later than the time when all prior groups’ gradient aggregates are transferred to the server. This condition ensures efficiency, i.e., the server NIC is never left fallow, waiting for updates to be aggregated.
We first randomly pre-assign the aggregator to use for the group. Then, the aggregation algorithm works are follows. Given ordered updates, we exhaustively enumerate cases (lines 21-23). In the such case, : (1) the first updates are forwarded directly to the server (lines 3-7). (2) We greedily assign successive updates to the first aggregator as long as the above-mentioned constraint is satisfied (lines 16-18). (3) When the constraint is violated, we greedily start assigning updates to the second aggregator (lines 10-15), and so on. Figure 6 shows this for .
After every assignment of an update to server/aggregator, we reserve network bandwidth for the transfer (lines 5, 6, 17). We also reserve bandwidth for transferring the aggregated update from the aggregator to the server (lines 11, 12).
All cases result in a different aggregation patterns. We pick the one which takes the least amount of time to transfer all the updates to the server (line 24).
Note that our algorithm does not alter the transfer schedules for updates in group 1 compared to those computed by Alg. 2. For all other updates, and aggregates, our algorithm computes new transfer schedules that differ from Alg. 2, because the schedule now accounts for new transfer destinations (aggregators vs. the server) and new transfers/sources (aggregates from aggregators vs. original updates from workers).
We now have transfer schedules for a batch where: delay bounds are met; updates are aggregated efficiently; updates/aggregates are committed as fast as possible to the server ensuring high model update rate; and server NIC and overall network efficiency are high.
Given the above transfer schedules for a batch, we now describe our replication algorithm. It determines the final schedules for transfers to both the server and the replica; thus, we refer to the above-determined schedules (alg. 2) as “tentative”.
Suppose there is just one replica (extending to more replicas is simple). The goal is to transfer a prefix of the updates to replica in the same order as (determined above), such that when all the updates in the batch are committed at the server the divergence bound is satisfied. We assume a separate aggregators are earmarked for the replica.
The replication algorithm operates in cognizance of the above “tentative server schedules” and the resulting network state. It first computes “tentative replica schedules”444Note: replicated transfers use already-computed server transfer ordering using the aggregation alg. 3, where the initial state of the network accounts for tentative server (original transfer) schedules.
Suppose the last transfer to finish in the tentative server schedules finished (i.e., committed at the server) at de . We check if the divergence bound, , holds at based on the server and replica updates that would have been committed by this time; we can determine this from tentative server and replica schedules as shown shortly. Note that only a prefix of server updates would have completed by .
If the bound is satisfied, we freeze the replica schedule (i.e., apply to replica) for all worker-to-aggregator and aggregator-to-replica transfers that would have finished by ; all replicated updates that finish after are “punted” to be processed along with the next batch.
If the divergence bound is not satisfied at , then we delay (akin to the example in (§3.3) just the last update in the tentative server schedule to start after completion of the earliest update in the replica schedule (say, ) such that the divergence bound is satisfied; all replica updates until are then frozen (this is still a prefix of ); subsequent updates are punted to the next batch.
Punting the processing of some replicated updates to the next batch has two advantages for said next batch: (1) punted transfers combine with the set of replicated updates for said batch, which helps increases overall aggregation efficiency in transfers to the replica when said batch is processed555especially when bottleneck is close to replica server, and, (2) in turn, this helps free up resources for server update transfers in said batch, helping them to finish faster. With a large divergence bound, more such updates are punted to subsequent batches, magnifying the aforementioned benefits. We show this empirically in §7.
Thus, we now have concrete schedules for both server and replica updates.
We return to the problem of efficiently estimating the divergence between sets of updates in the tentative server and replica schedules. Because computing exact divergence requires expensive computation over large updates, in MLfabric, we approximate actual divergence by an upper bound. Suppose at any time, the latest updates to the server and replica are, and respectively (see eq. 7). The divergence can be approximated using Cauchy-Schwartz inequality as:
where, is the update history and are constants dependent on momentum . The value (denoted as ) can be easily computed with just the norm of the individual updates provided by the workers/server to the MLfabric scheduler (§4); verifying that the approximation is less than is sufficient for bounded consistency.
6 Extending MLfabric
We now describe how MLfabric applies to synchronous and stale synchronous SGD, and to MPI frameworks.
Synchronous SGD/PS: Here, at each iteration, workers read the latest model and compute a local update using a portion of the mini-batch. The updates are then aggregated at the server and applied to the model (also incrementing model version) before the start of next iteration. MLfabric’s approach to construct dynamic network-aware aggregation topologies naturally helps synchronous SGD. The workers’ updates for an iteration are batched when they become ready, say into updates in a batch. Since update ordering does not apply to synchronous SGD, aggregation here starts with a list of updates (vs. an ordering in §5.2). Then, directly applying our algorithm from §5.2 ensures that this batch of updates is transferred as efficiently as possible. The next batch may use a different aggregation topology. Our replication algorithm (§5.3) also applies directly. Note that we have to guarantee bounded divergence only at the end of an iteration (after all workers’ updates are applied) as opposed to end of a batch.
Stale Synchronous SGD/PS: Stale synchronous (SS) SGD is a consistency model that allows slow workers to lag behind fast workers by up to iterations ; typically . This form of delay management in SS is restrictive compared to the delay management that MLfabric enables for asynchronous SGD. For example, with , the maximum staleness of a model is bound by . However, a worker that is more than twice slower than other workers will halt progress of all other workers until the slow worker progresses to the next iteration. In contrast, a delay bound of in asynchronous SGD with MLfabric will not halt other workers’ progress, while at the same time ensuring the staleness of the update is less than .
Further, a typical implementation of SS does not implement update aggregation. In-network control offered by MLfabric can be applied to update aggregation for SS in a manner similar to synchronous/PS above.
MPI: MLfabric’s AllReduce API can be used by existing MPI-based systems to implement synchronized SGD. Internally, MLfabric would implement AllReduce through successive calls to: (a) push(root, update, norm) and (b) get(root, update), using a synchronous consistency model; root is randomly chosen among the workers and acts as the root of the aggregation topology, which is a dynamically constructed tree (similar to synchronous/PS above). get(root, update) pulls the aggregated update from root once updates from all workers are received.
MLfabric can also help with model distribution (§10.3).
Implementation: MLfabric is implemented in C++ as a thin communication control layer between DML applications (e.g., PLDA 
, Keras6]) and MPI communication libraries (OpenMPI  and NCCL ). DML applications interact with MLfabric through APIs defined in Table 1 and MLfabric internally uses APIs provided by MPI frameworks to aggregate/schedule transfers across the network.
Datasets and ML models: We evaluate MLfabric with two popular communication intensive distributed ML applications: (1) distributed deep learning for image recognition on the ImageNet1K  data set (using ResNet50 and ResNet152  models), and, (2) distributed LDA for topic modelling using Gibbs sampling on the NY Times dataset . The computation and communication structure of distributed LDA is similar to SGD; instead of computing a gradient update using a mini-batch, each worker computes a numerical update to a word-topic matrix using its entire shard of data, and then exchanges the update among all the workers  using a PS or MPI based system.
Experiment setup: We run the DML applications across 30 workers in a cluster of 15 baremetal machines connected by a 10 Gbps network. The worker computation for distributed LDA runs on a 4-core CPU with 2.3GHz processor whereas the deep learning applications use NVIDIA P100 GPUs (2 cards/physical machine).
The scheduler, server (single) and replica (single) are hosted on a dedicated machine with Gbps bandwidth. The aggregators are co-hosted with worker clients; the aggregator runs on a separate core on the worker machines and does not compete for CPU resources. We batch requests at the scheduler every .
Background compute and network load: Along the lines of prior work , we emulate compute stragglers by slowing down the update computation stage; a single worker has an % chance of being slowed down by a factor of ; by default, (compute setting C1). We also study other settings; C2: and C3: .
We emulate network background traffic by varying the rate limits on physical hosts’ NICs; incoming and outgoing links are treated independently. For every (= 5, default) seconds the NIC rate is changed to a value from the set
Gbps with probability; it emulates the case where there are other contending flows respectively ( is the default settings, called N1). We also consider two other settings; N2: and N3:.
The network monitor reports changes in link bandwidth to the scheduler after (= 0.2s, default).
Algorithms: We evaluate the following: PS-based asynchronous and synchronous variants of MLfabric, or MLfabric-A & MLfabric-S, respectively; vanilla PS-based asynchronous (Async); and MPI-based (using NCCL library) synchronous algorithms – we study two variants, ring-reduce and tree-reduce, or RR-Sync and Tr-Sync, respectively.
7.1 Performance of MLfabric-A
We compare MLfabric-A with and RR-Sync. We also study the effect of varying delay.
plot the top-1 test error (in %) for a ResNet50 model (100 MB) trained on the ImageNet1K dataset as a function of training epochs and time respectively (for the setting C1-N1). We use a mini-batch size of 32 per worker and a learning rate schedule that reduces by a factor of 10 after epochs 30, 60, and 90. We compare MLfabric-A only withRR-Sync; the communication bottleneck at the parameter server for is prohibitive to run even over a few days.
With a delay bound of 30, MLfabric-A-30, can train a deep neural network with 74% accuracy; it alleviates server bottleneck through update aggregation. The convergence rate as function of number of epochs is similar to RR-Sync.
In terms of wall clock time MLfabric-A-30 is faster than RR-Sync. The speed up can be attributed to: (1) asynchronous algorithms are not prone to compute stragglers, (2) unlike synchronous algorithms, MLfabric-A does not have to send traffic over low bandwidth links in each iteration; update from workers behind slow links can be dropped. During the entire training process, MLfabric-A-30 dropped 30% of the updates at the worker for violating delay bounds.
We comment on the impact of delay control. A higher delay bound than the 30 used above can reduce the number of dropped updates and speed up the training time at the cost of loss in accuracy. MLfabric-A-60, with a delay bound of 60, achieves only 70% test accuracy; however, it is times faster than RR-Sync. We also experimented with an intermediate delay (45) and saw that it’s accuracy and run-time lie between delay-30 and delay-60.
Varying compute and network settings: Next, we varied CS and NS settings to evaluate the benefits of MLfabric-A for different kinds of heterogenuous environments. Table 2 shows the speedup of MLfabric-A relative to RR-Sync across 9 different compute and network background loads. Here, to minimize the overall running time, we start with a pre-trained model (i.e., the model after epoch 50 for synchronous SGD). The run time is measured as time taken to reach 74% test accuracy.
Speed-up is highest (3X) when some workers are slower than others (C2) and the network is not the bottleneck (N1). This is because in RR-Sync AllReduce is triggered only after receiving update-ready notification from all workers. Thus, in the presence of stragglers, network bandwidth remains fallow waiting for a slow worker to compute the update.
Varying reporting lag:
For network settings N2, increasing the reporting lag up to 2s (from 0.2s, with the network re-configured every 5 seconds) increases the per-iteration time averaged over 10 epochs (for ResNet50) by 100ms–7.6% of overall 1300ms. However, for a skewed distribution of link bandwidths,and , the per-iteration time increased by 40% with a 2s lag. Thus, the gains from MLfabric are robust to lags in monitoring unless bandwidth skews are significant.
Distributed LDA: Figures 7 and 7 compares performance of RR-Sync, Async and MLfabric-A based on the number of iterations and time taken to converge for the topic modelling task using NY times dataset (vocabulary size=102660, number of documents=300K); we use compute setting C1 and network setting N1 for this experiment. We learn a model with 100 topics; the model is said to have converged when the log-likelihood reaches -7.94 on a test data of size 1500. RR-Sync, MLfabric-A-30, MLfabric-A-60 and Async converge in 145 (182s), 188 (139s), 239 (169s) and 300 (1080s) iterations (wall clock time) respectively. This corresponds to a and speedup (in number of iterations) for MLfabric in comparison to Async for with delay bounds 30 and 60 respectively. Further, even though MLfabric-A takes more number of iterations, it reduces the overall run time w.r.t RR-Sync by 24% and 7% for delays 30 and 60. Due to update aggregation, MLfabric is up to faster than Async. Similar gains were obtained for other compute and network settings.
Importance of delay control and aggregation: Our results above show the relative importance of these two aspects of MLfabric. They show that MLfabric’s aggregation plays a crucial role in supporting all large model training; without it, training is prohibitively slow. Note that MLfabric enables aggregation for the first time for asynchronous algorithms. Delay control is also important, because without it either accuracy (ResNet-50) or runtime (LDA) are impacted.
7.2 Performance of MLfabric-S
We compare the performance of MLfabric-S (using ResNet50) with RR-Sync for different compute and network settings. We measure the overall time to complete 5 epochs for both the algorithms. We find that the bandwidth optimal RR-Sync is faster () than MLfabric-S for all combinations of compute and network settings except C2 and N1. When some workers are slowed down by a factor of , then for the C2-N1 setting, the rest of the workers are idle (no computation or communication) for 50% of the overall runtime. For all other settings, communication generally is not idle. MLfabric-S reduces the idle time by eagerly aggregating available worker updates (even over low bandwidth links). This results in a 16% improvement in overall run time in C2-N1. For ResNet-152 model (240MB) with the above compute and network settings, RR-Sync is the optimal algorithm since communication is always the bottleneck.
We also compare MLfabric-S with another (non-bandwith optimal) variant of MPI AllReduce (Tr-Sync) that aggregates and distributes updates along a binary tree. In the presence of stragglers (C2) and network contention (N2), MLfabric-S reduces the per-iteration communication time for ResNet-152 by 21.7%: from 3.05s with Tr-Sync to 2.38s. For ResNet-50 the gain is 18.42%. Clearly, the advantages of network-aware aggregation are more prominent for larger model sizes. Since compute time is relatively small, the reduction in per-iteration time directly translates to reduction in overall running time of the algorithm for large models. The benefits arise from dynamically avoiding (aggregating more) updates at nodes with low (high) current bandwidth; figure 9 plots the number of updates aggregated as a function of the bandwidth available on the incoming link. Note that since Tr-Sync uses a binary tree the number of messages exchanged between workers is higher. However, being network-aware, MLfabric-S forwards only 816 of the overall 20000 (3%) messages to aggregators with low bandwidth (2.5Gbps in N2), whereas Tr-Sync sends 1800 messages over such links!
7.3 Bounded consistency replication
For MPI based systems, fault tolerance is provided today by checkpointing the model at the worker with rank 0. We compare the cost of checkpointing with the cost of transferring updates to a hot-standby replica over the network. We measure the overhead of fault-tolerance as follows: for MPI based systems the overhead is the difference in time between two runs with and without checkpointing over 6 epochs ( minutes with no stragglers/network bottlenecks). For PS based systems, it is the time difference between two runs with and without in-network replication. The runs with fault tolerance are parameterized by the maximum allowable divergence (measured in number of updates) between server and replica; for MPI systems it translates to checkpointing frequency. The overhead of checkpointing for every iteration and every 20 iterations is 76 minutes and 4 minutes, respectively, for MPI based systems; the corresponding overhead for in-network replication is 16 mins and 10 mins, respectively. As the divergence bound increases (600 updates in the case of 20 iterations), the network cost savings due to aggregation plateaus at a factor of for 30 workers (see fig. 9). Further, if all workers only write part of their model to disk every iteration and replicate it 3-way, the 76 minute overhead can be reduced to just minutes. Thus, overall, in-network replication does not help with MPI based systems.
For PS based DML systems running asynchronous SGD, checkpointing at the server for every 30 and 600 updates has an overhead of 96 minutes (6X worse w.r.t. in-network replication) and 6 minutes (0.6X), respectively. Thus, in-network replication is advantageous over checkpointing here for scenarios that warrant tight divergence bound (e.g., where compute nodes are highly susceptible to failure).
Chain replication is commonly used in many PS frameworks. We also experimented with it, but found that, given the large model sizes, it adds prohibitive overhead (up to 30X) compared to in-network replication in MLfabric.
7.4 Scheduler performance
For our experimental setting with 30 workers, the transfer schedules were computed within ms per batch (batch size, , is typically ). To test scalability of the scheduler computation666Because the scheduler processes only small control messages that are received on a dedicated TCP socket and take only 1 RTT, high network utilization does not affect scheduler response time., we studied the effect of varying sizes of . We measure the time taken by the scheduler to determine the concrete batch schedules by providing batches of updates (with random deadlines from ) for a network topology with nodes and a congestion free core. For update batch sizes of 100, 500 and 1000, the scheduler overhead was ms, ms, and ms, respectively. Thus, the overhead is quadratic w.r.t. update batch size. However, we note that the inner loops in alg. 2 (line 3, function ShrtUp) and 3 (lines 21-23) can be parallelized (which our current implementation does not), leading to better scaling.
8 Related work
Prior works propose various techniques to reduce the overall training time of ML algorithms that employ SGD for learning.
Algorithmic approaches: Some other approaches for mitigating stragglers involve: aggregating gradients from only a subset of fast workers in each iteration of synchronous SGD , which is complementary with MLfabric’s aggregation; and delay-aware learning rate for asynchronous SGD , which can benefit from MLfabric’s delay management. Prior work advocates variance reduction SGD where a series of asynchronous updates is interspersed with intermediate synchronous updates , and performing partial updates of the model to reduce total data sent over the network [33, 22]; both techniques can benefit from MLfabric.
System-level approaches: To speed up gradient computation time, ML systems leverage SIMD processing capabilities of hardware accelerators like GPU or TPU , which can leverage MLfabric for further speedup. Communication overhead is typically managed by: (1) workers leveraging sparsity of data and pulling only parts of the model , or (2) quantization of floating point values used to represent gradients . MLfabric is complementary with #1 and #2.
Our overall approach can be view as type of co-flow [12, 14, 13, 28, 36] scheduling. The differences in our setting are: (1) flows in our co-flow have an intrinsic order, (2) we can drop/re-order flows from the co-flow, (3) flows in our co-flow can be aggregated in-network using ML algorithm specific aggregation functions. These aspects make our problem markedly difficult and different.
We designed MLfabric, a communication library for speeding up large-scale distributed machine learning (DML) systems in dynamic cluster settings. We showed that fine-grained in-network control helps MLfabric to (1) algorithmically speed up convergence, (2) improve network efficiency via dynamic update aggregation, and (3) offload model replication responsibilities from servers to the network in a network-efficient manner. Our experiments on a 30-worker GPU cluster using real-world datasets and realistic straggler settings show that MLfabric reduces model training time by up to compared to state-of-the-art algorithms. Finally, this work does not raise any ethical issues.
10.1 ILP formulation for joint ordering and forwarding for aggregation
Let be the workers and be the server storing a DML application’s model. Let, be the aggregators that serve as intermediate hops. Let denote a directed graph representing the underlying communication network. is the set of all hosts and switches and is the set of network links including host to switch links. Let denote the capacity of link . We assume the path, , from to over the set of links is fixed and does not change in time. Different paths can share a network link.
To exploit dynamic aggregation and re-ordering, we jointly determine the schedule for a batch of requests777requests are batched temporally, so that earlier requests are not starved; let denote a batch of ready updates.
Variables: Let the variable denote the rate at which update is transmitted by over time. Modeling the rate as a function of time allows us to capture network time sharing and ordering between updates. E.g., if updates and from and have to time-share a link such that sends data first followed by , the rate for and are:
Here, is the update size. Update ordering hinges on start/end times:
is applied before if .
Let the integer variable denote the immediate next hop for ; since a worker can forward the update either to the server or an aggregator, . We also determine the schedule of aggregated updates. Let be the rate at which forwards the aggregated update to the server and let and represent the start/end times.
Objective: For synchronous SGD, we have:
This minimizes the total time to aggregate all updates in . For asynchronous SGD, we optimize the average completion time per update:
where, is the number of updates aggregated at . Further, , should be such that delay bounds are satisfied.
Modeling the destination for each update (), rate of transfer at each discrete point in time () results in an large number of discrete variables. Solving an ILP with large number variables is time-wise expensive and is not a straight forward choice for the low-latency requirements at the scheduler. Thus, MLfabric breaks down the complex ILP into smaller sub-problems (ordering, aggregation, replication) and develops computationally efficient heuristics to solve them.
10.2 Model split across multiple servers
Our algorithms in §5 considered the case of a single parameter server; we now briefly describe the case where the model is split across a set of servers. Thus, , from a worker consists of components, . All the components of are computed from the same version of the model and thus, will have the same deadline. Below, we ignore aggregation and replication and consider just scheduling (§5.1) for simplicity; the modifications we suggest below naturally apply to algorithms for replication (§5.3) and aggregation (§5.2).
One option to schedule updates to multiple servers is to use an algorithm similar to one described in §5.1.2 by defining deadlines for each individual server and update component. For example, consider two updates, and , to two destination servers, and . By treating all the updates as independent and choosing them in a shortest transfer first order we might reserve network resources in the following order , if updates to are small in size. In the presence of a large number of workers, this would result in some parts of the model being updated less frequently than others.
To ensure uniform number of updates to all components of the ML model, network resources for all update components are reserved together. In each iteration of our algorithm, we pick the update which has the largest completion time for all its components:
10.3 Model distribution
Aggregating updates reduces overall runtime by reducing the amount of data forwarded to the server. However, if each request for the model is responded to individually then the down-link at the server will become the bottleneck. To reduce the load on the down-link, we use a distribution tree for propagating the model to the workers. At the server, requests are batched and responded with same version of model. The distribution pattern is determined similar to the aggregation pattern. For a batch of requests, distributors are earmarked. Mapping of workers to distributors is done using a variant of alg. 3 obtained by replacing s as the times taken to transfer the model from server-to-distributor and distributor-to-worker. Once the partitioning is determined, we first transfer the model from the server to the -th distributor and then proceed backwards. The workers in the first group receive the model directly from the server.
10.4 SGD convergence analysis under bounded delay
We extend the proof of convergence under uniform delay from . Specifically, we modify Lemma A.3, under the assumption that delay is uniform in: . We bound the delay dependent term (see A.15 in , also defined below) under the new delay model.
We then show that,
The expected loss after iterations can then be bounded as (see Corollary 3.2 in ):
Proof: Let , observe that it is not independent of . Thus, with
Consider the inner summation, we have
Thus, we now consider
-  Caffe2: A new lightweight, modular, and scalable deep learning framework. https://caffe2.ai/.
-  NVIDIA Collective Communication Library. https://github.com/NVIDIA/nccl. Accessed: 2018-01-01.
-  NY Times Dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words.
-  PyTorch -Distributed communication package. http://pytorch.org/docs/master/distributed.html.
-  Tensors and Dynamic neural network in Python with strong GPU accleration. http://pytorch.org/.
-  Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 265–283.
-  Agarwal, A., and Duchi, J. C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 873–881.
-  Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
-  Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Methods for Large-Scale Machine Learning. ArXiv e-prints (June 2016).
-  Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274 (2015).
-  Chollet, F., et al. Keras. https://keras.io, 2015.
-  Chowdhury, M., an