Deep learning is popular now. It has achieved phenomenal advancement in various fields including image recognition , speech processing , machine translation , gaming , health care  and so on. The key success of deep learning is the increasing size of models that can achieve high accuracy. At the same time, it is difficult to train the large and complex models. It is common that training a model may take hours or even days . Therefore, it is crucial to accelerate training in the distributed manner to better prompt wider applications of deep learning.
In distributed training, multiple workers running on a number of compute nodes cooperatively train a model with the help of communication between workers. The current widely used approach of distributed training is data parallelism , in which each worker keeps a replica of the whole model, processes training samples independently, and synchronizes the parameters every iteration. Parameter Server (PS)  is the first approach to support distributed training by introducing a central node which manages one or more shared versions of the parameters of the whole model at PS. More recently, All-Reduce , an alternative distributed solution utilizing the advanced Ring All-Reduce algorithm , is shown to provide superior performance than PS [26, 45, 52, 31]. To fundamentally improve the scalability, the general decentralized training [35, 34, 37, 21, 22, 33, 48, 47] also received intensive research interests. It has been recently theoretically shown for the first time that decentralized algorithms can outperform centralized ones . While PS and All-Reduce are both special cases of the decentralized method, a general decentralized training scheme can use an arbitrary communication graph with spectral gap, doubly stochastic averaging and independence properties to specify point-to-point communication between workers.
The first key problem of distributed learning is the intensive communication among workers. During execution, gradients or parameter updates are transferred between workers in different nodes to achieve the eventually trained model. In PS, all workers need to communicate with the parameter servers — easily causing communication bottleneck even if the number of workers is relatively small. In All-Reduce, the communication is more evenly distributed among all workers, since it logically implements the all-to-all communication, the amount of parameters transferred is still high. More importantly, to hide communication latency, All-Reduce uses delicate pipelined operations among all workers. It makes this solution vulnerable to system heterogeneity, a concept that means the performance of different nodes (workers) and the speed of different communication links are different. Specifically, because All-Reduce requires global synchronization in every step, its performance is strongly bounded by the slowest worker, thereby cannot tolerate heterogeneity well. We believe that heterogeneity is the second key challenge of distributed training.
To tolerate heterogeneity, both system and algorithm techniques have been proposed. At system level, backup worker  and bounded staleness  have been shown to be effective in mitigating the effects of random worker slowdown in both PS [5, 2, 20, 51, 42] and decentralized training . However, if some workers experience severe and continuous slowdown, the benefits of system solution are limited since the whole system will eventually be dragged down by the slow workers or communication links. It motivates the more fundamental algorithm level solutions. In particular, AD-PSGD  probabilistically reduces the effects of heterogeneity with randomized communication. In an additional synchronization thread, each worker randomly selects one worker to average parameters between the two and atomically update both versions. Moreover, the workers need to wait for the current synchronization to finish before starting another, no matter if it actively initiates a synchronization or is passively selected by another worker. While the slow workers inevitably have staler parameters and will drag down others’ progress, it will only happen if they happen to be selected. Unfortunately, the implementation in  only supports a certain type of communication graphs and suffers from deadlock otherwise. More importantly, the parameter update protocol in AD-PSGD incurs significant synchronization overhead to ensure atomicity.
Figure 1 shows the training performance111Defined as the time for training loss to reach of VGG-16 model over CIFAR-10 dataset, of All-Reduce  and AD-PSGD on GTX nodes running GPUs as workers in total in homogeneous and heterogeneous222The heterogeneous setting is that one worker is slowed down by times. execution environment. In Figure 1, we see AD-PSGD’s excellent ability to tolerate heterogeneity — times faster than All-Reduce. However, the figure also shows that All-Reduce is much faster () than AD-PSGD in homogeneous environment. Thus, the open question is whether it is possible to improve AD-PSGD so that its performance is comparable to All-Reduce in a homogeneous environment while still maintaining superior ability to tolerate heterogeneity?
In this paper, we propose Ripples, a high-performance heterogeneity-aware asynchronous decentralized training approach. Compared to the state-of-the-art solutions, Ripples gets the best of both worlds: it achieves better performance than All-Reduce in homogeneous environment and significantly outperforms AD-PSGD in both homogeneous and heterogeneous environments. We achieve this almost ideal solution with intensive synchronization optimization, emphasizing the interplay between algorithm and system implementation. To reduce synchronization cost, we propose a novel communication primitive, Partial All-Reduce, that allows a large group of workers to synchronize quickly. To reduce synchronization conflict, we propose static group scheduling in homogeneous environment and simple but smart techniques (Group Buffer and Group Division) to avoid conflicts with slightly reduced randomness.
We perform experiments on Maverick2 cluster of TACC Super Computer. We train a common model VGG-16 on CIFAR-10 dataset to look deeply into different algorithms. We also train a large model, ResNet-50, on a large dataset, ImageNet, to validate the optimizations. Our experiments show that in homogeneous environment, Ripples isfaster than the state-of-the-art implementation of All-Reduce, faster than Parameter Server and faster than AD-PSGD. In a heterogeneous setting, Ripples shows speedup over All-Reduce, and also obtains speedup over the Parameter Server baseline.
2 Background and Motivation
2.1 Distributed Training
In distributed training, a single model is trained collaboratively by multiple workers, which run in distributed compute nodes. Training is most commonly accomplished with Stochastic Gradient Descent (SGD), which is an iterative algorithm that reaches the minimum of the loss function by continuously applying approximate gradients computed over randomly selected data samples. In each iteration, there are typically three steps: (1) randomly select samples from the data set; (2) compute gradients based on the selected data; and (3) apply gradients to the model parameters.
There are a number of schemes to achieve parallelism among multiple workers in distributed training: data parallelism [41, 45], model parallelism , hybrid parallelism [27, 50], and pipeline parallelism 
. Among them, data parallelism can be easily deployed without significant efficiency loss compared with other models. Thus, it is supported by many popular machine learning frameworks such as TensorFlow, MXNet
and PyTorch. Recent papers [27, 50] discussed the trade-offs between data parallelism and model parallelism and proposed the hybrid approach. Due to space limit, we do not discuss other approaches in detail. Due to the popularity of data parallelism and the unresolved open problems, we focus on this model in this paper.
In data parallelism, each worker consumes training data independently and computes gradients based on its own selected data. The gradients obtained by distributed workers are then gathered and applied to model parameters during synchronization, and the updated model is subsequently used in the next iteration. Synchronization is both an essential part of parallelizing SGD and a critical factor in determining the training performance.
2.2 Existing Synchronization Approaches
There are three main categories of approaches to performing synchronization in data parallelism: Parameter Servers (PS), All-Reduce, and generalized decentralized approaches.
Training with PS involves using one or more central nodes called Parameter Servers that gather gradients from all workers and also send back the updated model to the workers. This straightforward approach enables relatively easy management of the training process. However, PS has limited scalability due to the communication bottlenecks at Parameter Servers. Parameter Hub provides a new approach to remove the bottleneck of communication by introducing a new network device to work as Parameter Server. While promising, it requires special hardware supports that do not exist in common distributed environment (e.g., Amazon AWS).
In contrast to PS, All-Reduce replaces the use of central nodes with carefully scheduled global communication to achieve better parallelism. The state-of-the-art solutions [41, 45, 31] leverage Ring All-Reduce , the advanced all-reduce algorithm that effectively utilizes the bandwidth between computation devices. Specifically, workers are organized as a ring, and gradients are divided into chunks and passed over the ring in a parallel manner. Different chunks of gradients are first accumulated to different workers, which are then broadcast to all workers in a parallel manner. This algorithm achieves ideal parallelism within the theoretical upper bound. Another algorithm, Hierarchical All-Reduce [7, 31], has been successfully scaled up to nodes with GPUs. Utilizing All-Reduce algorithms based on MPIs[9, 14, 11] and NCCL, Horovod  enables high-performance data parallelism and is proved to be effective and efficient — based on All-Reduce algorithms and high performance implementations, researchers were able to use the fastest supercomputer , Summit, to train a deep learning model in exascale.
Recently, the general decentralized approaches allow the point-to-point communication between workers by specifying a communication graph. Both PS and All-Reduce can be considered as special case of the communication graph. Two main algorithms proposed so far are Decentralized Parallel SGD (D-PSGD)  and Asynchronous D-PSGD (AD-PSGD) . In D-PSGD, every worker has its own version of parameters, and only synchronizes with its neighbors in the graph. As training proceeds, local information at a worker propagates along edges of the communication graph and gradually reaches every other worker, and thus models at different workers converge collaboratively to the same optimal point. The convergence rate has been proved to be similar to that of PS and All-Reduce . Like All-Reduce, D-PSGD does not suffer from communication bottleneck. However, it relies on a fixed communication topology, which may be susceptible to heterogeneity (more discussion in Section 2.3).
To tolerate heterogeneity, AD-PSGD  introduces a random communication mechanism on top of D-PSGD. Instead of synchronizing with all the neighbors specified by the communication graph, a worker randomly selects a single neighbor, and performs an atomic model averaging with the neighbor, regardless of whether they are in the same iteration or not. While the slow workers inevitably have staler parameters and will affect the training of the global model, it will not block the progress of other workers unless it is selected, which happens only occasionally.
2.3 Challenges and Problems
Communication With the continuously increasing compute capability (e.g., GPUs), communication has become more important and the focus of recent optimizations. The communication bottleneck in PS has been eliminated by approaches based on Ring All-Reduce, but the latter’s strongly synchronized communication pattern has lower heterogeneity tolerance. The generalized decentralized training captures both schemes and enables more optimization opportunities.
Heterogeneity With the communication problem largely mitigated, performance degradation in the heterogeneous distributed environment becomes a major challenge. It is also known as the straggler problem, and occurs due to the performance difference among workers and the discrepancy or fluctuations of communication speed and bandwidth. Heterogeneity is pervasive and can be caused by multiple reasons such as resource sharing in data center, paging, caching and hardware faults. The trend of heterogeneity and the “long tail effects” have been also discussed and confirmed in other recent works [5, 12, 28, 24, 35]. A number of countermeasures for different synchronization schemes have been proposed, such as asynchronous execution , bounded staleness , backup workers , adjusting the learning rate of stale gradients , sending accumulated gradients over bandwidth-scarce links when they reach a significance threshold , etc. Unfortunately, these techniques are mostly applicable for PS and decentralized training.
For All-Reduce, with the delicate communication schedule, it is difficult to apply these ideas — making it inherently vulnerable to heterogeneity. From the computation aspect, a global barrier is introduced by the All-Reduce operation, so the throughput of computation is determined by the slowest worker in the cluster. From the communication aspect, although Ring All-Reduce algorithm is ideal in theory, the speed of sending chunks along the ring is bounded by the edge with the slowest connection.
Considering the delicacy of All-Reduce, and due to the well-known limits of PS, tolerating heterogeneity in decentralized approach is particularly important. Recent work Hop  presented the first detailed distributed protocol to support general decentralized training  with backup worker and bounded staleness to tolerate random slowdown. Although the results are promising, the proposed methods are essentially system techniques to mitigate the effects of heterogeneity. The alternative way is algorithmic technique, with AD-PSGD  as an excellent example. While AD-PSGD is both communication-efficient and tolerates heterogeneity well, the atomic model averaging step poses a key challenge of synchronization.
Synchronization Conflict The atomic model averaging requires that two model averaging operations are serialized if they involve the same worker. This requirement is to ensure fast convergence, and more relaxed semantic will increase the mutual influence of model updates from different workers — making the global trained model more vulnerable to “staler” updates. Note that the problem is different from the synchronization relaxation in HOGWILD! , where conflict happens when two workers try to update the same shared parameter. Conflict is expected to be rare, since HOGWILD! requires the cost function to be “sparse” and separable. In the algorithm, workers only update a small fraction of the parameters in each iteration, and the sparsity ensures that updates from different workers rarely involve the same parameter. Therefore, the algorithm can still converge even without any locks. However, in AD-PSGD, the conflict is of a different nature and is expected to be frequent, because every worker can initiate model averaging and it is likely that 2 of them end up choosing the same worker.
To ensure atomic model averaging and avoid deadlock as exemplified in Figure 2(a), AD-PSGD divides the workers into 2 sets — active set and passive set, and requires that edges in the communication graph only exist between the two sets, i.e., neighbors of active workers can only be passive workers, and vice versa. This division is only possible when the communication graph is bipartite. In the implementation, only active workers are allowed to initiate model averaging, while passive workers can only respond. This is slightly different from the algorithm, in which every worker can initiate averaging. When an active worker needs to synchronize, it sends its model to the selected neighbor and blocks until it gets a response. Possible violation of atomicity can only happen when 2 active workers select the same passive worker, and it can be avoided by letting the passive worker deal with the requests one by one. Note that this scheme will incur deadlock if all workers are allowed to initiate model averaging or if the graph is not bipartite.
Besides the restriction of the communication graph between workers, the synchronization overhead is a more crucial problem in a distributed environment. When training VGG-16 model over CIFAR-10, and ResNet-50 model over ImageNet using AD-PSGD on GPUs, Figure 2(b) shows that more than of the time can be spent on synchronization in AD-PSGD. This is measured by comparing per iteration time of workers without synchronization (i.e., skip the synchronization operation to see the actual time of computation) and workers with the synchronization enabled.
|(a) An example deadlock happens when all workers first lock themselves (1⃝), and then try to lock their neighbors in a cycle (2⃝), which blocks forever.||(b) Computation and synchronization ratio of different algorithms on different tasks|
3 Partial All-Reduce
Based on the results in Section 2.3, we mainly focus on the synchronization challenge for decentralized training. This section first presents a deep analysis of AD-PSGD which motivates our key contribution of Partial All-Reduce primitive.
3.1 AD-PSGD Insights
AD-PSGD algorithm is shown in Figure 3. Similar to traditional training such as PS and All-Reduce, in one iteration, it computes gradients first, and then performs synchronization; the difference is that it only synchronizes with a random selected neighbor, instead of all the workers. Therefore, the global barrier is removed, enabling higher training throughput and better heterogeneity tolerance.
In AD-PSGD, each worker
has a local version of parameters, which can be seen as a single concatenated vector, as the shapes do not matter in synchronization. Concatenating all the weight vectors together, they can be represented as a matrix where is the total size of weights in the model, and is the number of workers.
In this formalization, one iteration in a worker in AD-PSGD algorithm can be seen as one update to . Formally, it can be represented as: . Here, is the update to according to gradient computation based on a random worker , the previous version of , and a random subset of the training samples . is a synchronization matrix that represents the process of model averaging: .
Figure 4 shows an example of , in which worker performs a synchronization with worker . More generally, for an update between worker and worker , the non-zero entries of matrix are: , .
In AD-PSGD, a conflict happens when two workers both select another worker for synchronization. In order to keep the atomic property of weight updating, the two operations need to be serialized. In matrix formalization, assume that represents the synchronization between and , represents the synchronization between and . Ignoring the gradient entry in the update, the updated weight can be represented as: .
Figure 5 shows an example of two workers and requiring synchronization with the same worker (). The matrix on the right shows the production of and as a fused synchronization matrix , which shows the final update over all the weights.
We can observe that the production is commutative in AD-PSGD — and can be exchanged (not mathematically but logically). It is because the order of synchronization is determined by the order of getting a lock, which is a completely random. Based on the atomicity requirement, the key insight is that in AD-PSGD, although the two synchronizations can be mathematically fused, they have to be executed sequentially.
3.2 Partial All-Reduce and Group Fusion
We propose Group Fusion — fusing multiple synchronizations approximately into one with reduced synchronization cost. In the precise fused synchronization, according to the matrix, several workers update their weights to a certain linear combination of the weights of each worker in the group.
Next, we seek proper approximation of the fused synchronization to achieve efficient implementation. Our goal is to leverage Ring All-Reduce, the high-performance algorithm that can compute the mean of several copies of weights in time. We cannot directly use All-Reduce to execute the synchronization among the three workers in Figure 5. It is because All-Reduce produces the same update for each worker, which is different from the outcome produced by multiplying a sequence of synchronization matrices in a certain order (on the right of Figure 5).
Thanks to the commutative property of ’s, our key idea is to slightly relax the entries in to leverage All-Reduce to perform the synchronization specified by . Generally, assume that there is a group of workers that perform a single fused synchronization together, involves modifying the weights of all the workers in . The with approximation is defined as , which contains the following non-zero entries: , .
Figure 6 shows an example of among worker with the modified . Although the example only involves workers, the group can contain an arbitrary number of workers.
Applying is equivalent to performing All-Reduce in the group . We define this operation as Partial All-Reduce or P-Reduce to distinguish our algorithm from the conventional All-Reduce in deep learning training that performs All-Reduce among all workers. Based on P-Reduce, we present a formal description of the new algorithm in Figure 7.
Compared to the original AD-PSGD algorithm, there are two key differences. First, in Step 3, each worker can randomly generate a group that may be larger than 2, as long as it contains itself, . The group in AD-PSGD of size 2 (one worker randomly selects a neighbor) becomes a special case. It essentially enlarges the unit of synchronization to groups of any size. Larger groups have two implications: (1) potentially enable fast propagation of model parameter updates among workers, speeding up convergence; and (2) increase the chance of conflicts. Thus the new algorithm allows the system to explore such a trade-off. The second difference from AD-PSGD is that the synchronization operation is performed by the new primitive P-Reduce involving the workers in the group, instead of using individual messages among workers. This directly reduces the cost of synchronization.
Although group fusion inspired us to propose the idea of P-Reduce, the algorithm in Figure 7 does not need to fuse groups during execution. In fact, the effects of fusing two groups of size 2 in AD-PSGD is reflected as generating group of arbitrary size in Step 3 of Figure 7. As a result, Ripples only needs to deal with group generation but not group fusion. The system still needs to satisfy the atomicity requirement. If two ’s do not share common workers, the two non-conflicting ’s can be executed concurrently. In an unrealistic but ideal situation, applying all the ’s should not introduce any conflict. Compared to All-Reduce, P-Reduce retains the efficient implementation while avoiding the global barrier.
3.3 Convergence Property Analysis
To guarantee that models at different workers converge to the same point, three requirements for are proposed in AD-PSGD. In the following, we show that although is not exactly the same as the result of multiplying a sequence of synchronization matrices in a certain order, our definition of satisfies all three convergence properties as AD-PSGD does.
Doubly stochastic averaging is doubly stochastic for all . The sum of each row and each column equals to in both and .
Spectral gap There exists a , such that: . Basically, . And can be regarded as a Markov Transition Matrix. According to the Expander Graph Theory, the spectral gap condition is fulfilled if the corresponding graph of random walk is connected. That means the update on any worker can be passed through several groups to the whole graph. When creating the group generation methods in the following section, this property is always kept in our mind to guarantee the convergence property.
Dependence of random variables
is a random variable dependent on333 is the worker initiating the synchronization., but independent on and . Up to now, the only requirement on the generated group is that it should contain the initial worker . Theoretically, it is generated randomly without any connection to or . Therefore, this condition is fulfilled.
4 Group Generation and Conflict Detection
With P-Reduce, a group of workers becomes the basic unit of synchronization procedure. As a type of collective operation, all workers in the group need to call P-Reduce function. It means that all group members should have the same group information to initiate the P-Reduce. It is non-trivial to obtain the consistent group among all workers inside the group. This section discusses how to generate the groups and serialize conflicting groups.
4.1 Group Generator
In Figure 7, each worker needs to randomly generate a group. This can be performed by each worker based on the communication graph with randomly selected neighbors. The workers in each group will collectively perform P-Reduce. The system needs to ensure atomicity — P-Reduces of groups with overlapping workers selected must be serialized. This can be implemented in either a centralized or distributed manner. In general, a distributed protocol involves multiple rounds of communication and coordination between workers. For simplicity, Ripples implements a centralized component. We can actually offload the group generation functionality from the workers to this component. Thus, we call it Group Generator (GG). When a worker needs to perform a synchronization, it just needs to contact GG without any group information, and then GG can select the group on behalf of the worker and maintain the atomicity. In the following, we explain the protocol using an example. We will find that the communications between workers and GG are only small messages, and do not introduce communication or scalability bottleneck.
In Figure 8, we consider four workers ,,, among a total number of 8 workers. In the beginning, and finish an iteration and need to perform a synchronization. Instead of generating groups locally, they both send a synchronization request to GG, indicated in 1⃝ and 2⃝. GG maintains the atomicity with a local lock vector — a bit vector indicating whether each worker is currently performing a P-Reduce. This vector is initialized as all 0s. Assume that there is no other synchronization being performed in the system, and GG receives the request from first. After that, GG randomly generates a group on behalf of (3⃝) and sets the corresponding bits in the lock vector (4⃝). Then, GG notifies the workers , , and (5⃝) in the group so that they can collectively perform the P-Reduce. Later, GG receives the synchronization request from and randomly generates a group . Unfortunately, it is conflicting with the first group due to the two overlapped workers and , and needs to be serialized. We can achieve this by simply blocking the group and storing it in a pending group queue (6⃝). In the meantime, , and receive the notifications from GG and perform P-Reduce (7⃝). They also need to acknowledge GG to release the locks (8⃝). After the locks for group are released in GG, the group can be performed after setting the corresponding bits in the lock vector.
4.2 Decentralized Static Scheduler
As we have seen in the example in Figure 8, two overlapping groups need to be serialized to ensure atomicity, causing delay in the execution. We can eliminate the conflict by statically scheduling the groups to be non-overlapping.
We design a conflict-free schedule as shown in Figure 9. There are 16 workers in total, and the schedule is periodic with a cycle length of 4. Every row corresponds to an iteration, and colored blocks with group indices indicate the grouping of workers. For example, in the first row, , , and are all colored yellow with an index “G1”, which means that these 4 workers are in the same group in the -th iteration, for any . Group indices do not indicate the sequence of execution; in fact, groups in the same row are expected to execute concurrently. In addition, some workers do not participate in synchronization in certain iterations, and this is shown by gray blocks marked with a hyphen "-". For instance, , , and do not participate in any group in the -th iteration, for any . Skipping synchronization can decrease the frequency of communication and thus shorten the training time. It is a technique that has been proved helpful in [29, 49].
To implement static scheduling, a naive way is to store the schedule table in the GG, and workers can access it by contacting the GG. Alternatively, we can store the table inside each worker, saving a round trip of communication between the worker and the GG. Since every worker has the same schedule table stored locally, a consistent view of the groups is naturally ensured.
|Phase||L.W. 0||L.W. 1||L.W. 2||L.W. 3|
|0||Sync with L.W. 0s on ALL NODES||No sync||Sync with L.W. 3||Sync with L.W. 2|
|1||Sync L.W. 0-3|
|2||Sync with L.W. 3||Sync with L.W. 1 on the opposite node on the ring||No sync||Sync with L.W. 0|
|3||Sync L.W. 0-3|
Notes: This table shows the rules that generate the schedule for workers running on one node. The rules are the same for all 4 nodes. L.W. stands for Local Worker , the -th worker on this node. The schedule has phases, each corresponds to one training step. It repeats itself after every 4 steps.
In fact, storing a table is unnecessary, since the schedule is generated in a rule-based manner. For example, our previously proposed schedule is based on a worker’s rank in its node. In an example where workers are on a node, the rule of scheduling is shown in Figure 10. In this way, a worker can simply call a local function to obtain its group in an iteration. The logic of guarantees that the schedule is consistent among all the workers, and a conflict-free static schedule is therefore enforced.
4.3 Discussion: Random vs. Static
Although static scheduling can ideally eliminate conflict and speed up execution, randomized group generation is more suitable for heterogeneous environment. We compare the different characteristics of the two approaches below.
Random GG is centralized, but it is different from Parameter Servers in that it does not involve massive weight transfer. It only costs minor CPU and network resources compared with gradient accumulation or weight synchronization. In our experiment, it is found that GG can be put on a node together with workers without incurring any performance loss. However, in random GG, contacting the GG induces communication overhead, and conflicting groups need to be serialized, resulting in additional wait time.
On the contrary, GG implemented as a static scheduler has no communication latency. With a proper design of , it can not only fully parallelize synchronization, but also utilize the architecture of the worker devices to accelerate every single P-Reduce operation. For example, it can schedule more intra-node synchronizations, and reduce the number of large-scale inter-node synchronizations. However, the function is pseudo random, which breaks the strict convergence condition of AD-PSGD, although the resulting algorithm still converges well in our experiments.
When a certain worker is slower than others, the original AD-PSGD algorithm is able to tolerate the slowdown. However, the static scheduler does not have such ability, as the schedule is in fact fixed. Synchronizations with the slow worker will slow down the whole training. As for random GG, the stragglers’ effect can be largely ameliorated. Well-designed group generation strategy can ensure that at any time, most workers will be able to proceed without depending on the few slow workers, thus relieving the slowdown problem. Also, slowdown detection and conflict avoidance mechanisms, which will be discussed in the following section, can be easily integrated into random GG, making it better adapt to heterogeneous environment.
5 Smart Randomized Group Generation
The basic implementation of the scheduler in GG is to always randomly generate a group as specified in Step 3 of Figure 7. With the centralized GG, our objective is to leverage the global and runtime information to generate groups in a more intelligent manner to: (1) avoid conflicts; and (2) embrace heterogeneity. For example, a worker may have already been assigned to several groups and thus have several pending P-Reduces to perform. If the worker is still selected to be included in a new group, then other workers will have to wait for all the prior scheduled P-Reduces to finish. Similarly, when a slow worker is in a group, the whole group may be blocked by this worker. Moreover, performing P-Reduce in different groups costs different time due to architecture factors. The group selection can even introduce architectural contentions on communication links. Based on the above insights, we propose intelligent scheduling mechanisms for GG to further improve performance.
5.1 Conflict Avoidance by Global Division
An intuitive way of reducing conflict is to have a Group Buffer (GB) for each worker, which includes the ordered list of groups that include the corresponding worker. When a group is formed, the group information is inserted in the GB of all workers involved. The consensus group order can be easily ensured among all GBs since the GG, as a centralized structure, generates groups in a serial manner. Based on GB, when GG receives a synchronization request from a worker, it can first look up the worker’s GB. If it is empty, a new group is generated for the worker; otherwise, the first existing group in the worker’s GB will serve as the selected group.
The main insight is that P-Reduce is a collective operation. So if initiates a synchronization with , i.e., and are in the same group, P-Reduce of this group is only performed when also requests its synchronization. Therefore, the simple mechanism can avoid generating a new group for when it is already scheduled (and ready) to execute a P-Reduce. However, with random group generation, nothing would prevent the selection of into a different group not initiated by . In this case, the overlapping groups and the corresponding P-Reduce operations are still serialized.
Inspired by the static scheduling, we propose an operation called Global Division (GD) that divides all current workers with empty GBs into several non-conflicting groups. A GD is called whenever a worker needs to generate a group and its GB is empty. A simple example is shown in Figure 11. In total we have 4 workers and initially all GBs are empty. On the left, random selection shows a possible scenario without GD optimization. The groups are randomly generated, so if initiated by includes and , another group initiated by can still include as the overlapping worker, thus introducing a conflict. On the right, with GD, when requests a group, the GG will not only generate one for it, i.e., , but also randomly generate groups for other workers, i.e., only in this example as there are only 4 workers. In this way, when later requests a group, GG will directly provide the non-conflicting generated before.
It is worth emphasizing two conditions. First, a GD only generates groups for the current “idle” workers (including the caller worker) that are not assigned to any group. Thus, when a worker requests a group, it is possible to generate groups in the above manner for just a subset of workers. Second, a GD is only called when the initiator’s GB is empty, otherwise the first group in the initiator’s GB will be returned.
Indeed, the proposed schemes to avoid conflict make the group generation not fully random. However, we argue that the effects are not critical. For the first optimization based on GB, we only reuse the existing group involving the worker who is requesting synchronization. This group is still generated in a fully random manner (if we do not use GD). For GD, essentially we generate a random group partition among all idle workers together, which is triggered by the first worker in the set who initiates a synchronization. So the difference is between randomly generating each group and generating a random partition. We acknowledge that they are not the same but believe that our method does not significantly damage the randomness. We leave the theoretical analysis as the future work. However, based on the results shown in our evaluation, the ideas work very well in practice.
5.2 Architecture-Aware Scheduling
If the groups are randomly divided, multiple groups may all need to use the network bandwidth at the same time, causing congestion, which is not optimal in the perspective of architecture. In fact, All-Reduce is fast because it has a balanced utilization of different connections between different devices, such as Infiniband HCA cards, QPI paths444The Intel QuickPath Interconnect between CPU sockets within one node, and PCIe slots. To better utilize the bandwidth of different connections, we propose a new communication pattern called Inter-Intra Synchronization that can be naturally incorporated with GD. Here, a node, commonly running or workers, are considered a unit. The scheme has an Inter and an Intra phase.
Inter phase One worker on each node is selected as Head Worker of the node. All the Head Workers are randomly divided into several groups to synchronize in a inter-node manner. At the same time, the workers that are not Head Worker are randomly assigned to groups with only local workers in the same node. In this way, only the Head Worker can generate inter-node communication while the others only incur local communication, which can be carefully arranged to avoid congestion on PCIe switches or QPI.
Intra phase Workers within a node synchronize with all other local workers collectively. In another word, it involves a P-Reduce among all the workers in the same node, without any inter-node communication. Following the Inter phase, the updates from workers on other nodes can be quickly propagated among local workers in this phase.
The two phases can be realized easily with GD operations. Specifically, two groups are inserted to the GB of each worker. Each group is generated by a GD, one is mainly among Head Workers in different nodes (the Inter phase), the other is purely among local workers in the same node (the Intra phase). An example can be seen in Figure 12.
It is worth noting that the proposed Inter-Intra Synchronization is not the same as hierarchical All-Reduce, which is mathematically equivalent to All-Reduce among all workers in one step with acceleration brought by the hierarchical architecture. After an All-Reduce, all workers end up with the same weight. Differently, Inter-Intra synchronization strategy spreads multiple partial updates through P-Reduce in an architecture-aware and controlled manner. Thus, workers end up with different weights after the synchronization.
5.3 Tolerating Slowdown
The mechanisms proposed so far are mainly effective in homogeneous execution environment but do not help with slowdown situations. Slow workers involved in groups can block the current and other groups as mentioned earlier.
We propose a simple solution by keeping track of execution information in GG. Specifically, an additional counter for each worker is placed in GG, which records how many times the worker requires a group. When a worker is significantly slower than other workers, the value of its counter should be also much smaller than the average. As a GD starts when a worker with an empty GB requests a group, an additional rule is added to filter the workers who can get a group in the division: the worker’s counter, , should be not significantly smaller than the initiator’s counter, , i.e., , where is a constant that can be adjusted.
This filter works as follows. When a fast worker initiates a GD, only fast workers are assigned to groups, avoiding the problem of being blocked by slow workers. When a slow worker initiates a division, some faster workers may be involved to synchronize with it. But the selected workers have empty buffers as defined in GD operation. So, neither the fast workers or the slow worker needs to wait for a long time for synchronization. By the filter rule, the effect of slow workers is minimized.
We implement the proposed algorithms and protocols using TensorFlow and its extensions. Specifically, Ripples is implemented as customized operators of TensorFlow.
6.1 Partial All-Reduce
Partial All-Reduce is implemented as a GPU TensorFlow Operator. It takes the variables and the group as input tensor, and outputs a new tensor representing the result of synchronization. NCCL is used to execute All-Reduce, and MPI is used to help create NCCL communicator. We use a simple but effective strategy to concatenate all weights into one tensor. Specifically, all weights are flattened and concatenated into one tensor for faster P-Reduce, and are separated and reshaped after the P-Reduce operation.
In NCCL, the upper bound of existing communicators is . But it is inefficient to destroy all the communicators after use. To save the time of creating communicators, a distributed cache for communicators is used, which provides consistent presence of communicators. It does not remove cached items, but simply stops caching when its size exceeds a threshold.
6.2 Group Generator
Group Generator is a centralized controller among all workers. It requires low latency remote function call. RPC is used in this scenario. The server is a light-weight Python program implemented by gRPC Python package. C++ is used in the core of the algorithms. It can be started and killed easily.
The client is wrapped up as another TensorFlow Python Operator. One function as static scheduler is implemented according to the scheduling rules. Another function as dynamic group generator using the centralized GG also uses gRPC. We can easily switch between the methods of group generation using executing flags.
7.1 Evaluation Setup
7.1.1 Hardware Environment
We conduct our experiment on Maverick2 cluster of TACC Super Computer. Maverick2 is a cluster managed by SLURM. In the GTX partition, a node is configured as shown in the table in Figure 14.
|Model||Super Micro X10DRG-Q Motherboard|
|Processor||2 x Intel(R) Xeon(R) CPU E5-2620 v4|
|GPUs||4 x NVidia 1080-TI GPUs|
7.1.2 Dataset and Model
To test the performance of Ripples and compare it with other works, we train models on both medium and large data sets. First, we train VGG-16 model on CIFAR-10 image classification dataset. The model contains of trainable 32-bit floating-point weights. A typical training setup is selected. The learning rate of SGD optimizer is set to , and the batch size per worker is .
images to be classified intoclasses. We aim to verify that Ripples is a valid algorithm that well converges in different tasks. The model contains of weights. Momentum optimizer is used with and . The initial learning rate is , and decays to its
on epochs. The training models are implemented using TensorFlow.
7.1.3 Baseline Setup
Parameter Server is already integrated in TensorFlow. We implement AD-PSGD using remote variable access supported by the TensorFlow distributed module. Horovod is adopted to set up a high-performance state-of-the-art baseline, which significantly outperforms many other implementations of All-Reduce. It is configured with NCCL2 in order to achieve the best All-Reduce speed. We also tune the size of fuse buffer for better utilization of the Inifiniband network. In all test runs, each worker occupies a whole GPU. For better affinity, we bind the process of each worker to the CPU socket it is directly attached to. In random GG, the group size is 3.
We use the time it takes for the model (randomly initialized using a fixed random seed across different experiments) to achieve as the metric of performance on VGG-16. We also inspect the loss w.r.t iteration curve and the average duration of an iteration to analyze the effect of our optimizations.
7.2 Interactions between Computation, Communication and Convergence
In order to better understand how much time communication takes in deep learning training compared to computation time, we first measured the time of computation with different batch sizes and time of communication with different settings555Size of weight to be synchronized is independent of batch size. Figure 15 shows the time comparisons. Because of better utilization of SIMD devices, the computation is slightly more efficient when the batch size is larger. Interestingly, All-Reduce among workers within a single node or workers separately placed across different nodes are significantly faster than having multiple nodes with each running multiple workers.
Although reducing communication by lowering synchronization frequency can increase the throughput of training, it becomes harder to converge. Figure 16 presents a simple experiment to show that the number of iterations needed to converge increases as communication frequency gets lower. To get the best performance of convergence time, setting a proper level of synchronization intensity is necessary. This result shows that we cannot simply improve AD-PSGD by enlarging the amount of computation between synchronizations.
7.3 Speedup in Homogeneous Environment
In a homogeneous environment with workers on nodes, VGG-16 trained over CIFAR-10 is used to compare Ripples with different ways of group generation against Parameter Server, All-Reduce and AD-PSGD. The per-iteration speedup and convergence time speedup is shown in Figure 17. Ripples is much faster than Parameter Server and the original AD-PSGD. All-Reduce is also much faster than these two baselines, due to the high throughput provided by Horovod. However, Ripples with both static scheduler and smart GG even outperform All-Reduce thanks to its smaller synchronization groups and architecture-aware scheduling.
Shown in Figure 18, AD-PSGD has better convergence speed in terms of number of iterations. All-Reduce is mathematically equivalent to Parameter Server. They are slightly different due to random sampling and competition in synchronization. Ripples with static scheduler has similar convergence speed as Parameter Server, but it gains speedup from its higher throughput. We see that the number of iterations in random GG is less than smart GG, which is smaller than static scheduling. This is due to the decreasing amount of randomness from random GG to smart GG and to static scheduling.
These results further demonstrate the trade-offs between execution efficiency and statistical efficiency . Although AD-PSGD needs fewer iterations to converge to the same error, the execution time of each iteration is seriously affected by the synchronization overhead, shown in Figure 2 (b). Ripples successfully explores this trade-off by slightly sacrificing statistical efficiency, i.e., running more iterations (0.96x vs. 0.78x), — mainly caused by the reduced randomness, to gain significant speedup in per iteration execution time (5.10x vs. 1.18x) and eventually lead to overall execution time speedup (5.26x vs. 1.42x).
7.4 Heterogeneity Tolerance
One of the key advantages of Ripples is better tolerance of heterogeneity. Based on the same setup in section 7.3, heterogeneity is simulated by adding or times the normal iteration time of sleep every iteration on one specific worker, the slow worker. The result is shown in Figure 19. In terms of the capability to tolerate slowdown, experiment results of 2x slowdown show that: (1) random GG (3.03x vs. 2.13x) is slightly worse than AD-PSGD (1.42x vs. 1.37x), but it is much faster due to more efficient P-Reduce as the synchronization primitive; (2) smart GG (5.26x vs. 4.23x) is better than random GG (3.03x vs. 2.13x); and (3) while both suffer from more slowdown, Ripples static (5.01x vs. 2.47x) is still considerably better than All-Reduce (4.27x vs. 1.66x). We also see that with 2x slowdown, All-Reduce is still faster than AD-PSGD although much slower than itself in homogeneous setting. With 5x slowdown, All-Reduce can only achieve a little more than half of the performance in AD-PSGD. We see that random GG is slightly slower than AD-PSGD, this is because the larger group size (3) in Ripples can increase the chance of conflicts. Nevertheless, smart GG outperforms AD-PSGD with a large margin.
7.5 Validation on Large Model and Dataset
This section shows the training performance of ResNet-50 on ImageNet by running only hours of training on nodes with workers for each algorithm. We conduct experiment in this manner to avoid affecting other experiments on the cluster, as TACC Super Computer is shared by thousands of researchers.
The training accuracy and the loss curves for the 10-hour executions are shown in Figure 20. Please note the execution environment is homogeneous without slower workers. We see that All-Reduce performs the best in this case, followed by Ripples with smart GG. AD-PSGD suffers from throughput issue. In ResNet-50 over ImageNet, the upper bound of effective batch size is very large. Therefore, although we make our best effort to enlarge the batch size, All-Reduce obtains much bigger convergence advantage numerically, while Ripples can train more iterations using the same time. The smart GG performs better than static scheduler because it has more randomness in synchronization. Observing from the loss curve, Ripples still has competitive convergence speed compared with the state-of-the-art approach, All-Reduce, on large data sets.
In this paper, we propose Ripples, a high-performance heterogeneity-aware asynchronous decentralized training approach. To reduce synchronization cost, we propose a novel communication primitive, Partial All-Reduce, that allows a large group of workers to synchronize quickly. To reduce synchronization conflict, we propose static group scheduling in homogeneous environment and simple techniques (Group Buffer and Group Division) to avoid conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Ripples is faster than the state-of-the-art implementation of All-Reduce, and is faster than Parameter Server and faster than AD-PSGD. In a heterogeneous setting, Ripples shows speedup over All-Reduce, and still obtains speedup over AD-PSGD.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §2.1, §7.1.2.
-  (2016) TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. External Links: Cited by: §1.
-  (2018) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. Note: cite arxiv:1802.09941 External Links: Cited by: §1.
-  Maverick2 user guide - tacc user portal. Note: https://portal.tacc.utexas.edu/user-guides/maverick2 Cited by: Figure 14.
-  (2016) Revisiting distributed synchronous sgd. In International Conference on Learning Representations Workshop Track, External Links: Cited by: §1, §2.3.
-  (2015) MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274. External Links: Cited by: §2.1, §2.2.
-  Cited by: §2.2, §5.2.
-  (2013-17–19 Jun) Deep learning with cots hpc systems. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28.3, Atlanta, Georgia, USA, pp. 1337–1345. External Links: Cited by: §2.1.
-  (2015) MPI: a message-passing interface standard. Note: https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf Cited by: §2.2.
-  Summit - ibm power system ac922, ibm power9 22c 3.07ghz, nvidia volta gv100, dual-rail mellanox edr infiniband | top500 supercomputer sites. Note: https://www.top500.org/system/179397 Cited by: §2.2.
-  Intel® mpi library | intel® software. Note: https://software.intel.com/en-us/mpi-library Cited by: §2.2.
-  (2013-02) The tail at scale. Commun. ACM 56 (2), pp. 74–80. External Links: Cited by: §2.3.
-  (2016-02) The impact of translation technologies on the process and product of translation. International Journal of Communication 10, pp. 969. Cited by: §1.
-  (2004-09) Open MPI: goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, pp. 97–104. Cited by: §2.2.
-  (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677. External Links: Cited by: §1.
-  (2018) PipeDream: fast and efficient pipeline parallel DNN training. CoRR abs/1806.03377. External Links: Cited by: §2.1.
-  (2018-02) Applied machine learning at facebook: a datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 620–629. External Links: Cited by: §1.
Identity mappings in deep residual networks.
European conference on computer vision, pp. 630–645. Cited by: §7.1.2.
Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29, pp. 82–97. External Links: Cited by: §1.
-  (2013) More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS’13, USA, pp. 1223–1231. External Links: Cited by: §1, §2.3.
-  (2018) Decentralized distributed deep learning in heterogeneous wan environments. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’18, New York, NY, USA, pp. 505–505. External Links: Cited by: §1.
-  (2019-07) DLion: decentralized distributed deep learning in micro-clouds. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19), Renton, WA. External Links: Cited by: §1.
-  (2006) Expander graphs and their applications. Bull. Amer. Math. Soc. 43 (2006), 439-561. Cited by: §3.3.
-  (2017) Gaia: geo-distributed machine learning approaching LAN speeds. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), Boston, MA, pp. 629–647. External Links: Cited by: §2.3.
-  (2017) Nccl 2.0. GTC. Cited by: §6.1, §7.1.3.
-  (2018) Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. arXiv preprint arXiv:1807.11205. Cited by: §1.
-  (2018) Beyond data and model parallelism for deep neural networks. CoRR abs/1807.05358. External Links: Cited by: §2.1.
-  (2017) Heterogeneity-aware distributed parameter servers. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, New York, NY, USA, pp. 463–478. External Links: Cited by: §2.3.
-  (2019) Accelerating distributed stochastic gradient descent with adaptive periodic parameter averaging: poster. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP ’19, New York, NY, USA, pp. 403–404. External Links: Cited by: §4.2.
-  (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §7.1.2.
-  (2018) Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 51. Cited by: §1, §2.2.
Scaling distributed machine learning with the parameter server.
International Conference on Big Data Science and Computing, pp. 3. Cited by: §1.
-  (2018) Pipe-sgd: a decentralized pipelined sgd framework for distributed deep net training. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 8056–8067. External Links: Cited by: §1.
-  (2017) Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5330–5340. Cited by: §1, §2.2, §2.3.
-  (2018) Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 3049–3058. External Links: Cited by: Figure 1, §1, §1, §2.2, §2.2, §2.3, §2.3, §3.3.
-  (2018) Parameter hub: a rack-scale parameter server for distributed deep neural network training. CoRR abs/1805.07891. External Links: Cited by: §2.2.
-  (2019) Hop: heterogeneity-aware decentralized training. CoRR abs/1902.01064. External Links: Cited by: §1, §1, §2.3.
-  (2009-02) Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69 (2), pp. 117–124. External Links: Cited by: §2.1, §2.2.
-  (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 693–701. Cited by: §2.3, §2.3.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §7.1.2.
-  (2018) Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799. Cited by: Figure 1, §1, §1, §2.1, §2.2, §7.1.3.
-  (2016) Tornado: a system for real-time iterative analysis over evolving data. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, New York, NY, USA, pp. 417–430. External Links: Cited by: §1.
-  (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. External Links: Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Cited by: §7.1.2.
-  (2019) Optimizing network performance for distributed dnn training on gpu clusters: imagenet/alexnet training in 1.5 minutes. arXiv preprint arXiv:1902.06855. Cited by: §1, §2.1, §2.2.
Going deeper with convolutions.
Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §1.
-  (2018) Communication compression for decentralized training. In NeurIPS, Cited by: §1.
-  (2018-10–15 Jul) : Decentralized training over decentralized data. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4848–4856. External Links: Cited by: §1.
-  (2018) Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. ArXiv abs/1810.08313. Cited by: §4.2.
-  (2018) Supporting very large models using automatic dataflow graph partitioning. CoRR abs/1807.08887. External Links: Cited by: §2.1.
-  (2015) Petuum: a new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 1335–1344. External Links: Cited by: §1.
-  (2019) Yet another accelerated sgd: resnet-50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650. Cited by: §1.
-  (2018-10) Artificial intelligence in healthcare. Nature Biomedical Engineering 2, pp. . External Links: Cited by: §1.
-  (2014) Dimmwitted: a study of main-memory statistical analytics. Proceedings of the VLDB Endowment 7 (12), pp. 1283–1294. Cited by: §7.3.