Distributed Learning over Unreliable Networks

10/17/2018 ∙ by Hanlin Tang, et al. ∙ ETH Zurich University of Rochester 0

Most of today's distributed machine learning systems assume reliable networks: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, we connect these two trends, and consider the following question: Can we design machine learning systems that are tolerant to network unreliability during training? With this motivation, we focus on a theoretical problem of independent interest---given a standard distributed parameter server architecture, if every communication between the worker and the server has a non-zero probability p of being dropped, does there exist an algorithm that still converges, and at what speed? In the context of prior art, this problem can be phrased as distributed learning over random topologies. The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over random topologies can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of blackparameter servers. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Distributed learning has attracted significant interest from both academia and industry. Over the last decade, researchers have come with up a range of different designs of more efficient learning systems. An important subset of this work focuses on understanding the impact of different system relaxations to the convergence and performance of distributed stochastic gradient descent, such as the compression of communication, e.g 

Seide and Agarwal (2016), decentralized communication Lian et al. (2017a); Sirb and Ye (2016); Lan et al. (2017); Tang et al. (2018a); Stich et al. (2018), and asynchronous communication Lian et al. (2017b); Zhang et al. (2013); Lian et al. (2015). Most of these works are motivated by real-world system bottlenecks, abstracted as general problems for the purposes of analysis.

In this paper, we are motivated by a new type of system relaxation—the reliability of the communication channel. We abstract this problem as a theoretical one, conduct a novel convergence analysis for this scenario, and then validate our results in a practical setting. Specifically, considering a network of nodes, we model unreliable communication as a setting where the communication channel between any two machines has a probability of not delivering a message. Alternatively, this setting can be abstracted as the simple, general problem of learning in a decentralized multi-worker system, over a random topology which changes upon every single communication.

Figure 1: An illustration of the communication pattern of distributed learning with three parameter servers and four workers — each server serves a partition of the model, and each worker holds a replica of the whole model. In this paper, we focus on the case in which every communication between the worker and the server has a non-zero probability of being dropped.

We assume a standard decentralized system implementation for aggregating gradients or models executing standard data-parallel stochastic gradient descent (SGD). In a nutshell, the algorithm, which we call RPS (for Robust Parameter Server), would work as follows on a reliable network. Given machines, each maintaining its own local model, each machine alternates local SGD steps with global communication steps, in which machines exchange their local models. This communication step is performed in two steps: first, in the Reduce-Scatter step, the model is partitioned into blocks, one for each machine; for each block , the machines average their model on the block by sending it to the corresponding machine. In the subsequent All-Gather step, each machine broadcasts its block to all others, so that all machines have a consistent model copy. Our modeling covers two standard distributed settings: the Parameter Server model (Li et al., 2014; Abadi et al., 2016)111Our modeling above fits the case of workers and parameter servers, although our analysis will extend to any setting of these parameters., as well as standard implementations of the AllReduce averaging operation in a decentralized setting (Seide and Agarwal, 2016; Renggli et al., 2018).

If the underlying network is unreliable (He et al., 2018), the two aggregation steps change as follows. In the Reduce-Scatter step, a uniform random subset of machines will average their model on each model block . In the All-Gather step, it is again a uniform random subset of machines which receive the resulting average. Specifically, machines not chosen for the Reduce-Scatter step do not contribute to the average, and all machines that are not chosen for the All-Gather will not receive updates on their model block . This is a realistic model of running an AllReduce operator implemented with Reduce-Scatter/All-Gather on unreliable network.

Our main technical contribution is characterizing the convergence properties of the RPS algorithm. To the best of our knowledge, this is a novel theoretical analysis of this faulty communication model. Compared with previous work on decentralized learning, e.g. Lian et al. (2017a) this paper considers the impact of a random topology on convergence. Specifically, we prove that the decentralized learning over random topologies can converge as efficient as the decentralized learning, over a fixed topology, and admits linear speedup in the number of machines. Moreover, we also prove that the impact of the package drop rate diminishes as the number of workers increases.

We apply our theoretical result to a real-world use case, illustrating the potential benefit of allowing an unreliable network. We focus on a realistic scenario where the network is shared among multiple applications or tenants, for instance in a data center. Both applications communicate using the same network. In this case, if the machine learning traffic is tolerant to some packet loss, the other application can potentially be made faster by receiving priority for its network traffic. Via network simulations, we find that tolerating a drop rate for the learning traffic can make a simple (emulated) Web service up to faster. (Even small speedups of

are significant for such services; for example, Google actively pursues minimizing its Web services’ response latency.) At the same time, this degree of loss does not impact the convergence rate for a range of machine learning applications, such as image classification and natural language processing.

2 Related Work

There has been a significant work on distributing deep learning, e.g. 

Seide and Agarwal (2016); Abadi et al. (2016); Goyal et al. (2017); Colin et al. (2016). Due to space constraints, we only mainly focus on work considering data-parallel SGD with decentralized randomized communication.

2.1 Distributed Learning

In Scaman et al. (2017), the optimal convergence rate for both centralized and decentralized distributed learning is given with the time cost for communication included. In Lin et al. (2018); Stich (2018), they investigate the trade off between getting more mini-batches or having more communication. To save the communication cost, some sparse based distributed learning algorithms is proposed (Shen et al., 2018b; Wu et al., 2018; Wen et al., 2017; McMahan et al., 2016; Wang et al., 2016). Recent works indicate that many distributed learning is delay-tolerant under an asynchronous setting (Zhou et al., 2018; Lian et al., 2015; Sra et al., 2015; Leblond et al., 2016). Also, in Blanchard et al. (2017); Yin et al. (2018); Alistarh et al. (2018) They study the Byzantine-robust distributed learning when Byzantine worker included in the network.

Many optimization algorithms were proved to achieve much better performance with more workers. For example, Hajinezhad et al. (2016) utilize a primal-dual based method for optimizing a finite-sum objective function and proved that it’s possible to achieve a speedup corresponding to the number of the workers. In Xu et al. (2017), an adaptive consensus ADMM is proposed and Goldstein et al. (2016) studied the performance of transpose ADMM on an entire distributed dataset.

2.2 Gossip-like Communication

Closest to this paper is a line of work considering gossip-like communication patterns for distributed learning. Specifically, Jin et al. (2016) proposes to scale the gradient aggregation process via a gossip-like mechanism. Reference Blot et al. (2016) considers a more radical approach, called GoSGD, where each worker exchanges gradients with a random subset of other workers in each round. They show that GoSGD can be faster than Elastic Averaging SGD Zhang et al. (2015) on CIFAR-10, but provide no large-scale experiments or theoretical justification. Recently, Daily et al. (2018) proposed GossipGrad, a more complex gossip-based scheme with upper bounds on the time for workers to communicate indirectly, periodic rotation of partners and shuffling of the input data, which provides strong empirical results on large-scale deployments. The authors also provide an informal justification for why GossipGrad should converge. Despite the promising empirical results, there is very little known in terms of convergence guarantees.

2.3 AllReduce SGD and Decentralized Learning

To overcome the limitations of the parameter-server model (Li et al., 2014), recent research, e.g.(Iandola et al., 2016) has explored communication patterns which do not depend on the existence of a coordinator node. These references typically consider an all-to-all topology, where each worker in the network can communicate reliably to all others.

Another direction of related work considers decentralized learning over fixed, but not fully-connected graph topologies. A recent result by Lian et al. (2017a) provided strong convergence bounds for a similar algorithm to the one we are considering, in a setting where the communication graph is fixed and regular. In Tang et al. (2018b), a new approach that admits a better performance than decentralized SGD when the data among workers is very different is studied. Shen et al. (2018a) generalize the decentralized optimization problem to a monotone operator. Here, we consider random, dynamically changing topologies, and therefore require a different analytic approach.

2.4 Random topology decentralized algorithms

In Boyd et al. (2006), a randomized decentralized SGD is studied. The weighted matrix for randomized algorithms can be time-varying, which means workers are allowed to change the communication network based on the availability of the network. However, most of the previous work (Li and Zhang, 2010; Lobel and Ozdaglar, 2011) made the assumption that the randomized network should be doubly-stochastic. This assumption is not satisfied in our setting. Recently, Nedic et al. (2017); Nedić and Olshevsky (2015) relax the assumption and consider the situation where the communication network is directed, but still, the row sum of the weight matrix is required to be 1. Those works focus on the case where the time-varying communication pattern is pre-designed, which does not extend to our setting.

In this paper, we consider a general model communication, which covers both Parameter Server Li et al. (2014) and AllReduce Seide and Agarwal (2016) distribution strategies. We specifically include the uncertainty of the network into our theoretical analysis, which is the first to not require the doubly-stochastic communication matrix. In addition, our analysis highlights the fact that the system can handle additional packet drops as we increase the number of worker nodes.

3 Problem Setup

We consider the following distributed optimization problem:

(1)

where is the number of workers, is the local data distribution for worker (in other words, we do not assume that all nodes can access the same data set), and

is the local loss function of model

given data for worker .

Unreliable Network Connection

Nodes can communicate with all other workers, but with packet drop rate (here we do not use the common-used phrase “packet loss rate” because we use “loss” to refer to the loss function). That means, whenever any node forwards models or data to any other model, the destination worker fails to receive it, with probability . For simplicity, we assume that all packet drop events are independent, and that they occur with the same probability .

Definitions and notations

Throughout, we use the following notation and definitions:

  • denotes the gradient of the function .

  • is the

    th largest eigenvalue of a matrix.

  • is the full-one vector.

  • denotes the all ’s by matrix.

  • denotes the norm for vectors.

  • denotes the Frobenius norm of matrices.

4 Algorithm

In this section, we describe the standard RPS algorithm in detail, followed by its interpretation from a global view.

4.1 The RPS Algorithm

In the RPS algorithm, each worker maintains an individual local model. We use to denote the local model on worker at time step . At each iteration , each worker first performs a regular SGD step

where is the learning rate and are the data samples of worker at iteration .

We would like to reliably average the vector among all workers, via the RPS procedure. In brief, the RS step perfors communication-efficient model averaging, and the AG step performs communication-efficient model sharing.

The Reduce-Scatter (RS) step:

In this step, each worker divides into equally-sized blocks.

(2)

The reason for this division is to reduce the communication cost and parallelize model averaging since we only assign each worker for averaging one of those blocks. For example, worker can be assigned for averaging the first block while worker might be assigned to deal with the third block. For simplicity, we would proceed our discussion in the case that worker is assigned for averaging the th block.

After the division, each worker sends its th block to worker . Once receiving those blocks, each worker would average all the blocks it receives. As noted, some packets might be dropped. In this case, worker averages all those blocks using

where is the set of the packages worker receives (including itself).

The AllGather (AG) step:

After computing , each worker attempts to broadcast to all other workers, using the averaged blocks to recover the averaged original vector by concatenation:

Note that it is entirely possible that some workers in the network may not be able to receive some of the averaged blocks. In this case, they just use the original block. Formally,

(3)

where

We can see that each worker just replace the corresponding blocks of using received averaged blocks. The complete algorithm is summarized in Algorithm 1.

1:  Input: Initialize all with the same value, learning rate , and number of total iterations . 2:  for  do 3:     Randomly sample from local data of the th worker, . 4:     Compute a local stochastic gradient based on and current optimization variable 5:     Compute the intermediate model according to
and divide into blocks .
6:     For any , randomly choose one worker 222Here indicates which worker is assigned for averaging the th block. without replacement. Then, every worker attempts to send their th block of their intermediate model to worker . Then each worker averages all received blocks using
7:     Worker broadcast to all workers (maybe dropped due to packet drop), . 8:     , where
for all .
9:  end for 10:  Output:
Algorithm 1 RPS

4.2 RPS From a Global Viewpoint

It can be seen that at time step , the th block of worker ’s local model, that is, , is a linear combination of th block of all workers’ intermediate model ,

(4)

where

and is the coefficient matrix indicating the communication outcome at time step . The th element of is denoted by . means that worker receives worker ’s individual th block (that is, ), whereas means that the package might be dropped either in RS step (worker fails to send) or AG step (worker fails to receive). So is time-varying because of the randomness of the package drop. Also is not doubly-stochastic (in general) because the package drop is independent between RS step and AG step.

figure under different number of workers n and package drop rate . figure under different number of workers n and package drop rate .

In fact, it can be shown that all ’s () satisfy the following properties

(5)
(6)

for some constants and satisfying in Lemmas 6, 7, and 8 respectively (see Supplementary Material for proof details) to make the algorithm convergent.

Simply speaking, we have and with regards to . It can be proved that is always larger than . Our result also indicates that and when and and when , which proves the tightness of our bound for and . Here we plot and in Figure  4.2 and Figure  4.2 . Detailed discussion is included in Section  D in Supplementary Material.

5 Theoretical Guarantees and Discussion

Below we will show that, for certain parameter values, RPS with unreliable communication rates admits the same convergence rate as the standard algorithms. In other words, the impact of network unreliablity may be seen as negligible. We begin by stating our analytic assumptions.

First let us make some necessary assumptions, that are commonly used in analyzing stochastic optimization algorithms.

Assumption 1.

We make the following commonly used assumptions:

  1. Lipschitzian gradient: All function ’s are with -Lipschitzian gradients, which means

  2. Bounded variance:

    Assume the variance of stochastic gradient

    is bounded for any in each worker .

  3. Start from 0: We assume for simplicity w.l.o.g.

Next we are ready to show our main result.

Theorem 1 (Convergence of Algorithm 1).

Under Assumption 1, choosing in Algorithm 1 to be small enough that satisfies , we have the following convergence rate for Algorithm 1

(7)

where

and , follows the definition in (5) and (6).

To make the result more clear, we appropriately choose the learning rate as follows:

Corollary 2.

Choose in Algorithm 1, under Assumption 1, we have the follow convergence rate for Algorithm 1

where , , , follow to the definitions in Theorem 1, and we treat ,, and to be constants.

We discuss our theoretical results below

  • (Comparison with centralized SGD and decentralized SGD) The dominant term in the convergence rate is (since and as shown by Lemma 8 in Supplement), which is consistent with the rate for centralized SGD and decentralized SGD Lian et al. (2017a).

  • (Linear Speedup) Since the the leading term of convergence rate for is . It suggests that our algorithm admits the linear speedup property with respect to the number of workers .

  • (Better performance for larger networks) Fixing the package drop rate (implicitly included in Section  D), the convergence rate under a larger network (increasing ), would be superior, because the leading terms’ dependence of the . This indicates that the affection of the package drop ratio diminishes, as we increase the number of workers and parameter servers.

6 Experiments: Convergence of RPS

We now validate empirically the scalability and accuracy of the RPS algorithm, given reasonable message arrival rates.

6.1 Experimental Setup

Datasets and models

We evaluate our algorithm on two state of the art machine learning tasks: (1) image classification and (2) natural language understanding (NLU). We train ResNet He et al. (2016) with different number of layers on CIFAR-10 Krizhevsky and Hinton (2009)

for classifying images. We perform the NLU task on the Air travel information system (ATIS) corpus on a one layer LSTM network.

Implementation

We simulate packet losses by adapting the latest version 2.5 of the Microsoft Cognitive Toolkit Seide and Agarwal (2016). We implement the RPS algorithm using MPI. During training, we use a local batch size of 32 samples per worker for image classification. We adjust the learning rate by applying a linear scaling rule Goyal et al. (2017)

and decay of 10 percent after 80 and 120 epochs, respectively. To achieve the best possible convergence, we apply a gradual warmup strategy 

Goyal et al. (2017) during the first 5 epochs. We deliberately do not use any regularization or momentum during the experiments in order to be consistent with the described algorithm and proof. The NLU experiments are conducted with the default parameters given by the CNTK examples, with scaling the learning rate accordingly, and omit momentum and regularization terms on purpose. The training of the models is executed on 16 NVIDIA TITAN Xp GPUs. The workers are connected by Gigabit Ethernet. We use each GPU as a worker. We describe the results in terms of training loss convergence, although the validation trends are similar.

(a) ResNet20 - CIFAR10
(b) ResNet110 - CIFAR10
(c) LSTM - ATIS
Figure 2: Convergence of RPS on different datasets.

Convergence of Image Classification

We perform convergence tests using the analyzed algorithm, model averaging SGD, on both ResNet110 and ResNet20 with CIFAR-10. Figure 2(a,b) shows the result. We vary probabilities for each packet being correctly delivered at each worker between 80%, 90%, 95% and 99%. The baseline is 100% message delivery rate. The baseline achieves a training loss of 0.02 using ResNet110 and 0.09 for ResNet20. Dropping 1% doesn’t increase the training loss achieved after 160 epochs. For 5% the training loss is identical on ResNet110 and increased by 0.02 on ResNet20. Having a probability of 90% of arrival leads to a training loss of 0.03 for ResNet110 and 0.11 for ResNet20 respectively.

Convergence of NLU

We perform full convergence tests for the NLU task on the ATIS corpus and a single layer LSTM. Figure 2(c) shows the result. The baseline achieves a training loss of 0.01. Dropping 1, 5 or 10 percent of the communicated partial vectors result in an increase of 0.01 in training loss.

(a) ResNet20 - CIFAR10
(b) ResNet110 - CIFAR10
(c) LSTM - ATIS
Figure 3: Why RPS? The Behavior of Stanford SGD in the Presence of Message Drop.

Comparison with Gradient Averaging

We conduct experiments with identical setup and a probability of 99 percent of arrival using a gradient averaging methods, instead of model averaging. When running data distributed SGD, gradient averaging is the most widely used technique in practice, also implemented by default in most deep learning frameworksAbadi et al. (2016); Seide and Agarwal (2016). As expected, the baseline (all the transmissions are successful) convergences to the same training loss as its model averaging counterpart, when omitting momentum and regularization terms. As seen in figures 3(a,b), having a loss in communication of even 1 percentage results in worse convergence in terms of accuracy for both ResNet architectures on CIFAR-10. This behavior is not visible on the NLU examples. The reason for achieving similar training loss when dropping gradients randomly in this case lies most probably in the naturally sparse nature of the gradients and model for this task. Nevertheless, this insight suggests that one should favor a model averaging algorithm over gradient averaging, if the underlying network connection is unreliable.

7 Case study: Speeding up Colocated Applications

figureAllowing an increasing rate of losses for model updates speeds up the Web service. figureAllowing more losses for model updates reduces the cost for the Web service.

Our results on the resilience of distributed learning to losses of model updates open up an interesting use case. That model updates can be lost (within some tolerance) without the deterioration of model convergence implies that model updates transmitted over the physical network can be de-prioritized compared to other more “inflexible,” delay-sensitive traffic, such as for Web services. Thus, we can colocate other applications with the training workloads, and reduce infrastructure costs for running them. Equivalently, workloads that are colocated with learning workers can benefit from prioritized network traffic (at the expense of some model update losses), and thus achieve lower latency.

To demonstrate this in practice, we perform a packet simulation over 16 servers, each connected with a  Gbps link to a network switch. Over this network of servers, we run two workloads: (a) replaying traces from the machine learning process of ResNet110 on CIFAR-10 (which translates to a load of 2.4 Gbps) which is sent unreliably, and (b) a simple emulated Web service running on all servers. Web services often produce significant background traffic between servers within the data center, consisting typically of small messages fetching distributed pieces of content to compose a response (e.g., a Google query response potentially consists of advertisements, search results, and images). We emulate this intra data center traffic for the Web service as all-to-all traffic between these servers, with small messages of  KB (a reasonable size for such services) sent reliably between these servers. The inter-arrival time for these messages follows a Poisson process, parametrized by the expected message rate, (aggregated across the servers).

Different degrees of prioritization of the Web service traffic over learning traffic result in different degrees of loss in learning updates transmitted over the network. As the Web service is prioritized to a greater extent, its performance improves – its message exchanges take less time; we refer to this reduction in (average) completion time for these messages as a speed-up. Note that even small speedups of are significant for such services; for example, Google actively pursues minimizing its Web services’ response latency. An alternative method of quantifying the benefit for the colocated Web service is to measure how many additional messages the Web service can send, while maintaining a fixed average completion time. This translates to running more Web service queries and achieving more throughput over the same infrastructure, thus reducing cost per request / message.

Fig. 7 and Fig. 7 show results for the above described Web service speedup and cost reduction respectively. In Fig. 7, the arrival rate of Web service messages is fixed ( per second). As the network prioritizes the Web service more and more over learning update traffic, more learning traffic suffers losses (on the -axis), but performance for the Web service improves. With just losses for learning updates, the Web service can be sped up by more than (i.e., ).

In Fig. 7, we set a target average transmission time (, , or  ms) for the Web service’s messages, and increase the message arrival rate, , thus causing more and more losses for learning updates on the -axis. But accommodating higher over the same infrastructure translates to a lower cost of running the Web service (with this reduction shown on the -axis).

Thus, tolerating small amounts of loss in model update traffic can result in significant benefits for colocated services, while not deteriorating convergence.

8 Conclusion

We present a novel analysis for a general model of distributed machine learning, under a realistic unreliable communication model. Mathematically, the problem we considered is that of decentralized, distributed learning over a randomized topology. We present a novel theoretical analysis for such a scenario, and evaluate it while training neural networks on both image and natural language datasets. We also provide a case study of application collocation, to illustrate the potential benefit that can be provided by allowing learning algorithms to take advantage of unreliable communication channels.

References

Supplemental Materials

References

Appendix A Notations

In order to unify notations, we define the following notations about gradient:

We define as the identity matrix, as and as . Also, we suppose the packet drop rate is .

The following equation is used frequently:

(8)

a.1 Matrix Notations

We aggregate vectors into matrix, and using matrix to simplify the proof.

a.2 Averaged Notations

We define averaged vectors as follows:

(9)
(10)
(11)

a.3 Block Notations

Remember in (2) and (3), we have divided models in blocks:

We do the some division on some other quantities, see following (the dimension of each block is the same as the corresponding block in ) :

a.4 Aggregated Block Notations

Now, we can define some additional notations throughout the following proof

a.5 Relations between Notations

We have the following relations between these notations:

(12)
(13)
(14)
(15)
(16)
(17)

a.6 Expectation Notations

There are different conditions when taking expectations in the proof, so we list these conditions below:

Denote taking the expectation over the computing stochastic Gradient procedure at th iteration on condition of the history information before the th iteration.

Denote taking the expectation over the Package drop in sending and receiving blocks procedure at th iteration on condition of the history information before the th iteration and the SGD procedure at the th iteration.

Denote taking the expectation over all procedure during the th iteration on condition of the history information before the th iteration.

Denote taking the expectation over all history information.

Appendix B Proof to Theorem 1

The critical part for a decentralized algorithm to be successful, is that local model on each node will converge to their average model. We summarize this critical property by the next lemma.

Lemma 3.

From the updating rule (4) and Assumption 1, we have

(18)

where and .

We will prove this critical property first. Then, after proving some lemmas, we will prove the final theorem. During the proof, we will use properties of weighted matrix which is showed in Section  D.

b.1 Proof of Lemma 3

Proof to Lemma 3.

According to updating rule (4) and Assumption 1, we have

(19)

We also have

(20)

Combing (19) and (20) together, and define

we get

(21)

where is a scale factor that is to be computed later. The last inequality is because for any matrix and .

For