Coding for Straggler Mitigation in Federated Learning

We present a novel coded federated learning (FL) scheme for linear regression that mitigates the effect of straggling devices while retaining the privacy level of conventional FL. The proposed scheme combines one-time padding to preserve privacy and gradient codes to yield resiliency against stragglers and consists of two phases. In the first phase, the devices share a one-time padded version of their local data with a subset of other devices. In the second phase, the devices and the central server collaboratively and iteratively train a global linear model using gradient codes on the one-time padded local data. To apply one-time padding to real data, our scheme exploits a fixed-point arithmetic representation of the data. Unlike the coded FL scheme recently introduced by Prakash et al., the proposed scheme maintains the same level of privacy as conventional FL while achieving a similar training time. Compared to conventional FL, we show that the proposed scheme achieves a training speed-up factor of 6.6 and 9.2 on the MNIST and Fashion-MNIST datasets for an accuracy of 95% and 85%, respectively.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/16/2021

CodedPaddedFL and CodedSecAgg: Straggler Mitigation and Secure Aggregation in Federated Learning

We present two novel coded federated learning (FL) schemes for linear re...
02/28/2022

Computational Code-Based Privacy in Coded Federated Learning

We propose a privacy-preserving federated learning (FL) scheme that is r...
09/30/2021

Federated Dropout – A Simple Approach for Enabling Federated Learning on Resource Constrained Devices

Federated learning (FL) is a popular framework for training an AI model ...
06/15/2021

Privacy Assessment of Federated Learning using Private Personalized Layers

Federated Learning (FL) is a collaborative scheme to train a learning mo...
05/24/2020

Reliability and Performance Assessment of Federated Learning on Clinical Benchmark Data

As deep learning have been applied in a clinical context, privacy concer...
06/14/2021

Dynamic Gradient Aggregation for Federated Domain Adaptation

In this paper, a new learning algorithm for Federated Learning (FL) is i...
06/14/2022

Matching Pursuit Based Scheduling for Over-the-Air Federated Learning

This paper develops a class of low-complexity device scheduling algorith...

I Introduction

Federated learning (FL) [McM17, Kon16, Tian20] is a distributed learning paradigm that trains an algorithm across multiple devices without exchanging the training data directly, thus limiting the privacy leakage and reducing the communication load. More precisely, FL enables multiple devices to collaboratively learn a global model under the coordination of a central server. The devices compute partial gradients based on their local data and send them to the central server, which aggregates them and sends an updated global model to the devices.

In many applications of FL, such as in the Internet of Things (IoT), due to the heterogeneous nature of the training devices and instability of the communication links, the training latency can be severely impaired by straggling devices, i.e., devices that do not provide timely updates. Various FL algorithms have been proposed in the literature to tackle stragglers. The most popular is federated averaging [McM17], which mitigates the effect of stragglers by dropping the slowest devices at the cost of reduced accuracy. When data is non-identically distributed across devices, which is typically the case in practice, the loss in accuracy may be significant—in this case, dropping stragglers makes the algorithm suffer from the client drift phenomenon, i.e., the learning converges to the optimum of one of the local models [Cha20, Mit21]. Straggler mitigating schemes for scenarios for which the data is identically distributed across devices were presented in [Dut18, Rei20], while the authors of [Mit21, Xie19, Li19, Li20, Wan20, Wu21] introduce asynchronous schemes to deal with scenarios for which the data is non-identically distributed across devices. The key idea here is to make use of the stale information (e.g., gradients) from the stragglers rather than discarding them at the central server. Generally, schemes of such nature do not converge to the global optimum. In particular, the authors of [Li20] present a scheme that controls the client drift, but with a nonlinear convergence rate to the global optimum [Mit21].

Mitigating the impact of stragglers has also been addressed in the neighboring problem of distributed computing over multiple servers in data centers. The key idea is to introduce redundant computations by means of an erasure correcting code—thereby increasing the computational load at each server—so that the result of a computation task can be obtained from the subtasks completed by a subset of the servers. Coded distributed computing has been considered for, e.g., matrix-vector and matrix-matrix multiplication

[Li16, Yu2017, Lee18, Sev19, amir2019, Dutta2019, Dutta2020], distributed gradient descent [Tan17], and distributed optimization [Kar17].

The use of erasure correcting codes to mitigate the impact of stragglers has also been considered in the context of wireless edge computing [Sch20, Zha19, Frigaard2021] and FL [Pra21]. In FL, in contrast to distributed computing, where data can be pre-processed and distributed across servers by a master server, the fact that the raw data is distributed across devices beforehand precludes from introducing redundant computations in the same manner as in distributed computing. The main idea in [Pra21] is that devices generate parity data, which is shared with the central server to facilitate the training and provide resilience against straggling devices. The raw local data, on the other hand, is not shared. By adding redundancy, the training is still performed on the data of all devices, which helps in mitigating the adverse affects on the learning accuracy when the data is non-identically distributed across devices. The sharing of the parity (coded) data with the central server, however, leaks information of the raw data to the central server, i.e., the coded FL scheme in [Pra21] yields a lower level of privacy than conventional FL.

In this paper, we propose a novel privacy-preserving coded FL scheme for linear regression that mitigates the effect of straggling devices and converges to the global optimum. Hence, the proposed scheme yields no penalty on the accuracy even for highly non-identically distributed data across devices. Furthermore, unlike the scheme in [Pra21], it retains the privacy level of conventional FL against the central server and honest-but-curious devices. The proposed scheme combines one-time padding to yield privacy with gradient codes [Tan17] to provide straggler resilience. One-time padding cannot be applied to real data. To circumvent this problem, the proposed scheme considers a fixed-point arithmetic representation of the real data and subsequently fixed-point arithmetic operations, which allows the application of one-time padding. The scheme consists of two phases: in the first phase, the devices share a padded version of their local data with a subset of other devices. The sharing of one-time padded data does not reveal any information about the data to other devices but enables the use of erasure correcting codes in the second phase. Particularly, in this phase each device uses a gradient code to generate a partial gradient on the local data and the padded data received from other devices. The partial gradient is then shared with the central server, which aggregates the received partial gradients—after removing the random keys—and sends an updated global model to the devices. Similar to the scheme in [Pra21], our scheme can be used to perform nonlinear classification by pre-processing the dataset using kernel embedding. We show that, for a realistic IoT environment, the proposed coded FL scheme using kernel embedding achieves a speed-up factor of and compared to conventional FL when training on the MNIST [Cun10] and Fashion-MNIST [Xia17] datasets for an accuracy of and , respectively.

Notation.

We use uppercase and lowercase bold letters for matrices and vectors, respectively, italics for sets, and uppercase sans-serif letters for random variables, e.g.,

, , , and represent a matrix, a vector, a set, and a random variable, respectively. An exception to this rule is , which will denote a matrix. Vectors are represented as row vectors throughout the paper. For natural numbers and , denotes an all-one matrix of size . The transpose of a matrix is denoted as . The support of a vector is denoted by , while the gradient of a function with respect to is denoted by . Furthermore, we represent the Euclidean norm of a vector by , while the Frobenius norm of a matrix is denoted by . Given integers , , we denote , where is the set of integers, and for a positive integer . Additionally, is a shorthand for and for a real number , is the largest integer less than or equal to . The expectation of a random variable is denoted by , and we write to denote that

follows a geometric distribution with failure probability

.

Ii Preliminaries

Ii-a Fixed-Point Numbers

Fixed-point numbers are rational numbers that can be split into an integer part and a fractional part. Let be the binary representation of a fixed-point number , of value , where is the sign of , is the length of the integer part (including the sign), and the length of the fractional part. Also, let . Then, , i.e., fixed-point numbers can be seen as integers scaled by a factor . Let . We define the set of all fixed-point numbers with range and resolution .

Ii-B Cyclic Gradient Codes

Gradient codes [Tan17] are a class of codes designed to mitigate the effect of stragglers in distributed gradient descent in data centers. Consider a piece of data partitioned into partitions, which are distributed among servers, each storing partitions. An fixed-point gradient code is characterized by the matrices and of size and , respectively, where denotes the number of straggling patterns that the code can deal with. The -th row of , , is associated with the -th server; the support of the row corresponds to the partitions of the data assigned to that server. Furthermore, we assume that the supports of the rows of , each of size , follow a cyclic pattern. In other words, the data partitions are placed cyclically across the servers. We will refer to such cyclic gradient codes simply as gradient codes throughout the rest of the paper. Now, let denote the gradient of the -th partition. The encoding of the gradients at each server is given by , where the -th row corresponds to server . Each server then sends the encoding of the local gradients to a master server, whose aim is to linearly combine any of them to obtain , thus mitigating the impact of stragglers. We refer to this operation as the decoding operation, which is determined by . Particularly, if the master server receives gradients from a subset of servers , it applies the linear combination of these gradients with coefficient for server given by the -th element of the row of with support . Moreover, it is required that

(1)

We refer the interested reader to [Tan17, Alg. 1] and [Tan17, Alg. 2] for the construction of and , respectively.

Iii System Model

We consider a FL scenario in which

devices collaborate to train a machine learning model with the help of a central server. Device

has local data consisting of training points. We denote by the total number of data points across all devices, i.e., . The scheme proposed in Section IV is based on one-time padding, which cannot be applied over the reals. To circumvent this shortcoming, our scheme works on the fixed-point representation of the data. Hereafter, we assume that and are the fixed-point representations of the corresponding real-valued vectors. Note that practical systems operate in fixed-point, hence the proposed scheme does not incur in a limiting assumption.

We represent the data in matrix form as

where is of size and of size . The devices and the central server collaboratively try to infer a linear global model as , where is a feature vector and its corresponding label, using federated gradient descent.

Iii-a Federated Gradient Descent

For convenience, we collect the whole data (consisting of data points) in matrices and as

where is of size and of size . Inferring the linear model can be formalized as the minimization problem

(2)

where is the globalloss function and the regularization parameter.

Let denote the local loss function corresponding to the data at device , i.e., Then, in creftype 2 can be expressed as

Federated gradient descent proceeds iteratively to train the model

. At each epoch, the devices compute the gradients of the respective loss functions and send them to the central server, which aggregates the received gradients to update the model. More precisely, during the

-th epoch, device computes the gradient

(3)

where

denotes the current model estimate. Upon reception of the gradients, the central server aggregates them to update the model according to

(4)
(5)

where is the learning rate. The updated model is then sent back to the devices, and creftypeplural 5, 4, and 3 are iterated times until convergence, i.e., until .

Iii-B Computation and Communication Latency

Let be the time required to compute multiply and accumulate (MAC) operations by device . Similar to [Zha19], we model as a shifted exponential random variable,

where are independent exponential random variables with representing the random setup times required by the devices, and is the number of MAC operations per second performed by device .

We assume that communication between the central server and the devices may fail. To enable communication, the devices and the central server repetitively transmit during the uplink and downlink phases until the first successful transmission occurs. Let and denote the number of transmissions needed for successful communication in the uplink and downlink, respectively, where denotes the failure probability of a single transmission between the central server and device . Also, let and be the transmission rates between the central server and the devices in the uplink and downlink, respectively. Then, the time required to successfully communicate bits during uplink and downlink, denoted by and , respectively, is

In our model, all communication between any two devices happens over a secured link and is relayed through the central server, i.e., any two devices share an encrypted communication link and the central server learns nothing about the exchanged messages.

Iv Low-Latency Federated Gradient Descent

The proposed scheme builds on one-time padding and gradient codes. Note, however, that one-time padding cannot be applied to data over the reals. To bypass this problem, we consider a fixed-point representation of the data and apply fixed-point arithmetic operations. In the following, we first discuss how to preserve privacy in performing operations using fixed-point arithmetic and then present the proposed scheme.

Iv-a Privacy-Preserving Operations on Fixed-Point Numbers

The authors of [Cat10] were the first to address the problem of performing secure computations (in the context of multiparty computation) using fixed-point numbers. The idea is to map fixed-point numbers to finite-field elements, and then perform secure operations (addition, multiplication, and division) of two secretly-shared numbers over the finite field. In this paper, we use a similar approach as the one in [Cat10] but define a different mapping and a simplified multiplication operation, leveraging the fact that we only need to multiply a secretly-shared number with a public number, as discussed in the next subsection. The resulting protocol is more efficient than the one in [Cat10].

Consider the fixed-point datatype (see Section II-A). Secure addition on can be performed via simple integer addition with an additional modulo operation. Let be the map from the integers onto the set given by the modulo operation. Furthermore, let , with and . For , with , we have .

Multiplication on is performed via integer multiplication with scaling over the reals in order to retain the precision of the datatype and an additional modulo operation. For , with , we have .

Proposition 1 (Perfect privacy).

Consider a secret and a key that is picked uniformly at random. Then,

is uniformly distributed in

, i.e., does not reveal any information about .

Proposition 1 is an application of a one-time pad, which was proven secure by Shannon in [Shannon49]. It follows that given that an adversary (having unbounded computational power) obtains the sum of the secret and the key, , and does not know the key , it cannot determine the secret .

Proposition 2 (Retrieval).

Consider a public fixed-point number , a secret , and a key that is picked uniformly at random. Suppose we have the weighted sum and the key, then we can retrieve .

The above proposition tells us that, given , , and , it is possible to obtain an approximation of . Moreover, if we choose a sufficiently large , then we can retrieve with negligible error.

Iv-B Data Sharing Scheme

We are now ready to introduce the proposed privacy-preserving scheme. It consists of two phases: in the first phase, discussed in this subsection, we secretly share data between devices, which enables the use of gradient codes in the second phase to perform privacy-preserving coded federated gradient descent while conferring straggler mitigation.

The central server first generates two sets of keys, and , where and are sent to device ,111Note that we consider the communication cost of transmitting keys to be negligible since, in practice, it is enough to send a (much smaller) pseudorandom number generator seed instead of the random numbers. is a matrix of size , and is a symmetric matrix of size . Using its keys and its data , device computes

(6)
(7)

where is the gradient of device in the first epoch (see creftype 3).

The above matrices are one-time padded versions of the gradient and transformed data. Sharing and does not leak any information about the data of device , but it is critical nevertheless, as it introduces redundancy of the data across devices, which enables the use of gradient codes in the second phase. In the following, we describe the sharing process in detail.

Let be the number of local datasets to be stored at each device (including its own), and the encoding matrix of a gradient code in fixed-point representation with entries . Define the transmission matrix

where identifies a device that shares its padded data with device . We assume that devices are equipped with full-duplex technology and that simultaneous transmission between the devices and the central server via parallel channels is possible. Thus, the transmission schedule in is completed after transmissions (encompassing both upload and download).

Example 1.

Consider devices and . We have the transmission matrix , where, for instance, denotes that device shares its padded gradient and data, and , with device . The first row says that each device should share its data with itself, making communication redundant, whereas the second row says that devices , and should share their padded gradients and data with devices , and , respectively. Since the devices operate in full-duplex mode, and communication is performed over parallel channels, transmission is enough.

Remark 1.

Note that is symmetric, as and are symmetric. Thus, in order to obtain at device , device only needs to communicate the upper half of the matrix to device .

Once the padded gradient and data have been shared, using an gradient code, device computes

(8)
(9)

which completes the sharing phase. Equation creftype 8 corresponds to the encoding via a gradient code of the padded gradient of device at epoch and the padded gradients (at epoch ) received from other devices. Similarly, creftype 9 corresponds to the encoding of the padded data of device as well as the data received from other devices.

Iv-C Coded Gradient Descent

After the transmission phase, the central server and the devices iteratively train a global model using gradient descent. Consider the -th epoch and let

(10)

be the model parameter at the -th epoch, where is the update matrix and the initial model estimate. Instead of sending to the devices, as is standard for gradient descent, in the proposed coded gradient descent, the central server sends the update matrix .

Upon reception of , the devices compute gradients on the encoded padded data. Particularly, in the -th epoch, device computes the gradient

where follows from creftype 8 and creftype 9 together with creftype 6 and creftype 7, is a reordering, and follows from creftype 3 and creftype 10. Device then sends to the central server, which updates the global model as explained next. The central server waits for the first gradients it receives, subtracts the keys (as it knows and the keys and ), and performs a decoding operation based on matrix , where is the decoding matrix for the gradient code given by . Let , , be the set of indices of the fastest devices to finish the computation of . After removing the keys from , , the central server obtains Next, it decodes according to as follows. Let be the -th row of such that . Then,

(11)

where follows from the property of gradient codes in creftype 1 and follows from creftype 4. Lastly, is obtained according to creftype 5 for the next epoch. Note that for the central server to obtain the correct global model update, the devices can perform only one epoch of local training between two successive global updates. This restriction means that our scheme can only be applied to federated gradient descent and not to federated averaging, where devices perform multiple local model updates before the central server updates the global model.

Fig. 1: An example showcasing the system model as well as an epoch of the proposed coded FL scheme. The system consists of devices and a central server. The devices share and . During the -th epoch of the coded FL scheme, the central server sends to the devices. The devices compute coded gradients using an gradient code, and send them to the central server, which decodes them to compute the model update.

The proposed coded FL scheme is schematized in Fig. 1. It is easy to see that our scheme achieves the global optimum.

Proposition 3.

The proposed coded federated learning scheme is resilient to stragglers, and achieves the global optimum, i.e., the optimal model obtained through gradient descent for linear regression.

Proof:

From creftype 11, we see that during each epoch, , the central server obtains

using the coded data obtained from the fastest devices. It further obtains an updated linear model using creftype 5, which is exactly the update rule for gradient descent. ∎

V Numerical Results

We simulate a wireless setting with devices and a central server which want to perform FL on the MNIST [Cun10] and Fashion-MNIST [Xia17] datasets. Following the norm in the machine learning literature, we divide the dataset into training and test data, where we use the training data to train the model and the test data to validate it. To simulate non-identically distributed data, we sort the training data corresponding to the labels and then we divide the training data into

equal parts, one for each device. Each device pre-processes its assigned data using kernel embedding as done by the radial basis function sampler of Python’s sklearn package (with

as kernel parameter and features) to obtain the (random) features and then stores

. We assume that the pre-processing step is performed offline. For conventional FL, the devices use 32-bit floating point arithmetic, whereas in the proposed coded FL scheme, the devices work on fixed-point numbers with

bits out of which bits are for the fractional part. Furthermore, for conventional FL the data at the devices is divided into five smaller batches and we perform mini-batch learning to speed up the process by reducing the epoch times. The mini-batch size is chosen as a compromise between the two corner cases: a mini-batch size of is difficult to parallelize, whereas a large mini-batch size may exceed the devices’ limited parallelization capabilities leading to an increased computational latency at each epoch. We select a mini-batch size of as a middle ground, which allows to utilize the parallelization capabilities of the considered chips while keeping the computational load at each epoch reasonable.

We consider devices with heterogeneous computation capabilities, which we model by varying the MAC rates . In particular, we have four classes of devices: devices have a MAC rate of MAC/s, devices have , have , and the last have , whereas the central server has a MAC rate of MAC/s. We chose these MAC rates based on the performance that can be expected by using devices with chips from Texas Instruments of the TI MSP430 family [TIchip]. The setup times have an expected value of of the deterministic computation time, i.e., . The communication between the central server and the devices is based on the LTE Cat 1 standard for IoT applications and the corresponding rates are Mbit/s and Mbit/s. The probability of transmission failure between the central server and the devices is constant across devices, with , , and we assume a header overhead of for each packet. Lastly, we use regularization parameter and initial learning rate . The learning rate is updated as at epochs and .

(a) MNIST dataset
(b) Fashion-MNIST dataset
Fig. 2: Training time for the proposed coded FL with different values of , the coded FL scheme in [Pra21], and conventional FL.

In Fig. 1(a), we plot the training time of the proposed coded FL scheme and compare it with that of conventional FL and the scheme in [Pra21] using the MNIST dataset. For our scheme, we consider . Note that corresponds to a replication scheme and all the padded data will be available at all devices. As increases, so does the time required to complete the encoding and sharing phase (note that there is no encoding and sharing phase for conventional FL). This induces a delay in the start of the training phase, which can be observed in the figure by the initial offset of the coded FL curves. However, once the sharing phase is completed, the time required to finish an epoch reduces as increases, as the central server only needs to wait for the gradients from the fastest devices to perform the model update. We see that coded FL with requires the least training time to achieve an accuracy of , yielding a speed-up of approximately compared to conventional FL, where we have to wait for the slowest device in each epoch. For different target accuracy levels, different values of will yield the lowest latency. If the target accuracy lies below , it turns out that conventional FL outperforms the proposed scheme. Furthermore, too low values of , such as , will never yield a lower latency for a given accuracy than conventional FL. The scheme in [Pra21] trades off efficient training with privacy. Briefly, the scheme ensures low latency training by making the devices share parity data with the central server. The more parity data is shared with the server, the more load is put on the central server instead of the devices, thereby reducing the latency per epoch. To quantify the amount of parity data introduced, in [Pra21], the authors define a parameter as the amount of parity data per device over the total amount of raw data across devices. Here, we choose two extreme values for , namely and . Note that the higher is, the more data is leaked to the central server. Our proposed scheme achieves a faster training time than the scheme in [Pra21] with for an accuracy of , while it achieves a slightly worse training time than the scheme with . It is important to realize, though, that a large goes against the spirit of FL: it leaks almost all data to the central server and the training becomes almost equivalent to standard, central machine learning, whereas the idea of FL is to utilize local data at the devices without leaking it to a central server.

(a) MNIST dataset
(b) Fashion-MNIST dataset
Fig. 3: Training time for the proposed coded FL with and conventional FL with a subset of the fastest devices.

A similar behavior is observed for the Fashion-MNIST dataset in Fig. 1(b), for which gives the best performance for an accuracy of with a speed-up factor of approximately compared to conventional FL. However, if the target accuracy is between and , nontrivial coding schemes (i.e., ) perform best.

In Figs. 2(b) and 2(a), we compare the performance of the proposed scheme for with that of conventional FL where the or slowest devices are dropped at each epoch. For conventional FL, we plot the average performance and the worst-case performance. Dropping devices in a heterogeneous network with strongly non-identically distributed data can have a big impact on the accuracy. This is highlighted in the figures; while dropping devices causes a limited loss in accuracy on average, in some cases the loss is significant. For the MNIST dataset, in the worst simulated case, the accuracy reduces to for dropped devices and to for dropped devices (see Fig. 2(a)). For the Fashion-MNIST dataset, the accuracy reduces to and for and dropped devices, respectively (see Fig. 2(b)). The proposed scheme outperforms conventional FL in all cases; the proposed scheme has the benefit of dropping the slow devices in each epoch while not suffering from a loss in accuracy due to the redundancy of the data across the devices (see Proposition 3).

Vi Conclusion

We proposed a novel coded federated learning scheme for linear regression that provides resiliency to straggling devices, while preserving the privacy level of conventional FL. The proposed scheme combines one-time padding—exploiting a fixed-point arithmetic representation of the data—to retain privacy and gradient codes to mitigate the effect of stragglers. For a given target accuracy, the proposed scheme can be optimized to minimize the latency. For the MNIST dataset and an accuracy of , our proposed coded FL scheme achieves a training speed-up factor of compared to conventional FL, while for the Fashion-MNIST dataset our scheme achieves a training speed-up factor of for an accuracy of . Furthermore, our scheme yields comparable latency performance to the coded FL scheme in [Pra21], without incurring the additional loss in privacy of this scheme. For high accuracy levels, the proposed scheme also outperforms conventional FL where devices are dropped to trade off lower latency for lower accuracy. While the focus of this paper is in linear regression, the proposed scheme can also be applied to nonlinear optimization problems, e.g., classification, by transforming the dataset using kernel embedding.

References