Asynchronous Online Federated Learning for Edge Devices

11/05/2019 ∙ by Yujing Chen, et al. ∙ 0

Federated learning (FL) is a machine learning paradigm where a shared central model is learned across multiple distributed client devices while the training data remains on edge devices or local clients. Most prior work on federated learning uses Federated Averaging (FedAvg) as an optimization method for training in a synchronized fashion. This involves independent training at multiple edge devices with synchronous aggregation steps. However, the assumptions made by FedAvg are not realistic given the heterogeneity of devices. In particular, the volume and distribution of collected data vary in the training process due to different sampling rates of edge devices. The edge devices themselves also vary in their available communication bandwidth and system configurations, such as memory, processor speed, and power requirements. This leads to vastly different training times as well as model/data transfer times. Furthermore, availability issues at edge devices can lead to a lack of contribution from specific edge devices to the federated model. In this paper, we present an Asynchronous Online Federated Learning (ASO- fed) framework, where the edge devices perform online learning with continuous streaming local data and a central server aggregates model parameters from local clients. Our framework updates the central model in an asynchronous manner to tackle the challenges associated with both varying computational loads at heterogeneous edge devices and edge devices that lag behind or dropout. Experiments on three real-world datasets show the effectiveness of ASO-fed on lowering the overall training cost and maintaining good prediction performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction.

As massive data is generated from modern edge devices (e.g., mobile phones, wearable devices, and GPS), distributed model training over a large number of computing nodes has become essential for machine learning. However, the sensitive nature of these data requires a secure and private computing environment. Additionally, the non-IID (not independent and identically distributed) and highly imbalanced characteristics of these data coupled with the need of high-throughput networks for data transfer leads to challenges in effective model training [3]. Federated learning [1] trains a shared global model from a federation of distributed devices under the coordination of a central server, while the training data is kept on device. Each device performs training on its local data and updates model parameters to a central server for aggregation. Both model training and prediction are implemented locally which has privacy and communication advantages compared to transferring all data to a centralized cloud center [1]. Many potential applications can leverage a federated framework, such as learning activities of mobile device users, forecasting weather pollutants, and predicting health events like heart rate.

Prior work on federated learning usually follows a synchronous setting with fixed available data during training. The central server aggregates after receiving updates from all local clients [1, 2, 7, 28]. However, there are several challenges that cannot be handled by synchronous federated learning: 1) data on local devices may increase during the training process, since sensors on these distributed devices usually have a high sampling frequency. Therefore in the online setting, inter-client relatedness could potentially vary over time; 2) mobile devices can be frequently offline or have poor communication bandwidth due to network constrains. Consequently, the synchronized federated learning frameworks can be extremely slow; 3) edged devices may lag or even dropout due to data or system heterogeneity [3].

In this work, we propose an asynchronous online federated learning framework, where distributed clients with continuously arriving data learn an effective shared model collaboratively. Previous online learning approaches [14, 15, 17] with multiple clients are not capable of solving the aforementioned challenges because these approaches share training samples of each client with other clients; which has the same privacy concern as centralized data centers. Jin et. al present a distributed online learning method [16] that trains on local clients and a central server alternately to reduce the communication cost, but it still needs each client to send a small portion of data to the server. An illustration of our model is in Figure 1. The main contributions of the proposed ASO-fed approach are summarized as follows: 1) it allows asynchronous updates from multiple clients with continuously arriving data and is robust against network connections with high communication delays between the central server and some local clients; 2) it mitigates the straggler problem caused by device heterogeneity; 3) it learns inter-client relatedness effectively using regularization and a global feature learning module; and 4) it improves model prediction performance and reduces computation cost with personalized learning step sizes on local clients.

Figure 1: Overview of the proposed asynchronous federated learning which: i) allows model updates from multiple distributed devices; ii) provides a mechanism to update central model asynchronously; iii) is robust to stragglers and dropouts.

2 Preliminaries and Definitions.

In this section, we present the notations used in this paper along with common loss functions used in federated learning. Then we briefly introduce the commonly used FedAvg

[1] model and identify the issues in synchronized federated settings.

2.1 Definitions and Loss Functions.

Assume that we have distributed devices, and for device , we are given the training data of data samples, where

is the feature vectors for training data, and

is the corresponding label matrix. To facilitate the learning, for each data sample , let be the corresponding loss function and in short. Then for each dataset on device , a loss function is defined on the data samples of this device:

(1)

We set as the cross-entropy loss for classification models and mean absolute error for regression models. Federated learning methods (e.g., [1, 7]) are designed to handle distributed devices and a central server that coordinates the global learning objective across the network. We denote as the total number of samples in devices. Assuming for any , , and . The global loss function of all distributed devices and examples is defined as:

(2)

2.2 Synchronized Federated Optimization.

Prior work on federated optimization (e.g., [7, 9, 10]) is usually based on FedAvg. It assumes a synchronized update scheme that proceeds in rounds of communication. At each round (global iteration), a

fraction of clients are randomly selected and local solvers (e.g., stochastic gradient descent) are used to optimize the local objective function on each of the selected clients. The clients then send their local model parameters to a central server and the central model updates are averaged after receiving all local parameters.

The disadvantage of synchronized optimization is that, at each round, when one or more clients are suffering from high network delays or clients which have more data and hence need longer training time, all the other clients must wait. Since the central server aggregates after all clients finish, the extended period of waiting time in a synchronized optimization will lead to idling and wastage of computing resources.

In addition, with data stored on a large number of local clients, communication efficiency is of utmost importance. Algorithms in federated learning should handle training data with the following characteristics:

  • [leftmargin=*]

  • Non-IID: Data on each client may have different distributions, i.e., the overall distribution cannot be learned from data on a single client.

  • Imbalanced data: Data can be biased to certain labels, e.g., users may have different habits or edge devices are monitoring different locations.

  • Heterogeneity: Data size and device performance may vary on different local clients.

  • Increasing data: Data may continuously arrive at local clients during the training processes.

McMahan et al. [1] proposed FedAvg for the optimization of non-IID and imbalanced properties in federated learning. However, FedAvg cannot handle increasing data on clients and data/system heterogeneity that leads to stragglers or dropouts.

3 Proposed Method

We propose to perform asynchronous online federated learning where the central server begins to update model parameters after receiving one to several clients’ updates, without waiting for the other clients to finish. The details of the proposed ASO-fed approach will be explained in the following sections.

3.1 Regularized Federated Learning.

An important assumption of FedAvg is the relatedness among clients. We use the canonical distributed gradient-descent algorithm which is widely used in state-of-the-art federated learning systems (e.g., [1, 7]). Each node has its local model parameters , and denotes the server model parameter from all clients. We observe that just minimizing cannot achieve the desired knowledge transfer among clients because the minimization problems are decoupled for each local model . Thus we add a penalty term to the global loss [10, 13, 21] and this yield the following minimization function:

(3)

where represents the relatedness of local clients, is the regularization parameter that controls the amount of knowledge transfer among clients. As shown in [12], by penalizing norm of model parameter , the relatedness of clients can be efficiently learned. Therefore, we choose . The grouped sparsity introduced by norm penalization encourages many rows of to be zero. That is a way of compromising between finding small weights and minimizing the original cost function . The new objective function is updated as follows:

(4)
Figure 2: Illustration of update procedure for the proposed ASO-fed model.

3.2 Proposed Framework.

Figure 2 illustrates the update procedure for ASO-fed. We use the concept of global iterations in synchronous federated learning, where each aggregation on the central server is treated as one round. At time , the server distributes the current model to a fraction of randomly selected available clients, i.e., Client and Client in Figure 2. Then the first round begins and these two clients initiate their local training. At time , Client finishes its local training and uploads its local model to the central server for aggregation. At the central server, the new model is updated by applying feature learning to the aggregated parameter. Then the server starts the next round and distributes to the next fraction of randomly selected available clients, i.e., Client in the example. Client starts its local training and updates its local model to the server at time . Before , Client uploads its local model at time . We can observe that there is an inconsistency in the asynchronous update scheme when it comes to obtaining model parameters from the central server. Such inconsistency is common in the real world settings and is caused by data and system heterogeneity or network delay. We address this problem by adding feature learning to the central server and dynamic learning step size to local clients’ training. The approach of ASO-fed is detailed in Algorithm 1.

3.3 Learning on Central Server.

We propose an asynchronous update procedure for the server, that is, the central server begins to update model parameter after it receives an update from one client (or several updates if multiple clients finish their local computations at the same time), without waiting for the other clients to finish their computations. The copies of on server and clients may be different. At round , assume the server receives updates from a subset with clients, where . Let the central server model be and be the total number of data samples of these clients. Then the server update is computed by aggregating the client-side updates:

(5)

where is the local model parameter of client at round .

Feature Learning. We apply feature learning on the central server to learn a better feature representation. Attention mechanisms shows effectiveness on extracting feature representations [25, 27]. Our feature learning approach is inspired by this, and additionally, we combine weight normalization to reduce the computation cost [23, 24]. For each element in column vector of , we adopt the below operations to obtain the updated :

(6)
(7)

3.4 Learning on Local Clients.

At local clients, data continues arriving during the global iterations; so each client needs to perform online learning. For this process, each client requests the latest model from the central server and updates the model with its new data. The updated model will affect the learning on other clients, so the cumulative misclassification/regression loss over the entire sequence of local data needs to be minimized.

Yang et. al [29] propose an online multi-task learning framework which aims at tackling the insufficiency of batch-mode training algorithms with a combination of -norm and -norm regularization. Inspired by their work, we propose a new online learning approach with -norm regularization for local client learning. We assume that the continuously arriving data has the same distribution as the original data. For client at round , it receives model from the central server. Assuming that there is a set with of newly arrived data samples since the last local model update, let , the optimization of client at this round is formulated as:

(8)

where local client gradient and central server gradient are calculated as bellow:

(9)
(10)

With being the learning rate for client , the closed form solution is given by:

(11)
1:Input: Multiple related learning clients distributed at client devices, regularization parameter , decay rate of EMA .
2:Procedure at Central Server
3:for global iterations  do
4:     /* get the update on */
5:     compute [Eq.(5)]
6:     update with feature learning
7: [Eq.(6) - Eq.(7)]
8:end for
9:Procedure of Local Client at round
10:for

 each local epoch i from 1 to E 

do
11:     receive from the central server
12:     Compute central and local gradients
13: [Eq.(9) - Eq.(10)]
14:     Set
15:end for
16:Update
17:Perform EMA and get [Eq.(13)]
18:upload to the central server
Algorithm 1 Algorithm for ASO-fed

Dynamic Learning Step Size. In real-world settings, the activation rates (i.e., how often clients provide updates to the global model) for different clients vary due to a host of reasons. Thus, we apply a dynamic learning step size with the intuition that if a client has less data or stable communication bandwidth, the activation rate of this client towards the global update will be large and thus the corresponding learning step size should be small. Dynamic learning step sizes are used in asynchronous optimization to achieve better learning performance [18, 11]. The updating process (11) can be revised as:

(12)

where is a time related multiplier, and is given by , where is the average delay of the past rounds. Then the actual learning step size is scaled by the past communication delays. The longer the delay, the larger the step size is in order to compensate for the loss from the activation rate.

Exponential Moving Average. Since data generated on devices is chronological, Exponential Moving Average (EMA) places a greater weight and significance on the most recent data points. EMA has been widely used in many sequential data prediction problems [19, 20]. We apply EMA on all trainable parameters in the local training. At the end of local training, EMA updates the local model as follows:

(13)

where is the decay rate. Finally the local parameter will be uploaded to the central server.

4 Experimental Setup

Dataset FitRec 111https://sites.google.com/eng.ucsd.edu/fitrec-project/home Air Quality 222https://biendata.com/competition/kdd_2018/data/
ExtraSensory333http://extrasensory.ucsd.edu/
# of clients 30 9 60
# of samples
in each client
30k-200k 8k-9k 2k-12k
Dimensionality 12 15 276
Labels c/r r c
Table 1: Details of datasets used in the experiments. Label c denotes classification and r denotes regression.

We perform extensive experiments on three real-world datasets and compare against state-of-the-art methods. In particular, we evaluate 1) if ASO-fed has better prediction performance than FedAvg with continuous streaming data and 2) if asynchronous update schemes save computation cost in federated learning.

Datasets. We consider three real-world datasets as shown in Table 1.

  • [leftmargin=*]

  • FitRec Dataset: User sport records generated on mobile devices and uploaded to Endomondo, include sequential features such as heart rate, speed, and GPS and the sport type (e.g., biking, hiking) [33]. We use the data of 30 randomly selected users for heart rate and speed prediction.

  • Air Quality Dataset: Air quality data collected from multiple weather devices distributed in 9 locations of Beijing with features such as thermometer and barometer. Each area is modeled as a separate client and the observed weather data is used to predict measures of six air pollutants (e.g., PM2.5).

  • ExtraSensory Dataset: Mobile phone data (e.g., location services, audio, and accelerator) collected from 60 users [22]. We model the device of each user as a client and predict their activities (e.g., walking, talking, running).

Baselines. We compare the proposed ASO-fed approach to single-client learning and federated learning approaches. We select the following approaches as baselines:

  • [leftmargin=*]

  • FedAvg: a sychronous federated learning approach proposed by Mamahan et al. [1].

  • AsyFl: asynchronous version of FedAvg.

  • Local-S: single client learning approach with the same model structure as ASO-fed.

  • ASO-fed-D: the proposed ASO-fed without dynamic learning step size.

4.1 Training Details.

For each dataset, we split the each client’s data into , , for training, validation and testing, respectively. As for each client’s training data, we start with a random portion of the total training size, and increase by each round to simulate the arriving data. We set the fraction of FedAvg as , decay rate of EMA as , for FitRec and Air Quality datasets, and for ExtraSensory dataset. We use a single layer LSTM with units, dropout rate for both federated learning models and single client learning model, and the epoch number of each client set as . We employ early stopping with a patience value of global iterations on validation loss. All of the experiments are conducted with an Intel E5-2683 v3 56-core CPU at 2.00GHz [5].

5 Experimental Results

Method FitRec Air Quality ExtraSensory
mae smape mae smape mae smape F1 Precision Recall BA
(Speed) (Speed) (HeartRate) (HeartRate)
Local-S 12.76 0.61 13.27 0.63 0.56
FedAvg 80.13 0.90 80.85 0.89 44.29 0.51 0.50 0.56 0.76
AsyFL 121.45 0.82 120.74 0.83 44.58 0.58 0.42 0.45 0.48 0.72
ASO-fed-D 37.71 0.45 0.41 0.57 0.76
ASO-fed
improv.(1) 87.99% 74.44% 87.77% 73.03% 18.35% 0.00% 29.41% 42.00% 12.50% 7.89%
improv.(2) 10.51% 28.12% 9.19% 31.42% 1.23% 16.00% 1.53% -1.38% 3.27% 3.79%
Table 2: Prediction performance comparison. Bold numbers are the best performance, numbers with underlines are the second best values. improv.(1) shows the percentage improvement of ASO-fed over FedAvg. improv.(2) shows the percentage improvement of ASO-fed over the best baseline results.

5.1 Performance Comparison.

Table 2 shows the comparison among different methods of prediction performance. For regression problems, we report the average mae and smape values, and for classification problems we report the average F1, Precision, Recall and Balanced Accuracy (BA). We observe that ASO-fed  achieves the lowest mae and smape values for FitRec and Air Quality, and has the best performance on F1, Recall and BA scores for ExtraSensory. AsyFL has the worst performance on all three datasets, which also proves that ASO-fed can learn a better feature representation across clients. From Table 2 we notice that ASO-fed significantly outperforms FedAvg on the FitRec dataset by lowering the average mae and smape Speed values by and , respectively.

Features in the FitRec dataset are not strongly correlated to each other (e.g., distance, altitude). ExtraSensory has high dimensional and noisy features. Therefore FedAvg without any feature learning does not learn an effective feature representation. We also perform comparison with a single client learning approach (Local-S). We observe that ASO-fed outperforms Local-S on FitRec and Air Quality, and obtains close performance as Local-S on ExtraSensory. The training data of each client in FitRec are biased to one sport type (e.g., biking, hiking), and data distribution varies across different locations in Air Quality. Thus clients in FitRec and Air Quality follow non-IID setting while clients in ExtraSensory do not.

Evaluation of Dynamic Learning Step Size. In Table 2 and Table 3, we also report the performance of ASO-fed-D to evaluate the effectiveness of the dynamic learning step size. From the results we notice that dynamic learning step size boosts the model performance and lowers the computation cost.

5.2 Computation Time.

The run time of synchronous and asynchronous approaches is reported in Table 3. As seen from this table, FedAvg has the highest computation cost on two out of three datasets. It is reasonable given that in FedAvg, each client node has to wait for other client nodes to finish their computations. ASO-fed and AsyFL have much lower computation cost which demonstrates that asynchronous update scheme can reduce the computation time greatly. ASO-fed has slightly higher computation costs than AsyFL because ASO-fed performs additional computation such as feature learning on a central server and dynamic learning rate calculation on local clients.

Method FitRec Air Quality ExtraSensory
Local-S 1875.83 416.33 546.94
FedAvg 924.31 1008.14 998.72
AsyFL
ASO-fed-D 283.68 312.56 328.72
ASO-fed
Table 3: Training time (in seconds) comparison with baseline models.
(a) FitRec
(b) Air Quality
Figure 3: Performance comparison as dropout rate of clients increases.

5.3 Robustness to Stragglers and Dropouts.

Stragglers are clients that lag in performing computation due to a variety of reasons: communication bandwidth, computation load, and data variability. In ASO-fed, clients update in an asynchronous manner, where each client can update its local model to the central server immediately after its local training without waiting for other clients to finish. Therefore ASO-fed can take care of the stragglers caused by these situations. We investigate this common real world scenario, when clients have no response during the whole training process and these clients are referred as dropouts.

We explore the performance on FedAvg, AsyFL and ASO-fed with some fraction of clients suffering from dropout. We randomly select a certain portion of local clients and these clients will not participate in the training process. However, the reported results are evaluated on test data from all clients. As shown in Figure 3, for the FitRec dataset, we observe that as the rate of dropout clients increases, the AsyFL model fluctuates around on smape values. For the FedAvg model, there is less fluctuation but larger smape values compared to ASO-fed. As for ASO-fed, there is a slight increase on smape values which becomes steady when the dropout rate exceeds . Even when of local clients dropout during training, ASO-fed  can still achieve the best performance. AsyFL has the worst performance on Air Quality data. ASO-fed and FedAvg have close performance as dropout rate increases while the performance of ASO-fed is relatively stable.

5.4 Feature learning on Central Server.

In this section, we present the qualitative results of the proposed feature learning on a central server. In Figure 4, we show the features learned from one client of three datasets respectively. For the client in ExtraSensory, the highlighted features are ‘Gyroscope’, ‘Accelerometer’ and ‘Location’, and the corresponding labels are ‘walking’, ‘at_home’. For the client from Air Quality dataset, we observe that features with high weights are ‘Wind Speed’ and ‘Temperature’. This makes sense given that the target values are air pollutants (e,g,. PM2.5, SO) and ‘Wind Speed’ decides whether these pollutants can be dispersed. Air pollutants vary with seasons, and a higher concentration of air pollutants appears in winter time due to fuel consumption for heating in winter. Therefore ‘Temperature’ is also a strong indicator for air pollutants. For the client from FitRec, the extracted features are ‘gender’, ‘sport type’ and ‘time’. Since the prediction targets are speed and heart rate, these three features have strong correlations with the targets. The above results show the effectiveness of feature learning in ASO-fed.

5.5 Results of Varying Training Samples.

Figure 4: Features learned on central server of three datasets. Each column is the weights vector within 48 time steps over the input series.
(a) FitRec
(b) Air Quality
(c) ExtraSensory (F1)
(d) ExtraSensory (BA)
Figure 5: Average performance comparison (smape, F1, BA) on three datasets as training data increases.

To evaluate the incremental online learning process more explicitly, we display how the prediction performance changes with the increasing training sample rates in Figure 5. We perform experiments with different rates of all clients’ training data and depict the average performance on all local clients. For the FitRec dataset, ASO-fed achieves the lowest smape values on varying rates of training data. Large fluctuations are observed in the results of FedAvg and AsyFL. Similar fluctuations are observed for AsyFL on Air Quality, which shows an unstable model performance for AsyFL as local data increases. ASO-fed obtains similar performance as FedAvg on Air Quality data. The Local-S method does not perform well on the two regression datasets. For ExtraSensory, Local-S has similar performance as ASO-fed on both F1 score and Balanced Accuracy. As mentioned before, FedAvg does not have good performance on ExtraSensory because of the noisy feature characteristics of this data. The analysis shows that that ASO-fed learns an effective model with a smaller portion of training data. With local data increasing, ASO-fed can still maintain high prediction performance.

6 Related Work

6.1 Federated Learning and Optimization.

Federated learning was first proposed by MacMahan et al. [1], and was benchmarked on image datasets and language dataset. The approach by MacMahan et al. [1] used a fixed global aggregation frequency and did not conduct experiments assuming variable client configurations. Many extensions have been explored based on this original federated learning setting. For instance, Hard et al. [2] used a variant of LSTM to realize the next-word prediction in a virtual keyboard for smartphones. Nishio et al. [4] proposed a protocol for selection of local clients in federated learning. Konečnỳ et al. [7] used secure aggregation to protect privacy of each user’s model gradient when dealing with the situations of an untrusted server. A better approach to deal with non-IID data distribution was proposed by sharing a small amount of data with other devices [6]. Methods were proposed to compress the information exchanged within one global aggregation step [28]. A benchmarking framework for federated settings was developed by Caldas et al. [8]. However, most of these studies update the federated model in a synchronous fashion and do not tackle the problem of stragglers and dropouts. Smith et al. [3] developed a federated multi-task framework to deal with the statistical and system challenges in federated learning. However, this approach is proposed in a multi-task framework and does not take the computational cost into consideration where separate models are learned for each local client. All these approaches are designed for datasets with a fixed size and are not suitable for real-time online learning.

6.2 Online Learning with Multiple Clients.

Online learning methods operate on a group of data examples that arrive sequentially in streaming fashion. Most existing work in online learning with multiple tasks (clients) focuses on taking advantage of task relationships. The online learning problem with multiple tasks was first introduced by Dekel et al. [30]. The relatedness of participated tasks was captured by a global loss and the goal was to reduce the cumulative loss over online rounds. However, this approach did not take task relationship information into consideration. To better model task relationships, Lugosi et al. [17] imposed a hard constraint on the simultaneous actions taken by the learner in the expert setting, Agarwal et al. [31] used matrix regularization and Murugesan et al. [26] proposed a method to learn task relationship matrix automatically from the data. All these methods were proposed in synchronized frameworks and not adaptable for real-world asynchronous learning.

Jin et al. [16]

presented a distributed framework to perform local training and global learning alternatively with a soft confidence-weighted classifier. Although this is an asynchronous approach, it assumes a Gaussian distribution of local data, which is not a good fit for non-convex neural network objectives. Besides, it also requires each client to send a portion of its local data to the central server and violated privacy.

Different from the above online learning approaches, our proposed ASO-fed updates in an asynchronous manner, and the data remains on local clients during training process, which is suitable for the real-world federated learning scenario.

7 Conclusions and Future Work

We propose a novel asynchronous online federated learning approach to tackle the learning problems on distributed edge devices. The proposed ASO-fed updates an aggregated model in an asynchronous fashion while keeping data on clients. Compared to synchronized FL approaches, ASO-fed is computationally efficient because clients do not need to wait for other clients to perform gradient updates. Training times are compared on three real-world datasets and the results show that the proposed ASO-fed is faster than single client learning and synchronized FL. We also perform feature learning on a central server and regularization at local clients to learn effective client relationships. Prediction performance shows that ASO-fed can achieve close or even better performance compared to synchronized FL models on real-world benchmarks. In the future, we will study the theoretical privacy guarantees by ASO-fed when sharing gradient updates.

References