As massive data is generated from modern edge devices (e.g., mobile phones, wearable devices, and GPS), distributed model training over a large number of computing nodes has become essential for machine learning. However, the sensitive nature of these data requires a secure and private computing environment. Additionally, the non-IID (not independent and identically distributed) and highly imbalanced characteristics of these data coupled with the need of high-throughput networks for data transfer leads to challenges in effective model training . Federated learning  trains a shared global model from a federation of distributed devices under the coordination of a central server, while the training data is kept on device. Each device performs training on its local data and updates model parameters to a central server for aggregation. Both model training and prediction are implemented locally which has privacy and communication advantages compared to transferring all data to a centralized cloud center . Many potential applications can leverage a federated framework, such as learning activities of mobile device users, forecasting weather pollutants, and predicting health events like heart rate.
Prior work on federated learning usually follows a synchronous setting with fixed available data during training. The central server aggregates after receiving updates from all local clients [1, 2, 7, 28]. However, there are several challenges that cannot be handled by synchronous federated learning: 1) data on local devices may increase during the training process, since sensors on these distributed devices usually have a high sampling frequency. Therefore in the online setting, inter-client relatedness could potentially vary over time; 2) mobile devices can be frequently offline or have poor communication bandwidth due to network constrains. Consequently, the synchronized federated learning frameworks can be extremely slow; 3) edged devices may lag or even dropout due to data or system heterogeneity .
In this work, we propose an asynchronous online federated learning framework, where distributed clients with continuously arriving data learn an effective shared model collaboratively. Previous online learning approaches [14, 15, 17] with multiple clients are not capable of solving the aforementioned challenges because these approaches share training samples of each client with other clients; which has the same privacy concern as centralized data centers. Jin et. al present a distributed online learning method  that trains on local clients and a central server alternately to reduce the communication cost, but it still needs each client to send a small portion of data to the server. An illustration of our model is in Figure 1. The main contributions of the proposed ASO-fed approach are summarized as follows: 1) it allows asynchronous updates from multiple clients with continuously arriving data and is robust against network connections with high communication delays between the central server and some local clients; 2) it mitigates the straggler problem caused by device heterogeneity; 3) it learns inter-client relatedness effectively using regularization and a global feature learning module; and 4) it improves model prediction performance and reduces computation cost with personalized learning step sizes on local clients.
2 Preliminaries and Definitions.
In this section, we present the notations used in this paper along with common loss functions used in federated learning. Then we briefly introduce the commonly used FedAvg model and identify the issues in synchronized federated settings.
2.1 Definitions and Loss Functions.
Assume that we have distributed devices, and for device , we are given the training data of data samples, where
is the feature vectors for training data, andis the corresponding label matrix. To facilitate the learning, for each data sample , let be the corresponding loss function and in short. Then for each dataset on device , a loss function is defined on the data samples of this device:
We set as the cross-entropy loss for classification models and mean absolute error for regression models. Federated learning methods (e.g., [1, 7]) are designed to handle distributed devices and a central server that coordinates the global learning objective across the network. We denote as the total number of samples in devices. Assuming for any , , and . The global loss function of all distributed devices and examples is defined as:
2.2 Synchronized Federated Optimization.
Prior work on federated optimization (e.g., [7, 9, 10]) is usually based on FedAvg. It assumes a synchronized update scheme that proceeds in rounds of communication. At each round (global iteration), a
fraction of clients are randomly selected and local solvers (e.g., stochastic gradient descent) are used to optimize the local objective function on each of the selected clients. The clients then send their local model parameters to a central server and the central model updates are averaged after receiving all local parameters.
The disadvantage of synchronized optimization is that, at each round, when one or more clients are suffering from high network delays or clients which have more data and hence need longer training time, all the other clients must wait. Since the central server aggregates after all clients finish, the extended period of waiting time in a synchronized optimization will lead to idling and wastage of computing resources.
In addition, with data stored on a large number of local clients, communication efficiency is of utmost importance. Algorithms in federated learning should handle training data with the following characteristics:
Non-IID: Data on each client may have different distributions, i.e., the overall distribution cannot be learned from data on a single client.
Imbalanced data: Data can be biased to certain labels, e.g., users may have different habits or edge devices are monitoring different locations.
Heterogeneity: Data size and device performance may vary on different local clients.
Increasing data: Data may continuously arrive at local clients during the training processes.
McMahan et al.  proposed FedAvg for the optimization of non-IID and imbalanced properties in federated learning. However, FedAvg cannot handle increasing data on clients and data/system heterogeneity that leads to stragglers or dropouts.
3 Proposed Method
We propose to perform asynchronous online federated learning where the central server begins to update model parameters after receiving one to several clients’ updates, without waiting for the other clients to finish. The details of the proposed ASO-fed approach will be explained in the following sections.
3.1 Regularized Federated Learning.
An important assumption of FedAvg is the relatedness among clients. We use the canonical distributed gradient-descent algorithm which is widely used in state-of-the-art federated learning systems (e.g., [1, 7]). Each node has its local model parameters , and denotes the server model parameter from all clients. We observe that just minimizing cannot achieve the desired knowledge transfer among clients because the minimization problems are decoupled for each local model . Thus we add a penalty term to the global loss [10, 13, 21] and this yield the following minimization function:
where represents the relatedness of local clients, is the regularization parameter that controls the amount of knowledge transfer among clients. As shown in , by penalizing norm of model parameter , the relatedness of clients can be efficiently learned. Therefore, we choose . The grouped sparsity introduced by norm penalization encourages many rows of to be zero. That is a way of compromising between finding small weights and minimizing the original cost function . The new objective function is updated as follows:
3.2 Proposed Framework.
Figure 2 illustrates the update procedure for ASO-fed. We use the concept of global iterations in synchronous federated learning, where each aggregation on the central server is treated as one round. At time , the server distributes the current model to a fraction of randomly selected available clients, i.e., Client and Client in Figure 2. Then the first round begins and these two clients initiate their local training. At time , Client finishes its local training and uploads its local model to the central server for aggregation. At the central server, the new model is updated by applying feature learning to the aggregated parameter. Then the server starts the next round and distributes to the next fraction of randomly selected available clients, i.e., Client in the example. Client starts its local training and updates its local model to the server at time . Before , Client uploads its local model at time . We can observe that there is an inconsistency in the asynchronous update scheme when it comes to obtaining model parameters from the central server. Such inconsistency is common in the real world settings and is caused by data and system heterogeneity or network delay. We address this problem by adding feature learning to the central server and dynamic learning step size to local clients’ training. The approach of ASO-fed is detailed in Algorithm 1.
3.3 Learning on Central Server.
We propose an asynchronous update procedure for the server, that is, the central server begins to update model parameter after it receives an update from one client (or several updates if multiple clients finish their local computations at the same time), without waiting for the other clients to finish their computations. The copies of on server and clients may be different. At round , assume the server receives updates from a subset with clients, where . Let the central server model be and be the total number of data samples of these clients. Then the server update is computed by aggregating the client-side updates:
where is the local model parameter of client at round .
Feature Learning. We apply feature learning on the central server to learn a better feature representation. Attention mechanisms shows effectiveness on extracting feature representations [25, 27]. Our feature learning approach is inspired by this, and additionally, we combine weight normalization to reduce the computation cost [23, 24]. For each element in column vector of , we adopt the below operations to obtain the updated :
3.4 Learning on Local Clients.
At local clients, data continues arriving during the global iterations; so each client needs to perform online learning. For this process, each client requests the latest model from the central server and updates the model with its new data. The updated model will affect the learning on other clients, so the cumulative misclassification/regression loss over the entire sequence of local data needs to be minimized.
Yang et. al  propose an online multi-task learning framework which aims at tackling the insufficiency of batch-mode training algorithms with a combination of -norm and -norm regularization. Inspired by their work, we propose a new online learning approach with -norm regularization for local client learning. We assume that the continuously arriving data has the same distribution as the original data. For client at round , it receives model from the central server. Assuming that there is a set with of newly arrived data samples since the last local model update, let , the optimization of client at this round is formulated as:
where local client gradient and central server gradient are calculated as bellow:
With being the learning rate for client , the closed form solution is given by:
Dynamic Learning Step Size. In real-world settings, the activation rates (i.e., how often clients provide updates to the global model) for different clients vary due to a host of reasons. Thus, we apply a dynamic learning step size with the intuition that if a client has less data or stable communication bandwidth, the activation rate of this client towards the global update will be large and thus the corresponding learning step size should be small. Dynamic learning step sizes are used in asynchronous optimization to achieve better learning performance [18, 11]. The updating process (11) can be revised as:
where is a time related multiplier, and is given by , where is the average delay of the past rounds. Then the actual learning step size is scaled by the past communication delays. The longer the delay, the larger the step size is in order to compensate for the loss from the activation rate.
Exponential Moving Average. Since data generated on devices is chronological, Exponential Moving Average (EMA) places a greater weight and significance on the most recent data points. EMA has been widely used in many sequential data prediction problems [19, 20]. We apply EMA on all trainable parameters in the local training. At the end of local training, EMA updates the local model as follows:
where is the decay rate. Finally the local parameter will be uploaded to the central server.
4 Experimental Setup
|Dataset||FitRec 111https://sites.google.com/eng.ucsd.edu/fitrec-project/home||Air Quality 222https://biendata.com/competition/kdd_2018/data/||
|# of clients||30||9||60|
We perform extensive experiments on three real-world datasets and compare against state-of-the-art methods. In particular, we evaluate 1) if ASO-fed has better prediction performance than FedAvg with continuous streaming data and 2) if asynchronous update schemes save computation cost in federated learning.
Datasets. We consider three real-world datasets as shown in Table 1.
FitRec Dataset: User sport records generated on mobile devices and uploaded to Endomondo, include sequential features such as heart rate, speed, and GPS and the sport type (e.g., biking, hiking) . We use the data of 30 randomly selected users for heart rate and speed prediction.
Air Quality Dataset: Air quality data collected from multiple weather devices distributed in 9 locations of Beijing with features such as thermometer and barometer. Each area is modeled as a separate client and the observed weather data is used to predict measures of six air pollutants (e.g., PM2.5).
ExtraSensory Dataset: Mobile phone data (e.g., location services, audio, and accelerator) collected from 60 users . We model the device of each user as a client and predict their activities (e.g., walking, talking, running).
Baselines. We compare the proposed ASO-fed approach to single-client learning and federated learning approaches. We select the following approaches as baselines:
FedAvg: a sychronous federated learning approach proposed by Mamahan et al. .
AsyFl: asynchronous version of FedAvg.
Local-S: single client learning approach with the same model structure as ASO-fed.
ASO-fed-D: the proposed ASO-fed without dynamic learning step size.
4.1 Training Details.
For each dataset, we split the each client’s data into , , for training, validation and testing, respectively. As for each client’s training data, we start with a random portion of the total training size, and increase by each round to simulate the arriving data. We set the fraction of FedAvg as , decay rate of EMA as , for FitRec and Air Quality datasets, and for ExtraSensory dataset. We use a single layer LSTM with units, dropout rate for both federated learning models and single client learning model, and the epoch number of each client set as . We employ early stopping with a patience value of global iterations on validation loss. All of the experiments are conducted with an Intel E5-2683 v3 56-core CPU at 2.00GHz .
5 Experimental Results
5.1 Performance Comparison.
Table 2 shows the comparison among different methods of prediction performance. For regression problems, we report the average mae and smape values, and for classification problems we report the average F1, Precision, Recall and Balanced Accuracy (BA). We observe that ASO-fed achieves the lowest mae and smape values for FitRec and Air Quality, and has the best performance on F1, Recall and BA scores for ExtraSensory. AsyFL has the worst performance on all three datasets, which also proves that ASO-fed can learn a better feature representation across clients. From Table 2 we notice that ASO-fed significantly outperforms FedAvg on the FitRec dataset by lowering the average mae and smape Speed values by and , respectively.
Features in the FitRec dataset are not strongly correlated to each other (e.g., distance, altitude). ExtraSensory has high dimensional and noisy features. Therefore FedAvg without any feature learning does not learn an effective feature representation. We also perform comparison with a single client learning approach (Local-S). We observe that ASO-fed outperforms Local-S on FitRec and Air Quality, and obtains close performance as Local-S on ExtraSensory. The training data of each client in FitRec are biased to one sport type (e.g., biking, hiking), and data distribution varies across different locations in Air Quality. Thus clients in FitRec and Air Quality follow non-IID setting while clients in ExtraSensory do not.
5.2 Computation Time.
The run time of synchronous and asynchronous approaches is reported in Table 3. As seen from this table, FedAvg has the highest computation cost on two out of three datasets. It is reasonable given that in FedAvg, each client node has to wait for other client nodes to finish their computations. ASO-fed and AsyFL have much lower computation cost which demonstrates that asynchronous update scheme can reduce the computation time greatly. ASO-fed has slightly higher computation costs than AsyFL because ASO-fed performs additional computation such as feature learning on a central server and dynamic learning rate calculation on local clients.
5.3 Robustness to Stragglers and Dropouts.
Stragglers are clients that lag in performing computation due to a variety of reasons: communication bandwidth, computation load, and data variability. In ASO-fed, clients update in an asynchronous manner, where each client can update its local model to the central server immediately after its local training without waiting for other clients to finish. Therefore ASO-fed can take care of the stragglers caused by these situations. We investigate this common real world scenario, when clients have no response during the whole training process and these clients are referred as dropouts.
We explore the performance on FedAvg, AsyFL and ASO-fed with some fraction of clients suffering from dropout. We randomly select a certain portion of local clients and these clients will not participate in the training process. However, the reported results are evaluated on test data from all clients. As shown in Figure 3, for the FitRec dataset, we observe that as the rate of dropout clients increases, the AsyFL model fluctuates around on smape values. For the FedAvg model, there is less fluctuation but larger smape values compared to ASO-fed. As for ASO-fed, there is a slight increase on smape values which becomes steady when the dropout rate exceeds . Even when of local clients dropout during training, ASO-fed can still achieve the best performance. AsyFL has the worst performance on Air Quality data. ASO-fed and FedAvg have close performance as dropout rate increases while the performance of ASO-fed is relatively stable.
5.4 Feature learning on Central Server.
In this section, we present the qualitative results of the proposed feature learning on a central server. In Figure 4, we show the features learned from one client of three datasets respectively. For the client in ExtraSensory, the highlighted features are ‘Gyroscope’, ‘Accelerometer’ and ‘Location’, and the corresponding labels are ‘walking’, ‘at_home’. For the client from Air Quality dataset, we observe that features with high weights are ‘Wind Speed’ and ‘Temperature’. This makes sense given that the target values are air pollutants (e,g,. PM2.5, SO) and ‘Wind Speed’ decides whether these pollutants can be dispersed. Air pollutants vary with seasons, and a higher concentration of air pollutants appears in winter time due to fuel consumption for heating in winter. Therefore ‘Temperature’ is also a strong indicator for air pollutants. For the client from FitRec, the extracted features are ‘gender’, ‘sport type’ and ‘time’. Since the prediction targets are speed and heart rate, these three features have strong correlations with the targets. The above results show the effectiveness of feature learning in ASO-fed.
5.5 Results of Varying Training Samples.
To evaluate the incremental online learning process more explicitly, we display how the prediction performance changes with the increasing training sample rates in Figure 5. We perform experiments with different rates of all clients’ training data and depict the average performance on all local clients. For the FitRec dataset, ASO-fed achieves the lowest smape values on varying rates of training data. Large fluctuations are observed in the results of FedAvg and AsyFL. Similar fluctuations are observed for AsyFL on Air Quality, which shows an unstable model performance for AsyFL as local data increases. ASO-fed obtains similar performance as FedAvg on Air Quality data. The Local-S method does not perform well on the two regression datasets. For ExtraSensory, Local-S has similar performance as ASO-fed on both F1 score and Balanced Accuracy. As mentioned before, FedAvg does not have good performance on ExtraSensory because of the noisy feature characteristics of this data. The analysis shows that that ASO-fed learns an effective model with a smaller portion of training data. With local data increasing, ASO-fed can still maintain high prediction performance.
6 Related Work
6.1 Federated Learning and Optimization.
Federated learning was first proposed by MacMahan et al. , and was benchmarked on image datasets and language dataset. The approach by MacMahan et al.  used a fixed global aggregation frequency and did not conduct experiments assuming variable client configurations. Many extensions have been explored based on this original federated learning setting. For instance, Hard et al.  used a variant of LSTM to realize the next-word prediction in a virtual keyboard for smartphones. Nishio et al.  proposed a protocol for selection of local clients in federated learning. Konečnỳ et al.  used secure aggregation to protect privacy of each user’s model gradient when dealing with the situations of an untrusted server. A better approach to deal with non-IID data distribution was proposed by sharing a small amount of data with other devices . Methods were proposed to compress the information exchanged within one global aggregation step . A benchmarking framework for federated settings was developed by Caldas et al. . However, most of these studies update the federated model in a synchronous fashion and do not tackle the problem of stragglers and dropouts. Smith et al.  developed a federated multi-task framework to deal with the statistical and system challenges in federated learning. However, this approach is proposed in a multi-task framework and does not take the computational cost into consideration where separate models are learned for each local client. All these approaches are designed for datasets with a fixed size and are not suitable for real-time online learning.
6.2 Online Learning with Multiple Clients.
Online learning methods operate on a group of data examples that arrive sequentially in streaming fashion. Most existing work in online learning with multiple tasks (clients) focuses on taking advantage of task relationships. The online learning problem with multiple tasks was first introduced by Dekel et al. . The relatedness of participated tasks was captured by a global loss and the goal was to reduce the cumulative loss over online rounds. However, this approach did not take task relationship information into consideration. To better model task relationships, Lugosi et al.  imposed a hard constraint on the simultaneous actions taken by the learner in the expert setting, Agarwal et al.  used matrix regularization and Murugesan et al.  proposed a method to learn task relationship matrix automatically from the data. All these methods were proposed in synchronized frameworks and not adaptable for real-world asynchronous learning.
Jin et al. 
presented a distributed framework to perform local training and global learning alternatively with a soft confidence-weighted classifier. Although this is an asynchronous approach, it assumes a Gaussian distribution of local data, which is not a good fit for non-convex neural network objectives. Besides, it also requires each client to send a portion of its local data to the central server and violated privacy.
Different from the above online learning approaches, our proposed ASO-fed updates in an asynchronous manner, and the data remains on local clients during training process, which is suitable for the real-world federated learning scenario.
7 Conclusions and Future Work
We propose a novel asynchronous online federated learning approach to tackle the learning problems on distributed edge devices. The proposed ASO-fed updates an aggregated model in an asynchronous fashion while keeping data on clients. Compared to synchronized FL approaches, ASO-fed is computationally efficient because clients do not need to wait for other clients to perform gradient updates. Training times are compared on three real-world datasets and the results show that the proposed ASO-fed is faster than single client learning and synchronized FL. We also perform feature learning on a central server and regularization at local clients to learn effective client relationships. Prediction performance shows that ASO-fed can achieve close or even better performance compared to synchronized FL models on real-world benchmarks. In the future, we will study the theoretical privacy guarantees by ASO-fed when sharing gradient updates.
-  McMahan, H. Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016).
-  Hard, Andrew, Kanishka Rao, Rajiv Mathews, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018).
-  Smith, Virginia, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S. Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424-4434. 2017.
-  Nishio, Takayuki, and Ryo Yonetani. Client selection for federated learning with heterogeneous resources in mobile edge. ICC 2019-2019 IEEE International Conference on Communications (ICC). IEEE, 2019.
-  Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, et al. The Design and Operation of CloudLab. Proceedings of the USENIX Annual Technical Conference (ATC), pp. 1-14. 2019.
-  Zhao, Yue, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018).
-  Konečný, Jakub, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016).
-  Caldas, Sebastian, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar.Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097 (2018).
-  Wang, Shiqiang, Tiffany Tuor, Theodoros Salonidis, Kin K. Leung, Christian Makaya, Ting He, and Kevin Chan. Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications 37, no. 6 (2019): 1205-1221.
-  Bonawitz, Keith, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482 (2016).
-  Baytas, Inci M., Ming Yan, Anil K. Jain, and Jiayu Zhou. Asynchronous multi-task learning. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 11-20. IEEE, 2016.
Liu, Jun, Shuiwang Ji, and Jieping Ye. Multi-task feature learning via efficient l 2, 1-norm minimization.
Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, 2009.
-  Zhou, Jiayu, Jianhui Chen, and Jieping Ye. Malsar: Multi-task learning via structural regularization. Arizona State University 21 (2011).
-  Sun, Xu, Hisashi Kashima, and Naonori Ueda. Large-scale personalized human activity recognition using online multitask learning. IEEE Transactions on Knowledge and Data Engineering 25.11 (2012): 2551-2563.
-  Dinuzzo, Francesco, Gianluigi Pillonetto, and Giuseppe De Nicolao. Client–server multitask learning from distributed datasets. IEEE Transactions on Neural Networks 22, no. 2 (2010): 290-303.
-  Jin, Xin, Ping Luo, Fuzhen Zhuang, Jia He, and Qing He. Collaborating between local and global learning for distributed online multiple tasks. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 113-122. ACM, 2015.
-  Cavallanti, Giovanni, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classification. Journal of Machine Learning Research 11, no. Oct (2010): 2901-2934.
-  Cheung, Yun Kuen, and Richard Cole. Amortized analysis on asynchronous gradient descent. arXiv preprint arXiv:1412.0159 (2014).
-  Yu, Adams Wei, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541 (2018).
-  Williams, Billy M., Priya K. Durvasula, and Donald E. Brown. Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models. Transportation Research Record 1644, no. 1 (1998): 132-141.
-  Evgeniou, Theodoros, and Massimiliano Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 109-117. ACM, 2004.
-  Vaizman, Yonatan, et al. Extrasensory app: Data collection in-the-wild with rich user interface to self-report behavior. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 2018.
-  Wang, Ying-Ming, and Taha MS Elhag. On the normalization of interval and fuzzy weights. Fuzzy sets and systems 157, no. 18 (2006): 2456-2471.
-  Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
-  Firat, Orhan, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073 (2016).
-  Murugesan, Keerthiram, Hanxiao Liu, Jaime Carbonell, and Yiming Yang. Adaptive smoothed online multi-task learning. In Advances in Neural Information Processing Systems, pp. 4296-4304. 2016.
-  Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998-6008. 2017.
-  Konečný, Jakub, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
Yang, Haiqin, Irwin King, and Michael R. Lyu.
Online learning for multi-task feature selection.Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 2010.
Dekel, Ofer, Philip M. Long, and Yoram Singer. Online multitask learning.
In International Conference on Computational Learning Theory, pp. 453-467. Springer, Berlin, Heidelberg, 2006.
-  Agarwal, Alekh, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for online multitask learning. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2008-138 (2008).
-  Wang, Jialei, Peilin Zhao, and Steven CH Hoi. Exact soft confidence-weighted learning. arXiv preprint arXiv:1206.4612 (2012).
-  Ni, Jianmo, Larry Muhlstein, and Julian McAuley. Modeling Heart Rate and Activity Data for Personalized Fitness Recommendation. The World Wide Web Conference. ACM, 2019.