Federated learning (FL) [f1]
is a booming distributed learning technique, supporting large, data-intensive deep neural network models. The FL framework serves when clients refuse to provide personal information while IT companies still want to profile models, collectively. As such, the FL technique supports AI algorithms in many privacy-concerned applications, including Social Network[f22], Smart Assistant [f23] and Traffic Surveillance [f24], etc. The common feature of these applications is that all smart devices involved only communicate with metadata, e.g., model gradients, to support one cloud to retrain a global model, with/without trust. In this way, the federated learning is claimed to mitigate the impact of privacy leaking. Internet giants, like Google, Facebook, adapt FL in their learning framework, in order to conquer the privacy conscious data market with a $200 billion net worth [f25].
Federated learning supposes to support co-building models among multiple devices, while preserving privacy in-between. Among these devices, each holds one replica of versioned model and its local data objects. Usually, one small set of devices, called servers, are powerful that can publish a master model, and compute the alignment between all trained local models. Rest of the devices, called workers, subscribe models from servers, and repeat the training process in-sync, i.e., federation. As such, raw data objects are not exchanged among parties, while the target model may converge after repeating the training process within a sufficient time. In theory, the federated learning framework is anticipated to scale, working for millions to billions of devices [f48].
In practice, this simple FL framework can be expensive and possibly leaks privacy, for three major reasons. First, federated learning implicitly assumes that the to-be trained model is simple enough that all synced workers can response and servers can compute the alignment within a given time threshold. However, this collaborative learning behavior can reveal the correlation between similar users, where this similarity model can be reused to reveal some private information. Moreover, this model can be complex per se, leading to a possible consequence that the model fails to converge. Second, workers may spend a large amount of resources on model training, which affect the response time and eventually the service-level objective (SLO). Authors of the original federated learning design [f26]
claimed some strong assumption that models are only trained while plug-in and with WiFi connection. But this claim prevents training data at a fine-grained granularity, violating the purpose of adapting distributed learning, and ubiquity of mobile devices. In addition, the necessity of being realtime is a critical performance metric for today’s machine learning systems. Letting device train local fresh data in realtime spends expensive battery energy. When scaling-out, this expensive energy footprint during the training process would grow exponentially, not to mention millions of devices involved, powered by batteries[f27]. Thus, a practical federated learning framework should intelligently adapt to privacy concerns with a consideration on the energy efficiency.
To address the aforementioned challenges, it is required to analyze and further understand the nature of federated learning. We first provide a study on understanding the state-of-the-practice FL frameworks on several applications. Based on our studies, there are two important observations: (1) For the model complexity and computational cost, in the history of model training, we could expect the whole framework initiates a simple model with a naïve configuration. However, when scaling-out in FL, not all workers are active with the same degree. As such, the FL framework may allow workers to forget some trained features at any time. That is, models in each worker may differ in their versions, which is equivalent to the case that all devices are online but their model versions are not all updated-to-date. This allows us to provide a tradeoff when publishing aligned models to workers at the beginning of each period. We may select a subset of workers without waking all device up, in order to reduce energy footprint while still achieve a possible reward maximization. The optimization process can speed up the convergence while reducing the latency, thus local energy footprints. (2) To further elaborate the concern on a privacy leak in a federated learning framework, we not only allow each worker to remove its old and sensitive data, but also allow the model to delete the learnt sensitive features, or as we called this framework “forgets”. This technique is commonly seen as the decremental learning in the community. This feature works in orthogonal with the local energy management in the system kernel. Thus, we can treat it as a control signal to wake the energy management unit when the computation demand decreases, since the device forgets. In this way, all previous relentless effort on system energy management techniques, such as dynamic voltage and frequency scaling (DVFS), process migration, and IC thermal shutdown, can be adopted into the federated learning process, in order to save training energy. These two observations and derived approaches are the key to provide an energy efficient federated learning system that forgets.
Based on the above findings, we propose a Decremental Energy-Aware Learning framework (DEAL) that provides an energy efficient design from decremental learning and energy saving techniques. DEAL serves a two-layered design to reduce the overall energy footprint from the federated learning. When a federated job is created, DEAL triggers an optimization process that models every candidate device into a multi-armed bandit (MAB) problem, and solves it to maximize the objective revenue, i.e., training latency, data volume, and energy footprint. In this way, the whole learning process can be conducted with a performance guarantee. When the learning starts, DEAL develops a local middleware layer that carefully manages the local learning process as incremental and decremental updates, based on specific models. The middleware intelligently tunes local energy state of mobile devices within the decremental learning process. When a bad decision is made in prediction or decremental update, DEAL can resolve the problem by recovering the model in the corresponding decremental and incremental updating algorithm. We have prototyped DEAL in current mobile operating systems, supporting multiple learning frameworks. The prototype is evaluated with machine learning models which are widely adopted in real-time machine learning systems, including Personalized PageRank and Tikhonov Regularization. The evaluation results show that DEAL can significantly reduce all learning completion time up to 2-4X, compared to the classic federated learning framework. In all state-of-the-practice baselines, our design shows a 75.6%–82.4% less energy use in all workloads.
The contribution of our work is summarized as follows,
We identify the high system resource usage issue and privacy problem of a learning federation for mobile systems.
We propose a two-layer energy efficient learning framework, DEAL, which reduces energy footprint with a forgetting feature. The framework performance is provided with a worst-case mathematical guarantee.
Our prototype proves the proficiency and effectiveness of DEAL with real world datasets. Compared to conventional federated learning design, DEAL can save 75.6%–82.4% energy cost while all learning processes are faster than the classic federated learning framework up to 2-4 orders of magnitude.
The rest of the paper is organized as follows. Section II introduces the background about our key concerns that motivate the design of DEAL. Section III discusses the system design and implementation. Section IV and V present the evaluation and related work, respectively. Then Section VI concludes the paper.
Ii Risen Awareness of Privacy and Resource
The federated learning (FL) is resource expensive yet may fail to commit the privacy-preserving task. In this section, we first introduce the background of federated learning. After that, we outline the potential privacy leak from collaborative federated learning, with a realworld example. Last, we summarize the resource use issues in federated learning.
Federated Learning. Federated learning is designed to train a shared model collaboratively with the data generated on edge devices while preserving data privacy, in a mobile federation. A federated learning procedure commonly consists of the following steps:
At start, the server selects a group of mobile candidates, i.e., workers to participate in the training process.
The workers subscribe the current model and parameters.
Each worker starts local training with the subscribed model and local data.
When the local training process is completed, each worker sends local coefficients to the server.
The server computes the convergence between received gradients. Until the model converges, the process repeats the first step.
Note that, The whole training time is not critical for each individual worker, as the job can be throttled. However, the whole training process may prolong exponentially to converge if some workers are delayed for sync. While in-sync, all training threads keep the device awake, which drains a lot of energy. Therefore, the training completion time is critical and makes FL very expensive at scale-out.
Privacy in Learning. Here we illustrate a real-world example on privacy leak from the FL process, shown in Figure 1. The Retailrocket [f28] e-commerce dataset contains events like clicks, adding to carts, and transactions over a period of four and a half months, covering 32,000 de-identified users. Some user (e.g., user A) has touched the following items: The Godfather, Titanic, Flipped, and Linear Algebra. All the information could be potentially sensitive as all items are related to some personal privacy. Regulations, such as European General Data Protection Regulation (GDPR) [f29], give users the right to remove all these sensitive data. However, the federated learning, can still reveal private user information from the database version after removing records from the user A. For example, FL collects some click and transaction history from all users and computes a similarity matrix between pairs of users, which can be used for personal ranked item recommendation. The similarity distance in this scenario could be computed with a simple Jaccard similarity [f42]. Unfortunately, we can still guess this (already deleted) browsing history from the user A. As shown in Figure 1, the similarity matrix shows that the average similarities between user A and some other users are remarkably high, such as the user C and the user B with a similarity of 0.97 and 0.81, respectively. Digging into this information, we can check the undeleted browsing history from the user B/C to recover the deleted information for user A, on The Godfather, Flipped and Titanic. Therefore, we conclude that it is still possible that these overlapping sets can reveal personal privacy through clustering, which can be considered as a level of privacy leak.
System Resource Use in Federated Training. In the federated training, there are two problems: the idle energy leakage and unnecessary memory footprint. We have already explained in Section I that all workers in the FL framework are usually edge devices, powered by batteries. Previous work [f30] already reveals that a federated learning process consumes heavy energy footprint, shortening the overall device service time by 40%. Bonawitz et al. [f9] argue that this is only a technical issue as the learning job is only processed when the device is recharging. However, this assumption violates two important features of the purpose of federated learning: (1) freshness of to-be-trained data objects; (2) the ubiquity of mobile devices. Moreover, that FL only happens when recharging prevents scaling out. Any unnecessary energy leak in a single device may affect all other workers, leading to an exponential energy waste. In addition, the learning process needs to repeatedly retrieve all local data from the memory or secondary storage during training, which can cause a large number of page faults, and thus page switches with an extra delay. Thus, in order to have a locally efficiency design, we need to understand the energy footprint and memory use in the local training, and an effective middleware that OS can use to elastically manage the data in memory and allow the learning algorithm coupling with the energy management policy.
Iii Design and Implementation
Previously, we have discussed our concerns in current FL frameworks on privacy and efficiency. In this section, we introduce DEAL, and then analyze our system modeling from the global and local perspectives, and on-device energy control with a decremental learning feature.
Iii-a System Overview
Figure 2 shows the architecture of DEAL. DEAL consists of a global layer that provides an selection optimization process with the MAB algorithm, and a local layer that manages the local decremental learning through incremental and decremental updates, and the associated energy control. The details are as follows.
Global Layer. DEAL supports federated learning in a client-server manner. In the global selection layer in the component as shown in Figure 2, when a learning job is created from the server, DEAL selects a worker subset from all live candidates. In this subset, all workers shall have required training data , sufficient computation resource to complete the learn job with a reward . The whole selection process shall maximize this reward as an optimization process. As the Device column in Figure 2, DEAL initializes the federated learning setup in a PUB/SUB model. All selected workers are notified by the server via the PUB method, as well as receiving the models to be trained. Gradually, each worker finishes its local training, and sends back model gradients via SUB methods. In this process, workers can leave. DEAL allows the server to communicate with workers via the SUB method periodically, and starts the convergence process when receiving the majority signals from all selected workers or a Time To Live (TTL) is violated.
In the local layer for learning and system control, each worker introduces a hyperparameter, meaning how much one worker shall “forget” its data [f40]. Therefore, though we still compute similarity models between users, as shown in Figure 1
, after some epochs, DEAL overwrites the model with newly arrived data and forgets the deleted data, as well as their impact in the model. In this way, we not only allow the balance between model training and local energy reduction, but also enable a better privacy preserved for each worker.
In summary, DEAL exhibits a two-level design, globally and locally, to optimize the energy efficiency and privacy for federated learning. Next, we introduce our system modeling within these two layers.
Iii-B System Modeling
As shown in Figure 2, our global selection process assumes each FL job always starts from the server side. Each server receives SUB signals from all candidate devices, with a profile on the local data volume, available resource, and their battery capacity. We model all variables in two categories, namely global and local metrics. In the global metrics, we adapt device statistics, data objects information, etc., in order to find the most suitable subset for one federated learning process. In the local metrics, we describe how we treat local data objects for learning, and models local power footprint, as well as the training time.
Global Metrics. In the global metrics, DEAL can capture mobile devices that participate in the federated learning process, as . Each device is related to the reward in round , where = 0,1,2,…. Specifically, the reward is a normalized variable on and DEAL can compute the distribution with the the mean represented as . As mentioned above, these reward distributions are i.i.d
, as defined in a basic FL framework. The reward vectoris unknown before the whole learning process starts.
In each round of learning, devices can join and leave at any given time, due to issues like network outrage, drained batteries, etc. In the PUB/SUB model, any devices arrived can only subscribe in the next round of learning. Dropped devices are considered as “sleep” when violating the TTL in a learning round. We use to denote the set of available mobile devices in a given round , in which is the power set of . Let , in which represents the distribution of the available devices, which is also i.i.d defined in the FL framework. This distribution is unknown before the training process starts. When workers start to subscribe, DEAL can reveal at the server side, in the beginning of each training round .
In each training round, the central server selects available devices (i.e., the devices belonging to ). Each subset of devices creates a worker subset. In order to consider the convergence delay, we ensure that the size of the selected set is not greater than , i.e., the server starts to compute convergence and update models after selecting workers to publish. We use to represent the set of all feasible groups when the set of available devices is observed, i.e., , in which denotes the cardinality of set . In a certain training round , the server not only selects a group , but also receives a reward represented as , which is a weighted sum of each participated device’s reward, that is, , in which is the calculated gradient of device . We assume that the gradient are fixed positive numbers known, provided from models. The upper bound of is represented as . Moreover, DEAL is designed to maximize the expected time-average reward within a given time horizon of rounds.
Local Metrics. In the local metrics, we first consider the modeling for data objects to be trained. Let denote the learnt model, where is a process to learn a model , denote the training data from devices, and is the user-defined coefficient. Now, we may need to delete some portion of private data of the -th device. The actual learnt model is . This model can be trivially obtained by repeating the training process on , where defines the percentage of focus on the newly created data objects.
Traditionally, conducting a complete model requires retraining all requested data, which might be costly in energy footprint. Meanwhile, the data loss in a single worker (or a handful of workers, not all in the whole set) may significantly change the predictability of the model. Therefore, we update the existing model to the desired model , or update the process as the process . This “forget” process can check much faster than for retraining, while still converges, as follows:
This methodology is called decremental learning [f20], which is similar to online learning, where only the new observations are used to incrementally update the existing model. However, we not only need to use new observations to update incrementally, but also need to delete such updates through the reverse operation.
In DEAL, we consider the practical scenario when deploying FL into real devices, and mainly focus on the energy and latency from local training and modeling. To understand the training completion time and the estimated energy consumption from one local training, we adopt models from previous research[f30]. The energy consumption is a linear combination of device utilization and energy states.
where is the average utilization. is an energy coefficient with a given frequency. is the training completion time, is a static energy profile for each other devices on their specific power states, based on the state-of-the-art state-machine models [f39, f45].
Next, we model the training completion time. Previous research [f30] reveals a linear correlation between local training data size under some model specifications. Hence, the model for training completion time is now simplified as follows:
where, the computation time is positively correlated to a function of the priority weight , local model , and affected data size , under the current CPU frequency in each round of training. and are correlation metrics. With all aforementioned models, DEAL can provide a decremental learning version of local training based on the data and performance (i.e., energy and latency models). Note that, We need to do corresponding work for each type of specific model. Besides, DEAL may pick wrong devices due to bad prediction. DEAL can only fix the bad selection when another federated learning job initializes. Results show that this may only affect the 95%-percentile performance. On average, DEAL can sustain a better federated learning service, compared to other state-of-the practice. More discussion on fault tolerance can be found in Section IV.
Iii-C Global Selection Optimization
In order to describe the decision of worker selection, we use a binary vector to indicate whether each device is selected to participate in the training in round . if device is selected, i.e., ; on the other side, . Then, the action vector must satisfy for all .
If a vector of mean reward vector is known in advance, we can always formulate the reward maximization in federated learning with minimum selection fraction as an optimization problem [f46]:
As the reward vector is unknown, DEAL uses the estimated mean rewards (i.e., exploitation) to maximize the reward and also has to get a more precise estimation of the rewards (i.e., exploration) through simultaneously learning. In this way, we pick an online optimization algorithm to deal with the exploitation and exploration tradeoff globally.
We treat the DEAL selection optimization as an Multi-armed Bandit (MAB) problem. MAB problem considers a fixed limited set of resources to be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice’s properties are only partially known at the time of allocation [f49]. In MAB, our main challenges are how to maximize the reward with unknown mean reward distribution and out-of-date worker, and how to minimize the power consumption without violating the TTL constraint.
For the first critical point to maximize the reward with uncertainty is to fast retrieve the reward distribution from top workers. We use to represent the number of times worker is selected at the end of round , i.e., . When the system starts at , We set . Meanwhile, let be the observed mean rewards of worker by the end of round , i.e., . If worker has not been played before the end of round (i.e., if ), we set . We use to represent the estimation of worker in round , which is given as follows:
where and correspond to exploitation and exploration, respectively. The upper limit of the above truncated version of the reward estimate is 1, because the actual reward must be [0,1]. Similarly, if . Due to the page limit, we refer the detailed optimization analysis in prior work [f35]. Next, we introduce our local decremental learning and energy control to address the local resource problem while forgetting.
Iii-D Local Learning and Control
In order to solve the previously described privacy issues and heavy resource footprint, We propose to use a local decremental learning, which a respect the selected worker set with maximum rewards. The main idea of our method is that the training algorithm of the target model retains the intermediate results in the process of model calculation. We can update the intermediate results effectively, mainly through two ways: incremental method to merge new user data, and decremental method to delete user data. As such, we are able to fine tune the local energy configuration, as well as provide a more accurate profile of the local device for next round of global selection.
DEAL adapts learning algorithms into local training in the following procedure:
Model Construction: Construct the prediction model according to the characteristic of the specific learning algorithm.
Update Procedure: Design the corresponding decremental and incremental updating algorithms according to the model established in Model and Prediction, and derive power savings from it via the dynamic voltage and frequency scaling (DVFS) tuning function.
Data Recovery: Analyze how to recover deleted user data from the stale model.
To adapt a specific learning algorithm into the local DEAL middleware, DEAL first rebuild this algorithm into a decremental version. This process is usually done offline, and DEAL explores these decremental learning algorithm as a local learning algorithm library. The decremental learning algorithms focus on the incremental/decremental updates, associated with local energy control. When being notified with a degree of “forget”, or the user-defined variable , the DEAL middleware adapts a -LRU, that only replaces -percent of allocated pages recently used. This algothm can significantly reduces the frequency of page replacement, as well as the number of swaps. DEAL keeps track of the level of forgetness in the decremental learning algorithms using data recovery policies, in order to prevent aggressive forgetting and the convergence failure. Next, we present DEAL on two learning algorithm cases, namely Personalized PageRank and Tikhonov Regularization, which are widely used in the real-time mobile-based machine learning, in order to highlight the design and implementation of DEAL in the local layer, and show that DEAL can be easily adapted to effectively support other algorithms and systems.
Case 1: Personalized PageRank. Personalized PageRank (PPR) is a fundamental opration first proposed by Google [f41]. The PPR algorithm is similar to the recommendation algorithm. They both calculate the distance between users from the perspective of user-item correlation. For example, in web page recommendation, the approach records the browsing history in each device. These pairs of co-occurring pages are ranked and formed the basis for recommendation later. For the simplest form of PPR, its input includes a binary history matrix , which represents the interactions between a set of devices and a set of items . If the device interacts with the item , the entry is equal to 1, and 0 otherwise.
Model Construction. The model of PPR consists of a similarity matrix , which represents the interaction similarity between item pairs. A common training method for this model is to first calculate the concurrency matrix , which represents the number of users interacting with each pair of items. In addition, we need a vector to represent the number of interactions for each item (the sum of the rows of Y). Next, if we want to get the similarity matrix L, we can calculate it by checking the co-occurrence counts. Many similarity measures between items can be computed from the co-occurrence matrix [f31]. The Jaccard similarity between items and calculated by is a better choice. Results can be achieved by querying the similarity matrix L, as we described in our motivation example in Figure 1. We retrieve recommendations of item pairs by querying the most similar items in each item, and calculate the preference estimates based on the weighted sum between the similarities of the item and the corresponding user history to generate the items to recommend for a specific device [f32].
Update Procedure. We need the following three intermediate data structures including: 1) the item interaction count vector v; 2) The concurrency matrix C and 3) the similarity matrix L, to enable the incremental and decremental updates for PPR.
We can get the whole process of deleting the u-th user data (corresponding to the -th row in the history matrix Y) through the model in the FORGET function (Lines 10-17) of Algorithm 1. We update the corresponding co-occurrence count by traversing all item pairs in the user history . Finally, we need to 1) traverse each item in the usage history of each user and 2) renew the similarity matrix in corresponding row . The working principle of the incremental update of the model, explained in the UPDATE function (Lines 2-8) of Algorithm 1 is similar, the difference is that the number of simultaneously incremental updates.
Space Complexity. PPR is composed of a similarity matrix with the space quadratic of the number of items. As we need to protect the concurrency matrix and the vector , we need to adjust of the concurrency matrix, and the recalculation entries in the similarity matrix L in the worst case. The intermediate data structure of the decremental learning algorithm double the required memory, and the update has a quadratic complexity of in the worst case. In practice, for example, given a configuration and PPR on items, DEAL uses -LRU to reduce up to 378 page swaps in memory replacement during a single round. However, most users need to interact with very few items in the real world data. In addition, we only retain the top- entries of each item in L. At the same time, we introduce bounds on the maximum number of interactions to reduce the memory usage required for the intermediate data structures and the update complexity [f40]. The number of energy control function calls is linear to the number of function calls for incremental/decremental updates. Although DEAL uses the similarity matrix to find corresponding users, it may selectively reduce the data. As the training proceeds, the new data overwrite the old data. Moreover, the new data could be detected after a few training rounds. The detailed analysis is discussed in Section IV.
Data Recovery. We analyze how to recover deleted user data from the stale model from the case of deleting the data of a single device from the database. When the original matrix Y removes the row corresponding to the deleted device, the matrix Y becomes an updated matrix . If we still have access to the similarity matrix L calculated from the original matrix Y, then we can calculate the corresponding similarity matrix from the updated matrix and compare it with the stale similarity matrix L. All items with differences in entries of the similarity matrices (e.g., ) are exactly the items that were included in the interaction history of the deleted device. In this way we can recover the deleted data.
Case 2: Tikhonov Regularization. Tikhonov regularization [f33] is a technique widely used to analyze multiple regression data that suffer from multicollinearity. The model input data is the matrix of d-dimensional observations, and the corresponding digital target variable . Under the principle of ensuring generality, we assume that the data of a specific user is captured in the input -th row vector . If there is more than one row of data representing a particular user, we can simply perform the decremental update process several times.
Model Construction. The solution of the normal equation is a common method to calculate the Tikhonov regularization model in the form of the weight vector h. As shown in the PREDICT function (Line 12) of Algorithm 2, we can use the weight vector as the dot product to calculate the estimate of a new observation : .
Update Procedure. We use an effective calculation method to explain the process of deleting the data of a certain device (corresponding to the -th row of matrix M) from the Tikhonov regularization model:
In this way, we retain two intermediates from the calculation, the vector and a QR factorization of the regularized gram matrix. We want to use the FORGET function (Lines 7-10) in Algorithm 2 to solve the updated model h. First, we have to use the method of subtracting to recalculate z
, and then we have to update the QR decompositionQ and R through using the fast rank-one update algorithm [f34] with and as parameters. In this way, we can get a new model.
Space Complexity. We need to maintain two additional matrices and as well as the vector in the decremental variant of the model. The feature number of Tikhonov regularization is quadratic and has nothing to do with the number of examples. It is usually much less than the number of examples . An decremental/incremental update requires scaling and adding to z ( operations), the rank-one QR update [f34] ( operations), the matrix vector multiplication ( operations) and the matrix vector multiplication ( operations), and solving for h by reverse substitution ( operations). DEAL introduced the FORGET function (to Lines 7-10 of Algorithm 2). Thus, -LRU can significantly reduce more page faults in this linear correlated algorithm. In all, the complexity of our update is , which improves from the original retraining complexity.
Data Recovery. It is difficult to obtain information about the deleted device feature vector from the model h. Although we can constrain the candidate vectors in the subspace defined by via accessing the complete target variable r, this can produce a large amount of prediction error, so we need to know more about M to further control the candidate vector space.
In this section, we first introduce the experimental setup and then discuss the corresponding evaluation results from different perspectives in detail.
|Device||Android Version||Core||Maximum Frequency|
Iv-a Experimental Setup
We evaluate the effectiveness of DEAL with both physical testbed and simulation. For the physical testbed, we prototype a federated learning system using mobile devices with different hardware configurations. Table I
shows the hardware information of the mobile devices adopted in the experiments. The on-device learning process is implemented based on the deep learning framework DL4J[f43]. In addition, a Monsoon Power Monitor [f44] is used to measure the power consumption of the participating devices. For the simulation, we emulate different mobile devices with various independent docker images, and deploy hundreds of corresponding FL docker images to simulate corresponding mobile devices. We provide the docker image of the complete experimental code on DockerHub 111https://hub.docker.com/r/goodlab/deal. Specifically, in order to evaluate the effectiveness of DEAL, we compare DEAL with the following baselines from different perspectives.
Original, is a federated learning system that always retrain full data objects created for the model.
NewFL, is a modified federated learning system implemented using the DL4J framework [f21], which only focuses on new data.
Models and Datasets: In order to evaluate the effectiveness of DEAL, we build four different models (Personalized PageRank, K-Nearest Neighbors, Multinomial Naïve Bayes and Tikhonov Regularization) and train them on eight datasets. Specifically, for Personalized PageRank, we use two datasets [f50]
about the movie ratings (movielens) and the joke ratings (jester). For the classification model (K-Nearest Neighbors and Multinomial Naïve Bayes), we use datasets[f51] about mushrooms, phishing websites (phishing) and cartographic forest data (covtype). For the Tikhonov Regularization model, we used datasets [f51]
on housing prices (housing, cadata) and music (YearPredictionMSD). Finally, in order to understand the impact of training new data only, we adopt image classification dataset Cifar-10.
Comparision of Training Completion Time. We first train a model on each dataset and load it into the smartphone. Next, we compare the overall training completion time of DEAL with all baselines in different application scenarios, and repeat every experiment for twenty randomly selected users. Figure 3 shows the experimental results under different CPU frequencies on Huawei Honor 8 Lite. Specifically, Figure 2(a) shows the results of training time of Personalized PageRank model on the movielens and jester datasets. In the movielens datasets, DEAL achieves one and two orders of magnitude faster than the NewFL and Original, respectively. It is because only DEAL trains less data than the Original, but also DEAL forgets some updates during training, thus greatly reducing the overall training time.
Similarly, Figure 2(b) shows the results of training time of a non-linear model, a k-Nearest Neighbor algorithm with Locality Sensitive Hashing, on the mushrooms and phishing datasets. DEAL has the similar improvement on the training time, namely one order of magnitude and three orders of magnitude faster than the NewFL and the Original, respectively. Moreover, when we allow aggressive DVFS on the device, DEAL can achieve four orders of magnitude better performance than the Original, in the phishing dataset. It is because (1) the phishing dataset is very large and the whole training is more I/O intensive than other workloads, allowing more power saving potentials; (2) The Original always trains the whole data set, which has a much larger memory footprint than DEAL.
In the rest two cases, Figure 2(c) and 2(d), both illustrate that the advantage of DEAL, compared to the two baselines, can achieve 2-4 orders of magnitude faster in training a converged model. However, when the data more coarse-grained distributed such as the YearPredictionMSD dataset, the performance of DEAL slowly converges to that of the NewFL. Yet, DEAL always achieves a better modeling performance than the NewFL because DEAL captures feature from older data objects.
Comparision of Convergence Time. We deploy hundreds of FL docker images to simulate corresponding mobile devices. Figure 4 shows the simulation results. The CDF result of Figure 4 shows the trend of the convergence time of DEAL and Original on Personalized PageRank model with the default power governor (interactive). Concretely, as can be seen from Figure 4, in the Movielens dataset, 92% of the simulated devices show that the advantage of DEAL is faster in training the converged model compared with the Original, and the median values of convergence time for DEAL and the Original are 158ms and 94,988ms (normalized to 0.18 and 0.55). In the Jester dataset, 85% of the simulated devices show that the advantage of DEAL is faster in training the converged model compared with the Original, and the median value of the convergence time for DEAL and the Original are 1ms and 6,598ms (normalized to 0.25 and 0.36). This shows that in our simulation, the convergence time of DEAL of more than half of the mobile devices is three orders of magnitude faster than that of the Original. This is because DEAL allows the server to communicate with workers via the SUB method periodically, and start the central convergence when receiving more than half SUB signals from all selected workers or passed a TTL. DEAL does not need to wait for all models update. However, it can be seen that the effect of the Tail of the convergence time is not good enough.
Comparison of Model Accuracy. Figure 5 compares the accuracy of the Tikhonov regularization model on different datasets. It shows that in the phishing dataset, the model accuracy of DEAL is only 9% lower than that of the Original. Among these datasets, the housing dataset has the largest accuracy reduction, which is reduced by 12%. The accuracy of the remaining datasets is not much different from the Original, which are all about 3%. The results show that, while guaranteeing the accuracy of the model, DEAL effectively improves the learning speed for on-device federated learning, and it can be seen from the subsequent results that DEAL also improves the energy efficiency of federated learning on the devices.
Comparison of Energy Consumption. Rather than having a less training time, DEAL allows adaptive power control in the training algorithm, as aforementioned in Section III. Here, we provide an energy consumption analysis based on our measured data under different CPU frequencies on the Huawei Honor 8 Lite in Figure 6. One common theme behind the power saving is, no matter which baselines, the total energy consumed gradually decreases with the CPU frequency.
Figure 5(a) shows the results of the consumed energy for training of Personalized PageRank model on the movielens and jester datasets. It can be seen that DEAL of the movielens dataset saves 253.2uAh of energy and 3687.1uAh of energy, as compared to the NewFL and the original, respectively. DEAL can save about 300uAh in the jester dataset. Figure 5(b) shows the results of the consumed energy for training of k-Nearest Neighbors with Locality Sensitive Hashing on the mushrooms and phishing datasets. It can be seen that DEAL of the mushrooms and phishing datasets consume an order of magnitude less energy than the NewFL, saving about 250uAh of energy. Compared to the Original, DEAL achieves energy saving in the amount of approximately 110,000uAh. Figure 5(c) shows the results of the consumed energy for training Multinomial Naïve Bayes models on mushrooms, phishing, and covtype datasets. DEAL of these three datasets consumes two to three orders of magnitude less energy than the NewFL, saving about 263uAh of energy. DEAL of the mushrooms and phishing datasets consumes three orders of magnitude less energy than the Original. In the the covtype dataset, DEAL consumes 4 orders of magnitude less energy than the Original. Because the cardinality of the covtype dataset is much larger than the mushrooms and phishing datasets, the training time and power consumption required for a whole retraining increase accordingly. Specifically, DEAL of the covtype datasets saves 17,908.1uAh of energy compared to the Original. Figure 5(d) shows the results of the consumed energy for training of Tikhonov regularization model on the housing, cadata, and YearPredictionMSD datasets. DEAL of the housing dataset saves only 6.7uAh of energy compared to the Original. This is because the housing dataset size is too small, so it consumes less energy for retraining. DEAL of the YearPredictionMSD dataset saves 77,497.6uAh of energy compared to the Original.
Figure 7 compares the energy consumption of DEAL and Original on the Tikhonov regularization model for six different datasets: housing, mushrooms, phishing, cadata, YearPreditionMSD and covtype. It can be seen that no matter what kind of dataset, DEAL consumes more than one order of magnitude less energy compared to the Original. Some datasets can even save three orders of magnitude of energy.
In short, DEAL can save up to 81.7% and 80.6% of energy cost on average, compared to the Original and the NewFL, respectively. With the smallest dataset housing in the Tikhonov regularization model, DEAL still saves 75.6% of energy, as compared to the Original. And due to the different size of datasets in each model, the behave of DEAL can change accordingly when using different datasets.
Comparison of Privacy. There is currently no existing approach that can effectively quantify privacy on mobile phones [f47, f48], so we measure privacy by observing the proportion of data objects. We add 10 new data objects in each round of training and observe the proportion of these 10 new data objects to the overall training data objects to measure the privacy situation. It can be seen from Figure 8 that for NewFL, its effect is the best, it only trains new data, so its proportion is always 100%. For Original, because it needs to train all the data (10 newly added data and the previous old data), as the number of training increases, its proportion value continues to decrease. For DEAL, there is a phenomenon of jitter, because DEAL includes two learning methods: decremental learning and incremental learning. In real life, we generally delete old data, so DEAL pays less and less attention to old data. In addition, new data always overwrites old data. But when we need to delete new data, DEAL can also delete these data in a specific training round.
V Related Work
Our work is closely related to two major research topics, distributed learning and federated learning.
Distributed Learning. Distributed learning has attracted a lot of attention in order to effectively train different neural network models with large amount of data located at different places. Previous research has been done to improve the system performance of distributed learning from different perspectives. Zhang et al. [f10] design a scheduling algorithm to approximate the training performance of deep learning jobs in order to maximize the overall performance of tasks in a cluster. Li et al. [f12] propose a framework of parameter server for distributed learning in order to manage asynchronous data communication between different working nodes and support flexible consistency models, elastic scalability and continuous fault tolerance. Though these approaches can effectively improve the system performance of distributed learning, they cannot be directly applied to mobile based federated learning. Compared with servers located in the data center, mobile devices have much higher limitation of computing capacity and battery lifetime.
Federated Learning. Federated learning is proposed to make multiple mobile devices collaboratively train a shared deep learning model while guaranteeing the data privacy [f1, f2, f3, f4, f5, f6, f7, f8, f9, f36]. Lalitha et al. [f1] design a distributed learning algorithm to train a machine learning model over a network of users in a fully decentralized manner. Bonawitz et al. [f9]
build a salable production system for federated learning in the domain of mobile devices, based on Tensorflow. Konecny et al.[f2] aim to improving the communication efficiency in a federated learning system and propose two schemes (e.g., sketched update and structured update) to reduce the uplink communication costs. Smith et al. [f3] propose a system-aware optimization approach to solve problems of high communication cost, stragglers, and fault tolerance for distributed multi-task learning. Wu et al. [f36] adopt a data-driven approach to introduce the opportunities and design challenges faced by Facebook in order to enable machine learning inference locally on smartphones and other edge platforms. However, the problem of effectively reducing the energy consumption while guaranteeing the model accuracy is not sufficiently investigated which is critical to battery-powered mobile device.
This paper proposes an energy efficient learning framework, DEAL, that achieves energy saving with a decremental learning design. DEAL improves the energy efficiency of the training process from two main levels. The first level selects a subset of workers with sufficient capacity in order to maximize the rewards, i.e., energy saving potentials. The second level is made up of a specified decremental learning algorithm that actively provides a decremental and incremental update functions, which adaptively tunes the DVFS of local mobile device. DEAL is prototyped in containerized services with modern smartphone profiles and evaluated with different learning benchmarks with real-world traces. The evaluation result shows that DEAL achieves 75.6%–82.4% less energy footprint in different datasets, compared to the traditional methods. Moreover, all learning processes are faster than the classic federated learning framework up to 2-4 orders of magnitude. Immediate future work includes evaluating DEAL with more applications and models on more smartphones at scaling-out.