Federated Evaluation of On-device Personalization

by   Kangkang Wang, et al.

Federated learning is a distributed, on-device computation framework that enables training global models without exporting sensitive user data to servers. In this work, we describe methods to extend the federation framework to evaluate strategies for personalization of global models. We present tools to analyze the effects of personalization and evaluate conditions under which personalization yields desirable models. We report on our experiments personalizing a language model for a virtual keyboard for smartphones with a population of tens of millions of users. We show that a significant fraction of users benefit from personalization.


page 1

page 2

page 3

page 4


Federated Learning for Mobile Keyboard Prediction

We train a recurrent neural network language model using a distributed, ...

Federated Learning of N-gram Language Models

We propose algorithms to train production-quality n-gram language models...

Multi-Center Federated Learning

Federated learning has received great attention for its capability to tr...

Zero-Shot Federated Learning with New Classes for Audio Classification

Federated learning is an effective way of extracting insights from diffe...

How To Backdoor Federated Learning

Federated learning enables multiple participants to jointly construct a ...

Production federated keyword spotting via distillation, filtering, and joint federated-centralized training

We trained a keyword spotting model using federated learning on real use...

Federated Learning Of Out-Of-Vocabulary Words

We demonstrate that a character-level recurrent neural network is able t...

1 Introduction

As users increasingly shift to mobile devices as their primary computing device (Anderson, 2015), we hypothesize that the information on devices allows for personalizing global models to better suit the needs of individual users. This can be achieved in a privacy-preserving way by fine-tuning a global model using standard optimization methods on data stored locally on a single device. While we expect personalization to be beneficial for most users, we need to make sure it doesn’t make things worse for some users, e.g. by overfitting.

In this paper, we describe extensions to the Federated Learning (Bonawitz et al., 2019) (FL) framework for evaluating the personalization of global models. We study this using an RNN language model for the keyboard next-word prediction task (Hard et al., 2018). We show that we can derive and impose conditions under which a personalized model is deployed if and only if it makes the user’s experience better. We further show that it is possible to personalize models that benefit a significant fraction of users.

2 Federated Personalization Evaluation

Federated Learning is a distributed model training paradigm where data never leaves users’ devices. Only minimal and ephemeral updates to the model are transmitted by the clients to the server where they are aggregated into a single update to the global model (McMahan et al., 2017). FL can be further combined with other privacy-preserving techniques like secure multi-party computation (Bonawitz et al., 2017) and differential privacy (McMahan et al., 2018; Agarwal et al., 2018; Abadi et al., 2016b). Hard et al. (2018) showed that FL can be used to train an RNN language model that outperforms an identical model trained using traditional server-side techniques, when evaluated on the keyboard next-word prediction task.

Figure 1: An illustration of Federated Personalization Evaluation: (A) the global model (gray circle) is sent to client devices, (B) the device computes SGD updates on the train partition of the local data, resulting in a personalized model (the green square), (C) the device computes a report of metrics for the global and personalized model on the test partition of the local data, (D) pairs of metric reports are sent by various devices to the server, (E) the server computes histograms of various delta metrics.

Such a global model is necessarily a consensus model, and it stands to reason that population-wide accuracy can be further improved through personalization on individual users’ data. However, such on-device refinements cannot be tested server-side because the training/eval data is not collected centrally. It is reasonable to expect that, given the nature of neural network training, personalizing models might make the experience of some users worse. We will show that we can prevent such undesired effects by carefully calibrating the model hyperparameters, and by building a gating mechanism that accepts or rejects personalized models for use in inference.

In this paper, we introduce an extension to the FL framework for evaluating personalization accuracy and for determining the training and acceptance hyperparameters - Federated Personalization Evaluation (FPE). As in the FL setting, mobile phones connect to a server when idle, charging and on an unmetered network (Bonawitz et al., 2019)

. Selected devices are served a baseline model along with instructions on how to train it using the device’s dataset in the form of a TensorFlow graph 

(Abadi et al., 2016a). In FL, the device would compute and send its model update to the server for aggregation, but in FPE, the device instead does five steps: it splits the local on-device dataset into a train and test partition using practitioner-defined criteria; it computes metrics of the baseline model on the test data; it fine-tunes the model on the training set; it computes metrics of the personalized model on the test set; and finally it computes and uploads the change in metrics between the personalized and baseline variants. The server aggregates the metrics it receives from various clients to compute histograms of various delta metrics.

Figure 1 illustrates this process. FPE allows us to evaluate the benefit of personalization and identify good hyperparameters using the existing infrastructure for federated learning, without any user visible impact. These conclusions can then be used for live inference using personalized models, though live inference is beyond the scope of this paper.

3 Method

3.1 Network Architecture

The network architecture of the next-word prediction model is described in  Hard et al. (2018)

. We use a variant of the Long Short-Term Memory (LSTM) 

(Hochreiter and Schmidhuber, 1997)recurrent neural network called the Coupled Input and Forget Gate (CIFG) (Greff et al., 2017). The input embedding and output projection matrices are tied to reduce the model size (Press and Wolf, 2017; Inan et al., 2016). For a vocabulary of size

, a one-hot encoding

is mapped to a dense embedding vector

by with an embedding matrix . The output projection of the CIFG, also in , is mapped to the output vector

. A softmax layer converts the raw logits into normalized probabilities. Cross-entropy loss over the output and target labels is used for training. We use a vocabulary of

words, including the special beginning-of-sentence, end-of-sentence, and out-of-vocabulary tokens. The input embedding and CIFG output projection dimension is set to 96. A single layer CIFG with 670 units is used. The network has 1.4 million parameters.

3.2 Global Model Training

The next-word prediction model is trained using FL on a population of users whose language is set to US English, as described in  Hard et al. (2018). The FederatedAveraging algorithm (McMahan et al., 2017) is used to aggregate distributed client SGD updates. Training progresses synchronously in “rounds”. Every client, indexed by , participating in a given round, indexed by , computes the average gradient, , on its local data , with the current model

using stochastic gradient descent (SGD). For a client learning rate

, the local client update, , is given by . The server performs a weighted aggregation of the client models to obtain a new global model, : , where . The server update is achieved via the Momentum optimizer, using Nesterov accelerated gradient (Nesterov, 1983; Sutskever et al., 2013), a momentum hyperparameter of 0.9, and a server learning rate of 1.0. Training converges after 3000 training rounds, over the course of which 600 million sentences are processed by 1.5 million clients. Training typically takes 4 to 5 days.

3.3 Model Personalization Strategies

A personalization strategy consists of the model graph, the initial parameter values, and the training hyperparameters - client learning rate, train batch size, and stopping criteria. Throughout our experiments, the model graph and initial parameter values are set to be the federated trained next word prediction model described in Section 3.2. The effect of the personalization learning on the model is evaluated via various training hyperparameters.

Given a personalization strategy, the personalized model can then be trained from the initial global model using individual client’s training cache data. Data are cached on mobile devices on which the language is set to US English. During a training process, the client data first gets split into train and test partitions (80% and 20% based on the temporal order). Stochastic gradient descent is used for model training with specified learning rate () and batch size (). The stopping criteria are based on number of tokens (

) observed and number of epochs (

) trained. The training process stops when one of the criteria is satisfied.

4 Experiments

The performance of the personalized model is evaluated using the prediction accuracy metric, defined as the ratio of the number of correct predictions to the total number of tokens.

Hyperparameters Accuracy delta
Table 1: The results from personalization eval experiments. Metrics are reported from over 500,000 clients. Mean of baseline accuracy is .

Experiments are conducted to study the influence of client train batch size and learning rate on the personalization performance. The stopping criteria are set to or . We wish to assess the benefits brought by personalization across users. However, each user experiences a different baseline accuracy, depending on their style, use of language register, etc. Therefore, it makes sense to measure the difference between prediction accuracies before and after personalization for each user, and observe the distribution of these differences in addition to their average. The histograms report only prediction-accuracy metrics and are computed over tens of thousands of users. Metric reports including the baseline accuracies and the personalized model accuracies from over 300,000 users’ devices are received. The delta metrics between the personalized model and the baseline model are summarized in Table 1.

Figure 2: Accuracy delta histograms for different learning rates (L) and batch sizes (B): (a) At and , 47% of users achieve 0.02 accuracy improvement; (b) At and , 39% of users achieve 0.02 accuracy improvement; (c) At and , 29% of users achieve 0.02 accuracy improvement.

The best accuracy improvement is achieved for , . It starts with a mean baseline prediction accuracy of 0.166 and reaches a mean personalized accuracy of 0.19, resulting in a mean relative accuracy increase of 14.5%.

While the mean metrics show how much personalization improves the model performance in general, the distribution reveals how personalization influences the experience of individual users. Histograms of the sampled accuracy deltas are shown in Figure 2.

As shown in Figure 1(a), with a small batch size, a large portion of users encounter model degradation with learning rate 1.0. This is not entirely surprising, since high learning rates with small batch size can cause the parameter update to jump over or even divert from the minima. In Figure 1(b) and Figure 1(c), with larger batch sizes, histograms of learning rate 1.0 tend to have heavier tails both on the left and on the right, compared with histograms of learning rate 0.1. Though the average accuracy improvement of learning rate 1.0 in batch size 20 is lower than learning rate 0.1 (0.015 vs. 0.018), neither is clearly superior, since more users (39% vs. 29%) achieve significant accuracy improvement () with learning rate 1.0.

Figure 3: Analysis of accuracy deltas by learning rate (L) and batch size (B) sliced by (a) number of user tokens and (b) baseline accuracy.

All personalization evaluation experiments in our study use the data stored in a user’s on-device training cache. Variations in the quantity and quality of training cache data across different devices are expected. We conduct experiments to evaluate how model personalization can be influenced by factors associated with the user data. Factors considered are the number of training tokens and baseline accuracy on the user data. The stopping criteria are set to or . Summaries of the greater improvement are illustrated in Figure 3.

In Figure 2(a), user token counts are placed into 4 buckets. As one might expect, we observe larger improvements for more data. For a learning rate of 0.1, the accuracy improvements of the last two buckets get closer, indicating the saturation of the improvement. A learning rate of 1.0 retrieves the best performance with very few tokens. The graph suggests that adjusting the learning rate based on number of user tokens leads to better results.

In Figure 2(b), baseline accuracies of users are placed into 4 buckets. With learning rates 0.1 and 1.0, accuracy improvement for users with the worst baseline accuracy () is greater than 0.25, while accuracy improvement for users with best baseline accuracy () is smaller than 0.2. The results indicate that users who deviate the most from the global model predictions are those benefiting the most from it.

5 Conclusion

This work describes tools to perform Federated Personalization Evaluation and analyze results in a privacy-preserving manner. Through experiments on live traffic, we show that personalization benefits users across a large population. We explore personalization strategies and demonstrate how they can be tuned to achieve better performance. To our knowledge, this represents the first evaluation of personalization using privacy-preserving techniques on a large population of live users.


The authors would like to thank colleagues on the Google Assistant and Google AI teams for many helpful discussions. We’re especially grateful to Emily Glanz and Brendan McMahan for help with our experiments, and Andrew Hard for help editing the manuscript. We’re also grateful for the many contributions made by researcher Jeremy Kahn to an early iteration of this work.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016a)

    TensorFlow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. External Links: Link Cited by: §2.
  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016b) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §2.
  • N. Agarwal, A. T. Suresh, F. Yu, S. Kumar, and B. McMahan (2018) CpSGD: communication-efficient and differentially-private distributed sgd. In Neural Information Processing Systems, External Links: Link Cited by: §2.
  • M. Anderson (2015) Technology device ownership: 2015. Note: http://www.pewinternet.org/2015/10/29/technology-device-ownership-2015/ Cited by: §1.
  • K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. M. Kiddon, J. Konečný, S. Mazzocchi, B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, and J. Roselander (2019) Towards federated learning at scale: system design. In SysML 2019, Note: To appear External Links: Link Cited by: §1, §2.
  • K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2017) Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, New York, NY, USA, pp. 1175–1191. External Links: ISBN 978-1-4503-4946-8, Link, Document Cited by: §2.
  • K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber (2017) LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learning Syst. 28 (10), pp. 2222–2232. External Links: Link, Document Cited by: §3.1.
  • A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage (2018) Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604. Cited by: §1, §2, §3.1, §3.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document, ISSN 0899-7667 Cited by: §3.1.
  • H. Inan, K. Khosravi, and R. Socher (2016)

    Tying word vectors and word classifiers: A loss framework for language modeling

    CoRR abs/1611.01462. External Links: Link, 1611.01462 Cited by: §3.1.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA

    pp. 1273–1282. External Links: Link Cited by: §2, §3.2.
  • B. McMahan, D. Ramage, K. Talwar, and L. Zhang (2018) Learning differentially private recurrent language models. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • Y. Nesterov (1983) A method for solving the convex programming problem with convergence rate . Dokl. Akad. Nauk SSSR 269 (), pp. 543–547. External Links: ISSN , Link, Document Cited by: §3.2.
  • O. Press and L. Wolf (2017) Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pp. 157–163. External Links: Link Cited by: §3.1.
  • I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013) On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Vol. 28, Atlanta, Georgia, USA, pp. 1139–1147. External Links: Link Cited by: §3.2.