In a wide range of fields, from music and advertising recommendations to healthcare and a wide range of other consumer applications, learning users’ personal tendencies and judgements is essential. Many current approaches demand centralized data storage and computation to aggregate and learn globally. Such central models, along with features known about a given user, make predictions appear personal to that user. While such global models have proven to be widely effective, they bring with them inherent conflicts with privacy. User data must leave the device, and training a central model requires regular communication between a given user and the remote model. Further, if users are in some way truly unique, and exhibit difference preferences than seemingly similar users, large centralized models may have trouble quickly adapting to this behavior.
With these disadvantages in mind, we present a definition of personalization that allows for no direct sharing or centralization of user data. We see personalization as the balance between generalization to global information and specialization to a given user’s quirks and biases.
To make this definition concrete, we show how a simple baseline model’s performance changes on a sentiment analysis task as a function of user bias, and the way information is shared across models. We hope this work can contribute to framing the discussion around personalization and provide a metric for evaluating in what ways a model is truly providing a user personal recommendations.
We also discuss related areas such as differential privacy, and federated learning, which have been motivated by similar considerations. Our work could easily fit into the frameworks of federated learning or differential privacy.
2. Related Work
2.1. Personalized Models
There has been a long history of research into personalization within machine learning. There is a wealth of work on using Bayesian hierarchical models to learn mixes of user and global parameters from data. These works have achieved success in areas from health care, to recommendation systems , to generally dealing with a mix of implicit and explicit feedback 
. There has also been increasing work on helping practitioners to integrate these Bayesian techniques with deep learning models.
Many approaches to personalization within deep learning have relied on combining personal features, hand written or learned, with some more global features to make predictions. For example, in deep recommender systems, a feature might whether a user is a certain gender, or has seen a certain movie. A deep model may learn to embed these features, and combine them with some linear model as in 
in order to make recommendations for a specific user. It is also common to learn some vector describing the user end to end for a task, rather than doing this featurization by hand. In such scenarios your input might be a sentence and a user id and the prediction would be the next sentence as in, in which the user is featurized via some learned vector. Similarly, Park et al , learn a vector representation of a user’s context to generate image captions that are personal to the user, and  learn a user vector alongside a language model to determine if a set of answers to a question will satisfy a user. These approaches have the benefit of not requiring any manual description of the important traits of a user.
Here, when we discuss personalization, we focus more on personalization work within deep learning. In general, deep learning models are large, complicated, and very non-linear. This makes it hard to reason about how incorporating a new user, or set of training examples will affect the state of the model at large, a phenomenon known as catastrophic forgetting , a topic which itself has seen a large amount of research , , 
. In general, this means that if we add a new user whose behavior is very different from our previous training examples, we need to take extra steps to preserve our performance on previous users. This makes online personalization of models to outlier users an open problem within deep learning.
2.2. Federated Learning
Our other key personalization constraint is privacy related; to get users to trust a model with extremely personal data, it is our believe that it is becoming increasingly necessary, and even legally mandated, to guarantee them a degree of privacy , . Research on federated learning has demonstrated that intelligence from users can be aggregated and centralized models trained without ever directly storing user data in a central location, alleviating part of these privacy concerns. This research focuses on training models when data is distributed on a very large number of devices, and further assumes each device does not have access to a representative sample of the global data , . We therefore believe federating learning is a key part of any personalization strategy.
Federated learning is concerned with training a central model that does well on users globally. However, the contribution from an individual user tends to be washed out after each update to the global model. Konecny et al.,  admit as much, explicitly saying that the issues of personalization are separate from federated learning. Instead, much of the current research focus on improving communication speed  and how to maintain stability between models when communication completely drops or lags greatly .  comes closest to our concerns, as it hypothesizes a system in which each user has a personal set of knowledge and some more global mechanism aggregating knowledge from similar users. They do not propose an exact mechanism for how to do this aggregating and how to determine which users are similar. We hope to contribute to the conversation on how to best minimally compromise the privacy and decentralization of learning, while not enforcing all models to globally cohere and synchronize.
Finally, it is important to note that federation itself does not guarantee privacy. While in practice this aggregation of gradients, in the place of storing of raw data, will often obscure some user behavior, it may still leak information about users. For example, if an attacker observes a non-zero gradient for a feature representing a location, it may be trivial to infer that some of the users in the group live in that location. Making strong guarantees about the extent to which data gives us information about individual users is the domain of differential privacy , . In future work, we hope to incorporate these stronger notions of privacy into our discussion as well, but believe that federated learning is a good first step towards greater user privacy.
3. Personalization Definition
With these problems in mind, we define personalization as the relative weighting between performance of a model on a large, multi-user, global dataset, and the performance of that model on data from a single user. This definition implies several things. In particular, the extent to which a model can be personalized depends both on the model itself, and the spread of user behavior. On a task in which users always behave the same, there is little room for personalization, as a global model trained on all user data will likely be optimal globally and locally. However, on any task where user behavior varies significantly between individuals, it is possible a model trained on all users may perform poorly on any specific user. Nonetheless, a specific user may benefit from some global data; for example, a user with less training data may see better performance if they use a model trained with some global data. Therefore, the best personalization strategy will have some ability to incorporate global knowledge, will minimally distorting the predictions for a given user.
In addition, we add the constraint that user specific data be private to a user, and cannot be explicitly shared between models. In particular, this means that even if all user data is drawn from the same distribution, we cannot simply train on all the data. Instead we must determine other ways to share this knowledge, such as federating or ensembling user models.
In this paper, we establish some simple benchmarks for evaluating how well a model respects this definition of personalization.
Formally, suppose we have number of users, , and for each user we have some user specific data, , and user specific models, . Let the global data be
, and we suppose we have a loss function,, which is a function of both and , . We define our success at personalization as:
where is between 0 and 1, and determines how much we weight local user and global data performance. In the case where follows the same distribution as all , this definition trivially collapses to approximately optimizing , the familiar, non personal objective function on a dataset. However, as increases and diverges from , we introduce a tension between optimizing for the specific user, while still not ignoring the whole dataset. Finally, to enforce our definition of privacy, each model, , has access only to , and the weights of all the other models, for , but does not have access to the other datasets, , , .
4. Personalization Motivation And Implications
One question might be why we bother at all with adding global data to the equation, since it is more intuitive to think about personalization as just using the model that does best on a single user’s data, and that data alone. However, that intuition ignores the fact that we may have only observed a small amount of behavior from any given user. If we only fit optimally to a specific user’s data, we risk overfitting and performing poorly on new data even from that same user.
A Bayesian interpretation of our definition is to view the global data term as representing our prior belief about user behavior. Another interpretation is to view as how much catastrophic forgetting we will allow our model to do in order to personalize to a user.
From the Bayesian perspective, the global data serves as a type of regularization that penalizes the local model from moving too far away from prior user data in order to fit to a new user. We can think about
as a hyperparameter representing the strength of our prior belief. The smalleris, the less we allow the model to deviate from the global state. There may be no perfect rule for choosing , as it may depend on task, and rate at which we want to adapt to the user.
One strategy could be to slowly increase for a given user as we observe more data from them. With this strategy, data rich users will have large and data poor users will have small
. Thus data rich users will be penalized less for moving further away from the global state. This is close to treating our loss as the maximum a posteriori estimate of the users data distribution, as we observe more data. The rate of changingcould be chosen so as to minimize the loss of our approach on some held out user data, following the normal cross validation strategy for choosing hyperparameters. Alternatively, we may have domain specific intuition on how much personalization matters, and provides an easy way to express this.
From the catastrophic forgetting perspective, our definition is similar to the work in , which penalizes weights from moving away from what they were on a previous task. That work upweights the penalty for weights that have a high average gradient on the previous task, reasoning that such weights are likely to be most important. We directly penalize the loss of accuracy on other users, rather than indirectly penalizing that change, as the gradient based approach does. The indirect approach of  has the benefits of being scalable, as it may be expensive to recalculate total global loss and potentially adapting to unlabeled data. Still, we see a common motivation, as in both cases, we have some weighting for how much we want to allow our model to change in response to new examples.
To calculate we do not need to gather the data in a central location (which would violate our privacy constraint). It is enough to share with each other user, or some sampling of other users, and gather summary statistics of how well the model performs. We could then aggregate these summary statistics to evaluate how well does on . However, sharing a user’s model with other users still compromises the original user’s privacy, since model weights potentially offer insight into user behavior.
In practice, we often have a subset of user data that we can centralize from users who have opted in, or a large public curated dataset that is relevant to our task of interest. We treat such a dataset as a stand in for how users will generally behave. This approach does not compromise user privacy. Alternately, since our is meant to regularize and stabilize our local models, there may be other approaches that achieve this global objective without directly measuring performance on global data. In future work, we will more deeply explore how best to measure this global loss without violating user privacy.
We run an experiment with a simple model to demonstrate the trade-offs between personal and global performance and how the choice of might affect the way we make future user predictions.
5.1. Setup and Data
We use the Stanford Sentiment Treebank (SSTB) 
dataset and evaluate how well we can learn models for sentiment. As a first step, we take the 200 most positive and 200 most negative words in the dataset, which we find by training a simple logistic regression model on the train set. We then run experiments simulating the existence of 2, 5, or 8 users. In each experiment, these words are randomly partitioned amongst users, and users are assigned sentences containing those words for their validation, train, and test sets. Sentences that contain no words in this top 400 are randomly assigned, and sentences that contain multiple of these are randomly assigned to one of the relevant users. This results in a split of the dataset in which each model has a subset of words that are significantly enriched for them, but are very underrepresented for all other models.
This split is meant to simulate a pathological case of user style; we try to simulate users in our train set that are very biased and almost non-overlapping in terms of the word choice they use to express sentiment. While this may not be the case for this specific review dataset, in general there will be natural language tasks in which users have specific slang, inside jokes, or acronyms that they use that may not be used by others. For such users, an ideal setup would adapt to their personal slang, while still leveraging global models to help understand the more common language they use.
For each user we train completely separate models with the same architecture. Roughly following the baseline from the original SSTB paper, we classify sentences using a simple two-layer neural network, and use an average of word embeddings as input and a tanh non-linearity. We use 35 dimensional word embeddings, dropout of, and use ADAM to optimize . We start with an initial learning rate of , which we slowly decay, by multiplying by
, if the validation accuracy has not decreased after a fixed number of batches, which is the same across all experiments. Finally, we use early stopping in the case validation accuracy does not decrease after a fix number of batches, equivalent to 5 epochs on the full train set. Once trained we evaluate two ways of combining our fixed models, averaging model predictions, and simply taking the most confident models, where confidence is defined as the absolute difference betweenand the models prediction.
5.3. Evaluation Metrics
To evaluate we use the train, validation, test splits as provided with the dataset, and use pytreebank  to parse the data. We only evaluate on the sentence level data for test and validation sets. This model is not state of the art. However, we have experience putting models of similar size on low powered and memory constrained devices, and believe this model could realistically be deployed. Nevertheless, the model is sufficiently complicated to give us a sense of what happens as we try to combine separately trained models. In all tables, we report accuracy, and test accuracy for user specific data is evaluated solely on sentences that contain only their words and none of the other users’ specific words. Global data scores represents the whole test set. We report all results averaged over 15 independent trials.
6. Results and Analysis
6.1. Single User Performance on User-Specific Data Vs. Single User Performance on Global Data
Unsurprisingly, as the second and third columns of Table 1 show, single user models perform much better on their own heavily biased user-specific test set than on the global data. This makes sense as each model has purposely been trained on more words from their biased test set. Those words were also specifically selected to be polarizing, but the gap makes concrete the extent to which varying word usages can hurt model performance on this task.
|Num. Users||Single user model (user-specific dataset)||Single user model (global dataset)||Average aggregation (global dataset)||Confidence aggregation (global dataset)|
6.2. Single User Performance on User-Specific Data Vs. Ensembled Models on User-Specific Data
As the number of users increases, the single user model outperforms both aggregation methods on user-specific data (Table 2). This is particularly pronounced for the confidence aggregation method: ensembling hurts performance across all experiments, with this effect increasing as we add users. As the number of users increases, for any given prediction we are less likely to choose the specific user’s model, which performs best on their own dataset. The averaging aggregation method outperforms the confidence aggregation method and is competitive with the single user model for up to five users. However, for more than five users, the averaging approach starts to perform worse on the user’s own data, again suggesting that we start to drown out much of the personal judgment and rely on global knowledge.
|Num. Users||Difference (Average Aggregation)||Difference (Confidence Aggregation)|
6.3. User Performance on Global Data Vs. Ensembled Models on Global Data
While it might be easy to conclude that we should just use a single user model, Table 3 demonstrates that the average-aggregated ensembled models outperform the single user model on global data, particularly as the number of users increases. Again, this is what we would expect, since the aggregated models have collectively been trained on more words in more examples, and ought to generalize better to unbiased and unseen data. This global knowledge is still important, as it may contain insights about phrases a user has only used a few times. This may be especially true for a user who has little data. Recall we divide the whole dataset amongst all users, so as the number of users increases, each user-specific model is trained on less data. In this case the lack of a word in the training set may not indicate that a user will never use that word. It may be that the user has not interacted with the system enough for their individual model to have fully learned their language.
|Num. Users||Difference (Average Aggregation)||Difference (Confidence Aggregation)|
6.4. Choosing an Approach Based on
These experiments demonstrate the tensions between performing well on global and user data, the two terms in our loss in Equation 1. We can apply Equation 1, vary , and see at what point we should prefer different strategies.
Specifically, suppose we have two approaches we can choose from, with personalized losses of and , and global losses of and respectively. If the first approach is superior. We can solve for the such that , where our loss term again comes from Equation 1. Plugging our definition in, we see that
Rearranging this yields
as our break even personalization point. For this value of , we ought to see our two models as equally valid solutions to the problem of personalization.
is linear with respect to , so if for any above our cutoff, it will be greater everywhere above the cutoff, and vice versa. This yields a rule for how to approach making a decision between multiple types of models. It only requires choosing a single hyperparameter between 0 and 1, representing one’s belief on how much personalization matters to the task at hand.
So, our cutoff value of , where the single model and averaged models yield equivalent losses and we are indifferent between them, is . We can also compute the ranges of where we should prefer each model. Because, as explained above, is linear in , evaluating a single point above the cutoff suffices: we choose for computational convenience, and have . Consequently, we prefer the single model for values of above the cutoff, and the averaged model for values of below the cutoff.
Our definition of personalization allows for a complete decoupling of models at train time, while only requiring aggregate knowledge of other models inference in order to potentially benefit from this global knowledge. In addition, it gives a practitioner a simple, one parameter way, of deciding how to choose amongst models that may have different strengths and weaknesses. Further, we have shown how this approach might look on a simplified dataset and model, and why the naïve approach of using a single model, or always aggregating all models, may sometimes not be optimal.
In the future, we will work to develop better methods for combining this aggregate global knowledge, while not hurting user performance. To better protect user privacy, we will also consider alternate methods for regularizing our models outside of the global loss term, . We hope that this work will provide a useful framing for future work on personalization, learning in decentralized architectures, such as Ethereum and Bitcoin , , and serve as a guideline for situations in which the normal single loss and centralized server training paradigm cannot be used.
-  Martín Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang, Deep learning with differential privacy, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, ACM, 2016, pp. 308–318.
-  Rami Al-Rfou, Marc Pickett, Javier Snaider, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil, Conversational contextual cues: The case of personalization and history for response ranking, arXiv preprint arXiv:1606.00372 (2016).
-  Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars, Memory aware synapses: Learning what (not) to forget, arXiv preprint arXiv:1711.09601 (2017).
-  Zheqian Chen, Ben Gao, Huimin Zhang, Zhou Zhao, Haifeng Liu, and Deng Cai, User personalized satisfaction prediction via multiple instance deep learning, Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2017, pp. 907–915.
-  Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al., Wide & deep learning for recommender systems, Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, ACM, 2016, pp. 7–10.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, Imagenet: A large-scale hierarchical image database
-  Cynthia Dwork, Differential privacy: A survey of results, International Conference on Theory and Applications of Models of Computation, Springer, 2008, pp. 1–19.
-  Kai Fan, Allison E Aiello, and Katherine A Heller, Bayesian models for heterogeneous personalized health data, arXiv preprint arXiv:1509.00110 (2015).
-  Robert M French, Catastrophic forgetting in connectionist networks, Trends in cognitive sciences 3 (1999), no. 4, 128–135.
-  Ronald Kemker, Angelina Abitino, Marc McClure, and Christopher Kanan, Measuring catastrophic forgetting in neural networks, arXiv preprint arXiv:1708.02072 (2017).
-  Diederik P Kingma and Jimmy Ba Adam, A method for stochastic optimization. 2014, arXiv preprint arXiv:1412.6980.
-  James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the National Academy of Sciences (2017), 201611835.
-  Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik, Federated optimization: distributed machine learning for on-device intelligence, arXiv preprint arXiv:1610.02527 (2016).
-  Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon, Federated learning: Strategies for improving communication efficiency, arXiv preprint arXiv:1610.05492 (2016).
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, Microsoft coco: Common objects in context, European conference on computer vision, Springer, 2014, pp. 740–755.
-  Bernd Malle, Nicola Giuliani, Peter Kieseberg, and Andreas Holzinger, The more the merrier-federated learning from local sphere recommendations, International Cross-Domain Conference for Machine Learning and Knowledge Extraction, Springer, 2017, pp. 367–373.
-  H Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Aguera y Arcas, Federated learning of deep networks using model averaging, (2016).
-  Satoshi Nakamoto, Bitcoin: A peer-to-peer electronic cash system, 2008.
-  State of California Department of Justice Office of the Attorney General, Privacy laws — state of california - department of justice - office of the attorney general, https://oag.ca.gov/privacy/privacy-laws, Accessed:2018-01-22.
-  Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim, Attend to you: Personalized image captioning with context sequence memory networks, arXiv preprint arXiv:1704.06485 (2017).
-  The European Parliament and the Council Of The European Union, Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016, http://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679&from=en, 2016, Accessed:2018-01-23.
-  Jonathan Raiman, Stanford sentiment treebank loader in python, https://github.com/JonathanRaiman/pytreebank, Accessed:2018-01-05.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (2015), no. 3, 211–252.
-  Jiaxin Shi, Jianfei Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu, and Yuhao Zhou, Zhusuan: A library for bayesian deep learning, arXiv preprint arXiv:1709.05870 (2017).
-  Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar, Federated multi-task learning, arXiv preprint arXiv:1705.10467 (2017).
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
Andrew Ng, and Christopher Potts, Recursive deep models for semantic
compositionality over a sentiment treebank
, Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.
-  Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting., Journal of machine learning research 15 (2014), no. 1, 1929–1958.
-  Gavin Wood, Ethereum: A secure decentralised generalised transaction ledger, Ethereum Project Yellow Paper 151 (2014).
-  Yi Zhang and Jonathan Koren, Efficient bayesian hierarchical user modeling for recommendation system, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2007, pp. 47–54.
-  Philip Zigoris and Yi Zhang, Bayesian adaptive user profiling with explicit & implicit feedback, Proceedings of the 15th ACM international conference on Information and knowledge management, ACM, 2006, pp. 397–404.