1 Introduction
Discovering the heavy hitters in a population of usergenerated data streams plays an instrumental role in improving many mobile and web applications. For example, learning popular outofdictionary words can improve the autocomplete feature in a smart keyboard, and discovering frequentlytaken actions can provide an improved inapp user experience. Naively, a service provider can learn the popular elements by first collecting user data and then applying stateoftheart heavy hitters discovery algorithms (Cormode et al., 2003; Cormode & Hadjieleftheriou, 2008; Charikar et al., 2002). However, this approach requires that users trust the service provider with their raw data. And even with a fully trusted service provider, the publication of the learned heavy hitters might still expose a user’s private information. Our work builds on recent advances in federated learning (McMahan & Ramage, 2017; Konečnỳ et al., 2016; McMahan et al., 2017) and differential privacy (Dwork et al., 2006b, a; Dwork, 2008; Dwork et al., 2010; Dwork & Roth, 2014) to address these concerns.
Federated learning (FL) (Konečnỳ et al., 2016; McMahan et al., 2017) is a collaborative learning approach that enables a service provider to learn a prediction model without collecting user data (i.e., while keeping the training data on user devices). The training phase of FL is interactive and executes in multiple rounds. In each round, a randomly chosen small set of online users download the latest model and improve it locally using their training data. Only the updates are then sent back to the service provider where they are aggregated and used to update the global model. Unlike existing FL algorithms where the goal is to learn a prediction model, our work introduces a new federated approach that allows a service provider to discover the heavy hitters. Our algorithm retains much of the essential privacy ingredients of FL: (a) no full data collection (only partial sketches from a random subset of online users are sent back to the service provider), (b) massive decentralization across a large population of users, (c) interactivity in building an aggregate understanding of the population.
When applicable, FL has several privacy advantages. First, it does not expose users’ data directly to the server, because only model updates are uploaded. Further, different sets of users are selected in each round, so the contribution of a specific user to the global model is not likely to be high. This reduces the risk of personal information disclosure. Finally, cryptographic secure aggregation can be used to ensure that the service provider only sees the aggregate update, as opposed to individual user updates (Bonawitz et al., 2016). Despite these advantages, without further assumptions we cannot formally rule out the possibility of one participant (or many colluding participants) in FL discovering sensitive information about another. Differential privacy (DP), a rigorous privacy notion that has been carefully studied over the last decade (Dwork et al., 2006b, a; Dwork, 2008; Dwork & Roth, 2014) and widely adopted in industry (Ding et al., 2017; Apple, 2017; Kenthapadi & Tran, 2018; Erlingsson et al., 2014), provides the ability to make such strong formal privacy guarantees.
Contributions.
We develop a federated heavy hitters discovery algorithm that achieves DP without requiring data collection. In contrast to classical distribution estimation problems, our goal is to discover the heavy hitters but not their frequencies
^{1}^{1}1Observe that once the popular items are discovered, learning their frequencies can be done using offtheshelf DP techniques.. For example, in a smart mobile keyboard application, our algorithm allows a service provider to discover outofdictionary words and add them to the keyboard’s dictionary, allowing these words to be automatically spellcorrected and typed using gesture typing.We assume, without loss of generality^{2}^{2}2This is true because regardless of the items’ data type, they can always be represented by a sequence of bits., that items in usergenerated data streams have a sequential structure. Thus, we refer to items as sequences and critically leverage their sequential structure to build our interactive protocol. Our contributions are as follows.

[leftmargin=*]

We introduce a novel interactive algorithm for federated heavy hitters discovery. Our algorithm runs in multiple rounds. In each round, a few active users are randomly selected and asked to transmit prefixes of one of the sequences they own. The server then aggregates the received prefixes using a trie structure, prunes nodes that have counts that fall below a chosen threshold, and continues to the next round.

We prove that our algorithm is inherently differentially private, and show how the parameters of the algorithm can be chosen to obtain precise privacy guarantees (see Theorem 1). When the number of users and the sequences have a length of at most 10, our algorithm guarantees differential privacy. See Table 1 for the DP parameters we can provide for various population sizes and maximum sequence lengths.

We systematically examine the privacyutility tradeoff using worstcase theoretical analyses and experimentation on Sentiment140, a Twitter dataset with 1.6M tweets and over 650k users (Go et al., 2009). For Sentiment140, the top 250 words are recalled at a rate close to 1 with and .
Related work. On the FL front, much of the existing works are in the context of learning prediction models (Konečnỳ et al., 2016; McMahan et al., 2017; McMahan & Ramage, 2017), and some recent works combine FL with DP (Geyer et al., 2017; McMahan et al., 2018). Our work differs in that it focuses on federated algorithms for the discovery of heavy hitters.
On the DP front, there is a rich body of work on distribution learning, frequent sequence mining, and heavyhitter estimation both in the central and local models of DP (Diakonikolas et al., 2015; Bonomi & Xiong, 2013; Xu et al., 2016; Zhou & Lin, 2018; Wang et al., 2017; Acharya et al., 2018; Bassily et al., 2017; Kairouz et al., 2016; Ye & Barg, 2018; Avent et al., 2017; Bun et al., 2018). The local model of DP, introduced by Kasiviwanathan et al., assumes that the users do not have any trust in the service provider (Kasiviswanathan et al., 2011). Thus, local DP offers very strong privacy guarantees, but often leads to a significant reduction in utility (Kairouz et al., 2014; Wang et al., 2017; Bassily et al., 2017; Kairouz et al., 2016; Ye & Barg, 2018; Duchi et al., 2013). The utility loss is not as severe in the central model of DP (Dwork et al., 2006b, a; Dwork, 2008; Dwork & Roth, 2014) where the service provider has access to the entire dataset. However, central DP has a major weakness: it assumes that the participating users have full trust in the service provider. Our work bridges these existing lines of work in that it allows a semitrusted service provider to learn the popular sequences in a centrally differentially private way, without having full access to user data.
Methods that provide DP typically involve adding noise, such as Gaussian noise, to the data before releasing it. In this work, we show that DP can be obtained without the addition of any noise by relying exclusively on random sampling and trie pruning which achieves anonymity. The connection between DP, sampling, and anonymity has previously appeared in the work of (Li et al., 2012) where the authors analyze the DP properties of stronglysafe anonymization algorithms. Since we use a threshold to filter the votes from randomly selected users in each round, our algorithm is also stronglysafe anonymous. However, our work differs from theirs in a number of crucial ways. First, we focus on learning popular sequences in a federated and privacypreserving way, while they investigate generic connections between stronglysafe anonymous algorithms and DP in the centralized dataset model. Second, we exploit the structure of our application to provide improved and explicit privacy guarantees (especially when ), and a more precise privacyutility tradeoff analysis. Third, due to the distributed nature of our proposed algorithm, our interactive random sampling method is different from their fixedrate sampling strategy.
Our algorithm exploits the hierarchical structure of usergenerated data streams to interactively maintain a trie structure that contains the frequent sequences. The idea of using trielike structures for finding frequent sequences in data streams has been explored before in (Cormode et al., 2003; Bassily et al., 2017). However, the work of Cormode et al. (2003) focuses on finding frequent sequences in a single data stream and does not address the distributed nature or associated privacy risks. The TreeHist algorithm of Bassily et al. (2017) is noninteractive and achieves local DP, which leads to a significantly reduced utility.
2 Preliminaries and Organization
Model and notation. We consider a population of users , where user has a collection of items . We abuse notation and use to refer to both the set of all users and set of all items. As discussed in the introduction, we assume (without loss of generality) that the items have a sequential structure and refer to them as sequences. More precisely, we express an item as a sequence of items. For example, in our experiments (see Section 5), we focus on discovering heavyhitter English words in a population of tweets generated by Twitter users. Therefore, each user has a collection of words, and each word can be expressed as a sequence of ASCII characters. We assume that the length of any sequence is at most .
For any set , we build a trie via a randomized algorithm to obtain an estimate of the heavy hitters. We let denote the prefix of of length . For a trie and a prefix , we say that if there exists a path in . Also, let denote the subtree of that contains all nodes and edges from the first levels of . Suppose is a path of length in . Growing the trie from to by “adding prefix to ” means appending a child node to .
Differential privacy. We follow the differential privacy definition.
Definition 1.
Differential Privacy (Dwork & Roth, 2014). A randomized algorithm is differentially private iff for all , and for all adjacent datasets and :
(1) 
We adopt userlevel adjacency where and are adjacent if can be obtained by adding all the items associated with a single user from (McMahan et al., 2018). This is stronger than the typically used notion of adjacency where and differ by only one item.
Paper organization. To warm up, we focus, in Section 3, on a simpler setting where each user has a single sequence (). We present the basic version of our algorithm, prove that it is differentially private, and provide worstcase utility guarantees. Combining key insights from Section 3, we handle the more general case of multiple sequences per user in Section 4. We present, in Section 5, extensive simulation results on the Sentiment140 Twitter dataset of 1.6M tweets (Go et al., 2009). We conclude our paper with a few interesting and nontrivial extensions in Section 6. All proofs and additional experiments are deferred to the supplementary material.
3 Single Sequence per User
In this section, we consider a simple setting where each user has single sequence. Much of the intuition behind the algorithm and privacy guarantees we present in this section carry over to the more realistic setting of multiple sequences per user.
Figure 1 shows an example of how our algorithm works in the context of learning heavyhitter English words. Suppose we have users and each user has a single word. Assume there are three popular words: “star” (on 3 devices), “sun” (on 4 devices) and “moon”(on 4 devices). The rest of the words appear once each. We add a “$” to the end of each word as an “end of word” symbol. In each round, the service provider selects random users, asks them to vote for a prefix of their word, and stores the prefixes that receive votes greater than or equal to in a trie. In the above example, two prefixes “s” and “m” of length 1 grow on the trie after the first round. This means that among the 10 randomly selected users, at least two of them voted for “s” and at least another two voted for “m”. Observe that users who have “sun” and “star” share the first character “s”, so “s” has a significant chance of being added to the trie. In the second round, 10 users are randomly selected and provided with the depth 1 trie learned so far (containing “s” and ”m”). In this round, a selected user votes for the length 2 prefix of their word only if it starts with an “s” or “m”. The service provider then aggregates the received votes and adds a prefix to the trie if it receives at least votes. In this particular example, prefixes “st”, “su”, and “mo” are learned after the second round. This process is repeated for prefixes of length 3 and 4 in the third and the fourth rounds, respectively. After the fourth round, the word “sun$” is completely learned, but the prefix “sta” stopped growing. This is because at least two of the three users holding “star” were selected in the second and third round, but less than two were chosen in the fourth one. The word “moon$” is completely learned in the fifth round. Finally, the algorithm terminates in the sixth round, and the completely learned words are “sun$” and “moon$”.
To describe the algorithm formally, for a set of users , our algorithm runs in multiple rounds, and returns a trie that contains the popular sequences in . In each round of the algorithm, a batch of size (with ) users are selected uniformly at random from . Note that could be chosen differently, and there are interesting tradeoffs between the utility and privacy as a function of . We will discuss these tradeoffs later in this section.
In round, randomly selected users receive a trie containing the popular prefixes that have been learned so far. If a user’s sequence has a length prefix that is in the trie, they declare the length prefix of the sequence they have. Otherwise, they do nothing. Prefixes that are declared by at least selected users grow on the level of the trie. Note that we grow at most one level of the trie in each round of the algorithm. Thus, if , then cannot be in . The final output of is the trie returned by the algorithm when it stops growing. Algorithm 1 describes our distributed algorithm and Algorithm 2 shows a single round of the algorithm to grow one level of the trie.
Given the final trie, we can easily extract the heavyhitter sequences learned by the algorithm. Suppose each valid sequence has an “end of sequence” (EOS) symbol. Then by extracting all the sequences from the root to leaves containing the EOS symbol, we get the set of sequences learned by Algorithm 1. Note that the nonEOS leaves also represent frequent prefixes in the population, which might still be valuable depending on the application.
Privacy guarantees. We now show that Algorithm 1 is inherently differentially private – without the need for any additional randomization or noise addition.
Theorem 1.
When , , , , and , Algorithm 1 is differentially private.
Suppose . We first decompose any into , where and for . Assume there are users in that have prefix . Then we show that when is large, the ratio between and is small so it could be bounded by . When is small, is small enough so it could be bounded by . Intuitively, when is large, it means prefix is already popular in , so the fact that
has one more user with this prefix does not affect the probability of it showing in the result too much. When
is small, the chance of prefix showing the result is very small, even with an extra user with it in . ∎The above result is general and holds for a wide array of algorithm parameters (, , and ). The following corollary shows how strong privacy guarantees can be obtained by tuning the algorithm’s parameters.
Corollary 1.
The requirement of having at least participants is standard in the federated learning setting. In fact, McMahan et al. achieve differential privacy with a population of users (McMahan et al., 2018).
10  0.105  1.05  
12  0.012  0.12  
14  0.0014  0.014  
16  0.00016  0.0016 
Privacy vs. utility tradeoff. By the sampling nature of Algorithm 1, sequences that appear more frequently are more likely to be learned. The batch size and threshold could be tuned to trade off utility for privacy. For a user set of size , smaller and larger achieve better privacy at the expense of lower utility, and vice versa. We now discuss the important privacyutility tradeoffs and defer the rest to the supplementary material. We note that all the plots we provide in this section represent worstcase lower bounds on utility that apply regardless of the underlying data statistics. We show, in Section 5, that the performance of Algorithm 1 is much better on real data.
To quantify utility under Algorithm 1, we examine the discovery rate of a sequence (probability of discovering it) as a function of its frequency in the dataset. In particular, we consider the worstcase discovery rate which captures the probability of discovering a sequence assuming that it shares no prefixes with other sequences in the dataset. In the presence of such sequences, the discovery rate will only get better (see Section 5 for experiments on real data).
Proposition 1.
Suppose a sequence appears times in a dataset of users where the longest sequence has length . Then the worstcase discovery rate under Algorithm 1 is given by
(2) 
We explore the tradeoff between utility and privacy by varying the batch size . We choose , , , and . Observe that it is also reasonable to explore other scaling of with , e.g. for different choices of . However, we focus our attention on and vary . We calculate by Theorem 1. Note that is guaranteed to be irrespective of the choice of .
We now study how much utility we can get if we wish to target a fixed privacy level and require . Given these constraints, our goal is to tune the algorithm’s parameters to get as much utility as possible. The following corollary shows how to to choose to guarantee .
Corollary 2.
If we choose the batch size , , . Then for a fixed budget of , when , Algorithm 1 achieves differential privacy.
To get as much utility as possible for a fixed , when , we choose . If , it means our method cannot guarantee the desired  differential privacy. The worstcase discovery rate could be calculated by Equation (2). Figure 2 shows the utility for different sequence frequencies and different budgets when and . We also provide the results for in the supplementary material. For a larger , with the same fixed , the utility is greatly improved.
Suppose our goal is to learn the sequences that have a frequency greater than 1% in the population with a fixed privacy budget of , and we want to know how large the population should be to discover these sequences with high probability. Again, using Equation (2) and Corollary 2, we can calculate the worstcase discovery rate for different population sizes. Figure 3 shows how the discovery rate changes when the population size increases for a fixed sequence frequency (parametrized by ). Not surprisingly, smaller frequency sequences and smaller budgets require a larger population.
Finally, we study how large the population should be if we want to learn these sequences with high probability for a fixed . Figure 4 shows the relationship between sequence frequency and population size if we want the discovery rate to be at least 0.9 for different ’s. The figure shows that, in order to be discovered with high probability, lower frequency sequences require larger population size, and vice versa. We also need larger populations for stronger privacy guarantees (smaller ).
4 Multiple Sequences per User
In this section, we consider the more general setting where each user could have more than one sequence on their device. Suppose the population is a set of users , and each user has a set of sequences . Each sequence has a certain number of appearances on the user’s device, and if a sequence does not appear on any user’s device, we say that it appears 0 times.
Let denote the number of appearances of on ’s device, we define the local frequency of on ’s device as . Note that the sum of all the sequences’ local frequencies on ’s device is 1, i.e. . If a sequence has 0 appearance on ’s device, then . Similarly, for a certain prefix , let denote the number of appearances of on ’s device. Then the frequency of on ’s device is .
We are now ready to generalize Algorithm 1 to accommodate multiple sequences per user. In each round of the algorithm, we select a batch of users from uniformly at random. A chosen user randomly selects a sequence with probability , i.e., according to its local frequency. Thus, as in Algorithm 1, we still select sequences from users in every round. The voting step by these sequences proceeded in the same way described in Algorithm 2. Algorithm 3 shows the full algorithm.
Interestingly, the differential privacy bounds we obtained in the single sequence setting also hold in the multiple sequence setting under Algorithm 3. This is formally stated in Theorem 2. Intuitively, in the single sequence per user setting, the probability that a prefix is added to the trie is equal to the probability that at least users having this prefix are selected in this round. In the multiple sequences setting, each chosen user selects by a probability of . This means that the single sequence setting considered in the previous section is the worstcase setting for the differential privacy guarantees. The rigorous proof is deferred to the supplementary material.
Theorem 2.
When , choose , , and . Then Algorithm 3 is differentially private.
5 Experiments
In this section, we apply Algorithm 1 and Algorithm 3 to Sentiment140, a Twitter dataset which contains 1.6M tweets from 658769 users (Go et al., 2009). We evaluate the performance of Algorithm 1 and Algorithm 3 by the discovery rate of words with different frequencies, and the recall of the top popular words in the whole population.
5.1 Single Word per User
We start by studying the setting where each user has only one word. To simulate this setting using the Sentiment140 dataset, we choose the word with the highest local frequency for each user and assume that the user will only vote for this word if chosen to participate in a round of Algorithm 1. In this case, the frequency of a word in the whole population is the number of users that have as their highest frequency word divided by the total number of users .
Figure 5 shows the relationship between the word frequencies and the discovery rate using Algorithm 1. We limit to be 10 in our experiments. We choose batch size , where . By Corollary 2, Algorithm 1 achieves differential privacy. We use the Monte Carlo method to run Algorithm 3 2000 times for and . The solid lines show the average discovery rate vs. word frequency, and the shadowed areas show the 0.95 confidence interval. The dashed lines represent the theoretical lower bounds of the discovery probability (presented in Section 3). Observe that there is a gap between the experimental results and the theoretical ones. This is because the theoretical bounds assume that the sequence shares no prefixes with others in the dataset, while in Sentiment140, a lot of the English words do share some prefixes.
We also study the recall of the top highest frequency words in the population. The recall is calculated as follows. We first run Algorithm 1 once on the population. Among all the words discovered by Algorithm 1, suppose of them are in the top most frequent words of the population. Then the recall is . Figure 6 shows the recall of the top words vs. . We use the Monte Carlo method to run Algorithm 1 10 times. The solid lines show the recall for the top
words, while the shadowed areas show the 0.95 confidence interval. Note that we are able to achieve very small standard deviation by running Monte Carlo for only 10 times. More importantly, observe that Algorithm
1 is able to discover the top 250 words at a near 1 recall with a single digit .5.2 Multiple Words per User
In this section, we evaluate Algorithm 3 in the multiple words setting. As in Section 4, we use to represent the local frequency of on ’s device. We calculate the population frequency of by . We run Algorithm 3 on Sentiment140, choosing the same batch size as in the single words setting, , where . By Corollary 2, Algorithm 3 acheives differential privacy.
Similar to the single word setting, Figure 7 shows the relationship between the word frequency and the discovery rate using Algorithm 3. We use the Monte Carlo method to run Algorithm 3 for 2000 times. Note that in the multiple words setting, it is difficult to get a nontrivial lower bound on the discovery rate of Algorithm 3
. This is because such bound heavily depends on the distribution of words. If a large fraction of the words share the same prefixes, the discovery rate would be high. While if the words are uniformly distributed and not any sharing prefixes, the discovery rate can be made arbitrarily low. Not surprisingly, Figure
7 show that the larger is, the higher the discovery rate rate is (and also with lower standard deviation).We also study the recall of the top highest frequency words in the multiple words setting. Figure 8 shows the recall of the top words vs. . We use the Monte Carlo method to run Algorithm 3 for 10 times. Again, the recalls are higher for higher privacy budget (higher ), and vice versa. By comparing Figure 6 to Figure 8, observec that the recall is much better in the mutiple words setting. The top 350 words can now be recalled at a rate close to 1 with a single digit .
5.2.1 Different Unit Size
In the experiments above, we assumed a word is a sequence of characters, so the length of a word is the number of characters in it. In this section, we study the performance of Algorithm 3 on different sizes of the smallest unit of a word, such as 1 character, 2 characters, etc. For example, when we consider the word “moon” with unit size = 1, it is a sequence of four characters “moon”. However, when we consider it with unit size = 2, it is a sequence of two units “moon”. In this setting, the length of a word is the number of units it contains.
We study words with at most 10 characters using unit sizes of 1, 2, 3, and 4 on the Sentiment140 dataset with a fixed budget of . (the maximum length of a word) is adjusted according to the unit size. We use the Monte Carlo method to run Algorithm 3 for 10 times. Both figures show that the utility increases with larger unit size. This is because with a fixed budget, when the unit size is larger, the maximum length of a word is smaller, which enables us to use a larger batch size and get better utility (discovery and recall rates). Figure 9 shows the recall rate for various choices of unit sizes. Observe that there is a noticeable improvement when moving from a unit size of 1 to a unit size of 2 after which gains diminish quickly.
6 Conclusion and Open Questions
We have introduced a novel federated algorithm for learning the frequent sequences in a population of usergenerated data streams. We proved that it is inherently differentially private, investigated the tradeoff between privacy and utility, and showed that it can provide excellent utility while achieving strong privacy guarantees. A significant advantage of this approach is that it eliminates the need to centralize raw data while also avoiding the harsh penalty of differential privacy in the local model. Indeed, any individual user only votes for a singlecharacter extension to an existing trie.
Many questions remain to be addressed. For starters, it would be interesting to examine if varying and from one round to the other (say by starting with large values and adaptively shrinking them) could yield utility gains without sacrificing privacy. More broadly, it is of fundamental interest to derive a sharp lower bound on the utility of distributed/interactive heavyhitter discovery algorithms achieving differential privacy. This would help us assess whether or not there is a room for improving the proposed algorithm. Another important research direction is to explore ways in which our algorithm can be modified to directly yield frequency estimates for the discovered heavy hitters. Finally, it would be interesting to explore decentralized algorithms that achieve differential privacy with zero probability of catastrophic failure (instead of probability of failure as in our proposed algorithm). This may be achieve by introducing a local randomization. For instance, a randomly chosen user can choose to remain silent with a small probability even when they have a word that contains a popular prefix that is already discovered by the service provider. They can also choose to transmit a prefix that they do not have (again, with low probability). Such approaches can strengthen the privacy guarantees and lead to algorithms that do not fail catastrophically with probability .
References
 Acharya et al. (2018) Acharya, J., Sun, Z., and Zhang, H. Communication efficient, sample optimal, linear time locally private discrete distribution estimation. arXiv preprint arXiv:1802.04705, 2018.

Apple (2017)
Apple.
Learning with privacy at scale.
Apple Machine Learning Journal
, 2017.  Avent et al. (2017) Avent, B., Korolova, A., Zeber, D., Hovden, T., and Livshits, B. Blender: enabling local search with a hybrid differential privacy model. In Proc. of the 26th USENIX Security Symposium, pp. 747–764, 2017.
 Bassily et al. (2017) Bassily, R., Stemmer, U., Thakurta, A. G., et al. Practical locally private heavy hitters. In Advances in Neural Information Processing Systems, pp. 2288–2296, 2017.
 Bonawitz et al. (2016) Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for federated learning on userheld data. arXiv preprint arXiv:1611.04482, 2016.
 Bonomi & Xiong (2013) Bonomi, L. and Xiong, L. Mining frequent patterns with differential privacy. Proceedings of the VLDB Endowment, 6(12):1422–1427, 2013.
 Bun et al. (2018) Bun, M., Nelson, J., and Stemmer, U. Heavy hitters and the structure of local privacy. In Proceedings of the 37th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems, SIGMOD/PODS ’18, pp. 435–447, New York, NY, USA, 2018. ACM. ISBN 9781450347068. doi: 10.1145/3196959.3196981. URL http://doi.acm.org/10.1145/3196959.3196981.
 Charikar et al. (2002) Charikar, M., Chen, K., and FarachColton, M. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pp. 693–703. Springer, 2002.
 Cormode & Hadjieleftheriou (2008) Cormode, G. and Hadjieleftheriou, M. Finding frequent items in data streams. Proc. VLDB Endow., 1(2):1530–1541, August 2008. ISSN 21508097. doi: 10.14778/1454159.1454225. URL http://dx.doi.org/10.14778/1454159.1454225.
 Cormode et al. (2003) Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. Finding hierarchical heavy hitters in data streams. In Proceedings of the 29th International Conference on Very Large Data Bases  Volume 29, VLDB ’03, pp. 464–475. VLDB Endowment, 2003. ISBN 0127224424. URL http://dl.acm.org/citation.cfm?id=1315451.1315492.
 Diakonikolas et al. (2015) Diakonikolas, I., Hardt, M., and Schmidt, L. Differentially private learning of structured discrete distributions. In Advances in Neural Information Processing Systems, pp. 2566–2574, 2015.
 Ding et al. (2017) Ding, B., Kulkarni, J., and Yekhanin, S. Collecting telemetry data privately. In Advances in Neural Information Processing Systems, pp. 3571–3580, 2017.
 Duchi et al. (2013) Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Local privacy and statistical minimax rates. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pp. 429–438. IEEE, 2013.
 Dwork (2008) Dwork, C. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, pp. 1–19. Springer, 2008.
 Dwork & Roth (2014) Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
 Dwork et al. (2006a) Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486–503. Springer, 2006a.
 Dwork et al. (2006b) Dwork, C., McSherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Springer, 2006b.

Dwork et al. (2010)
Dwork, C., Naor, M., Pitassi, T., and Rothblum, G. N.
Differential privacy under continual observation.
In
Proceedings of the fortysecond ACM symposium on Theory of computing
, pp. 715–724. ACM, 2010.  Erlingsson et al. (2014) Erlingsson, Ú., Pihur, V., and Korolova, A. Rappor: Randomized aggregatable privacypreserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp. 1054–1067. ACM, 2014.
 Geyer et al. (2017) Geyer, R. C., Klein, T., and Nabi, M. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
 Go et al. (2009) Go, A., Bhayani, R., and Huang, L. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12), 2009.
 Kairouz et al. (2014) Kairouz, P., Oh, S., and Viswanath, P. Extremal mechanisms for local differential privacy. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2879–2887. Curran Associates, Inc., 2014.
 Kairouz et al. (2016) Kairouz, P., Bonawitz, K., and Ramage, D. Discrete distribution estimation under local privacy. In International Conference on Machine Learning, pp. 2436–2444, 2016.
 Kasiviswanathan et al. (2011) Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn privately? SIAM J. Comput., 40(3):793–826, June 2011. ISSN 00975397. doi: 10.1137/090756090. URL http://dx.doi.org/10.1137/090756090.
 Kenthapadi & Tran (2018) Kenthapadi, K. and Tran, T. T. Pripearl: A framework for privacypreserving analytics and reporting at linkedin. arXiv preprint arXiv:1809.07754, 2018.
 Konečnỳ et al. (2016) Konečnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
 Li et al. (2012) Li, N., Qardaji, W., and Su, D. On sampling, anonymization, and differential privacy or, kanonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, pp. 32–33. ACM, 2012.
 McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282, 2017.
 McMahan & Ramage (2017) McMahan, H. B. and Ramage, D. Federated learning: Collaborative machine learning without centralized training data, April 2017. URL https://ai.googleblog.com/2017/04/federatedlearningcollaborative.html. Google AI Blog.
 McMahan et al. (2018) McMahan, H. B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially private recurrent language models. In ICLR, 2018.
 Wang et al. (2017) Wang, T., Blocki, J., Li, N., and Jha, S. Locally differentially private protocols for frequency estimation. In Proc. of the 26th USENIX Security Symposium, pp. 729–745, 2017.
 Xu et al. (2016) Xu, S., Cheng, X., Su, S., Xiao, K., and Xiong, L. Differentially private frequent sequence mining. IEEE Transactions on Knowledge and Data Engineering, 28(11):2910–2926, 2016.
 Ye & Barg (2018) Ye, M. and Barg, A. Optimal schemes for discrete distribution estimation under locally differential privacy. IEEE Transactions on Information Theory, 2018.
 Zhou & Lin (2018) Zhou, F. and Lin, X. Frequent sequence pattern mining with differential privacy. In International Conference on Intelligent Computing, pp. 454–466. Springer, 2018.
Appendix A Proof of Theorem 1 and Theorem 2
In Theorem 1, we will show that when , choosing , , and , ensures that Algorithm 1 is differentially private. This theorem is proved by combining two lemmas that deal with different cases of the population. In Lemma 1, we first show a bound on the ratio between and for any trie that . This bound depends on , the number of sequences that have prefix in . It is obvious that when , must be 0, but the number of sequences having prefix in is , so is greater than 0. In this case, the ratio between them approaches infinity. On the one hand, if the number of sequences with prefix in is already large, then an extra in only affects the probability slightly, so the ratio between and is small, and it could be bounded by a small . On the other hand, if the number of sequences with in is actually small, then the probability is small, and could be bounded by a reasonably small . This case is handled by Lemma 2.
We start by calculating the probability that a prefix appears at least times if we randomly choose users from a pool of users of size , assuming that appears times in the population.
Proposition 2.
Suppose prefix appears times in a pool of users. If we select users uniformly at random from them, then the probability that prefix is appears at least times is
Proof.
The probability that a prefix appears
times follows the hypergeometric distribution
. To calculate the probability that appears at least times in the chosen subset, we sum up the case that appears times. ∎The above probability expression will be useful in the proof of Lemma 2 below, and when we investigate the privacyutility tradeoff in Section 3. Also, Proposition 1 is derived from Proposition 2.
Lemma 1.
such that , , assume there are users in that have prefix , and . Then .
Proof.
Let be a function to count the number of ways to choose users (denote the set of chosen users as ) from a set of users , that using Algorithm 2, . Also, we denote as the number of ways to choose users under the same condition, given prefix is added to in this step.
Remember and differ in only one sequence and , that , . We denote ’s prefix of length as . For any output trie , consider the step to grow from by . Let . We assume there are users in that have prefix of . We denote this subset of users in as . Thus the set of users in that have prefix is with size .
We abuse the notation to use instead of , and instead of for fixed , , , .
We calculate by the ratio between (how many ways to choose users from , that returns and (how many ways to choose users from ).
Also, we could separate into two parts: not choosing (taking all users from ), or choosing (taking the rest users from ). Thus,
(3) 
Consider , because , so there must be at least users in the chosen set voting for . We consider the following cases separately: choosing users from , users from (note that is already guaranteed by choosing users from , so we consider as a given condition here), and choosing users from , users from , , i.e.,
Similarly for , not choosing (taking all users from ) or choosing (taking the rest users from ).
Thus,
(4) 
could also be considered as not choosing (taking all users from ) or choosing (taking the rest users from ). But different from , if is chosen, we can choose to users contain prefix from . Thus,
(5) 
∎
Suppose there are users has prefix . In Lemma 2, we show that when and , . This means when is small, the probability that is small, so it could be bounded by a small . And it is the same for the round that when there are users has prefix . If , then .
Lemma 2.
Again, consider the step to grow from by . We assume there are users in that have prefix of . Then there are users in that have prefix . If , , , , and , then , and .
Proof.
First . To calculate , we consider the cases of choosing to users voting for separately, By Proposition 2,
The sum of the array above could be upper bounded by the sum of a geometric sequence. We know that , , , . Consider the ratio between the first two items,
(6) 
We denote as .
Now we’ll show that the ratio between adjacent items is decreasing. Consider the ratio between any two adjacent items and ,
Thus,