Learning a distribution from its samples has been a fundamental task in unsupervised learning dating back to the late nineteenth centuryPearson (1894)
. This task, especially under distributed settings, has gained growing popularity in the recent years as data is increasingly generated “at the edge” by countless sensors, smartphones, and other devices. When data is distributed across multiple devices, communication cost and bandwidth often become a bottleneck hampering the training of high-accuracy machine learning modelsGarg et al. (2014). This is even more so for federated learning and analytics type settings Kairouz et al. (2019) which rely on wireless mobile links for communication.
To resolve this issue, several communication-efficient distribution learning schemes have been recently proposed and studied in the literature (see Section 1.2 for a thorough discussion). On the positive side, the state-of-the-art schemes are known to be worst-case (minimax) optimal as they have been shown to achieve the information-theoretic lower bounds on the global minimax error Han et al. (2018a); Acharya et al. (2019b, a); Han et al. (2018b); Barnes et al. (2019). On the negative side, however, the estimation error achieved by these schemes scales as under a -bit communication constraint on each sample, where is the alphabet size of the unknown discrete distribution . This suggests that without additional assumptions on the error scales linearly in , i.e. the introduction of communication constraints introduces a penalty on the estimation accuracy. This is true even if we allow for interaction between clients Barnes et al. (2019); Acharya et al. (2020).
A recent work Acharya et al. (2021) has moved a step forward from the “global minimax regime” by restricting the target distribution to be -sparse and showing that the error can be reduced to in this case, i.e. the error depends on the sparsity rather than the ambient dimension . More recently, Chen et al. (2021) has improved on this result by developing a group-testing based scheme that reduces the error to , therefore completely removing the dependence on the ambient dimension and matching the information-theoretic lower bound.
However, both of these schemes and their performance guarantees are limited to the -sparse setting. Little is known when the target distribution deviates slightly from being exactly -sparse. Moreover, these schemes are essentially still worst-case schemes as they are designed to be minimax optimal, albeit over a smaller class of distributions, i.e. -sparse.
In this paper, we argue that all these results can be overly pessimistic, as worst-case notions of complexity and schemes designed to optimize these worst-case notions can be too conservative. Instead, we seek a measure of local complexity that captures the hardness of estimating a specific instance . Ideally, we want a scheme that adapts to the hardness of the problem instead of being tuned to the worst-case scenario; that is, a scheme achieving smaller error when is “simpler.”
Our contributions Motivated by these observations, in this work we consider the local minimax complexity of distribution estimation and quantify the hardness of estimating a specific under communication constraints. In particular, under the loss, we show that the local complexity of estimating is captured by its half-norm111Here we generalize the notion of -norm to . : we propose a two-round interactive scheme that uniformly achieves the error under loss222Our scheme also guarantees pointwise upper bounds on or general errors. See Theorem 2.2. which requires no prior information on . On the impossibility side, we also show that for any (arbitrarily) interactive scheme, the local minimax error (which is formally defined in Theorem 2.4) must be at least for any when is sufficiently large.
These upper and the lower bounds together indicate that plays a fundamental role in distributed estimation and that the bits of communication is both sufficient and necessary to achieve the optimal (centralized) performance when is large enough. Indeed, this quantity is exactly the Rényi entropy of order , i.e. , showing that under the loss, the correct measure of the local communication complexity at is the Rényi entropy of .
Compared to the global minimax results where the error scales as , we see that when we move toward the local regime, the linear dependency on in the convergence rate is replaced by . This dimension independent convergence is also empirically verified by our experiments (see Section 3 for more details). Note that , so our proposed scheme is also globally minimax optimal. Moreover, since , our scheme also achieves the convergence rate as in Chen et al. (2021) under the -sparse model (though admittedly, the scheme in Chen et al. (2021) is designed under the more stringent non-interactive setting). As another immediate corollary, our pointwise upper bounds indicate that
bit suffices to attain the performance of the centralized model when the target distribution is highly skewed, such as the (truncated) Geometric distributions and Zipf distributions with degree greater than two.
Our techniques Our proposed two-round interactive scheme is based on a local refinement approach, where in the first round, a standard (global) minimax optimal estimation scheme is applied to localize . In the second round, we use additional samples (clients), together with the information obtained from the previous round, to locally refine the estimate. The localization-refinement procedure enables us to tune the encoders (in the second round) to the target distribution and hence attain the optimal pointwise convergence rate uniformly.
On the other hand, our lower bound is based on the quantized Fisher information framework introduced in Barnes et al. (2020). However, in order to obtain a local bound around , we develop a new approach that first finds the best parametric sub-model containing , and then upper bounds its Fisher information in a neighborhood around . To the best of our knowledge, this is the first impossibility result that allows to capture the local complexity of high dimensional estimation under information constraints and can be of independent interest, e.g. to derive pointwise lower bounds for other estimation models under general information constraints.
1.1 Notation and Setup
The general distributed statistical task we consider in this paper can be formulated as follows. Each one of the clients has local data , where and is the collection of all -dimensional discrete distributions. The -th client then sends a message to the server, who upon receiving aims to estimate the unknown distribution .
At client , the message is generated via a sequentially interactive protocol; that is, samples are communicated sequentially by broadcasting the communication to all nodes in the system including the server. Therefore, the encoding function of the -th client can depend on all previous messages . Formally, it can be written as a randomized mapping (possibly using shared randomness across participating clients and the server) of the form , and the -bit communication constraint restricts . As a special case, when depends only on and is independent of the other messages for all (i.e. ), we say the corresponding protocol is non-interactive. Finally, we call the tuple an estimation scheme, where is an estimator of . We use and to denote the collections of all sequentially interactive and non-interactive schemes respectively.
Our goal here is to design a scheme to minimize the (or ) estimation error: , for all , as well as characterizing the best error achievable by any scheme in . We note that while our impossibility bounds hold for any scheme in , the particular scheme we propose uses only one round of interaction. For error, we replace in the above expectations with . In this work, we mainly focus on the regime , aiming to characterize the statistical convergence rates when is sufficiently large.
1.2 Related works
Estimating discrete distributions is a fundamental task in statistical inference and has a rich literature Barlow (1972); Devroye and Gábor (1985); Devroye and Lugosi (2012); Silverman (1986). Under communication constraints, the optimal convergence rate for discrete distribution estimation was established in Han et al. (2018b, a); Barnes et al. (2019); Acharya et al. (2019b, a); Chen et al. (2020) for the non-interactive setting, and Barnes et al. (2019); Acharya et al. (2020) for the general (blackboard) interactive model. The recent work Acharya et al. (2021); Chen et al. (2021) considers the same task under the sparse assumption for the distribution under communication or privacy constraints. However, all these works study the global minimax error and focus on minimizing the worst-case estimation error. Hence the resultant schemes are tuned to minimize the error in the worst-case which may be too pessimistic for most real-world applications.
A slightly different but closely related problem is distributed estimation of distributions Ye and Barg (2017); Wang et al. (2017); Acharya et al. (2019c); Acharya and Sun (2019); Chen et al. (2020) and heavy hitter detection under local differential privacy (LDP) constraints Bassily and Smith (2015); Bassily et al. (2017); Bun et al. (2019); Zhu et al. (2020). Although originally designed for preserving user privacy, some of these schemes can be made communication efficient. For instance, the scheme in Acharya and Sun (2019) suggests that one can use bit communication to achieve error (which is global minimax optimal); the non-interactive tree-based schemes proposed in Bassily et al. (2017); Bun et al. (2019) can be cast into a -bit frequency oracle, but since the scheme is optimized with respect to error, directly applying their frequency oracle leads to a sub-optimal convergence in .
In the line of distributed estimation under LDP constraint, the recent work Duchi and Ruan (2018) studies local minimax lower bounds and shows that the local modulus of continuity with respect to the variation distance governs the rate of convergence under LDP. However, the notion of the local minimax error in Duchi and Ruan (2018) is based on the two-point method, so their characterized lower bounds may not be uniformly attainable in general high-dimensional settings.
The rest of the paper is organized as follows. In Section 2, we present our main results, including pointwise upper bounds and an (almost) matching local minimax lower bound. We provide examples and experiments in Section 3 to demonstrate how our pointwise bounds improve upon previous global minimax results. In Section 4, we introduce our two-round interactive scheme that achieves the pointwise bound. The analysis of this scheme, as well as the proof of the upper bounds, is given in Section 5. Finally, in Section 6 we provide the proof of the local minimax lower bound.
2 Main results
Our first contribution is the design of a two-round interactive scheme (see Section 4 for details) for the problem described in the earlier section. The analysis of the convergence rate of this scheme leads to the following pointwise upper bound on the error:
Theorem 2.1 (Local upper bound)
For any , there exist a sequentially interactive scheme , such that for all ,
where are some universal constants (which are explicitly specified in Section 5). This implies that as long as .
In addition to the error, by slightly tweaking the parameters in our proposed scheme, we can obtain the following pointwise upper bound on the (i.e. the total variation) error:
Theorem 2.2 (Local upper bound)
For any , there exists a sequentially interactive scheme , such that for all ,
where are some universal constants. Hence as long as .
Theorem 2.1 implies that the convergence rate of the error is dictated by the half-norm of the target distribution while Theorem 2.2 implies that the error is dictated by the one-third norm of the distribution. In general, we can optimize the encoding function with respect to any loss for and the quantity that determines the convergence rate becomes the -norm (see Remark C.1 for details). We also remark that the large second order terms in Theorem 2.1 and Theorem 2.2 are generally inevitable. See Appendix A.1 for a discussion. The proofs of Theorem 2.1 and Theorem 2.2 are given in Section 5 and Appendix C respectively.
Next, we complement our achievability results with the following minimax lower bounds.
Theorem 2.3 (Global minimax lower bound)
For any (possibly interactive) scheme, it holds that
Note that the lower bound in Theorem 2.3 can be uniformly attained by our scheme even if no information on the target set of distributions, i.e. the parameter is available, indicating that our proposed scheme is global minimax optimal. The proof can be found in Section D.
Theorem 2.4 (Local minimax lower bound)
Note that in Theorem 2.4, the neighborhood is strictly smaller than the following neighborhood: as .
Theorem 2.4 indicates that our proposed scheme is (almost) optimal when is sufficiently large and that is the fundamental limit for the estimation error. This conclusion implies that under loss, the right measure of the hardness of an instance is its half-norm, or equivalently its Rényi entropy of order . The proof of Theorem 2.4 is given in Section 6.
When is sufficiently large, bits of communication is both sufficient and necessary to achieve the convergence rate of the centralized setting under loss. Similarly, bits are sufficient for loss.
In addition, observe that the quantities and are the Rényi entropies of order and , denoted as and , respectively. In other words, the communication required to achieve the optimal (i.e. the centralized) rate is determined by the Rényi entropy of the underlying distribution .
Finally, we remark that when the goal is to achieve the centralized rate with minimal communication (instead of achieving the best convergence rate for a fixed communication budget as we have assumed so far), the performance promised in the above corollary can be achieved without knowing beforehand. See Appendix A.2 for a discussion.
3 Examples and experiments
Next, we demonstrate by several examples and experiments that our results can recover earlier global minimax results and can significantly improve them when the target distribution is highly skewed.
Corollary 3.1 (-sparse distributions)
Let . Then as long as large enough, there exists interactive schemes and such that and .
The above result shows that our scheme is minimax optimal over . This recovers and improves (by a factor of ) the results from Acharya et al. (2021) on the and convergence rates for -sparse distributions444We remark, however, that their scheme is non-interactive and has smaller minimum sample size requirement..
Corollary 3.2 (Truncated geometric distributions)
Let be the truncated geometric distribution. That is, for all and , Then if ,
This result shows that if is constant for the truncated geometric distribution, bit suffices to achieve the centralized convergence rate in this case. Note that this is a significant improvement over previous minimax bounds on the the error which are Han et al. (2018a). This suggests that the corresponding minimax optimal scheme is suboptimal by a factor of when the target distribution is truncated geometric. Our results suggest that the error should not depend on at all, and Figure 1 provides empirical evidence to justify this observation.
Corollary 3.3 (Truncated Zipf distributions with )
Let be a truncated Zipf distribution with . That is, for , Then the local complexity of is characterized by . So in this case, , as long as .
We leave the complete characterization for all to Section E in appendix. Note that since both and are not sparse distributions, Acharya et al. (2021) cannot be applied here. Therefore, the best previously known scheme for these two cases is the global minimax optimal scheme achieving an error. Again, our results suggest that this is suboptimal by a factor of .
In Figure 1, we empirically compare our scheme with Han et al. (2018a) (which is globally minimax optimal). In the left figure, we see that the error of our scheme is an order of magnitude smaller than the minimax scheme and remains almost the same under different values of . We illustrate that more clearly in the right figure, where we fix and increase . It can be observed that the error of our scheme remains bounded when increases, while the error of the minimax scheme scales linearly in . This phenomenon is justified by Corollary 3.2.
4 The localization-refinement scheme
In this section, we introduce our two-round localization-refinement scheme and show how it locally refines a global minimax estimator to obtain a pointwise upper bound.
The first round of our two-round estimation scheme (the localization phase) is built upon the global minimax optimal scheme Han et al. (2018a), in which symbols are partitioned into subsets each of size , and each of the first clients is assigned one of these subsets (each subset is assigned to clients). The client reports its observation only if it is in its assigned subset by using bits. This first round allows us to obtain a coarse estimate with error . Then in the second round (the refinement phase), we locally refine the estimate by adaptively assigning more “resources” (i.e. samples), according to obtained from the first round, to the more difficult symbols (i.e. with larger error ). In particular, for the error, the number of samples assigned to symbol will be roughly in proportional to .
It turns out that the local refinement step can effectively mitigate the estimation error, enabling us to move from a worst-case bound to a pointwise bound.
Round 1 (localization):
In the first round, the first clients collaboratively produce a coarse estimate for the target distribution via the grouping scheme described above (this is the minimax optimal scheme proposed in Han et al. (2018a), see Algorithm 1 for details). Note that in general, this scheme can be replaced by any non-interactive minimax optimal scheme.
Round 2 (local refinement):
Upon obtaining from the first round, the server then computes , where the exact definition of
depends on the loss function.will be used in the design of the rest clients’ encoders. In particular, for loss, we set ; for loss, we set . For notational convenience, we denote as .
To design the encoding functions of the remaining clients, we group them into (possibly overlapping) sets with
for all .
We require the grouping to satisfy the following properties: 1) each consists of distinct clients, and 2) each client is contained in at most groups. Notice that these requirements can always be attained (see, for instance, Algorithm 3 in Section B). We further write to be the set of indices of the groups that client belongs to (i.e. ), so the second property implies for all .
As in the first round, client will report its observation if it belongs to the subset . Since , client ’s report can be encoded in bits. Note that with this grouping strategy each symbol is reported by clients in , hence the server can estimate by computing Note that the size of is dictated by the estimate for obtained in the first round. See Algorithm 2 for the details of the second round.
In the above two-round scheme, we see that the local refinement step is crucial for moving from the worst-case performance guarantee to a pointwise one. Therefore we conjecture that the local lower bound in Theorem 2.4 cannot be achieved by any non-interactive scheme.
5 Analysis of the error (proof of Theorem 2.1)
In this section, we analyze the estimation errors of the above two-round localization-refinement scheme and prove Theorem 2.1. Before entering the main proof, we give the following lemma that controls the estimation error of between and .
Now consider the scheme described in Section 4. After the first round, we obtain a coarse estimate . Set . Then by Lemma 5.1 and taking the union bound over , we have . In order to distinguish the estimate obtained from the first round to the final estimator, we use to denote the final estimator.
Now observe that the estimation error can be decomposed into
where (a) holds since almost surely. Hence it remains to bound . Next, as described in Section 4, we partition the second clients into overlapping groups according to (2). The reason we choose in this way is to ensure that 1) for symbols with larger (which implies larger estimation error), we allocate them more samples; 2) every symbol is assigned with at least samples.
Clients in then collaboratively estimate . In particular, client reports her observation if , and the server computes Finally, the following lemma controls , completing the proof of Theorem 2.1.
Let be defined as above. Then
6 The local minimax lower bound (proof of Theorem 2.4)
Our proof is based on the framework introduced in Barnes et al. (2019), where a global upper bound on the quantized Fisher information is given and used to derive the minimax lower bound on the error. We extend their results to the local regime and develop a local upper bound on the quantized Fisher information around a neighborhood of .
To obtain a local upper bound, we construct an -dimensional parametric sub-model that contains and is a subset of , where is a tuning parameter and will be determined later. By considering the sub-model , we can control its Fisher information around with a function of and . Optimizing over yield an upper bound that depends on . Finally, the local upper bound on the quantized Fisher information can then be transformed to the local minimax lower bound on the error via van Tree’s inequality Gill et al. (1995).
Before entering the main proof, we first introduce some notation that will be used through this section. Let be the sorted sequence of in the non-increasing order; that is, for all . Denote as the corresponding sorting function555With a slight abuse of notation, we overload so that , i.e. for all .
Constructing the sub-model
We construct by “freezing” the smallest coordinates of and only letting the largest coordinates to be free parameters. Mathematically, let
where is fixed when are determined. For instance, if (so ) and , then the corresponding sub-model is
Bounding the quantized Fisher information
Now recall that under this model, the score function
can be computed as
The next lemma shows that to bound the quantized Fisher information, it suffices to control the variance of the score function.
Lemma 6.1 (Theorem 1 in Barnes et al. (2019))
Let be any -bit quantization scheme and is the Fisher information of at where and . Then for any ,
Therefore, for any unit vectorwith , we control the variance as follows:
where (a) holds since the score function has zero mean. This allows us to upper bound in a neighborhood around , where is the location of in the sub-model , i.e.
Bounding the error
Applying (Barnes et al., 2019, Theorem 3) on , we obtain
where the second inequality is due to , and the third inequality holds if we pick . Notice that in order to satisfy the condition , must be at most , so we have an implicit sample size requirement: must be at least .
Finally, we maximize over to obtain the best lower bound. The following simple but crucial lemma relates to .
For any and , it holds that
for and a universal constant small enough.
Picking and by Lemma 6.2 and , we obtain that for all
as long as and , where (a) holds due to the second result of Lemma 6.2 and . In addition, the sample size constraint that must be larger than can be satisfied if since , where the first inequality is due to Lemma 6.2 and the second one is due to . The proof is complete by observing that
7 Conclusion and open problems
We have investigated distribution estimation under -bit communication constraints and characterized the local complexity of a target distribution . We show that under loss, the half-norm of dictates the convergence rate of the estimation error. In addition, to achieve the optimal (centralized) convergence rate, bits of communication is both necessary and sufficient.
Many interesting questions remain to be addressed, including investigating if the same lower bound can be achieved by non-interactive schemes, deriving the local complexity under general information constraints (such as -local differential privacy constraint), and extending these results to the distribution-free setting (i.e. the frequency estimation problem).
This work was supported in part by a Google Faculty Research Award, a National Semiconductor Corporation Stanford Graduate Fellowship, and the National Science Foundation under grants CCF-1704624 and NeTS-1817205.
- Inference under information constraints ii: communication constraints and shared randomness. arXiv preprint arXiv:1905.08302. Cited by: §1.2, §1.
- Inference under information constraints: lower bounds from chi-square contraction. In Conference on Learning Theory, pp. 3–17. Cited by: §A.1, §1.2, §1.
- General lower bounds for interactive high-dimensional estimation under information constraints. arXiv preprint arXiv:2010.06562. Cited by: §1.2, §1.
- Estimating sparse discrete distributions under local privacy and communication constraints. In Algorithmic Learning Theory, External Links: Cited by: Appendix D, §1.2, §1, §3, §3.
Hadamard response: estimating distributions privately, efficiently, and with little communication.
The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1120–1129. Cited by: §1.2.
- Communication complexity in locally private distribution estimation and heavy hitters. In International Conference on Machine Learning, pp. 51–60. Cited by: §1.2.
- Statistical inference under order restrictions; the theory and application of isotonic regression. Technical report Cited by: §1.2.
- Fisher information under local differential privacy. arXiv preprint arXiv:2005.10783. Cited by: §1.
- Lower bounds for learning distributions under communication constraints via fisher information. External Links: Cited by: §A.1, §1.2, §1, §6, Lemma 6.1, §6, footnote 3.
- Practical locally private heavy hitters. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 2285–2293. External Links: Cited by: §1.2.
Local, private, efficient protocols for succinct histograms.
Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15, New York, NY, USA, pp. 127–135. External Links: Cited by: §1.2.
- Heavy hitters and the structure of local privacy. ACM Transactions on Algorithms (TALG) 15 (4), pp. 1–40. Cited by: §1.2.
- Breaking the communication-privacy-accuracy trilemma. Advances in Neural Information Processing Systems 33. Cited by: §1.2, §1.2.
- Breaking the dimension dependence in sparse distribution estimation under communication constraints. In Proceedings of Thirty Fourth Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 134, pp. 1028–1059. External Links: Cited by: §A.1, §1.2, §1, §1.
- Nonparametric density estimation: the l1 view. Wiley Interscience Series in Discrete Mathematics, Wiley. External Links: Cited by: §1.2.
- Combinatorial methods in density estimation. Springer Series in Statistics, Springer New York. External Links: Cited by: §1.2.
- The right complexity measure in locally private estimation: it is not the fisher information. arXiv preprint arXiv:1806.05756. Cited by: §1.2.
- On communication cost of distributed statistical estimation and dimensionality. In Advances in Neural Information Processing Systems, pp. 2726–2734. Cited by: §1.
- Applications of the van trees inequality: a bayesian cramér-rao bound. Bernoulli 1 (1-2), pp. 59–79. Cited by: §6.
- Distributed statistical estimation of high-dimensional and nonparametric distributions. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 506–510. Cited by: §A.1, §F.1, §1.2, §1, Figure 1, §3, §3, §4, §4, 1.
- Geometric lower bounds for distributed parameter estimation under communication constraints. In Conference On Learning Theory, pp. 3163–3188. Cited by: §A.1, §1.2, §1.
- Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §1.
- Communication complexity. In Advances in Computers, Vol. 44, pp. 331–360. Cited by: footnote 3.
- Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A 185, pp. 71–110. Cited by: §1.
- Density estimation for statistics and data analysis. Chapman & Hall, London. Cited by: §1.2.
- Locally differentially private protocols for frequency estimation. In 26th USENIX Security Symposium (USENIX Security 17), pp. 729–745. Cited by: §1.2.
- Optimal schemes for discrete distribution estimation under local differential privacy. In 2017 IEEE International Symposium on Information Theory (ISIT), Vol. , pp. 759–763. External Links: Cited by: §1.2.
- Federated heavy hitters discovery with differential privacy. In International Conference on Artificial Intelligence and Statistics, pp. 3837–3847. Cited by: §1.2.
Appendix A Additional Remarks
a.1 Second-order terms in the upper and lower bounds
In terms of the convergence rate, Theorem 2.1 and Theorem 2.2 admit large second-order terms which may dominate the MSEs when . However, our results improve the sample complexity for the high-accuracy regime. More precisely, as a direct corollary of Theorem 2.1, the sample complexity is . In addition, the requirement is necessary for all of the previous global minimax schemes. For instance, all the minimax upper and lower bounds for distribution estimation (without any additional assumptions on the target distribution) in previous works Han et al. (2018a, b); Barnes et al. (2019); Acharya et al. (2019b) require .
That being said, with additional prior knowledge on the target distribution such as sparse/near-sparse assumptions, we can easily improve our two-stage scheme by replacing the uniform grouping localization step with other sparse estimation schemes Chen et al. (2021), and the resulting sample size requirement can be decreased to .
a.2 Achieving the centralized convergence without the knowledge of
We note that when the goal is to achieve the centralized rate with minimal communication (instead of achieving the best convergence rate for a fixed communication budget as we have assumed so far), the performance promised in the above corollary can be achieved without knowing beforehand. We can do that by modifying our proposed scheme to include an initial round for estimating . More precisely, we can have the first clients communicate only bit about their sample and estimate within bit accuracy. When is sufficiently large (e.g. ), this estimation task will be successful with negligible error probability. Then the remaining of clients can implement the two-round scheme we propose to estimate (i.e. the scheme described in Section 4) by using bits per client. Under this strategy, we are guaranteed to achieve the centralized performance while each client communicates no more than bits.