A key property of machine learning systems is their ability to generalize to new and unknown data. Such a system is trained on a particular set of data, but must then perform well even on new datapoints that have not previously been considered. This ability, deemed generalization, can be formulated in the language of statistical learning theory by considering the generalization error of an algorithm, i.e, the difference between the population risk of a model trained on a particular dataset and the empirical risk for the same model and dataset. We say that a model generalizes well if it has a small generalization error, and because models are often trained by minimizing empirical risk or some regularized version of it, a small generalization error also implies a small population risk which is the average loss over new samples taken randomly from the population. It is therefore of interest to upper bound generalization error and understand which quantities control it, so that we can quantify the generalization properties of a machine learning system and offer guarantees about how well it will perform.
In recent years, it has been shown that information theoretic quantities such as mutual information can be used to bound generalization error under assumptions on the tail of the distribution of the loss function[1, 2, 3]. In particular, when the loss function is sub-Gaussian, the expected generalization error can scale at most with the square root of the mutual information between the training dataset and the model weights . These bounds offer an intuitive explanation for generalization and overfitting – if an algorithm uses only limited information from its training data, then this will bound the expected generalization error and prevent overfitting. Conversely, if a training algorithm uses all of the information from its training data in the sense that the model is a deterministic function of the training data, then this mutual information can be infinite and there is the possibility of unbounded generalization error and thus overfitting.
Another modern focus of machine learning systems has been that of distributed and federated learning [4, 5, 6]. In these systems, data is generated and processed in a distributed network of machines. The main differences between the distributed and centralized settings are the information constraints imposed by the network. There has been considerable interest in understanding the impact of both communication constraints [7, 8] and privacy constraints [9, 10, 11, 12] on the performance of machine learning systems, and in designing protocols that efficiently train systems under these constraints.
Since both communication and local differential privacy constraints can be thought of as special cases of mutual information constraints, they should pair naturally with some form of information theoretic generalization bound in order to induce control over the generalization error of the distributed machine learning system. The information constraints inherent to the network can themselves give rise to tighter bounds on generalization error and thus provide better guarantees against overfitting. Along these lines, in recent work , a subset of the present authors introduced the framework of using information theoretic quantities to bound both expected generalization error and a measure of privacy leakage in distributed and federated learning systems. The generalization bounds in this work, however, are essentially the same as those obtained by thinking of the entire system, from the data at each node in the network to the final aggregated model, as a single centralized algorithm. Any improved generalization guarantees from these bounds would remain implicit in the mutual information terms involved.
In this work, we develop improved bounds on expected generalization error for distributed and federated learning systems. Instead of leaving the differences between these systems and their centralized counterparts implicit in the mutual information terms, we bring analysis of the structure of the systems directly into the bounds. By working with the contribution from each node separately, we are able to derive upper bounds on expected generalization error that scale with the number of nodes as instead of . This improvement is shown to be tight for certain examples such as learning the mean of a Gaussian with squared loss. We develop bounds that apply to distributed systems in which the submodels from each one of
different nodes are averaged together, as well as bounds that apply to more complicated multiround stochastic gradient descent (SGD) algorithms such as in federated learning. For linear models with Bregman divergence losses, these “per node” bounds are in terms of the mutual information between the training dataset and the trained weights at each node, and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node. For arbitrary nonlinear models that have Lipschitz continuous losses, the improved dependence of
can still be recovered, but without a description in terms of mutual information. We demonstrate the improvements given by our bounds over the existing information theoretic generalization bounds via simulation of a distributed linear regression example.
I-a Technical Preliminaries
Suppose we have independent and identically distributed (i.i.d.) data for and let . Suppose further that is the output of a potentially stochastic algorithm. Let be a real-valued loss function and define
to be the population risk for weights (or model) . We similarly define
to be the empirical risk on dataset for model . The generalization error for dataset is then
and the expected generalization error is
where the expectation is also over any randomness in the algorithm. Below we present some standard results on the expected generalization error that will be needed.
Theorem 1 (Leave-one-out Expansion – Lemma 11 in ).
Let be a version of with replaced by an i.i.d. copy . Denote . Then
In many of the results in this paper, we will use one of the two following assumptions.
The loss function satisfies
for , , where are taken independently from the marginals for , respectively,
The next assumption is a special case of the previous one with
The loss function is sub-Gaussian with parameter in the sense that
Recall that for a continuously differentiable and strictly convex function , we define the associated Bregman divergence  between two points to be
where denotes the usual inner product.
Ii Distributed Learning and Model Aggregation
Now suppose that there are nodes each with samples. Each node has dataset with taken i.i.d. from . We use to denote the entire dataset of size . Each node locally trains a model with algorithm . After each node locally trains its model, the models are then combined to form the final model using an aggregation algorithm . See Figure 1. In this section we will assume that and that the aggregation is done by simple averaging, i.e.,
Define to be the total algorithm from data to the final weights so that
Suppose that is a convex function of for each and that represents the empirical risk minimization algorithm on local dataset in the sense that
While Theorem 3 seems like a nice characterization of generalization bounds for the aggregate model – in that the aggregate generalization error cannot be any larger than the average generalization errors over each node – it does not offer any improvement in the expected generalization error that one might expect given total samples instead of just samples. A naive application of the information theoretic generalization bounds from Theorem 2, followed by the data processing inequality , runs into the same problem.
Ii-a Improved Bounds
In this subsection, we prove bounds on expected generalization error that remedy the above shortcomings. In particular, we would like the following two properties.
The bound should decay with the number of nodes in order to take advantage of the total dataset from all nodes.
The bound should be in terms of the information theoretic quantities which can represent (or be upper bounded by) the capacities of the channels that the nodes are communicating over. This can, for example, represent a communication or local differential privacy constraint for each node.
At a high level, we will improve on the bound from Theorem 3 by taking into account the fact that a small change in will only change by a fraction of the amount that it will change . In the case that is a linear or location model, and the loss is a Bregman divergence, we can obtain an upper bound on expected generalization error that satisfies both properties (a) and (b) as follows.
Theorem 4 (Linear or Location Models with Bregman Loss).
Suppose the loss takes the form of one of:
and that Assumption 1 holds. Then
Here we restrict our attention to case (ii), but the two cases have nearly identical proofs. Using Theorem 1,
In (7), we use to denote . Line (6) follows by the linearity of the inner product and by canceling the higher order terms and which have the same expected values. The key step (7) then follows by noting that only differs from in the submodel coming from node , which is multiplied by a factor of when averaging all of the submodels. By backing out step and re-adding the appropriate canceled terms we get
By applying Theorem 2,
Then, by noting that is non-decreasing and concave,
as desired. ∎
The result in Theorem 4 is general enough to apply to many problems of interest. For example, if , then the Bregman divergence gives the ubiquitous squared loss, i.e.,
can apply to ordinary least squares regression which we will look at in more detail in SectionIV
. Other regression models such as logistic regression have a loss function that cannot be described with a Bregman divergence without the inclusion of an additional nonlinearity. However, the result in Theorem4 is agnostic to the algorithm that each node uses to fit its individual model. In this way, each node could be fitting a logistic model to its data, and the total aggregate model would then be an average over these logistic models. Theorem 4 would still control the expected generalization error for the aggregate model with the extra factor, however, critically, the upper bound would only be for generalization error that is with respect to a loss of the form such as squared loss.
In order to show that the dependence on the number of nodes from Theorem 4 is tight for certain problems, consider the following example from . Suppose that and so that we are trying to learn the mean of the Gaussian. An obvious algorithm for each node to use is simple averaging of its dataset:
For this algorithm, it can be shown that
However, for this choice of algorithm at each node, the true expected generalization error can be computed to be
Applying our new bound from Theorem 4, we get
which recovers the correct dependence on and improves upon the result from previous information theoretic methods.
Ii-B General Models and Losses
In this section we briefly describe some results that hold for more general classes of models and loss functions, such as deep neural networks and other nonlinear models.
Theorem 5 (Lipschitz Continuous Loss).
Suppose that is Lipschitz continuous as a function of in the sense that
for any , and that
for each . Then
Starting with Theorem 1,
The bound in Theorem 5 is not in terms of the information theoretic quantities , but it does show that the upper bound can be shown for much more general loss functions and arbitrary nonlinear models.
Ii-C Privacy and Communication Constraints
Both communication constraints and local differential privacy constraints can be thought of as special cases of mutual information constraints. Motivated by this observation, Theorem 4 immediately implies corollaries for these types of system.
Corollary 1 (Privacy Constraints).
Iii Iterative Algorithms
We now turn to considering more complicated multi-round and iterative algorithms. In this setup, after rounds there is a sequence of weights and the final model is a function of that sequence where gives a linear combination of the vectors . The function could represent, for example, averaging over the iterates, picking out the last iterate , or some weighted average over the iterates. On each round , each node produces an updated model based on its local dataset and the previous timestep’s global model . The global model is then updated via an average over all updated submodels:
The particular example that we will consider is that of distributed SGD, where each node constructs its updated model by taking one or more gradient steps starting from with respect to random minibatches of its local data. Our model is general enough to account for multiple local gradient steps as is used in so-called Federated Learning [4, 5, 6], as well as noisy versions of SGD such as in [17, 18]. If only one local gradient step is taken on each iteration, then the update rule for this particular example could be written as
where is a data point (or minibatch) sampled from on timestep , is the learning rate, and is some potential added noise. We assume that the data points are sampled without replacement so that the samples are distinct across different values of .
For this type of iterative algorithm, we will consider the following timestep averaged empirical risk quantity:
and the corresponding generalization error
Note that the quantity in (12) is slightly different than the end-to-end generalization error that we would get considering the final model and whole dataset . It is instead an average over the generalization error we would get from each model stopping at iteration . We do this so that when we apply the leave-one-out expansion from Theorem 1, we do not have to account for the dependence of on past samples for and . Since we expect the generalization error to decrease as we use more samples, this quantity should result in a more conservative upper bound and be a reasonable surrogate object to study. The following bound follows as a corollary to Theorem 4.
We simulated a distributed linear regression example in order to demonstrate the improvement in our bounds over the existing information theoretic bounds. To do this, we generated synthetic datapoints at each of different nodes for various values of . Each datapoint consisted of a pair where with , and
was the randomly generated true weight that was common to all datapoints. Each node constructed an estimateof using the well-known normal equations which minimize the loss, i.e., . The aggregate model was then the average . In order to estimate the old and new information theoretic generalization bounds (i.e., the bounds from Theorems 2 and 4, respectively), this procedure was repeated times and the datapoint and model values were binned in order to estimate the mutual information quantities. The value of was increased until the mutual information estimates were no longer particularly sensitive to the number and widths of the bins. In order to estimate the true generalization error, the expectations for both the population risk and the dataset were estimated by Monte Carlo with trials each. The results can be seen in Figure 2, where it is evident that the new information theoretic bound is much closer to the true expected generalization error, and decays with an improved rate as a function of .
-  D. Russo and J. Zou, “How much does your data exploration overfit? controlling bias via information usage,” IEEE Transactions on Information Theory, vol. 66, no. 1, pp. 302–323, 2020.
-  A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” in NIPS, 2017, pp. 2521–2530.
-  Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information-based bounds on generalization error,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020.
-  H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of AISTATS, 2017.
-  J. Konecný, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for on-device intelligence,” CoRR, vol. abs/1610.02527, 2016.
-  J. Konečný, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” in Proceedings of the NIPS Workshop on Private Multi-Party Machine Learning, 2016.
-  Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” Proceedings of the 6th International Congress on Learning Representations (ICLR), 2018.
-  L. P. Barnes, H. A. Inan, B. Isik, and A. Ozgur, “rTop-k: A statistical estimation approach to distributed SGD,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 897–907, 2020.
-  S. L. Warner, “Randomized response: A survey technique for eliminating evasive answer bias,” Journal of the American Statistical Association, vol. 60, no. 309, pp. 63–69, 1965.
-  C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography Conference, S. Halevi and T. Rabin, Eds. Springer, Berlin, Heidelberg, 2006.
-  S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What can we learn privately?” SIAM Journal on Computing, vol. 40, no. 3, p. 793–826, 2011.
-  P. Cuff and L. Yu, “Differential privacy as a mutual information constraint,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 43–54.
-  S. Yagli, A. Dytso, and H. Vincent Poor, “Information-theoretic bounds on the generalization error and privacy leakage in federated learning,” in Proceedings of the 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2020, pp. 1–5.
-  S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence,” Journal of Machine Learning Research, vol. 11, pp. 2635–2670, 2010.
-  L. M. Bregman, “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,” USSR Computational Mathematics and Mathematical Physics, vol. 7, no. 3, pp. 200–217, 1967.
-  A. Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh, and J. Lafferty, “Clustering with Bregman divergences.” Journal of Machine Learning Research, vol. 6, no. 10, 2005.
-  A. Pensia, V. Jog, and P.-L. Loh, “Generalization error bounds for noisy, iterative algorithms,” in Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), 2018, pp. 546–550.
-  H. Wang, R. Gao, and F. P. Calmon, “Generalization bounds for noisy iterative algorithms using properties of additive noise channels,” 2021.
V. Feldman and J. Vondrak, “High probability generalization bounds for uniformly stable algorithms with nearly optimal rate,” inProceedings of the Thirty-Second Conference on Learning Theory, ser. Proceedings of Machine Learning Research, A. Beygelzimer and D. Hsu, Eds., vol. 99. Phoenix, USA: PMLR, 25–28 Jun 2019, pp. 1270–1279. [Online]. Available: http://proceedings.mlr.press/v99/feldman19a.html
-  A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via rényi--divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
-  A. R. Asadi, E. Abbe, and S. Verdú, “Chaining mutual information and tightening generalization bounds,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18, 2018, p. 7245–7254.
-  M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in Proceedings of the 2016 IEEE Information Theory Workshop (ITW), 2016, pp. 26–30.
-  J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration bias for general measurements,” in Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 1475–1479.