Many problems in machine learning and optimization require that we produce an accurate estimate of a square matrix, i.e., a random matrix such that
. In these cases, taking a uniform average of those independent copies provides a natural strategy for boosting the estimation accuracy, essentially by making use of the law of large numbers:. For many other problems, however, we are more interested in the inverse (Hessian/covariance) matrix , and it is necessary or desirable to work with as the estimator. Here, a naïve averaging approach has certain fundamental limitations (described in more detail below). The basic reason for this is that , i.e., that there is what may be called an inversion bias.
In this paper, we propose a method to address this inversion bias challenge. The method uses a weighted average, where the weights are carefully chosen to compensate for and correct the bias. Our motivation comes from distributed Newton’s method (explained shortly), where combining independent estimates of the inverse Hessian is desired, but our method is more generally applicable, and so we first state our key ideas in a more general context.
Let be independent random variables and
be independent random variables andbe fixed square rank- matrices. If is invertible almost surely, then the inverse of the matrix can be expressed as:
To demonstrate the implications of Theorem 1, suppose that our goal is to estimate for some linear function . For example, in the case of Newton’s method , where is the gradient and is the Hessian. Another example would be , where is the covariance matrix of a dataset and is the matrix trace, which is useful for uncertainty quantification. For these and other cases, consider the following estimation of , which takes an average of the individual estimates , each weighted by the determinant of , i.e.,
By applying the law of large numbers (separately to the numerator and the denominator), Theorem 1 easily implies that if are i.i.d. copies of then this determinantal averaging estimator is asymptotically consistent, i.e., , almost surely. This determinantal averaging estimator is particularly useful when problem constraints do not allow us to compute , e.g., when the matrices are distributed and not easily combined.
To establish finite sample convergence guarantees for estimators obtained via determinantal averaging, we establish the following matrix concentration result. We state it separately since it is technically interesting and since its proof requires novel bounds for the higher moments of the determinant of a random matrix, which is likely to be of independent interest. Below and throughout the paper, denotes an absolute constant and “” is the Löwner order on positive semi-definite (psd) matrices.
Let and , where is a positive definite matrix and are i.i.d. . Moreover, assume that all are psd, and rank-. If for and , then
where and .
1.1 Distributed Newton’s method
To illustrate how determinantal averaging can be useful in the context of distributed optimization, consider the task of batch minimization of a convex loss over vectors, defined as follows:
where , and are convex, twice differentiable and smooth. Given a vector , Newton’s method dictates that the correct way to move towards the optimum is to perform an update , with , where denotes the inverse Hessian of at .111Clearly, one would not actually compute the inverse of the Hessian explicitly [XRKM17, YXRKM18]. We describe it this way for simplicity. Our results hold whether or not the inverse operator is computed explicitly. Here, the Hessian and gradient are:
For our distributed Newton application, we study a distributed computational model, where a single machine has access to a subsampled version of with sample size parameter :
Note that accesses on average loss components ( is the expected local sample size), and moreover, for any . The goal is to compute local estimates of the Newton’s step in a communication-efficient manner (i.e., by only sending parameters from/to a single machine), then combine them into a better global estimate. The gradient has size so it can be computed exactly within this communication budget (e.g., via map-reduce), however the Hessian has to be approximated locally by each machine. Note that other computational models can be considered, such as those where the global gradient is not computed (and local gradients are used instead).
Under the constraints described above, the most natural strategy is to use directly the Hessian of the locally subsampled loss (see, e.g., GIANT [WRKXM18]), resulting in the approximate Newton step . Suppose that we independently construct i.i.d. copies of this estimate: (here, is the number of machines). Then, for sufficiently large , taking a simple average of the estimates will stop converging to because of the inversion bias: . Figure 1 shows this by plotting the estimation error (in Euclidean distance) of the averaged Newton step estimators, when the weights are uniform and determinantal (for more details and plots, see Appendix C).
The only way to reduce the estimation error beyond a certain point is to increase the local sample size (thereby reducing the inversion bias), which raises the computational cost per machine. Determinantal averaging corrects the inversion bias so that estimation error can always be decreased by adding more machines without increasing the local sample size. From the preceding discussion we can easily show that determinantal averaging leads to an asymptotically consistent estimator. This is a corollary of Theorem 1, as proven in Section 2.
Let be i.i.d. samples of (2) and define . Then:
The (unnormalized) determinantal weights can be computed locally in the same time as it takes to compute the Newton estimates so they do not add to the overall cost of the procedure. While this result is only an asymptotic statement, it holds with virtually no assumptions on the loss function (other than twice-differentiability) or the expected local sample size . However, with some additional assumptions we will now establish a convergence guarantee with a finite number of machines by bounding the estimation error for the determinantal averaging estimator of the Newton step.
In the next result, we use Mahalanobis distance, denoted , to measure the error of the Newton step estimate (i.e., the deviation from optimum ), with chosen as the Hessian of . This choice is motivated by standard convergence analysis of Newton’s method, discussed next. This is a corollary of Theorem 2, as explained in Section 3.
For any if expected local sample size satisfies then
where , and , and are defined as in Corollary 3.
We next establish how this error bound impacts the convergence guarantees offered by Newton’s method. Note that under our assumptions is strongly convex so there is a unique minimizer . We ask how the distance from optimum, , changes after we make an update . For this, we have to assume that the Hessian matrix is -Lipschitz as a function of . After this standard assumption, a classical analysis of the Newton’s method reveals that Corollary 4 implies the following Corollary 6 (proof in Appendix B).
The Hessian is -Lipschitz: for any .
The bound is a maximum of a linear and a quadratic convergence term. As goes to infinity and/or goes to the approximation coefficient in the linear term disappears and we obtain exact Newton’s method, which exhibits quadratic convergence (at least locally around ). However, decreasing means increasing and with it the average computational cost per machine. Thus, to preserve the quadratic convergence while maintaining a computational budget per machine, as the optimization progresses we have to increase the number of machines while keeping fixed. This is only possible when we correct for the inversion bias, which is done by determinantal averaging.
1.2 Distributed data uncertainty quantification
Here, we consider another example of when computing a compressed linear representation of the inverse matrix is important. Let be an matrix where the rows represent samples drawn from a population for statistical analysis. The sample covariance matrix holds the information about the relations between the features. Assuming that is invertible, the matrix , also known as the precision matrix, is often used to establish a degree of confidence we have in the data collection [KBCG13]. The diagonal elements of
are particularly useful since they hold the variance information of each individual feature. Thus, efficiently estimating either the entire diagonal, its trace, or some subset of its entries is of practical interest[Ste97, WLK16, BCF09]. We consider the distributed setting where data is separately stored in batches and each local covariance is modeled as:
For each of the local covariances , we compute its compressed uncertainty information: , where we added a small amount of ridge to ensure invertibility222Since the ridge term vanishes as goes to infinity, we are still estimating the ridge-free quantity .. Here, may for example denote the trace or the vector of diagonal entries. We arrive at the following asymptotically consistent estimator for :
Note that the ridge term decays to zero as goes to infinity, which is why . Even though this limit holds for any local sample size , in practice we should choose sufficiently large so that is well-conditioned. In particular, Theorem 2 implies that if , where , then for we have w.p. .
1.3 Related work
Many works have considered averaging strategies for combining distributed estimates, particularly in the context of statistical learning and optimization. This research is particularly important in federated learning [KBRR16, KBY16], where data are spread out accross a large network of devices with small local storage and severely constrained communication bandwidth. Using averaging to combine local estimates has been studied in a number of learning settings [MMS09, MHM10] as well as for first-order stochastic optimization [ZWLS10, AD11]. For example, [ZDW13] examine the effectiveness of simple uniform averaging of emprical risk minimizers and also propose a bootstrapping technique to reduce the bias.
More recently, distributed averaging methods gained considerable attention in the context of second-order optimization, where the Hessian inversion bias is of direct concern. [SSZ14] propose a distributed approximate Newton-type method (DANE) which under certain assumptions exhibits low bias. This was later extended and improved upon by [ZL15, RKR16]. The GIANT method of [WRKXM18] most closely follows our setup from Section 1.1, providing non-trivial guarantees for uniform averaging of the Newton step estimates
(except they use with-replacement uniform sampling, whereas we use without-replacement, but that is typically a negligible difference). A related analysis of this approach is provided in the context of ridge regression by[WGM17]. Finally, [ABH17] propose a different estimate of the Hessian inverse by using the Neuman series expansion, and use it to construct Newton step estimates which exhibit low bias when the Hessian is sufficiently well-conditioned.
Our approach is related to recent developments in determinantal subsampling techniques (e.g. volume sampling), which have been shown to correct the inversion bias in the context of least squares regression [DW17, DWH19]. However, despite recent progress [DW18, DWH18], volume sampling is still far too computationally expensive to be feasible for distributed optimization. Indeed, often uniform sampling is the only practical choice in this context.
With the exception of the expensive volume sampling-based methods, all of the approaches discussed above, even under favorable conditions, use biased estimates of the desired solution (e.g., the exact Newton step). Thus, when the number of machines grows sufficiently large, with fixed local sample size, the averaging no longer provides any improvement. This is in contrast to our determinantal averaging, which converges exactly to the desired solution and requires no expensive subsampling. Therefore, it can scale with an arbitrarily large number of machines.
2 Expectation identities for determinants and adjugates
In this section, we prove Theorem 1 and Corollary 3, establishing that determinantal averaging is asymptotically consistent. To achieve this, we establish a lemma involving two expectation identities.
For a square matrix , we use to denote its adjugate, defined as an matrix whose th entry is , where denotes without th row and
th column. The adjugate matrix provides a key connection between the inverse and the determinant because for any invertible matrix, we have . In the following lemma, we will also use a formula called Sylvester’s theorem, relating the adjugate and the determinant: .
For , where are independently random and are square and rank-,
Proof We use induction over the number of components in the sum. If there is only one component, i.e., , then a.s. unless is , in which case (a) is trivial, and follows similarly. Now, suppose we showed the hypothesis when the number of components is and let . Setting , we have:
showing (a). Finally, (b) follows by applying (a) to each entry .
Similar expectation identities for the determinant have been given before [vdV65, DWH19, Der18]. None of them, however, apply to the random matrix as defined in Lemma 7, or even to the special case we use for analyzing distributed Newton’s method. Also, our proof method is quite different, and somewhat simpler, than those used in prior work. To our knowledge, the extension of determinantal expectation to the adjugate matrix has not previously been pointed out.
Proof of Theorem 1 When is invertible, its adjugate is given by , so the lemma implies that
from which Theorem 1 follows immediately.
3 Finite-sample convergence analysis
In this section, we prove Theorem 2 and Corollary 4, establishing that determinantal averaging exhibits a convergence rate, where is the number of sampled matrices (or the number of machines in distributed Newton’s method). For this, we need a tool from random matrix theory.
Lemma 8 (Matrix Bernstein [Tro12])
Consider a finite sequence of independent, random, self-adjoint matrices with dimension d such that and almost surely. If the sequence satisfies , then the following inequality holds for all :
The key component of our analysis is bounding the moments of the determinant and adjugate of a certain class of random matrices. This has to be done carefully, because higher moments of the determinant grow more rapidly than, e.g., for a sub-gaussian random variable. For this result, we do not require that the individual components of matrix be rank-1, but we impose several additional boundedness assumptions. In the proof below we apply the concentration inequality of Lemma 8 twice: first to the random matrix itself, and then also to its trace, which allows finer control over the determinant.
Let , where are independent, whereas and are psd matrices such that for all and . If for and , then
Proof We start by proving (a). Let and denote as the indicator variable of the event that . Since , we have:
Thus it suffices to bound the two integrals. We will start with the first one. Let . We use the matrix Bernstein inequality to control the extreme eigenvalues of the matrix (note that matrix cancels out because ). To do this, observe that and, moreover, , so:
Thus, applying Lemma 8 we conclude that for any :
Conditioning on the high-probability event given by (4) leads to the lower bound which is very loose. To improve on it, we use the following inequality, where denote the eigenvalues of :
Thus we obtain a tighter bound when is multiplied by , and now it suffices to upper bound the latter. This is a simple application of the scalar Bernstein’s inequality (Lemma 8 with ) for the random variables , which satisfy . Thus the scalar Bernstein’s inequality states that
Thus, for and we obtain that , and consequently,
because for . Note that for , so
In the interval , we have:
where the last inequality follows because
. Noting that
for any concludes the
proof of (a). The proof of (b), given in Appendix A,
follows similarly as above because for any positive
definite we have .
Having obtained bounds on the higher moments, we can now convert them to convergence with high probability for the average of determinants and the adjugates. Since determinant is a scalar variable, this follows by using standard arguments. On the other hand, for the adjugate matrix we require a somewhat less standard matrix extension of the Khintchine/Rosenthal inequalities (see Appendix A).
There is s.t. for as in Lemma 9 with all rank- and ,
where are independent copies of .
We are ready to show the convergence rate of determinantal averaging, which follows essentially by upper/lower bounding the enumerator and denominator separately, using Corollary 10.
Proof of Theorem 2 We will apply Corollary 10 to the matrices . Note that , where each satisfies . Therefore, Corollary 10 guarantees that for , with probability the following average of determinants is concentrated around 1:
along with a corresponding bound for the adjugate matrices. We obtain that with probability ,
It remains to multiply the above expressions by from both sides to recover the desired estimator:
and the lower bound follows identically. Appropriately adjusting the constants concludes the proof.
As an application of the above result, we show how this allows us to bound the estimation error in distributed Newton’s method, when using determinantal averaging.
which completes the proof of the corollary.
MWM would like to acknowledge ARO, DARPA, NSF and ONR for providing partial support of this work. Also, MWM and MD thank the NSF for funding via the NSF TRIPODS program. Part of this work was done while MD and MWM were visiting the Simons Institute for the Theory of Computing.
- [ABH17] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(116):1–40, 2017.
- [AD11] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 873–881. Curran Associates, Inc., 2011.
- [BCF09] C. Bekas, A. Curioni, and I. Fedulova. Low cost high performance uncertainty quantification. In Proceedings of the 2Nd Workshop on High Performance Computational Finance, WHPCF ’09, pages 8:1–8:8, New York, NY, USA, 2009. ACM.
Chih-Chung Chang and Chih-Jen Lin.
LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
- [Der18] Michał Dereziński. Fast determinantal point processes via distortion-free intermediate sampling. CoRR, abs/1811.03717, 2018.
Michał Dereziński and Manfred K. Warmuth.
Unbiased estimates for linear regression via volume sampling.In Advances in Neural Information Processing Systems 30, pages 3087–3096, Long Beach, CA, USA, December 2017.
- [DW18] Michał Dereziński and Manfred K. Warmuth. Reverse iterative volume sampling for linear regression. Journal of Machine Learning Research, 19(23):1–39, 2018.