Algorithmic scalability is a keystone in the realm of modern machine learning. Making high quality inference, on large, feature rich datasets under a constrained computational budget is arguably the primary goal of the learning community. A common hindrance, appearing in Gaussian graphical models, Gaussian Processes (rue2005gaussian; rasmussen2006gaussian), sampling, variational inference (mackay2003information), metric/kernel learning (davis2007information; van2009minimum), Markov random fields (wainwright2006log), Determinantal Point Processes (DPP’s) and Bayesian Neural networks (mackay1992bayesian), is the calculation of the log determinant of a large positive definite matrix.
For a large positive definite matrix , the canonical solution involves the Cholesky decomposition, . The log determinant is then trivial to calculate as . This computation invokes a computational complexity and storage complexity and is thus unfit for purpose for , i.e. even a small sample set in the age of big data.
1.1 Related Work
Recent work in machine learning combined stochastic trace estimation with Taylor approximations for Gaussian process parameter learning(zhang2007approximate; boutsidis2017randomized)
, reducing the computational cost to matrix vector multiplications (MVMs),for a dense matrix and 111Number of non zeros. for a sparse matrix. Variants of the same theme used Chebyshev polynomial approximations, giving a performance improvement over Taylor along with refined error bounds (han2015large)
and the combination of either Chebyshev/Lanczos techniques with structured kernel interpolation (SKI) in order to accelerate MVMs to, where is the number of inducing points (dong2017scalable).
This approach relies on an extension of Kronecker and Toeplitz methods, which are limited to low dimensional (typically ) data, which cannot be assumed in general. Secondly, whilst Lanczos methods have a convergence rate of double that of the Chebyshev approaches, the derived bounds require Lanczos steps (Ubaru2016), where is the matrix condition number. In many practical cases of interest and thus the large number of matrix vector multiplications becomes prohibitive. We restrict ourselves to the high-dimensional, high-condition number, big data limit.
We recast the problem of calculating log determinants as a constrained variational inference problem, which under the assumption of a uniform prior reduces to maximum entropy spectral density estimation given moment information. We develop a novel algorithm using Newton conjugate gradient and Hessian information with a Legendre/Chebyshev basis, as opposed to power moments. Our algorithm is stable for a large number of moments () surpassing the limit of previous MaxEnt algorithms (DBLP:conf/bigdataconf/GranziolR17; bandyopadhyay2005maximum; mead1984maximum). We build on the experimental results of (han2015large; bild)
, adding to the case against using Taylor expansions in Log Determinant approximations by proving that the implied density violates Kolmogorov’s axioms of probability. We compare our algorithm to both Chebyshev and Lanczos methods on real data and synthetic kernel matrices of arbitrary condition numbers, relevant to GP kernel learning and DPP’s.
The work most similar to ours is (ete), which uses stochastic trace estimation as moment constraints with an off the shelf maximum entropy algorithm (bandyopadhyay2005maximum) to estimate the matrix spectral density in order to calculate the log determinant. Their approach outperforms Chebyshev, Taylor, Lanczos and kernel based approximations, yet begins to show pathologies and increasing error for moments, due to convergence issues (DBLP:conf/bigdataconf/GranziolR17), making it unsuitable for machine learning where typical squared exponential kernels have very sharply decaying spectra and thus a larger number of moments is required for high precision.
2 Motivating example
Determinantal point processes (DPPs) (macchi1975coincidence) are probabilistic models capturing global negative correlations. They describe Fermions 222as a consequence of the spin-statistics theorem
in Quantum Physics, Eigenvalues of random matrices and non-intersecting random walks.
In machine learning their natural selection of diversity has found applications in the field of summarization (gong2014diverse), human pose detection (kulesza_2012), clustering (kang2013fast), Low rank kernel matrix approximations (li2016fast) and Manifold learning (wachinger2015sampling).
Formally, it defines a distribution on , where
is the finite ground set. For a random variabledrawn from a given DPP we have
where is a positive definite matrix referred to as the -ensemble kernel. Greedy algorithms that find the most diverse set of that achieves the highest probability, i.e require the calculation of the marginal gain,
with computational complexity. Previous work has looked at limiting the burden of the computational complexity by employed Chebyshev approximations to the Log Determinant (han2017faster). However their work is limited Kernel Matrices with minimum eigenvalues of 333or alternatively low condition numbers, which does not cover the class of realistic kernel matrix spectra, notably the popular squared exponential kernel. We develop a method in section 5.2 and an algorithm 5.4 which is capable of handling very high condition numbered matrices.
3.1 Log Determinants as a Density Estimation Problem
Any symmetric positive definite (PD) matrix , is diagonalizable by a unitary transformation , i.e , where is the matrix with the eigenvalues of along the diagonal. Hence we can write the log determinant as:
Here we have used the cyclicity of the determinant and denotes the expectation under the spectral measure. The latter can be written as:
Given that the matrix is PD, we know that and we can divide the matrix by an upper bound, , via the Gershgorin circle theorem (gershgorin1931uber) such that,
Here , i.e the max sum of the rows of the absolute of the matrix . Hence we can comfortably work with the transformed measure,
as the spectral density is outside of its bounds, which are bounded by respectively.
3.2 Stochastic Trace Estimation
Using the expectation of quadratic forms, for any multivariate random variable with mean
and variance, we can write
where in the last equality we have assumed that the variable possesses zero mean and unit variance. By the linearity of trace and expectation for any we can write
In practice we approximate the expectation over all random vectors with a simple Monte Carlo average. i.e for random vectors ,
where we take the product of the matrix with the vector , times, so as to avoid costly matrix matrix multiplication. This allows us to calculate the non central moment expectations in for dense matrices, or for sparse matrices, where .
The random unit vector can be drawn from any distribution, such as a Gaussian. Choosing the components of to be i.i.d Rademacher random variables i.e (Hutchinson’s method (hutchinson1990stochastic)) has the lowest variance of such estimators (jackmub), satisfying,
Loose bounds exist on the number of samples required to get within a fractional error with probability (han2015large),
as per (roosta2015improved). To get within fractional error with probability , for example, we require samples. In practice we find that as little as gives good accuracy however.
4 Polynomial approximations to the Log Determinant
Recent work (han2015large; dong2017scalable; zhang2007approximate) has considered incorporating knowledge of the non central moments 444Also using stochastic trace estimation. of a normalised eigenspectrum by replacing the logarithm with a finite polynomial expansion,
Given that is not analytic at , it can be seen that, for any density with a large mass near the origin, a very large number of polynomial expansions, and thus moment estimates, will be required to achieve a good approximation, irrespective of the choice of basis.
4.1 Taylor approximations are probabilistically unsound
In the case of a Taylor expansion equation (12) can be written as,
The error in this approximation can be written as the difference of the two sums,
where we have used the Taylor expansion of and denotes the expectation under the spectral measure.
De-Finetti (finetti) showed that Kolmogorov’s axioms of probability (kolmogorov_1950) could be derived by manipulating probabilities in such a manner so as to not make a sure loss on a gambling system based on them. Such a probabilistic framework, of which the Bayesian is a special case (bayesianspecialcase), satisfies the conditions of,
Non Negativity: ,
Finite Additivity: . 555for a sequence of disjoint sets .
The intuitive appeal of De-Finetti’s sure loss arguments, is that they are inherently performance based. A sure loss is a practical cost, which we wish to eliminate.
Keeping within such a very general formulation of probability and thus inference. We begin with complete ignorance about the spectral density (other than its domain ) and by some scheme after seeing the first non-central moment estimates we propose a surrogate density . The error in our approximation can be written as,
For this error to be equal to that of our Taylor expansion (14), our implicit surrogate density must have the first non-central moments of identical to the true spectral density and all others .
Given that and that we have removed the trivial case of the spectral density (and by implication its surrogate) being a delta function at , the sum is manifestly positive and hence for some , which is incompatible with the theory of probability (finetti; kolmogorov_1950).
5 Constrained Variational Method
5.1 Variational Bayes
Variational Methods (mackay2003information; fox2012tutorial) in machine learning pose the problem of intractable density estimation from the application of Bayes’ rule as a functional optimization problem,
and finding finding the appropriate .
Typically, whilst the functional form of is known, calculating is intractable. Using Jensen’s inequality we can show that,
It can be shown that the reverse KL divergence between the posterior and the variational distribution, , can be written as,
Hence maximising the evidence lower bound is equivalent to minimising the reverse KL divergence between and .
By assuming the variational distribution to factor over the set of latent variables, the functional form of the variational marginals, subject to the normalisation condition, can be computed using functional differentiation of the dual,
leading to a Gibbs’ distribution and an iterative update equation.
5.2 Log Determinant as a Variational Inference problem
We consider minimizing the reverse KL divergence between our surrogate posterior and our prior on the eigenspectrum,
such that the normalization and moment constraints are satisfied. Here denotes the differential entropy of the density .
By the theory of Lagrangian duality, the convexity of the KL divergence and the affine nature of the moment constraints, we maximise the dual (boyd_vandenberghe_2009),
or alternatively we minimise
5.3 Link to Information Physics
In the field of information physics the minimization of Equation (23) is known as the method of relative entropy (caticha2012entropic). It can be derived as the unique functional satisfying the axioms of,
Locality: local information has local effects.
Co-ordinate invariance: the co-ordinate set carries no information.
Sub-System Independence: for two independent sub-system it should not matter if we treat the inference separately or jointly.
Objectivity: Under no new information, our inference should not change. Hence under no constraints, our posterior should coincide with our prior.
These lead to the generalised entropic functional,
Here the justification for restricting ourselves to a functional is derived from considering the set of all distributions compatible with the constraints and devising a transitive ranking scheme. It can be shown, further, that Newton’s laws, non-relativistic quantum mechanics and Bayes’ rule can all be derived under this formalism.
In the case of a flat prior over the spectral domain, we reduce to the method of maximum entropy with moment constraints (jaynes1982rationale; inftheoryjaynes). Conditions for the existence of a solution to this problem have been proved for the case of the Hausdorff moment conditions (mead1984maximum), of which our problem is a special case.
The generalised dual objective function which we minimise is,
which can be shown to have gradient
5.4.1 Prior Spectral Belief
If we assume complete ignorance over the spectral domain, then the natural maximally entropic prior is the uniform distribution and hence.888Technically as the log determinant exists and is finite, we cannot have any mass at , hence we must set the uniform between some , where . An alternative prior over the
domain is the Beta distribution, the maximum entropy distribution of that domain under a mean and log mean constraint,
The log mean constraint is particularly interesting as we know that it must exists for a valid log determinant to exist, as is seen for equation (4). We set the parameters of by maximum likelihood, hence,
5.4.2 Analytical surrogate form
Our final equation for can be written as,
for the beta prior and
for the uniform. The exponential factor can be thought of altering the prior beta/uniform distribution so as to fit the observed moment information.
5.4.3 Practical Implementation
For simplicity we have kept all the formula’s in terms of power moments, however we find vastly improved performance and conditioning when we switch to another polynomial basis. Many alternative and orthogonal Polynomial bases exist (so that the errors between moment estimations are uncorrelated), we implement both Chebyshev and Legendre moments in our Lagrangian and find similar performance for both. The use of Chebyshev moments in Machine Learning and Computer Vision has been reported to be of practical siginifcance previously(yap2001chebyshev). We use Python’s SciPy minimize standard newton-conjugate gradient algorithm to solve the objective, given the gradient and hessian to within a gradient tolerance . To make the Hessian better conditioned so as to achieve convergence we add jitter along the diagonal. The pseudo code is given in Algorithm 1. The Log Determinant is then calculated using Algorithm 2.
In order to test the validity and practical applicability of our proposed Algorithm, we test on both Synthetic and real Kernel matrices. We compare against Chebyshev and Lanczos approaches. For completeness in Figures 4,4 and 4 we include results from the Taylor expansion. However in light of the consistently superior performance of the Chebyshev approach (han2015large), the arguments in Section 4.1 and space requirements we do not include Taylor methods in our results tables.
6.1 Synthetic Kernel Data
We simulate the kernel matrices from a Gaussian/Determinental Point Process (Rasmussen2006), by generating a typical squared exponential kernel matrix using the Python GPy package with dimensional, Gaussian inputs. We then add noise of variance along the diagonals. We employ a variety of realistic uniform length-scales (, & ) with condition numbers , & respectively. We use Moments, Hutchinson probe vectors and compare VBALD against the Taylor approximation, Chebyshev (han2015large) and Lanczos (ubaru2017fast). We see that for low condition numbers (Figure 4) The benefit of framing the log determinant as an optimization problem is marginal, whereas for large condition numbers (Figures 4 & 4) the benefits are substantial, with orders of magnitude better results than competing methods. We also provide the number of Chebyshev and Lanczos steps required to achieve similar performance. Results for a wider range of length scales are shown in Table 1.
6.2 Sparse Matrices
Given that the method described above works in general for any positive definite matrices. We look at large sparse spectral problems. Given that good results are reported for Polynomial methods at even a fairly modest number of moments, we consider very very large sparse matrices, such as social networks999Facebook, with users, each with friends is equivalent to a dense matrix, where we wish to limit the number of non-central moment estimates. We set and test on the UFL SuiteSparse dataset. We show our results in table 2. VBALD it competitive with Lanczos and we note that similar to the Kernel Matrix case, VBALD does best relative to other methods when the others perform at their poorest, i.e have a relatively large absolute relative error.
6.3 Citations and References
6.4 Software and Data
Appendix A Do not have an appendix here
Do not put content after the references. Put anything that you might normally include after the references in a separate supplementary file.
We recommend that you build supplementary material in a separate document. If you must create one PDF and cut it up, please be careful to use a tool that doesn’t alter the margins, and that doesn’t aggressively rewrite the PDF file. pdftk usually works fine.
Please do not use Apple’s preview to cut off supplementary material. In previous years it has altered margins, and created headaches at the camera-ready stage.