A Maximum Entropy approach to Massive Graph Spectra

12/19/2019 ∙ by Diego Granziol, et al. ∙ University of Oxford 20

Graph spectral techniques for measuring graph similarity, or for learning the cluster number, require kernel smoothing. The choice of kernel function and bandwidth are typically chosen in an ad-hoc manner and heavily affect the resulting output. We prove that kernel smoothing biases the moments of the spectral density. We propose an information theoretically optimal approach to learn a smooth graph spectral density, which fully respects the moment information. Our method's computational cost is linear in the number of edges, and hence can be applied to large networks, with millions of nodes. We apply our method to the problems to graph similarity and cluster number learning, where we outperform comparable iterative spectral approaches on synthetic and real graphs.



There are no comments yet.


page 8

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction: networks, their graph spectra and importance

Many systems of interest can be naturally characterised by complex networks; examples include social networks (Mislove et al., 2007; Flake et al., 2000; Leskovec et al., 2007), biological networks (Palla et al., 2005)

and technological networks. Trends, opinions and ideologies spread on a social network, in which people are nodes and edges represent relationships. Networks are mathematically represented by graphs. Of crucial importance to the understanding of the properties of a network or graph is its spectrum, which is defined as the eigenvalues of its adjacency or Laplacian matrix

(Farkas et al., 2001; Cohen-Steiner et al., 2018). The spectrum of a graph can be considered as a natural set of graph invariants and has been extensively studied in the fields of chemistry, physics and mathematics (Biggs et al., 1976). Spectral techniques have been extensively used to characterise the global network structure (Newman, 2006b)

and in practical applications thereof, such as facial recognition and computer vision

(Belkin and Niyogi, 2003), learning dynamical thresholds (McGraw and Menzinger, 2008), clustering (Von Luxburg, 2007), and measuring graph similarity (Takahashi et al., 2012).

A major limitation in utilizing graph spectra to solve problems such as graph similarity and estimating the number of clusters

111Just two example applications of the general method we propose for learning graph spectrum in this paper. is the inability to automatically and consistently learn an everywhere-positive, non-singular approximation to the spectral density. Full eigen-decomposition, which is prohibitive for large graphs, or iterative moment-matched approximations both give a Dirac sum that must be smoothed to be everywhere positive. The choice of smoothing kernel and kernel bandwidth choice , or number of histogram bins, which are usually chosen in an ad-hoc manner, can significantly affect the resulting output.

The main contributions of this paper are as follows:

  • We prove that the method of kernel smoothing, commonly used in methods to visualize and compare graph spectral densities, biases moment information;

  • We propose a computationally efficient and information theoretically optimal smooth spectral density approximation, based on the method of Maximum Entropy, which fully respects the moment information. It further admits analytic forms for symmetric and non-symmetric KL-divergences and Shannon entropy;

  • We utilize our information theoretic spectral density approximation, on two example applications. We investigate graph similarity and to learn the number of clusters in a graph, outperforming iterative smoothed spectral approaches on both real and synthetic data-sets

2 Preliminaries

Graphs are the mathematical structure underpinning the formulation of networks. Let be an undirected graph with vertex set . Each edge between two vertices and carries a non-negative weight . corresponds to two disconnected nodes. For un-weighted graphs we set for two connected nodes. The adjacency matrix is defined as and . The degree of a vertex is defined as


The degree matrix is defined as a diagonal matrix that contains the degrees of the vertices along diagonal, i.e., . The unnormalised graph Laplacian matrix is defined as


As is undirected, , which means that the weight matrix is symmetric and hence is symmetric and given is symmetric, the unnormalized Laplacian is also symmetric. As symmetric matrices are special cases of normal matrices, they are Hermitian matrices and have real eigenvalues. Another common characterisation of the Laplacian matrix is the normalised Laplacian (Chung, 1997),


where is known as the normalised adjacency matrix 222Strictly speaking, the second equality only holds for graphs without isolated vertices.. The spectrum of the graph is defined as the density of the eigenvalues of the given adjacency, Laplacian or normalised Laplacian matrices corresponding to the graph. Unless otherwise specified, we will consider the spectrum of the normalised Laplacian.

3 Motivations for A New Approach on Approximating and Comparing the Spectra of Large Graphs

For large sparse graphs with millions, or billions, of nodes, learning the exact spectrum using eigen-decomposition is unfeasible due to the

cost. Powerful iterative methods, such as the Lanczos algorithm, which only require matrix-vector multiplications, and hence have a computational cost scaling with the number of non-zero nodes in the graph, are often used. These approaches approximate the graph spectrum with a sum of weighted Dirac delta functions, closely matching the first

moments (where is the number of iterative steps used, as detailed in Appendix B) of the spectral density (Ubaru et al., 2016) i.e.:


where , and denotes the -th eigenvalue in the spectrum. However, such an approximation is undesirable because natural divergence measures between densities, such as the information-based relative entropy (Cover and Thomas, 2012),(Amari and Nagaoka, 2007) as equation 5,


can be infinite for densities that are mutually singular. The use of the Jensen-Shannon divergence simply re-scales the divergence into . This can lead to counter-intuitive scenarios, such as an infinite (or maximal) divergence upon the removal or addition of a single edge or node in a large network, an infinite (or maximal) divergence between two graphs generated using the same random graph model and identical hyper-parameters.

3.1 The argument against kernel smoothing:

To alleviate these limitations, practitioners typically generate a smoothed spectral density by convolving the Dirac mixture with a smooth kernel (Takahashi et al., 2012; Banerjee, 2008), often a Gaussian or Cauchy (Banerjee, 2008) to facilitate visualisation and comparison. The smoothed spectral density, with reference to Equation (4), thus takes the form:


We make some assumptions regarding the nature of the kernel function, , in order to prove our main theoretical result about the effect of kernel smoothing on the moments of the underlying spectral density. Both of our assumptions are met by (the commonly employed) Gaussian kernel.

Assumption 1

The kernel function is supported on the real line .

Assumption 2

The kernel function is symmetric and permits all moments.

The -th moment of a Dirac mixture , which is smoothed by a kernel satisfying assumptions 1 and & 2, is perturbed from its unsmoothed counterpart by an amount , where if is even and otherwise. denotes the -th central moment of the kernel function . The moments of the Dirac mixture are given as,


The moments of the modified smooth function (Equation (6)) are


We have used the binomial expansion and the fact that the infinite domain is invariant under shift reparametarization and the odd moments of a symmetric distribution are


The above proves that kernel smoothing alters moment information, and that this process becomes more pronounced for higher moments. Furthermore, given that , and (for the normalised Laplacian) , the corrective term is manifestly positive, so the smoothed moment estimates are biased.

For large random graphs, the moments of a generated instance converge to those averaged over many instances (Feier, 2012), hence by biasing our moment information we limit our ability to learn about the underlying stochastic process. We include a detailed discussion regarding the relationship between the moments of the graph and the underlying stochastic process in Appendix Section E.

4 An Information Theoretically Optimal Approach to the Problem of Smooth Spectra for Massive Graphs

For large, sparse graphs corresponding to real networks with millions or billions of nodes, where eigen-decomposition is intractable, we may still be able to compute a certain number of matrix-vector products, which we can use to get unbiased estimates of the spectral density moments, using stochastic trace estimation (as explained in Appendix

A). We can settle on a unique spectral density which satisfies the given moment information exactly, known as the density of Maximum Entropy explained in Section 4.1.

4.1 Maximum Entropy: MaxEnt

The method of maximum entropy, hereafter referred to as MaxEnt (Pressé et al., 2013), is information-theoretically optimal in so far as it makes the least additional assumptions about the underlying density (Jaynes, 1957) and is flattest in terms of the KL divergence compared to the uniform (Granziol et al., 2019). To determine the spectral density using MaxEnt, we maximise the entropic functional


with respect to , where are the power moment constraints on the spectral density, which are estimated using stochastic trace estimation (STE) as explained in Appendix A. The resultant entropic spectral density has the form


where the coefficients are derived from optimising (9). We use the MaxEnt algorithm, proposed in (Granziol et al., 2019) to learn these coefficients. For simplicity, we denote as . 333We make our Python code available on https://github.com/diegogranziol/Python-MaxEnt

1:  Input: Normalized Laplacian , number of probe vectors , number of moments used
2:  Output: EGS
3:  Moments STE
4:  MaxEnt Coefficients MaxEnt Algorithm
5:  Entropic graph spectrum
Algorithm 1 Entropic Graph Spectrum(EGS) Learner.

4.2 The Entropic Graph Spectral Learning algorithm

The full algorithm for learning the entropic graph spectrum (EGS) is summarized in Algorithm 1. We first estimate the moments of the normalised graph Laplacian via STE, then use the moment information to solve for MaxEnt coefficients and compute the EGS via Equation (10).

5 Visualising the Modelling Power of EGS

Having developed a theory as to why a smooth, exact moment matched approximation of the spectral density is crucial to learning the characteristics of the underlying stochastic process, and having proposed a method (Algorithms 1) to learn such a density, we test the practical utility of our method and algorithm on examples where the limiting spectral density is known.

5.1 Erdős-Rényi graphs and The semi-circle law

For Erdős-Rényi graphs with edge creation probability

, and , the limiting spectral density of the normalised Laplacian converges to the semi-circle law and its Laplacian converges to the free convolution of the semi-circle law and (Jiang, 2012). We consider here to what extent our EGS learnt with finite moments can effectively approximate the density. Wigner’s density is fully defined by its infinite number of central moments given by , where are known as the Catalan numbers. As a toy example we generate a semi-circle centered at with and use the analytical moments to compute its corresponding EGS (FIG 1). As can be seen in FIG 0(a), for moments, the central portion of the density is already well approximated, but the end points are not. This is largely corrected for moments.

Figure 1: EGS fit to a semi-circle density that is centered at 0.5 and has a radius of 0.5 for different moment number . (a) visualises the quality of fit for and . (b) shows the KL divergence between the true semi-circle density and the EGS.

We generate an Erdős-Rényi graph with and , and learn the moments using stochastic trace estimation. We then compare the fit between the EGS computed using a different numbers of input moments and the graph eigenvalue histogram computed by eigen-decomposition. We plot the results in FIG 2. One striking difference between this experiment and the previous one is the number of moments needed to give a good fit. This can be seen especially clearly in the top left subplot of FIG 2, where the 3 moment, i.e Gaussian approximation, completely fails to capture the bounded support of the spectral density. Given that the exponential polynomial density is positive everywhere, it needs more moment information to learn the regions of boundedness of the spectral density in its domain. In the previous example we artificially alleviated this phenomenon by putting the support of the semi-circle within the entire domain. It can be clearly seen that increasing moment information successively improves the fit to the support FIG 2. Furthermore, the magnitude of the oscillations, which are characteristic of an exponential polynomial function, decay in magnitude for larger moments.

Figure 2: EGS fit to randomly generated Erdős-Rényi graph. The number of moments used increases from 3 to 100 and the number of bins used for the eigenvalue histogram is .     Figure 3: EGS fit to randomly generated Barabási-Albert graph. The number of moments used for computing EGSs and the number of bins used for the eigenvalue histogram are , (Left) and , (Right).

5.2 Beyond the semi-circle law

For the adjacency matrix of an Erdős-Rényi graph with , the limiting spectral density does not converge to the semi-circle law and has an elevated central portion, and the scale free limiting density converges to a triangle like distribution (Farkas et al., 2001). For other random graph, such as the Barabási-Albert (Barabási and Albert, 1999), also known as the scale-free network, the probability of a new node being connected to a certain existing node is proportional to the number of links that existing node already has, violating the independence assumption required to derive the semi-circle density. We plot a Barabási-Albert network () and, similar to Section 5.1, we learn the EGS and plot the resulting spectral density against the eigenvalue histogram, shown in FIG 3. For the Barabási-Albert network, due to the extremity of the central peak, a much larger number of moments is required to get a reasonable fit. We also note that increasing the number of moments is akin to increasing the number of bins in terms of spectral resolution, as seen in FIG 3.

6 EGS for Measuring Graph Similarity

In this section, we test the use of our EGS in combination with symmetric KL divergence to measure similarity between different types of synthetic and real world graphs. Note that our proposed EGS, based on the MaxEnt distribution, enables the symmetric KL divergence to be computed analytically - this we show in Appendix F

. We first investigate the feasibility of recovering the parameters of random graph models, and then move onto classifying the network type as well as computing graph similarity among various synthetic and real world graphs.

6.1 Inferring parameters of random graph models

We investigate whether one can recover the network parameter values of a graph via its learned EGS. We generate a random graph of a given size and parameter value (e.g., ) and learn its entropic spectral characterisation using our EGS learner (Algorithm 1). Then, we generate another graph of the same size but learn its parameter value by minimising the symmetric-KL divergence between its entropic spectral surrogate and that of the original graph. We repeat the above procedures for different random graph models i.e. Erdős-Rényi (ER), Watts-Strogatz (WS) and Barabási-Albert (BA) and different graph sizes (), and the results are shown in Table 1. It can be seen that, given the approximate EGS, we are able to learn well the parameters of the graph producing that spectrum.

Table 1: Average parameters estimated by our MaxEnt-based method for the 3 types of network. The number of nodes in the network is denoted by . 50 100 150 ER () WS () BA () Table 2: Minimum KL divergence between the EGSs of random networks and that of a large BA graph and YouTube network. Large BA YouTube ER WS BA

6.2 Learning real world network types

Determining which random graph model best fit a real-world network, characterised by their spectral divergence can lead to better understanding of its dynamics and characteristics. This has been explored for small biological networks (Takahashi et al., 2012) where full eigen-decomposition is viable. Here, we conduct similar experiments for large networks based on our EGS method. We first test on a large (-node) synthetic BA network. By minimising the symmetric KL divergence between its EGS and those of small (1000-node) random networks (ER, WS, BA), we successfully recover its own type. As a real-world use case, we further repeat the experiment to determine which random network can best model the YouTube network from the SNAP dataset (Leskovec and Krevl, 2014) and find, as shown in Table 2, that the BA gives the lowest divergence, which aligns with other findings for social networks (Barabási and Albert, 1999).

Figure 4: Symmetric KL heatmap between 9 graphs from the SNAP dataset: (0) bio-human-gene1, (1) bio-human-gene2, (2) bio-mouse-gene, (3) ca-AstroPh, (4) ca-CondMat, (5) ca-GrQc, (6) ca-HepPh, (7) ca-HepTh, (8) roadNet-CA, (9) roadNet-PA, (10) roadNet-TX.

6.3 Comparing different real world networks

We now consider the feasibility of comparing real world networks using EGSs. Specifically, we take biological networks, citation networks and road networks from the SNAP dataset (Leskovec and Krevl, 2014), and compute the symmetric KL divergences between their EGS with moments. We present the results in a heat map (FIG 4). We see very clearly that the intra-class divergences between the biological, citation and road networks are much smaller than their inter-class divergences. This strongly suggests that the combination of our EGS method and the symmetric KL divergence can be used to identify similarity in networks. Furthermore, as can be seen in the divergence between the human and mouse network, the spectra of human genes are more closely aligned with each other than they are with the spectra of mouse genes. This suggests a reasonable amount of intra-class distinguishability as well.

7 EGS for Estimating Cluster Number

It is known from spectral graph theory (Chung, 1997), that the number multiplicity of the eigenvalue in the Laplacian (and the normalized Laplacian) is equal to the number of connected components in the graph (Von Luxburg, 2007). Previous literature has argued (Ubaru and Saad, ), that for a small amount of inter-cluster connections by matrix perturbation theory (Bhatia, 2013) we should expect a number of eigenvalues close to , we make this argument precise with the following Theorem 7.

The normalised Laplacian eigenvalue, perturbated by adding a single edge between nodes and from two previously disconnected clusters and , is bounded to first order by


where denotes the degree of node and and similarly , where denotes the sum over all nodes connecting to node . Using Weyl’s bound on Hermination matrices (Bhatia, 2013),


By the definition of the normalized Laplacian


to first order in the binomial expansion. We hence have the result. For two clusters with identical degree , connected by a single inter-cluster link, the zero eigenvalue eigenvalue is perturbed to first order by at most . Hence for inter-cluster connections, our bound goes as and hence the intuition of a small change in the eigenvalue holds if the number of edges between clusters is much smaller than the degree of the nodes within the clusters.

Algorithm 2 Cluster Number Estimation. 1:  Input: Normalized graph Laplacian , graph dimension , tolerance 2:  Output: Number of clusters 3:  EGS Algorithm 1() 4:  Find minimum that satisfy and 5:  Calculate Figure 5: Eigenvalues of the Email dataset with clear spectral gap and . The shaded area multiplied by the number of nodes predicts the number of clusters.

7.1 Learning the number of clusters in large graphs

For the case of large sparse graphs, where only iterative methods such as the Lanczos algorithm can be used, the same arguments from Section 3 apply. This is because the Dirac’s delta functions are now weighted, and to obtain a reliable estimate of the eigengap, one must smooth the Dirac’s delta functions. We would expect a smoothed spectral density plot to have a spike near . We expect the moments of the spectral density to encode this information and the mass of this peak to be spread. We hence look for the first spectral minimum in the EGS and calculate the number of clusters as shown in Algorithm 2. We conduct a set of experiments to evaluate the effectiveness of our spectral method in Algorithm 2 for learning the number of distinct clusters in a network, where we compare it against the Lanczos algorithm with kernel smoothing on both synthetic and real-world networks.

7.1.1 Synthetic networks

The synthetic data consists of disconnected sub-graphs of varying sizes and cluster numbers, to which a small number of intra-cluster edges are added. We use an identical number of matrix vector multiplications, i.e., (see Appendix C for experimental details for both EGS and Lanczos methods), and estimate the number of clusters and report the fractional error. The results are shown in Table 3. In each case, the method achieving lowest detection error is highlighted in bold. It is evident that the EGS approach outperforms Lanczos as the number of clusters and the network size increase. We observe a general improvement in performance for larger graphs, visible in the differences between fractional errors for EGS as the graph size increases and not kernel-smoothed Lanczos.

() 9 (270) 30 (900) 90 (2700) 240 (7200)
Table 3: Fractional error in cluster number detection for synthetic networks using EGS and Lanczos methods with 80 moments. denotes the number of clusters in the network and the number of nodes.

To test the performance of our approach for networks that are too large to apply eigen-decomposition, we generate two large networks by mixing the ER, WA, BA random graph models. The first large network has a size of 201,600 nodes and comprises 305 interconnected clusters whose size varies from 500 to 1000 nodes. The second large network has a size of 404,420 nodes and comprises interconnected 1355 clusters whose size varies from 200 to 400 nodes. The results in FIG 6 show that for both methods, the detection error generally decreases as more moments are used, and our EGS approach again outperforms the Lanczos method for both large synthetic networks.

(a) 305 clusters (b) 1,355 clusters
Figure 6: Log error of cluster number detection using EGS and Lanczos methods on large synthetic networks with (a) 201,600 nodes and 305 clusters and (b) 404,420 nodes and 1,355 clusters.
(a) Email dataset (b) NetScience dataset
Figure 7: Log error of cluster number detection using EGS and Lanczos methods on small-scale real world networks (a) Email network of 1,003 nodes and (b) NetScience network of 1,589 nodes.

7.1.2 Small real world networks

We next experiment with relatively small real world networks, such as the Email network in the SNAP dataset, which is an undirected graph where the nodes represent members of a large European research institution and the edges represent the existence of email communication between them. For such network, we can still calculate the ground-truth number of clusters by computing the eigenvalues explicitly and finding the spectral gap near . For the Email network, we count very small eigenvalues before a large jump in magnitude (measured in the log scale) and set this as the ground-truth. This is shown in FIG 5, where we display the value of each of the eigenvalues in increasing order and how this results in a broadened peak in the EGS. The area under the curve multiplied by the number of network nodes is the number of clusters . We note that the number differs from the value of given by the number of departments at the research institute in this dataset. A likely reason for this ground-truth inflation is that certain departments, Astrophysics, Theoretical Physics and Mathematics for example, may collaborate to such an extent that their division in name may not be reflected in terms of node connection structure. We plot the log error against the number of moments for both EGS and Lanczos in FIG 6(a), with EGS showing superior performance. We repeat the experiment on the Net Science collaboration network, which represents a co-authorship network of scientists () working on network theory and experiment (Newman, 2006a). The results in FIG 6(b) show that EGS quickly outperforms the Lanczos algorithm after around moments.

7.1.3 Large real world networks

For large datasets with , where the Cholesky decomposition becomes completely prohibitive even for powerful machines, we can no longer define a ground-truth using a complete eigen-decomposition. Alternative “ground-truths” supplied in (Mislove et al., 2007), regarding each set of connected components with more than 3 nodes as a community, are not universally accepted. This definition, along with that of self-declared group membership (Yang and Leskovec, 2015), often leads to contradictions with our definition of a community. A notable example is the Orkut dataset, where the number of stated communities is greater than the number of nodes (Leskovec and Krevl, 2014). Beyond being impossible to learn such a value from the eigenspectra, if the main reason to learn about clusters is to partition groups and to summarise networks into smaller substructures, such a definition is undesirable.

We present our findings for the number of clusters in the DBLP (), Amazon () and YouTube () networks (Leskovec and Krevl, 2014) in Table 4, where we use a varying number of moments. We see that for both the DBLP and Amazon networks, the number of clusters seems to converge with increasing moments number , whereas for YouTube such a trend is not visible. This can be explained by looking at the approximate spectral density of the networks implied by maximum entropy in FIG 8. For both DBLP and Amazon (FIG 7(a) and 7(b) respectively), we see that our method implies a clear spectral gap near the origin, indicating the presence of clusters. Whereas for the YouTube dataset, shown in FIG 7(c), no such clear spectral gap is visible and hence the number of clusters cannot be estimated accurately.

(a) DBLP
(b) Amazon
(c) YouTube
Figure 8: Spectral densities for DBLP, Amazon and YouTube datasets.
Moments 40 70 100
Amazon ()
Youtube ()
Table 4: Cluster number detection by EGS for DBLP, Amazon and YouTube .

8 Conclusion

In this paper, we propose a novel, efficient framework for learning a continuous approximation to the spectrum of large scale graphs, which overcomes the limitations introduced by kernel smoothing. We motivate the informativeness of spectral moments using the link between random graph models and random matrix theory. We show that our algorithm is able to learn the limiting spectral densities of random graph models for which analytical solutions are known. We showcase the strength of this framework in two real world applications, namely, computing the similarity between different graphs and detecting the number of clusters in the graph. Interestingly, we are able to classify different real world networks with respect to their similarity to classical random graph models. The EGS may be of further use to researchers studying network properties and similarities.

Appendix A Stochastic Trace Estimation

The intuition behind stochastic trace estimation is that we can accurately approximate the moments of with respect to the spectral density by using computationally cheap matrix-vector multiplications. The moments of can be estimated using a Monte-Carlo average,


where is any random vector with zero mean and unit covariance and is a matrix whose eigenvalues are . This enables us to efficiently estimate the moments in for sparse matrices, where . We use these as moment constraints in our entropic graph spectrum (EGS) formalism to derive the functional form of the spectral density. Examples of this in the literature include (Ubaru et al., 2017; Fitzsimons et al., 2017).

1:  Input: Normalized Laplacian , Number of Probe Vectors , Number of moments required
2:  Output: Moments of Normalised Laplacian
3:  for  in  do
4:     Initialise random vector
5:     for  in  do
8:     end for
9:  end for
Algorithm 3 Learning the Graph Laplacian Moments.

Appendix B Comment on the Lanczoz Algorithm

In the state-of-the-art iterative algorithm Lanczos (Ubaru et al., 2017), the tri-diagonal matrix can be derived from the moment matrix , corresponding to the discrete measure satisfying the moments for all (Golub and Meurant, 1994) and hence it can be seen as a weighted Dirac approximation to the spectral density matching the first moments. The weight given on every Ritz eigenvalue (the eigenvalues of the matrix

) is the square of the first component of the corresponding eigenvector, i.e.,

, hence the approximated spectral density can be written as,


Appendix C Experimental Details

We use Gaussian random vectors for our stochastic trace estimation, for both EGS and Lanczos (Ubaru et al., 2017). We explain the procedure of going from adjacency matrix to Laplacian moments in Algorithm 3. When comparing EGS with Lanczos, we set the number of moments equal to the number of Lanczos steps, as they are both matrix vector multiplications in the Krylov subspace. We further use Chebyshev polynomial input instead of power moments for improved performance and conditioning. In order to normalise the moment input we use the normalised Laplacian with eigenvalues bounded by and divide by . To make a fair comparison we take the output from Lanczos (Ubaru et al., 2017) and apply kernel smoothing (Lin et al., 2016) before applying our cluster number estimator.

Figure 9: Symmetric KL heatmap, obtained using only moments, i.e., Gaussian approximation, between 9 graphs from the SNAP dataset: (0) bio-human-gene1, (1) bio-human-gene2, (2) bio-mouse-gene, (3) ca-AstroPh, (4) ca-CondMat, (5) ca-GrQc, (6) ca-HepPh, (7) ca-HepTh, (8) roadNet-CA, (9) roadNet-PA, (10) roadNet-TX.
Figure 10: Symmetric KL heatmap, obtained using only moments, between 9 graphs from the SNAP dataset: (0) bio-human-gene1, (1) bio-human-gene2, (2) bio-mouse-gene, (3) ca-AstroPh, (4) ca-CondMat, (5) ca-GrQc, (6) ca-HepPh, (7) ca-HepTh, (8) roadNet-CA, (9) roadNet-PA, (10) roadNet-TX.

Appendix D EGSs of Real World Networks with Varying Number of Moments

In order to more clearly showcase the practical value of having a EGS based on a large number of moments, we show the symmetric KL divergence between real world networks using a

moment Gaussian approximation. The Gaussian is fully defined by its normalization constant, mean and variance and so can be specified with

Lagrange multipliers. The results for the same analysis as in Figure 4, but now obtained using a moment Gaussian approximation, are shown in Figure 9. The networks are still somewhat distinguished; however, one can see for example that citation networks and road networks are less clearly distinguished to the point that inter-class distance is lessened compared to intra-class distance, which for the purpose of network classification is not a particularly helpful property. The problem still persists for more moments; for example, when we choose , which is what has been reported stable for other off-the-shelf maximum entropy algorithms, similar results are observed in Figure 10. In comparison, this is not the case for more moments in Figure 4 in the main text.

Appendix E On the Importance of Moments

Given that all iterative methods essentially generate a moment empirical spectral density (ESD) approximation, it is instructive to ask what information is contained within the first spectral moments.

To answer this question concretely, we consider the spectra of random graphs. By investigating the finite size corrections and convergence of individual moments of the empirical spectral density (ESD) compared to those of the limiting spectral density (LSD), we see that the observed spectra are faithful to those of the underlying stochastic process. Put simply, given a random graph model, if we compare the moments of the spectral density observed from a single instance of the model to that averaged over many instances, we see that the moments we observe are informative about the underlying stochastic process.

e.0.1 ESD moments converge to those of the LSD

For random graphs, with independent edge creation probabilities, their spectra can be studied through the machinery of random matrix theory (Akemann et al., 2011).

We consider the entries of an matrix to be zero mean and independent, with bounded moments. For such a matrix, a natural scaling which ensures we have bounded norm as is . It can be shown (see for instance (Feier, 2012)) that the moments of a particular instance of a random graph and the related random matrix converge to those of the limiting counterpart in probability with a correction of .

e.0.2 Finite size corrections to moments get worse with larger moments

A key result, akin to the normal distribution for classical densities, is the semi-circle law for random matrix spectra

(Feier, 2012). For matrices with independent entries , , with common element-wise bound , common expectation and variance , and diagonal expectation , it can be shown that the corrections to the semi-circle law for the moments of the eigenvalue distribution,


have a corrective factor bounded by (Füredi and Komlós, 1981)


Hence, the finite size effects are larger for higher moments than that for the lower counterparts. This is an interesting result, as it means that for large graphs with , the lowest order moments, which are those learned by any iterative process, best approximate those of the underlying stochastic process.

Appendix F Analytic Forms for the Differential Entropy and divergence from EGS

To calculate the differential entropy we simply note that


The KL divergence between two EGSs, and , can be written as,


where refers to the -th moment constraint of the density . Similarly, the symmetric-KL divergence can be written as,


where all the and are derived from the optimisation and all the are given from the stochastic trace estimation.

Appendix G Spectral Density with More Moments

We display the process of spectral learning for both EGS and Lanczos, by plotting the spectral density of both methods against the ground-truth in FIG 11. In order to make a valid comparison, we smooth the implied density using a Gaussian kernel with . Whilst this number could in theory be optimised over, we considered a range of values and took the smallest for which the density was sufficiently smooth, i.e., everywhere positive on the bounded domain . We note that both EGS and Lanczos approximate the ground-truth better with a greater number of moments and that Lanczos learns the extrema of the spectrum before the bulk of the distribution while EGS captures the bulk right from the start.

Figure 11: Spectral density for varying number of moments , for both EGS and Lanczos algorithms as well as the ground-truth.


  • G. Akemann, J. Baik, and P. Di Francesco (2011) The oxford handbook of random matrix theory. Oxford University Press. Cited by: §E.0.1.
  • S. Amari and H. Nagaoka (2007) Methods of information geometry. Vol. 191, American Mathematical Soc.. Cited by: §3.
  • A. Banerjee (2008) The spectrum of the graph laplacian as a tool for analyzing structure and evolution of networks. Ph.D. Thesis. Cited by: §3.1.
  • A. Barabási and R. Albert (1999) Emergence of scaling in random networks. science 286 (5439), pp. 509–512. Cited by: §5.2, §6.2.
  • M. Belkin and P. Niyogi (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15 (6), pp. 1373–1396. Cited by: §1.
  • R. Bhatia (2013) Matrix analysis. Vol. 169, Springer Science & Business Media. Cited by: §7, §7.
  • N. Biggs, E. Lloyd, and R. Wilson (1976) Graph theory 1736-1936, 1976. Clarendon Press. Cited by: §1.
  • F. R. Chung (1997) Spectral graph theory. American Mathematical Soc.. Cited by: §2, §7.
  • D. Cohen-Steiner, W. Kong, C. Sohler, and G. Valiant (2018) Approximating the spectrum of a graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1263–1271. Cited by: §1.
  • T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §3.
  • I. J. Farkas, I. Derényi, A. Barabási, and T. Vicsek (2001) Spectra of “real-world” graphs: beyond the semicircle law. Physical Review E 64 (2), pp. 026704. Cited by: §1, §5.2.
  • A. R. Feier (2012) Methods of proof in random matrix theory. Ph.D. Thesis, Harvard University. Cited by: §E.0.1, §E.0.2, §3.1.
  • J. Fitzsimons, D. Granziol, K. Cutajar, M. Osborne, M. Filippone, and S. Roberts (2017) Entropic trace estimates for log determinants. External Links: arXiv:1704.07223 Cited by: Appendix A.
  • G. W. Flake, S. Lawrence, and C. L. Giles (2000) Efficient identification of web communities. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 150–160. Cited by: §1.
  • Z. Füredi and J. Komlós (1981) The eigenvalues of random symmetric matrices. Combinatorica 1 (3), pp. 233–241. Cited by: §E.0.2.
  • G. H. Golub and G. Meurant (1994) Matrices, moments and quadrature. Pitman Research Notes in Mathematics Series, pp. 105–105. Cited by: Appendix B.
  • D. Granziol, B. Ru, S. Zohren, X. Dong, M. Osborne, and S. Roberts (2019) MEMe: an accurate maximum entropy method for efficient approximations in large-scale machine learning. Entropy 21 (6), pp. 551. Cited by: §4.1.
  • E. T. Jaynes (1957) Information theory and statistical mechanics. Phys. Rev. 106, pp. 620–630. Cited by: §4.1.
  • T. Jiang (2012) Empirical distributions of laplacian matrices of large dilute random graphs. Random Matrices: Theory and Applications 1 (03), pp. 1250004. Cited by: §5.1.
  • J. Leskovec, L. A. Adamic, and B. A. Huberman (2007) The dynamics of viral marketing. ACM Transactions on the Web (TWEB) 1 (1), pp. 5. Cited by: §1.
  • J. Leskovec and A. Krevl (2014) SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data Cited by: §6.2, §6.3, §7.1.3, §7.1.3.
  • L. Lin, Y. Saad, and C. Yang (2016) Approximating spectral densities of large matrices. SIAM review 58 (1), pp. 34–65. Cited by: Appendix C.
  • P. N. McGraw and M. Menzinger (2008) Laplacian spectra as a diagnostic tool for network structure and dynamics. Physical Review E 77 (3), pp. 031102. Cited by: §1.
  • A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee (2007) Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pp. 29–42. Cited by: §1.
  • A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee (2007) Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07), San Diego, CA. Cited by: §7.1.3.
  • M. E. Newman (2006a) Finding community structure in networks using the eigenvectors of matrices. Physical review E 74 (3), pp. 036104. Cited by: §7.1.2.
  • M. E. Newman (2006b) Modularity and community structure in networks. Proceedings of the national academy of sciences 103 (23), pp. 8577–8582. Cited by: §1.
  • G. Palla, I. Derényi, I. Farkas, and T. Vicsek (2005) Uncovering the overlapping community structure of complex networks in nature and society. nature 435 (7043), pp. 814. Cited by: §1.
  • S. Pressé, K. Ghosh, J. Lee, and K. A. Dill (2013) Principles of Maximum Entropy and Maximum Caliber in Statistical Physics. Reviews of Modern Physics 85, pp. 1115–1141. Cited by: §4.1.
  • D. Y. Takahashi, J. R. Sato, C. E. Ferreira, and A. Fujita (2012) Discriminating different classes of biological networks by analyzing the graphs spectra distribution. PLoS One 7 (12), pp. e49949. Cited by: §1, §3.1, §6.2.
  • S. Ubaru, J. Chen, and Y. Saad (2016) Fast Estimation of tr (f (a)) via Stochastic Lanczos Quadrature. Cited by: §3.
  • S. Ubaru, J. Chen, and Y. Saad (2017) Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications 38 (4), pp. 1075–1099. Cited by: Appendix A, Appendix B, Appendix C.
  • [33] S. Ubaru and Y. Saad Applications of trace estimation techniques. Cited by: §7.
  • U. Von Luxburg (2007)

    A tutorial on spectral clustering

    Statistics and computing 17 (4), pp. 395–416. Cited by: §1, §7.
  • J. Yang and J. Leskovec (2015) Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems 42 (1), pp. 181–213. Cited by: §7.1.3.