1 Introduction: Graphs and their Importance
Many systems of interest can be naturally characterised by complex networks; examples include social networks (Mislove et al., 2007b, Flake et al., 2000, Leskovec et al., 2007), biological networks (Palla et al., 2005)
and technological networks. The biological cell can be compactly described as a complex network of chemical reactions. Trends, opinions and ideologies spread on a social network, in which people are nodes and edges represent relationships. The World Wide Web is a complex network of documents with web pages representing nodes and hyperlinks denoting edges. Neural networks, considered state of the art machine learning algorithms for a variety of complex problems, can be seen as directed networks where neurons are the nodes and the synaptic connections between them are the edges. A variety of complex networks have been studied in the literature, from scientific collaborations
Ding (2011), ecological/cellular networks Fath et al. (2007), to sexual contacts (Albert and Barabási, 2002). For a comprehensive introduction we refer the reader to (Newman, 2010).1.1 Network spectra and applications
Networks are mathematically represented by graphs. Of crucial importance to the understanding of the properties of a network or graph is its spectrum, which is defined as the eigenvalues of its adjacency or Laplacian matrix
(Farkas et al., 2001, CohenSteiner et al., 2018). The spectrum of a graph can be considered as a natural set of graph invariants and has been extensively studied in the fields of chemistry, physics and mathematics Biggs et al. (1976). Spectral techniques have been extensively used to characterise the global network structure (Newman, 2006b)and in practical applications thereof, such as facial recognition and computer vision
(Belkin and Niyogi, 2003), learning dynamical thresholds (McGraw and Menzinger, 2008), clustering Von Luxburg (2007), and measuring graph similarity (Takahashi et al., 2012). Applications in quantum chemistry include calculating the electron energy levels in hydrocarbons and their stability Cvetkovic (2009), or minimising the energies of Hamiltonian systems in quantum physics Stevanovic (2011). In the fields of statistics, computer science and machine learning, spectral clustering
(Von Luxburg, 2007) has become a powerful tool for grouping data, regularly outperforming or enhancing other classical algorithms, such as means or single linkage clustering. For most clustering algorithms, estimating the number of clusters is an open problem (Von Luxburg, 2007), with likelihood, adhoc, information theoretic, stability, and spectral approaches advocated. To this end, in spectral clustering, one analyzes the spectral gap in eigenvalues which we refer to as eigengap for short. An accurate estimate of the graph spectrum is therefore critical in this case, which will be discussed later in our paper.1.2 Problem statement
A major limitation in utilizing graph spectra to solve interesting problems such as computing graph similarity and estimating the number of clusters is the inability to learn an everywherepositive nonsingular approximation to the spectral density in an automated and consistent fashion. Current methods rely on either full eigendecompositions, which becomes prohibitive for large graphs, or iterative momentmatched approximations, both of which give a weighted Dirac sum that must be smoothed to be everywhere positive. Beyond requiring a choice of smoothing kernel and kernel bandwidth choice , or number of histogram bins, which are usually chosen in an adhoc manner, we show in this paper that for any smoothing kernel, the spectral moments, which can be seen to be representative of the underlying stochastic process and hence informative, are in fact biased away from their true values. In a nutshell, in order to make certain problems tractable, e.g., comparisons of network spectra, current methods loose the only exact information we have about the network. In this paper, we are interested in an efficient and accurate everywherepositive approximation of the spectral density of large graphs.
1.3 Main contributions
The main contributions of this paper are as follows:

We show that the method of kernel smoothing, commonly used in methods to estimate the spectral density, loses exact moment information;

We propose a computationally efficient smooth spectral density approximation, based on the method of Maximum Entropy, which does not require kernel smoothing. It also admits analytic forms for symmetric and nonsymmetric KLdivergences and Shannon entropy;

We show that our method is able to learn the underlying stochastic process of a network, and can be utilized for computing the similarity among networks from a wide range of synthetic and real world datasets;

We study the behaviour of bounds on changes in the graph spectrum upon perturbation of the graph, and its implication on determining the number of node clusters in the graph. We further demonstrate the superior empirical performance of our method in learning the number of clusters compared to that of the Lanczos algorithm.
2 Preliminaries
Graphs are the mathematical structure underpinning the formulation of networks. Let be an undirected graph with vertex set . Each edge between two vertices and carries a nonnegative weight . corresponds to two disconnected nodes. For unweighted graphs we set for two connected nodes. The adjacency matrix is defined as and . The degree of a vertex is defined as
(1) 
The degree matrix is defined as a diagonal matrix that contains the degrees of the vertices along diagonal, i.e., . The unnormalised graph Laplacian matrix is defined as
(2) 
As is undirected, , which means that the weight matrix is symmetric and hence is symmetric and given is symmetric, the unnormalized Laplacian is also symmetric. As symmetric matrices are special cases of normal matrices, they are Hermitian matrices and have real eigenvalues. Another common characterisation of the Laplacian matrix is the normalised Laplacian (Chung, 1997),
(3) 
where is known as the normalised adjacency matrix ^{1}^{1}1Strictly speaking, the second equality only holds for graphs without isolated vertices.. The spectrum of the graph is defined as the density of the eigenvalues of the given adjacency, Laplacian or normalised Laplacian matrices corresponding to the graph. Unless otherwise specified, we will consider the spectrum of the normalised Laplacian.
3 Motivations for A New Approach on Comparing the Spectra of Large Graphs
In this section, without loss of generality, we motivate the need for a better approach for spectral density approximation in the context of the problem of comparing large graphs.
3.1 Graph comparison using iterative algorithms
For large sparse graphs, with millions or billions of nodes, learning the exact spectrum using eigendecomposition is unfeasible due to the cost. Powerful iterative methods, such as the Lanczos algorithm, are then proposed as costefficient alternatives to approximate the graph spectrum with a sum of weighted Dirac delta functions closely matching the first moments (explained in Appendix 1) of the spectral density Ubaru et al. (2016):
(4) 
where , and denotes the th eigenvalue in the spectrum.
However, such an approximation is undesirable because natural divergence measures between densities, such as the relative entropy from the fields of information theory Cover and Thomas (2012) and information geometry Amari and Nagaoka (2007) shown in equation (5)
(5) 
is infinite for densities that are mutually singular. The use of the JensenShannon divergence simply rescales the divergence into . This puts us in the counterintuitive scenarios, such as,

an infinite (or maximal) divergence upon the removal or addition of a single edge or node in a large network;

an infinite (or maximal) divergence between two graphs generated using the same random graph model and identical hyperparameters.
This does not comply with our notion of network similarity. Two networks generated from the same stochastic process with the same hyperparameters are highly similar and hence should have a low divergence. Similarly, the removal of an edge in a large network, such as for example two people unfriending each other on a large social network, would not in general be considered a fundamental change in the network structure.
One way to circumvent the above problem is to use kernel smoothing. However, as we argue in the following, this results in losing the original moment information.
3.2 On the Importance of Moments
Given that all iterative methods essentially generate a moment empirical spectral density (ESD) approximation, it is instructive to ask what information is contained within the first spectral moments.
To answer this question concretely, we consider the spectra of random graphs. By investigating the finite size corrections and convergence of individual moments of the empirical spectral density (ESD) compared to those of the limiting spectral density (LSD), we see that the observed spectra are faithful to those of the underlying stochastic process. Put simply, given a random graph model, if we compare the moments of the spectral density observed from a single instance of the model to that averaged over many instances, we see that the moments we observe are informative about the underlying stochastic process.
3.2.1 ESD moments converge to those of the LSD
For random graphs, with independent edge creation probabilities, their spectra can be studied through the machinery of random matrix theory
Akemann et al. (2011).We consider the entries of an matrix to be zero mean and independent, with bounded moments. For such a matrix, a natural scaling which ensures we have bounded norm as is . It can be shown (see for instance Feier (2012)) that the moments of a particular instance of a random graph and the related random matrix converge to those of the limiting counterpart in probability with a correction of .
3.2.2 Finite size corrections to moments get worse with larger moments
A key result, akin to the normal distribution for classical densities, is the semicircle law for random matrix spectra
Feier (2012). For matrices with independent entries , , with common elementwise bound , common expectationand variance
, and diagonal expectation , it can be shown that the corrections to the semicircle law for the moments of the eigenvalue distribution,(6) 
have a corrective factor bounded by Füredi and Komlós (1981)
(7) 
Hence, the finite size effects are larger for higher moments than that for the lower counterparts. This is an interesting result, as it means that for large graphs with , the lowest order moments, which are those learned by any iterative process, best approximate those of the underlying stochastic process.
3.3 The argument against kernel smoothing
To alleviate the limitations explained in this section, practitioners typically generate a smoothed spectral density by convolving the Dirac mixture with a smooth kernel Takahashi et al. (2012), Banerjee (2008), typically Gaussian or Cauchy,
(8)  
or simply a histogram Banerjee (2008)
to facilitate visualisation and comparison. However, this introduces hyperparameters, such as the choice of convolving kernel, the smoothing parameter or the number of bins for the histogram, which heavily affect the resolution of the spectra. To show this, the moments of the Dirac mixture are given as,
(9) 
For simplicity we assume that the kernel function is symmetric, defined on the real line, and permits all moments, which are satisfied for the commonly used Gaussian kernel. Consider the moments of the modified smooth function in (8)
(10)  
where the sum over is up to , with being if is even and if
is odd. Further,
denotes the th central moment of the kernel function . We have used the fact that the infinite domain is invariant under shift reparamatrisation and that the odd moments of any symmetric distribution are . This proves that kernel smoothing alters moment information and that this process gets more pronounced for higher moments. Furthermore, given that , and, for the normalised Laplacian with , the corrective term is manifestly positive and so the smoothed moment estimates are biased. For large random graphs, the moments of a generated instance converge to those averaged over many instances Feier (2012), hence by biasing our moment information we limit our ability to learn about the underlying stochastic process.4 The Method of Maximum Entropy
The method of maximum entropy, hereafter referred to as MaxEnt Pressé et al. (2013)
, is a procedure for generating the most conservative density estimate, with respect to the uniform distribution. It can be seen as the maximally uncertain probability distribution possible with the given information. To determine the spectral density
using MaxEnt, we maximise the entropic functional(11) 
with respect to , where are the power moment constraints on the spectral density, which are estimated using stochastic trace estimation as explained in section 4.2. Algorithm 1 presents the procedure to learn them. The resultant MaxEnt spectral density has the form
(12) 
where the coefficients are derived from optimising (11). For simplicity, we denote as . We develop a novel algorithm (Algorithm 2), where NCG denotes Newton conjugate gradient, for determining the density of maximum entropy. We use analytical expressions for the gradient and Hessian explicitly as opposed to approximately Bandyopadhyay et al. (2005). We note that the method of maximum entropy has been used previously for calculating the spectra of large sparse Hamiltonians in Physics, and has been shown to be far superior in terms of moment information than kernel polynomial methods Silver and Röder (1997). Therefore, we do not compare our method against kernel polynomial methods in this paper.
4.1 Analytic forms for the differential entropy and divergence from MaxEnt
To calculate the differential entropy we simply note that
(13) 
The KL divergence between two MaxEnt spectra, and , can be written as,
(14) 
where refers to the th moment constraint of the density . Similarly, the symmetricKL divergence can be written as,
(15) 
where all the and are derived from the optimisation and all the are given from the stochastic trace estimation.
4.2 Stochastic trace estimation
The intuition behind stochastic trace estimation is that we can accurately approximate the moments of with respect to the spectral density
by using computationally cheap matrixvector multiplications. The moments of
can be estimated as,(16) 
where is any random vector with zero mean and unit covariance and is a matrix whose eigenvalues are . This enables us to efficiently estimate the moments in for sparse matrices, where . We use these as moment constraints in our MaxEnt formalism to derive the functional form of the spectral density. Examples of this in the literature include Ubaru et al. (2017), Fitzsimons et al. (2017).
4.3 The Entropic Spectral Learning algorithm
Algorithm 1 learns the moments of the normalised graph Laplacian. The appropriate Lagrange multipliers for the maximum entropy density are learned with algorithm 2. The full algorithm, which takes the matrix as an input and gives the maximum entropy spectral density, is then summarized in Algorithm 3.
5 Visualising the modelling power of MaxEnt
Having developed a theory for why a smooth exact moment matched approximation of the spectral density is crucial to learning the characteristics of the underlying stochastic process, and having proposed a set of Algorithms 1 and 2 to learn such a density, we test the practical utility of our method and algorithm on examples where the limiting spectral density is known.
5.1 The semicircle law
For ErdősRényi graphs with edge creation probability , and , the limiting spectral density of the normalised Laplacian converges to the semicircle law and its Laplacian converges to the free convolution of the semicircle law and Jiang (2012).
We consider here to what extent a MaxEnt distribution with finite moments can effectively approximate the density. Wigner’s density is fully defined by its infinite number of central moments given by , where are known as the Catalan numbers. As a toy example we generate a semi circle centered at with and give the moments analytically to the maximum entropy algorithm. As can be seen in FIG 1, for moments, the central portion of the density is already well approximated, but the end points are not. This is largely corrected for moments.
We plot the relative entropy between the semicircle density and its maximum entropy surrogate, which we show in FIG 2. This shows that even a very modest number of moments can give an excellent fit.
5.2 Application to ErdősRényi normalised Laplacian
We generate a ErdősRényi graph with and , and learn the moments using stochastic trace estimation. We then compare the fit between the MaxEnt distribution computed using a different numbers of input moments and the graph eigenvalue histogram computed by eigendecomposition. We plot the results in FIG 3. One striking difference between this experiment and the previous one is the number of moments needed to give a good fit. This can be seen especially clearly in the top left subplot of FIG 3, where the 3 moment, i.e Gaussian approximation, completely fails to capture the bounded support of the spectral density. Given that the exponential polynomial density is positive everywhere, it needs more moment information to learn the regions of boundedness of the spectral density in its domain. In the previous example we artificially alleviated this phenomenon by putting the support of the semicircle within the entire domain. It can be clearly seen that increasing moment information successively improves the fit to the support FIG 3. Furthermore, the magnitude of the oscillations, which are characteristic of an exponential polynomial function, decay in magnitude for larger moments.
5.3 Beyond the semicircle law
For the adjacency matrix of an ErdősRényi graph with , the limiting spectral density does not converge to the semicircle law and has an elevated central portion, and the scale free limiting density converges to a triangle like distribution Farkas et al. (2001). For other random graph, such as the BarabásiAlbert Barabási and Albert (1999) also known as the scale free network, the probability of a new node being connected to a certain existing node is proportional to the number of links that existing node already has, violating the independence assumption required to derive the semicircle density. We plot a BarabásiAlbert network () and, similar to section 5.2, we learn the moments using stochastic trace estimation and plot the resulting MaxEnt spectral density against the eigenvalue histogram, as shown in FIG 4. For the BarabásiAlbert network, due to the extremity of the central peak, a much larger number of moments is required to get a reasonable fit. We also note that increasing the number of moments is essentially akin to increasing the number of bins in terms of spectral resolution, as seen in FIG 4.
6 MaxEnt for computing graph similarity
In this section, we test the performance of our MaxEntbased method for computing similarity between different types of synthetic and real world graphs. We first investigate the feasibility in recovering the parameters of random graph models, and then move onto classifying the network type as well as computing graph similarity by measuring the symmetric KL divergence among various synthetic and real world graphs.
6.1 Inferring parameters of random graph models
We investigate whether one can recover the network parameter values of a graph via its MaxEnt spectral density. We generate a random graph of a given size and parameter value (e.g., ) and learn its MaxEnt spectral characterisation. Then, we generate another graph of the same size but learn its parameter value by minimising the symmetricKL divergence between its MaxEnt spectral surrogate and that of the original graph. We repeat the above procedures for different random graph models and different graph sizes (), and the results are shown in Table 1. It can be seen that given simply the approximate MaxEnt spectrum, we are able to learn rather well the parameters of the graph producing that spectrum. Determining which random graph models best fit real world networks, characterised by their spectral divergence, so as to better understand their dynamics and characteristics has been explored in biology (Takahashi et al., 2012), a potential application domain for our method.
50  100  150  

ErdősRényi ()  
WattsStrogatz ()  
BarabásiAlbert () 
6.2 Learning real world network types using MaxEnt and the symmetric KL divergence
As a realworld use case, we investigate which random network between ErdősRényi, WattsStrogatz and BarabásiAlbert can best model the YouTube network from the SNAP dataset Leskovec and Krevl (2014). To this end, we compute the divergence between the MaxEnt spectral density of the YouTube network and those of the randomly generated graphs. We find, as shown in Table 2, that the BarabásiAlbert gives the lowest divergence, which aligns with other findings for social networks (Barabási and Albert, 1999).
Synthetic  Youtube  

ErdősRényi  
WattsStrogatz  
BarabásiAlbert 
6.3 Comparing different real world networks
We now consider the feasibility of comparing real world graphs using their MESDs. Specifically, we take biological networks, citation networks and road networks from the SNAP dataset Leskovec and Krevl (2014), and compute the symmetric kl divergences between their MESDs with moments. We plot the results in the form of a heat map in FIG 5. We see very clearly that the intraclass divergences between the biological, citation and road networks are much smaller than their interclass divergences. This strongly suggests that the combination of a MESD and the symmetric KL divergence can be used to identify similarity in networks. Furthermore, as can be seen in the divergence between the human and mouse network, the spectra of the human genes are more closely aligned with each other than they are with the spectra of mouse genes. This suggests a reasonable amount of intraclass distinguishability as well.
7 MaxEnt for estimating cluster number
In this section, we discuss a second application of the proposed MaxEntbased method, which is to estimate the number of clusters (i.e., communities) in a graph. We first recall the following results from spectral graph theory (Chung, 1997).
Proposition 1
Let G be an undirected graph with nonnegative weights. Then the multiplicity of the eigenvalue of the Laplacian is equal to the number of connected components
in the graph. The eigenspace of the eigenvalue
is spanned by the indicator vectors .We refer the reader to Von Luxburg (2007) for a simple proof of this result. Consequently, to learn the number of connected components in a graph, we simply count the number of eigenvalues. We now extend the idea behind this proposition, and consider groups of nodes containing far greater intragroup connections than intergroup connections as clusters. For small changes in the Laplacian, i.e., a very small number of links between the previously disconnected clusters, we expect from matrix perturbation theory (Bhatia, 2013) the next smallest eigenvalues to be close to . This is given by Weyl’s bound on Hermitian matrices,
(17) 
where is the perturbed version of graph with a small number of edges between the previously disconnected clusters, are the perturbed eigenvalues, which differ from (where for ) by an amount , which is bounded by the norm of the difference matrix . Notice that the bound holds for any consistent matrix norm. In the case of an entrywise norm, for every edge added between the th and th node that are from previously separate clusters, the norm of the difference matrix goes as , where and denote the neighborhood of and in , respectively, and denotes the degree of the th node. Assuming that the degrees of the nodes in each cluster to be similar and , we see that for each added intercluster edge the bound grows as .
7.1 Motivations for a new approach on learning the number of clusters in large graphs
For the case of large sparse graphs, where only iterative methods such as the Lanczos algorithm can be used, the same arguments of the previous section 3 apply. This is because the delta functions are now weighted, and to obtain a reliable estimate of the eigengap, one must smooth the delta functions.
7.2 Using MaxEnt to learn the number of clusters
For our definition of a cluster described above, we would expect a smoothed spectral density plot to have a spike near . We expect the moments of the spectral density to encode this information and the mass of this peak to be spread. We hence look for the first spectral minimum in the MaxEnt spectral density and calculate the number of clusters as shown in Algorithm 4.
7.3 Experiments
This set of experiments evaluates the effectiveness of our spectral method in Algorithm 4 for learning the number of distinct clusters in a network, where we compare it against the Lanczos algorithm with kernel smoothing on both synthetic and realworld networks.
7.3.1 Synthetic networks
The synthetic data consists of disconnected subgraphs of varying sizes and cluster numbers, to which a small number of intracluster edges are added. We use an identical number of matrix vector multiplications, i.e., (see Appendix 2 for experimental details for both MaxEnt and Lanczos methods), and estimate the number of clusters and report the fractional error. The results are shown in Table 3. In each case, the method achieving lowest detection error is highlighted in bold. It is evident that the MaxEnt approach outperforms Lanczos as the number of clusters and the network size increase. We observe a general improvement in performance for larger graphs, visible in the differences between fractional errors for MaxEnt as the graph size increases and not kernelsmoothed Lanczos.
()  Lanczos  MaxEnt 

9 (270)  
30 (900)  
90 (2700)  
240 (7200) 
To test the performance of our approach for networks that are too large to apply eigendecomposition, we generate two large networks by mixing the ErdösRényi, WattsStrogatz and BarabásiAlbert random graph models. The first large network has a size of 201,600 nodes and comprises 305 interconnected clusters whose size varies from 500 to 1000 nodes. The second large network has a size of 404,420 nodes and comprises interconnected 1355 clusters whose size varies from 200 to 400 nodes. The results in FIG 6 show that for both methods, the detection error generally decreases as more moments are used, and our maximum entropy approach again outperforms the Lanczos method for both large synthetic networks.
7.3.2 Small real world networks
We next experiment with relatively small real world networks, such as the Email network in the SNAP dataset, which is an undirected graph where the nodes represent members of a large European research institution and the edges represent the existence of email communication betweem them. For such network, we can still calculate the groundtruth number of clusters by computing the eigenvalues explicitly and finding the spectral gap near . For the Email network, we count very small eigenvalues before a large jump in magnitude (measured in the log scale) and set this as the groundtruth. This is shown in FIG 7, where we display the value of each of the eigenvalues in increasing order and how this results in a broadened peak in the MaxEnt spectrum. The area under the curve multiplied by the number of network nodes is the number of clusters . We note that the number differs from the value of given by the number of departments at the research institute in this dataset. A likely reason for this groundtruth inflation is that certain departments, Astrophysics, Theoretical Physics and Mathematics for example, may collaborate to such an extent that their division in name may not be reflected in terms of node connection structure.
We display the process of spectral learning for both MaxEnt and Lancsoz, by plotting the spectral density of both methods against the groundtruth in FIG 8. In order to make a valid comparison, we smooth the implied density using a Gaussian kernel with . Whilst this number could in theory be optimised over, we considered a range of values and took the smallest for which the density was sufficiently smooth, i.e., everywhere positive on the bounded domain . We note that both MaxEnt and Lancsoz approximate the groundtruth better with a greater number of moments and that Lancsoz learns the extrema of the spectrum before the bulk of the distribution while MaxEnt spectruam captures the bulk right from the start.
We plot the log error against the number of moments for both MaxEnt and Lancsoz in FIG 8(a), with MaxEnt showing superior performance. We repeat the experiment on the Net Science collaboration network, which represents a coauthorship network of scientists () working on network theory and experiment (Newman, 2006a). The results in FIG 8(b) show that MaxEnt quickly outperforms the Lanczos algorithm after around moments.
7.3.3 Large real world networks
For large datasets with , where the Cholesky decomposition becomes completely prohibitive even for powerful machines, we can no longer define a groundtruth using a complete eigendecomposition. Alternative “groundtruths” supplied in Mislove et al. (2007a), regarding each set of connected components with more than 3 nodes as a community, are not universally accepted. This definition, along with that of selfdeclared group membership Yang and Leskovec (2015), often leads to contradictions with our definition of a community. A notable example is the Orkut dataset, where the number of stated communities is greater than the number of nodes Leskovec and Krevl (2014). Beyond being impossible to learn such a value from the eigenspectra, if the main reason to learn about clusters is to partition groups and to summarise networks into smaller substructures, such a definition is undesireable.
We present our findings for the number of clusters in the DBLP (), Amazon () and YouTube () networks Leskovec and Krevl (2014) in Table 4, where we use a varying number of moments. We see that for both the DBLP and Amazon networks, the number of clusters seems to converge with increasing moments number , whereas for YouTube such a trend is not visible. This can be explained by looking at the approximate spectral density of the networks implied by maximum entropy in FIG 10. For both DBLP and Amazon (FIG 9(a) and 9(b) respectively), we see that our method implies a clear spectral gap near the origin, indicating the presence of clusters. Whereas for the YouTube dataset, shown in FIG 9(c), no such clear spectral gap is visible and hence the number of clusters cannot be estimated accurately.
Moments  40  70  100 

DBLP  
Amazon  
Youtube 
8 Conclusion
In this paper, we propose a novel, efficient framework for learning a continuous approximation to the spectrum of large scale graphs, which overcomes limitations introduced from kernel smoothing. We motivate the informativeness of spectral moments using the link between random graph models and random matrix theory. We show that our algorithm is able to learn the limiting spectral densities of random graph models for which analytical solutions are known.
We showcase the strength of this framework in two real world applications, namely, computing the similarity between different graphs and detecting the number of clusters in the graph. Interestingly, we are able to classify different real world networks with respect to their similarity to classical random graph models.
In future work, one could look at the temporal evolution of real world networks and a more complete analysis of the effect of the number of moments on the accuracy of network classification.
References
 Akemann et al. (2011) G. Akemann, J. Baik, and P. Di Francesco. The Oxford handbook of random matrix theory. Oxford University Press, 2011.
 Albert and Barabási (2002) R. Albert and A.L. Barabási. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
 Amari and Nagaoka (2007) S.i. Amari and H. Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.
 Bandyopadhyay et al. (2005) K. Bandyopadhyay, A. K. Bhattacharya, P. Biswas, and D. Drabold. Maximum entropy and the problem of moments: A stable algorithm. Physical Review E, 71(5):057701, 2005.
 Banerjee (2008) A. Banerjee. The spectrum of the graph Laplacian as a tool for analyzing structure and evolution of networks. PhD thesis, 2008.
 Barabási and Albert (1999) A.L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
 Belkin and Niyogi (2003) M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
 Bhatia (2013) R. Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
 Biggs et al. (1976) N. Biggs, E. Lloyd, and R. Wilson. Graph theory 17361936, 1976, 1976.
 Chung (1997) F. R. Chung. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
 CohenSteiner et al. (2018) D. CohenSteiner, W. Kong, C. Sohler, and G. Valiant. Approximating the spectrum of a graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1263–1271. ACM, 2018.
 Cover and Thomas (2012) T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Cvetkovic (2009) D. M. Cvetkovic. Applications of graph spectra: An introduction to the literature. Applications of Graph Spectra, 13(21):7–31, 2009.
 Ding (2011) Y. Ding. Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks. Journal of informetrics, 5(1):187–203, 2011.
 Farkas et al. (2001) I. J. Farkas, I. Derényi, A.L. Barabási, and T. Vicsek. Spectra of “realworld” graphs: Beyond the semicircle law. Physical Review E, 64(2):026704, 2001.
 Fath et al. (2007) B. D. Fath, U. M. Scharler, R. E. Ulanowicz, and B. Hannon. Ecological network analysis: network construction. ecological modelling, 208(1):49–55, 2007.
 Feier (2012) A. R. Feier. Methods of proof in random matrix theory. PhD thesis, Harvard University, 2012.
 Fitzsimons et al. (2017) J. Fitzsimons, D. Granziol, K. Cutajar, M. Osborne, M. Filippone, and S. Roberts. Entropic trace estimates for log determinants, 2017.
 Flake et al. (2000) G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 150–160. ACM, 2000.
 Füredi and Komlós (1981) Z. Füredi and J. Komlós. The eigenvalues of random symmetric matrices. Combinatorica, 1(3):233–241, 1981.
 Golub and Meurant (1994) G. H. Golub and G. Meurant. Matrices, moments and quadrature. Pitman Research Notes in Mathematics Series, pages 105–105, 1994.
 Jiang (2012) T. Jiang. Empirical distributions of laplacian matrices of large dilute random graphs. Random Matrices: Theory and Applications, 1(03):1250004, 2012.
 Jones et al. (2001–) E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. URL http://www.scipy.org/. [Online; accessed ¡today¿].
 Leskovec and Krevl (2014) J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
 Leskovec et al. (2007) J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. ACM Transactions on the Web (TWEB), 1(1):5, 2007.
 Lin et al. (2016) L. Lin, Y. Saad, and C. Yang. Approximating spectral densities of large matrices. SIAM review, 58(1):34–65, 2016.
 McGraw and Menzinger (2008) P. N. McGraw and M. Menzinger. Laplacian spectra as a diagnostic tool for network structure and dynamics. Physical Review E, 77(3):031102, 2008.
 Mislove et al. (2007a) A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07), San Diego, CA, October 2007a.
 Mislove et al. (2007b) A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 29–42. ACM, 2007b.
 Newman (2010) M. Newman. Networks. Oxford University Press, 2010.

Newman (2006a)
M. E. Newman.
Finding community structure in networks using the eigenvectors of matrices.
Physical review E, 74(3):036104, 2006a.  Newman (2006b) M. E. Newman. Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23):8577–8582, 2006b.
 Palla et al. (2005) G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. nature, 435(7043):814, 2005.
 Pressé et al. (2013) S. Pressé, K. Ghosh, J. Lee, and K. A. Dill. Principles of Maximum Entropy and Maximum Caliber in Statistical Physics. Reviews of Modern Physics, 85:1115–1141, Jul 2013.
 Silver and Röder (1997) R. Silver and H. Röder. Calculation of densities of states and spectral functions by chebyshev recursion and maximum entropy. Physical Review E, 56(4):4822, 1997.
 Stevanovic (2011) D. Stevanovic. Applications of graph spectra in quantum physics. Selected Topics in Applications of Graph Spectra, pages 85–111, 2011.
 Takahashi et al. (2012) D. Y. Takahashi, J. R. Sato, C. E. Ferreira, and A. Fujita. Discriminating different classes of biological networks by analyzing the graphs spectra distribution. PLoS One, 7(12):e49949, 2012.
 Ubaru et al. (2016) S. Ubaru, J. Chen, and Y. Saad. Fast Estimation of tr (f (a)) via Stochastic Lanczos Quadrature. 2016.
 Ubaru et al. (2017) S. Ubaru, J. Chen, and Y. Saad. Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017.
 Von Luxburg (2007) U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
 Yang and Leskovec (2015) J. Yang and J. Leskovec. Defining and evaluating network communities based on groundtruth. Knowledge and Information Systems, 42(1):181–213, 2015.
Appendix
1 Comment on Lanczoz algorithm
In the stateoftheart iterative algorithm Lanczos Ubaru et al. (2017), the tridiagonal matrix can be derived from the moment matrix , corresponding to the discrete measure satisfying the moments for all Golub and Meurant (1994) and hence it can be seen as a weighted Dirac approximation to the spectral density matching the first moments. The weight given on every Ritz eigenvalue (the eigenvalues of the matrix ) is the square of the first component of the corresponding eigenvector, i.e., , hence the approximated spectral density can be written as,
(18) 
2 Experimental details
We use Gaussian random vectors for our stochastic trace estimation, for both MaxEnt and Lanczos Ubaru et al. (2017). We explain the procedure of going from adjacency matrix to Laplacian moments in Algorithm 1. When comparing MaxEnt with Lanczos, we set the number of moments equal to the number of Lanczos steps, as they are both matrix vector multiplications in the Krylov subspace. We implement a quadrature MaxEnt method in Algorithm 2. We use a grid size of over the interval and add diagonal noise on the Hessian to improve conditioning and symmetrise it. We further use Chebyshev polynomial input instead of power moments for improved performance and conditioning. In order to normalise the moment input we use the normalised Laplacian with eigenvalues bounded by and divide by . We use Python’s Scipy implementation of the Newton conjugate gradient algorithm Jones et al. (2001–) for the MaxEnt Lagrange multipliers. To make a fair comparison we take the output from Lanczos Ubaru et al. (2017) and apply kernel smoothing Lin et al. (2016) before applying our cluster number estimator.
3 MESDs of real world networks with varying number of moments
In order to more clearly showcase the practical value of having a MESD based on a large number of moments, we show the symmetric KL divergence between real world networks using a moment Gaussian approximation. The Gaussian is fully defined by its normalization constant, mean and variance and so can be specified with Lagrange multipliers. The results for the same analysis as in FIG 5, but now obtained using a moment Gaussian approximation, are shown in FIG 11. The networks are still somewhat distinguished; however, one can see for example that citation networks and road networks are less clearly distinguished to the point that interclass distance is lessened compared to intraclass distance, which for the purpose of network classification is not a particularly helpful property. The problem still persists for more moments; for example, when we choose , which is what has been reported stable for other offtheshelf maximum entropy algorithms, similar results are observed in FIG 12. In comparison, this is not the case for more moments in FIG 5 in the main text.
Comments
There are no comments yet.