1 Introduction
Spectral clustering (Von Luxburg, 2007)
refers to a number of different algorithms that have in common two main steps: first, of computing the spectral decomposition of a (possibly regularised) data matrix; and second, of applying a clustering algorithm to a point cloud extracted from the eigenvectors. When the matrix holds distances or affinities spectral clustering allows estimation of noncircular clusters in pointillist data
(Ng et al., 2002). When the matrix represents a graph, it enables the discovery of communities (Von Luxburg, 2007).In the case of graphs, one can talk quite precisely about the relative merits of different regularisation techniques (e.g. adjacency versus normalised Laplacian), which eigenvectors to select (e.g. corresponding to large eigenvalues versus large magnitude eigenvalues) and which clustering algorithm to use (e.g. means versus Gaussian mixture modelling). While the first decision is complicated (Tang and Priebe, 2019)
, asymptotic analysis now clearly favours the second option in each of the remaining
(Rohe et al., 2011; Athreya et al., 2017; RubinDelanchy et al., 2018). These determinations are made under the assumption that the data follow a stochastic block model (Holland et al., 1983), where the probability of an edge is dependent only on the (unknown) community memberships of the corresponding nodes.
The natural extension to a realvalued matrix is to assume the th entry is a real random variable whose distribution depends only the communities of nodes and (Xu et al., 2017). (Under the ordinary stochastic block model this distribution would be Bernoulli.) While defining a normalised Laplacian is not entirely straightforward, since for example a node’s ‘degree’ could be negative and would need to be squarerooted, the second and third questions are still pertinent: which eigenvectors and which clustering algorithm should be used?
This paper presents a central limit theorem showing that asymptotically the point cloud obtained from spectrally embedding a realvalued matrix from a weighted stochastic block model follows a Gaussian mixture model with elliptical components whose centres and covariance matrices are explicitly calculable. This result implies that for statistical consistency, eigenvectors selected by eigenvalue magnitude must be used, and for optimality one should use Gaussian clustering, and not
means.Another application of this result is to allow a choice between data representations, for example, whether to embed the matrix of counts or logcounts. Since two data representations produce two different mixture distributions, one can compare how well the components separate in each case. Following Tang and Priebe (2019), we do this using Chernoff information. In a relevant formalisation of the network anomaly detection problem, we are thus able to show that embedding the matrix of log pvalues, rather than raw pvalues, is statistically more efficient. This theoretical observation is validated in a cybersecurity example.
Finally, affine transformation of a realvalued matrix’s entries does not change the Chernoff information of the associated asymptotic clustering problem. In other words, one need not worry about the origin and scale of the measurements in the data matrix, for example, whether temperature is measured in Celsius or Fahrenheit. Yet affine transformation can cause important eigenvalues to flip sign and Gaussian clusters to change shape, and so this invariance hinges on choosing eigenvectors from both sides of the spectrum and using Gaussian clustering; otherwise, performance will vary substantially.
2 Spectral clustering in the weighted stochastic block model
2.1 The weighted stochastic block model
Definition 1 (Weighted stochastic block model).
Given nodes and communities, an undirected weighted graph with symmetric adjacency matrix follows a community stochastic block model if there is a partition of the nodes into communities conditional upon which, for all ,
where is an index denoting the community of node
, assigned independently according to a probability vector
where .Define matrices
as the block means and variances respectively of the distributions
, for, where it is assumed the moments exist. For example, a 2community unweighted stochastic block model with intracommunity (respectively, intercommunity) link probability
(respectively, ) hasThe signature of a weighted stochastic block model, , is defined as the number of strictly positive and strictly negative eigenvalues of respectively and let . We can choose such that , for , where , with ones followed by minus ones on its diagonal. One choice is to use the rows of , where is the spectral decomposition of and . (We will use and to denote the elementwise absolute value and power of a diagonal matrix .)
The vector can be interpreted as a canonical latent position for node in the weighted stochastic block model. Latent positions of a stochastic block model are only identifiable up to transformation by elements of the indefinite orthogonal group . Attempts to infer the latent positions from the adjacency matrix of a weighted stochastic block model must take unidentifiability up to transformation from into account.
2.2 Spectral clustering
Definition 2 (Adjacency spectral embedding).
Given an undirected weighted graph with symmetric adjacency matrix , consider the spectral decomposition , where is a diagonal matrix containing the largest eigenvalues of in magnitude, and contains the corresponding orthonormal eigenvectors. Define the adjacency spectral embedding of the graph into by
We will interpret this spectral embedding procedure as providing an estimate of the latent position for node
in the network. Heuristically, nodes that are somehow ‘close’ in this space are likely to belong to the same community. Algorithm
1 (extending Algorithm 1 RubinDelanchy et al. (2018) to realvalued matrices) proposes an approach to recovering these communities.There are two important features of Algorithm 1. Firstly, both sides of the spectral decomposition are used: in Definition 2 the largest eigenvalues by magnitude are retained (and the corresponding eigenvectors), not just the largest positive eigenvalues. This is needed for statistical consistency in general (Rohe et al., 2011). Large negative eigenvalues in computer network graphs can hold key information for node clustering and link prediction (RubinDelanchy et al., 2018). Secondly, the covariance matrices in the Gaussian mixture model are unconstrained, i.e. ellipsoidal with varying volume, shape, and orientation. This is a significant departure from the standard use of means (Von Luxburg, 2007).
Both of these algorithm features are welljustified by the theorem in the following section.
2.3 Central limit theorem
RubinDelanchy et al. (2018) derived a central limit theorem for adjacency spectral embedding under a ‘generalised random dot product graph’; a novel contribution of the present paper is to consider an extension of this theorem to the case of a weighted stochastic block model:
Theorem 1 (Adjacency spectral embedding central limit theorem).
Consider a sequence of adjacency matrices from a weighted stochastic block model with signature . For any integer and points , conditional on the community labels , there exists a sequence of random matrices such that
where
(1) 
the second moment matrix , assumed to be invertible, is
and
denotes the cumulative distribution function of a multivariate normal distribution with mean
and covariance .The implication of Theorem 1
is that spectral embedding an adjacency matrix from a weighted stochastic block model produces a point cloud that is asymptotically a linear transformation (given by
) of independent, identically distributed draws from a Gaussian mixture model. Each of its components corresponds to a community and has an explicitly calculable mean and covariance. A finite sample illustration of the theorem is given in Figure 1.The result motivates the design of Algorithm 1, namely the importance of using both sides of the spectral decomposition and allowing full covariance matrices when fitting a Gaussian mixture model.
2.4 Example: Poisson counts versus Bernoulli presence events
Consider a 2community weighted stochastic block model where weights represent event counts modelled by Poisson distributions with rate
, , with block mean and variance matricesWe generate a weighted network from this model with nodes and probability of belonging to the first community , and apply Algorithm 1. Figure 1a) shows the dimensional point cloud obtained from spectral embedding the graph (note, ), with colours indicating the true cluster assignment. The red and blue ellipses show the two 95% contours obtained by applying Gaussian clustering using the Python sklearn library. In this example, the predicted community assignment is 98.5% accurate. Black ellipses show the 95% asymptotic contours of the components, calculated using Theorem 1, and approximately comparable.
Instead, suppose we simply report, for each pair of nodes, whether at least one event occurs. If , then . The block mean and variance matrices for this unweighted stochastic block model are
We calculate this modified adjacency matrix directly from the original and Figure 1b) shows the resulting point cloud from spectral embedding, where the contours and true community labels are indicated as before. This time the predicted community assigment based on a Gaussian mixture model is only 96.3% accurate. This loss of accuracy is consistent with the theoretical contours appearing less well separated. We formally quantify cluster separation in Section 3.1, and find that the Poisson representation is indeed preferable in this example.
3 Choosing matrix data representation
3.1 Chernoff information
In order to define a measure of cluster separation we take inspiration from Tang and Priebe (2019), where the Chernoff information was proposed as a method to compare graph embedding based on the Laplacian versus the adjacency matrix. In a 2cluster problem, the Chernoff information provides an upper bound on the probability of error of the Bayes decision rule that assigns each data point to its most likely cluster a posteriori. If the clusters have distributions and , the Chernoff information is (Chernoff, 1952):
where is the Chernoff divergence
and
are the probability density functions corresponding to
respectively. For , one reports instead the Chernoff information of the critical pair, .The Chernoff information of the components in the limiting mixture distribution of Theorem 1 can be written in closed form. Suppose distribution and then, for , denoting , we can compute (Pardo, 2005),
(2) 
In their work motivating the use of Chernoff information to compare graph embeddings, Tang and Priebe (2019) make the point that a simpler criterion such as cluster variance is not satisfactory, since it is effectively measuring the performance of means clustering rather than clustering using a Gaussian mixture model.
3.1.1 Example: Poisson counts versus Bernoulli presence events
Returning to the Poisson versus Bernoulli example of Section 2.4, Figure 1c) shows the Chernoff divergence, and hence the Chernoff information, for the two representations. For the Bernoulli data, the Chernoff information is 0.002, achieved at ; for the Poisson data, the Chernoff information is 0.012, achieved at , and this representation should therefore be preferred.
3.2 Invariance under affine transformation
As mentioned in the introduction, the choice of origin and scale for measurements in the data matrix is often arbitrary. The following lemma shows that cluster separation, as measured through Chernoff information, is not affected by this choice.
Lemma 1 (Chernoff information invariance under affine transformation).
Let be an adjacency matrix from a weighted stochastic block model and, for , define , where is the allone vector. For any ,
where and denote the th component from the limiting mixture distribution of Theorem 1 associated with and respectively.
This lemma has some interesting consequences regarding common data transformations and their effect on spectral clustering.
Remark 1.
Given an unweighted stochastic block model, rather than using 1 and 0 to respectively represent edges and missing edges, Chernoff information invariance suggests that any other two distinct values could be used.
Remark 2.
Given a weighted stochastic block model where weights represent pvalues, Chernoff information invariance suggests that there is no difference between analysing the matrix with entries or with entries .
Based on Lemma 1, it may appear that an affine transformation of the adjacency matrix entries will not affect the geometry of the point cloud. However, by transforming the entries we could potentially change the signature of the model and the underlying geometry of the invariant indefinite orthogonal group, .
Lemma 2 (Signature change under affine transformation).
Let be a matrix with signature . Then, depending on and , the matrix has signature:
Signature  

3.2.1 Example: Beta distributions for pvalues
Consider a 2community weighted stochastic block model where weights represent pvalues from a continuous test statistic. We model the pvalues using Beta distributions, for
,(3) 
Following Corollary 2, there is no difference in Chernoff information between using the matrix , with entries , or the matrix , with entries , in Algorithm 1. Let and be the corresponding block mean matrices,
Since , the matrix has signature while has signature , changing the geometry of latent space. We investigate this model further in Section 4.
4 Application: network anomaly detection
Consider the problem of detecting a cluster of anomalous activity on a network. Assume that a pvalue for every unordered pair of nodes on a network can be obtained, quantifying our level of surprise in their activity. For example, a low pvalue might occur if, relative to historical behaviour, a much smaller or larger volume of communication was observed (Heard et al., 2010), if a communication used an unusual channel (Heard and RubinDelanchy, 2016) or took place at a rare time of day (PriceWilliams et al., 2018).
Assume that the network contains an unknown proportion of nodes of interest whose interactions tend to have low associated pvalues. Interactions involving the remaining nodes generate pvalues with no signal.
We model this using the 2community stochastic block model specified in Section 3.2.1. One could hope to discover the anomalous cluster by spectrally embedding or, equivalently, . However, familiarity with statistical anomaly detection might suggest using instead , since the most common method of combining pvalues is Fisher’s method (Fisher, 1934),
This provides the uniformly most powerful approach if the pvalues are with under the alternative hypothesis (Heard and RubinDelanchy, 2018). Under a log transformation, these pvalues have an , whereas they have an
distribution under the null hypothesis.
Figure 2 shows the Chernoff information associated with these two matrix data representations, for . The log pvalue representation appears to dominate over the full range and this observation is confirmed in Lemma 3 below. Under this model, it is always preferable to use log pvalues.
Lemma 3 (Log pvalue dominance).
4.1 Real data: detection of a cyber attack
In attacks on computer networks, attackers move between computers, leaving evidence in the form of anomalous connections between computers (Neil et al., 2013) (Turcotte et al., 2014). While individually these anomalous connections can sometimes be detected, they are often lost among the many unusual but nonetheless benign connections on the network. This calls for an approach that detects clusters of anomalous scores by exploiting network structure.
In this example we consider network login events between computers on a computer network. Further details on the data acquisition process can be provided by the authors upon request. For each login event, we use Poisson factorisation (Turcotte et al., 2016) to score the likelihood that a given computer logs in to another computer. In the case of login events in both directions, we combine the scores using Fisher’s method to produce a symmetric matrix of pvalues.
In our first experiment, we insert three anomalous edges connected to three random vertices in the graph, drawing the value from a distribution. Figure 3a) shows the spectral embedding of the pvalue matrix with entries , while Figure 3b) shows the embedding corresponding to the matrix with entries . By Section 3.2.1 there is no advantage to using the matrix with entries over , the former is simply chosen so that large entries in this and the log representation indicate unusual events. The log representation results in an embedding which better separates the cluster of synthetic anomalous nodes, shown in red.
Our second experiment analyses data from a different computer network, now containing redteam login activity. The embeddings corresponding to the two rival representations are shown in Figure 4. Here, both embeddings seem to do well at separating the red team.
5 Conclusion
The performance of spectral clustering with realvalued matrices was investigated under a weighted stochastic block model assumption, extending recent statistical theory on graphs. Our theory recommends selecting eigenvectors by eigenvalue magnitude and using Gaussian clustering. This allows a choice between data representations using Chernoff information. We have identified cases where this choice is asymptotically immaterial (e.g. when the matrices are equal up to affine transformation) and other cases where one representation always dominates (e.g. favouring the use of log pvalues over pvalues for network anomaly detection).
References

Athreya et al. (2017)
Athreya, A., Fishkind, D. E., Tang, M., Priebe, C. E., Park, Y., Vogelstein,
J. T., Levin, K., Lyzinski, V., and Qin, Y. (2017).
Statistical inference on random dot product graphs: a survey.
The Journal of Machine Learning Research
, 18(1):8393–8484.  Chernoff (1952) Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507.
 Fisher (1934) Fisher, R. A. (1934). Statistical methods for research workers. Edinburgh: Oliver & Boyd, 4th edition.
 Heard and RubinDelanchy (2016) Heard, N. A. and RubinDelanchy, P. (2016). Networkwide anomaly detection via the Dirichlet process. In Proceedings of IEEE workshop on Big Data Analytics for Cybersecurity Computing.
 Heard and RubinDelanchy (2018) Heard, N. A. and RubinDelanchy, P. (2018). Choosing between methods of combiningvalues. Biometrika, 105(1):239–246.
 Heard et al. (2010) Heard, N. A., Weston, D. J., Platanioti, K., and Hand, D. J. (2010). Bayesian anomaly detection methods for social networks. The Annals of Applied Statistics, 4(2):645–662.
 Holland et al. (1983) Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social networks, 5(2):109–137.
 Horn and Johnson (2012) Horn, R. and Johnson, C. (2012). Matrix analysis, second edition. Cambridge University Press.
 Neil et al. (2013) Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013). Scan statistics for the online detection of locally anomalous subgraphs. Technometrics, 55(4):403–414.

Ng et al. (2002)
Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002).
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pages 849–856.  Pardo (2005) Pardo, L. (2005). Statistical inference based on divergence measures. Chapman and Hall/CRC.
 PriceWilliams et al. (2018) PriceWilliams, M., Turcotte, M., and Heard, N. (2018). Time of day anomaly detection. In IEEE European Intelligence and Security Informatics Conference (EISIC2018). IEEE. to appear.
 Rohe et al. (2011) Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the highdimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915.
 RubinDelanchy et al. (2018) RubinDelanchy, P., Priebe, C., Tang, M., and Cape, J. (2018). A statistical interpretation of spectral embedding: the generalised random dot product graph. Arxiv preprint at https://arxiv.org/abs/1709.05506.
 Tang and Priebe (2019) Tang, M. and Priebe, C. (2019). Limit theorems for eigenvectors of the normalized Laplacian for random graphs. The Annals of Statistics, To appear.
 Turcotte et al. (2014) Turcotte, M., Heard, N., and Neil, J. (2014). Detecting localised anomalous behaviour in a computer network. In International Symposium on Intelligent Data Analysis, pages 321–332. Springer.
 Turcotte et al. (2016) Turcotte, M., Moore, J., Heard, N., and McPhall, A. (2016). Poisson factorization for peerbased anomaly detection. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pages 208–210. IEEE.
 Von Luxburg (2007) Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4):395–416.
 Xu et al. (2017) Xu, M., Jog, V., and Loh, P.L. (2017). Optimal rates for community estimation in the weighted stochastic block model. arXiv preprint arXiv:1706.01175.
6 Appendix
6.1 Proof of Lemma 1
Proof.
Let be the block mean and variance matrices for the affine transformed weighted stochastic block model,
If is the spectral decomposition of , then we consider latent positions given by
and . Using this notation, the second moment matrix , where . We can compute the covariance matrices of the asymptotic Gaussian mixture model distribution from Theorem 1,
where . For the Chernoff divergence at , we require . This has the same form as the above equation, replacing with , where we similarly define .
We individually analyse the two terms of the Chernoff divergence in Equation 2. For the first term, we can write , where is the standard basis vector with 1 in position and 0 elsewhere.
(4)  
where we have used and the right hand side does not depend on or .
Next, we consider the second term of the Chernoff divergence,
(5) 
Neither nor depend on or . Therefore the Chernoff divergence is independent of and for all , which implies that the Chernoff information is unaffected by affine transformation. ∎
6.2 Proof of Lemma 2
Proof.
Firstly, we shall assume that . Using the above result with we have
Since only the first eigenvalues of are negative, this means that either the first or eigenvalues of are negative. Therefore, the signature for is either or .
A version of Corollary 4.3.9 from Horn and Johnson (2012) considers matrices of the form , which proves the lemma for using a similar argument. If , then and have the same eigenvalues and, therefore, the same signature.
If , then , which swaps the role of and in the signature but the rest of the proof is unchanged. ∎
6.3 Proof of Lemma 3
Proof.
Consider block mean and variance matrices for edges between the two communities with the following form,
Using Equations 4 and 5, we can compute the Chernoff divergence directly,
where,
(6)  
(7) 
We consider the Chernoff information for the pvalues and log pvalues stochastic block models. Terms relating to the latter model are denoted using a dash. The block mean and variance matrices for the two models are
From the definition of Chernoff information, we have the following upper and lower bounds for the two different data representations,
The maximum points of these functions can be found by differentiation,
It is sufficient to prove that the Chernoff information of the log pvalue model dominants the pvalue model, if, for all ,
This inequality depends on only via so we can assume the worst case scenario, . Substituting the maximum points into Equations 6 and 7, this inequality leads to the function, ,
where are the only parameters in the stochastic block models that depend on . Numerical analysis shows that for all , which completes the proof. ∎
Comments
There are no comments yet.