Spectral clustering in the weighted stochastic block model

10/12/2019
by   Ian Gallagher, et al.
0

This paper is concerned with the statistical analysis of a real-valued symmetric data matrix. We assume a weighted stochastic block model: the matrix indices, taken to represent nodes, can be partitioned into communities so that all entries corresponding to a given community pair are replicates of the same random variable. Extending results previously known only for unweighted graphs, we provide a limit theorem showing that the point cloud obtained from spectrally embedding the data matrix follows a Gaussian mixture model where each community is represented with an elliptical component. We can therefore formally evaluate how well the communities separate under different data transformations, for example, whether it is productive to "take logs". We find that performance is invariant to affine transformation of the entries, but this expected and desirable feature hinges on adaptively selecting the eigenvectors according to eigenvalue magnitude and using Gaussian clustering. We present a network anomaly detection problem with cyber-security data where the matrix of log p-values, as opposed to p-values, has both theoretical and empirical advantages.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/07/2013

Consistency of spectral clustering in stochastic block models

We analyze the performance of spectral clustering for community extracti...
08/29/2020

Exact Recovery of Community Detection in k-Community Gaussian Mixture Model

We study the community detection problem on a Gaussian mixture model, in...
05/03/2021

Spectral clustering under degree heterogeneity: a case for the random walk Laplacian

This paper shows that graph spectral embedding using the random walk Lap...
08/09/2019

Extending the Davis-Kahan theorem for comparing eigenvectors of two symmetric matrices II: Computation and Applications

The extended Davis-Kahan theorem makes use of polynomial matrix transfor...
02/05/2021

A simpler spectral approach for clustering in directed networks

We study the task of clustering in directed networks. We show that using...
10/04/2021

Row-clustering of a Point Process-valued Matrix

Structured point process data harvested from various platforms poses new...
03/22/2020

Spectral Clustering Revisited: Information Hidden in the Fiedler Vector

We are interested in the clustering problem on graphs: it is known that ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spectral clustering (Von Luxburg, 2007)

refers to a number of different algorithms that have in common two main steps: first, of computing the spectral decomposition of a (possibly regularised) data matrix; and second, of applying a clustering algorithm to a point cloud extracted from the eigenvectors. When the matrix holds distances or affinities spectral clustering allows estimation of non-circular clusters in pointillist data

(Ng et al., 2002). When the matrix represents a graph, it enables the discovery of communities (Von Luxburg, 2007).

In the case of graphs, one can talk quite precisely about the relative merits of different regularisation techniques (e.g. adjacency versus normalised Laplacian), which eigenvectors to select (e.g. corresponding to large eigenvalues versus large magnitude eigenvalues) and which clustering algorithm to use (e.g. -means versus Gaussian mixture modelling). While the first decision is complicated (Tang and Priebe, 2019)

, asymptotic analysis now clearly favours the second option in each of the remaining

(Rohe et al., 2011; Athreya et al., 2017; Rubin-Delanchy et al., 2018). These determinations are made under the assumption that the data follow a stochastic block model (Holland et al., 1983)

, where the probability of an edge is dependent only on the (unknown) community memberships of the corresponding nodes.

The natural extension to a real-valued matrix is to assume the th entry is a real random variable whose distribution depends only the communities of nodes and (Xu et al., 2017). (Under the ordinary stochastic block model this distribution would be Bernoulli.) While defining a normalised Laplacian is not entirely straightforward, since for example a node’s ‘degree’ could be negative and would need to be square-rooted, the second and third questions are still pertinent: which eigenvectors and which clustering algorithm should be used?

This paper presents a central limit theorem showing that asymptotically the point cloud obtained from spectrally embedding a real-valued matrix from a weighted stochastic block model follows a Gaussian mixture model with elliptical components whose centres and covariance matrices are explicitly calculable. This result implies that for statistical consistency, eigenvectors selected by eigenvalue magnitude must be used, and for optimality one should use Gaussian clustering, and not

-means.

Another application of this result is to allow a choice between data representations, for example, whether to embed the matrix of counts or log-counts. Since two data representations produce two different mixture distributions, one can compare how well the components separate in each case. Following Tang and Priebe (2019), we do this using Chernoff information. In a relevant formalisation of the network anomaly detection problem, we are thus able to show that embedding the matrix of log p-values, rather than raw p-values, is statistically more efficient. This theoretical observation is validated in a cyber-security example.

Finally, affine transformation of a real-valued matrix’s entries does not change the Chernoff information of the associated asymptotic clustering problem. In other words, one need not worry about the origin and scale of the measurements in the data matrix, for example, whether temperature is measured in Celsius or Fahrenheit. Yet affine transformation can cause important eigenvalues to flip sign and Gaussian clusters to change shape, and so this invariance hinges on choosing eigenvectors from both sides of the spectrum and using Gaussian clustering; otherwise, performance will vary substantially.

2 Spectral clustering in the weighted stochastic block model

2.1 The weighted stochastic block model

Definition 1 (Weighted stochastic block model).

Given nodes and communities, an undirected weighted graph with symmetric adjacency matrix follows a -community stochastic block model if there is a partition of the nodes into communities conditional upon which, for all ,

where is an index denoting the community of node

, assigned independently according to a probability vector

where .

Define matrices

as the block means and variances respectively of the distributions

, for

, where it is assumed the moments exist. For example, a 2-community unweighted stochastic block model with intra-community (respectively, inter-community) link probability

(respectively, ) has

The signature of a weighted stochastic block model, , is defined as the number of strictly positive and strictly negative eigenvalues of respectively and let . We can choose such that , for , where , with ones followed by minus ones on its diagonal. One choice is to use the rows of , where is the spectral decomposition of and . (We will use and to denote the element-wise absolute value and power of a diagonal matrix .)

The vector can be interpreted as a canonical latent position for node in the weighted stochastic block model. Latent positions of a stochastic block model are only identifiable up to transformation by elements of the indefinite orthogonal group . Attempts to infer the latent positions from the adjacency matrix of a weighted stochastic block model must take unidentifiability up to transformation from into account.

2.2 Spectral clustering

Definition 2 (Adjacency spectral embedding).

Given an undirected weighted graph with symmetric adjacency matrix , consider the spectral decomposition , where is a diagonal matrix containing the largest eigenvalues of in magnitude, and contains the corresponding orthonormal eigenvectors. Define the adjacency spectral embedding of the graph into by

We will interpret this spectral embedding procedure as providing an estimate of the latent position for node

in the network. Heuristically, nodes that are somehow ‘close’ in this space are likely to belong to the same community. Algorithm

1 (extending Algorithm 1 Rubin-Delanchy et al. (2018) to real-valued matrices) proposes an approach to recovering these communities.

0:  Weighted adjacency matrix , dimension , number of communities
1:  Compute spectral embedding of the graph into via Definition 2
2:  Fit a Gaussian mixture model with full covariance matrices with components
2:  Cluster centres and node memberships
Algorithm 1 Spectral clustering for the weighted stochastic block model

There are two important features of Algorithm 1. Firstly, both sides of the spectral decomposition are used: in Definition 2 the largest eigenvalues by magnitude are retained (and the corresponding eigenvectors), not just the largest positive eigenvalues. This is needed for statistical consistency in general (Rohe et al., 2011). Large negative eigenvalues in computer network graphs can hold key information for node clustering and link prediction (Rubin-Delanchy et al., 2018). Secondly, the covariance matrices in the Gaussian mixture model are unconstrained, i.e. ellipsoidal with varying volume, shape, and orientation. This is a significant departure from the standard use of -means (Von Luxburg, 2007).

Both of these algorithm features are well-justified by the theorem in the following section.

2.3 Central limit theorem

Rubin-Delanchy et al. (2018) derived a central limit theorem for adjacency spectral embedding under a ‘generalised random dot product graph’; a novel contribution of the present paper is to consider an extension of this theorem to the case of a weighted stochastic block model:

Theorem 1 (Adjacency spectral embedding central limit theorem).

Consider a sequence of adjacency matrices from a weighted stochastic block model with signature . For any integer and points , conditional on the community labels , there exists a sequence of random matrices such that

where

(1)

the second moment matrix , assumed to be invertible, is

and

denotes the cumulative distribution function of a multivariate normal distribution with mean

and covariance .

The implication of Theorem 1

is that spectral embedding an adjacency matrix from a weighted stochastic block model produces a point cloud that is asymptotically a linear transformation (given by

) of independent, identically distributed draws from a Gaussian mixture model. Each of its components corresponds to a community and has an explicitly calculable mean and covariance. A finite sample illustration of the theorem is given in Figure 1.

The result motivates the design of Algorithm 1, namely the importance of using both sides of the spectral decomposition and allowing full covariance matrices when fitting a Gaussian mixture model.

2.4 Example: Poisson counts versus Bernoulli presence events

Consider a 2-community weighted stochastic block model where weights represent event counts modelled by Poisson distributions with rate

, , with block mean and variance matrices

We generate a weighted network from this model with nodes and probability of belonging to the first community , and apply Algorithm 1. Figure 1a) shows the -dimensional point cloud obtained from spectral embedding the graph (note, ), with colours indicating the true cluster assignment. The red and blue ellipses show the two 95% contours obtained by applying Gaussian clustering using the Python sklearn library. In this example, the predicted community assignment is 98.5% accurate. Black ellipses show the 95% asymptotic contours of the components, calculated using Theorem 1, and approximately comparable.

Figure 1: Spectral embedding into of a graph from a 2-community weighted stochastic block model with a) Poisson count data and b) Bernoulli presence data where red and blue points indicate true community membership. Red and blue ellipses show the 95% contours obtained from a fitted Gaussian mixture model, and the black ellipses their theoretical counterparts (calculated from Theorem 1). Panel c) shows the Chernoff divergence for for the two examples, as discussed in Section 3.1.

Instead, suppose we simply report, for each pair of nodes, whether at least one event occurs. If , then . The block mean and variance matrices for this unweighted stochastic block model are

We calculate this modified adjacency matrix directly from the original and Figure 1b) shows the resulting point cloud from spectral embedding, where the contours and true community labels are indicated as before. This time the predicted community assigment based on a Gaussian mixture model is only 96.3% accurate. This loss of accuracy is consistent with the theoretical contours appearing less well separated. We formally quantify cluster separation in Section 3.1, and find that the Poisson representation is indeed preferable in this example.

3 Choosing matrix data representation

3.1 Chernoff information

In order to define a measure of cluster separation we take inspiration from Tang and Priebe (2019), where the Chernoff information was proposed as a method to compare graph embedding based on the Laplacian versus the adjacency matrix. In a 2-cluster problem, the Chernoff information provides an upper bound on the probability of error of the Bayes decision rule that assigns each data point to its most likely cluster a posteriori. If the clusters have distributions and , the Chernoff information is (Chernoff, 1952):

where is the Chernoff divergence

and

are the probability density functions corresponding to

respectively. For , one reports instead the Chernoff information of the critical pair, .

The Chernoff information of the components in the limiting mixture distribution of Theorem 1 can be written in closed form. Suppose distribution and then, for , denoting , we can compute (Pardo, 2005),

(2)

In their work motivating the use of Chernoff information to compare graph embeddings, Tang and Priebe (2019) make the point that a simpler criterion such as cluster variance is not satisfactory, since it is effectively measuring the performance of -means clustering rather than clustering using a Gaussian mixture model.

3.1.1 Example: Poisson counts versus Bernoulli presence events

Returning to the Poisson versus Bernoulli example of Section 2.4, Figure 1c) shows the Chernoff divergence, and hence the Chernoff information, for the two representations. For the Bernoulli data, the Chernoff information is 0.002, achieved at ; for the Poisson data, the Chernoff information is 0.012, achieved at , and this representation should therefore be preferred.

3.2 Invariance under affine transformation

As mentioned in the introduction, the choice of origin and scale for measurements in the data matrix is often arbitrary. The following lemma shows that cluster separation, as measured through Chernoff information, is not affected by this choice.

Lemma 1 (Chernoff information invariance under affine transformation).

Let be an adjacency matrix from a weighted stochastic block model and, for , define , where is the all-one vector. For any ,

where and denote the th component from the limiting mixture distribution of Theorem 1 associated with and respectively.

This lemma has some interesting consequences regarding common data transformations and their effect on spectral clustering.

Remark 1.

Given an unweighted stochastic block model, rather than using 1 and 0 to respectively represent edges and missing edges, Chernoff information invariance suggests that any other two distinct values could be used.

Remark 2.

Given a weighted stochastic block model where weights represent p-values, Chernoff information invariance suggests that there is no difference between analysing the matrix with entries or with entries .

Based on Lemma 1, it may appear that an affine transformation of the adjacency matrix entries will not affect the geometry of the point cloud. However, by transforming the entries we could potentially change the signature of the model and the underlying geometry of the invariant indefinite orthogonal group, .

Lemma 2 (Signature change under affine transformation).

Let be a matrix with signature . Then, depending on and , the matrix has signature:

Signature

3.2.1 Example: Beta distributions for p-values

Consider a 2-community weighted stochastic block model where weights represent p-values from a continuous test statistic. We model the p-values using Beta distributions, for

,

(3)

Following Corollary 2, there is no difference in Chernoff information between using the matrix , with entries , or the matrix , with entries , in Algorithm 1. Let and be the corresponding block mean matrices,

Since , the matrix has signature while has signature , changing the geometry of latent space. We investigate this model further in Section 4.

4 Application: network anomaly detection

Consider the problem of detecting a cluster of anomalous activity on a network. Assume that a p-value for every unordered pair of nodes on a network can be obtained, quantifying our level of surprise in their activity. For example, a low p-value might occur if, relative to historical behaviour, a much smaller or larger volume of communication was observed (Heard et al., 2010), if a communication used an unusual channel (Heard and Rubin-Delanchy, 2016) or took place at a rare time of day (Price-Williams et al., 2018).

Assume that the network contains an unknown proportion of nodes of interest whose interactions tend to have low associated p-values. Interactions involving the remaining nodes generate p-values with no signal.

We model this using the 2-community stochastic block model specified in Section 3.2.1. One could hope to discover the anomalous cluster by spectrally embedding or, equivalently, . However, familiarity with statistical anomaly detection might suggest using instead , since the most common method of combining p-values is Fisher’s method (Fisher, 1934),

This provides the uniformly most powerful approach if the p-values are with under the alternative hypothesis (Heard and Rubin-Delanchy, 2018). Under a log transformation, these p-values have an , whereas they have an

distribution under the null hypothesis.

Figure 2 shows the Chernoff information associated with these two matrix data representations, for . The log p-value representation appears to dominate over the full range and this observation is confirmed in Lemma 3 below. Under this model, it is always preferable to use log p-values.

Figure 2: Chernoff information comparison of spectral clustering using a) the matrix of p-values versus b) the matrix of log p-values, under a weighted stochastic block model representing a network anomaly detection problem. For a range of model parameters, the log p-value representation dominates. Further details in main text.
Lemma 3 (Log p-value dominance).

Consider a 2-community stochastic block model with weights representing p-values modelled by the Beta distributions given in Equation 3, and define . For all ,

where and denote the th component from the limiting mixture distribution of Theorem 1 associated with and respectively.

4.1 Real data: detection of a cyber attack

In attacks on computer networks, attackers move between computers, leaving evidence in the form of anomalous connections between computers (Neil et al., 2013) (Turcotte et al., 2014). While individually these anomalous connections can sometimes be detected, they are often lost among the many unusual but nonetheless benign connections on the network. This calls for an approach that detects clusters of anomalous scores by exploiting network structure.

In this example we consider network log-in events between computers on a computer network. Further details on the data acquisition process can be provided by the authors upon request. For each log-in event, we use Poisson factorisation (Turcotte et al., 2016) to score the likelihood that a given computer logs in to another computer. In the case of log-in events in both directions, we combine the scores using Fisher’s method to produce a symmetric matrix of p-values.

In our first experiment, we insert three anomalous edges connected to three random vertices in the graph, drawing the value from a distribution. Figure 3a) shows the spectral embedding of the p-value matrix with entries , while Figure 3b) shows the embedding corresponding to the matrix with entries . By Section 3.2.1 there is no advantage to using the matrix with entries over , the former is simply chosen so that large entries in this and the log representation indicate unusual events. The log representation results in an embedding which better separates the cluster of synthetic anomalous nodes, shown in red.

Our second experiment analyses data from a different computer network, now containing red-team log-in activity. The embeddings corresponding to the two rival representations are shown in Figure 4. Here, both embeddings seem to do well at separating the red team.

Figure 3: Spectral embedding into of network log-in events data using a p-value matrix with a) entries and b) entries . Nodes with synthetic anomalous connections are shown in red.
Figure 4: Spectral embedding into of network log-in events data using a p-value matrix with a) entries and b) entries . Nodes with red-team activity are shown in red.

5 Conclusion

The performance of spectral clustering with real-valued matrices was investigated under a weighted stochastic block model assumption, extending recent statistical theory on graphs. Our theory recommends selecting eigenvectors by eigenvalue magnitude and using Gaussian clustering. This allows a choice between data representations using Chernoff information. We have identified cases where this choice is asymptotically immaterial (e.g. when the matrices are equal up to affine transformation) and other cases where one representation always dominates (e.g. favouring the use of log p-values over p-values for network anomaly detection).

References

  • Athreya et al. (2017) Athreya, A., Fishkind, D. E., Tang, M., Priebe, C. E., Park, Y., Vogelstein, J. T., Levin, K., Lyzinski, V., and Qin, Y. (2017). Statistical inference on random dot product graphs: a survey.

    The Journal of Machine Learning Research

    , 18(1):8393–8484.
  • Chernoff (1952) Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507.
  • Fisher (1934) Fisher, R. A. (1934). Statistical methods for research workers. Edinburgh: Oliver & Boyd, 4th edition.
  • Heard and Rubin-Delanchy (2016) Heard, N. A. and Rubin-Delanchy, P. (2016). Network-wide anomaly detection via the Dirichlet process. In Proceedings of IEEE workshop on Big Data Analytics for Cyber-security Computing.
  • Heard and Rubin-Delanchy (2018) Heard, N. A. and Rubin-Delanchy, P. (2018). Choosing between methods of combining-values. Biometrika, 105(1):239–246.
  • Heard et al. (2010) Heard, N. A., Weston, D. J., Platanioti, K., and Hand, D. J. (2010). Bayesian anomaly detection methods for social networks. The Annals of Applied Statistics, 4(2):645–662.
  • Holland et al. (1983) Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social networks, 5(2):109–137.
  • Horn and Johnson (2012) Horn, R. and Johnson, C. (2012). Matrix analysis, second edition. Cambridge University Press.
  • Neil et al. (2013) Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013). Scan statistics for the online detection of locally anomalous subgraphs. Technometrics, 55(4):403–414.
  • Ng et al. (2002) Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002).

    On spectral clustering: Analysis and an algorithm.

    In Advances in neural information processing systems, pages 849–856.
  • Pardo (2005) Pardo, L. (2005). Statistical inference based on divergence measures. Chapman and Hall/CRC.
  • Price-Williams et al. (2018) Price-Williams, M., Turcotte, M., and Heard, N. (2018). Time of day anomaly detection. In IEEE European Intelligence and Security Informatics Conference (EISIC2018). IEEE. to appear.
  • Rohe et al. (2011) Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915.
  • Rubin-Delanchy et al. (2018) Rubin-Delanchy, P., Priebe, C., Tang, M., and Cape, J. (2018). A statistical interpretation of spectral embedding: the generalised random dot product graph. Arxiv preprint at https://arxiv.org/abs/1709.05506.
  • Tang and Priebe (2019) Tang, M. and Priebe, C. (2019). Limit theorems for eigenvectors of the normalized Laplacian for random graphs. The Annals of Statistics, To appear.
  • Turcotte et al. (2014) Turcotte, M., Heard, N., and Neil, J. (2014). Detecting localised anomalous behaviour in a computer network. In International Symposium on Intelligent Data Analysis, pages 321–332. Springer.
  • Turcotte et al. (2016) Turcotte, M., Moore, J., Heard, N., and McPhall, A. (2016). Poisson factorization for peer-based anomaly detection. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pages 208–210. IEEE.
  • Von Luxburg (2007) Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4):395–416.
  • Xu et al. (2017) Xu, M., Jog, V., and Loh, P.-L. (2017). Optimal rates for community estimation in the weighted stochastic block model. arXiv preprint arXiv:1706.01175.

6 Appendix

6.1 Proof of Lemma 1

Proof.

Let be the block mean and variance matrices for the affine transformed weighted stochastic block model,

If is the spectral decomposition of , then we consider latent positions given by

and . Using this notation, the second moment matrix , where . We can compute the covariance matrices of the asymptotic Gaussian mixture model distribution from Theorem 1,

where . For the Chernoff divergence at , we require . This has the same form as the above equation, replacing with , where we similarly define .

We individually analyse the two terms of the Chernoff divergence in Equation 2. For the first term, we can write , where is the standard basis vector with 1 in position and 0 elsewhere.

(4)

where we have used and the right hand side does not depend on or .

Next, we consider the second term of the Chernoff divergence,

(5)

Neither nor depend on or . Therefore the Chernoff divergence is independent of and for all , which implies that the Chernoff information is unaffected by affine transformation. ∎

6.2 Proof of Lemma 2

Proof.

The result follows from Corollary 4.3.9 from Horn and Johnson (2012). If and then, for ,

Firstly, we shall assume that . Using the above result with we have

Since only the first eigenvalues of are negative, this means that either the first or eigenvalues of are negative. Therefore, the signature for is either or .

A version of Corollary 4.3.9 from Horn and Johnson (2012) considers matrices of the form , which proves the lemma for using a similar argument. If , then and have the same eigenvalues and, therefore, the same signature.

If , then , which swaps the role of and in the signature but the rest of the proof is unchanged. ∎

6.3 Proof of Lemma 3

Proof.

Consider block mean and variance matrices for edges between the two communities with the following form,

Using Equations 4 and 5, we can compute the Chernoff divergence directly,

where,

(6)
(7)

We consider the Chernoff information for the p-values and log p-values stochastic block models. Terms relating to the latter model are denoted using a dash. The block mean and variance matrices for the two models are

From the definition of Chernoff information, we have the following upper and lower bounds for the two different data representations,

The maximum points of these functions can be found by differentiation,

It is sufficient to prove that the Chernoff information of the log p-value model dominants the p-value model, if, for all ,

This inequality depends on only via so we can assume the worst case scenario, . Substituting the maximum points into Equations 6 and 7, this inequality leads to the function, ,

where are the only parameters in the stochastic block models that depend on . Numerical analysis shows that for all , which completes the proof. ∎