Non-parametric regression for networks

09/30/2020 ∙ by Katie E. Severn, et al. ∙ 0

Network data are becoming increasingly available, and so there is a need to develop suitable methodology for statistical analysis. Networks can be represented as graph Laplacian matrices, which are a type of manifold-valued data. Our main objective is to estimate a regression curve from a sample of graph Laplacian matrices conditional on a set of Euclidean covariates, for example in dynamic networks where the covariate is time. We develop an adapted Nadaraya-Watson estimator which has uniform weak consistency for estimation using Euclidean and power Euclidean metrics. We apply the methodology to the Enron email corpus to model smooth trends in monthly networks and highlight anomalous networks. Another motivating application is given in corpus linguistics, which explores trends in an author's writing style over time based on word co-occurrence networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Networks are of wide interest and able to represent many different phenomena, for example social interactions and connections between regions in the brain (kolaczyk2009statistical; ginestet2017hypothesis). The study of dynamic networks has recently increased as more data of this type is becoming available (rastelli_latouche_friel_2018), where networks evolve over time. In this paper we develop some non-parametric regression methods for modelling and predicting networks where covariates are available. An application for this work is the study of dynamic networks derived from the Enron email corpus, in which each network corresponds to communications between employees in a particular month (Diesner2005). Another motivating application is the study of evolving writing styles in the novels of Jane Austen and Charles Dickens, in which each network is a representation of a novel based on word co-occurences, and the covariate is the time that writing of the novel began (Severnetal19). In both applications, the goal is to model smooth trends in the structure of the dynamic networks as they evolve over time.

An example of previous work on dynamic network data is friel2016interlocking

who embedded nodes of bipartite dynamic networks in a latent space, motivated by networks representing the connection of leading Irish companies and board directors. We will also use embeddings in our work, although the bipartite constraints are not present. Further approaches include object functional principal components analysis

(Dubeymueller19) applied to time-varying networks from New York taxi trip data; multi-scale time-series modelling (Kangetal17) applied to magnetoencephalography data in neuroscience; and quotient space space methodology applied to brain arterial networks (Guosriv20).

The analysis of networks is a type of object oriented data analysis (Marralon14), and important considerations are to decide what are the data objects and how are they represented. We consider datasets where each observation is a weighted network, denoted , comprising a set of nodes, , and a set of edge weights, , indicating nodes and are either connected by an edge of weight , or else unconnected (if ). An unweighted network is the special case with . We restrict attention to networks that are undirected and without loops, so that and , then any such network can be identified with its graph Laplacian matrix , defined as

for . The graph Laplacian matrix can be written as , in terms of the adjacency matrix, , and degree matrix , where is the

-vector of ones. The

th diagonal element of equals the degree of node . The space of graph Laplacian matrices is of dimension and is

(1)

where is the -vector of zeroes. In fact the space is a closed convex subset of the cone of centred symmetric positive semi-definite matrices and is a manifold with corners (ginestet2017hypothesis).

Since the sample space for graph Laplacian data is non-Euclidean, standard approaches to non-parametric regression cannot be applied directly. In this paper, we use the statistical framework introduced in Severnetal19 for extrinsic analysis of graph Laplacian data, in which “extrinsic” refers to the strategy of mapping data into a Euclidean space, where analysis is performed, before mapping back to . The choice of mapping enables freedom in the choice of metric used for the statistical analysis, and in various applications with manifold-valued data analysis there is evidence of advantage in using non-Euclidean metrics (Drydenetal09; Pigolietal14).

A summary of the key steps in the extrinsic approach is:

  1. embedding to another manifold by raising the graph Laplacian matrix to a power,

  2. mapping from to a (Euclidean) tangent space in which to carry out statistical analysis,

  3. inverse mapping from the tangent space to the embedding manifold ,

  4. reversing the powering in i), and projecting back to graph Laplacian space ,

which we explain in more detail as follows. First write by the spectral decomposition theorem, with and , where

are the eigenvalues, which are non-negative as any

is positive semi-definite, and

are the corresponding eigenvectors of

. We consider the following map which raises the graph Laplacian to the power :

(2)

In this paper we take to be the Euclidean space of symmetric matrices. In terms of we define the power Euclidean distance between two graph Laplacians as

(3)

where is the Frobenius norm, also known as the Euclidean norm. For the special case , (3) is just the Euclidean distance. Severnetal19

further considered a Procrustes distance which includes minimization over an orthogonal matrix, approximately allowing relabelling of the nodes, and in this case the embedding manifold

is a Riemannian manifold known as the size-and-shape space (Drydmard16, p.99). However, in this paper for simplicity and because the labelling is known, we shall just consider the power Euclidean metric.

After the embedding the data are mapped to a tangent space of the embedding manifold at using the bijective transformation

In our applications we take and the tangent co-ordinates are , where is the vector of elements above and including the diagonal and is the Helmert sub-matrix (Drydmard16, p49). The main point of note is that the tangent space is a Euclidean space of dimension . Hence, multivariate linear statistical analysis can be carried out in ; for example, Severnetal19

considered estimation, two-sample hypothesis tests, and linear regression.

After carrying out statistical analysis, the fitted values are then transformed back to the graph Laplacian space as follows:

(4)

where is a map that reverses the power, and is the projection to the closest point in graph Laplacian space using Euclidean distance. The projection is obtained by solving a convex optimization problem using quadratic programming, and the solution is therefore unique. See Severnetal19 for full details of this framework.

For visualising results, it is useful to map the data and fitted regression lines in into . In this paper we do so using principal component analysis (PCA) such that the two plotted dimensions reflect the two orthogonal dimensions of greatest sample variability in the tangent space (Severnetal19).

2 Nadaraya-Watson estimator for network data

2.1 Nadaraya-Watson estimator

We first review the classical Nadaraya-Watson estimator (10.2307/25049340; doi:10.1137/1110024) before defining an analogous version for data on . Consider the regression problem where we want to predict an unknown variable with known covariate

for the dataset of independent and identically distributed random variables

observed at with . The aim is to estimate the regression function

The Nadaraya-Watson estimator is

(5)

where is a kernel function with bandwidth .

Consider now a version of the regression problem with replaced with an graph Laplacian matrix with known covariates and dataset with . We wish to estimate the regression function

A natural analogue of in (5) for graph Laplacian data given covariate, , is

(6)

where

and note that . A common choice of kernel function is the Gaussian kernel

(7)

which is bounded above and strictly positive for all . We use a truncated version of (7) such that for (with large) in order that this truncated kernel has compact support, as required by theoretical results presented later. Wherever is defined (meaning that at least one of the is non-zero) it is a sum of positively weighted graph Laplacians. Since the space is a convex cone (ginestet2017hypothesis), itself defined as the sum of positively weighted graph Laplacians, thus as required.

The estimator in (6) can equivalently be written as the graph Laplacian that minimises a weighted sum of squared Euclidean, , distances to the sample data

(8)

using weighted least squares. In principle, the Euclidean distance in can be replaced with a different distance metric, , though solving for the estimator entails an optimisation on the manifold , which can be theoretically and computationally challenging. Hence instead we generalise to other distances via an extrinsic approach, and define

(9)

which is simpler provided, as here, the embedding manifold is chosen such that the optimisation is straightforward. The projection is needed to map back to graph Laplacian space . For the power Euclidean metric, , consider the Nadaraya-Watson estimator in the tangent space ,

(10)

in terms of which, after mapping back to using (4), the resulting Nadaraya-Watson estimator in the graph Laplacian space is

(11)

When this simplifies to (6).

2.2 Uniform weak consistency

First we show that the Nadaraya-Watson estimator for graph Laplacians (6) is uniformly weakly consistent.

Let

(12)

where

is the probability measure of

.

Proposition 1.

Suppose the kernel function is non-negative on , bounded above, has compact support and is strictly positive in a neighbourhood of the origin. If , as and it follows that . Hence the Nadaraya-Watson estimator is uniformly weakly consistent for the true regression function .

Proof.

Consider the univariate regression problem for the th element of . From devroye1980 we know that under the conditions of the proposition we have

as and ; and since

thus . ∎

The result can be extended to the power Euclidean distance based Nadaraya-Watson estimator (11). Let

(13)

where is the probability measure of .

Proposition 2.

Under the conditions of Proposition 1 it follows that . Hence the power Euclidean Nadaraya-Watson estimator is uniformly weakly consistent for the true regression function .

Proof.

First embed the graph Laplacians in the Euclidean manifold and map to a tangent space . Consider the univariate regression problem for the th element of . Again from devroye1980; spiegelman1980 we know that under the conditions of the proposition we have uniform weak consistency in the tangent space.

as and . Also, using the continuous mapping theorem and Pythagorean arguments as in Severnetal19, we have

as and . ∎

2.3 Bandwidth selection

The result that the power Euclidean Nadaraya-Watson estimator is uniformly weakly consistent gives reassurance that the method is a sensible practical approach to non-parametric regression for predicting networks. The result is asymptotic, however, which leaves open the question of how to choose the bandwidth, , in practice. One way to do so is to select it via cross validation (Efron93) as follows. Denote by a Nadaraya-Watson estimator at , based on distance metric , with bandwidth , trained on all the training observations excluding the th. Selection of bandwidth by cross validation then involves choosing to minimise the criterion

(14)

3 Application: Enron email corpus

The Enron dataset was made public during the legal investigation of Enron by the Federal Energy Regulatory Commission (klimt2004introducing) and an overview can be found in Diesner2005. Similar to shetty2004enron we use this data to form social networks between employees and the data are available from Enrondata. For each month we create a network with employees as nodes. The edges between nodes have weights that are the number of emails exchanged between the two employees in the given month. The networks we consider are for the whole months from June 1999 (month 1) to May 2002 (month 36), and we standardise by dividing by the trace of the graph Laplacian for each month. The aim is to model smooth trends in the structure of the dynamic networks as they evolve over time, and we also wish to highlight anomalous months where the network is rather unusual compared to the fitted trend.

(a)
(b)
Figure 1: Distances, , between consecutive observations for the monthly Enron networks for (a) the Euclidean metric, , and (b) the square root Euclidean metric, .

In Figure 1 we plot the distances between consecutive monthly graph Laplacians using Euclidean distance (a) and square root Euclidean distance (b). Some of the largest successive distances are at times , and these are possible candidate positions for anomalous networks that are rather different.

(a)
(b)
(c)
(d)
Figure 2: PCA plots showing the data and Nadaraya-Watson curves, and residual plots for the Enron network data. In each plot the red digits indicate the observation number (month index). In the upper plots the black lines show the Nadaraya-Watson regression curves in the space of the first two principal components, using (a) the Euclidean metric, , and ; (b) the Square root Euclidean metric, , and . We performed the calculations for a various values , and the chosen value of was whichever that was optimal with respect to (14). Plots (c) and (d) are corresponding residual plots showing distance between the fitted values and the observations .
(a)
(b)
(c)
(d)
Figure 3: The red digits indicate the month of the data. MDS plots using the Mahalanobis metric for (a) the Enron data with and (b) with using an overall estimate from , and the Mahalanobis metric when

is estimated to maximize the variance explained by PC1 for (c) the Enron data with

and (d) with .

We provide a PCA plot of the first two PC scores in Figure 2(a),(b) and include the Nadaraya-Watson estimator projected into the space of the first two PCs. Here the bandwidth has been chosen by cross-validation as for the Euclidean case and for the square root metric. The Nadaraya-Watson estimator provides a smooth path through the data, and the structure is clearer in the square root metric plot.

We are interested in finding anomalies in the Enron dynamic networks and so we compute the distances from each network to the fitted value from the Nadaraya-Watson estimate. Figure 2 shows these residual distances of each graph Laplacian to the fitted Nadaraya-Watson values for (c) the Euclidean metric and (d) the square root metric. Some of the largest residuals are months 1,7,35 for Euclidean and 7,33,34,35 for the square root metric, and these are candidates for anomalies.

From Figure 2(b) it looks like there is an approximate horseshoe shape in the PC score plot which could be an example of the horseshoe effect (kendall1971abundance; diaconis2008horseshoes; morton2017uncovering). We might conclude there is a change point in the data around months 20-26 from these plots but this may be misleading (doi:10.1098/rsta.1970.0091). Explained in Mardiaetal79, the horseshoe effect occurs when the distances which are “large”, between data points, appear the same as those that are “moderate”. morton2017uncovering described this as a “saturation property” of the metric, and so on the PCA plot the point corresponding to a ‘large’ time is pulled in closer to time 1 than we intuitively would expect.

As an alternative to PCA, which seeks to address this horseshoe effect, we consider multidimensional scaling (MDS) with a Mahalanobis metric in the tangent space (Mardiaetal79, p.31) between two graph Laplacians and , at times and respectively, which is:

where and are the mean and covariance matrix of respectively. Here we take as zero and consider an isotropic AR(1) model which has covariance matrix

which is a diagonal matrix where the diagonal elements are the variance of elements and we have assumed a 0 covariance between any other elements. Writing and we estimate by least squares and we take as this is just an overall scale parameter. The Mahalanobis metric between graph Laplacians, and , can now be written as

The plot of MDS with the Mahalanobis distance are given in Figure 3(a)-(b). In both plots there are large distances between the first few and last few observations compared to the central observations, which is broadly in keeping with Figure 1(a),(b) although the middle observations do seem too close together in the MDS plots. We consider an alternative estimate in choosing that maximises the variance explained by the first PC scores for each example, shown in Figure 3(c)-(d). These final MDS plots are more in agreement with the distance plots of Figure 1. In particular in Figure 3(d) we see that months 7, 34, 35, 36 look rather different from the rest.

Finally we consider the main features of all the results from Figure 1-Figure 3 and we see that the 7th, 34th and 35th months stand out as strong anomalies. The 7th month corresponds to December 1999, and this is picked out to be an anomaly in wang2014locality, believed to coincide with Enron’s tentative sham energy deal with Merrill Lynch created to meet profit expectations and boost the stock price. Month 34 and 35 correspond to March and April 2002 these correspond to the former Enron auditor, Arthur Andersen, being indicted for obstruction of justice (guardianEnron).

4 Application: 19th century novel networks

We consider an application where it is of interest to analyze dynamic networks from the novels of Jane Austen and Charles Dickens. The 7 novels of Austen and 16 novels of Dickens were represented as samples of network data by Severnetal19. Each novel is represented by a network where each node of the network is a word, and edges are formed with weights proportional to the number of times a pair of words co-occurs closely in the text. For each novel we produce a network counting pairwise word co-occurrences, and words are said to co-occur if they appear within five words of each other in the text. A choice that needs to be made is if we allow co-occurrences over sentence boundaries and chapter boundaries (evert2008corpora, Section 3), and for this dataset we allow it. The data are obtained from CLiC (doi:10.3366/cor.2016.0102).

(a)
(b)
(c)
(d)
(e)
(f)
Figure 4: Regression paths for Austen novels (blue) between the years 1794 to 1815, numbered 1-7 according to chronology of the novels, and for Dickens novels (red) between the years 1836 to 1870, numbered 8-23; using (left to right) and , with bandwidth (top to bottom) , of which gave the smallest value of the cross-validation criterion (14).

We take the node set as the most common words across all the novels of Austen and Dickens. A pre-processing step for the novels is to normalise each graph Laplacian, in order to remove the gross effects of different lengths of the novels, by dividing each graph Laplacian by its own trace, resulting in a trace of 1 for each novel. Our key statistical goal is to investigate the authors’ evolving writing styles, by carrying out non-parametric regression with a graph Laplacian response on the year that each novel was written.

We apply the Nadaraya-Watson regression to the Jane Austen and Charles Dickens networks separately to predict their writing styles at different times. The response is a graph Laplacian and the covariate is time for each novel, with a separate regression for each novelist. We compared using the metrics and . For each author a Nadaraya-Watson estimate was produced for every 6 months within the period the author was writing. We compared different bandwidths, , in the Gaussian kernel. The results are shown in Figure 4 plotted on the first and second principal component space for all the novels.

For both metrics for Dickens when the regression lines are not at all smooth. For both metrics with the regression line for Dickens appears to show an initial smooth trend, then a turning point around the years 1850 and 1851 (between David Copperfield and Bleak House which are novels 16 and 17 in Figure 4). After 1851 there is much less dependence on time. This change in structure is especially evident in the plot for both metrics, which has the smallest value of the cross-validation criterion (14) out of these choices . In the year 1851 Dickens had a tragic year including his wife having a nervous breakdown, his father dying and his youngest child dying (charlesDickens). We see that the possible turning point is around the same time as these significant events.

As there are far fewer novels written by Austen it is less obvious if there is any turning point in her writing, however it is clear that Lady Susan (novel 1) is an anomaly, not fitting with the regression curve that does follow Austen’s other works more closely. Lady Susan is Austen’s earliest work, and is a short novella published 54 years after Austen’s death.

5 Discussion

The two applications presented involve a scalar covariate, but the Nadaraya-Watson estimator is appropriate to more general covariates, e.g. spatial covariates. A further extension would be to adapt the method of kriging, also referred to as Gaussian process prediction. Kriging is a geospatial method for prediction at points on a random field (e.g. see Cressie93), and Pigolietal16 considered kriging for manifold-valued data. The kriging predictor of an unknown graph Laplacian on a random field with known coordinates for the dataset is of the form , where the weights, , are determined by minimizing the mean square prediction error for a given covariance function.

The Nadaraya-Watson estimator can also be applied in a reverse setting where some variable is dependent on the graph Laplacian , this can be written as . This could be used if, for example, one had the times networks were produced and then wanted to predict the time a new network was produced. In this case the Nadaraya-Watson estimator is a linear combination of known values, weighted by the graph Laplacian distances, given by

(15)

where can be any metric between two graph Laplacians. Severn19 provided an application of this approach using the Gaussian kernel defined in (7), predicting the time that a novel was written using the network graph Laplacian as a covariate.

Other metrics could also be used, for example the Procrustes metric of Severnetal19. To solve (9) for the Procrustes metric, the algorithm for obtaining a weighted generalised Procrustes mean given in Drydmard16 can be implemented.

In Euclidean space there are more general results for the Nadaraya-Watson estimator including weak convergence in norm, rather than results that we have used (devroye1980; spiegelman1980). More general results also exist, e.g. see Walk2002, including strong consistency. It will be interesting to explore which of these results can be extended to graph Laplacians, although the additive properties of the case have been particularly important in our work.

Acknowledgments

This work was supported by the Engineering and Physical Sciences Research Council [grant number EP/T003928/1]. The datasets were derived from the following resources available in the public domain: The Network Repository http://networkrepository.com and CLiC https://clic.bham.ac.uk

References