Manifold valued data analysis of samples of networks, with applications in corpus linguistics

02/21/2019 ∙ by Katie Severn, et al. ∙ The University of Nottingham 0

Networks can be used in many applications, such as in the analysis of text documents, social interactions and brain activity. We develop a general framework for extrinsic statistical analysis of samples of networks, motivated by networks representing text documents in corpus linguistics. We identify networks with their graph Laplacian matrices, for which we define metrics, embeddings, tangent spaces, and a projection from Euclidean space to the space of graph Laplacians. This framework provides a way of computing means, performing principal component analysis and regression, and carrying out hypothesis tests, such as for testing for equality of means between two samples of networks. We apply the methodology to the set of novels by Jane Austen and Charles Dickens.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The statistical analysis of networks dates back to at least the 1930’s, however interest has increased considerably in the 21st century (Kolaczyk, 2009). Networks are able to represent many different types of data, for example social networks, neuroimaging data and text documents. In this paper, each observation is a weighted network, denoted , comprising a set of nodes, , and a set of edge weights, , indicating nodes and are either connected by an edge of weight , or else unconnected (if ). An unweighted network is the special case with . We restrict attention to networks that are undirected and without loops, so that and , then any such network can be identified with its graph Laplacian matrix , defined as

for . The graph Laplacian can be written as , in terms of the adjacency matrix, , and degree matrix
, where is the

-vector of ones. The

th diagonal element of D equals the degree of node . The space of all graph Laplacians of dimension is

(1)

where is the -vector of zeroes. The space is a manifold, in particular a convex subset of the cone of symmetric positive semi-definite matrices with corners (Ginestet et al., 2017).

For the tasks we address the data are a random sample from a population of networks, where each observation is a graph Laplacian representing networks with a common node set . Graph Laplacians, as with most network representations, are not standard Euclidean data and so for typical statistical tasks, such as computing the mean, performing principal component analysis and regression, and testing equality of means based on two-samples, standard Euclidean methods need to be carefully adapted.

To perform statistical analysis on the manifold of graph Laplacians we must define suitable metrics. We will consider two general metrics between graph Laplacians:

Euclidean power metric: (2)
Procrustes power metric: (3)

where R

is an orthogonal matrix for the ordinary Procrustes match of

to (Dryden and Mardia, 2016, chapter 7) and is the Frobenius norm, which is also known as the Euclidean norm. Common choices of Euclidean power metrics and Procrustes metrics are , and , referred to as the Euclidean, square root Euclidean and Procrustes size-and-shape metrics respectively (Dryden, Koloydenko and Zhou, 2009). We provide more detail about these metrics in Section 3.

Analysing networks by representing them as elements of is an approach also used by Ginestet et al. (2017). The authors considered the Euclidean metric

and derived a central limit theorem which they used to develop a test between two samples networks, driven by an application in neuroimaging. Motivation for our considering metrics other than

includes evidence that interpolation of non-Euclidean data based on

often has disadvantages, such as swelling (in the context of positive semi-definite matrices (Dryden, Koloydenko and Zhou, 2009)) and lack of interpretability (in the context of graph Laplacians (Bakker, Halappanavar and Sathanur, 2018)).

2 Application: Jane Austen and Charles Dickens novels

In corpus linguistics, networks are used to model documents comprising a text corpus (Phillips, 1983). Each node represents a word, and edges indicate words that co-occur within some span—typically 5 words, which we use henceforth—of each other (Evert, 2008). Our dataset is derived from the full text in novels111 Christmas Carol and Lady Susan are short novellas rather than novels, but we shall use the term “novel” for each of the works for ease of explanation. by Jane Austen and Charles Dickens, as listed in Table 1, obtained from CLiC (Mahlberg et al., 2016). For each of the 7 Austen and 16 Dickens novels, the “year written” refers to the year in which the author started writing the novel; see The Jane Austen Society of North America (2018) and Charles Dickens Info (2018). Our key statistical goals are to investigate the authors’ evolving writing styles, by regressing the networks on “year written”; to explore dominant modes of variability, by developing principal component analysis for samples of networks; and to test for significance of differences in Austen’s and Dickens’ writing styles, via a two-sample test of equality of mean networks.

Author Novel name Abbreviation Year written
Austen Lady Susan LS 1794
Austen Sense and Sensibility SE 1795
Austen Pride and Prejudice PR 1796
Austen Northanger Abbey NO 1798
Austen Mansfield Park MA 1811
Austen Emma EM 1814
Austen Persuasion PE 1815
Dickens The Pickwick Papers PP 1836
Dickens Oliver Twist OT 1837
Dickens Nicholas Nickleby NN 1838
Dickens The Old Curiosity Shop OCS 1840
Dickens Barnaby Rudge BR 1841
Dickens Martin Chuzzlewit MC 1843
Dickens A Christmas Carol C 1843
Dickens Dombey and Son DS 1846
Dickens David Copperfield DC 1849
Dickens Bleak House BH 1852
Dickens Hard Times HT 1854
Dickens Little Dorrit LD 1855
Dickens A Tale of Two Cities TTC 1859
Dickens Great Expectations GE 1860
Dickens Our Mutual Friend OMF 1864
Dickens The Mystery of Edwin Drood ED 1870
Table 1: The Jane Austen and Charles Dickens novels from the CLiC database (Mahlberg et al., 2016)

For each Austen and Dickens novel we produce a network representing pairwise word co-occurrence. If the node set corresponded to every word in all the novels it would be very large, with , but a relatively small number of words are used far more than others. The top words cover of the total word frequency, cover , and cover . We focus on a truncated set of the most frequent words and the ’s are the pairwise co-occurrence counts between these words. In our analysis we choose as a sensible trade-off between having very large, very sparse graph Laplacians versus small graph Laplacians of just the most common words. For each novel and the truncated node set, the network produced is converted to a graph Laplacian. A pre-processing step for the novels is to normalise each graph Laplacian in order to remove the gross effects of different lengths of the novels by dividing each graph Laplacian by its own trace, resulting in a trace of 1 for each novel.

As an indication of the broad similarity of the most common words we list the top 25 words in the table in Appendix A. Of the top 25 words across all novels 22 appear in the most frequent 25 words for the Dickens novels and 23 for the Austen novels. The words not, be, she do not appear in Dickens’ top 25 and the words mr and said do not appear in Austen’s top 25. Some differences in relative rank are immediately apparent: her, she, not having higher relative rank in Austen and he, his, mr, said having relatively higher rank in Dickens.

Figure 1: Cluster analysis and MDS plots based on (from top to bottom) the Euclidean distance, , square root distance, , and Procrustes distance, each with . The plots display Austen’s novels in blue and lower case, and Dickens’s novels in red and upper case.

We initially compare some choices of distance metrics on the Austen and Dickens data after constructing the graph Laplacians from the most frequent words across all 23 novels. Figure 1

(left column) shows the results of a hierarchical cluster analysis using Ward’s method

(Ward, 1963), based on pairwise distances between novels using metrics , and . For computing the Procrustes metric we use the shapes package (Dryden, 2018) in R (R Core Team, 2018).

The dendrograms for square root and Procrustes separate the authors into two very distinct clusters, whereas for Euclidean distance Dickens’ David Copperfield and Great Expectations are clustered with Austen’s Lady Susan which is unsatisfactory. The next sub-division of the Dickens cluster using square root/Procrustes distance splits into groups of the earlier novels versus later novels, with the exception being the historical novel A Tale of Two Cities which is clustered with the earlier novels. There is not such a clear sub-division for Dickens using the Euclidean metric. In the Austen cluster for square root and Procrustes there is clearly a large distance between Lady Susan and the rest, where Lady Susan is her earliest work, a short novella published 54 years after Austen’s death.

Figure 1 (right column) shows corresponding plots of the first two multi-dimensional scaling (MDS) variables from a classical multi-dimensional scaling analysis. The square root and Procrustes MDS plots are visually identical, although they are slightly different numerically. We see that there is a clear separation in MDS space between Austen’s and Dickens’ works with a very strong separation in MDS1 using the square root and Procrustes distances, and less so for Euclidean distance.

3 Framework for the statistical analysis of graph Laplacians

3.1 Preliminary framework

The general framework we will define in this section for the statistical analysis of graph Laplacians involves embedding , shown schematically in Figure 2. The identity projection, , illustrates that , where

(4)

is the space of symmetric positive semi-definite matrices of dimension . This is evident because any is diagonally dominant, as , which is a sufficient condition for (De Klerk, 2006, page 232).

Figure 2: Schematic for the general framework for the statistical analysis of graph Laplacians.

Distance metrics such as (2) and (3) on manifolds are referred to as intrinsic or extrinsic. An intrinsic distance is the length of a shortest geodesic path in the manifold, whereas an extrinsic distance is one induced by a Euclidean distance in an embedding of the manifold (Dryden and Mardia, 2016, p112). On , Euclidean distance is intrinsic, but in general and are extrinsic with respect to an embedding defined as follows.

First, we write by the spectral decomposition theorem, with and , where and

are the eigenvalues and corresponding eigenvectors of

L. Since thus , hence for any

(5)

embeds into . The embedding space is dependent on the choice of metric, and defined for specific metrics below.

Distance metrics (2) and (3) in terms of embedding , for , are hence

These distances in fact hold more generally for .

We consider three choices of for the reverse mapping back from the embedding space, which are suitable for different scenarios. The choice of is dependent on whether we want to project to before reversing the powering of .

When using the Euclidean power metric, the space is the space of real symmetric matrices with centred rows and columns, and we use

Note that the second expression before taking the power is the closest symmetric positive semi-definite matrix to Q in terms of Frobenius distance (Higham, 1988).

For the Procrustes power metric, the space is the reflection size-and-shape space, denoted (Dryden, Koloydenko and Zhou, 2009; Dryden and Mardia, 2016, p67), and in this case we use

We choose this reverse map as it removes the orthogonal matrices from the Procrustes fits, which we will see in the next section are introduced from the exponential map.

3.2 Tangent space

To perform further statistical analysis the inverse exponential map, , is used to project into a tangent space to , in which standard statistical methods can be applied, where denotes the pole of the projection. Figure 3 shows a simple visualisation of a tangent space. The tangent space at is a Euclidean approximation touching the manifold in which a geodesic becomes a straight line preserving distance to the pole. In non-Euclidean spaces distances are the length of the shortest geodesic path between two points on a manifold. The exponential map provides a connection between the tangent space to the manifold and the inverse exponential map is the map from the manifold to the tangent space (Dryden and Mardia, 2016, Chapter 5).

X

Q

Figure 3: A simple visualisation of the exp map, mapping X onto the tangent space .

As the graph Laplacian space has centering constraints on the rows and columns these constraints are also preserved in our choice of embedding in . We can remove the centering constraints and reduce dimension when projecting to a tangent space by pre and post multiplying by the Helmert sub-matrix H and its transpose as a component of the projection. The Helmert sub matrix H, of dimension , has th row defined as

(page 49, Dryden and Mardia (2016)). Note that and , where is the centering matrix, is the identity matrix and is the -vector of ones.

For the Euclidean power metric we define the inverse exponential map to the tangent space as

(6)

where is the half vectorisation of a matrix including the diagonal. In this case is actually Euclidean, with zero curvature, and analysis is unaffected by the choice of , and hance we can take .

For the Procrustes power metric we define the map to the tangent space as

(7)

where is the vectorise operator obtained from stacking the columns of a matrix, is the ordinary Procrustes match of X to (Dryden and Mardia, 2016, chapter 7) and is the ordinary Procrustes match from to . Note that the reflection size-and-shape space is a space with positive curvature (Kendall et al., 1999) and statistical analysis depends on the choice of . A sensible choice for is the sample Fréchet mean.

3.3 Projection

The framework illustrated in Figure 2 involves a projection, , into the space of graph Laplacians. We seek a that maps to the “closest point” in . For the Euclidean and Procrustes power metric intuitive projections are

(8)

It is desirable that optimisation involved in computing by the projection is convex, since convex optimisation problems have the useful characteristic that any local minimum must be the unique global minimum (Rockafellar, 1993).

Result 1.

For with then the projection can be found by solving a convex optimisation problem with a unique solution, by minimising

(9)

It is immediately clear that this is a convex optimization problem since the objective function is quadratic with Hessian , which is strictly positive definite, and the constraints are linear. The unique global solution can be found using quadratic programming, and so for if then .

Note that the choice of metric for projection does not need to be the same as the choice of metric for estimation. As the projection for the Euclidean power metric with

involves convex optimisation we will use throughout for all our metrics. For the optimization is not in general convex. To implement this projection we can, for example, use either the CVXR (Fu et al., 2018) or rosqp (Anderson, 2018) packages in R (R Core Team, 2018) to solve the optimisation, and rosqp is particularly fast even for .

3.4 Means

There are two main types of means on a manifold, the intrinsic mean and extrinsic mean (Dryden and Mardia, 2016, Chapter 6). We define the mean in the graph Laplacian space using extrinsic means, although the mean when the Euclidean power distance with is used is in fact an intrinsic mean.

We define the population mean for graph Laplacians as

(10)

assuming exists, and the sample mean for a set of graph Laplacians as

(11)

For the Euclidean power distance we have

and are unique in this case. For the Procrustes power distance and may be sets, and the conditions for uniqueness rely on the curvature of the space (Le, 1995). In particular the support of the distribution is a geodesic ball such that is regular. We will assume uniqueness exists throughout. For the Euclidean power metric when , we have and the mean is a Fréchet intrinsic mean (Fréchet, 1948; Ginestet et al., 2017) in this case.

Result 2.

Let be a random sample of i.i.d. observations from a distribution with population mean in (10). For the power Euclidean distance the estimator , in (11), is a consistent estimator of .

The proof of this result can be found in Appendix B. Note that a similar result holds for where stronger conditions for consistency of are given in Bhattacharya and Patrangenaru (2003), but the same projection argument used in the proof for holds.

Figure 4: The means of (a) Austen’s novels and (b) Dickens’ novels using based on the top m=1000 word pairs. In (c) we see edges present in the Austen mean but not Dickens and in (d) the edges present in Dickens and not Austen means. Zoom in for more detail.

Figure 4 shows an illustration of the sample means for (a) Austen and (b) Dickens novels using , with the 1000 words arranged in a grid and edges drawn between words which co-occur with adjacency weight at least of the sum of the nodes. Plots for the square root Euclidean and Procrustes metric, which are not shown, are visually similar to those for the Euclidean mean. Plots (a) and (b) are very similar, perhaps unsurprisingly as approximately half of the words in each novel are represented by the first 50 words. Figure (c) shows edges present in the Austen mean but not in the Dickens mean, and (d) the edges present in the Dickens mean but not in the Austen mean, to highlight the differences between the two networks. These illustrate more co-occurrences of she, her by Austen and the, his, don’t by Dickens, among many others. These plots are drawn using the program Cytoscape (Shannon et al., 2003) and more detail can be seen by magnifying the view to a large extent. We shall explore the differences in more detail later in Section 4.5.

3.5 Interpolation and extrapolation

We now consider an interpolation path,
, for being the position along the path, , between the graph Laplacians at and . For and the path is extrapolating from the graph Laplacians, at and . The interpolation and extrapolation path between graph Laplacians for each metric is defined by first finding the geodesic path in the embedding space between the embedded graph Laplacians, which is then projected to .

The minimal geodesic passing through and is

(12)

For the Euclidean power this simplifies to

(13)

Figure 5 shows the interpolation and extrapolation paths, for the 25 nodes, corresponding to the most frequent words, out of nodes, between the mean Austen and Dickens novels, when using . At the feminine words have larger degrees and their edges have larger weights , for example her to to, of and she to to. For the nodes for she and her are actually removed indicating they have degree 0, which is further evidence of the fact Austen used female words more then Dickens.

(a)
(b)
(c)
Figure 5: Interpolation () and extrapolation () networks between Dickens’ and Austen’s mean novels using . The top 25 words are displayed where the mean novels for the authors are estimated using and .

4 Further inference

4.1 Principal component analysis

There are several generalisations of PCA to manifold data, and the following approach is similar to Fletcher et al. (2004) in computing PCA in the tangent space and projecting back to the manifold. See also earlier approaches of PCA in tangent spaces in shape analysis include Kent (1994) and Cootes et al. (1994).

Let , where for either the Euclidean or Procrustes power metric, then is an estimated covariance matrix. Suppose S is of rank with non-zero eigenvalues , then the corresponding eigenvectors are the principal components (PCs) in the tangent space, and the PC scores are

(14)

The path of the th PC in is

(15)

When for the Euclidean case when is chosen, the importance of the th word in the principal component is given by

(16)
(a)
(b)
Figure 6: Plot of PC 1 and PC 2 scores for the Austen and Dickens novels, coloured in time order (red to violet) with extrinsic regression lines for Dickens novels (blue) and Austen novels (red) using the a) Euclidean and b) square root Euclidean metric.

We now apply the methods of PCA to the Austen and Dickens text data, for . The first and second PC scores are plotted in Figure 6

for the Euclidean and square root Euclidean metric. The Procrustes metric is not included as it gave visually identical results to the square root Euclidean. The extrinsic regression lines are included which we will define and explain below. The variance explained by PC 1 and PC 1 and 2 together was 49

and 70, 37 and 46 and 37 and 46 for the Euclidean, square root Euclidean and Procrustes size-and-shape respectively. A benefit of the square root Euclidean metric is clear here as it separates the Austen and Dickens novels with a large gap on PC1 where as David Copperfield (DC) and Persuasion (PE) are very close in PC1 for the Euclidean. We now analyse the Euclidean PCs in more detail.

Figure 7: The importance of each word given by (16) in (left) PC 1 and (right) PC 2. The red bar represents the importance of each word in the difference of means, , given by , .

Figure 7 contains plots representing the importance and sign of each word in the first and second Euclidean PC. From Figure 6 a more positive PC 1 score is indicative of an Austen novel whilst a more negative one a Dickens novel. For a positive PC1 score the nodes her and she have importance whilst for a negative score words such as his, and he have more importance, which is expected as Austen writes with more female characters. The second PC actually is similar to a fitted regression line which we describe in the next section. An interesting point to note is that the Austen novels over time have the second PC increasing, as Lady Susan (LS) and Persuasion (PE) are her earliest and latest novels respectively. This is the opposite to Dickens where PC2 decreases with time. Pickwick papers (PP) is Dickens earliest and The Mystery of Edwin Drood (ED) his latest. The second PC has feminine words like her and she as the most positive words, but more first and second person words, such as I, my and you as negative words. This is consistent with Austen increasingly using a stylistic device called free indirect speech in her later novels novels (Shaw, 1990). Free indirect speech has the property the third person pronouns, such as she and her are used instead of first person pronouns, such as I and my.

4.2 Regression

Here we assume the data are the pairs , for in which the are graph Laplacians to be regressed on covariate vectors , and consider the regression error model

where is the operator but with multiplying the terms corresponding to the off-diagonal. In general has a large number of elements, so in practice it is necessary to restrict to be diagonal or even isotropic, .

When using the power Euclidean metric we take and the parameters in (4.2) are the least squares solution

(17)

and the fitted values are

(18)

and so predicts a graph Laplacian with covariates . A similar model can be used for the Procrustes power metric but with . The optimisation in (18) is convex and the parameters of the regression line are found using the standard least squares approach in the tangent space. This optimisation reduces element-wise for , to independent optimisations.

A test for the significance of covariate involves the hypotheses and . By Wilks’ Theorem (Wilks, 1962), if

is true then the likelihood ratio test statistic is

(19)

approximately when is large, where and is the log-likelihood function of under the distribution from (4.2). We assume is a diagonal matrix. Using equation (19) is rejected in favour of at the significance level if is greater than the quantile of .

For the Austen and Dickens data, each novel, represented by a graph Laplacian is paired with the year, , the novel was written. We regress the on the using the method above with for each author. To visualise the regression lines in Figure 6 we find for many values of for the specific metrics, and project these to the PC1 and PC2 space. For each metric the regression lines seem to fit the data well, and could be used to see how writing styles have changed over time. When the test for regression was performed on the novels the p-values were extremely small () for both the Austen and Dickens regression lines, for both the Euclidean and square root Euclidean metrics. Hence there is very strong evidence to believe that the writing style of both authors changes with time, regardless of which metric we choose.

4.3 A central limit theorem

Consider independent random samples where have a distribution with mean . As the extrinsic mean is based on the arithmetic mean for the power Euclidean metrics, a central limit holds for the sample mean graph Laplacian, under the condition var is finite.

Result 3.

For any power Euclidean metric

as , where and recall is the vech operator but with multiplying the terms corresponding to the off-diagonal, and is a finite variance matrix.

When this result is similar to that in Ginestet et al. (2017) although they work directly in whereas we work in the embedding space.

4.4 Hypothesis tests

Consider two populations and of graph Laplacians with corresponding population means and defined in (10) . Given two samples and respectively from and , the goal is to test the hypotheses

We define the test statistic as , where and are defined by in (11) for the sets and respectively and using a suitable metric. Any Euclidean or Procrustes power metric is suitable to use, we however will just consider the Euclidean ; the square root Euclidean ; and the Procrustes size-and-shape , where the subscripts refer to whether the Euclidean, square root or Procrustes size-and-shape means have been used, respectively.

Using Result 3 the distribution of the test statistics for large is given as follows.

Result 4.

Consider independent random samples of networks of size and

. For the power Euclidean metrics under the null hypothesis,

: , as , such that :

(20)

in which each is independent and are the non-zero eigenvalues of .

For the Procrustes power metric similar central limit theorem results follow providing the more stringent conditions of Bhattacharya and Patrangenaru (2005) hold. In practice needs to be estimated, which can be very high dimensional. In our application with this is a symmetric matrix with parameters where . One approach is to use the shrinkage estimator from Schäfer and Strimmer (2005), as employed by Ginestet et al. (2017), but this is impractical for our application with . If we assume a diagonal matrix then the

correspond to the variances of individual components of the difference in means, and these can be estimated consistently from method of moments estimators. A further very simple model would be to have an isotropic covariance matrix with covariance matrix

, which requires estimation of a single variance parameter . Note that the likelihood ratio test for regression with test statistic in Section 4.2 gives an alternative test for equality of means when the covariates are group labels, but the additional assumption of normality for the observations needs to be made in that case.

An alternative non-parametric test, which does not depend on large sample asymptotics is a random permutation test, similar to Preston and Wood (2010) as follows.

1:  Calculate the test statistics between and , given by .
2:  Generate random sets and of size and respectively, by randomly sampling without replacement from .
3:  Compute the test statistic of sets and , given by .
4:  Repeat steps 2 and 3 times, to give test statistics .
5:  Order the test statistics .
6:  Calculate the p-value, which is for the minimum satisfying
, unless , in which case the p-value is 1 or if , in which case the p-value is 0.
Algorithm 1 Random permutation test to test the equality of means for two sets of graph Laplacians, and , using the test statistic .

A limitation of using the permutation test is it assumes exchangeability of the observations under the null hypothesis (Amaral, Dryden and Wood, 2007). This means under the null hypothesis the populations and are assumed identical. A test based on the bootstrap is an alternative possibility, which requires weaker assumptions about and , see for example Amaral, Dryden and Wood (2007).

For the Austen and Dickens data have test statistics , , . We compute the p-value from the permutation test with permutations for each of and in each case all permuted values were less than the observed statistics for the data. Hence, in each case the estimated p-value is zero, indicating very strong evidence for a difference in mean graph Laplacian.

4.5 Exploring differences between authors

Given that the Austen and Dickens novels are significantly different in mean we would like to explore how they differ. In particular we examine the off-diagonal elements of

, i.e. the differences in the mean weighted adjacency matrix, and compare them to appropriate measures of standard error of the differences using a

-statistic. The histograms of the off-diagonal individual graph Laplacians are heavy tailed, and a plot of sample standard deviations versus sample means show an overall average linear increase with approximate slope

, but with a large spread. We shall use this relationship in a regularised estimate of our choice of standard error.

For a particular co-occurrence pair of words we have weighted adjacency values and with sample means and , and sample standard deviations and . For our analysis here we use the Euclidean mean graph Laplacians. We estimate the variance in our sample with a weighted average of the sample variance and an estimate based on the linear relationship between the mean and standard deviation, and in particular the population pooled variance is estimated by

where the weights are taken as , where we take . Note that if all values in one of the samples are (due to no word co-occurrence pairings in any of that author’s books) then we drop that word pairing from further analysis, as we are only interested in the relative usage of the word occurrences that are actually used by both authors. A univariate -statistic for comparing adjacencies is then

(21)

where we include the regularizing offset to avoid highlighting very small differences in mean adjacency with very small standard errors. The value for is chosen as the median of all values under consideration.

Figure 8: Networks displaying the top 100 pairs of words ranked according to the -statistic in (21), with more prominent co-occurrences used by Austen (left, in blue) and the more prominent co-occurrences used by Dickens for (right, in yellow).

The exploratory graphical displays in Figure 8 illuminate striking differences between the novelists. For Austen there are very common pairings of words with her, she, herself, which form important hubs in this network. Austen also pairs these hubs with more emotional words feelings, felt, feel, kindness, happiness, affection, pleasure and stronger words power, attention, must, certainly, advantage and opinion. Also we see more use of letter in Austen, which is a literary device often used by the author. For Dickens there are more common uses of abbreviations, especially don’t which is an important hub, and also it’s, i’ll and that’s. In contrast the Austen network highlights not. Dickens also more prominently pairs body parts arm, arms, eyes, feet, hair, hand, hands, head, mouth, face, shoulder, legs in combination with the strong hubs his and the. These hubs are also paired with other objects, such as door, chair, glass. Finally, Dickens has the more prominent use of pairs with a sombre word, such as dark, black and dead, which might have been expected.

5 Conclusion

We have developed a general framework for extrinsic statistical analysis of graph Laplacians and considered in particular the distances , and . Other metrics fit in our framework and could be considered. One example is the log metric used in Bakker, Halappanavar and Sathanur (2018) which uses the embedding and it easy to see ) where we define in and is the rank of L. The metric is then . The log embedding is the limit of the the Box-Cox transform, , when , the reverse is given as .

Another metric to consider is the element-wise metric of the form . Of particular interest would be comparing , which is the Frobenius/Euclidean norm , with which can be similar to the square root norm (and is identical for diagonal matrices).

Our methodology gives appropriate results for comparing co-occurrence networks for Jane Austen and Charles Dickens novels, but the methodology is widely applicable, for example to neuroimaging networks and social networks, and such applications will be explored in further work.

References

  • Amaral, Dryden and Wood (2007) [author] Amaral, G. J. AG. J. A., Dryden, I. LI. L. Wood, Andrew T. AA. T. A. (2007). Pivotal Bootstrap Methods for k-Sample Problems in Directional Statistics and Shape Analysis. Journal of the American Statistical Association 102 695-707. 10.1198/016214506000001400
  • Anderson (2018) [author] Anderson, EricE. (2018). rosqp: Quadratic Programming Solver using the OSQP Library R package version 0.1.0.
  • Bakker, Halappanavar and Sathanur (2018) [author] Bakker, CraigC., Halappanavar, MahanteshM. Sathanur, Arun VisweswaraA. V. (2018). Dynamic graphs, community detection, and Riemannian geometry. Applied Network Science 3 3.
  • Bhattacharya and Patrangenaru (2003) [author] Bhattacharya, RabiR. Patrangenaru, VicV. (2003). Large sample theory of intrinsic and extrinsic sample means on manifolds. Ann. Statist. 31 1–29. 10.1214/aos/1046294456
  • Bhattacharya and Patrangenaru (2005) [author] Bhattacharya, RabiR. Patrangenaru, VicV. (2005). Large sample theory of intrinsic and extrinsic sample means on manifolds—II. Ann. Statist. 33 1225–1259. 10.1214/009053605000000093
  • Cootes et al. (1994) [author] Cootes, T. F.T. F., Taylor, C. J.C. J., Cooper, D. H.D. H. Graham, J.J. (1994). Image search using flexible shape models generated from sets of examples. In Statistics and Images: Vol. 2 (K. V.K. V. Mardia, ed.) 111-139. Carfax, Oxford.
  • De Klerk (2006) [author] De Klerk, EtienneE. (2006). Aspects of semidefinite programming: interior point algorithms and selected applications 65. Springer Science & Business Media.
  • Dryden (2018) [author] Dryden, I. L.I. L. (2018). shapes package R Foundation for Statistical Computing, Vienna, Austria Contributed package, Version 1.2.4.
  • Dryden, Koloydenko and Zhou (2009)

    [author] Dryden, Ian L.I. L., Koloydenko, AlexeyA. Zhou, DiweiD. (2009). Non-Euclidean Statistics for Covariance Matrices, with Applications to Diffusion Tensor Imaging. The Annals of Applied Statistics 3 1102-1123.

  • Dryden and Mardia (2016)

    [author] Dryden, Ian L.I. L. Mardia, Kanti V.K. V. (2016). Statistical shape analysis with applications in R, second ed. Wiley Series in Probability and Statistics. John Wiley & Sons, Ltd., Chichester. 10.1002/9781119072492 3559734

  • Evert (2008) [author] Evert, StefanS. (2008). Corpora and collocations. Corpus linguistics. An international handbook 2 1212–1248.
  • Fletcher et al. (2004) [author] Fletcher, P ThomasP. T., Lu, ConglinC., Pizer, Stephen MS. M. Joshi, SarangS. (2004). Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE transactions on medical imaging 23 995–1005.
  • Fréchet (1948) [author] Fréchet, MauriceM. (1948). Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’institut Henri Poincaré 10 215-310.
  • Fu et al. (2018) [author] Fu, AnqiA., Narasimhan, BalasubramanianB., Diamond, StevenS. Miller, JohnJ. (2018). CVXR: Disciplined Convex Optimization R package version 0.99.
  • Ginestet et al. (2017) [author] Ginestet, Cedric EC. E., Li, JunJ., Balachandran, PrakashP., Rosenberg, StevenS., Kolaczyk, Eric DE. D. et al. (2017). Hypothesis testing for network data in functional neuroimaging. The Annals of Applied Statistics 11 725–750.
  • Higham (1988) [author] Higham, Nicholas JN. J. (1988). Computing a nearest symmetric positive semidefinite matrix. Linear algebra and its applications 103 103–118.
  • Charles Dickens Info (2018) [author] Charles Dickens Info (2018). Charles Dickens Timeline. https://www.charlesdickensinfo.com/life/timeline/, Last accessed on 2018-11-12.
  • Joyce (2009) [author] Joyce, DominicD. (2009). On manifolds with corners. arXiv preprint arXiv:0910.3518.
  • Kendall et al. (1999) [author] Kendall, D. G.D. G., Barden, D.D., Carne, T. K.T. K. Le, H.H. (1999). Shape and Shape Theory. Wiley, Chichester.
  • Kent (1994) [author] Kent, John TJ. T. (1994). The complex Bingham distribution and shape analysis. Journal of the Royal Statistical Society. Series B (Methodological) 285–299.
  • Kolaczyk (2009) [author] Kolaczyk, Eric DE. D. (2009). Statistical analysis of network data: methods and models. Springer Science & Business Media.
  • Le (1995) [author] Le, HuilingH. (1995). Mean Size-and-Shapes and Mean Shapes: A Geometric Point of View. Advances in Applied Probability 27 44–55.
  • Mahlberg et al. (2016) [author] Mahlberg, MichaelaM., Stockwell, PeterP., de Joode, JohanJ., Smith, CatherineC. O’Donnell, Matthew BrookM. B. (2016). CLiC Dickens: novel uses of concordances for the integration of corpus stylistics and cognitive poetics. Corpora 11 433-463. 10.3366/cor.2016.0102
  • The Jane Austen Society of North America (2018) [author] The Jane Austen Society of North America (2018). Jane Austen’s Works. http://jasna.org/austen/works/, Last accessed on 2018-11-12.
  • Phillips (1983) [author] Phillips, M. K.M. K. (1983). Lexical Macrostructure in Science Text. University of Birmingham.
  • Preston and Wood (2010) [author] Preston, S. P.S. P. Wood, A. T. A.A. T. A. (2010). Two-Sample Bootstrap Hypothesis Tests for Three-Dimensional Labelled Landmark Data. Scandinavian Journal of Statistics 37 568–587.
  • Rockafellar (1993) [author] Rockafellar, R TyrrellR. T. (1993). Lagrange multipliers and optimality. SIAM review 35 183–238.
  • Schäfer and Strimmer (2005) [author] Schäfer, JulianeJ. Strimmer, KorbinianK. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology 4 Article32.
  • Shannon et al. (2003) [author] Shannon, P.P., Markiel, A.A., Ozier, O.O., Baliga, N. S.N. S., Wang, J. T.J. T., Ramage, D.D., Amin, N.N., Schwikowski, B.B. Ideker, T.T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 13 2498–2504.
  • Shaw (1990) [author] Shaw, NarelleN. (1990). Free Indirect Speech and Jane Austen’s 1816 Revision of Northanger Abbey. Studies in English Literature, 1500-1900 30 591–601.
  • R Core Team (2018) [author] R Core Team (2018). R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria.
  • Ward (1963) [author] Ward, Joe H.J. H. Jr. (1963). Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc. 58 236–244. MR0148188 (26 ##5696)
  • Wilks (1962) [author] Wilks, S. S.S. S. (1962). Mathematical Statistics. Wiley, New York.

Appendix A Most common words

Rank in Rank in Rank in
Word all Dickens Austen
novels novels novels
the 1 1 1
and 2 2 3
to 3 3 2
of 4 4 4
a 5 5 5
i 6 6 7
in 7 7 8
that 8 8 13
it 9 11 10
he 10 10 16
his 11 9 20
was 12 13 9
you 13 12 15
with 14 14 21
her 15 16 6
as 16 15 18
had 17 17 17
for 18 20 19
at 19 21 25
mr 20 18 38
not 21 26 12
be 22 28 14
she 23 31 11
said 24 19 58
have 25 25 23
Table 2: The most common 25 words in the Austen and Dickens novels

Appendix B Proof for result 2

For an estimator to be consistent for a population mean , it must converge in probability to . Let be a sequence of estimates from a sample set , for this to converge in probability to then for any and any there exists a number such that for all , where .

We can see is a consistent estimator as it converges in probability to

from the law of large numbers, and so by the continuous mapping theorem

converges in probability to , as long as exists and is unique.

(a) Case 1

(b) Case 2
Figure 9:

We now need to show the convergence in probability holds when we project and to