InfiniteWalk: Deep Network Embeddings as Laplacian Embeddings with a Nonlinearity

05/29/2020 ∙ by Sudhanshu Chanpuriya, et al. ∙ 0

The skip-gram model for learning word embeddings (Mikolov et al. 2013) has been widely popular, and DeepWalk (Perozzi et al. 2014), among other methods, has extended the model to learning node representations from networks. Recent work of Qiu et al. (2018) provides a closed-form expression for the DeepWalk objective, obviating the need for sampling for small datasets and improving accuracy. In these methods, the "window size" T within which words or nodes are considered to co-occur is a key hyperparameter. We study the objective in the limit as T goes to infinity, which allows us to simplify the expression of Qiu et al. We prove that this limiting objective corresponds to factoring a simple transformation of the pseudoinverse of the graph Laplacian, linking DeepWalk to extensive prior work in spectral graph embeddings. Further, we show that by a applying a simple nonlinear entrywise transformation to this pseudoinverse, we recover a good approximation of the finite-T objective and embeddings that are competitive with those from DeepWalk and other skip-gram methods in multi-label classification. Surprisingly, we find that even simple binary thresholding of the Laplacian pseudoinverse is often competitive, suggesting that the core advancement of recent methods is a nonlinearity on top of the classical spectral embedding approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Vertex embedding is the task of learning representations of graph vertices in a continuous vector space for use in downstream tasks, such as link prediction and vertex classification

(Yan et al., 2006)

. The primary classical approach for this task is spectral embedding: vertices are represented by their corresponding values in the smallest eigenvectors of the graph Laplacian. Spectral embedding methods include the Shi-Malik normalized cuts algorithm 

(Shi and Malik, 2000), Laplacian Eigenmaps (Belkin and Niyogi, 2003), and spectral partitioning methods for stochastic block models (McSherry, 2001). They also include many spectral clustering methods, which apply spectral embeddings to general datasets by first transforming them into a graph based on data point similarity (Ng et al., 2002).

The spectral embedding approach has recently been exceeded in predictive performance on many tasks by methods inspired by Word2vec (Mikolov et al., 2013), which performs the related word embedding task. Word2vec forms representations for words based on the frequency with which they co-occur with other words, called context words, within a fixed distance in natural text. The DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015), and node2vec (Grover and Leskovec, 2016) algorithms, among others, adapt this idea to network data. In particular, DeepWalk takes several random walks on the network, treating the vertices as words, and treating the walks as sequences of words in text. It has been shown in (Levy and Goldberg, 2014) that the representations learned by this approach are implicitly forming a low-dimensional factorization of a known matrix, which contains the pointwise mutual information (PMI) of co-occurences between nodes in the random walks. Work by Qiu et al. (Qiu et al., 2018) gives a closed form expression for this matrix and shows a connection to the normalized graph Laplacian. This motivates their NetMF algorithm, which performs a direct factorization, improving on the performance of DeepWalk on a number of tasks.

In this work, we consider DeepWalk in the limit as the window size goes to infinity. We derive a simple expression for the PMI matrix in this limit:

(1)

where is the degree matrix, is the trace of (twice the number of edges in ), is the normalized Laplacian (i.e. ), and is the all-ones matrix. is the pseudoinverse of . One can show that is equal to the unnormalized Laplacian pseudoinverse , the central object in classic spectral embeddings, up to a rank-2 component (see (6) in Section 3.3). Thus, is equal to plus at most a rank-3 component and a diagonal matrix.

Not surprisingly, embeddings formed directly using a low dimensional factorization of itself perform poorly on downstream tasks compared to DeepWalk and other skip-gram methods. However, we show that after an element-wise application of a linear function followed by a logarithm, in particular, , approximates the finite- PMI matrix. Embeddings formed by factorizing this transformed matrix are competitive with DeepWalk and nearly competitive with the NetMF method of Qiu et al (Qiu et al., 2018), when evaluated on multi-label node classification tasks. We call our method InfiniteWalk.

Note that the window hyperparameter only appears in the entrywise nonlinearity in InfiniteWalk and not in the formula for . This is perhaps surprising, as is a key hyperparameter in existing methods. Our results suggest that ’s importance lies largely in determining the shape of the nonlinearity applied. Since is closely related to the Laplacian pseudoinverse, the key difference between DeepWalk and classic spectral embedding seems to be the application of this nonlinearity.

In more detail, note that our results show that InfiniteWalk well approximates DeepWalk by forming a low-rank factorization of a nonlinear entrywise transformation of . Classic spectral embedding and clustering methods (Shi and Malik, 2000; McSherry, 2001; Belkin and Niyogi, 2003)

embed nodes using the eigenvectors corresponding to the smallest eigenvalues of the Laplacian

(or a variant of this matrix), which are the largest eigenvalues of . Thus, these methods can be viewed as embedding nodes using an optimal low-dimensional factorization of (lying in the span of ’s top eigenvectors), without applying an entrywise nonlinearity.

Inspired by this observation, we simplify the idea of a nonlinear transformation of the Laplacian pseudoinverse even further: thresholding it to a binary matrix. In particular, we form the binary matrix

where is an element of

itself of some hyperparameter quantile (e.g. the median or the

th percentile element). Surprisingly, embeddings from factorizing this binary matrix are also competitive with DeepWalk and the method of Qiu et al. on many tasks.

Broadly, our results strengthen the theoretical connection between classical methods based on factorizing the graph Laplacian and more recent “deep” methods for vertex embedding. They suggest that these methods are not as different conceptually as they may seem at first glance.

2. Background and Related Work

We begin by surveying existing work on skip-gram-based network embeddings and their connections to matrix factorization, which our work builds on.

2.1. Skip-Gram

In word embedding models, words are typically treated as both words and contexts. A context is simply a word that appears within a window of length of another word. As formalized in (Goldberg and Levy, 2014), given a dataset of words , contexts (typically ), and word-context co-occurrence pairs , the “skip-gram” model for training word and context embeddings , (Mikolov et al., 2013) has the following log-likelihood objective:

We can see that the objective encourages co-occuring pairs to have similar embeddings with large dot product . This exact objective is not used as repeatedly evaluating the partition function is too computationally expensive; (Mikolov et al., 2013) proposes a substitute: the skip-gram with negative sampling (SGNS). Here, an auxiliary ‘negative’ dataset consisting of random pairs not appearing in is used. Pairs in this negative set are encouraged to have dissimilar embeddings with small .

2.2. Implicit PMI Matrix

Levy and Goldberg (Levy and Goldberg, 2014) prove that SGNS implicitly factors a matrix whose entries gave the pointwise mutual information (PMI) of occurrence of a word and occurrence of a context . Given a dataset of these co-occurrences, an element of this matrix is given by

where is the ratio of negative samples to positive samples. Their proof assumes that the dimensionality of the embedding is at least the cardinality of the word/context set; the analysis of Li et al. (Li et al., 2015) relaxes assumptions under which this equivalence holds. Several works, including (Arora et al., 2016) and (Gittens et al., 2017), propose generative models for word-context co-occurrence to explain the effectiveness of PMI-based methods in linearizing semantic properties of words. More recently, the analysis of Allen et al. in (Allen et al., 2019) and (Allen and Hospedales, 2019) has provided explanations of this phenomenon based on geometric properties of the PMI matrix, without the strong assumptions required by the generative models. The extensive research and results on the skip-gram PMI matrix make it an intrinsically interesting object in representation learning.

2.3. Networks

The DeepWalk method (Perozzi et al., 2014) applies the SGNS model to networks, where the word and context sets are the nodes in the network, and the co-occuring pairs are node pairs that appear within a window of length hops in a set of length random walks run on the graph. Qiu et al. (Qiu et al., 2018) derived the following expression for the PMI matrix in the context of random walks on undirected, connected, and non-bipartite graphs for DeepWalk: in the limit as the number of walks originating at each node and walk length , it approaches

(2)

where is an element-wise natural logarithm, (the “volume” of the graph) is the sum of the degrees of all of the vertices, and is the random walk transition matrix.

Rather than sampling from random walks as in DeepWalk, the NetMF algorithm of Qiu et al. explicitly calculates and factors this matrix to produce embeddings, outperforming DeepWalk significantly on multi-label vertex classification tasks. For low , NetMF manually computes the exact sum, whereas for high , it computes a low-rank approximation via SVD of the symmetrized transition matrix, . While Qiu et al. analyze the effect of increasing on the spectrum of the resulting matrix, they do not pursue the limit, stopping at as in the original DeepWalk paper. We show that this limiting matrix is both meaningful and simple to express.

2.4. Other Approaches

Some other node embedding algorithms share significant similarities with DeepWalk. Qiu et al (Qiu et al., 2018) showed the LINE method to also be implicit matrix factorization, though its algorithm is based on edge sampling rather than sampling from random walks. In particular, its factorized matrix is a special case of the DeepWalk matrix with . We include the performance of LINE in our empirical results. node2vec (Grover and Leskovec, 2016) is a generalization of DeepWalk which uses second-order random walks: the distribution of the following node in node2vec walks depends on the current and preceding nodes rather than only the current node as in DeepWalk. Hyperparameters allow the walk to approach BFS-like or DFS-like behavior as desired, which the authors assert extract qualitatively different information about node similarities.

Several architectures for applying convolutional neural networks to network data in an end-to-end fashion have been developed in the past few years, including the graph convolutional networks (GCNs) of

(Kipf and Welling, 2016) and (Defferrard et al., 2016), and some methods leverage these architectures to produce node embeddings: for example, Deep Graph Infomax (Veličković et al., 2018) uses GCNs to maximize a mutual information objective involving patches around nodes. Recent work from Wu et al. (Wu et al., 2019)

shows that much of the complexity of GCNs comes from components inspired by other forms of deep learning that have limited utility for network data. In the same way, we seek to further the investigation of the core principles of “deep” network embeddings apart from their inspiration in word embedding and neural networks. We note that, like DeepWalk, and the related methods, we focus on unsupervised embeddings, derived solely from a graph’s structure, without training, e.g., on vertex labels.

3. Methodology

We now present our main contributions, which connect DeepWalk in the infinite window limit to classic spectral embedding with a nonlinearity. We discuss how this viewpoint clarifies the role of the window size parameter in DeepWalk and motivates a very simple embedding technique based on a binary thresholding of the graph Laplacian pseudoinverse.

3.1. Derivation of Limiting PMI Matrix

We start by showing how to simplify the expression in (2) for the DeepWalk PMI given by (Qiu et al., 2018) in the limit . We first establish some well-known facts about random walks on graphs. First, is well-defined under our assumption that the graph is undirected, connected, and non-bipartite. It is rank- and equal to , where is a column vector of ones and

is the probability mass of each vertex in the stationary distribution of the random walk as a column vector. Note that

. That is, the probability mass of a vertex in the stationary distribution is proportional to its degree. We let denote the diagonal matrix with entries .

Let and be the eigenvalue and eigenvector of the symmetrized transition matrix . We have and . From (Levinson, ), for any positive integer ,

(3)

where . We rewrite the expression in (3) for and the expression (2) of Qiu et al. for the PMI matrix, setting the negative sampling ratio to for the latter (i.e., one negative sample per positive sample):

Substituting the former into the latter, then rearranging the order of the summations and applying the geometric series formula yields

Now we consider the limit as . Define

Since for (Levinson, ), the terms go to as and we have:

Since as , for any constant real number , as . We apply this identity element-wise, then simplify:

where the last step follows from being the element-wise square root of . Note that the above equation gives the expression for in (1), since . Similar analysis also leads to the following general expressions for the finite- PMI matrix:

(4)

3.2. Approximation of Finite- PMI Matrix via Limiting PMI Matrix

Note that the expression (3.1) above for the finite- matrix, when multiplied by differs from the limiting matrix only by the term , which vanishes as . So, we may approximate the finite- matrix by simply dropping this term as follows:

(5)

where is any ramp function to ensure that the argument of the logarithm is positive. In our implementation, we use the function . We use the 64-bit floating-point machine precision limit () for . Note that the NetMF method of (Qiu et al., 2018) uses ; we find that a small positive value consistently performs better. Both ramping functions can be interpreted as producing the positive shifted PMI matrix (shifted PPMI) matrix introduced by Levy and Goldberg (Levy and Goldberg, 2014). Other methods of avoiding invalid arguments to the logarithm are an interesting avenue for future work.

Note that the accuracy of (3.2) is limited by the second largest eigenvalue of , which is known as the Fiedler eigenvalue. Smaller magnitudes of this eigenvalue are correlated with a faster mixing rate (Levinson, ), the rate at which as increases. In Section 4.2 we show that for typical graphs, the Fiedler eigenvalue is relatively small, and so the approximation is very accurate for large , e.g., , which is a typical setting for DeepWalk. The approximation is less accurate for small , e.g., (See Table 2.)

Effect of the Window Size

Intuitively, the effect of in (3.2) is to control the “strength” of the logarithm nonlinearity, since, as noted previously, for any real constant , as . That is, the logarithm becomes a simple linear scaling in this limit. As we will see, even when the approximation of (3.2) in inaccurate (when is very low) this approximated matrix qualitatively has similar properties to the actual -window PMI matrix, and produces similar quality embeddings, as measured by performance in downstream classification tasks. This finding suggests that the strength of the logarithm nonlinearity can influence the locality of the embedding (as modulated in DeepWalk by the window size ) independently of modifying the arguments of the nonlinearity, which contain the actual information from the network as filtered by length- windows.

3.3. Binarized Laplacian Pseudoinverse

Motivated by the view of DeepWalk as a variant of classic Laplacian factorization with an entrywise nonlinearity, we investigate a highly simplified version of InfiniteWalk. We construct a binary matrix by 1) computing the pseudoinverse of the unnormalized Laplacian , 2) picking a quantile hyperparameter , 3) determining the quantile element value, and 4) setting all elements less than this value to and others to . In other words, an element of this matrix is given by , where is the quantile element of . We then produce embeddings by partially factoring this matrix as with the PMI matrices. Interestingly, this can be interpreted as factorizing the adjacency matrix of an implicit derived network whose sparsity is determined by . Gaining a better understanding of the structure and interpretation of this network is an interesting direction from future work.

Note that in this method, we use the unnormalized Laplacian rather than the normalized Laplacian , which appears in the expression (1) for . This is because, as we will show, the limiting PMI matrix is equal to the pseudoinverse of the unnormalized Laplacian up to a rank-3 term and a diagonal adjustment. We can rewrite our expression for the limiting matrix by expanding the normalized Laplacian in terms of the normalized Laplacian as follows:

Consider the underbraced term above containing . If this term had a true inverse rather than a pseudoinverse, the four factors involving the degree matrix would simply cancel. Instead, application of a variant of the Sherman-Morrison formula for pseudoinverses (Meyer, 1973) results in the following expression for this term:

This yields the following alternate expression for the limiting PMI matrix:

(6)

In our context of binarizing

by a quantile, note that addition by the all-ones matrix and multiplication by does not affect the ranks of the elements within the matrix, and the subtraction by the diagonal matrix affects relatively few elements. Hence we might expect binarizing by thresholding on quantiles to have a similar effect as binarizing the limiting PMI matrix itself.

Binarization is arguably one of the simplest possible methods of augmenting the Laplacian with a nonlinearity. As we will see, this method has good downstream performance compared to DeepWalk and related methods. We argue that this suggests that the core advancement of deep vertex embeddings over classical spectral embedding methods based on factorizing the Laplacian is application of a nonlinearity.

4. Experimental Setup

We now discuss how we empirically validate the performance of the limiting PMI matrix method presented in Section 3.2 and the Laplacian pseudoinverse binarization method of Section 3.3.

4.1. Data Sets

We use three of the four datasets uses in the evaluation of the NetMF algorithm (Qiu et al., 2018). Table 1 provides network statistics. Figure 1 provides the eigenvalue distribution of the symmetrized random walk matrix for each network.

BlogCatalog 111socialcomputing.asu.edu/datasets/BlogCatalog3 (Agarwal et al., 2009) is a social network of bloggers. The edges represent friendships between bloggers, and vertex labels represent group memberships corresponding to interests of bloggers.

Protein-Protein Interaction (PPI) 222snap.stanford.edu/biodata/datasets/10000/10000-PP-Pathways.html (Stark et al., 2010) is a subgraph of the PPI network for Homo Sapiens. Vertices represent proteins, edges represent interactions between proteins, and vertex labels represent biological states. We use only the largest connected component, which has over of the nodes.

Wikipedia 333mattmahoney.net/dc/text.html is a co-occurrence network of words from a portion of the Wikipedia dump. Nodes represent words, edges represent co-occurrences within windows of length in the corpus, and labels represent inferred part of speech (POS) tags.

Dataset BlogCatalog PPI Wikipedia
10,312 3,852 4,777
333,983 76,546 184,812
Fiedler Eigenvalue 0.568 0.800 0.504
# Labels 39 50 40
Table 1. Network Statistics.
Figure 1. Sorted eigenvalues of for each network. The top eigenvalue of 1 is excluded, and the Fiedler eigenvalues are marked with X’s.

4.2. Procedure

Implementation

See Algorithm 1 for the pseudocode of our limiting PMI matrix method. We use the NumPy (Oliphant, 2006) and SciPy (Jones et al., 2001) libraries for our implementation. The most expensive component of the algorithm is the pseudoinversion of the graph Laplacian. While there is significant literature on approximating this matrix, or vector products with it (Spielman and Teng, 2004; Koutis et al., 2014; Kelner et al., 2013; Ranjan et al., 2014), we simply use the dense SVD-based function from NumPy. For graphs of larger scale, this method would not be practical. The truncated sparse eigendecomposition is handled via SciPy’s packer to ARPACK (Lehoucq et al., 1998), which uses the Implicitly Restarted Lanczos Method. As in (Qiu et al., 2018), to generate a -dimensional embedding, we return the singular vectors corresponding to the

largest singular values, scaling the dimensions of the singular vectors by the square roots of the singular values. For classification, we use the scikit-learn 

(Pedregosa et al., 2011) packer to LIBLINEAR (Fan et al., 2008).

1:Compute
2:Compute
3:Rank- approximation by truncated eigendecomposition:
4:return as vertex embeddings
Algorithm 1 InfiniteWalk

Evaluation Setting

To investigate the usefulness and meaningfulness of the limiting PMI matrix, we evaluate the quality of embeddings generated by its truncated SVD after applying the element-wise ramp-logarithm described in Section 3.2. For this task, we closely follow the same procedure as in (Perozzi et al., 2014) and (Qiu et al., 2018)

. We use one-vs-rest logistic regression on the embeddings for multi-label prediction on the datasets. The classifiers employ L2 regularization with inverse regularization strength

. Classifiers are trained on a portion of the data, with the training ratio being varied from to ; the remainder is used for testing. As in (Perozzi et al., 2014) and (Qiu et al., 2018), we assume that the number of labels for each test example is given. In particular, given that a vertex is assigned labels, the classifier predicts exactly the classes to which it assigned the highest probability. We use the mean scores over random splits of the training and test data for each training ratio. We evaluate the micro-F1 and macro-F1 scores of classifiers using our embedding.

Hyperparameter Choices

We compare our results to those of DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015), and NetMF (Qiu et al., 2018) as reported in (Qiu et al., 2018). The hyperparameters used for DeepWalk are the preferred default settings of its authors: window size , walk length , and number of walks starting from each vertex . Results from the second-order variant of LINE are reported. As the authors of NetMF report results for window sizes and , we similarly report results for InfiniteWalk with and . We expect the results of InfiniteWalk, as an approximation of the NetMF method in the high limit, to at least be roughly similar for the higher setting. We also include results with our limiting

matrix, though only for illustrative purposes. As the limiting matrix is essentially a simple linear transformation of the graph Laplacian’s pseudoinverse, we expect embeddings derived thereof to perform relatively poorly. The entrywise nonlinearity seems to be critical. The embedding dimensionality is

for all methods as in both (Perozzi et al., 2014) and (Qiu et al., 2018).

4.3. Binarized Laplacian Pseudoinverse

We implement and evaluate the simplified method of factorizing a binarized version of the unnormalized Laplacian pseudoinverse (described in Section 3.3) in the same way. We present results for the best values of quantile hyperparameter found by rough grid search. As with the window size , the best value is expected to vary across networks. We compare to the performance of NetMF, the sampling methods LINE and DeepWalk, and classical methods - since the normalized and unnormalized Laplacians themselves both perform poorly on these tasks, we compare to factorizing the adjacency matrix itself. Again, since inverting and binarizing is one of the simplest possible nonlinearities to apply the Laplacian, good downstream performance suggests that the addition of a nonlinearity is the key advancement of deep node embeddings from classical embeddings of the Laplacian itself.

Figure 2. Multi-label classification performance on the BlogCatalog, PPI, and Wikipedia networks. Micro-F1 score (top) and Macro-F1 score (bottom) versus percent of data used for training. Results for InfiniteWalk (Algorithm 1) all appear as solid lines.

5. Results

We now discuss our experimental results on both the limiting PMI-based algorithm and the simple Laplacian binarization algorithm.

Dataset BlogCatalog PPI Wikipedia
Error 0.001273 0.04152 1.355
Ramped Elements 0.0004901 0.002521 0.08892
Table 2. PMI Approximation Error. The first row gives the Frobenius norm of the difference between the true PPMI matrix and our approximation based on (see (3.2)), divided by the norm of . The log-ramp nonlinearity with , as used in the NetMF method, is applied to both matrices. The second row gives the number of elements that are affected by the ramping component of the nonlinearity in one matrix but not the other, normalized by the size of the matrices. is used for BlogCatalog and PPI, while is used for Wikipedia.
Figure 3. Performance of InfiniteWalk relative to NetMF (Qiu et al., 2018). F1 score (%) of InfiniteWalk minus F1 score of NetMF versus percent of data used for training. For both methods, is used for BlogCatalog and PPI, and is used for Wikipedia.
Figure 4. Performance of the binarized Laplacian pseudoinverse method relative to NetMF, sampling-based methods, and embedding by factorizing the adjacency matrix. is used as the thresholding quantile for BlogCatalog and PPI, and is used for Wikipedia. The more suitable setting of and the more suitable sampling method is plotted for each network.

5.1. PMI Approximation Error

In Table 2 we show how well closely the PMI approximation given by (3.2) matches the true PMI matrix. We can see from Table 1 that the Fiedler eigenvalues of our graphs are bounded away from . Thus, as expected, the approximation of the finite- PMI matrix via the limiting matrix is quite close for the networks, but not so for the one. This holds both before and after application of the log-ramp nonlinearity, which is applied in the NetMF method (Qiu et al., 2018). Across the surveyed networks and values of , the elements which are affected by the ramping component of the nonlinearity are similar between our approximation and the true PMI matrix. The accurate approximation at explains why InfiniteWalk performs similarly on downstream classification tasks. Interestingly, at , InfiniteWalk performs competitively, in spite of inaccurate approximation.

5.2. Multi-Label Classification

In Figure 2 we show downstream performance of embeddings based on the limiting approximation, compared to other methods. Across both metrics and all datasets, NetMF and InfiniteWalk are generally or par with or outperform the sampling-based methods, LINE and DeepWalk. As observed in (Qiu et al., 2018), DeepWalk and NetMF with have better overall performance than LINE and NetMF with on the BlogCatalog and PPI networks, while the inverse is true for the Wikipedia network. This suggests that shorter-range dependencies better capture Wikipedia’s network structure. As expected, InfiniteWalk with the nonlinearity performs better than the version with the nonlinearity on the former two datasets, while the inverse is true for Wikipedia. In all cases, the factorization of the PMI matrix itself performs poorly. These findings support our hypothesis that changing the strength of the logarithm nonlinearity can largely emulate the effect of actually changing the window size in sampling and deterministic approaches.

While maximizing downstream performance is not the focus of our work, we observe that, overall, InfiniteWalk has performance competitive with if slightly inferior to NetMF (see Figure 3). On BlogCatalog, InfiniteWalk underperforms relative to NetMF. On PPI, InfiniteWalk outperforms NetMF when less than half the data is used for training, but underperforms when given more training data. On Wikipedia, InfiniteWalk underperforms relative to NetMF on macro-F1 score, but outperforms NetMF on micro-F1 score.

Figure 5. Distribution of elements of the limiting PMI matrices of the three networks. The distributions are separated between negative and positive elements corresponding to negative and positive correlations between nodes.

Binarized Laplacian Pseudoinverse

In Figure 4 we show down stream performance of our simple method based on factorizing the binarized Laplacian pseudoinverse. This method performs remarkably well on both networks. On PPI, it matches the performance of NetMF, and on BlogCatalog, it is nearly on par again, accounting for most of the improvement from the classical method. On the network, Wikipedia, it is less successful, especially on Macro-F1 error, but still improves on the classical method. These result again support our hypothesis that the key ingredient of improved node embeddings seems to be the application of a nonlinearity to the Laplacian pseudoinverse.

Elements of Limiting PMI Matrices

Since we are introducing the limiting PMI matrix as an object for future investigation, we also give a preliminary qualitative description of its elements. See Figure 5 for a visualization of the element distribution for the three networks we investigate. Across these networks, these matrices tend to contain mostly small negative elements, corresponding to weak negative correlations between nodes, as well as some large positive values, corresponding to strong positive correlations. The distributions overall have similar shapes, and, interestingly, have roughly the same ratios of negative values to positive values, corresponding to roughly the same ratios of negative correlations to positive correlations.

6. Discussion and Conclusions

In this work we have simplified known expressions for the finite- network PMI matrix and derived an expression for the matrix in terms of the pseudoinverse of the graph Laplacian. This expression strengthens connections between SGNS-based and class spectral embedding methods.

We show that, with a simple scaled logarithm nonlinearity, this limiting matrix can be used to approximate finite- matrices which yield competitive results on downstream vertex classification tasks. This finding suggests that the core mechanism of SGNS methods as applied to networks is a simple nonlinear transformation of classical Laplacian embedding methods. We even find that just binarizing the Laplacian pseudoinverse by thresholding often accounts for most of the performance gain from classical approaches, suggesting again the important of the nonlinearity.

Future Work

We view our work as a step in understanding the core mechanisms of SGNS-based embedding methods. However many open questions remain.

For example, one may ask why the scaled logarithm non-linearity is a good choice. Relatedly, how robust is performance to changes in the nonlinearity? Our results on binarization of the Laplacian pseudoinverse indicate that it may be quite robust, but this is worth further exploration. Finally, as discussed, our binarization method can be viewed as producing the adjacency matrix of a graph based on the Laplacian pseudoinverse, and then directly factoring this matrix. Understanding how this derived graph relates to the input graph would be a very interesting next step in understanding the surprisingly competitive performance of this method.

Additionally, as discussed previously, node2vec (Grover and Leskovec, 2016) is a generalization of DeepWalk that adds additional hyperparameters to create second-order random walks. Qiu et al. (Qiu et al., 2018) also provide an expression for the matrix that is implicitly factored by node2vec, so pursuing the limit of this matrix may provide insight into node2vec and an interesting generalization of DeepWalk’s limiting PMI matrix.

References

  • N. Agarwal, H. Liu, S. Murthy, A. Sen, and X. Wang (2009) A social identity approach to identify familiar strangers in a social network. In Third International AAAI Conference on Weblogs and Social Media, Cited by: §4.1.
  • C. Allen, I. Balazevic, and T. Hospedales (2019) What the vec? towards probabilistically grounded embeddings. In Advances in Neural Information Processing Systems, pp. 7465–7475. Cited by: §2.2.
  • C. Allen and T. Hospedales (2019) Analogies explained: towards understanding word embeddings. In

    International Conference on Machine Learning

    ,
    pp. 223–231. Cited by: §2.2.
  • S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski (2016) A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. Cited by: §2.2.
  • M. Belkin and P. Niyogi (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15 (6), pp. 1373–1396. Cited by: §1, §1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2.4.
  • R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: a library for large linear classification. Journal of machine learning research 9 (Aug), pp. 1871–1874. Cited by: §4.2.
  • A. Gittens, D. Achlioptas, and M. W. Mahoney (2017) Skip-gram- zipf+ uniform= vector additivity. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 69–76. Cited by: §2.2.
  • Y. Goldberg and O. Levy (2014) Word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. Cited by: §2.1.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §1, §2.4, §6.
  • E. Jones, T. Oliphant, P. Peterson, et al. (2001)

    SciPy: open source scientific tools for Python

    .
    External Links: Link Cited by: §4.2.
  • J. A. Kelner, L. Orecchia, A. Sidford, and Z. A. Zhu (2013) A simple, combinatorial algorithm for solving sdd systems in nearly-linear time. In

    Proceedings of the forty-fifth annual ACM symposium on Theory of computing

    ,
    pp. 911–920. Cited by: §4.2.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.4.
  • I. Koutis, G. L. Miller, and R. Peng (2014) Approaching optimality for solving sdd linear systems. SIAM Journal on Computing 43 (1), pp. 337–354. Cited by: §4.2.
  • R. B. Lehoucq, D. C. Sorensen, and C. Yang (1998) ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted arnoldi methods. Vol. 6, Siam. Cited by: §4.2.
  • [16] H. Levinson An eigenvalue representation for random walk hitting times and its application to the rook graph. External Links: Link Cited by: §3.1, §3.2.
  • O. Levy and Y. Goldberg (2014) Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2177–2185. External Links: Link Cited by: §1, §2.2, §3.2.
  • Y. Li, L. Xu, F. Tian, L. Jiang, X. Zhong, and E. Chen (2015) Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    ,
    Cited by: §2.2.
  • F. McSherry (2001) Spectral partitioning of random graphs. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp. 529–537. Cited by: §1, §1.
  • C. D. Meyer (1973) Generalized inversion of modified matrices. SIAM Journal on Applied Mathematics 24 (3), pp. 315–323. Cited by: §3.3.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: InfiniteWalk: Deep Network Embeddings as Laplacian Embeddings with a Nonlinearity, §1, §2.1.
  • A. Y. Ng, M. I. Jordan, and Y. Weiss (2002)

    On spectral clustering: analysis and an algorithm

    .
    In Advances in Neural Information Processing Systems, pp. 849–856. Cited by: §1.
  • T. E. Oliphant (2006) A guide to numpy. Vol. 1, Trelgol Publishing USA. Cited by: §4.2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §4.2.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: InfiniteWalk: Deep Network Embeddings as Laplacian Embeddings with a Nonlinearity, §1, §2.3, §4.2, §4.2.
  • J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 459–467. Cited by: InfiniteWalk: Deep Network Embeddings as Laplacian Embeddings with a Nonlinearity, §1, §1, §2.3, §2.4, §3.1, §3.2, §4.1, §4.2, §4.2, §4.2, Figure 3, §5.1, §5.2, §6.
  • G. Ranjan, Z. Zhang, and D. Boley (2014) Incremental computation of pseudo-inverse of laplacian. In

    International Conference on Combinatorial Optimization and Applications

    ,
    pp. 729–749. Cited by: §4.2.
  • J. Shi and J. Malik (2000) Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22 (8), pp. 888–905. Cited by: §1, §1.
  • D. A. Spielman and S. Teng (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 81–90. Cited by: §4.2.
  • C. Stark, B. Breitkreutz, A. Chatr-Aryamontri, L. Boucher, R. Oughtred, M. S. Livstone, J. Nixon, K. Van Auken, X. Wang, X. Shi, et al. (2010) The biogrid interaction database: 2011 update. Nucleic acids research 39 (suppl_1), pp. D698–D704. Cited by: §4.1.
  • J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: §1, §4.2.
  • P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2018) Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §2.4.
  • F. Wu, T. Zhang, A. Holanda de Souza, C. Fifty, T. Yu, and K. Q. Weinberger (2019) Simplifying graph convolutional networks. Proceedings of Machine Learning Research. Cited by: §2.4.
  • S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin (2006) Graph embedding and extensions: a general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence 29 (1), pp. 40–51. Cited by: §1.