Repository for residual2vec
Graph embedding maps a graph into a convenient vector-space representation for graph analysis and machine learning applications. Many graph embedding methods hinge on a sampling of context nodes based on random walks. However, random walks can be a biased sampler due to the structural properties of graphs. Most notably, random walks are biased by the degree of each node, where a node is sampled proportionally to its degree. The implication of such biases has not been clear, particularly in the context of graph representation learning. Here, we investigate the impact of the random walks' bias on graph embedding and propose residual2vec, a general graph embedding method that can debias various structural biases in graphs by using random graphs. We demonstrate that this debiasing not only improves link prediction and clustering performance but also allows us to explicitly model salient structural properties in graph embedding.READ FULL TEXT VIEW PDF
Repository for residual2vec
On average, your friends tend to be more popular than you. This is a mathematical necessity known as the friendship paradox, which arises due to a sampling bias, i.e., popular people have many friends and thus are likely to be on your friend list Feld1991 . Beyond being a fun trivia, the friendship paradox is a fundamental property of graphs: following an edge is a biased sampling that preferentially samples nodes based on nodes’ degree (i.e., the number of neighbors). The fact that random walk is used as the default sampling paradigm across many graph embedding methods raises important questions: what are the implications of this sampling bias in graph embedding? If it is undesirable, how can we debias it?
Graph embedding maps a graph into a dense vector representation, enabling a direct application of many machine learning algorithms to graph analysis Chai2018 . A widely used framework is to turn a graph into a “sentence of nodes” and then feed the sentence to word2vec Perozzi2014 ; Grover2016 ; Dong2017 ; Mikolov2013 . A crucial difference from word embedding is that, rather than using given sentences, graph embedding methods generate synthetic “sentences” from a given graph. In other words, the generation of synthetic “sentences” in graph is an implicit modeling decision Eriksson2021-kf , which most graph embedding methods take for granted. A common approach for generating sentences from a graph is based on random walks, which randomly traverse nodes by following edges. The friendship paradox comes into play when a walker follows an edge (e.g., friendship tie): it is more likely to visit a node with many neighbors (e.g., popular individual). As an example, consider a graph with a core-periphery structure, where core nodes have more neighbors than periphery (Fig. 1A). Although core nodes are the minority, they become the majority in the sentences generated by random walks (Fig. 1B). This is because core nodes have more neighbors than periphery and thus are likely to be a neighbor of other nodes, which is a manifestation of the friendship paradox. Then, how does the sampling bias affect the embedding?
Previous approaches to mitigate the degree bias in embedding are based on modifying random walks or the post-transformation of the embedding Andersen2006-ki ; Wu2019-qq ; Tang2020-ta ; Ahmed2020-vl ; Ha2020-fs ; Fountoulakis2019-fv ; Palowitch2020-tu ; Faerman2018-pd ; Rahman2019-uh . Here we show that word2vec by itself has an implicit bias arising from the optimization algorithm—skip-gram negative sampling (SGNS)—which happens to negate the bias due to the friendship paradox. To leverage this debiasing feature further, we propose a more general framework, residual2vec, that can also compensate for other systematic biases in random walks. We show that residual2vec performs better than conventional embedding methods in link prediction and community detection tasks. Using a citation graph of 260k journals, we demonstrate that the biases from random walks overshadow the salient features of graphs. By removing the bias, residual2vec better captures the characteristics of journals such as the impact factor and journal subject. The python code of residual2vec is available at GitHub r2v .
Consider a sentence of words composed of unique words. word2vec associates the th word with words in its surrounding , which are referred to as context words, determined by a prescribed window size . For a center-context word pair , word2vec
models conditional probability
where are embedding vectors representing word as center and context words, respectively, and is the embedding dimension. An approach to fit
is the maximum likelihood estimation, which is computationally expensive because
involves the sum over all words. Alternatively, several heuristics have been proposed, among whichnegative sampling is the most widely used Mikolov2013 ; Perozzi2014 ; Grover2016 .
Negative sampling trains word2vec as follows. Given a sentence, a center-context word pair is sampled and labeled as . Additionally, one samples random word as candidate context words from a noise distribution , and then labels as . In general, a popular choice of the noise distribution is based on word frequency, i.e., , where is the fraction of word in the given sentence, and is a hyper-parameter. Negative sampling trains and such that its label
is well predicted by a logistic regression model
by maximizing its log-likelihood.
Negative sampling efficiently produces a good representation Mikolov2013
. An often overlooked fact is that negative sampling is a simplified version of Noise Contrastive Estimation (NCE)Gutmann2010 ; Dyer2014 , and this simplification biases the model estimation. In the following, we show that this estimation bias gives rise to a built-in debiasing feature of SGNS word2vec.
NCE is a generic estimator for probability model of the form Gutmann2010 :
where is a non-negative function of data in the set of all possible values of . word2vec (Eq. (1)) is a special case of , where and . NCE estimates
by solving the same task as negative sampling—classifying a positive example andrandomly sampled negative examples using logistic regression—but based on a Bayesian framework Gutmann2010 ; Dyer2014 . Specifically, as prior knowledge, we know that in
pairs are taken from the given data, which can be expressed as prior probabilitiesGutmann2010 ; Dyer2014 :
) into the Bayes rule yields the posterior probability forgiven an example Gutmann2010 ; Dyer2014 :
which can be rewritten with a sigmoid function as
where is a constant. NCE learns by the logistic regression based on Eq. (7
). The key feature of NCE is that it is an asymptomatically unbiased estimator ofwhose bias goes to zero as the number of training examples goes to infinity Gutmann2010 .
In the original paper of word2vec Mikolov2013 , the authors simplified NCE into negative sampling by dropping in Eq. (7) because it reduced the computation and yielded a good word embedding. In the following, we show the impact of this simplification on the final embedding.
Equation (8) makes clear the relationship between negative sampling and NCE: negative sampling is the NCE with and noise distribution murray2020unsupervised . Bearing in mind that NCE is the asymptomatically unbiased estimator of Eq. (3) and substituting into Eq. (3), we show that SGNS word2vec is an asymptomatically unbiased estimator for probability model:
Equation (9) clarifies the role of noise distribution . Noise probability serves as a baseline for , and word similarity represents the deviation from , or equivalently the characteristics of words not captured in . Notably, baseline is determined by word frequency and thus negates the word frequency bias. This realization—that we can explicitly use a noise distribution to obtain “residual” information—is the motivation for our method, residual2vec.
We assume that the given graph is undirected and weighted, although our results can be generalized to directed graphs (see Supplementary Information). We allow multi-edges (i.e., multiple edges between the same node pair) and self-loops, and consider unweighted graphs as weighted graphs with all edge weight set to one Fortunato2010 ; Newman2014 .
The presence of effectively negates the bias in random walks due to degree. This bias dictates that, for a sufficiently long trajectory of random walks in undirected graphs, the frequency of node is proportional to degree (i.e., the number of neighbors) irrespective of the graph structure MASUDA2017b . Now, if we set , baseline matches exactly with the node frequency in the trajectory, negating the bias due to degree. But, we are free to choose any . This consideration leads us to residual2vec model:
where we explicitly model baseline transition probability denoted by . In doing so, we can obtain the residual information that is not captured in . Figure 2 shows the framework of residual2vec. To negate a bias, we consider “null” graphs, where edges are randomized while keeping the property inducing the bias intact Fortunato2010 ; Newman2014 . Then, we compute either analytically or by running random walks in the null graphs. The is then used as the noise distribution to train SGNS word2vec.
Among many models for random graph erdHos1959random ; Karrer2009 ; Garlaschelli2009 ; Karrer2011 ; Expert2011 ; Levy2014 ; fosdick2018 , here we focus on the degree-corrected stochastic block model (dcSBM), which can be reduced to many fundamental random graph models with certain parameter choices Karrer2011 . With the dcSBM, one partitions nodes into groups and randomizes edges while preserving (i) the degree of each node, and (ii) the number of inter-/intra-group edges. Preserving such group connectivity is useful to negate biases arising from less relevant group structure such as bipartite and multilayer structures. The dcSBM can be mapped to many canonical ensembles that preserve the expectation of structural properties. In fact, when , the dcSBM is reduced to the soft configuration model that preserves the degree of each node on average, with self-loops and multi-edges allowed fosdick2018 . Furthermore, by setting and
, the dcSBM is reduced to the Erdős-Rényi model for multigraphs that preserves the number of edges on average, with self-loops and multi-edges allowed. In the dcSBM, the edge weights follow a Poisson distribution and thus take integer values.
Suppose that the nodes in the given graph have discrete labels (e.g., gender), and we want to remove the structural bias associated with the labels. If no such label is available, all nodes are considered to have the same label (i.e., ). We fit the dcSBM with groups, where each group consists of the nodes with the same label. The dcSBM generates random graphs that preserve the number of edges within and between the groups (e.g., assortativity by gender types). We can calculate without explicitly generating random graphs (Supplementary Information)
where node has degree and belongs to group , , and is Kronecker delta. The entry of matrix is the fraction of edges to group in . Table 1 lists for the special classes of the dcSBM. See Supplementary Information for the step-by-step derivation.
residual2vec can be considered as a general framework to understand structural graph embedding methods because many existing graph embedding methods are special cases of residual2vec. node2vec and NetMF use SGNS word2vec with . This is equivalent to the baseline for the soft configuration model fosdick2018 , where each node has degree . DeepWalk is also based on word2vec but trained with an unbiased estimator (i.e., the hierarchical softmax). Because negative sampling with is unbiased Gutmann2010 ; Dyer2014 , DeepWalk is equivalent to residual2vec with the Erdős-Rényi random graphs for multigraphs.
Many structural graph embedding methods implicitly factorize a matrix to find embeddings Qiu2018 . residual2vec can also be described as factorizing a matrix which captures residual pointwise mutual information. Just like a th order polynomial function can be fit to points without errors, we can fit word2vec to the given data without errors when the embedding dimension is equal to the number of unique words Levy2014 . In other words, if . By substituting , we obtain
The equality holds if for all and , where is a constant. The solution is not unique because can be any real value. We choose to obtain a solution in the simplest form, yielding matrix that residual2vec factorizes:
Matrix has an information-theoretic interpretation. We rewrite
The dcSBM preserves the degree of each node and thus has the same the degree bias with the given graph, i.e., (Supplementary Information), which leads
is the pointwise mutual information that measures the correlation between center and context
under joint distribution, i.e., if and appear independently, and otherwise. In sum, reflects residual pointwise mutual information of and from the null model.
Although we assume that above, in practice, we want to find a compact vector representation (i.e., ) that still yields a good approximation Levy2014 ; Qiu2018 . There are several computational challenges in factorizing . First, is ill-defined for any node pair that never appears because . Second, is often a dense matrix with space complexity. For these issues, a common remedy is a truncation Levy2014 ; Qiu2018 :
This truncation discards negative node associations () while keeping the positive associations () based on the idea that negative associations are common, and thus are less informative Qiu2018 . In both word and graph embeddings, the truncation substantially reduces the computation cost of the matrix factorization Levy2014 ; Qiu2018 .
We factorize into embedding vectors and such that
by using the truncated singular value decomposition (SVD). Specifically, we factorizeby , where and are the left and right singular vectors of associated with the
largest singular values () in magnitude, respectively. Then, we compute and by with following the previous studies Levy2014 ; Qiu2018 .
The analytical computation of is expensive because it scales as , where is the window size Qiu2018 . Alternatively, one can simulate random walks to estimate . Yet, both approaches require space complexity. Here, we reduce the time and space complexity by the block approximation that approximates the given graph by the dcSBM with groups (we set ) and then computes an approximated . The block approximation reduces the time and space complexity to and for a graph with nodes and edges, respectively, with a high accuracy (the average Pearson correlation of 0.85 for the graphs in Table 2). See Supplementary Information for the block approximation.
|Network||Assortativity||Max. degree||Clustering coef.||Ref.|
We test residual2vec using link prediction and community detection benchmarks Lancichinetti2009 ; Newman2006a ; Fortunato2010 ; Grover2016 ; abu2017learning . We use the soft configuration model fosdick2018 as the null graph for residual2vec, denoted by r2v-config, which yields a degree-debiased embedding. The soft configuration model allows self-loops and multi-edges—which are not present in the graphs used in the benchmarks—and thus is not perfectly compatible with the benchmark graphs. Nevertheless, because the multi-edges and self-loops are rare in the case of sparse graphs, the soft configuration model has been widely used for sparse graphs without multi-edges and self-loops Newman2006a ; Newman2014 ; fosdick2018 .
As baselines, we use (i) three random-walk-based methods, node2vec Grover2016 , DeepWalk Perozzi2014 , and FairWalk Rahman2019-uh , (ii) two matrix-factorization-based methods, Glove Pennington2014 and Laplacian eigenmap (LEM) Belkin2003 , and (iii) the graph convolutional network (GCN) kipf2017semi , the graph attention networks (GAT) Velickovic2018-tl , and GraphSAGE Hamilton2017 . For all random-walk-based methods, we run 10 walkers per node for 80 steps and set and training iterations to 5. We set the parameters of node2vec by and . For Glove, we input the sentences generated by random walks. We use two-layer GCN, GraphSAGE, and GAT implemented in StellarGraph package StellarGraph
with the parameter sets (e.g., the number of layers and activation function) used inkipf2017semi ; Hamilton2017 ; Velickovic2018-tl
. Because node features are not available in the benchmarks, we alternatively use degree and eigenvectors for the
smallest eigenvalues of the normalized Laplacian matrix because they are useful for link prediction and clusteringLuxburg2007 ; Kunegis2009 . We set to dimension (i.e., ). Increasing does not improve performance much (Supplementary Information).
Link prediction task is to find missing edges based on graph structure, a basic task for various applications such as recommending friends and products Grover2016 ; abu2017learning ; zhang2018arbitrary . The link prediction task consists of the following three steps. First, given a graph, a fraction () of edges are randomly removed. Second, the edge-removed graph is embedded using a graph embedding method. Third, the removed edges are predicted based on a likelihood score calculated based on the generated embedding. In the edge removal process, we keep edges in a minimum spanning tree of the graph to ensure that the graph is a connected component Grover2016 ; abu2017learning . This is because predicting edges between disconnected graphs is an ill-defined task because each disconnected component has no relation to the other.
We leverage both embedding and baseline probability to predict missing edges. Specifically, we calculate the prediction score by , where we set for residual2vec because has the same unit as (Supplementary Information). Glove has a bias term that is equivalent to . Therefore, we set to the bias term for Glove. Other methods do not have the parameter that corresponds to and thus we set . We measure the performance by the area under the curve of the receiver operating characteristics (AUC-ROC) for the prediction scores, with the removed edges and the same number of randomly sampled non-existent edges being the positive and negative classes, respectively. We perform the benchmark for the graphs in Table 2.
r2v-config performs the best or nearly the best for all graphs (Figs. 3A–F). It consistently outperforms other random walk-based methods in all cases despite the fact that node2vec and r2v-config train the same model. The two methods have two key differences. First, r2v-config uses baseline , whereas node2vec uses that does not exactly fit to the degree bias. Second, r2v-config
optimizes the model based on a matrix factorization, which often yields a better embedding than the stochastic gradient descent algorithm used innode2vec Levy2014 ; Qiu2018 . The performance of residual2vec is substantially improved when incorporating offset , which itself is a strong predictor as indicated by the high AUC-ROC.
We use the Lancichinetti–Fortunato–Radicchi (LFR) community detection benchmark Lancichinetti2009 . The LFR benchmark generates graphs having groups of densely connected nodes (i.e., communities) with a power-law degree distribution with a prescribed exponent . We set to generate the degree heterogeneous graphs. See Supplementary Information for the case of degree homogeneous graphs. In the LFR benchmark, each node has, on average, a specified fraction of neighbors in different communities. We generate graphs of nodes with the parameters used in Ref. Lancichinetti2009 and embed the graphs to dimensional space. We evaluate the performance by randomly sampling
node pairs and calculate the AUC-ROC for their cosine similarities, with nodes in the same and different communities being the positive and negative classes, respectively. A large AUC value indicates that nodes in the same community tend to have a higher similarity than those in different communities.
As increases from zero, the AUC for all methods decreases because nodes have more neighbors in different communities. DeepWalk and LEM have a small AUC value even at . r2v-config consistently achieves the highest or the second-highest AUC.
Can debiasing reveal the salient structure of graphs more clearly? We construct a journal citation graph using citation data between 1900 and 2019 indexed in the Web of Science (WoS) (Fig. 4A). Each node represents a pair of journal and year . Each undirected edge between and is weighted by the number of citations between journal in year and journal in year . The graph consists of nodes and undirected and weighted edges. Because the graph has a high average degree (i.e., ), some algorithms are computationally demanding. For this reason, we omit node2vec with and , NetMF, and GAT due to memory shortage (1Tb of RAM). Furthermore, for GCN, we set , which still took more than 18 hours. We also perform the whole analysis for the directed graph to respect the directionality of citations. Although all methods perform worse in predicting impact factor and subject category, we find qualitatively the same results. See Supplementary Information for the results for the directed graph.
Here, in addition to the degree bias, there are also temporal biases, e.g., there has been an exponential growth in publications, older papers had more time to accumulate citations, and papers tend to cite those published in prior few years Wang2013 . To remove both biases, we use residual2vec with the dcSBM (denoted by r2v-dcSBM), where we group journals by year to randomize edges while preserving the number of citations within and between years. We generate dimensional embeddings with .
Figures 4B and C show the 2d projection by the Linear Discriminant Analysis (LDA) with journals’ subject categories as the class labels. Glove and node2vec capture the temporal structure prominently, placing many old issues at the center of the embeddings. By contrast, r2v-dcSBM spreads out the old issues on the embedding. To quantify the effect of temporal bias, we randomly sample node pairs
and then fitting a linear regression modelthat predicts cosine similarity for node pair with attributes and , where is either the degree or the year of node . We perform 5-cross validations and compute -score (Fig. 4D). A smaller -score indicates that node similarity is less dependent on node attributes and thus less biased. LEM has the smallest -score for both degree and year. r2v-dcSBM has a smaller -score than r2v-config for year, respectively, suggesting that r2v-dcSBM successfully negates the biases due to time.
Is debiasing useful to capture the more relevant structure of graphs? We use embedding vectors to predict journal’s impact factor (IF) and subject category. By employing the -nearest neighbor algorithm, we carry out -cross validations and measure the prediction performance by -score and the micro-F1 score. To ensure that the train and test sets do not have the same journals in the cross-validation, we split the set of journals into the train and test sets instead of splitting the set of nodes (,). No single method best predicts both impact and subject categories. Yet, r2v-config and r2v-dcSBM consistently achieve the strongest or nearly the strongest prediction power for all we tested (Fig. 4E). This result demonstrates that debiasing embedding can reveal the salient structure of graphs that is overshadowed by other systematic biases.
In this paper, starting from the insight that word2vec with SGNS has a built-in debiasing feature that cancels out the bias due to the degree of nodes, we generalize this debiasing feature further, proposing a method that can selectively remove any structural biases that are modeled by a null random graph. By exposing the bias and explicitly modeling it, we provide a new way to integrate prior knowledge about graphs into graph embedding, and a unifying framework to understand structural graph embedding methods. Under our residual2vec framework, other structural graph embedding methods that use random walks can be understood as special cases with different choices of null models. Through empirical evaluations, we demonstrate that debiasing improves link prediction and community detection performances, and better reveals the characteristics of nodes, as exemplified in the embedding of the WoS journal citation graph.
Our method is highly flexible because any random graph model can be used. Although we focus on two biases arising from degree and group structure in a graph, one can remove other biases such as the degree-degree correlation, clustering, and bipartitivity by considering appropriate null graphs. Beyond these statistical biases, there have been growing concerns about social bias (e.g., gender stereotype) as well as surveillance and privacy in AI applications, which prompted the study of gender and frequency biases in word embedding Bolukbasi2016 ; ethayarajh-etal-2019-towards ; ethayarajh-etal-2019-understanding ; Zhou2021 . The flexibility and power of our selective and explicit debiasing approach may also be useful to address such biases that do not originate from common graph structures.
There are several limitations in residual2vec. We assume that random walks have a stationary distribution, which may not be the case for directed graphs. One can ensure the stationarity in random walks by randomly teleporting walkers Lambiotte2012 . Second, it is not yet clear to what extent debiasing affects downstream tasks (e.g., by losing information about the original graph). Nevertheless, we believe that the ability to understand and control systematic biases is critical to model graphs through the prism of embedding.
. Although we have not studied social biases in this paper, given the wide usage of graph embedding methods to model social data, our approach may lead to methods and studies that expose and mitigate social biases that manifest as structural properties in graph datasets. Our general idea and approach may also be applied to modeling natural language and may contribute to the study of biases in language models. At the same time, by improving the accuracy of graph embedding, our method may also have negative impacts such as privacy attacks and exploitation of personal data (surveillance capitalism)Bose2019 ; Edwards2016 . Nevertheless, we believe that our approach contributes to the effort to create transparent and accountable machine learning methods, especially because our method enables us to explicitly model what is structurally expected.
The authors acknowledge support from the Air Force Office of Scientific Research under award number FA9550-19-1-0391.
DEMO-Net: Degree-specific graph neural networks for node and graph classification.In Proceedings of the 25th ACM SIGKDD KDD, 406–415 (New York, NY, USA, 2019).
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, vol. 108, 2687–2697 (2020).
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 5, 1532–1543 (Stroudsburg, PA, USA, 2014).
A tutorial on spectral clustering.Statistics and Computing 17, 395–416 (2007).