Low-Norm Graph Embedding

Learning distributed representations for nodes in graphs has become an important problem that underpins a wide spectrum of applications. Existing methods to this problem learn representations by optimizing a softmax objective while constraining the dimension of embedding vectors. We argue that the generalization performance of these methods are probably not due to the dimensionality constraint as commonly believed, but rather the small norm of embedding vectors. Both theoretical and empirical evidences are provided to support this argument: (a) we prove that the generalization error of these methods can be bounded regardless of embedding dimension by limiting the norm of vectors; (b) we show empirically that the generalization performance of existing embedding methods are likely due to the early stopping of stochastic gradient descent. Motivated by our analysis, we propose a new low-norm formulation of the graph embedding problem, which seeks to preserve graph structures while constraining the total squared l_2 norm of embedding vectors. With extensive experiments, we demonstrate that the empirical performance of the proposed method well backs our theoretical analysis. Furthermore, it notably outperforms state-of-the-art graph embedding methods in the tasks of link prediction and node classification.

Comments

There are no comments yet.

Authors

• 3 publications
• 122 publications
• 50 publications
• 20 publications
• Gaussian Embedding of Large-scale Attributed Graphs

Graph embedding methods transform high-dimensional and complex graph con...
12/02/2019 ∙ by Bhagya Hettige, et al. ∙ 0

read it

• Adversarial Training Methods for Network Embedding

Network Embedding is the task of learning continuous node representation...
08/30/2019 ∙ by Quanyu Dai, et al. ∙ 2

read it

• Asymptotics of Network Embeddings Learned via Subsampling

Network data are ubiquitous in modern machine learning, with tasks of in...
07/06/2021 ∙ by Andrew Davison, et al. ∙ 0

read it

• Robust Graph Embedding with Noisy Link Weights

We propose β-graph embedding for robustly learning feature vectors from ...
02/22/2019 ∙ by Akifumi Okuno, et al. ∙ 0

read it

• Res-embedding for Deep Learning Based Click-Through Rate Prediction Modeling

Recently, click-through rate (CTR) prediction models have evolved from s...
06/25/2019 ∙ by Guorui Zhou, et al. ∙ 0

read it

• Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data

This study investigates the theoretical foundations of t-distributed sto...
05/16/2021 ∙ by T. Tony Cai, et al. ∙ 27

read it

• Tutorial on NLP-Inspired Network Embedding

This tutorial covers a few recent papers in the field of network embeddi...
10/16/2019 ∙ by Boaz Shmueli, et al. ∙ 0

read it

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs have long been considered as one of the most fundamental structures that can naturally represent interactions between numerous real-life objects (e.g., the Web, social networks, protein-protein interaction networks). Graph embedding, whose goal is to learn distributed representations for nodes while preserving the structure of the given graph, is a fundamental problem in network analysis that underpins many applications. A handful of graph embedding techniques have been proposed in recent years Perozzi2014, Tang2015, GroverL16, along with impressive results in applications like link prediction, text classification Tang2015a, and gene function prediction wang2015exploiting.

Linear graph embedding methods preserve graph structures by converting the inner products of the node embeddings into probability distributions with a softmax function Perozzi2014, Tang2015, GroverL16. Since the exact softmax objective is computationally expensive to optimize, the negative sampling technique mikolov2013distributed is often used in these methods: instead of optimizing the softmax objective function, we try to maximize the probability of positive instances while minimizing the probability of some randomly sampled negative instances. It has been shown that by using this negative sampling technique, these graph embedding methods are essentially computing a factorization of the adjacency (or proximity) matrix of graph levy2014neural. Hence, it is commonly believed that the key to the generalization performance of these methods is the dimensionality constraint.

However, in this paper we argue that the key factor to the good generalization of these embedding methods is not the dimensionality constraint, but rather the small norm of embedding vectors. We provide both theoretical and empirical evidence to support this argument:

• [leftmargin=15pt]

• Theoretically, we analyze the generalization error of two linear graph embedding hypothesis spaces (restricting embedding dimension/norm), and show that only the norm-restricted hypothesis class can theoretically guarantee good generalization in typical parameter settings.

• Empirically, we show that the success of existing linear graph embedding methods Perozzi2014, Tang2015, GroverL16 are due to the early stopping of stochastic gradient descent (SGD), which implicitly restricts the norm of embedding vectors. Furthermore, with prolonged SGD execution and no proper norm regularization, the embedding vectors can severely overfit the training data.

Paper Outline

The rest of this paper is organized as follows. In Section 3, we review the definition of graph embedding problem and the general framework of linear graph embedding. In Section LABEL:sect:norm, we present both theoretical and empirical evidence to support our argument that the generalization of embedding vectors is determined by their norm. In Section 5, we present additional experimental results for a hinge-loss linear graph embedding variant, which further support our argument. In Section 6, we discuss the new insights that we gained from previous results. Finally in Section 8, we conclude our paper. Details of the experiment settings, algorithm pseudo-codes, theorem proofs and the discussion of other related work can all be found in the appendix.

2 Other Related Work

Classical graph embedding algorithms such as multidimensional scaling (MDS) [kruskal1978multidimensional], IsoMap [tenenbaum2000global], and Laplacian Eigenmap [belkin2001laplacian]

typically construct an affinity graph with the features of the given data points, and then look for low-dimensional representations of the data points based on the graph. These methods typically require finding the eigenvectors of the affinity matrix, making them hard to scale on large graphs with millions of nodes.

In recent years, the success of Word2Vec [mikolov2013distributed]

has inspired a handful of neural-network-based techniques for graph embedding:

• [leftmargin=1.5em]

• Perozzi et al. [Perozzi2014] propose the DeepWalk method, that first uses random walks to generate a large number of paths from the graph, and then applies the SkipGram model to learn vectorized representations for graph nodes.

• Tang et al. [Tang2015] use node embeddings to model the neighborhood distributions of graph nodes, and learn the embeddings such that the embedding-based distributions align well with the empirical distributions.

• Grover et al. [GroverL16] propose the Node2Vec method, which uses a biased random walk procedure to obtain expanded neighborhoods for graph nodes instead of the observed neighbors.

All of these methods fall within the generic framework of linear graph embedding, with differences primarily on the construction of proximity matrix, so our analysis applies to all these methods.

Apart from the linear graph embedding techniques listed above, there are also studies [li2014lrbm, TianGCCL14, LiTBZ15, wang2016structural] that use deep neural network architectures to preserve graph structures. However, these methods usually require a large training time and are sensitive to neural network hyper-parameters. Moreover, it is very difficult to theoretically analyze these methods. Hence, it remains unclear what level of generalization accuracy these methods can achieve.

Learning embedding vectors with norm constraints and hinge-loss has been proposed by Srebro et al. [srebro2005maximum] as an alternative to low-rank factorization (i.e., SVD). However, Srebro et al. put constraint on the trace norm of the reconstructed proximity matrix, and the resulting objective function requires a generic semi-definite programming (SDP) solver, which has severe scalability issues.

Dual Coordinate Descent (DCD) [HsiehCLKS08] is an optimization algorithm originally designed for solving SVM. Due to the connection between SVM and our low-norm embedding objective , it can also be adapted as a building block for learning low-norm graph embedding (details will be discussed in this paper).

3 Preliminiaries

3.1 The Graph Embedding Problem

We consider a graph , where is the set of nodes in , and is the set of edges between the nodes in . For any two nodes , an edge if and are connected, and we assume all edges are unweighted and undirected for simplicity111

All linear graph embedding methods discussed in this paper can be generalized to weighted case by multiplying the weight to the corresponding loss function of each edge. The directed case is usually handled by associating each node with two embedding vectors for incoming and outgoing edges respectively, which is equivalent as learning embedding on a transformed undirected bipartite graph.

. The task of graph embedding is to learn a -dimensional vector representation for each node such that the structure of can be maximally preserved. These embedding vectors can then be used as features for subsequent applications (e.g., node label classification or link prediction).

3.2 The Linear Graph Embedding Framework

Linear graph embedding Tang2015, GroverL16 is one of the two major approaches for computing graph embeddings 222The other major approach is to use deep neural network structure to compute the embedding vectors, see the discussion of other related works in the appendix for details.. These methods use the inner products of embedding vectors to capture the likelihood of edge existence, and are appealing to practitioners due to their simplicity and good empirical performance. Formally, given a node and its neighborhood 333Note that can be either the set of direct neighbors in the original graph  Tang2015, or an expanded neighborhood based on measures like random walk GroverL16., the probability of observing node being a neighbor of is defined as:

 p(v|u)=exp(xTuxv)∑k∈Vexp(xTuxk).

By minimizing the KL-divergence between the embedding-based distribution and the actual neighborhood distribution, the overall objective function is equivalent to:

 L=−∑u∈E∑v∈N+(u)logp(v|u)

Unfortunately, it is quite problematic to optimize this objective function directly, as the softmax term involves normalizing over all vertices. To address this issue, the negative sampling mikolov2013distributed technique is used to avoid computing gradients over the full softmax function. Intuitively, the negative sampling technique can be viewed as randomly selecting a set of nodes that are not connected to each node as its negative neighbors. The embedding vectors are then learned by minimizing the following objective function instead:

 L=−∑u∑v∈N+(u)logσ(xTuxv)−∑u∑v∈N−(u)κ|N+(u)||N−(u)|logσ(−xTuxv). (1)

3.3 The Matrix Factorization Interpretation

Although the embedding vectors learned through negative sampling do have good empirical performance, there is very few theoretical analysis of such technique that explains the good empirical performance. The most well-known analysis of negative sampling was done by levy2014neural, which claims that the embedding vectors are approximating a low-rank factorization of the PMI (Pointwise Mutual Information) matrix.

More specifically, the key discovery of levy2014neural is that when the embedding dimension is large enough, the optimal solution to Eqn (1) recovers exactly the PMI matrix (up to a shifted constant, assuming the asymptotic case where for all ):

 ∀u,v,xTuxv=log(|E|⋅1(u,v)∈E|N+(u)||N+(v)|)−logκ

Based on this result, levy2014neural suggest that optimizing Eqn (1) under the dimensionality constraint is equivalent as computing a low-rank factorization of the shifted PMI matrix. This is currently the mainstream opinion regarding the intuition behind negative sampling. Although Levy and Goldberg only analyzed negative sampling in the context of word embedding, it is commonly believed that the same conclusion also holds for graph embedding qiu2018network.

4 Generalization Analysis of Linear Graph Embedding Methods

The first major point of this paper is to argue that the generalization of linear graph embedding methods in Section 3 are more likely determined by the norm of embedding vectors instead of their dimensions. We begin by providing a theoretical generalization error analysis of linear graph embedding that characterizes the importance of restricting the norm of vectors. We consider the following statistical model for graph generation: assume that there exists an unknown probability distribution over the Cartesian product of two vertex sets and . Each sample from denotes an edge connecting and .The graph that we observed consists of the first i.i.d. samples from the distribution , and the goal is use these samples to learn a model that generalizes well to the underlying distribution . Note that in the above notations, we allow either for homogeneous graphs or for bipartite graphs.

Define

to be the uniform distribution over

for generating the negative edges, and be the combined distribution over that generates both positive and negative edges (i.e., indicates that is sampled from , and indicates otherwise). Intuitively, a good graph embedding should be able to distinguish whether a future sample from is actually sampled from or . Using the above notations, we have the following theorem which bounds the generalization error of a linear embedding model on the link prediction task:

Theorem 1.

Let be i.i.d. samples from a distribution over . Let be the embedding vectors for nodes and be the embedding vectors for nodes in . Then for any -Lipschitz loss function and , with probability :

 ∀x,t,s.t.n∑i=1∥xi∥2≤Cx,k∑j=1∥tj∥2≤Ct, E(a,b,y)∼PL(yxTatb)≤1m+m′m+m′∑i=1L(yixTaitbi)+2m+m′Eσ||Aσ||2√CxCt+4C√2ln(4/δ)m+m′

where

is the spectral norm of the random matrix

defined as follows:

 Aσ(i,j)={σij∃y,(i,j,y)∈E0∀y,(i,j,y)∉E

in which

are i.i.d. Rademacher random variables.

The proof of this theorem can be found in the appendix. As we can see, our generalization error analysis does not depend on the embedding dimension. Therefore, we can enjoy good generalization performance by ensuring the total squared norm of embedding vectors is small, even if the embedding dimension is very large.

For homogeneous graphs, we have the following corollary, which is obtained by restricting and in Theorem 1:

Corollary 1.

Let be i.i.d. samples from a distribution over . Let be the embedding vectors for nodes , then for any -Lipschitz loss function and , with probability :

 ∀x,s.t.n∑i=1∥xi∥2≤Cx, E(a,b,y)∼PL(yxTaxb)≤1m+m′m+m′∑i=1L(yixTaixbi)+2m+m′Eσ||Aσ||2Cx+4C√2ln(2/δ)m+m′

We suspect that the vector norm is also the key factor to the generalization of existing graph embedding methods: although not explicitly regularized, the norm of embedding vectors are still small due to the relatively few iterations of SGD executed on each vector444Most graph embedding experiments are conducted on large scale graphs. Therefore, even though the total number of iterations are set to be large, the number of iterations executed on each embedding vector is small.. The empirical evidences for this argument can be found in Section 7: if we keep running the SGD procedure in LINE [Tang2015], the norm of embedding vectors will continue to increase, and the generalization performance will start to drop.

5 Demonstrating the Importance of Norm Regularization via Hinge-Loss Linear Graph Embedding

In this section, we present the experimental results for a non-standard linear graph embedding formulation, which optimizes the following objective:

 L= λ+1∑(u,v)∈E+h(xTuxv)+λ−1∑(u,v)∈E−h(−xTuxv)+λr2∑v∈V||xv||22 (2)

By replacing logistic loss with hinge-loss, it is now possible to apply the dual coordinate descent (DCD) method HsiehCLKS08 for optimization, which circumvents the issue of vanishing gradients in SGD, allowing us to directly observe the impact of norm regularization. More specifically, consider all terms in Eqn (2) that are relevant to a particular vertex :

 L(u)=∑(xi,yi)∈Dλyiλrmax(1−yixTuxi,0)+12∥xu∥2. (3)

in which we defined . Since Eqn (3) takes the same form as a soft-margin linear SVM objective, with being the linear coefficients and being training data, it allows us to use any SVM solver to optimize Eqn (3), and then apply it asynchronously on the graph vertices to update their embeddings. The pseudo-code for the optimization procedure using DCD can be found in the appendix.

Impact of Regularization Coefficient: Figure 1 shows the generalization performance of embedding vectors obtained from DCD procedure (epochs). As we can see, the quality of embeddings vectors is very bad when , indicating that proper norm regularization is necessary for generalization. The value of also affects the gap between training and testing performance, which is consistent with our analysis that controls the model capacity of linear graph embedding.

Impact of Embedding Dimension Choice: The choice of embedding dimension on the other hand is not very impactful as demonstrated in Figure 2: as long as is reasonably large (), the exact choice has very little effect on the generalization performance. Even with extremely large embedding dimension setting . These results are consistent with our theory that the generalization of linear graph embedding is primarily determined by the norm constraints.

6 Discussion

So far, we have seen many pieces of evidence supporting our argument, suggesting that the generalization of embedding vectors in linear graph embedding is determined by the vector norm. Intuitively, it means that these embedding methods are trying to embed the vertices onto a small sphere centered around the origin point. The radius of the sphere controls the model capacity, and choosing proper embedding dimension allows us to control the trade-off between the expressive power of the model and the computation efficiency.

Note that the connection between norm regularization and generalization performance is actually very intuitive. To see this, let us consider the semantic meaning of embedding vectors: the probability of any particular edge being positive is equal to

 Pr(y=1|u,v)=σ(xTuxv)=σ(xTuxv||xu||2||xv||2||xu||2||xv||2)

As we can see, this probability value is determined by three factors:

• , the cosine similarity between

and , evaluates the degree of agreement between the directions of and .

• and on the other hand, reflects the degree of confidence we have regarding the embedding vectors of and .

Therefore, by restricting the norm of embedding vectors, we are limiting the confidence level that we have regarding the embedding vectors, which is indeed intuitively helpful for preventing overfitting.

It is worth noting that our results in this paper do not invalidate the analysis of levy2014neural, but rather clarifies on some key points: as pointed out by levy2014neural, linear graph embedding methods are indeed approximating the factorization of PMI matrices. However, as we have seen in this paper, the embedding vectors are primarily constrained by their norm instead of embedding dimension, which implies that the resulting factorization is not really a standard low-rank one, but rather a low-norm factorization:

 xTuxv≈PMI(u,v)s.t.∑u||xu||22≤C

The low-norm factorization represents an interesting alternative to the standard low-rank factorization, for which our current understanding of such factorization is still very limited. Given the empirical success of linear graph embedding methods, it would be really helpful if we can have a more in-depth analysis of such factorization, to deepen our understanding and potentially inspire new algorithms.

7 Experiments

In this section, we conduct several experiments to further validate the importance of restricting vector norm in linear graph embedding methods. We also compare the empirical performance of the proposed low-norm graph embedding methods against the state-of-the-art graph embedding methods and several other baselines on two tasks: link prediction and node label classification.

7.1 Experimental Settings

Data Sets: We use the following three datasets in our experiments:

• Tweet is an undirected graph that encodes keyword co-occurrence relationships using Twitter data. To construct this graph, we collected 1.1 million English tweets using Twitter’s Streaming API during 2014 August. From the collected tweets, we extracted the most frequent 10,000 keywords as graph nodes and their co-occurrences as edges. All nodes with more than 2,000 neighbors are then removed as stop words. There are 9,913 nodes and 681,188 edges in total.

• BlogCatalog [Zafarani+Liu:2009] is an undirected graph that contains the social relationships between BlogCatalog users. It consists of 10,312 nodes and 333,983 undirected edges, and each node belongs to one of the 39 groups.

• YouTube [mislove-2007-socialnetworks] is a social network among YouTube users. It includes 500,000 nodes and 3,319,221 undirected edges555Available at http://socialnetworks.mpi-sws.org/data-imc2007.html. We only used the subgraph induced by the first 500,000 nodes since our machine doesn’t have sufficient memory for training the whole graph. The original graph is directed, but we treat it as undirected graph as in [Tang2015].. There are also 30,085 groups, with each group containing users on average.

Methods: The following variants of linear embedding methods are evaluated in our experiments:

1. LogSig uses the SGD algorithm to optimize the standard linear graph embedding objective (Eqn (LABEL:eqn:obj_2)) as in LINE [Tang2015].

2. LowNorm uses the DCD algorithm to optimize the low-norm objective function (Eqn (2)).

3. LNPair uses the DCD algorithm to optimize the pairwise low-norm objective function (Eqn (LABEL:eqn:pairwise_hinge)).

4. LNKernel optimizes the dual form of low-norm objective function over the kernel matrix instead of explicit embeddings.

Additionally, we also compare against the following baselines:

1. node2vec [GroverL16] is a state-of-the-art graph embedding method. Compared with LINE [Tang2015], it uses random walks to sample nodes as positive neighborhoods for graph nodes, while LINE uses direct neighbors in the original graph.

2. CommonNeighbor [Liben-NowellK03] is a simple yet strong baseline for link prediction. Given two nodes and , it computes the score between and as , namely the number of common neighbors between and .

3. SVD

computes the Singular Value Decomposition of the adjacency matrix to obtain low-dimensional representations for graph nodes.

4. GRF [zhu2003semi]

is a semi-supervised learning method for node label classification on graph. It first sets the score of each labeled node as

or , then repeatedly updates the score of unlabeled nodes as the average score of nodes in its neighborhood until convergence.

For node2vec , we obtained the implementation from the original paper. For SVD, we used the Randomized SVD algorithm [halko2011finding] in Scikit-learn [pedregosa2011scikit]. For all the other methods, we implemented them in C++. All three datasets are partitioned into three parts for training, validating and testing respectively. The size ratio of the three parts are 2:1:1 for all datasets. We conducted all the experiments on a Windows machine with Xeon E3 3.4GHz CPU and 8GB memory. Other details about the experimental protocols and parameter settings are omitted here due to page limitations, and can be found in the appendixin our technical report [TechReport].

7.2 Additional Experiments on Importance of Vector Norm Restriction

In this section, we conduct several experiments to evaluate LogSig and low-norm methods on datasets BlogCatalog and YouTube. These experiments complement the ones in Section LABEL:sect:norm and 5 to further validate the importance of restricting vector norm for linear graph embedding methods.

In the first set of experiments, we further demonstrate the early stopping effect of SGD. As in Section LABEL:sect:norm, we run the SGD algorithm to optimize LogSig objective and stop the iterations after a certain period of time. Figure 2(a) and 2(b) shows the link prediction AP score and the node classification macro F1 score on BlogCatalog respectively. As we can see, the generalization AP score for link prediction peaks at around epochs, which is similar to what we have observed in Section LABEL:sect:sgd. The early stopping effect on macro-F1 score for node label classification is less apparent, probably due to the post-processing SVM step compensating for it. However, we still observed the performance drop after around epochs (not shown in the figure).

The next set of experiments compare the efficiency of LogSig (SGD) and LowNorm (DCD) on BlogCatalog and YouTube datasets. Figure 4 shows the testing AP score over the course of training procedure for both datasets. As we can see, the DCD algorithm converges faster than the SGD algorithm on both datasets, reaching optimal performance at around seconds for BlogCatalog dataset and seconds for YouTube dataset. On the other hand, the SGD algorithm has not yet completely converged even after seconds for BlogCatalog dataset and seconds for YouTube dataset.

In the final set of experiments, we demonstrate the importance of choosing appropriate value for the norm regularization parameter . Figure 5 shows the testing AP score of LowNorm with varying value of on BlogCatalog and YouTube datasets. Similar to what we have observed previously in Section 5, the regularization parameter controls the model capacity, which affects the generalization performance. Therefore, finding the optimal value of is the key for achieving good generalization performance for low-norm embedding methods.

7.3 Link Prediction Performance

Table 1 shows the performance of all methods on the link prediction task777We are unable to report the performance of node2vec, GRF and LNKernel on YouTube, since they do not scale well to this large dataset.. As we can see, the linear graph embedding methods consistently outperform all baselines. Among the four linear graph embedding methods, LNPair provides slightly better performance across all datasets. LogSig slightly outperforms LowNorm on Tweet and BlogCatalog datasets, but such performance requires both the optimal parameter configuration and sufficiently long training time to achieve. As a result, LogSig would be considerably worse compared to LowNorm and LNPair for larger datasets (such as YouTube), due to the fact that parameter tuning and actual training are very costly in such scenario.

7.4 Node Label Classification Performance

Table 2 shows the performance on the node label classification task. On BlogCatalog dataset, node2vec achieved the best performance overall, which confirms earlier findings in [GroverL16] that adding additional positive neighbors generated from random walk would improve the label classification accuracy. We remark that the technique used in node2vec [GroverL16] is also applicable to our low-norm formulation, and hence it is promising that such a technique can also lead to performance improvement. Overall, the linear graph embedding methods significantly outperform all other baselines, achieving and times better F1-score on BlogCatalog and YouTube respectively. Among the four linear graph embedding variants, LNPair performs slightly better across all datasets.

7.4.1 Efficiency Comparison

Figure 6 demonstrates the empirical comparison of training efficiency between SGD and DCD algorithms (both with optimal parameter configuration). For fair comparison, the -axis is the total running time (in seconds) and -axis shows the link prediction AP score on testing dataset. As we can see, the DCD algorithm converges much faster than the SGD algorithm: it takes only about seconds to converge while the SGD algorithm still has not converged even after seconds. Note that after about epochs (not shown in the figure), the SGD algorithm could achieve similar performance as the DCD algorithm, but it requires significantly longer training time (Figure 6 only shows the trend up to epochs).

8 Conclusion

We have shown that the generalization of linear graph embedding methods are not determined by the dimensionality constraint but rather the norm of embedding vectors. We proved that limiting the norm of embedding vectors would lead to good generalization, and showed that the generalization of existing linear graph embedding methods is due to the early stopping of SGD and vanishing gradients. We experimentally investigated the impact embedding dimension choice, and demonstrated that such choice only matters when there is no norm regularization. In most cases, the best generalization performance is obtained by choosing the optimal value for the norm regularization coefficient, and in such case the impact of embedding dimension case is negligible. Our findings combined with the analysis of levy2014neural suggest that linear graph embedding methods are probably computing a low-norm factorization of the PMI matrix, which is an interesting alternative to the standard low-rank factorization and calls for further study.

Appendix

Datasets and Experimental Protocols

We use the following three datasets in our experiments:

• Tweet is an undirected graph that encodes keyword co-occurrence relationships using Twitter data: we collected 1.1 million English tweets using Twitter’s Streaming API during 2014 August, and then extracted the most frequent 10,000 keywords as graph nodes and their co-occurrences as edges. All nodes with more than 2,000 neighbors are removed as stop words. There are 9,913 nodes and 681,188 edges in total.

• BlogCatalog Zafarani+Liu:2009 is an undirected graph that contains the social relationships between BlogCatalog users. It consists of 10,312 nodes and 333,983 undirected edges, and each node belongs to one of the 39 groups.

• YouTube mislove-2007-socialnetworks is a social network among YouTube users. It includes 500,000 nodes and 3,319,221 undirected edges999Available at http://socialnetworks.mpi-sws.org/data-imc2007.html. We only used the subgraph induced by the first 500,000 nodes since our machine doesn’t have sufficient memory for training the whole graph. The original graph is directed, but we treat it as undirected graph as in Tang2015..

For each positive edge in training and testing datasets, we randomly sampled negative edges, which are used for learning the embedding vectors (in training dataset) and evaluating average precision (in testing dataset). In all experiments, , which achieves the optimal generalization performance according to cross-validation. All initial coordinates of embedding vectors are uniformly sampled form .

Other Related Works

In the early days of graph embedding research, graphs are only used as the intermediate data model for visualization kruskal1978multidimensional or non-linear dimension reduction tenenbaum2000global, belkin2001laplacian. Typically, the first step is to construct an affinity graph from the features of the data points, and then the low-dimensional embedding of graph vertices are computed by finding the eigenvectors of the affinity matrix.

For more recent graph embedding techniques, apart from the linear graph embedding methods discussed in this paper, there are also methods wang2016structural, kipf2016semi, hamilton2017inductive that explore the option of using deep neural network structures to compute the embedding vectors. These methods typically try to learn a deep neural network model that takes the raw features of graph vertices to compute their low-dimensional embedding vectors: SDNE wang2016structural uses the adjacency list of vertices as input to predict their Laplacian Eigenmaps; GCN kipf2016semi aggregates the output of neighboring vertices in previous layer to serve as input to the current layer (hence the name “graph convolutional network”); GraphSage hamilton2017inductive extends GCN by allowing other forms of aggregator (i.e., in addition to the mean aggregator in GCN). Interestingly though, all these methods use only or neural network layers in their experiments, and there is also evidence suggesting that using higher number of layer would result in worse generalization performance kipf2016semi. Therefore, it still feels unclear to us whether the deep neural network structure is really helpful in the task of graph embedding.

Prior to our work, there are some existing research works suggesting that norm constrained graph embedding could generalize well. srebro2005maximum studied the problem of computing norm constrained matrix factorization, and reported superior performance compared to the standard low-rank matrix factorization on several tasks. Given the connection between matrix factorization and linear graph embedding levy2014neural, the results in our paper is not really that surprising.

Proof of Theorem 1

Since consists of i.i.d. samples from , by the uniform convergence theorem bartlett2002rademacher, shalev2014understanding, with probability :

 ∀x,s.t.∑u∈U∥xu∥2≤CU,∑v∈V∥xv∥2≤CV, E(a,b,y)∼Pl(yxTaxb)≤1m+m′m+m′∑i=1l(yixTaixbi)+2R(HCU,CV)+4B√2ln(4/δ)m+m′

where is the hypothesis set, and is the empirical Rademacher Complexity of , which has the following explicit form:

 R(HCU,CV)=1m+m′Eσa,b∼{−1,1}supx∈HCU,CV∑iσai,bil(yixTaixbi)

Here are i.i.d. Rademacher random variables: . Since is -Lipschitz, based on the Contraction Lemma shalev2014understanding, we have:

 R(HCU,CV)≤ 1m+m′Eσa,b∼{−1,1}supx∈HCU,CV∑iσai,biyixTaixbi = 1m+m′Eσa,b∼{−1,1}supx∈HCU,CV∑iσai,bixTaixbi

Let us denote as the dimensional vector obtained by concatenating all vectors , and as the dimensional vector obtained by concatenating all vectors :

 XU=(xu1,xu2,…,xu|U|)XV=(xv1,xv2,…,xv|V|)

Then we have:

 ||XU||2≤√CU||XV||2≤√CV

The next step is to rewrite the term in matrix form:

 supx∈HCU,CV∑iσai,bixTaixbi = sup||XU||2≤√CU,||XV||2≤√CVXTU[Aσ⊗Id]XV = √CU∥Aσ⊗Id∥2√CV

where represents the Kronecker product of and , and represents the spectral norm of (i.e., the largest singular value of ).

Finally, since , we get the desired result in Theorem 1.

Proof Sketch of Claim LABEL:clm:low_dim

We provide the sketch of a constructive proof here.

Firstly, we randomly initialize all embedding vectors. Then for each , consider all the relevant constraints to :

 Cv={(a,b,y)∈E:a=v or b=v}

Since is -regular, . Therefore, there always exists vector satisfying the following constraints:

 ∀(a,b,y)∈Cv,yxaxb=1+ϵ

as long as all the referenced embedding vectors are linearly independent.

Choose any vector in a small neighborhood of that is not the linear combination of any other embedding vectors (this is always possible since the viable set is a -dimensional sphere minus a finite number of dimensional subspaces), and set .

Once we have repeated the above procedure for every node in , it is easy to see that all the constraints are now satisfied.

Rough Estimation of ||Aσ||2 on Erdos–Renyi Graph

By the definition of spectral norm, is equal to:

 ||Aσ||2=sup||x||2=||y||2=1,x,y∈RnyTAσx

Note that,

 yTAσx=∑(i,j)∈Eσijyixj

Now let us assume that the graph is generated from a Erdos-Renyi model (i.e., the probability of any pair being directed connected is independent), then we have:

 yTAσx=∑i∑jσijeijyixj

where is the boolean random variable indicating whether .

 ∑i∑jσijeijyixj∼N(0,mn2)

where is the expected number of edges, and is the total number of vertices. Then we have,

 Pr(yTAσx≥t)≈O(e−t2n22m)

for all .

Now let be an -net of the unit sphere in dimensional Euclidean space, which has roughly total number of points. Consider any unit vector , and let be the closest point of in , then:

 yTAσx= (yS+y−yS)TAσ(xS+x−xS) = yTSAσxS+(y−yS)TAσxS+yTSAσ(x−xS)+(y−yS)TAσ(x−xS) ≤ yTSAσxS+2ϵn+ϵ2n

since is always true.

By union bound, the probability that at least one pair of satisfying is at most:

 Pr(∃xS,yS∈S:yTSAσxS≥t)≈O(ϵ−2ne−t2n22m)

Let , then the above inequality becomes:

 Pr(∃xS,yS∈S:yTSAσxS≥t)≈O(e−nlnn)

Since implies that

 sup||x||2=||y||2=1,x,y∈RnyTAσx

Therefore, we estimate

to be of order .

Pseudocode of Dual Coordinate Descent Algorithm

Algorithm Pseudocode of Dual Coordinate Descent Algorithm shows the full pseudo-code of the DCD method for optimizing the hinge-loss variant of linear graph embedding learning.

[t] DCD Method for Hinge-Loss Linear Graph Embedding [5] DcdUpdate Main Randomly initialize for all Initialize for all DCDUpdate()