Learning latent representations for graph nodes has attracted considerable attention in machine learning, with applications in social networks[5, 17], knowledge bases [25, 32], and recommendation systems . Textual networks additionally contain rich semantic information, so text can be included with the graph structure to predict downstream tasks, such as link prediction [11, 31] and node classification [9, 22]. For instance, social networks have links between users, and typically each user has a profile (text). The goal of textual network embedding is to learn node embeddings by jointly considering textual and structural information in the graph.
Most of the aforementioned textual network embedding methods focus on a fixed graph structure [26, 31, 21]. When new network nodes are added to the graph, these frameworks require the whole model be re-trained to update the existing nodes and add representations for the new nodes, leading to high computational complexity. However, networks are often dynamic. For example, in social networks users and relationships between users change over time (, new users, new friends, unfriending, ). It is impractical to update the full model whenever a new user is added. This paper seeks to address this challenge and learn an embedding method that adapts to a changed graph, without re-training.
Prior dynamic embedding methods usually focus on predicting how the graph structure changes over time, by training on multiple time steps [33, 20, 4]. In such a model, a dynamic network embedding is approximated by multiple steps of fixed network embeddings. In contrast, our method only needs to train on a single graph and can quickly adapt to related graph structures. Additionally, in prior work textual information is rarely included in dynamic graphs. Two exceptions have looked at dynamic network embeddings with changing node attributes [12, 11]. However, both require pre-trained node features, whereas we show that it is more powerful to learn the text encoder in a joint framework with the structural embedding.
We propose Dynamic Embedding for Textual Networks with a Gaussian Process (DetGP), a novel end-to-end model to perform unsupervised dynamic textual network embedding. DetGP jointly learns textual and structural features for both fixed and dynamic networks. The textual features indicate the intrinsic attributes of each node based on the text alone, while the structural features reveal the node relationships of the whole community. The structural features utilize both the textual features and the graph topology. This is achieved by smoothing the kernel function in a Gaussian process (GP) with a multi-hop graph transition matrix. This GP-based structure can handle newly added nodes or dynamic edges due to its non-parametric properties . To facilitate fast computation, we learn inducing points to serve as landmarks in the GP . Since the inducing points are fixed after training, computing new embeddings only requires calculating similarity to the inducing points, alleviating computational issues caused by changing graph structure. To evaluate the proposed approach, the learned node embeddings are used for link prediction and node classification. Empirically, DetGP learns improved node representations on several real-world datasets, and outperforms other existing baselines for both static and dynamic networks.
Assume the input data is given as an undirected graph , where is the node set and is the edge set. Each node is associated with an -length text sequence , where each is a natural language word. The adjacency matrix represents node relationships, where if and
otherwise. Our objective is to learn a low-dimensional embedding vectorfor each node , that captures both textual and structural features of graph .
Figure 1 gives the framework of the proposed model, DetGP. Text is input to a text encoder with parameters ; in the Supplementary Material we describ further details. The output is the textual embedding of node . This textual embedding is both part of the complete embedding and an input into the structural embedding layer (dotted purple box) that is combined with the graph structure in a GP framework, discussed in Section 2.1. In addition, multiple hops are modeled in this embedding layer to better reflect the graph architecture and use both local and global graph structure. To scale up the model to large datasets, we adopt the idea of inducing points [24, 23], which serve as nonuniformly situated grid points in the model. The output structural embeddings are denoted as , which are combined to form the complete node embedding .
The model is trained by using the negative sampling loss , where neighbor nodes should be more similar than non-neighbor nodes (described in Section 2.2). When the graph structure changes, the node embeddings are updated by a single forward-propagation step without relearning any model parameters. This property comes from the non-parametric nature of the GP-based structure, and it greatly increases computational efficiency for dynamic graphs.
2.1 Structural Embedding Layer
The structural embedding layer transforms the encoded text feature to structural embedding using a GP in conjunction with the graph topology. Before introducing the GP, we introduce the multi-hop transition matrix that will smooth the GP kernel.
Multi-hop transition matrix: Suppose is the normalized transition matrix, , a normalized version of where each row sums to one and
represents the probability of a transition from nodeto node . If represents the transition from a single hop, then higher orders of will give multi-hop transition probabilities. Specifically, is the th power of , where gives the probability of transitioning from node to node after random hops on the graph. Different powers of provide different levels of smoothing on the graph, and vary from using local to global structure. A priori though, it is not clear what level of structure is most important for learning the embeddings. Therefore, we combine them in a learnable weighting scheme:
where is the maximum number of steps considered, and are the learnable weights. The constraint that in (1) is implemented by a softmax function. Note that
is an identity matrix, which treats each node independently. In contrast, a large power ofwould typically be very smooth after taking many hops. Therefore, can learn the importance of local ( or ) and global (large powers of ) graph structure for the node embeddings. Equation (1) can be viewed as a generalized form for DeepWalk  or GloVe . In practice, learning the weights is more robust than hand-engineering them .
GP structural embedding prior: We define a latent function over the textual embedding with a GP prior . Inspired by 
, instead of using this GP directly to determine the embedding, the learned graph diffusion is used on top of this Gaussian process. For finite samples, the combination of the graph diffusion and the GP yields a conditional structural embedding that can be expressed as a multivariate Gaussian distribution:
where and is a index of our structure embedding feature. Each dimension of the structural embedding follows this Gaussian distribution with the same covariance matrix smoothed by . In the Supplementary Material, we discuss the selection of different kernels.
Inducing Points: GP models are well-known to suffer from computational complexity with large data size . To scale up the model, we use the inducing points based on the variational Gaussian process . Denote inducing points as with , and their corresponding learnable embeddings as , which follow the same GP function. The textual and structural embeddings of real data samples are denoted as and . Given inducing points, the conditional distribution of our structural embeddings is
Here and . The subscript indicates the th column of a matrix ( is the concatenation of the th element from all node structural embeddings). Each dimension of has a multivariate Gaussian distribution with unique mean value but the same covariance . Theoretically, we can give a posterior of and get the marginal distribution of by integrating out. However, the integral on does not have a closed form. As an approximation, we use a deterministic function for the learned structural embedding.
2.2 Algorithm Outline
The structural embedding and the textual embedding are concatenated to form the final node embedding . To learn the embeddings in an unsupervised manner, existing works adopt the technique of negative sampling , which tries to maximize the conditional probability of a nodal embedding given its neighbors, while maintaining a low conditional probability for non-neighbors. In the proposed framework, this loss is
where is a weighting constant. Equation (4) maximizes the inner product among neighbors in the graph while minimizing the similarity among non-neighbors. Our model is trained end-to-end by taking gradients of loss with respect to , and
. The inducing points are initialized as the k-means centers of the encoded text features. Then,
and the text encoder are jointly trained to minimize the loss function.
For newly introduced nodes with text , the transition matrix is first updated, and the embeddings are obtained directly without additional back-propagation. Specifically, we first compute from the text encoder. Then with , the structural embedding of all nodes can be computed as .
To demonstrate the efficacy of DetGP embeddings, we conduct experiments on both static and dynamic textual networks. Here we mainly focus on analyzing results on graphs with newly added nodes. The results on static networks and the experiment setup details are shown in the Supplementary Material. Here we use word embedding average (Wavg)  as our text encoder.
|Only Text (Wavg)||61.2||77.9||87.9||90.3||68.3||83.7||84.2||86.9|
|% Training Nodes||10%||30%||50%||70%||10%||30%||50%||70%|
|Only Text (Wavg)||60.2||76.3||83.5||84.8||56.7||67.9||70.4||73.5|
Previous works [26, 31, 21] on textual network embedding require the overall connection information to train the structural embedding, which cannot directly assign (without re-training) structural embeddings to a new coming node with connection information unknown during training. Therefore, the aforementioned methods cannot be applied to dynamic networks. To obtain comparable baselines to DetGP, we propose two strategies, based on the idea of (GraphSAGE) : Neighbor-Aggregate and GraphSAGE. Details about the strategies are shown in the Supplementary Material.
We evaluate the dynamic embeddings for test nodes on link prediction and node classification tasks. For both tasks, we split the nodes into training and testing sets with different proportions (%, %, %,
%). When embedding new testing nodes, only their textual attributes and connections with existing training nodes are provided. For the link prediction, we predict the edges between testing nodes based on the inner product between their node embeddings; for node classification, an SVM classifier is trained based on embeddings of training nodes. When new nodes come, we first embed the nodes using the trained model and then use the pre-learned SVM to predict their labels.
, respectively. The proposed DetGP significantly outperforms other baselines, especially when the proportion of training set is small. A reasonable explanation is, when the training set is small, new nodes will have few connections with the training nodes, which causes high variance in the results of aggregating neighborhood embeddings. However, instead of aggregating, the proposed DetGP infers the structural embedding via a Gaussian process with pre-learned inducing points, which is more robust than the information passed by neighbor nodes.
We propose a novel textual network embedding framework that learns representative node embeddings for static textual network, and also effectively adapts to dynamic graph structures. This is achieved by introducing a GP network structural embedding layer, which first maps each node to the inducing points, and then embeds them by taking advantage of the non-parametric representation. We also consider multiple hops to weight local and global graph structures. The graph structure is injected in the kernel matrix, where the kernel between two nodes use the whole graph information based on multiple hops. Our final embedding contains both structural and textual information. Empirical results demonstrate the practical effectiveness of the proposed algorithm.
-  (2018) Watch your step: learning node embeddings via graph attention. In NeurIPS, pp. 9180–9190. Cited by: §2.1.
-  (2008) Mixed membership stochastic blockmodels. JMLR, pp. 1981–2014. Cited by: Table 3.
-  (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175. Cited by: §A.1.
-  (2018) Dynamic network embedding: an extended approach for skip-gram based network embedding.. In IJCAI, pp. 2086–2092. Cited by: §1.
Graph neural networks for social recommendation. arXiv preprint arXiv:1902.07243. Cited by: §1.
-  (2016) Node2vec: scalable feature learning for networks. In SIGKDD, pp. 855–864. Cited by: Table 3.
-  (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §B.2, §3.
-  (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve.. Radiology 143 (1), pp. 29–36. Cited by: §B.3.
-  (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §1.
-  (2007) Graph evolution: densification and shrinking diameters. TKDD. Cited by: 3rd item.
-  (2018) Streaming link prediction on dynamic attributed networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 369–377. Cited by: §1, §1.
-  (2017) Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 387–396. Cited by: §1.
-  (2017) A structured self-attentive sentence embedding. ICLR. Cited by: §A.1.
-  (2016) Context2vec: learning generic context embedding with bidirectional lstm. In SIGNLL, pp. 51–61. Cited by: §A.1.
Bayesian semi-supervised learning with graph gaussian processes. In NeurIPS, pp. . Cited by: §2.1.
-  (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §2.1.
-  (2014) Deepwalk: online learning of social representations. In SIGKDD, pp. 701–710. Cited by: Table 3, §1, §2.1.
-  (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. Cited by: §B.4.
-  (2006) Gaussian processes for machine learning. In Springer, Cited by: §1.
-  (2018) Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pp. 362–373. Cited by: §1.
-  (2019) Improved semantic-aware network embedding with fine-grained word alignment. EMNLP. Cited by: §A.1, §B.1, §B.1, Table 3, Table 4, §1, §3.
-  (2015) Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: Table 3, Table 4, §1.
-  (2010) Bayesian gaussian process latent variable model. In AISTATS, pp. 844–851. Cited by: §2.
-  (2009) Variational learning of inducing variables in sparse gaussian processes. In AISTATS, pp. 567–574. Cited by: §2.1, §2.
Know-evolve: deep temporal reasoning for dynamic knowledge graphs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3462–3471. Cited by: §1.
-  (2017) Cane: context-aware network embedding for relation modeling. In ACL, Cited by: §A.1, §B.1, §B.1, Table 3, Table 4, §1, §2.2, §2, §3.
-  (2015) Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198. Cited by: §3.
Network representation learning with rich text information.
Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: Table 3, Table 4.
Graph convolutional neural networks for web-scale recommender systems. In SIGKDD, pp. 974–983. Cited by: §1.
Gaussian process models for link analysis and transfer learning. In NIPS, pp. 1657–1664. Cited by: §1.
-  (2018) Diffusion maps for textual network embedding. In NeurIPS, Cited by: §A.1, §B.1, §B.1, §B.3, §B.4, Table 3, Table 4, §1, §1, §3.
-  (2018) NSCaching: simple and efficient negative sampling for knowledge graph embedding. arXiv preprint arXiv:1812.06410. Cited by: §1.
-  (2018) Dynamic network embedding by modeling triadic closure process. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
Appendix A Model Details
a.1 Text Encoder
There are many existing text encoders [14, 3, 13], often based on deep neural networks. However, using a deep neural network encoder can overfit on graphs because of the relatively small size of textual data . Therefore, various encoders are proposed to extract rich textual information specifically from graphs [26, 31, 21]. In general, we aim to learn a text encoder with parameters that encodes semantic features . A simple and effective text encoder is the word embedding average (Wavg) , where is the corresponding embedding of from the sequence . This is implemented by a learnable look-up table.  proposed a diffused word average encoder (DWavg) to leverage textual information over multiple hops on the network. Because DetGP focuses mainly on the structural embeddings, we do not focus on developing a new text encoder. Instead, we show that DetGP has compatibility with different text encoders, and our experiments use these two text encoders (Wavg and DWavg).
a.2 Kernel Selection
While there are many ways to define the kernel function, excessive non-linearity is not required or desired in this embedding layer because the text encoder is highly non-linear and high-dimensional. In practice, the first-degree polynomial kernel, written as
outperforms others due to its numerical stability. Empirically, the linear kernel in Eq. (5) speeds up computation and increases model stability.
a.3 Analysis of the Structural Embedding
We analyze the kernel function in Eq. (2) and (5) to show how the graph structure is used in the embedding layer. Denote (Eq. (1)) as the transition probability from node to node in hops, then the correlation between node and in Eq. (2) can be expanded as
The covariance is the same for all indicies . The first term in Eq. (A.3) measures the kernel function between and . The next two terms show the relationship between and the weighted multi-hop neighbors of and vice versa. controls how much different hops are used. The last term is the pairwise-weighted higher order relationship between any two nodes in the graph, except and . The covariance structure uses the whole graph and learns how to balance local and global information. If node has no edges, then it will not be influenced by other nodes besides textual similarity. In contrast, a node with dense edge connections will be smoothed by its neighbors.
With the inducing points, Equation (A.3) can be modified as The covariance between node and the inducing points includes the local information , as well as the smoothed effect from . This can also be viewed as feature smoothing over neighbors. Since inducing points do not contain links to other inducing points, there is no smoothing function for them. Each inducing point can be viewed as a node that already includes global graph information.
Appendix B Experiments
b.1 Setup Details
For a fair comparison with previous work, we follow the setup in [26, 31, 21], where the embedding for each node has dimension 200, a concatenation of a 100-dimensional textual embedding and a 100-dimensional structural embedding. We evaluate our DetGP base on two text encoders: the word embedding average (Wavg) encoder and the diffused word embedding average (DWavg) encoder from Zhang et al. , introduced in Section A.1. The maximum number of hops in is set to .
Cora is a paper citation network, with a total of 2,277 vertices and 5,214 edges in the graph, where only nodes with text are kept. Each node has a text abstract about machine learning and belongs to one of seven categories.
is a paper citation network with 60,744 nodes and 52,890 edges. Each node represents one paper in computer science in one of four categories: database, data mining, artificial intelligence, and computer vision.
HepTh (High Energy Physics Theory)  is another paper citation network. The original dataset contains 9,877 nodes and 25,998 edges. We only keep nodes with associated text, so this is limited to 1,038 nodes and 1,990 edges.
b.2 Embedding Assigning Strategies
The two embedding assigning strategies are: (a) Neighbor-Aggregate: aggregating the structural embeddings from the neighbors in the training set, as the structural embedding for the new node; (b) GraphSAGE: aggregating the textual embeddings from the neighbors, then passing through a fully-connected layer to get the new node’s structural embedding. For neighborhood information aggregating, we use the mean aggregator and the max-pooling aggregator as mentioned in .
In both tasks, the Neighbor-Aggregate strategy with mean aggregator shows slight improvement to the baseline with only a text encoder. However, it does not work well with the max-pooling aggregator, implying that the unsupervised max-pooling on pre-trained neighbor structural embeddings cannot learn a good representation. The GraphSAGE strategies (with both mean and pooling aggregator) show notable improvements compared with Wavg and Neighbor-Aggregate. Unlike the unsupervised pooling, the GraphSAGE pooling aggregator is trained with a fully-connected layer on top, which shows comparable result to the mean aggregator.
b.3 Link Prediction
The link prediction task seeks to infer if two nodes are connected, based on the learned embeddings. This standard task tests if the embedded node features contain graph connection information. For a given network, we randomly keep a certain percentage (, , , , ) of edges and learn embeddings. At test time, we calculate the inner product of pairwise node embedding. A large inner product value indicates a potential edge between two nodes. The AUC score  is computed in this setting to evaluate the performance. The results are shown in Table 3 on Cora and HepTh. Since the DBLP dataset only has 52,890 edges which is far too sparse compared with the node number 60,744, we do not evaluate the AUC score on it as a consequence of high variance from sampling edges. The first four models only embed structural features, while the remaining alternatives use both textual and structural embeddings. We also provide the DetGP results of with only textual embeddings and only structure embeddings for ablation study.
From Table 3, adding textual information in the embedding can improve the link prediction result by a large margin. Even using only textual embeddings, DetGP gains significant improvement compared with only structure-based methods, and achieves competitive performance compared with other text-based embedding methods. Using only structural information is slightly better than using only textual embeddings, since link prediction is a more structure-dependent task, which also indicates that DetGP learns inducing points that can effectively represent the network structure. Compared with other textual network embedding methods, DetGP has very competitive AUC scores, especially when only given a small percentage of edges. Noting that for our methods the text encoders come from the baselines Wavg and DWavg , the performance gain should come from the proposed structural embedding framework.
|DetGP (Wavg) only Text||83.4||89.1||89.9||90.9||92.3||86.5||89.6||90.2||91.5||92.6|
|DetGP (Wavg) only Struct||85.4||89.7||91.0||92.7||94.1||89.7||92.1||93.5||94.8||95.1|
|DetGP (Wavg) only Text||78.1||81.2||84.7||85.3||71.4||73.3||74.2||74.9|
|DetGP (Wavg) only Struct||70.9||79.7||81.5||82.3||70.0||71.4||72.6||73.3|
b.4 Node Classification
Node classification requires high-quality textual embeddings because structural embeddings
alone do not accurately reflect node category. Therefore, we only compare to methods designed for textual network embedding. After training converges, a linear SVM classifier is learned on the trained node embeddings and performance is estimated by a hold-out set. In Table4, we compare our methods (Wavg+DetGP, DWavg+DetGP) with recent textual network embedding methods under different proportions (, , , ) of given nodes in the training set. Following the setup in Zhang et al. 
, the evaluation metric is Macro-F1 score. We test on the Cora and DBLP datasets, which have group label information, where DetGP yields the best performance under all situations. This demonstrates that the proposed model can learn both representative textual and structural embeddings. The ablation study results (only textual embeddings vs. only structural embeddings) demonstrates that textual attributes are more important than edge connections in classification task. To describe the effect of learning the weighting in the diffusion, for the experiment on Cora with nodes given for training, the learned weights in are . Thus, local and second order transition features are more important.
b.5 Inducing Points
Figure 2 gives the t-SNE visualization of the learned DetGP structural embeddings on the Cora citation dataset. The model is learned using all edges and all of the nodes with their textual information. We set the number of inducing points to . To avoid the computational instability caused by the inverse matrix , we update inducing points with a smaller learning rate, which is set to one-tenth of the learning rate for the text encoder. The inducing points are visualized as red filled circles in Figure 2. Textual embeddings are plotted with different colors, representing the node classes. Note that the inducing points fully cover the space of the categories, implying that the learned inducing points meaningfully cover the distribution of the textual embeddings.