Graphs are important mathematical structures commonly used to represent the objects and their relations in real-world systems such as the World Wide Web, social networks, and protein-protein interactions. Due to the wide range of applications that networks appear, network analysis methods have attracted great interest from the research community, and numerous techniques have been proposed to better understand and uncover their underlying properties. In recent years, many prominent and powerful approaches have emerged under the field of network representation learning
(NRL). The main goal of NRL techniques is to learn feature vectors corresponding to the nodes of the graph (also known asnode embeddings
), by preserving important structural properties of the network; those vectors can later be used to perform various analysis and mining tasks including visualization, node classification and link prediction with the favor of machine learning algorithms.
The initial studies in the field of node representation learning, have mostly relied on matrix factorization techniques, since various properties and interactions between nodes can be expressed as matrix operations. However, these methods are mainly applicable on small-scale networks due to their high computational cost – especially for graphs consisting of millions of nodes and edges . More recent studies have concentrated on developing methods suitable for relatively large-scale networks – being able to effectively approximate the underlying objective functions that capture meaningful information about the nodes of the graph and their properties.
A plethora of node representation learning methods have been inspired by the advancements in the area of natural language processing (NLP), borrowing various ideas originally developed for computing word embeddings. One such successful technique is the Skip-Gram architecture 
, which aims to find latent representations of words by estimating their context in the sentences of a textual corpus. That way, many pioneer studies in NRL utilize the idea of random walks to transform graphs into a collection of sentences – as an analogy to the area of natural language – and these sentences or walks are later being used to learn node embeddings.
Although random walk-based approaches are strong enough to capture local connectivity patterns, they mainly suffer to sufficiently convey information about the global structural properties of the network. More precisely, real-world networks have an inherent clustering (or community) structure, which can be utilized to further improve the predictive capabilities of node embeddings. One can interpret such structural information based on an analogy to the concept of topics in a collection of documents. In a similar way as word embeddings can be enhanced with topic-based information , here we aim at empowering node embeddings by employing information about the community structure of the graph – that can be achieved by a process similar to the one of topic modeling.
In this paper, we propose topical node embeddings (TNE), a framework in which node and topic embeddings are learned separately from the network, and then they are merged into a single vector – leading to further improvements in the performance on downstream tasks. The main contributions of the paper can be summarized as follows:
A novel node representation learning framework. We propose a new strategy, called TNE, which learns community embeddings from the graph, and use them to improve the node representations extracted by random walk-based methods.
Enriched feature vectors. We perform a detailed empirical evaluation of the embeddings learned by TNE on the tasks of node classification and link prediction. As the experimental results demonstrate, the proposed model provides feature vectors which can boost the performance of downstream tasks.
The rest of the paper is organized as follows. In Section 2, we describe the related work. In Section 3, we formulate the problem, and in Section 4 we present the proposed method. Section 5 presents the experimental results, and finally, in Section 6 we conclude our work providing also future research directions.
2 Related Work
In recent years, many methods have been proposed to learn a latent representation of nodes in an unsupervised manner. Developing a technique for learning network representations inherently contains a plethora of challenges, since a good representation should capture various underlying properties of the network. For instance, many real-world networks consist of tightly connected communities and obey a scale-free property with respect to their degree distribution; in other words, a small numbers of nodes, known as hubs, are connected to the majority of nodes. Hence, a structure-preserving method should be able to produce latent representations in which nodes that link to a hub should be close enough to it in the embeddings’ space, while they should also placed far away from each other if they belong to totally different communities .
The traditional unsupervised feature learning methods aim at factorizing some matrix representation, which has been designed by taking into account the properties and connections of a given network. MDS , Laplacian Eigenmaps , Locally Linear Embeddings (LLE)  and IsoMap  are just some of those approaches targeting to preserve the first-order proximity of nodes. More recently, proposed algorithms including GraRep  and HOPE , aim at preserving higher order proximities of nodes. Nevertheless, despite the fact that matrix factorization approaches offer an elegant way to capture the desired properties, they mainly suffer from their time complexity.
In recent years, random walk-based methods  have gained considerable attention, mainly due their efficiency. In fact, a very recent study  shows that DeepWalk and node2vec [20, 10] implicitly perform matrix factorizations. Following this line of research, distinct random sampling strategies have been proposed and various methods have emerged [15, 22].
To the best of our knowledge, very few studies are benefiting from the community structure property of real network to learn node embeddings. The authors of , have proposed a matrix factorization-based algorithm that incorporates the community structure into the embedding process, implicitly focusing on the quantity of modularity. The ComE model , proposes a closed-loop procedure among the encoding of communities, learning node embeddings and community detection in the network. As we will present shortly, our work aims at independently learning node and community (topic) embeddings, and then combining them into expressive topical feature vectors.
3 Problem Formulation and Latent Models on Graphs
Let be a graph, where is the set of nodes and denotes the set of edges. Our goal is to find a mapping function , where indicates the representation of the vertex in a lower dimensional space (which we desire to learn for feeding downstream learning tasks) and is generally referred to as the embedding or dimension size which is much smaller than the cardinality of the vertex set, .
Node embedding methods based on the popular SkipGram architecture 
mainly target to maximize the log-probability, where denotes the set of reachable nodes by starting from the vertex in at most steps. However, we have to deal with a computational problem when we aim to find and for each , mainly because the computational cost grows significantly as the length increases due to the sum over . Therefore, many approaches prefer to approximate the objective function above using random walks as follows:
where is a walk of length , refers to the window size, and is a collection of walks. Note that we obtain two different embedding vectors and for each node , but we will only consider the vector as a node embedding of .
Complex networks, such as social or biological networks, consist of latent clusters of different sizes in which the nodes are more likely to be connected to each other . Although some random walk-based methods implicitly benefit from this structural property of networks, our main goal here is to enhance node embedding vectors using clusters of a given network. We mainly rely on two different approaches to extract latent communities: on random walks and on the network structure itself. For a given graph , we will use the symbol to indicate the set of communities of .
3.1 Random walk-based graph topic models
Most real-world networks can be expressed as a combination of nested or overlapping communities . Therefore, when a random walk is initialized, it does not only visit neighboring nodes, but also traverses communities in the network (see Fig. 2). In this regard, we assume that each random walk can be represented as random mixtures over latent communities, and each community can be characterized by a distribution over nodes. In other words, we can write the following generative model for each walk over the network:
For each walk
For each vertex
Here, is the number of walks, is the length of walks and is the number of clusters.
If we consider each random walk as a document and the collection of random walks as a corpus, it can be seen that the statistical process defined above corresponds to the well known Latent Dirichlet Allocation (LDA) model . Therefore, each community corresponds to a distinct topic in the terminology of NLP (we use the terms topic and community interchangeably in the rest of the paper).
Now we can use community or topic assignments of nodes in the walks to obtain better vector representations. By replacing a node with its topic label, we aim to predict the nodes in the context of the topic. More formally, we can state our objective function to find community or topic representations as follows:
By maximizing the log-probability above, we obtain the embedding vectors for each topic label , which are called as topic embeddings or representations. We will refer to this model as Lda throughout the paper.
In the previous Lda model, the latent community assignment of each node is independently chosen from the topic label of the previous node in the walk. However, the hidden state of the current node can play an important role towards determining the next vertex to visit, as the random walk also traverses through communities. Therefore, we can modify the Lda model, and define the following generative process:
For each walk
For each vertex , for all
The above model is in fact the well-known Hidden Markov Model with symmetric Dirichlet priors over transition and emission distributions (we will refer to this model as Hmm). Note that, in the generation of each node sequence, the same transition probabilities are used, unlike the topic distribution of the Lda model, and the vectors and contain and components, respectively. Moreover, as shown in Lemma 3.1, the Lda model can also be viewed as a special case of Hmm for the generation of a specific node sequence, after choosing suitable distributions.
The probability of generating the topic and node sequences , by Lda for a given node and topic distributions , is equal to the probability of producing the sequences by Hmm if the initial, transition and emission probabilities are chosen as , and . Please see the Appendix.
3.2 Network structure-based modeling
In the previous models, the generated random walks are used to detect the community (or topic) assignment of each node in the given node sequence. Here, we propose two additional model, namely BigC and Louvain, which directly target to determine communities of nodes from a given network. The Louvain model uses the Louvain method  to extract communities, while the BigC model is based on an overlapping community detection method called BigClam .
4 Topical Node Embeddings
In this section, we will describe the proposed Topical Node Embeddings (TNE) model in detail. An overview of the model is given in Fig. 2. Our overall goal is to enhance node embedding using information about the underlying topics of the graph. This can be achieved by learning node and topic embedding vectors independently of each other, jointly maximizing the objectives defined in Equations (1) and (3.1). By combining these objective functions, we derive the following equation:
In the Skip-Gram model , the probability measure in the above equation is considered as a softmax function
and we adopt the negative sampling technique  in order to make our computations more efficient.
After obtaining the node and topic representations, our final step is to efficiently incorporate these feature vectors. For this purpose, we introduce three simple strategies, namely , , and :
. It produces the final representation for the node by combining the node and community embeddings: . Here, the topic label is equal to the parameter maximizing the expression and the symbol denotes the concatenation operation. For instance, if we select the number of topics as for Zachary’s karate club in Figure 1, then each node is assigned to the topic that has the highest probability.
. The second strategy can be defined as , where .
. The final strategy is formulated as follows: .
We call the final vector obtained after concatenating the node and topic feature vectors as topical node embedding. Algorithm 1 provides the pseudocode of the TNE model.
The general structure of our framework follows. First, we need a collection of walks over the network to learn node and topic embeddings – so, any approach such as Deepwalk and Node2vec can be used to perform random walks. Then, we choose a strategy for this collection to get the topic assignment of each node in the walk , based on the latent models on graphs defined in Section 3. In the first case, we use the stochastic processes Lda and Hmm described in Section 3, getting the topical node embedding models of tne-lda and tne-hmm, respectively. In the second case, the topic assignments are inferred from the network structure based on the BigC and Louvain models – relying on the BigClam and Louvain methods respectively – and the corresponding topical node embedding models are called tne-BigC and tne-Louvain.
Afterwards, we produce the node-context pairs to provide the input for the Skip-Gram algorithm, and we learn the latent node representations. By replacing each node with its topic assignment in the walk , we obtain a new set of pairs to learn topic embeddings. Finally, we combine the feature vectors depending on our methodology.
In this section, we will present the datasets that we use in our experiments and further discuss the performance and effectiveness of the proposed four variations of TNE model in the tasks of node classification and link prediction. Our model has been implemented in Python and the source code can be found at: https://abdcelikkanat.github.io/projects/TNE/.
5.1 Baseline Methods
We will consider two notable random walk-based approaches and apply our framework to the collection of walks generated by these algorithms.
Deepwalk  uses a very natural sampling strategy in producing walks. At each step, it uniformly chooses a node having connections to the one that it currently resides at, and repeats the same procedure until obtaining a walk of the desired length. We will refer to this method as deepwalk-emb.
Node2vec  is an extension of Deepwalk, and its walking behavior is controlled by two parameters and which provide the ability to discover distant regions of the network; it also captures structural similarities between nodes. We will refer to this method as node2vec-emb.
5.2 Parameter Settings
In this section, we describe the parameters’ settings that we have used for our experiments and clarify the strategies that we follow. Since both of the random walk sampling strategies that we examine here (Deepwalk and Node2vec) share many common parameters, we assign all of them to the same typical values.
More specifically, we consider the number of walks , walk length , window size , and the embedding dimension . The return and in-out hyper-parameters , of Node2vec are simply set to and for all experiments – so, the walk is encouraged to explore previously unvisited regions of the network. To speed up the training process, we use negative sampling 
for all models. We also use stochastic gradient descent (SGD) for optimization, setting the initial learning rate to .
For learning the topic assignment of each node in node sequences, we perform collapsed Gibbs sampling  for tne-Lda model, and variational message passing  for tne-Hmm. For all variants of the TNE framework, the number of topics are selected as in the experiments, and concatenation method is preferred to obtain final embedding vector.
5.3 Multi-Label Node Classification
In the multi-label node classification experiment, every node of the network is assigned to at least one label; the goal is to predict the correct node labels by only observing certain fraction of the network. We use the embedding vectors that we have learned in order to carry out node classification task. We randomly split the collection of feature vectors into training and tests sets, and apply an one-vs-rest logistic regression classifier withregularization for optimization. In order to provide more reliable experimental results, we repeat the same procedure for times. We use the following three datasets in our experiments.
CiteSeer  is a citation network extracted from the CiteSeer library, where nodes represent research papers and the edges indicate citations between publications.
Protein-Protein Interaction (PPI) is the subgraph of PPI network for Homo Saphiens and each label corresponds to a biological state .
Cora  is a citation network consisting of machine learning publications divided into seven categories. Every paper in the corpus is cited or cites at least one other paper.
Table 1 provides the basic statistics of the above datasets.
Figure 3 depicts the Micro- scores for the variants of the TNE framework as well as for the baseline methods, with respect to the number of nodes in the training set. In Table 2, the Macro- scores are shown for the case where the size of training and test sets are equal. As it can be seen, tne-BigC provides a gain of up to compared to the raw Deepwalk model (deeepwalk-emb), and up to compared to Node2vec (node2vec-emb) on the Citeseer dataset.
Although the general performance of the two feature learning methods Node2vec and Deepwalk are the same over the PPI network, tne-Lda model increases the score up to while tne-Louvain cannot show a great performance as much as it.
5.4 The effect of the number of topics
In this paragraph, we analyze the effect of the number of topics (or clusters) in the performance of our framework. We perform experiments on the CiteSeer network and we examine the tne-Lda and tne-Hmm models on the collection of random walks generated by Deepwalk and Node2vec. All the parameter settings are the same as those described in Subsection 5.2, except the number of topics. Figure 4 indicates that the increase in the number of topics makes positive contribution up to a certain value for tne-Lda model. On the other hand, this is not valid for tne-Hmm; it performs better for over both random walk strategies. The chosen number of topics shows its importance for large training data sizes – the scores get closer to each other when the training size decreases.
5.5 The effect of the concatenation strategy
In Section 4, we have described how to combine the node and topic feature vectors, in order to construct topical node embeddings. Here, we perform several experiments to observe the behavior of those strategies over varying training data sizes. Figure 5 depicts the Micro- scores on the CiteSeer network. As it can be seen, the and strategies highly outperform the third one across all cases, and their scores are highly close to each other.
5.6 Link Prediction
In the link prediction task, we have a limited access to the edges of the network, and our goal is to predict the missing (unseen) edges between nodes. We divide the edge set of a given network into two parts to form training and test sets, by randomly removing of the edges (the network remains connected during the process). The removed edges are later used as positive samples in the test set. The same number of node pairs that does not exist in the initial network is chosen to obtain negative samples for each training and test sets. The node embedding vectors are converted into edge features based on the binary operators listed in Table 4.
We perform all experiments using the logistic regression classifier with regularization on the following networks:
Table 3 presents the area under curve (AUC) scores for the link prediction task. As it can be seen, the proposed TNE framework outperforms the baseline methods in all cases. For the Facebook network, tne-BigC gives the best results for all but the average operator – which also corresponds to the best performing model across all different settings.
6 Conclusions and Future Work
In this paper, we have proposed TNE, a latent model for representation learning on networks. TNE takes advantage of the topics (or clusters) that a node belongs to – leading to the concept of topical node embeddings. That way, TNE is capable of producing enriched latent node representations, compared to traditional random walk-based approaches, leading to improved performance results in the tasks of node classification and link prediction.
Currently, TNE can be applied along with random walk-based approaches. An interesting future direction is how to extend the framework to include other NRL algorithms. Moreover, motivated by the hierarchical community structure that many real networks follow, an interesting future direction would be to extend the framework towards learning hierarchical node embeddings. Lastly, we plan to evaluate TNE in the task of community detection.
-  M. Belkin and P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in NIPS, 2002, pp. 585–591.
-  V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, Fast unfolding of communities in large networks, J. Stat. Mech., 2008 (2008), p. P10008.
Stochastic gradient learning in neural networks, in In Proceedings of Neuro-Nîmes. EC2, 1991.
-  S. Cao, W. Lu, and Q. Xu, Grarep: Learning graph representations with global structural information, in CIKM, 2015, pp. 891–900.
-  S. Cavallari, V. W. Zheng, H. Cai, K. C.-C. Chang, and E. Cambria, Learning community embedding with community detection and node embedding on graphs, in CIKM, 2017, pp. 377–386.
-  H. Chen, B. Perozzi, Y. Hu, and S. Skiena, Harp: Hierarchical representation learning for networks, in AAAI, 2018.
-  M. I. J. David M. Blei, Andrew Y. Ng, Latent dirichlet allocation, Journal of Machine Learning Research., 3 (2003), pp. 993–1022.
-  M. Girvan and M. E. J. Newman, Community structure in social and biological networks, PNAS, 99 (2002), pp. 7821–7826.
-  T. L. Griffiths and M. Steyvers, Finding scientific topics, PNAS, 101 (2004), pp. 5228–5235.
-  A. Grover and J. Leskovec, Node2vec: Scalable feature learning for networks, in KDD, 2016, pp. 855–864.
-  W. L. Hamilton, R. Ying, and J. Leskovec, Representation learning on graphs: Methods and applications, IEEE Data Eng. Bull., 40 (2017), pp. 52–74.
-  T. Hofmann and J. Buhmann, Multidimensional scaling and data clustering, (1995), pp. 459–466.
-  J. Leskovec, J. Kleinberg, and C. Faloutsos, Graph evolution: Densification and shrinking diameters, ACM Trans. Knowl. Discov. Data, 1 (2007).
-  J. Leskovec and J. J. Mcauley, Learning to discover social circles in ego networks, in NIPS, 2012, pp. 539–547.
-  J. Li, J. Zhu, and B. Zhang, Discriminative deep random walk for network classification, in ACL, 2016, pp. 1004–1013.
-  Y. Liu, Z. Liu, T.-S. Chua, and M. Sun, Topical word embeddings, in AAAI, 2015, pp. 2418–2424.
-  T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, in NIPS, 2013, pp. 3111–3119.
-  M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, Asymmetric transitivity preserving graph embedding, in KDD, 2016, pp. 1105–1114.
-  G. Palla, I. Derényi, I. Farkas, and T. Vicsek, Uncovering the overlapping community structure of complex networks in nature and society, Nature, 435 (2005), p. 814.
-  B. Perozzi, R. Al-Rfou, and S. Skiena, Deepwalk: Online learning of social representations, in KDD, 2014, pp. 701–710.
-  J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec, in WSDM, 2018, pp. 459–467.
-  L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo, Struc2vec: Learning node representations from structural identity, in KDD, pp. 385–394.
-  M. Ripeanu, A. Iamnitchi, and I. Foster, Mapping the gnutella network, IEEE Internet Computing, 6 (2002), pp. 50–57.
-  S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290 (2000), pp. 2323–2326.
-  P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad, Collective classification in network data, (2008).
-  J. B. Tenenbaum, V. d. Silva, and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, 290 (2000), pp. 2319–2323.
-  X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, Community preserving network embedding, in AAAI, 2017, pp. 203–209.
-  J. Winn and C. M. Bishop, Variational message passing, J. Mach. Learn. Res., 6 (2005), pp. 661–694.
-  J. Yang and J. Leskovec, Overlapping community detection at scale: A nonnegative matrix factorization approach, in WSDM, 2013, pp. 587–596.
Proof of Lemma 3.1
Let and be the node and topic sequences that are generated by the Markov model defined in Section 3 for given parameters , and with probability
The probability of generating the same pairs and by the Lda model is
for a given and , where and
are the hyper-parameters. If the emission, transition and initial state probabilities of the Markov chain are chosen as follows:, and , then Eq. (Proof of Lemma 3.1) can be re-written as
which is equal to the probability given in Eq. (4).