Supervised Q-walk for Learning Vector Representation of Nodes in Networks

Automatic feature learning algorithms are at the forefront of modern day machine learning research. We present a novel algorithm, supervised Q-walk, which applies Q-learning to generate random walks on graphs such that the walks prove to be useful for learning node features suitable for tackling with the node classification problem. We present another novel algorithm, k-hops neighborhood based confidence values learner, which learns confidence values of labels for unlabelled nodes in the network without first learning the node embedding. These confidence values aid in learning an apt reward function for Q-learning. We demonstrate the efficacy of supervised Q-walk approach over existing state-of-the-art random walk based node embedding learners in solving the single / multi-label multi-class node classification problem using several real world datasets. Summarising, our approach represents a novel state-of-the-art technique to learn features, for nodes in networks, tailor-made for dealing with the node classification problem.

READ FULL TEXT VIEW PDF

Authors

page 4

page 5

10/22/2019

Collaborative Graph Walk for Semi-supervised Multi-Label Node Classification

In this work, we study semi-supervised multi-label node classification p...
07/03/2016

node2vec: Scalable Feature Learning for Networks

Prediction tasks over nodes and edges in networks require careful effort...
10/25/2017

Inductive Representation Learning in Large Attributed Graphs

Graphs (networks) are ubiquitous and allow us to model entities (nodes) ...
05/19/2020

Learning Representations using Spectral-Biased Random Walks on Graphs

Several state-of-the-art neural graph embedding methods are based on sho...
09/09/2021

Ergodic Limits, Relaxations, and Geometric Properties of Random Walk Node Embeddings

Random walk based node embedding algorithms learn vector representations...
05/31/2022

FedWalk: Communication Efficient Federated Unsupervised Node Embedding with Differential Privacy

Node embedding aims to map nodes in the complex graph into low-dimension...
10/05/2021

Exact and Approximate Mean First Passage Times on Trees and other Necklace Structures: a Local Equilibrium Approach

In this work we propose a novel method to calculate mean first-passage t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Consider a social network of users where users have professions like scientist, manager, student, etc. Each user can be connected to sets of users from different professions. The problem is to predict the profession of users whose profession information is missing based upon their context in the network. Such a problem belongs to the class of problems called node classification problem.

In node classification problem, the task is to predict the labels of those nodes in networks whose label information is missing. A node can have one or more labels associated with it, e.g. if the nodes are people, then the labels can be student, artist, dancer, etc. This problem seems to be solvable using supervised machine learning techniques. The downside is that we do not know the required set of features which need to be feeded into the machine learning system for performing node classification.

In our work, we have developed a technique for learning features of nodes in networks, i.e. node embeddings. Our technique provides a supervised adaptation to node2vec [9]

- a semi-supervised learning algorithm to learn continuous valued features of nodes in networks.

Our algorithm is based on a simple intuition. We want that the nodes which have same labels should have very similar embeddings. Therefore, we would like to perform random walks on the nodes in the networks such that if the nodes on the random walk are where is the walk length, then is very similar to where .

To achieve the above intuition, we have laid out a two step approach to perform the random walks:

  1. We associate confidence values with all node-label pairs. The confidence values just give us a hint about the tentative labels for nodes.

  2. We first associate a reward function with every edge in the network. These reward functions are used by the Q-learning algorithm in learning the Q-values for each node-edge pair. These Q-values then guide the random walks in a way such that in the ideal scenario if the random walk is , then the actual labels associated with the nodes are .

The generated random walks are treated as sentences in a document. We then apply Skip-gram [18] with Negative Sampling [19]

to learn the node embeddings. These node embeddings are then used to train a classifier for checking the goodness of the learnt embeddings.

The related work is discussed in II.The details of the approach are given in III. The experimental details are mentioned in IV. The conclusion and future work are mentioned in V and VI respectively.

Ii Related Work

Graph Analytics field is pacing up due to the growth of large datasets in social network analysis [16][2][15], communication networks [14][13], etc. The area of node classification [6] has been approached earlier from different perspectives like factorization based approaches, random walk based approaches, etc.

Factorization based techniques represent the edges in networks as matrices. These matrices are factorized to obtain the embeddings. The matrix representation and its factorization are done using various techniques [24][3][1][8][20]. These methods may suffer from scalability issues for large graph datasets and sparse matrix representations need special attention.

Random walk based approaches perform random walks on networks to obtain the embeddings. Two popular techniques are DeepWalk [22] and node2vec [9]. node2vec [9]

is a semi-supervised algorithmic framework which showcases strategies to perform random walks such that nodes which are homophilic and/or structurally equivalent end up getting similar embeddings. The random walks are guided by a heuristic which involves computing distance of the next possible nodes from the previous node given the current node. DeepWalk can be considered as a special case of node2vec with

and where

are hyperparameters in node2vec which decide the tradeoff between depth-first and breadth-first sampling.

Our approach falls under random walk based approaches. We compare our approach against node2vec in IV

. Instead of using some hand-crafted random walk based approach, we decided to learn how to do random walks using reinforcement learning. We perform random walks such that nodes which have same labels but are not necessarily structurally equivalent, end up getting embeddings close to one another in the embedding space. The random walks are guided by the Q-values of the node-action pairs.

Iii Learning Vector Representation of Nodes

In III-A, we define the problem formally. In III-B, we propose a k-hops neighborhood based confidence values learner which learns the confidence values of labels for unlabelled nodes in the network. Using the learnt confidence values, we devise a reward function, which aids Q-walk, described in III-C, to do random walks. The generated random walks are then fed into word2vec, which is briefly described in III-D, to get the vector representation of nodes.

Iii-a Problem Definition

Consider where can be any (un)directed, (un)weighted simple graph. We ignore self-loops and parallel edges. is the set of vertices and is the set of edges. and where is the set of labelled vertices, is the set of unlabelled vertices and denotes empty set. Let .

We want to learn a mapping such that for any if and have same labels, then distance is minimum. We use the same objective function and assumtions - conditional independence and symmetry in feature space, as used in node2vec [9].

(1)

In (1

), we are maximizing over the log-probability of observing a network neighborhood

obtained by the sampling strategy , which, in our case, is described in III-C, starting at node . In [9], the authors have derived that

(2)

The use of (1) is justified in our case since we want to minimize which is equivalent to maximizing .

Iii-B K-hops neighborhood based confidence values learner

This algorithm is motivated by homophily [17] in networks. Entitites of similar kind tend to stay together, e.g. friends who share same interest, people in the same profession, etc. We use this heuristic to find the confidence values of labels for unlabelled nodes in the graph. It is imperative to compute the confidence values, since in their absence, the agent, as used in III-C, would get confused in chosing the appropriate direction of the random walk.

(3)

In (3), is the k-hops neighborhood for . For directed , -hops are based on the outgoing edges from .

(4)

In (4), is the set of all labels in and is the cardinality of .

(5)

In (5), is the set of labels associated with .

(6)

In (6), is the iteration counter for computing the confidence values and is the maximum number of such iterations.

(7)

In (7), is the initial confidence value associated with computed .

(8)

In (8), is the confidence value of for label at iteration and .

is the confidence with which we can state that has label . We denote in -hops neighborhood as .

Iii-C Supervised Q-walk

To generate random walks, we look at the graph as a Markov Decision Process (MDP)

[4], where each node is a state, outgoing edges from are the actions possible at . The probability of reaching the neighbor by taking the action is . Imagine an agent at which has to decide the next node . To aid the agent in taking the right decision, we perform Q-learning [27]. So, the agent decides based upon the Q-value . Hence, the generated random walks are called Q-walks. The Q-walks are supervised because the reward function (9) depends on the confidence values, learnt in a supervised way, as per III-B.

(9)

In (9), is the reward obtained when the agent moves from to along the edge where . The closer the reward is to , the more similar is to .

(10)

In (10), is the iteration counter in Q-learning and is the maximum number of such iterations.

(11)

In (11), is the initial Q-value for the node-action pair .

(12)
(13)

In (12) and (13),

is the learning rate for epoch

, is user-defined, and is the discount factor. In (12), we are updating the learning rate at each iteration such that its value decreases with number of iterations and the difference between and diminishes over time, in other words, Q-values converge.

(14)
(15)

The agent needs to decide the action when it is at . It first samples a random number as per (14), then it decides upon the action as per (15). In (15), is the exploitation probability and . We generate random walks of length each .

Iii-D word2vec

The generated random walks can be considered as sentences in a text document. As per (1), we have to maximize the probability of the context nodes given the node . To achieve this, we use Skip-gram [18] with Negative Sampling [19]. We denote the context window size as and number of epochs as .

Iv Experiments

Figure 1: Mean F1 scores of supervised Q-walk against the baseline node2vec for different settings of the discount factor with fixed , , , , , , , , , , , .
Figure 2: Mean F1 scores of supervised Q-walk against baseline node2vec for different settings of the exploitation probability with fixed , , , , , , , , , , , .
Figure 3: Mean F1 scores of supervised Q-walk against baseline node2vec for different settings of the word2vec epochs with fixed , , , , , , , , , , , .
Figure 4: Mean F1 scores of supervised Q-walk against baseline node2vec for different settings of the ratio of labelled nodes , used for learning confidence values as per III-B with fixed , , , , , , , , , , .
Figure 5: Mean F1 scores of supervised Q-walk against baseline node2vec for different settings of , used for learning confidence values as per III-B with fixed , , , , , , , , , , .
Figure 6: Mean F1 scores of supervised Q-walk against baseline node2vec for different settings of node features representation dimensions with fixed , ,, , , , , , , , .
Figure 7: Mean F1 scores of supervised Q-walk against baseline node2vec for different settings of number of walks per node with fixed ,, , , , , , , , , .
Figure 8: Mean F1 scores of supervised Q-walk against baseline node2vec for different settings of walk length with fixed , , , , , , , , , .
Figure 9: Mean F1 scores of supervised Q-walk against baseline node2vec for different settings of window size with fixed , , , , , , , , , .
Figure 10: Mean F1 scores of node2vec against supervised Q-walk for different values of and with fixed , , , , , , , , , , , , .
Supervised Q-walk node2vec % improvement
Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1
Yeast 0.2896 0.3797 0.2409 0.3212 20.21% 18.21%
BlogCatalog 0.4051 0.5420 0.1984 0.3062 104.18% 77.01%
Flickr 0.4340 0.5505 0.2087 0.2853 107.95% 92.95%
Table I: Mean F1 scores of supervised Q-walk and node2vec on different datasets with fixed , , , , , , , , , , , , , , .

Iv-a Datasets

We provide a brief overview of the datasets which were used to perform the experiments.

Iv-A1 Yeast

Yeast [7] dataset is a protein-protein interaction network of budding yeast. It has nodes, edges and classes. It is an undirected and unweighted network. It offers a single-label multi-class classification problem.

Iv-A2 BlogCatalog

BlogCatalog [25] dataset is a social network of bloggers. The labels associated with the bloggers refer to their interests which are obtained from the metadata information as available on BlogCatalog site. It consists of nodes, edges and classes. It is an undirected and unweighted network. It offers a multi-label multi-class classification problem.

Iv-A3 Flickr

Flickr [25] dataset is a contact network of users on Flickr site. The labels associated with the users refer to their groups. It consists of nodes, edges and classes. It is an undirected and unweighted network. It offers a multi-label multi-class classification problem.

Iv-B Performance Evaluation

We evaluate the performance by first computing the vector representation of nodes in the network using both node2vec and supervised Q-walk for specific hyperparameter settings. Then, we compute the mean of the macro and micro F1 scores obtained by performing 5 fold cross validation using k-nearest neighbors (k-NN) [10] classifier. We use k-NN for a couple of reasons. First, we are interested in showcasing that our learnt embeddings are similar for nodes with same labels, and such similarity can be measured by finding euclidean distance between the node embeddings of the concerned nodes, and this is in accordance with III-A. Second, it is a non-linear classifier, therefore, the learnt embeddings need not be linearly separable for getting better classification performance. We denote in k-NN as .

Iv-C Results

It can be observed in 1 that values of around provide higher performance than other values of on the Yeast dataset. For , the F1 scores are better than a number of other values which signifies the aptness of the reward function (9), since (13) is then modified to , which does not include . For , on BlogCatalog dataset, supervised Q-walk approach achieves mean Macro-F1 score of and mean Micro-F1 score of which are higher than the F1 scores for as shown in table I.

In 2, we can observe the tradeoff involved between exploration and exploitation. means that we do not make use of the learnt Q-values and have instead resorted to randomly explore the network, thereby, leading to poor performance. means that we always make the greedy choice by opting for the action which yields maximum Q-value. This is complete exploitation policy which again leads to poor performance. rightly balances the tradeoff between exploration and exploitation, thereby, leading to higher F1 scores compared with other values of .

In 3, we can observe that supervised Q-walk takes small number of epochs to give high F1 scores. In the case of Yeast dataset, was sufficient. There exists randomness in the curves which can be attributed to different weight initializations in word2vec in different runs of the experiment. The F1 scores stop improving after . It signifies that the model is trained and starts overfitting for .

In 4, we can observe that supervised Q-walk approach performs better than node2vec given we use around and above 50% labelled nodes for learning the confidence values as per III-B. For lower values of , the performance is poor because the k-hops neighborhood based confidence values learner did not get enough labelled nodes for properly learning the confidence values. For higher values of , we can see that the performance improves. In a real world setting, it may not be necessary that we get to train upon, instead, for all other experiments we choose since split of training and test data is generally used in machine learning.

It can be observed in 5 that is sufficient for learning good confidence values, in other words, the confidence values can be determined by just looking at the immediate neighbors of any node .

It can be observed in 6 that give high F1 scores. Supervised Q-walk approach gives better F1 scores than node2vec for different settings of dimensions.

It can be observed in 7 that supervised Q-walk approach gives high F1 scores for while node2vec gives high F1 scores for and low for . So, our approach takes lesser number of walks per node to give a better performance than node2vec.

It can be observed in 8 and 9 that supervised Q-walk approach gives better F1 scores than node2vec. Values of around and values of around are good enough for our approach on the Yeast dataset.

It can be observed in 10 that supervised Q-walk approach performs better than node2vec hyperparameterised by different combinations of and which control the degree of Depth First Sampling (DFS) and Breadth First Sampling (BFS).

The mean F1 scores are calculated for specific hyperparameter settings which leaves room for improvement by fine tuning the hyperparameters through cross validation with grid search or random search [5] over the hyperparameter space.

Iv-D Technologies Deployed

All the experiments were carried out on a server housing 48 core Intel Xeon @ GHz processor, 252 GB RAM with Ubuntu 16.04. The experiments were coded in Python 3.6, using 3rd party libraries - NetworkX [11], Numpy [26], Matplotlib [12], Scikit-learn [21] and Gensim [23].

V Conclusion

We have presented a novel supervised Q-walk approach to generate random walks guided by Q-values and assisted by k-hops neighborhood based confidence values learner. We have shown experimentally that the node embeddings learnt from our approach are similar for the nodes with the same labels. We have also shown that our approach outperforms node2vec in the node classification task.

Vi Future Work

Supervised Q-walk works better in the cases where the assumption of homophily in networks holds true. In other networks, nodes which are structurally equivalent may have the same labels e.g. networks of bio-chemical compounds. Our work can be extended by composing another reward function where encourages homophily in networks, encourages structural equivalence of nodes and is a hyperparameter which decides the tradeoff between homophily and structural equivalence.

Vii Acknowledgements

We acknowledge the valuable insights provided by Dr Robert West, Assistant Professor, Data Science Lab, School of Computer and Communication Sciences, EPFL, Switzerland. His lab had also provided us with the compute infrastructure for carrying out all the experiments.

References

  • [1] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J. Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, pages 37–48, New York, NY, USA, 2013. ACM.
  • [2] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 44–54, New York, NY, USA, 2006. ACM.
  • [3] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14, pages 585–591. MIT Press, 2001.
  • [4] Richard Bellman. A Markovian Decision Process. Indiana Univ. Math. J., 6:679–684, 1957.
  • [5] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13:281–305, February 2012.
  • [6] Smriti Bhagat, Graham Cormode, and S. Muthukrishnan. Node classification in social networks. CoRR, abs/1101.3291, 2011.
  • [7] D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N. Zhang, G. Li, and R. Chen. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Research, 31:2443–2450, 2003.
  • [8] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pages 891–900, New York, NY, USA, 2015. ACM.
  • [9] Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 855–864, New York, NY, USA, 2016. ACM.
  • [10] Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. KNN Model-Based Approach in Classification, pages 986–996. Springer Berlin Heidelberg, Berlin, Heidelberg, 2003.
  • [11] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008), pages 11–15, Pasadena, CA USA, August 2008.
  • [12] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science & Engineering, 9(3):90–95, 2007.
  • [13] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Signed networks in social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, pages 1361–1370, New York, NY, USA, 2010. ACM.
  • [14] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1), March 2007.
  • [15] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR, abs/0810.1355, 2008.
  • [16] Jure Leskovec and Julian J. Mcauley. Learning to discover social circles in ego networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 539–547. Curran Associates, Inc., 2012.
  • [17] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415–444, 2001.
  • [18] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.
  • [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
  • [20] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1105–1114, New York, NY, USA, 2016. ACM.
  • [21] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [22] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 701–710, New York, NY, USA, 2014. ACM.
  • [23] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.
  • [24] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. SCIENCE, 290:2323–2326, 2000.
  • [25] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 817–826, New York, NY, USA, 2009. ACM.
  • [26] Stefan van der Walt, S. Chris Colbert, and Gael Varoquaux. The numpy array: A structure for efficient numerical computation. Computing in Science and Engg., 13(2):22–30, March 2011.
  • [27] Christopher J.C.H. Watkins and Peter Dayan. Technical note: Q-learning. Machine Learning, 8(3):279–292, 1992.