Link Prediction with Mutual Attention for Text-Attributed Networks

02/28/2019 ∙ by Robin Brochier, et al. ∙ 0

In this extended abstract, we present an algorithm that learns a similarity measure between documents from the network topology of a structured corpus. We leverage the Scaled Dot-Product Attention, a recently proposed attention mechanism, to design a mutual attention mechanism between pairs of documents. To train its parameters, we use the network links as supervision. We provide preliminary experiment results with a citation dataset on two prediction tasks, demonstrating the capacity of our model to learn a meaningful textual similarity.



There are no comments yet.


page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Related Works

In this section, we relate recent works in the fields of network embedding (NE) and attention mechanism for natural language processing (NLP).

1.1. Attributed Network Embedding

DeepWalk (Perozzi et al., 2014) first proposed to derive the word embedding algorithm Word2vec (Mikolov et al., 2013) by generating paths of nodes, akin to sentences, with truncated random walks. DeepWalk and other variants are generalized into a common matrix factorization framework in NetMF (Qiu et al., 2018). To extend DeepWalk for text-attributed networks, TADW (Yang et al., 2015) expresses this latter as a matrix factorization problem and incorporates a matrix of textual features , produced by latent semantic indexing (LSI), into the factorization so that the vertex similarity matrix can be reconstructed as the product of three matrices , and .

1.2. Attention Mechanisms for NLP

The Transformer (Vaswani et al., 2017)

is a novel neural architecture that outperforms state-of-the-art methods in neural machine translation (NMT) without the use of convolution nor recurrent units. The Scaled Dot-Product Attention (SDPA) is the main constituting part of the


that actually performs attention over a set of words. It takes as input a query vector

and a set of key vectors of dimensions and value vectors of dimensions . One weight for a value is generated by a compatibility function with its corresponding key and the query. Formally, the attention vectors are generated in parallel for multiples queries , following the formula: . The result is a set of attention vectors ( being the number of queries) of dimension . The matrices , and are produced by projection of initial words representations with three matrices , and whose parameters are meant to be learned.

Several works (Devlin et al., 2018; Radford et al., 2018) adapted the Transformer architecture beyond the task of NMT. Their main idea is to train the Transformer in an unsupervised fashion over large corpora of texts and further refine its parameters on a wide variety of supervised tasks. Motivated by these recent works, we present a model, MATAN (Mutual Attention for Text-Attributed Networks), that derives from the the SDPA to address the task of link prediction in a network of documents.

2. Proposed Model

We propose an algorithm for link prediction in text-attributed network. Our model is trained under a NE procedure, presented in Section 2.1. The optimization of the reconstruction error is performed via dot-product between contextual document representations and . These embeddings are generated with a mutual attention mechanism over their textual contents only, described in Section 2.2.

2.1. Overall Optimization

The model takes as input a network of documents , being the textual content of the documents. We precompute word embeddings and a normalized similarity measure between nodes designed from the adjacency matrix of the network. Each document is associated with a bag of word embeddings matrix. For any pair of node , mutual embeddings are generated with an asymmetric mutual attention function for both documents given their bags of word embeddings and . We define the unormalized similarity between the two nodes as the dot product of their mutual embeddings . We aim at learning the parameters by minimizing the KL divergence from the similarity distributions

(from the graph) to that of the normalized distribution of the dot products between the mutual embeddings

(Tsitsulin et al., 2018)

(text associated to the nodes). We achieve this by employing noise-contrastive estimation

(Tsitsulin et al., 2018), minimizing the following objective function: , where

is the sigmoid function.

is a corpus of pairs of nodes generated by drawing uniformly existing links from the empirical distribution of links

. k negative nodes are uniformly drawn for each positive pair. To minimize this objective, we employ stochastic gradient descent using ADAM

(Kingma and Ba, 2014).

2.2. Mutual Attention Mechanism

The role of is to generate a contextual representation of given . The parameters we aim to learn are composed of three matrices of dimension each. For all words of the target document , we create queries . We similarly create keys and values from the contextual document , such that and . Attention representations for each target word are then computed, following the SDPA formula: . Note that has dimension , that is, we have a mutual attention representation of each word of document given . Finally, the representation for document is obtained by averaging its word mutual attention vectors: . Similarly, is generated by flipping indices and . The intuition behind this model is that the matrices and learn to project pairs of words that explain links in the network such that their dot-products produce large weights. is then meant to project the word vectors such that their average produces similar representations for nodes that are close in the network and dissimilar for nodes that are far in the network.

3. Experiments

To assess the quality of our model, we perform two tasks of link prediction on a dataset of citation links between scientific abstracts: Cora 111Get the data: The first prediction evaluation, called edges-hidden, consists in hiding a percentage of the links given a network of documents and measuring the ability of the model to predict higher scores to hidden links than to non-existing ones by computing the ROC AUC. The second evaluation, called nodes-hidden, consists in splitting the network into two unconnected networks, keeping a percentage of the nodes in the training network.

We precompute on the full corpus word embeddings using GloVe (Pennington et al., 2014) of dimension 256 with a co-occurrence threshold , a window size

and 50 epochs. We precompute

LSI (Deerwester et al., 1990) vectors of dimension 128. For the edge-hidden prediction task, we provide results performed by NetMF with negative samples. TADW is run with epochs and MATAN is performed with negative sample and sampled pairs of documents. The empirical similarity matrix between the nodes we chose is the normalized adjacency matrix. All produced representations are of dimension 256.

3.1. Results

% of training data 10% 20% 30% 40% 50%
NeMF 59.0 67.2 77.5 83.2 87.2
TADW 68.0 82.0 87.1 93.2 94.5
MATAN 82.3 87.1 88.6 90.9 91.0
Table 1. Edges-hidden link prediction ROC AUC
% of training data 10% 20% 30% 40% 50%
TADW 64.2 75.8 80.3 81.9 82.3
MATAN 69.4 73.0 75.4 77.9 78.6
Table 2. Nodes-hidden link prediction ROC AUC

Tables 1 and 2 show the results of our experiments. MATAN shows promising results for learning on a small percentage of training data on both evaluations. TADW has better scores for nodes-hidden predictions which might be explained by the capacity of LSI to learn discriminant features on a small dataset unlike GloVe. In future work we would like to deal with bigger datasets from which word embedding methods might capture richer semantic information.


  • (1)
  • Deerwester et al. (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 41, 6 (1990), 391–407.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
  • Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 459–467.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language-unsupervised/language_ understanding_paper. pdf (2018).
  • Tsitsulin et al. (2018) Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 539–548.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
  • Yang et al. (2015) Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. 2015. Network representation learning with rich text information.. In IJCAI. 2111–2117.