Many applications, which have become everyday tools, offer to search and filter the vast data sources available on the Web. In particular, there is a multitude of platforms dealing with scientific literature. From the simple search engine for scientific articles to the social network for researchers, all use, as data, the daily publications produced around the world. For the researcher facing this deluge of information, it has become difficult, if not impossible, to conduct a regular and exhaustive monitoring of his areas of expertise. The ongoing research presented in this paper, done in partnership with an industrial player 111Digital Scientific Research Technology (DSRT) and its web application Peerus: https://peer.us., deals with the problem of learning representations in heterogeneous networks of documents applied to the recommendation of scientific literature in real time.
If the scientific information of a publication is mainly contained in the text that composes it, rich supplementary information is nested in its metadata. Thus, networks of co-authors, citations and places of publication of an article contain important information for the realization of a scientific recommender system. As such, the scientific literature constitutes a heterogeneous attributed network (HAN) and since new papers are constantly published, this HAN grows in real time (Figure 1 shows an hypothetical example). However, limited information might be observed from the new aggregated nodes. For example, a newly published paper hasn’t incoming citation links and a PhD student has a limited number of past co-authorship links. To face this lack of information, the approach considered in the proposed research is to focus on learning strong representations of the attributes, particularly the textual contents of the articles, that can reflect the partially observable underlying network structure.
2. State of the art
The quality and informativeness of data representation greatly influence the performance of machine learning algorithms. For this reason, a lot of efforts are devoted to devise new ways of learning representations(Bengio et al., 2013). In Section 2.1, I describe how the task of learning representations of nodes, i.e. network embedding, is tightly connected to word embedding. Then, in Section 2.2, I focus on the interplay between natural language processing and network embedding. Finally, in Section 2.3, I present recent works that extend network embedding techniques to HAN.
2.1. From Word Embedding to Network Embedding
The distributional hypothesis (Sahlgren, 2008) forms the basis of word embedding algorithms. This assumes that the distributional similarity of words correlates strongly with their semantic similarity. In other words, if we learn a representation of a word that allows us to predict the other words that occurred in its context, we obtain a representation of its meaning.
Skip-Gram (Mikolov et al., 2013) is an algorithm that builds representations of words by maximizing the log-likelihood of a multi-set of co-occurring word pairs. Skip-gram with Negative Sampling is a variation proposed in (Mikolov et al., 2013) to effectively approach that log-likelihood. This is achieved by reducing the task to a classification which consists in distinguishing pairs of words that co-occur with false pairs that do not co-occur. An alternative approach, GloVe (Pennington et al., 2014), learns representations of words by factoring a matrix of counts of co-occurrences of the words of a corpus. Its objective is to minimize the error of reconstruction of the matrix, considering only the non-zero values of co-occurrence counts.
Even though the distributional hypothesis originated in linguistics and is naturally leveraged for word embedding, Perozzi et al. establish the connection with network embedding. To do so, they show that the frequency at which nodes appear in short random walks follows a power-law distribution, like the frequency of words in language (Perozzi et al., 2014). They propose DeepWalk, that consists in applying skip-gram with hierarchical softmax on a corpus of node sequences, deemed equivalent to sentences, generated with truncated random walks (Perozzi et al., 2014). For some specific tasks, the representations learned with DeepWalk offer large performance improvements. Thus, many subsequent works focus on modifying or extending DeepWalk. Node2vec replaces random walks with biased random walks, in order to better balance the exploration-exploitation trade-off, arguing that the added flexibility in exploring neighborhoods helps learning richer representations (Grover and Leskovec, 2016). VERSE (Tsitsulin et al., 2018)
provides a scalable graph embedding algorithm by defining a versatile similarity matrix of the nodes and a learning algorithm using noise-contrastive estimation(Gutmann and Hyvärinen, 2010), which provably converges to its objective in contrast to negative sampling.
2.2. Natural Language Processing in Networks of Documents
As a special case of attributed networks, graphs of documents bring together the fields of natural language processing (NLP) and network embedding. A wide variety of unsupervised learning algorithms to represent words and documents have been proposed, from the well-known bag-of-word model (Harris, 1954) to the recently introduced attention-based Transformer (Vaswani et al., 2017) adapted for unsupervised pre-training (Devlin et al., 2018). However, few works have fully explored the interplay between NLP techniques and network embedding.
NetPLSA (Mei et al., 2008) adapts a topic modelling algorithm by regularizing a statistical topic model with a harmonic regularizer based on a graph structure. It generates topics that reflect the underlying communities of the network, providing cleaner topics than regular statistical models.
In (Yang et al., 2015), Yang et al. prove that skip-gram with hierarchical softmax can be equivalently formulated as a matrix factorization problem. They then propose Text-Associated DeepWalk (TADW), to deal with networks of documents. TADW consists in constraining the factorization problem, with a pre-computed representation of documents via LSA (Deerwester et al., 1990). As such, each node can be represented as the concatenation of a network embedding, and a projected textual embedding. CANE (Tu et al., 2017) aims to improve node representations in a structured corpus by applying a mutual attention mechanism over the textual contents associated with the vertices of a graph. Given a connected pair of nodes, the model produces textual representations for each node contextually to the other node. In this manner, there are as many representations for a single node as it has neighbors. This model has the advantage to produce interpretable weights for the words of a pair of documents, highlighting those that explain the network structure.
2.3. Heterogeneous Attributed Networks
Real-world networks are often composed of several types of links and nodes which are associated with attributes. For example, the scientific literature is made of articles and authors, with directed citations links between papers, co-authorship links between authors and articles are associated with their textual content and their journal of publication. Many works have extended network embedding to handle heterogeneity and attributes in graphs.
With Metapath2vec (Dong et al., 2017), the authors propose to operate meta-path based random walks to handle heterogeneous nodes and links. These meta-path are hand-crafted schemes that guide a random walker over the network to generate nodes co-occurrences. Using a similar Skip-Gram based objective as DeepWalk, Metapath2vec achieves significant improvements on multi-class node classification and node clustering over traditional network embedding algorithms. (Huang et al., 2017) introduces a Label informed Attributed Network Embedding (LANE) framework which jointly projects an attributed network and its labels into a unified embedding space by extracting their correlations. The mapping of the structural proximities in the attributed network and labels into an identical embedding space via correlation projections produces a significant improvement of the embeddings. Some works go beyond the factorization-based embedding approaches, introducing models that learn a function to generates embeddings by sampling and aggregating features from a node’s local neighborhood. GraphSAGE (Hamilton et al., 2017) makes use of learnable aggregator functions that allow to infer representations for unseen nodes, given their attributes and links. (Veličković et al., 2018) adapt recent work on attention mechanism to compute a node representation by attending to its neighbors.
3. Proposed Approach
The proposed approach is intended to produce a novel model for learning representations of nodes and documents in a dynamic heterogeneous network with the goal to compute meaningful recommendations in real-time. The novelty results in the capacity of the model to infer representations of unseen documents for which no network information is available, in the same embedding space as the previously observed nodes. The approach is divided in three steps:
design of a first model to learn node representations that can handle textual attributes. By opposition to TADW, which rely on previously learned LSA representations, this model should learn word and document embeddings from scratch to ease the inference for unseen documents. This step should validate the possibility to learn meaningful text representations from graph information only.
improvement of the model by focusing on natural language processing. The model would be able to predict the similarity of documents in a network, based on their textual content only. It would deal with more advanced NLP techniques and further take advantage of the interplay between word and document representations and the network topology. Compared to CANE, the representations should be produced only from text information (still using the network as training supervision) and a strong emphasis on link prediction for unseen documents should be put.
integration of the heterogeneity to handle different types of nodes and links. This would allow to apply the model to a wider variety of tasks, such as user-item recommendation and expert finding. Handling the diversity of node and link types shouldn’t rely on hand-crafted meta-path, as proposed in metapath2vec, but should be learned during the process similarly to GraphSAGE. At this point, the data provided by DSRT from Peerus would serve as a strong online evaluation of the proposed model.
Step (1) has been achieved and is detailed in Section 4.2. The results are presented in Section 5.1. More work on its theoretical background will be done in a near future. Step (2) is ongoing research, that I briefly present in Section 4.3 and for which I provide some preliminary results in Section 5.3 motivating the research direction. For all evaluations, I detail the datasets used and the experimental setups in Section 4.1.
In this Section, I first provide an overview of the evaluations used for my research, then I detail a contribution corresponding to the first step of my thesis and I finally briefly address the planned methodology for the next steps.
I first detail some datasets commonly used in the literature. Then I briefly present traditional experiments conducted for evaluating network embedding.
I present below two small datasets, Cora and CiteSeer, according to the treatments applied in (Sen et al., 2008) as well as a larger dataset, DBLP, widely used by the scientific community:
Cora (McCallum et al., 2000) is a network of scientific articles in the field of artificial learning, grouping 7 classes (scientific subdomains) with 2708 documents, 1433 distinct words in the vocabulary and 5429 citations links.
CiteSeer (Giles et al., 1998) is a network of scientific articles grouping 6 classes on 3312 documents, 3703 distinct words in the vocabulary and 4732 citations links.
Many other non-scientific datasets that present similar data structures exist. Among others, Q&A websites and online encyclopedia provide rich sources of data for which we can tackle similar challenges as with the scientific literature. Moreover, the industrial player supporting these research provides a large dataset of scientific literature with user log activities that allows to apply and evaluate the proposed models to online recommendation tasks.
To evaluate network representation learning models, it is common to use the nodes embeddings as input space for a linear algorithm to classify the nodes. For each set of representations produced by a particular algorithm, the proportion of learning representations is varied from 10% to 50% and the average prediction accuracy of the classifier is computed over the rest of the node representations, given a set of ground truth labels. This evaluation was used for the results in Section5.1.
An extension of this evaluation consists in observing only from 30% to 70% of the nodes when learning the representations. Then, the accuracies of classification are computed on the unobserved nodes, using only their attributes for prediction. As such, we evaluate the algorithm on its capacity to infer representation for new unseen documents. This evaluation was used in Section 5.3. Note that this is different from the inductive prediction performed by GraphSAGE, which also infers representations for unseen nodes, but with the knowledge of the actual new links of these new nodes.
Finally, to evaluate the model of step (2), link prediction constitutes a good evaluation task. Several ways to generate a pair of training/test set exist (random, temporal). The goal is then to distinguish unseen links from non-existing ones. The most suited metric for this is the ROC AUC. The same way as with the previous classification task, it is possible to extend this evaluation for unobserved new documents by hiding a proportion of the nodes (and not of the links) during learning.
4.2. Document Network Embedding
In this section, I present the first contribution of my thesis (Brochier et al., 2019), GVNR
(Global Vectors for Node Representation), a model to learn node representations with an extension,GVNR-t, to handle text-associated nodes. We seek to learn two sets of representations of the nodes and , being the number of nodes in the network and the dimension of the learned embeddings.
4.2.1. Factorization Problem
We formulate a factorization problem on a random-walk based co-occurrence counts matrix generated from an input network, measuring the error of reconstruction only for positive coefficients and a fraction of randomly sampled zero coefficients:
and are two learned biases for the pair of node embeddings. The function effectively selects the coefficients considered for measuring the reconstruction error:
It takes the value 1 for all positive coefficients of
, while for zero coefficients, its value is given by a Bernoulli random variable,. It depends on the number of distinct nodes with which node co-occurs, . We introduce a global hyper-parameter , to control the proportion of zero coefficients incorporated into the reconstruction error.
4.2.2. Extended Model for Networks of Documents
In this brief section, we show how to extend GVNR to deal with networks where nodes are short text documents.
Assuming word order is negligible for short documents (such as a scientific abstract), we can model them as bag-of-words and thus represent a document with a vector , being the size of the vocabulary. We can further assume that the meaning of a short text can be captured by averaging the representations of its words (Le and Mikolov, 2014).
Therefore, with a word embedding matrix, we define the context-vector representation of a node in the following way:
4.3. Improving Document Network Embedding
is able to jointly learn word, document and node embeddings in a network of documents. However, the textual information could highly benefit from more recent works in the field of NLP. In this direction, the recently introduced Transformer as shown great promise in learning dependencies between words for text representation. Besides its low computational complexity and its strong results achieved on neural machine translation and unsupervised pre-training for language understanding, its core unit, the Scaled Dot-Product Attention, provides a good basis for extendingGVNR-t.
The Scaled Dot-Product Attention takes as input, a set of keys and values , corresponding to projected representations of words in a documents, and a query , possibly any kind of vector lying in the same space as the keys. As output, it generates a weighted sum of the values, whose weights are produced by confronting the keys with the query, following the formula: , being the dimension of the query and the keys.
My current research focus on exploring the use of this attention mechanism for mutual attention between pairs of documents in a network. Using pre-trained word embeddings, I try to find a suitable variation of this unit for generating sparse weights (hence using other functions than the softmax) and I explore several ways to build an efficient query for mutual attention.
I first present the results obtained on multi-class classification by GVNR and its extension with text and then I show some preliminary results indicating that more emphasis should be set on the representations of the textual content of nodes in a network. Finally, I show an example of a visualization of the weights learned by a preliminary model adapting the Scaled Dot-Product Attention for networks of documents.
5.1. Results for Gvnr and it Extension With Text
The results presented in Table 1 show the average accuracies for multi-class classification obtained on Cora. First, we observe that GVNR produces competitive representations with DeepWalk. Its extension GVNR-t, with the integration of the textual content of the documents, significantly improves the quality of the embeddings, achieving even better performances than TADW which relies however on textual representations more time-consuming, obtained with latent semantic analysis.
|% of training data||10%||20%||30%||40%||50%|
|TADW (text only)||60.5||69.3||72.7||73.6||74.5|
|GVNR-t (text only)||74.5||76.5||78.5||78.6||79.8|
5.2. Motivation for a Stronger NLP Component
To gain more insight into the textual representations that are learned, Table 1 shows the accuracies obtained by TADW and GVNR-t with their respective text components only. We see that GVNR-t has significantly higher accuracies than TADW, but it is unclear if this is due to an underlying better natural language understanding. Table 2 shows the results of classification when predicting on unseen documents. We observe that if GVNR-t is capable of generalizing on the text attributes of the nodes, TADW relatively fails. However, the results achieved by GVNR-t are still lower than expected and motivates the use of more advanced NLP techniques to achieve better generalization.
|% of training data||30%||40%||50%||60%||70%|
5.3. Mutual Attention for Networks of Documents
My ongoing research is meant to adapt the Scaled Dot-product attention mechanism for network of documents. The hope is to find a way to effectively infer weights for the words of the documents that strongly support (i.e highlight evidence for) the links in the network. Figure 2
shows an example of generated weights with a first draft of such a model. Interestingly, the model highlights words related to the field of reinforcement learning in both texts.
6. Conclusion and Future Work
Text data and network data are the two most represented information types on the World Wide Web. Building meaningful representations for both is a crucial step for the design of efficient recommender systems. Particularly, the ever growing scientific literature constitute a dynamic heterogeneous text-attributed network. The interplay between the textual content of scientific publications and the networks dynamics of the actors of the research brings strong challenges.
The proposed research aims at discovering an efficient model to represent the variety of nodes and links in the scientific HAN in a unique representation space in order to tackle a wide variety of recommendation tasks. The first works achieved during that research allowed to validate the complementarity of the two sources of information, text and graph, to learn meaningful representations. Ongoing research now aims at improving the natural language understanding component of the model, to truly be able to generate representations for streams of new documents. A last step will focus on extending the coverage of the types of nodes and links that the model can handle and intensively evaluating it on a wide variety of real-world recommendation tasks.
- Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
- Brochier et al. (2019) Robin Brochier, Adrien Guille, and Julien Velcin. 2019. Global Vectors for Node Representations. In Proceedings of the 2019 World Wide Web Conference (WWW ’19). International World Wide Web Conferences Steering Committee.
- Deerwester et al. (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 41, 6 (1990), 391–407.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
- Dong et al. (2017) Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 135–144.
- Giles et al. (1998) C Lee Giles, Kurt D Bollacker, and Steve Lawrence. 1998. CiteSeer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries. ACM, 89–98.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
- Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297–304.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
- Harris (1954) Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162.
- Huang et al. (2017) Xiao Huang, Jundong Li, and Xia Hu. 2017. Label informed attributed network embedding. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 731–739.
- Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning. 1188–1196.
- Ley (2002) Michael Ley. 2002. The DBLP computer science bibliography: Evolution, research issues, perspectives. In International symposium on string processing and information retrieval. Springer, 1–10.
- Ley (2009) Michael Ley. 2009. DBLP: some lessons learned. Proceedings of the VLDB Endowment 2, 2 (2009), 1493–1500.
- McCallum et al. (2000) Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. 2000. Automating the construction of internet portals with machine learning. Information Retrieval 3, 2 (2000), 127–163.
- Mei et al. (2008) Qiaozhu Mei, Deng Cai, Duo Zhang, and ChengXiang Zhai. 2008. Topic modeling with network regularization. In Proceedings of the 17th international conference on World Wide Web. ACM, 101–110.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
- Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
- Sahlgren (2008) Magnus Sahlgren. 2008. The distributional hypothesis. Italian journal of linguistics (2008), 23–53.
- Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine 29, 3 (2008), 93.
- Tsitsulin et al. (2018) Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 539–548.
- Tu et al. (2017) Cunchao Tu, Han Liu, Zhiyuan Liu, and Maosong Sun. 2017. Cane: Context-aware network embedding for relation modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1722–1731.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
- Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations (2018). https://openreview.net/forum?id=rJXMpikCZ accepted as poster.
- Yang et al. (2015) Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. 2015. Network representation learning with rich text information.. In IJCAI. 2111–2117.