RPT: Toward Transferable Model on Heterogeneous Researcher Data via Pre-Training

10/08/2021 ∙ by Ziyue Qiao, et al. ∙ 7

With the growth of the academic engines, the mining and analysis acquisition of massive researcher data, such as collaborator recommendation and researcher retrieval, has become indispensable. It can improve the quality of services and intelligence of academic engines. Most of the existing studies for researcher data mining focus on a single task for a particular application scenario and learning a task-specific model, which is usually unable to transfer to out-of-scope tasks. The pre-training technology provides a generalized and sharing model to capture valuable information from enormous unlabeled data. The model can accomplish multiple downstream tasks via a few fine-tuning steps. In this paper, we propose a multi-task self-supervised learning-based researcher data pre-training model named RPT. Specifically, we divide the researchers' data into semantic document sets and community graph. We design the hierarchical Transformer and the local community encoder to capture information from the two categories of data, respectively. Then, we propose three self-supervised learning objectives to train the whole model. Finally, we also propose two transfer modes of RPT for fine-tuning in different scenarios. We conduct extensive experiments to evaluate RPT, results on three downstream tasks verify the effectiveness of pre-training for researcher data mining.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the pervasiveness of digital bibliographic search engines, e.g., Google Scholar, Aminer, Microsoft Academic Search, and DBLP, efforts have been dedicated to mining scientific data, one of the main focuses is researcher data mining, which aims to mine researchers’ semantic attributes and community relationships for important applications, including: collaborator recommendation[29, 53], academic network analysis[46, 11, 27], and expert finding[62, 7].

Millions of researchers and relevant data have been added to digital bibliographic datasets every year. The scientific research achievements and academic cooperation of researchers saw a continuation in the upward trend. However, different researcher data mining tasks usually choose particular features and design unique models, and minimal works can be transferred to out-of-domain tasks. For example, previous studies are more inclined to use graph representation learning-based models to explore the researcher community graphs in the collaborator recommendation task. In researchers’ research field classification task, their publications are more valuable features, and the semantic models are more often used. Hence, when multiple mining tasks on a tremendous amount of researcher data, it would be laborious for feature selection and computationally expensive to train different task-specific models, especially when these models need to be trained from scratch. Also, most researcher data mining task need labeled data, which is usually quite expensive and time-consuming, especially involving manual effort. The deficiency of labeled data makes supervised models are easily over-fitting. At the same time, the unlabelled graph data is usually easily and cheaply collected. Inspired by these, we aim to exploit the intrinsic data information disclosed by the unlabeled researcher data to train a generalized model for researcher data mining, which is transferable to various downstream tasks.

The pre-training technology is a proper solution, which has drawn increasing attention recently in many domains, such as natural language processing (NLP)

[20, 57]

, computer vision (CV)

[17, 5], and graph data mining[38, 18]

. The idea is to first pre-train a general model to capture useful information on a large unlabeled dataset via self-supervised learning, and then, the pre-trained model is treated as a good initialization of the downstream tasks, further trained along with the downstream models for different application tasks with a few fine-tuning steps. The pre-training model is a transferable model and convenient to train because the unlabeled data is easily available. In fine-tuning, the downstream models can be very lightweight. Thus, this sharing mechanism of pre-training models is more efficient than independently training various task-specific models. Early attempts to pre-training mainly focus on learning word or node representation, which optimizes the embedding vectors by preserving some similarity measure, such as the word co-occurrence frequency in texts and the network proximity in graphs, and directly uses the learned embeddings for downstream tasks. However, the embeddings are limited in preserving extensive information in large datasets. In contrast, recent studies consider a transfer setting, and the goal is to pre-train a generic model which can be applied to different downstream tasks. There have been some representative pre-training models which achieved great performance in their areas. Pre-training was first successfully applied to CV by firstly pre-training the model on a large supervised dataset such as ImageNet

[8], and then fine-tuning the pre-trained model in a downstream task or directly extracting the representations as features. SimCLR[5] utilize data augmentation technology to Enhance the generalization ability of the model, and MoCo[17] propose a pre-training mechanism for building dynamic dictionaries for contrastive learning, which leverages instance discrimination as a pretext task via momentum contrast. The language pre-training model BERT[20] is designed to pre-train deep bidirectional representations from the unlabeled text by two self-supervised text reconstructing tasks, and it obtains state-of-the-art results on eleven NLP tasks. The graph pre-training model GCC[38] can capture the universal network topological properties across multiple networks based on contrastive learning and achieve high performance on graph representation learning.

Inspired by these improvements, we aim to conduct pre-training on researcher data for mining tasks. The study in [15] has proved the effectiveness of domain data for pre-training, and some work has leveraged pre-training models on academic domain data. For example, SciBERT[2] leverages BERT on scientific publications to improve the performance on downstream scientific NLP tasks. SPECTER[6] introduces the citation information into the pre-training to learn document-level embeddings of scientific papers. However, existing pre-training models can not be directly applied to researcher data pre-training. Because the researcher data is more heterogeneous, including textual attributes (e.g., profiles, publications, patents, etc.) and graph-structured community relationships (e.g., collaborating, advisor-advisee, etc.). In contrast, most present models can only deal with a specific type of data. Also, to extract information from the heterogeneous researcher data, the pre-training model should be more complex and heavy-weight than traditional ones, which brings new challenges to the pre-training of models on large-scale data.

Fig. 1: An illustration of our proposed pre-training and fine-tuning framework for researcher data mining. (a) We divide the researchers’ data into semantic document sets (containing researchers’ textual attributes) and community graphs (preserving researchers’ community relationships) as inputs for the RPT model and (b) pre-train the RPT model via a multi-task self-supervised learning objective to preserve the researcher data. (c) Then, we fine-tune the pre-trained RPT model via the objectives of different downstream tasks for researcher data mining.

In this paper, we leverage the idea of multi-task learning and self-supervised learning to design the Researcher data Pre-T

raining(RPT) model. Specifically, for each researcher, we propose a hierarchical Transformer as the textual encoder to capture semantic information in researcher textual attributes. We also propose a linear local community sampling strategy and an efficient graph neural network (GNN) based local community encoder to capture the local community information of researchers. To leverage optimization on these two encoders, the main task of RPT is a contrastive learning model to discriminate whether these two captured information belongs to the same researcher. We also design two auxiliary tasks, Hierarchical Masked Language Model (HMLM) and Community Relation Prediction (CRP), respectively, to extract token-level semantic information and link-level relation information to improve the fine-grained performance of pre-training. Finally, we pre-train RPT on a big unlabeled researcher dataset extracted from real-world scientific data. We set two transfer modes for RPT and fine-tune RPT on three researcher data mining tasks to verify the effectiveness and transferability of RPT. We also conduct model analysis experiments to evaluate different components and hyper-parameters of RPT. The contributions of our paper are summarised as follows:

  1. We introduce the pre-training idea to handle the multiple mining tasks on abundant and heterogeneous researcher data. We propose the RPT framework, including the pre-training model and two fine-tuning modes, which can extract and transfer useful information from big unlabeled scientific data to benefit researcher data mining tasks.

  2. To perform the pre-training model in considering heterogeneity, generalization, scalability, and transferability, we propose a Transformer, GNN based information extractor, and a multi-task self-supervised learning objective including a hierarchical masked language model on the hierarchical Transformer, a relation prediction model on the local community encoder, and a global contrastive learning model overall to capture both the semantic features and local community information of researchers.

  3. We apply our pre-training model on real-world researcher data, which is extracted from DBLP, ACM, and MAG digital library. Then, we fine-tune the pre-trained model on three tasks: researcher classification, collaborator prediction, and top-k researcher retrieval to evaluate the effectiveness and transferability of RPT. The results demonstrate RPT can significantly benefit various downstream tasks. We also perform ablation studies and hyper-parameters sensitivity experiments to analyze the underlying mechanism of RPT.

Ii Related work

Ii-a Researcher Data mining

The increasing availability of digital scholarly data offers unprecedented opportunities to explore the structure and evolution of science[12]. Multiple data sources such as Google Scholar, Microsoft Academic, ArnetMiner, Scopus, and PubMed cover millions of data points pertaining to researchers(also known as scholars and scientists) community and their output. Analysis and mining based on big data technology have been implemented on these data and the analyses of researchers have been a hot topic. The researcher data analysis tasks including collaborator recommendation [22, 55], collaboration sustainability prediction[53, 52], reviewer recommendation[1, 65], expert finding[63, 26], advisor-advisee discovery[50, 66], academic influence prediction[22, 32], etc. Mainstream works focus on mining the various academic characteristics and community graph properties of researchers, then learn task-specific researcher representations for various tasks. For example, here are some recent representative works, [29] recommends context-aware collaborator for researchers by exploring the semantic similarity between researchers’ published literature and restricted research topics. [53] use researchers’ personal properties and network properties as input features to predict the collaboration sustainability. [62] propose an expert finding model for reviewer recommendation, which learn hierarchical representations to express the semantic information of researchers. [27] propose a network representation learning method on scientific collaboration networks to discover advisor-advisee relationships. [3]

study the problem of citation recommendation for researchers and use the generative adversarial network to integrates network structure and the vertex content into researcher representations.

[39] incorporate both network structures and researcher features into convolutional neural and attention networks and learn representations for social influence prediction. [60] study the problem of top-k similarity search of researchers on the academic network, the content and meta-path based structure information is embedded into researcher representations.

Ii-B Self-Supervised Learning

Self-supervised learning is a form of unsupervised learning which aims to train a pretext task where the supervised signals are obtained by data itself automatically, it can guide the learning model to capture the underlying patterns of the data. The key of self-supervised learning is to design the pretext tasks. In the area of computer vision (CV), various self-supervised learning pretext tasks have been widely exploited, such as predicting image rotations

[13], solving jigsaw puzzles[33], and predicting relative patch locations[9]

. In natural language processing (NLP), many works propose pretext tasks based on language models, including the context-word prediction

[31] the Cloze task, the next sentence prediction[20, 57] and so on[4]. For graph data, the pretext task are usually designed to predict the central nodes given node context[35, 59] or sub-graph context[19], or maximize mutual information between local and global graph[49, 44]. Recently, many works[10, 40, 51, 56] on different domains has integrated self-supervised learning with multi-task learning, i.e., joint training multiple self-supervised tasks on the underlying models, which can introduce useful information of different facets and improve the generalization performance.

Ii-C Pre-Training Model

With the idea of self-supervised learning, pre-training models can be applied on big unlabeled data to build more universal representations that work across a wider variety of tasks and datasets[67]

. The pre-training models can be classified as feature-based models and end-to-end models. Early pre-training studies are mainly feature-based, which directly parameterizes the entity embeddings and optimizes them by preserving some similarity measure. The learned embeddings are used as input features, in combination with downstream models to accomplish different tasks. For example, Word2vec

[31, 30], Glove[36] and Doc2vec[23] in NLP, which are optimized via textual context information. Early graph pre-training models are similar to NLP, like Deepwalk[37], LINE[45], node2vec[14] and metapath2vec[11], which aim to learn node embeddings to preserve network proximity or graph-based context. Differently, recent pre-training models pre-train deep neural network-based encoders and fine-tune them along with downstream models in end-to-end manner. Typical examples includes: MoCo[17] and SimCLR[5] for unlabeled image dataset; BERT[20], RoBERTa[28] and XLNet[57] for unlabeled text dataset; GCC[38] and GPT-GNN[18] for unlabeled graph dataset. Our proposed RPT is a meaningful attempt to apply a pre-training model on domain-specific and heterogeneous data as the researcher data is scientific data and contains researcher textual attributes and researcher community.

Iii Proposed method

Iii-a Problem Statement and Framework Overview

To leverage pre-training models on big unlabeled researcher data, we need to design proper self-supervised tasks to capture the underlying patterns of the data. The key idea of self-supervised learning is to automatically generate supervisory signals or pseudo labels based on the data itself. Given the raw data of researchers, denoted by , we first explore the researcher data and extract two categories of researcher features: (1) semantic document set and (2) community graph, for pre-training.

Semantic Document Set. A researcher may have multiple textual features, including published literature, patents, profiles, curriculum vitae(CV), and personal homepage, and these features may contain rich semantic information with various lengths and different properties. We collect the text information of these features as documents and compose these documents of each researcher together to a organized semantic document set. Formally, the semantic document set of researcher is expressed as , where every document is formed as a token sequence . Noted that each semantic document set and each document may have arbitrary lengths.

Community Graph. Besides the semantic features, the social communications and relationships are also significant for researcher and have been widely utilized to analysis in previous studies, which can be expressed as graph-structured data. As the relations between researcher may have different types, We construct the researcher community graph in a heterogeneous graph manner, expressed as , where represent one link between researchers, is the relation set, and . Multiple types of relations between researchers are automatically extracted based on the original data to construct . For example, we can make the rules that if two researchers coauthor a same paper, there is a relation named between them, if two researchers work for the same organization, there is a relation named between them. Noted that two researchers can have multiple relations of the same type in . For instance, if they collaborate on papers, there would be relations between them.

Researcher Data Pre-Training. Formally, the problem of researcher data pre-training is defined as: given researchers , their semantic document sets , and their community graph , we aim to pre-train a generalized model via self-supervised learning, which is expected to capture both the semantic information and the community information of researchers into low-dimensional representation space, where we hope the researchers with similar researcher topics and academic community are close with each other. Then, in fine-tuning, the learned model is treated as a generic initialized model for benefiting various downstream researcher mining tasks and is optimized by the task objectives via a few gradient descent steps. Formally, in the pre-training stages, the pre-training model is expressed as and the output is researcher representations . Let

be the self-supervised loss functions, which extract samples and pseudo-labels in researcher data

and for pre-training. Thus, the objective of pre-training is to optimize the following:

(1)

Based on the motivation described above, the pre-training model optimized by objective should have the following properties:

  • Heterogeneity. Ability of encoding semantic information from multi-type textual document data and community information from heterogeneous graph data.

  • Generalization. Under no supervised information, the pre-training should integrate heterogeneous information from massive unlabeled data into generalized model parameters and researcher representations.

  • Scalability. As the researcher data is usually massive and contains rich information, the model should be heavy-weight enough to extract the information on the one hand. On the other hand, it needs to be friendly to mini-batch training and parallel computing.

  • Transferability. For fine-tuning on multiple tasks, the model should be compatible with researcher document features and community graph in the downstream tasks.

As such, the pre-trained model can be adopted on multiple researcher mining tasks. Noted that the focus of this work is on the practicability of the pre-training framework on the researcher data. The goal is to make the model satisfy the above properties, making it different from the common text embedding model or graph embedding model research.

Framework Overview. Figure 2 shows the architecture of the proposed RPT framework, which consists of three components: (a) the hierarchical Transformer, (b) the local community encoder, and (c) the multi-task and self-supervised learning objective. Specifically, for Heterogeneity, the hierarchical Transformer is a two-level text encoder, including the Document Transformer and Researcher Transformer, which aims to extract the semantic information in researchers’ semantic document sets. The local community encoder employs a linear random sampling strategy to sample the researcher’s local communities and conduct a GNN encoder to capture the community information. For Generalization, we design a multi-task and self-supervised learning objective by automatically extracting supervised signals from unlabeled data. The self-supervised tasks include global contrastive learning, hierarchical masked language model, and community relation prediction, which are proposed for integrating the information from researcher data into generalized model parameters and researcher representations. For Scalability, the model adapt Transformer model, which encode long-sequencial text in a parallel manner, and sub-graph sampling based community encoder, which encode graph data via mini-batch GNN propagation on sub-grpahs with parallel computing. For Transferability, the proposed model is transferable for multiple downstream researcher mining tasks. Also, we propose two fine-tuning modes in different applying scenarios.

Fig. 2: An brief illustration of our proposed researcher data pre-training model. The model contains two encoders, (1) the hierarchical Transformer that encodes the researcher semantic attributes and (2) the local community encoder that encodes the researcher community—noted that in the researcher community, neighbors with different link types to the central researcher are colored differently. Three self-supervised tasks, including global contrastive learning, hierarchical masked language model, and community relation prediction, are designed to optimize the whole model.

Iii-B Hierarchical Transformer

The Transformer model is the state-of-the-art text encoder, which has demonstrated very competitive ability in combination with language pre-training models[20, 28]. Comparing with RNN-like models as text encoder, Transformer is more efficient to learn the long-range dependencies between words. Also, RNN-like models extract text information via sequentially propagation on sequence, which need serial operation of the time complexity of sequence length. While self-attention layers in Transformer connect all positions with constant number of sequentially executed operation. Therefore, Transformer encode text in parallel. A Transformer usually has multiple layers. A layer of Transformer encoder (i.e, a Transformer block) usually consists of a Multi-Head Self-Attention Mechanism, a Residual Connections and Layer Normalization Layer, a Feed Forward Layer, and a Residual Connections and Normalization Layer, which can be written as:

(2)
(3)

where is the input sequence of -th layer of Transformer, is the length of input sequence, and

in the dimension. LayerNorm is layer normalization, MLP denotes a two-layer feed-forward network with ReLU activation function, and MultiHead denotes the multi-head attention mechanism, which is calculated as follows:

(4)
(5)
(6)
(7)

where are weight metrics, The outputs from the attention calculations are concatenated and transformed using a output weight matrix .

In summary, Given the input sequence embeddings , after the propagation of Equation 2 and 3 on a -layers Transformer, formulized as , we can obtain the final output embeddings of this sequence , where each embedding has contained the context information in this sequence.The main hyper-parameters of a Transformer are the number of layers (i.e., Transformer blocks), the number of self-attention heads, and the maximum length of inputs.

For the semantic document set of researcher , we aim to use the Transformer proposed in [48] to encode the text information in into the researcher representation . Considering that each researcher may have multiple documents with rich text and documents may have different properties and should be processed separately, while the original Transformer can only handle the inputs of a single sentence. Thus, we propose a two-level hierarchical Transformer model, which consists of a Document Transformer to first encode the text in each document into a document representation, and a Researcher Transformer to integrate multiple documents into the researcher representations.

Document Transformer. For each document , suppose its token sequence is and the corresponding token embeddings are . we define a new token named [SOD] (Start Of Document) and random initialize its embedding , then we concatenate it with the embedding sequence of into the Document Transformer, which is a multi-layer bidirectional Transformer, and take the final output of [SOD] as the document representation:

(8)

where is the ’s representation, is the forward function of Document Transformer, is the number of layers, and is the output. The input token embeddings are initialized by Word2vec[30], we collect all the texts in the dataset to train the Word2vec model. Also, for each token, its input representation is constructed by summing its embedding with its corresponding position embeddings in documents.

Researcher Transformer. Then, given ’s semantic document set , we can obtain the documents’ representations from the Document Transformer and input them into the Researcher Transformer, which is yet another multi-layer bidirectional Transformer but applied on document level, followed with a mean-pooling layer:

(9)

where is the semantic representation of researcher . represent the average of all documents’ final outputs. is the forward function of Researcher Transformer, is the number of layers, and is the output. Thus, in this Researcher Transformer, for each researcher’s documents set, each document in the set can collect information from other documents with different attention weights contributed by the self-attention mechanism of the Transformer. So that we can obtain context-aware and researcher-specific document outputs for different researchers, rather than assuming a document (e.g., a paper) has the same contribution to different owners. Figure 2 shows the architecture of the proposed hierarchical Transformer, where the Document Transformer is shared by different document inputs and the Researcher Transformer is shared by different researcher inputs.

Iii-C Local Community Encoder

Given the researcher community graph , first we aim to extract community information of researchers from this graph. Recently, Graph Neural Networks (GNNs) have achieved state-of-the-art performance on handling graph data. Typically, GNNs output node representations via a message-passing mechanism, i.e., they stack layers to encode node features into low-dimensional representations by aggregating -hop local neighbors’ information of nodes. However, most GNN based models take the whole graph as the input, which can hardly be applied on large-scale graph data due to memory limitation. Also, the inter-connected graph structure prevents parallel computing on complete graph topology, making the GNNs propagation on large graph data extremely time-consuming[19]. One prominent direction for improving the scalability of GNNs use sub-graph sampling strategies, For example, SEAL[64] extracts k-hop enclosing subgraphs to perform link prediction. GraphSAINT[61] propose random walk samplers to construct mini-batches during training. Thus, we propose to sample a sub-graph of to represent the local community graph of researcher , denoted by , which can preserve the interactions and relations of with other researchers. An intuitive way to sample is to directly sample ’s -hops neighborhoods, e.g., to sample all ’s neighbors within -hops as well as corresponding relations to compose . However, the community graph of researchers usually is denser than other kinds of graphs, each researcher may have dozens of neighbors. In this sampling way, the size of might grow geometrically and it would become expensive to process it in training when increases.

Linear Random Sampling. In our paper, we propose a linear random sampling strategy, which can make the number of sampled neighbors increase linearly with the number of sampling hops . The procedure of linear random sampling is presented as follow:

Input: The researcher , the community graph , the number of sampling hops , and the sampling size .
Output: The local community graph of .
1 = ;
2 for  do
3       Random sample a -hop neighbor of in , denoted by ;
4       = ;
5       ;
6 end for
7return ;
Algorithm 1 Procedure of linear random sampling.

where the represents the process that randomly sampling numbers of ’s 1-hop links, expressed as , to compose , is the sampled local community within -hop. In this procedure, we obtain a sub-graph of ’s -hop neighborhood with neighbors in each hop.

This sampling strategy has the following advantages: (1) The size of local community graphs are linearly correlated to the number of sampling hops, so it would not increase the time complexity of computing community embeddings below when increases. (2) The sampled neighbor size of each researcher is fixed as and neighbors with more links are more likely to be sampled. (3) Noted that we re-sample the local community graphs in each training step, so that all the links in the local community may be sampled after multiple training steps. (4) The sampling strategy can be seen as a data augmentation operation by masking partial neighbors, which is widely used in graph embedding models[19, 16, 49]. It can help to improve the generalization of models like the mechanism of Dropout[43].

GNN Encoder. Obtained the local community graph of , we use GNN model to encode into a community embedding. Traditional GNNs can only learn the representations of nodes by aggregating the features of their neighborhood nodes, we refer to these node representations as patch representations. Then, we utilize a Readout function to summarize all the obtained patch representations into a fixed length graph-level representation. Formally, the propagation of -layer GNN is represent as:

(10)
(11)
(12)

where , is the output hidden vector of node at the -th layer of GNN and = , represents the neighborhood message of passing from all its neighbors in at the -th layer, and are component functions of the -th GNN layer, and is the node set of . After -layer propagation, the output community embedding of is summarized on node representation vectors through the function, which can be a simple permutation invariant function such as averaging or more sophisticated graph-level pooling function[58]. As the relation between researchers may have multiple types, in practice,, we choose the classic and widely used RGCN[42], which can be applied on heterogeneous graphs, as the encoder of sub-graphs. Also, we use averaging function for in consider of efficiency.

Substitutability of Local Community Encoder. It is worth mentioning that our method places no constraints on the choices of the local community’s sampling strategy and GNN encoder. Our framework is flexible with other neighborhood sampling methods, such as random work with restart[47] and forest fire[24]. Also, traditional GNN such as GCN[21], GraphSAGE[16], and other GNN models that can encode graph-level representations are available for local community graph encoding, such as graph isomorphism network[54], DiffPol[58] can work in our framework. The design of our framework mainly consider the heterogeneity of researcher community graph and the efficiency as the pre-training dataset is very large, the comparison of different sampling strategies and encoders is not the focus of this paper.

Iii-D Multi-Task Self-Supervised Learning

In this section, we propose the multi-task self-supervised objective for pre-training, which consists of the main task: global contrastive learning, and two auxiliary self-supervised tasks: hierarchical mask language model and community relation prediction.

Iii-D1 Global Contrastive Learning

We consider the strong correlation between a researcher’s semantic attributes and the local community to design a self-supervised contrastive learning task. The assumption is that given a researcher’s local community, we can use the community embedding to infer the semantic information of this researcher, based on the fact that we can usually infer a researcher’s research topics according to the community he/she belongs to, and vice versa. Obtained a researcher embedding and the embedding of one sampled local community, this task aim to discriminate whether they belong to the same researcher. We define the similarity between them as the dot-product of their embeddings and adopt the infoNCE[34] loss as our learning objective:

(13)

where is the random sampled negative researchers set for , its size is fixed as . is the temperature hyper-parameter. By minimize , we can simultaneously optimize the semantic encoder and local community encoder for researchers. Thus, the purpose of the contrastive learning is to preserve the semantic and community information into the model parameters and help the model to integrade these two kinds of information.

Iii-D2 Hierarchical Masked Language Model

In the main task, the researcher-level representations are trained via the contrastive learning with their community embeddings. While the document-level and token-level representations are not directly trained. Inspired by the Mask Language Model(MLM) task in Bert, we propose the Hierarchical Masked Language Model(HMLM) on the hierarchical Transformer to train the hidden outputs of documents and tokens, which can further improve the researcher representations. The MLM task masks a few tokens in each document and use the Transformer outputs corresponding to these tokens, which have captured the context token information, to predict the original tokens. As our semantic encoder is a two-level hierarchical Transformer, besides the token-level context captured in the Document Transformer, the Researcher Transformer can capture document-level context (i.e., other documents belong to the same researchers) information, which we assume is also helpful to predict the masked token. Thus, Given a researcher ’s semantic document set , where is one document of and the textual sequence of is expressed as . We first mask 15% of the tokens in each document. Suppose is one masked token and it is replaced with [MASK], the HMLM task aims to predict the original token based on the sequence context in and document context in . First, we obtain the output of the Document Transformer in the position of :

(14)

where the input embedding of is replaced as the embedding of [MASK]. Then, we obtain the output of Researcher Transformer in the position of document , which is the document is in:

(15)

After that, We sum up these two outputs and fed it into a linear transformation and a softmax function:

(16)

where

is the probability of

over all tokens, is the vocabulary size. Finally, the HMLM loss can be formulazed as a cross-entropy loss for predicting all masked tokens in ’s documents:

(17)

where

is the one-hot encoding of

, is the set of masked tokens in document .

Iii-D3 Community Relation Prediction

Also, in the main task, we represent the relation information between researchers as the graph-level local community embeddings via linear random sampling and local community encoder. However, the link-level relatedness between researchers is not directly learned. In this auxiliary task, we propose another self-supervised learning task named Community Relation Prediction(CRP), which utilize the links in sampled local communities to construct supervisory signals. The CRP task has two objectives, the first is to predict the relation type between two researchers, the second is to predict the hop counts between two researchers. Specifically, given a researcher ’s one sampled local community graph , we random select 15% links in as the inputs of CRP task, expressed as , is the link type between researcher and , which can be a atomic relation or composite relations. For example, if and are linked with the relation , ; if they are linked by a path , is the composition from to , i.e., .. In each link . has the properties of relation type and hop count, which is what the CRP task aim to predict given the researcher and . Thus, we first input the element-wise multiplication of and ’s output representations into two linear transformations and softmax functions:

(18)
(19)

where is the probabilities of ’s type over all link types and is the number of link types, is the probabilities of ’s hop count range from 1 to and is the maximum hop count from to its neighbors in . Next we input these two probabilities into respective cross-entropy losses to compose the loss function of CRP task:

(20)

where is the one-hot encoding of ’s link type index, is the one-hot encoding of ’s hop count. With the guidance of this task, the model can learn fine-grained interaction between researchers such that the GNN encoder is capable of finely capture community information.

Iii-E Pre-training and Fine-tuning

Pre-training. We leverage a multi-task learning objective by combining the main task and two auxiliary tasks for pre-training. The final loss function of RPT can be written as:

(21)

where the loss weights and are hyper-parameters. We sample mini-batches of researchers’ semantic document sets and local community graphs as the inputs to train the whole model, and The parameters is optimized by back propagation consistently.

Fine-tuning. After pre-training, we obtain the pre-trained RPT model , which is able to extract valuable information from the researcher semantic document features and community network. Thus, in the fine-tuning stage, given the researcher data and for a specific fine-tuning task, we can obtains the semantic representation and the community representation of each researcher . Suppose and is the semantic and community representation matrix of the fine-tuning task, respectively, the final researcher representation matrix is obtained from the following:

(22)
(23)

where is a merging function to fuse the semantic representation and community representation of researchers, in practice, we directly use the concatenating operation as these two representations have been integrated with each other via contrastive learning in Ep. 13. In our framework, we propose two transfer modes of RPT for fine-tuning:

  • The first is the feature-based mode, expressed as RPT (fb). We treat the RPT as an pre-trained representation generator that first extracts researchers’ original features into a low dimensional representation vectors . Then, the encoded representations are used as input initial features of researchers for the downstream tasks.

  • The second is the end-to-end mode, expressed as RPT (e2e). The pre-trained model with hierarchical Transformer and local community encoder is trained together with each fine-tuning downstream task. Suppose the models of fine-tuning task is with the parameters , the objective of RPT (e2e) can be written as:

    (24)

    where and is the optimized parameters, and is the loss function of the fine-tuning task. All parameters are optimized end-to-end.

RPT (e2e) can further extract semantic and community information useful for the downstream researcher data mining tasks. While RPT (fb) without saving and training the pre-trained model is more efficient than RPT (e2e) in the fine-tuning stage. But compared with traditional solutions training different task-specific models for different tasks from scratch, both these two transfer modes are relatively inexpensive, as the downstream model can be very lightweight and converge quickly with the help of pre-training.

Iv Experiments

In this section, we first introduce the experimental settings including dataset, baselines, pre-training and fine-tuning parameter setting and implemental hardware and software. Then we fine-tune the pre-trained model on three tasks: researcher classification, collaborator prediction, and top-k researcher retrieval to evaluate the effectiveness and transferability of RPT. Lastly, we perform the ablation studies and hyper-parameters sensitivity experiments to analyze the underlying mechanism of RPT. The code of RPT is publicly available on https://github.com/joe817/RPT.

Iv-a Experimental Settings

Dataset. We perform RPT on the public scientific dataset: Aminer citation dataset111https://www.aminer.cn/citation[46], which is extracted from DBLP, ACM, MAG (Microsoft Academic Graph) and contains mullions of publication records. The available features of researchers contained in the dataset are the published literature, published venues, organizations, etc. To prepare for the experiments, we select 40281 researchers from Aminer, who have published at least ten papers range from the year of 2013 to 2018. The information of publication records from 2013 to 2015 are extracted to create the semantic document sets and the community graph of researchers for pre-training. Specifically, we collect each researcher’s papers as his/her semantic documents, the textual sequence of each document is composed by the paper’s fields of study. We extract three kinds of relations, Collaborating (if they collaborated a paper), Colleague (if they are in the same organization), and CoVenue (if they published on same venue) between researchers, to construct the researcher community graph (Noted that we random sample 100 neighbors of relation CoVenue per researcher). The statistics of semantic document set of researchers and researcher community graph is presented in Table I.

Number of researchers 40,281
Number of documents 225,724
Number of tokens 14,391
Average documents per researcher 13.8
Average tokens per documents 19.7
Number of Collaborating 599,612
Number of Colleague 115,891
Number of CoVenue 2,012,935
Average researcher degree of Collaborating 29.8
Average researcher degree of Colleague 5.8
Average researcher degree of CoVenue 100
TABLE I: Pre-training Dataset Details.

Baselines. We choose several pre-training models that can capture the information in semantic document sets and researcher community to researcher representations as baselines. Based on if they can be trained end-to-end with the downstream models, we divide these models into feature-based models, including Doc2vec[23], Metapath2vec[11], and ASNE[25], and end-to-end models, including BERT[20], GraphSAGE[16], and RGCN[42]. We also perform our model in feature-based mode and end-to-end mode in fine-tuning. The detailed descriptions and implementations of baselines are presented as follows:

  • Doc2vec: Doc2vec is a document embedding model, we collect all the researcher documents from whole dataset to train the model, and use the average embeddings of each researcher’s document as the his/her pre-trained embeddings. We use the python gensim library to conduct Doc2vec, we set training algorithm as PV-DM, size of window as 3, number of negtive samples as 5. https://pypi.org/project/gensim/.

  • Metapath2vec: Metapath2vec is a network embedding model, we conduct it on the researcher community graph to obtain node embeddings as pre-trained researcher representations. We set Collaborating-Colleague-CoVenue as the meta-path to sample paths on researcher community via random walk, we set walk length as 10, works per researcher as 5, size of window as 3, number of negative samples as 5. https://ericdongyx.github.io/metapath2vec/m2v.html.

  • ASNE

    : ASNE is a attribute network embedding method which can preserve both the community and semantic attributes of researchers. It concatenates the structure features and node attributes into a multi-layer perceptron to learn node embeddings. We use the semantic representations learned by Doc2vec as the input attribute embeddings, and set the the same weight for attribute embedding and structure embedding and the number of hidden layer as 2.

    https://github.com/lizi-git/ASNE.

  • BERT: BERT is a classic text pre-training model, we concatenate all documents into a sentence as inputs and output the [CLS] output as researcher representations. For a fair comparison, we train the BERT model on our dataset from the beginning. We set the layer of Transformer as 6 and the number of self-attention heads as 8. The maximum length of inputs is set as the same as the product of maximum lengths in Researcher Transformer and Document Transformer in RPT. https://github.com/codertimo/BERT-pytorch.

We also design two graph pre-training model based on two state-of-the-art GNN models: GraphSAGE and RGCN, GraphSAGE can be applied on homogeneous graph and RGCN can be applied on heterogeneous graphs, and they both can aggregate the local neighborhood information and node attributes into researcher embeddings. we conduct GraphSAGE and RGCN on the researcher community graph and use the self-supervised graph context based loss function introduced in GraphSAGE for pre-training, then in fine-tuning, the pre-trained GraphSAGE and RGCN is trained together with downstream models.

  • GraphSAGE: We use Deep Graph Library tools to build GraphSAGE model. The node attributes in researcher community is initialized by the researcher semantic representations learned by Doc2vec. We use the mean aggregator of GraphSAGE and set the aggregation layer number as 2. https://github.com/dmlc/dgl.

  • RGCN: RGCN also used the tools from Deep Graph Library. The node attributes initialization and number of aggregation layers is same with GraphSAGE, we choose the same graph-based loss function from GraphSAGE paper which encourages nearby nodes to have similar representations. https://github.com/dmlc/dgl.

For a fair comparison, we fix the representation dimension of all baselines as 64.

Pre-Training Parameter Setting. We use Adam optimization with learning rate of 0.01, , , weight decay of 1e-7. we train for 64000 steps with the batch size of 64. We set the researchers representation dimension and all the hidden layer dimensions as 64. The weight and of two auxiliary tasks is set as 0.1. For hierarchical Transformer, we set the number of layers as 3, the number of self-attention heads in each Transformer as 8, the maximum length of inputs as 20 and 10 respectively for Document Transformer and Researcher Transformer. For local community encoder, we set the neighbor sampling hops as 2, the size of sampled neighbor set as 8, and the number of layers of GNN model as 2. For global contrastive learning, We set the temperature as 1, and the size of negative samples as 3. The code and data on https://github.com/joe817/RPT.

Parameter RC CP TRR
batch size 64 256  64
epochs 10 10 20
dropout rate 0.9 0.9 0.9
learning rate 1e-2 1e-3 1e-4
weight decay 1e-7 1e-7 1e-4
adam parameter() 0.9 0.9 0.9
adam parameter() 0.999 0.999 0.999
TABLE II: The hyper-parameter settings on three fine-tuning tasks. RC: Researcher Classification; CP: Collaborating Prediction; TRR: Top-k Researcher Retrieval.

Hardware & Software All experiments are conducted with the following setting:

  • Operating system: CentOS Linux release 7.7.1908

  • Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz.

  • GPU: 4 GeForce GTX 1080 Ti

  • Software versions: Python 3.7; Pytorch 1.7.1; Numpy 1.20.0; SciPy 1.6.0; Gensim 3.8.3; scikit-learn 0.24.0

Fine-tuning Parameter Setting. The hyper-parameter settings of RPT on three fine-tuning tasks are presented on Table II.

Iv-B Task #1: Researcher Classification

The classification task is to predict the researcher categories. Researchers are labeled by four areas: Data Mining (DM), Database (DB), Natural Language Processing (NLP) and Computer Vision (CV). For each area, we choose three top venues222DM: KDD, ICDM, WSDM. DB: SIGMOD, VLDB, ICDE. NLP: ACL, EMNLP, NAACL. CV: CVPR, ICCV, ECCV., then we label researchers by the area with the majority of their publish records on these venues(i.e., their representative areas). The learned researcher representations by each model are feed into a multi-layer perceptron(MLP) classifier. The experimental results are shown in Table III

. The number of extracted labeled authors is 4943, and they are randomly split into the training set, validation set, and testing set with different proportions. We use both Micro-F1 and Macro-F1 as the multi-class classification evaluation metrics.

According to Table III, we can observe that (1) the proposed RPT outperform all baselines by a substantial margin in terms of two metrics, the RPT (e2e) obtains 0.6%-10.5% Micro-F1 and 0.7%-12.4% Macro-F1 improvement over baselines. (2) Noted that the pre-trained model in RPT (fb) is not trained in fine-tuning, comparing three feature-based baselines, RPT (fb) obtains at least 4.5% improvement, proving that our designed multi-task self-supervised learning objectives can better capture the semantic and community information. (3) RPT (e2e) consistently outperforms RPT (fb), indicating the learned parameters in the pre-training model can further contribute to the downstream task by end-to-end fine-tuning.

Proportions(%) 10/10/80 20/10/70 30/10/60
Metric(F1) Micro Macro Micro Macro Micro Macro
Doc2vec 0.781 0.759 0.799 0.780 0.804 0.787
Metapath2vec 0.760 0.733 0.769 0.743 0.784 0.763
ASNE 0.790 0.770 0.811 0.786 0.810 0.797
BERT 0.815 0.792 0.824 0.805 0.838 0.825
GraphSAGE 0.800 0.777 0.812 0.795 0.820 0.808
RGCN 0.835 0.818 0.838 0.823 0.841 0.825
RPT (fb) 0.827 0.805 0.848 0.833 0.852 0.837
RPT (e2e) 0.840 0.824 0.850 0.835 0.855 0.840
TABLE III: Results of Researcher Classification.

Iv-C Task #2: Collaborating Prediction

Collaborating prediction is a traditional link prediction problem that given the existing collaboration information between authors, we aim to predict whether two researchers will collaborate on a paper in the future, which can be used to recommend potential new collaborators for researchers. To be practical, we randomly sample the collaborating links from 2013 to 2015 for training, in 2016 for validation, and from 2017 to 2018 for testing, noted that duplicated collaborators are removed from evaluation. We use the element-wise multiplication of two candidate researchers’ representations as the representation of their collaborating link, then we input the link representation into a binary MLP classifier to predict whether this link exists. Also, negative links (two researchers who did not collaborate in the dataset) with 3 times the number of true links are randomly sampled. We sample various numbers of collaborating links and use accuracy and F1 as evaluation metrics.

The experimental results of different models are reported in Table IV. According to the table, RPT still performs best in all cases. The following insights can be drawn: (1) Graph representation learning models and GNNs achieve better performance than semantic representation learning models, showing that the community information of researchers maybe the more important to collaborating prediction. (2) RPT and RGCN outperform GraphSAGE, indicating the benefit of incorporating heterogeneity information of relations to researcher representations. (3) Our methods can achieve 82.9% accuracy even when the size of the training set is far less than the testing set, indicating the effectiveness of the pre-training model in preserving useful information from unlabeled data.

Number of links 1k/1k/10k 3k/1k/10k 5k/1k/10k
Metric(F1) ACC F1 ACC F1 ACC F1
Doc2vec 0.764 0.240 0.767 0.310 0.778 0.363
Metapath2vec 0.778 0.347 0.794 0.391 0.796 0.416
AHNE 0.796 0.483 0.809 0.506 0.816 0.508
BERT 0.785 0.378 0.799 0.520 0.805 0.543
GraphSAGE 0.777 0.308 0.781 0.477 0.795 0.482
RGCN 0.811 0.502 0.833 0.608 0.841 0.640
RPT (fb) 0.805 0.624 0.827 0.622 0.828 0.633
RPT (e2e) 0.829 0.623 0.840 0.641 0.860 0.674
TABLE IV: Results of Collaborating Prediction. 1k = 1000.

Iv-D Task #3: Top-K Researcher Retrieval

The problem of top-K researcher retrieval is a typical information retrieval problem, which is defined as given a researcher, we aim to retrieve several most relevant researchers of him/her. For each input researcher, we use the dot-product of his and the candidate researcher’s representations as their scores and input the scores into a softmax over all researchers to predict top-K relevant researchers, we also use negative sampling[41] in training for efficiency. We randomly select 2000 researchers for training, 1000 researchers for validation, and 7000 researchers for testing. The ground truth for each researcher is defined as its coauthor list ordered by the times of collaboration. Noted that we do not introduce any extra parameters except pre-trained models to fine-tuning in this task, and for a fair comparison, the researcher representations are regarded as trainable parameters for feature-based models. Finally, we use Precision@K and Recall@K in the top-K retrieval list as the evaluation metric and we set K as 1, 5, 10, 15, 20 respectively.

The results are shown in Table V. We can observe that: (1) The performance of traditional feature-based models in this task is far less than end-to-end models in general, comparing their performance in previous tasks. That is because, without downstream parameters, they are inadequate to fit the objective function well. (2) While the end-to-end models can adaptively optimize the parameters in pre-trained models by the downstream objectives, so they can achieve better performance. (3) The proposed RPT in end-to-end mode still achieves the best performance, showing the designed framework is robust in transferring to different downstream tasks.

K 1 5 10 15 20
Metric Pre@K Rec@K Pre@K Rec@K Pre@K Rec@K Pre@K Rec@K Pre@K Rec@K
Doc2vec 0.278 0.041 0.152 0.099 0.107 0.131 0.088 0.154 0.076 0.171
Metapath2vec 0.506 0.076 0.259 0.177 0.189 0.235 0.153 0.272 0.131 0.299
ASNE 0.514 0.077 0.288 0.192 0.198 0.244 0.159 0.279 0.134 0.304
BERT 0.594 0.089 0.299 0.205 0.195 0.246 0.149 0.272 0.124 0.291
GraphSAGE 0.722 0.104 0.389 0.253 0.252 0.304 0.191 0.333 0.158 0.355
RGCN 0.757 0.106 0.480 0.288 0.322 0.352 0.246 0.386 0.201 0.409
RPT (fb) 0.597 0.090 0.337 0.217 0.231 0.273 0.182 0.309 0.154 0.337
RPT (e2e) 0.787 0.106 0.525 0.309 0.357 0.381 0.275 0.421 0.226 0.447
TABLE V: Results of Top-k Researcher Retrieval

Iv-E Model Analysis

In this section, we analyze the underlying mechanism of RPT, we conduct several ablation studies and parameter analysis to investigate the effect of different components, stages, and parameters.

(a) Researcher Classification
(b) Collaborating Prediction
Fig. 3: Performance evaluation of variant models.
(a) Researcher Classification
(b) Collaborating Prediction
Fig. 4: Training and testing curves of RPT with pre-training and withour pre-training. Solid and dashed lines indicate training and testing curves, respectively.

Ablation Study of Multi-tasks. As the proposed RPT is a multi-task learning model with the main task and two auxiliary tasks. How different tasks impact the model performance? we propose three model variants to validate the effectiveness of these tasks.

  • RPT: RPT with only main task .

  • RPT: RPT with main task and HMLM task .

  • RPT: RPT with main task and CRP task .

We perform these variant models on the researcher classification task and collaborating prediction task, the pre-training setting is same with the complete version, and the fine-tuning is set as the end-to-end mode. Figure 3 show the performance of these variants comparing with the original RPT. We can observe that: (1) The results of RPT are consistently better than all the other variants, it is evident that using the three objectives together achieves better performance. (2) Both the RPT and RPT achieve better performance than RPT, indicating the usefulness of both these two auxiliary tasks. (3) RPT is better than RPT on two downstream tasks, which implies the plays a more important role than in this framework. (4) Comparing the performance of RPT with the baselines in Table III and IV, we can find that RPT still achieves very competitive performance, demonstrating that our framework has outstanding ability in learning researcher representations.

Effect of Pre-training. To verify if RPT’s good performance is due to the pre-training and fine-tuning framework, or only because our designed neural network is powerful in encoding researcher representations. In this experiment, we do not pre-train the designed framework and fully fine-tune it with all the parameters randomly initialized. In Figure 4, we present the training and testing curves of RPT with pre-training and without pre-training as the epoch increasing in two downstream tasks. We can observe that the pre-trained model achieves orders-of-magnitude faster training and validation convergence than the non-pre-trained model. For example in the classification task, it took 10 epoch for the non-pre-trained model to get the 77.9% Micro-F1, while it took only 1 epoch for the pre-trained model to get 83.6% Micro-F1, showing pre-training can improve the training efficiency of downstream tasks. On the other hand, we can observe that the non-pre-trained model is inferior to pre-trained models in the final performance. It proves that our designed pre-training objective can preserve rich information in parameters and provides a better start point for fine-tuning than random initialization.

Fig. 5: Hyper-parameters sensitivity of RPT.

Hyper-parameters sensitivity. We also conduct experiments to evaluate the effect of two key hyper-parameters in our model, i.e, the number of self-attention heads in hierarchical Transformers and the sampling hops in the local community encoder. We investigate the sensitivity of these two parameters on the researcher classification task and report the results in Figure 5. According to these figures, we can observe that (1) the more number of attention heads will generally improve the performance of RPT, while with the further increase of attention heads, the improvement becomes slightly. Meanwhile, we also find that more attention heads can make the pre-training more stable. (2) When the number of sampling hops varies from 1 to 5, the performance of RPT increases at first as a suitable amount of neighbors are considered. Then the performance decrease slowly when the hops further increase as more noises (uncorrelated neighbors) are involved.

V Conclusion

In this paper, we propose a researcher data pre-training framework named RPT to solve researcher data mining problems. RPT jointly consider the semantic information and community information of researchers. In the pre-training stage, a multi-task self-supervised learning objective is employed on big unlabeled researcher data for pre-training, while in fine-tuning, we transfer the pre-trained model to multiple downstream tasks with two modes. Experimental results show RPT is robust and can significantly benefit various downstream tasks. In the future, we plan to perform RPT on more meaningful researcher data mining tasks to verify the extensibility of the framework.

References

  • [1] V. Balachandran (2013) Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. In 2013 35th International Conference on Software Engineering (ICSE), pp. 931–940. Cited by: §II-A.
  • [2] I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3606–3611. Cited by: §I.
  • [3] X. Cai, J. Han, and L. Yang (2018) Generative adversarial network based heterogeneous bibliographic network representation for personalized citation recommendation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 32. Cited by: §II-A.
  • [4] H. Chen, C. Hu, L. Wehbe, and S. Lin (2019) Self-discriminative learning for unsupervised document embedding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 2465–2474. Cited by: §II-B.
  • [5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    ICML 2020: 37th International Conference on Machine Learning

    ,
    Vol. 1, pp. 1597–1607. Cited by: §I, §II-C.
  • [6] A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. S. Weld (2020) Specter: document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2270–2282. Cited by: §I.
  • [7] M. Dehghan and A. A. Abin (2019) Translations diversification for expert finding: a novel clustering-based approach. ACM Transactions on Knowledge Discovery from Data (TKDD) 13 (3), pp. 1–20. Cited by: §I.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 248–255. Cited by: §I.
  • [9] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §II-B.
  • [10] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060. Cited by: §II-B.
  • [11] Y. Dong, N. V. Chawla, and A. Swami (2017) Metapath2vec: scalable representation learning for heterogeneous networks. In SIGKDD, pp. 135–144. Cited by: §I, §II-C, §IV-A.
  • [12] S. Fortunato, C. T. Bergstrom, K. Börner, J. A. Evans, D. Helbing, S. Milojević, A. M. Petersen, F. Radicchi, R. Sinatra, B. Uzzi, A. Vespignani, L. Waltman, D. Wang, and A. Barabási (2018) Science of science. Science 359 (6379). External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/359/6379/eaao0185.full.pdf Cited by: §II-A.
  • [13] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: §II-B.
  • [14] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 2016, pp. 855–864. Cited by: §II-C.
  • [15] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964. Cited by: §I.
  • [16] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §III-C, §III-C, §IV-A.
  • [17] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738. Cited by: §I, §II-C.
  • [18] Z. Hu, Y. Dong, K. Wang, K. Chang, and Y. Sun (2020) Gpt-gnn: generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1857–1867. Cited by: §I, §II-C.
  • [19] Y. Jiao, Y. Xiong, J. Zhang, Y. Zhang, T. Zhang, and Y. Zhu (2020) Sub-graph contrast for scalable self-supervised graph representation learning. arXiv preprint arXiv:2009.10273. Cited by: §II-B, §III-C, §III-C.
  • [20] J. D. M. C. Kenton and L. K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186. Cited by: §I, §II-B, §II-C, §III-B, §IV-A.
  • [21] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §III-C.
  • [22] X. Kong, H. Jiang, W. Wang, T. M. Bekele, Z. Xu, and M. Wang (2017) Exploring dynamic research interest and academic influence for scientific collaborator recommendation. Scientometrics 113 (1), pp. 369–385. Cited by: §II-A.
  • [23] Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: §II-C, §IV-A.
  • [24] J. Leskovec and C. Faloutsos (2006) Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 631–636. Cited by: §III-C.
  • [25] L. Liao, X. He, H. Zhang, and T. Chua (2018) Attributed social network embedding. IEEE Transactions on Knowledge and Data Engineering 30 (12), pp. 2257–2270. Cited by: §IV-A.
  • [26] S. Lin, W. Hong, D. Wang, and T. Li (2017) A survey on expert finding techniques. Journal of Intelligent Information Systems 49 (2), pp. 255–279. Cited by: §II-A.
  • [27] J. Liu, F. Xia, L. Wang, B. Xu, X. Kong, H. Tong, and I. King (2019) Shifu2: a network representation learning based model for advisor-advisee relationship mining. IEEE Transactions on Knowledge and Data Engineering. Cited by: §I, §II-A.
  • [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §II-C, §III-B.
  • [29] Z. Liu, X. Xie, and L. Chen (2018) Context-aware academic collaborator recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1870–1879. Cited by: §I, §II-A.
  • [30] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    .
    arXiv preprint arXiv:1301.3781. Cited by: §II-C, §III-B.
  • [31] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, Vol. 26, pp. 3111–3119. Cited by: §II-B, §II-C.
  • [32] Y. Nie, Y. Zhu, Q. Lin, S. Zhang, P. Shi, and Z. Niu (2019) Academic rising star prediction via scholar’s evaluation model and machine learning techniques. Scientometrics 120 (2), pp. 461–476. Cited by: §II-A.
  • [33] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84. Cited by: §II-B.
  • [34] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §III-D1.
  • [35] Z. Peng, Y. Dong, M. Luo, X. Wu, and Q. Zheng (2020) Self-supervised graph representation learning via global context prediction. arXiv preprint arXiv:2003.01604. Cited by: §II-B.
  • [36] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Cited by: §II-C.
  • [37] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §II-C.
  • [38] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang (2020) Gcc: graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1150–1160. Cited by: §I, §II-C.
  • [39] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang (2018)

    Deepinf: social influence prediction with deep learning

    .
    In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2110–2119. Cited by: §II-A.
  • [40] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio (2020) Multi-task self-supervised learning for robust speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6989–6993. Cited by: §II-B.
  • [41] X. Rong (2014) Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738. Cited by: §IV-D.
  • [42] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §III-C, §IV-A.
  • [43] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §III-C.
  • [44] F. Sun, J. Hoffmann, V. Verma, and J. Tang (2019) InfoGraph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §II-B.
  • [45] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) LINE: large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. Cited by: §II-C.
  • [46] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su (2008) Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. Cited by: §I, §IV-A.
  • [47] H. Tong, C. Faloutsos, and J. Pan (2006) Fast random walk with restart and its applications. In Sixth international conference on data mining (ICDM’06), pp. 613–622. Cited by: §III-C.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30, pp. 5998–6008. Cited by: §III-B.
  • [49] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2018) Deep graph infomax. In International Conference on Learning Representations, Cited by: §II-B, §III-C.
  • [50] C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu, and J. Guo (2010) Mining advisor-advisee relationships from research publication networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 203–212. Cited by: §II-A.
  • [51] S. Wang, W. Che, Q. Liu, P. Qin, T. Liu, and W. Y. Wang (2020) Multi-task self-supervised learning for disfluency detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 9193–9200. Cited by: §II-B.
  • [52] W. Wang, J. Chen, W. Sun, and Z. Gong (2020) Scientific collaboration sustainability prediction based on h-index reciprocity. In Companion Proceedings of the Web Conference 2020, pp. 71–72. Cited by: §II-A.
  • [53] W. Wang, B. Xu, J. Liu, Z. Cui, S. Yu, X. Kong, and F. Xia (2019)

    Csteller: forecasting scientific collaboration sustainability based on extreme gradient boosting

    .
    World Wide Web 22 (6), pp. 2749–2770. Cited by: §I, §II-A.
  • [54] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. In International Conference on Learning Representations, Cited by: §III-C.
  • [55] C. Yang, J. Sun, J. Ma, S. Zhang, G. Wang, and Z. Hua (2015) Scientific collaborator recommendation in heterogeneous bibliographic networks. In 2015 48th Hawaii International Conference on System Sciences, pp. 552–561. Cited by: §II-A.
  • [56] L. Yang, T. L. J. Ng, B. Smyth, and R. Dong (2020) Html: hierarchical transformer-based multi-task learning for volatility prediction. In Proceedings of The Web Conference 2020, pp. 441–451. Cited by: §II-B.
  • [57] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, Vol. 32, pp. 5753–5763. Cited by: §I, §II-B, §II-C.
  • [58] R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4805–4815. Cited by: §III-C, §III-C.
  • [59] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen (2020) Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33. Cited by: §II-B.
  • [60] M. Yu, Y. Zhang, T. Zhang, and G. Yu (2020) Semantic enhanced top-k similarity search on heterogeneous information networks. In International Conference on Database Systems for Advanced Applications, pp. 104–119. Cited by: §II-A.
  • [61] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. Prasanna (2020) GraphSAINT: graph sampling based inductive learning method. In International Conference on Learning Representations, External Links: Link Cited by: §III-C.
  • [62] D. Zhang, S. Zhao, Z. Duan, J. Chen, Y. Zhang, and J. Tang (2020) A multi-label classification method using a hierarchical and transparent representation for paper-reviewer recommendation. ACM Transactions on Information Systems (TOIS) 38 (1), pp. 1–20. Cited by: §I, §II-A.
  • [63] J. Zhang, J. Tang, and J. Li (2007) Expert finding in a social network. In International conference on database systems for advanced applications, pp. 1066–1069. Cited by: §II-A.
  • [64] M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. Advances in Neural Information Processing Systems 31, pp. 5165–5175. Cited by: §III-C.
  • [65] S. Zhao, D. Zhang, Z. Duan, J. Chen, Y. Zhang, and J. Tang (2018) A novel classification method for paper-reviewer recommendation. Scientometrics 115 (3), pp. 1293–1313. Cited by: §II-A.
  • [66] Z. Zhao, W. Liu, Y. Qian, L. Nie, Y. Yin, and Y. Zhang (2018) Identifying advisor-advisee relationships from co-author networks via a novel deep model. Information Sciences 466, pp. 258–269. Cited by: §II-A.
  • [67] B. Zoph, G. Ghiasi, T. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. Le (2020) Rethinking pre-training and self-training. Advances in Neural Information Processing Systems 33. Cited by: §II-C.