1 Introduction
Heterogeneous networks are ubiquitous. Examples include bibliographic networks [20, 22], movie recommendation networks [32] and many online social networks containing information of heterogeneous types [19]. Different from their homogeneous counterparts, heterogeneous networks contain multiple types of nodes and/or links. For example, in bibliographic networks, node types include paper, author and more; link types include authorwritepaper, papercontainkeyword and so on. Due to the fast emerging of such data, the problem of mining heterogeneous network has gained a lot of attention in the past few years [21, 19].
In this work, we are interested in the problem of mining heterogeneous bibliographic network [21]. To be more specific, we consider the problem of author identification under doubleblind review setting [11], on which many peer review conferences/journals are based. Authors of the paper under doubleblind review are not visible to reviewers, i.e. the paper is anonymized, and only content/attributes of the paper (such as title, venue, text information, and references) are visible to reviewers. However, in some cases authors of the paper can still be unveiled by the content and references provided. Affected by the phenomenon, questions exist about whether or not doubleblind review process is really effective. In fact, WSDM this year also conducts an experiment trying to answer this question. Here we ponder on this issue by formulating the author identification problem that aims at designing a model to automatically identify potential authors of an anonymized paper. Instead of dealing with full text directly, we treat the information of an anonymized paper as nodes in bibliographic network, such as keyword nodes, venue nodes, and reference nodes. An illustration of the problem can be found in Figure 1. Other than serving as a study for existing reviewing system, the problem has broader implications for general information retrieval and recommender system, where the model is asked to match queried document with certain target, such as reviewer recommendation [29, 22].
To tackle the author identification problem, as well as many other network mining problems, good representations of data are very important, as demonstrated by many previous work [16, 15, 17, 26, 7]
. Unlike traditional supervised learning, dense vectorized representations
[16, 15] are not directly available in networked data [26]. Hence, many traditional methods under network settings heavily rely on problem specific feature engineering [12, 13, 34, 9, 33].Although feature engineering can incorporate prior knowledge of the problem and network structure, usually it is timeconsuming, problem specific (thus not transferable), and the extracted features may be too simple for complicated data sets [3]. Several network embedding methods [17, 26, 25] have been proposed to automatically learn feature representations for networked data. A key idea behind network embedding is learning to map nodes into vector space, such that the proximities among nodes can be preserved. Similar nodes (in terms of connectivity, or other properties) are expected to be placed near each other in the vector space.
Unfortunately, most existing embedding methods produce generalpurpose embeddings that are independent of tasks, and they are usually designed for homogeneous networks [17, 26]. When it comes to author identification problem under the heterogeneous networks, existing embedding methods cannot be applied directly. There are two unique challenges brought by this problem: (1) how to embed the network under the guidance of author identification task, so that embeddings learned are more suitable for this task compared to general network embedding. And (2) how to select the best type of information due to the heterogeneity of the network. As shown in previous work [23, 21], proximity in heterogeneous networks is richer than homogeneous counterparts, the semantic of a connection between two nodes is likely to be dependent on the type of connection they form.
To address the above mentioned challenges, we propose a taskguided and pathaugmented network embedding method. In our model, nodes are first embedded as vectors. Then the embeddings are shared and jointly trained according both taskspecific and networkgeneral objectives: (1) the author identification task objective where embeddings are used in a specifically designed model to score possible authors for a given paper, and (2) the general heterogeneous network embedding objective where embeddings are used to predict neighbors of a node. By combing both objectives, the learned network can preserve network structures/proximities, as well as be beneficial to the author identification task. To better utilize the heterogeneous network structure, we extend the existing unsupervised network embedding to incorporate meta paths derived from heterogeneous networks, and select useful paths according to the author identification task. Compared to traditional network embedding [17, 26, 25], our method uses the author identification task as an explicit guidance to influence network embedding by joint learning, and also as an implicit guidance to select meta paths, based on which the network embedding is performed. It is worth mentioning that although our model is originally targeted for the author identification problem, it can also be extended to other taskoriented embedding problems in heterogeneous networks.
The contributions of our work can be summarized as follows.

We propose a taskguided and pathaugmented heterogeneous network embedding framework, which can be applied to author identification problem under doubleblind review setting and many other tasks.

We demonstrate the effectiveness of task guidance for network embedding when a specific task is of interest; and also show the usefulness of metapath selection in heterogeneous network embedding.

Our learning algorithm is efficient, parallelizable, and experimental results show that our model can achieve much better results than existing feature based methods.
2 Preliminaries
In this section, we first introduce the concept of heterogeneous networks and meta paths, and then introduce the embedding representation of nodes. Finally, a formal definition of the author identification problem is given.
2.1 Heterogeneous Networks
Definition 1 (Heterogeneous Networks) A heterogeneous network [21] is defined as a network with multiple types of nodes and/or multiple types of links. It can be denoted as , where is a set of nodes and is a set of links. A heterogeneous network is also associated with a node type mapping function , which maps the node to a predefined node type, and a link type mapping function , which maps the link to a predefined link type. It is worthing noting that a link type automatically defines the node types of its two ends.
The bibliographic network can be seen as a heterogeneous network [21]. It is centered by paper, the information of a paper can be represented as its neighboring nodes. The node types in the network include paper, author, keyword, venue and year, and the set of link types include authorwritepaper, papercontainkeyword, and so on. The network schema is shown in Figure 2.
Definition 2 (Meta path) A meta path [23] is a path defined on the network schema and is denoted in the form of , which represents a compositional relations between two given types. For each of the meta path , we can define an adjacency matrix , with cardinality equal to the number of nodes, to denote the connectivity of nodes under that meta path. If there are multiple meta paths considered for a given network , we use a set of adjacency matrices to represent it.
Examples of meta paths defined in network schema Figure 2 include paper keyword paper, and paper year paper. From these two examples, it is easy to see that in a heterogeneous network, even compare two nodes of the same type (e.g. paper), going from different paths can lead to different semantic meanings.
2.2 Embedding Representation of Nodes
The networked data is usually highdimensional and sparse, as there can be many nodes but the links are usually sparse [1]. This brings challenges to represent nodes in the network. For example, given two users, it is hard to calculate their similarity or distance directly. To obtain a better data representation, embedding methods are widely adopted [17, 26, 25], where nodes in the network are mapped into some common latent feature space. With embedding, we can measure similarity/distance between two nodes directly based on arithmetic operations, like dot product, of their embedding vectors.
Through the paper, we use a matrix to represent the embedding table for nodes. The size of the matrix is , where is total number of nodes (including all node types, such as authors, keywords, and so on), and is the number of dimensions. So the feature vector for node is denoted as , which is a dimensional vector.
2.3 Author Identification Problem
We formalize the author identification problem using bibliographic networks with network schema shown in Figure 2. For each paper , we represent its neighbors in the given network as , where is a set of neighbor nodes in th node type. The node types include keyword, reference, venue, and year in our task. And we use to denote the set of true authors of the paper .
Author Identification Problem. Given a set of papers represented as where , the goal is to learn a model to rank potential authors for every anonymized paper based on information in , such that its top ranked authors are in ^{1}^{1}1Here it is posed as a ranking problem since each paper may have different number of authors and it is unknown beforehand..
3 Proposed Model
In this section, we introduce the proposed model in details. The model is composed of two major components: (1) author identification based on taskspecific embedding, and (2) pathaugmented general network embedding. We first introduce them separately and then combine them into a single unified framework, where the meta paths are selected according to the author identification task.
3.1 TaskSpecific Embedding for Author Identification
In this subsection, we propose a supervised embeddingbased model that can rank the potential authors given the information of a paper (such as keywords, references, and venue). Our model first maps each node into latent feature space, and then gradually builds the feature representation for the anonymized paper based on its observed neighbors in the network. Finally the aggregated paper representation is used to score the potential author.
There are two stages of aggregation to build up the feature representation for a paper based on node embeddings. In the first stage, it builds a feature vector for each of the th node type by averaging node embeddings in , which is:
(1) 
where is the feature representation of th node type (e.g. keyword node type), and is the th node embedding (e.g. keyword node).
In the second stage, it builds feature vector for the paper using a weighted combination of feature vectors of different node types:
(2) 
Now the anonymized paper is represented by this feature vector , and can be used to score potential authors (which are also embedding vectors) by taking their dot product. The score between a pair of paper and author is defined as follows:
(3) 
The computational flow is summarized in Figure 3. Note that the final denselyconnected layer has no bias term, and thus its weight matrix can be seen as author node embeddings. The final layer output (green dots) is the score vector for the candidate authors.
To learn the parameters and
, we use stochastic gradient descent (SGD)
[5] based on a hinge loss ranking objective. For each triple , where is one of the true author for paper , and is not the author of paper , the hinge loss is defined as:(4) 
where is a positive number usually referred as margin [4]. A loss penalty will incur if the score of positive pair is not at least larger than the score of .
3.2 PathAugmented General Heterogeneous Network Embedding
In this subsection, we propose a pathaugmented general network embedding model to exploit the rich information in heterogeneous networks.
Most of existing network embedding techniques [17, 26, 25]
are based on the idea that, embeddings of nodes can be learned by neighbor prediction, which is to predict the neighborhood given a node, i.e. the linking probability
from node to node . For existing network embedding methods, the observed neighborhood of a node is usually defined by original network [26, 25] or by random walk on the original network [17].In heterogeneous network, one can easily enrich the semantic of neighbors by considering different types of meta paths [23]. As shown in [23], different meta paths encode different semantic of links. For example, connections between two authors can encode multiple similarities: (1) they are interested in the same topic, or (2) they are associated with the same affiliation. And clearly these two types of connections indicate different semantics. Inspired by the phenomenon, we generalize existing network embedding techniques [25] to incorporate different meta paths, and propose the pathaugmented network embedding.
In path augmented network embedding, instead of using original adjacency matrices where is an original link type or onehop meta path (such as authorwritepaper), we consider more meta paths (such as authorwritepapercontainkeyword) and use meta pathaugmented adjacency matrices for network embedding, where each indicates network connectivity under a specific meta path . Here we normalize each , such that , so that the learned embedding will not be dominated by some meta paths with large raw weights. Since there can be infinite many potential meta paths (including original link types), when considered for network embedding, one has to select a limited number of useful meta paths. The selection of meta paths will be discussed in next subsection, and we assume a collection of meta paths are selected for now.
To learn embeddings that preserve proximities among nodes induced by meta paths, we follow the neighbor prediction framework, and model the conditional neighbor distribution of nodes. In heterogeneous networks, there can be multiple types of paths starting from a node , so the neighbor distribution of the node will be conditioned on both the node and the given path type , which is defined as follows:
(5) 
where is the embedding of node , and denotes the set of all possible nodes that are in the destination side of path .
In real networks, the number of nodes in can be very large (e.g. millions of papers), so the evaluation of Eq. 5 can be prohibitively expensive. Inspired by [16, 15], we apply negative sampling and form the following approximation term:
(6) 
where is the negative node sampled from a predefined noise distribution for path ^{2}^{2}2The noise distribution only returns nodes of the same type as specified by endpoint of path ., and a total of negative nodes are sampled for each positive node . Furthermore, a bias term is added to adjust densities of different paths.
To learn the parameters and , we adopt stochastic gradient descent (SGD) with the goal of maximizing the likelihood function. The training procedure is given as follows. We first sample a path uniformly, and then randomly sample a link according to their weights in . The set of negative nodes used in Eq. 6 are also sampled according to some predefined , such as “smoothed" node degree distribution under specific edge type [16, 15]. Finally the parameters are updated according to their gradients, such that approximated sample loglikelihood can be maximized.
3.3 The Combined Model
The taskspecific embedding submodel and pathaugmented general embedding submodel capture different perspectives of a network. The former focuses more on the direct information related to the specific task, while the latter can better explore more global and diverse information in the heterogeneous information network. This motivates us to model them in a single unified framework.
The two submodels are combined in two levels as follows.

A joint objective is formed by combining both taskspecific and networkgeneral objectives, and joint learning is performed. Here the task serves as an explicit guidance for network embedding.

The meta paths used in networkgeneral embedding are selected according to the author identification task. Here the task provides an implicit guidance for network embedding as it helps select meta paths.
3.3.1 Joint Objective  An Explicit Guidance
The joint objective function is defined as a weighted linear combination of the two submodels with a regularization term on the embedding, where the embedding vectors are shared in both submodels:
(7) 
where is the tradeoff factor for taskspecific and networkgeneral components. When , only networkgeneral embedding is used; and when , only supervised embedding is used. A regularization term is added to avoid overfitting.
To optimize the objective in Eq. 7, we utilize Asynchronous Stochastic Gradient Descent (ASGD), where samples are randomly drawn and training is performed in parallel [16]. The challenge here is that we have two different tasks that learn from two different data sources. To solve this problem, we design a sampling based task scheduler. Basically, for each worker, it first draws a task according to , and then draws samples for the selected task and update the parameters according to the samples. In order to reduce the task sampling overhead, the selected task will be trained on a minibatch of data samples instead of on a single sample.
The learning algorithm is summarized in Algorithm 1.
Complexity. Firstly, the algorithm can be run in parallel using multiple CPUs thanks to asynchronous SGD. Secondly, the algorithm is efficient, as for each iteration of each thread, there are two major components: (1) both edge and negative node sampling only take constant time with alias table [30], and (2) gradient update is linear w.r.t. the number of links and number of embedding dimensions. Thirdly, with minibatch of reasonable size, the overhead in switching tasks is ignorable.
3.3.2 Meta Path Selection  An Implicit Guidance
So far we have assumed that pathaugmented adjacency matrices are already provided. Now we discuss how we can select a set of meta paths that can further enhance the performance of the author prediction task.
The potential meta paths induced from the heterogeneous network can be infinite, but not every one is relevant and useful for the specific task of interest. So we utilize the author identification task as a guidance to help select the meta paths that can best help the task at hand.
The path selection problem can be formulated as: given a set of predefined candidate paths , we want to select a subset of paths , such that certain utility can be maximized. Since our final goal is the author identification task, we define the utility to be the generalization performance (on validation data set) of the task.
It is worth noting the problem is neither differentiable nor continuous. And the total number of combinations are exponential to the number of candidate paths. So we employ following two steps to select relevant paths in a greedy fashion.

Single path performance. We first run the joint learning with network embedding based on a single path at a time, and then run the experiments for all candidate paths.

Greedy additive path selection. We sort paths according to their performance (from good to poor) obtained from Step 1 above, and gradually add paths into the selected pool. Experiments are run for each additive combination of paths, and the path combination with best performance is selected.
We need to run experiments (at most) times, where means the number of candidate paths. Since every experiment takes about 10 minutes in our case (even with millions of nodes and hundreds of millions of links), such selection scheme is affordable and can avoid exponential number of combinations.
4 Experiments
In this section, we compare the proposed model with baselines, and also evaluate several variants of the proposed model. Case studies are also provided.
4.1 Data
The AMiner citation network [27] is used throughout our experiments. To prepare for the evaluation, we split all papers into training set and test set according to their publication time. Papers published before 2014 are treated as training set, and papers published in 2014 and 2015 are treated as test set.
Based on the training papers, a heterogeneous bibliographic network is extracted. We first extract all papers which contain information about its title, authors, references, venue from the dataset. Then we extract keywords by combining unigram and key phrases extracted using method proposed in [14]. The schema of the network is the same as in Figure 2.
The extracted network contains millions of nodes and tens of millions of links. The detailed statistics of nodes and links for both training and test set can be found in Table 1 and 2, respectively.
Paper  Author  keyword  Venue  Year  
Train  1,562,139  1,003,836  402,687  7,528  60 
Test  33,644  62,030  41,626  868  2 
PA  PP  PV  PW  PY  
Train  4,554,740  6,122,252  1,562,139  12,817,479  1,562,139 
Test  96,434  388,030  235,508  287,885  235,508 
Meta paths augmentation. Other than the length1 paths presented in the original network, we also consider various of length2 meta paths as candidate paths for general heterogeneous network embedding. Although other path similarity measures [23] can be explored, for simplicity, we set weights of a path by the number of path instances. For example, if Tom attended KDD Twice and Jack attended KDD three times, then the path of Tom  KDD  Jack will have a weight of six. The augmented network by adding new meta paths has hundreds of millions of links, much more than the original network. Many of the candidate paths are not symmetric and may contain different information at both sides, so we consider them in both directions. Finally, the detailed statistics of the length2 paths are presented in Table 3.
APA  APP  APV  APW  APY  PPV  PPW  VPW  WPW  YPW 
17,205,758  18,308,110  4,554,740  38,251,803  4,554,740  3,674,632  27,200,144  12,817,479  118,497,737  12,817,479 
To better understand statistics of the network,, Figure 4 shows three different types of degree distributions for papers. As can be seen from the figure, most papers contain quite sparse information of authors, references and keywords: medium 3 authors, 1 reference (many are missing in the data set), and 8 keywords. And this lack of information makes the problem of automatic author identification even harder.
4.2 Baselines and Experimental Settings
We mainly consider two types of baselines: (1) the traditional featurebased methods, and (2) the variations of network embedding methods.

Supervised featurebased baselines. As widely used in similar author identification/disambiguation problems [12, 13, 34, 9, 33]
, this thread of methods first extract features for each pair of training data, and then applies supervised learning algorithm to learn some ranking/classification functions. Following them, we extract 20+ related features for each pair of paper and author in the training set (details can be found in appendix). Since the original network only contains true paperauthor pairs, in order to get the negative samples, for each paperauthor pair we sampled 10 negative pairs by randomly replacing the authors. For the supervised algorithm, we consider Logistic Regression (
LRSVM), Random Forests (
RF), and LambdaMART ^{3}^{3}3for LR, SVM, RF, we use scikit learn implementation, and for LambdaMART, we use XGboost implementation.
. For all these methods, we use grid search to find their best hyperparameters, such as regularization penalty, maximum depth of trees, and so on. 
Taskspecific embedding. This method is introduced in Section 3.1. The embeddings of nodes are learned solely based on taskspecific embedding architecture.

Networkgeneral embedding. This method is introduced in Section 3.2. The embeddings of nodes are learned solely based on general heterogeneous network embedding, and then the learned embeddings are used to score the author in the same way as in taskspecific author identification framework. Since it is not directly combined with author identification task, it cannot perform path selection specific for the task. By default, the paths used for embedding are from original network, i.e. length1 paths. With length1 paths, this method is in the same form of PTE [25].

Pretraining + Taskspecific embedding
. Pretraining has been found useful to improve neural network based supervised learning
[10]. So instead of training taskspecific author identification from randomly initialized embedding vectors, we first pretrain the embedding of nodes using networkgeneral embedding, and then initialize the supervised embedding training with the pretrained embedding vectors. 
Proposed combined model. This is our proposed method, which combines both taskspecific embedding and metapath selectionbased networkgeneral embedding.
Candidate authors. There are more than one million authors in the training data, so the total number of candidate authors for each paper is very large. The supervised featurebased baselines cannot scale up to such large amount of candidate set, as it is both very time consuming and storage intensive to extract and store features for all candidate paperauthor pairs (which amounts to more than pairs). Hence, we conduct comparisons mainly based on a subsampled author candidate set, where we randomly sample a set of negative authors, combined with the true authors of the paper to form a candidate set of total 100 authors. For completeness, we also provide both quantitative and qualitative comparisons of different embedding methods on the whole candidate set of a million authors.
4.3 Evaluation Metrics
Since the author identification problem is posed as a ranking problem and usually only top returned results are of interest, we adopt two commonly used ranking metrics: Mean Average Precision at k (MAP@k) and Recall at k (Recall@k).
MAP@K reflects the accuracy of top ranked authors by a model, and can be computed as mean of AP@K for each papers in the test set. The formula for computing AP@K of a single paper is given as follows.
(8) 
where is the precision at cutoff in the return list. is the total number of true authors for this test paper.
The Recall@K shows the ratio of true authors being retrieved in the top k return results, and can be computed according to:
(9) 
4.4 MetaPath Selection Results
We first report experimental results for path selection since the selected paths are used in the joint training of our model. The candidate paths that we consider are all length1 and length2 paths presented in Table 2 and 3, 15 paths in total. As introduced in section 3.3.2, a greedy algorithm involving two stages has been used for path selection: (1) single path performance evaluation, and (2) additive path selection.
Figure 4(a) shows the results of single path performance, i.e., the performance when only a single metapath is used in networkgeneral embedding. Each dot in the plot indicates the performance of author prediction task for the validation dataset. The horizontal line indicates the performance of taskspecific only embedding model. Note that paths are sorted according to their performance, and only paths that can help improve the author identification task are shown in the figure.
Figure 4(b) shows the results of additive path selection, which demonstrate the performance of the combined model when metapaths are added gradually. Each bar in the graph shows performance of the joint model based on specific additive selection of paths. Each single path is added to the networkgeneral embedding sequentially according to their rank in the single path performance experiments. For example, the third bar with label “+P1A" includes three paths: APP, APW, and PA.
We observe the author identification performance grow first during the first several additive selection of paths, and then it starts to decrease as we add more paths. This suggests that first several paths are most relevant and helpful, and the latter ones can be less relevant, noisy, and thus they are harmful to use in networkgeneral embedding. It also verifies our hypothesis that heterogeneous network embedding based on different meta paths will lead to different embeddings. Finally we select the first three paths APP, APW, and PA in joint learning of the proposed model.
To further investigate the impact of using different meta paths on learning embeddings for the prediction task, we consider several types of paths: (1) the original length1 network paths presented by network schema in Figure 2, (2) the augmented paths by combining all length1 and length2 paths, and (3) the selected paths by our procedure.
Table 4 shows the results of different embedding models trained based on pregiven meta paths. We observe that by adding all length2 paths, the results actually become worse, which might be due to the irrelevant or noisy paths. However, this does not mean that consider augmented paths are unnecessary. Using the greedy selected paths (APP, APW, and PA) from both length1 and length2 paths, the performance of all models can be improved, which again demonstrate the path selection can play an important role in learning taskrelated embeddings.
4.5 Performance Comparison with Baselines
Table 5 shows the performance comparison between baselines and the proposed method. For both pretrain and networkgeneral model, they do not have access to the taskspecific path selection, so original length1 network paths are used.
Our method significantly outperforms all baselines, including both supervised featurebased baselines and variants of embedding methods. To our surprise, the taskspecific embedding model performs quite badly without pretrained embedding vectors, significantly lower than other methods. We conjecture this is due to overfitting, and can be largely alleviated by pretraining or joint learning with unsupervised networkgeneral embedding.
To further examine the superior performance of our method compared with traditional methods, we group the papers by its medium author degrees^{4}^{4}4The author degree is calculated based on the number of papers he/she has published in training data., and report the results on each groups. Figure 6 shows that our method outperforms baseline methods in almost all groups of papers, but most significantly in those papers that have less frequent authors. This suggests that our method can better understand authors with fewer links. For traditional feature based methods, it is very difficult to extract useful information/feature for them, but our model can still utilize propagation between authors and learn useful embeddings for them.
Whole author candidate set. To test in realworld author prediction setting, we also conduct evaluation on the whole candidate set including a million of authors for variants of embedding methods. We only compare embedding methods as supervised feature based methods cannot scale up to whole candidate set. The results are shown in Figure 7. Due to the use of large candidate set, and thus longer evaluation time, we randomly sample 1000 test papers for a single experiment, and results are averaged over 10 experiments. We observe that, among variants of embedding methods, the combined method consistently outperforms other two variants.


Networkgeneral  Pretrain + Task  Combined  


length1  0.7563 / 0.7105  0.7722 / 0.7234  0.759 / 0.7133 
length(1+2)  0.7225 / 0.6847  0.7489 / 0.7082  0.7385 / 0.6973 
Selected  0.7898 / 0.7379  0.7914 / 0.7413  0.8113 / 0.7548 



Models  MAP@3  MAP@10  Recall@3  Recall@10 


LR  0.7289  0.7321  0.6721  0.8209 
SVM  0.7332  0.7365  0.6748  0.8267 
RF  0.7509  0.7543  0.6921  0.8381 
LambdaMart  0.7511  0.7420  0.6869  0.8026 
Taskspecific  0.6876  0.7088  0.6523  0.8298 
Pretrain+Task.  0.7722  0.7962  0.7234  0.9014 
Networkgeneral  0.7563  0.7817  0.7105  0.8903 
Combined  0.8113  0.8309  0.7548  0.9215 

4.6 Case Studies
We show two types of case studies to demonstrate the performance differences between our proposed method and variants of embedding methods. The first type of case study shows the ranking of authors given some terms, which is used to see if the learned embedding nodes make sense. And the second type of case study shows the ranking of authors given information of anonymized paper, which is our original task.
Table 6 shows the ranking of authors given the term “variational inference". We find from the results, the returned authors of combined methods are most reasonable (i.e., most likely to be the authors of the queried keyword), followed by general network embedding. And the taskspecific embedding model itself sometimes give less reasonable results.
Table 7 shows the ranked authors of some selected papers. Since the information provided for a paper is quite limited (keywords and limited references), and the number of whole candidate author set is more than one million, many of the true authors may not be presented in the top list. However, our combined method can predict true authors more accurately than other methods. Also, we find that most of the top authors in the returned list are related to the paper’s topic and true authors, so it is sensible to consider them as potential authors of the paper.


Taskspecific  Networkgeneral  Combined 


Chong Wang  Yee Whye Teh  Michael I. Jordan 
Qiang Liu  Mohammad E. Khan  Yee Whye Teh 
Sheng Gao  Edward Challis  Zoubin Ghahramani 
Song Li  Ruslan Salakhutdinov  John William Paisley 
Donglai Zhu  Michael I. Jordan  David M. Blei 
Neil D. Lawrence  Zoubin Ghahramani  Max Welling 
Sotirios Chatzis  Matthias Seeger  Alexander T. Ihler 
Si Wu  David B. Dunson  Eric P. Xing 
Huan Wang  Dae Il Kim  Ryan Prescott Adams 
Weimin Liu  Pradeep D. Ravikumar  Thomas L. Griffiths 

“Active learning for networked data based on nonprogressive diffusion model" 

4.7 Parameter Study and Efficiency Test
We study the hyperparameters , which is the tradeoff term for combing taskspecific embedding and networkgeneral embedding. The result is shown in Figure 7(a). As we can see that the best performance is obtained when we use , at which both objectives are combined most appropriately.
Our model can be trained very efficiently with multicore parallelization. All our experiments are conducted in a desktop with 4 core i75860k CPU and 64G memory. The experiments with embedding methods can be finished in about 10 minutes. To conduct a quantitatively experiment, we compare the times of training speedup versus the number of threads used in Figure 7(b). It is almost linear speedup for the first several number of threads, since our desktop CPU has only 4 cores (with hyperthreading), there are some overhead when the number of threads is more than 4.
5 Discussion
Although there is a severe lack of information about papers (e.g. the medium number of references per paper is 1, only keywords are used, and so on), our embedding based algorithm can still identify true authors with reasonable accuracy at top ranks, even with a million of candidate authors. We believe the model can be further improved by utilizing more complete information, and incorporating with more advanced text understanding techniques. For now and near future, a human expert can still be much more accurate at identifying the authors of a paper that he/she may be very familiar with, but algorithms may do a much better job when a paper is in some less familiar domains.
An interesting observation from both Figure 6 and Table 7 is that, authors with higher number of past publications are easier for the algorithm to predict, while the authors with few publication records are substantially harder. This suggests that highlyvisible authors may be easier to detect, while relatively junior researchers are harder to be identified. From this perspective, we think the doubleblind review system is still helpful and in someway protects junior researchers.
6 Related Work
Many work has been devoted to mining heterogeneous networks in the past few years [21, 23, 24, 19]. To study such networks with multiple types of nodes and/or links, meta paths are proposed and studied [21, 23, 24, 19]. Many existing work on mining heterogeneous networks rely on feature engineering [20, 22], while we adopt embedding methods for automatic feature learning.
Network embedding also attracts lots of attentions in recent years [17, 26, 25, 6, 7]. Many of these methods are technically inspired by word embedding [16, 15]. Different from traditional graph embedding methods [31], such as multidimensional scaling [8], IsoMap [28], LLE [18], Laplacian Eigenmap [2], the network embeddings are more scalable and shown better performance [17, 26]. Some existing network embedding methods are based on homogeneous network [17, 26], while others are based on heterogeneous networks [25, 6]. Our work extends existing embedding methods by leveraging meta paths in heterogeneous networks, and use supervised task to guide the selection of meta paths.
The problem of author identification has been briefly studied before [11]. And we also notice KDD Cup 2013 has similar author identification/disambiguation problem [12, 13, 34, 9, 33], where participants are asked to predict which paper is truly written by some author. However, different from the KDD Cup, our setting is different from them in the sense that (1) existing authors are unknown in our doubleblind setting, and (2) we consider the reference of the paper, which is one of the most important sources of information. Similar problems in social and information networks are also studied, such as collaboration prediction [20, 22]. The major difference between those work and ours is the methodology, their methods are mostly based on heavy feature engineering, while ours adopt automatic feature learning.
7 Conclusion and Future Work
In this paper, we study the problem of author identification under doubleblind review setting, which is posed as author ranking problem under heterogeneous networks. To (1) embed network under the guidance of author identification task, and (2) better exploit heterogeneous networks with multiple types of nodes and links, we propose a taskguided and pathaugmented heterogeneous network embedding model. In our model, nodes are first embedded as vectors in latent feature space. Embeddings are then shared and jointly trained by both taskspecific and networkgeneral objectives. We extend the existing unsupervised network embedding to incorporate meta paths in heterogeneous networks, and select paths according to the author identification task. The guidance is provided for learning network embedding, both explicitly in a joint objective and implicitly in path selection. Our experiments demonstrate the usefulness of meta paths in heterogeneous network embedding, and show that by combining both tasks, our model can obtain significantly better accuracy at identifying the true authors comparing to existing methods.
Some potential future work includes (1) author set prediction, where the interactions between authors will be considered in the prediction task, and (2) deeper analysis on text, given the full text of papers is given.
Acknowledgement
We would like to thank anonymous reviewers for helpful suggestions. This work is partially supported by NSF CAREER #1453800.
References
 [1] R. Albert and A.L. Barabási. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
 [2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems (NIPS’01), Vancouver, 2001.

[3]
Y. Bengio.
Learning deep architectures for ai.
Foundations and trends in Machine Learning
, 2(1):1–127, 2009.  [4] A. Bordes, N. Usunier, A. GarciaDuran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multirelational data. In Advances in Neural Information Processing Systems (NIPS’13), Lake Tahoe, 2013.
 [5] L. Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of 16th International Conference on Computational Statistics (COMPSTAT’10).
 [6] S. Chang, W. Han, J. Tang, G.J. Qi, C. C. Aggarwal, and T. S. Huang. Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). Sydney, 2015.

[7]
T. Chen, L.A. Tang, Y. Sun, Z. Chen, and K. Zhang.
Entity embeddingbased anomaly detection for heterogeneous categorical events.
InProceedings of the TwentyFifth International Joint Conference on Artificial Intelligence (IJCAI’16)
, Miami, 2016.  [8] T. F. Cox and M. A. Cox. Multidimensional scaling. 2000.
 [9] D. Efimov, L. Silva, and B. Solecki. Kdd cup 2013authorpaper identification challenge: second place team. In Proceedings of the 2013 KDD Cup 2013 Workshop, 2013.

[10]
D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio.
Why does unsupervised pretraining help deep learning?
Journal of Machine Learning Research, 11:625–660, 2010.  [11] S. Hill and F. Provost. The myth of the doubleblind review?: author identification using only citations. ACM SIGKDD Explorations Newsletter, 2003.
 [12] C.L. Li, Y.C. Su, T.W. Lin, C.H. Tsai, W.C. Chang, K.H. Huang, T.M. Kuo, S.W. Lin, Y.S. Lin, Y.C. Lu, et al. Combination of feature engineering and ranking models for paperauthor identification in kdd cup 2013. In Proceedings of the 2013 KDD Cup 2013 Workshop, 2013.
 [13] J. Li, X. Liang, W. Ding, W. Yang, and R. Pan. Feature engineering and tree modeling for authorpaper identification challenge. In Proceedings of the 2013 KDD Cup 2013 Workshop, 2013.
 [14] J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In Proceedings of 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15), Melbourne, 2015.
 [15] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 [16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS’13), Lake Tahoe, 2013.
 [17] B. Perozzi, R. AlRfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’14), New York City, 2014.
 [18] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
 [19] C. Shi, Z. Zhang, P. Luo, P. S. Yu, Y. Yue, and B. Wu. Semantic path based personalized recommendation on weighted heterogeneous information networks. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15), Melbourne, 2015.
 [20] Y. Sun, R. Barber, M. Gupta, C. C. Aggarwal, and J. Han. Coauthor relationship prediction in heterogeneous bibliographic networks. In International Conference on Advances in Social Networks Analysis and Mining (ASONAM’11), Taiwan, 2011.
 [21] Y. Sun and J. Han. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(2):1–159, 2012.
 [22] Y. Sun, J. Han, C. C. Aggarwal, and N. V. Chawla. When will it happen?: relationship prediction in heterogeneous information networks. In Proceedings of the fifth ACM international conference on web search and data mining (WSDM’12). Seattle, 2012.
 [23] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta pathbased topk similarity search in heterogeneous information networks. Proceedings of 2011 International Conference on Very Large Data Bases (VLDB’11), 2011.
 [24] Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, and X. Yu. Pathselclus: Integrating metapath selection with userguided object clustering in heterogeneous information networks. ACM Transactions on Knowledge Discovery from Data, 7(3):11, 2013.
 [25] J. Tang, M. Qu, and Q. Mei. Pte: Predictive text embedding through largescale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15), Sydney, 2015.
 [26] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web (WWW’15), Florence, Italy, 2015.
 [27] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’08), Las Vegas, 2008.
 [28] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
 [29] S. Van Rooyen, F. Godlee, S. Evans, N. Black, and R. Smith. Effect of open peer review on quality of reviews and on reviewers’ recommendations: a randomised trial. BMJ, 318(7175):23–27, 1999.

[30]
A. J. Walker.
An efficient method for generating discrete random variables with general distributions.
ACM Transactions on Mathematical Software, 3(3):253–256, 1977.  [31] S. Yan, D. Xu, B. Zhang, H.J. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):40–51, 2007.
 [32] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han. Personalized entity recommendation: A heterogeneous information network approach. In Proceedings of the 7th ACM international conference on web search and data mining (WSDM’14), New York City, 2014.
 [33] X. Zhao. The scorecard solution to the authorpaper identification challenge. In Proceedings of the 2013 KDD Cup 2013 Workshop, 2013.
 [34] E. Zhong, L. Li, N. Wang, B. Tan, Y. Zhu, L. Zhao, and Q. Yang. Contextual rulebased feature engineering for authorpaper identification. In Proceedings of the 2013 KDD Cup 2013 Workshop, 2013.
Appendix A Feature Engineering for traditional supervised models
For the traditional supervised models, we consider both author features and paperauthor paired features for ranking authors given a paper. What follows we first show the author features we utilized.

Total number of papers

Number of distinct venues

Number of distinct years
There are four types of paperauthor paired features being utilized, as shown below.
Paper references related

Number of references being cited by the author before

Ratio of references being cited by the author before

Number of author’s citations in the references

Ratio of author’s citations in the references

Number of references written by the author

Ratio of references written by the author

Ratio of author’s papers in the references
Paper words related

Number of shared word

Number of unique shared word

Ratio of shared words

Ratio of unique shared words
Paper venue related

Whether the author attend the venue before

Number of times the author attend the venue before

Ratio of times the author attend the venue before
Paper year related

Number of papers author published in the last 3 years

Ratio of papers author published in the last 3 years
Appendix B Derivation of Taskspecific Embedding for Author Identification
The gradients of the parameters in Taskspecific Embedding model are calculated as follows.
(10) 
(11) 
where
where is an indicator function, which is set one if and only if is greater than 0.
The learning algorithm is illustrated in Algorithm 2.
Appendix C Derivation of Pathaugmented General Heterogeneous Network Embedding
The gradient of the parameters in Pathaugmented General Heterogeneous Network Embedding model can be calculated as follows.
(12) 
(13) 
The learning algorithm of the model is summarized in the Algorithm 3.
Comments
There are no comments yet.