1. Introduction
“It takes me a lot more time to find a useful paper… and it takes me even longer to read it… ” while a nonEnglish speaking PhD student complained this in a seminar, other PhD candidates, in the similar background, agreed with her and they shared the same frustration when they are trying to find and consume the helpful English publications. Professor’s (a native speaker) response came later as a relief “well, I agree, but my problem is even bigger… I cannot read the papers in your language at all…” This dialog initialized our thinking about this new problem  Cross Language Publication (Citation) Recommendation, a.k.a. how can we propose a useful method/system to assist scholars to efficiently locate the useful publications written in different languages (a typical scenario of this task is to help nonEnglish speaking students to search for useful English papers). Increased academic globalization is forcing a scholar to break the linguistic boundaries, and English (or any other dominant language) may not always serve as the gatekeeper to scientific discourse.
Unfortunately, existing academic search engines (e.g. Google Scholar, Microsoft Academic Search, etc.) along with many sophisticated retrieval and recommendation algorithms (He et al., 2010; Jiang et al., 2015; Tang and Zhang, 2009) cannot cope with this problem efficiently. For instance, most of the existing citation recommendation algorithms work in a monolingual context, and the scholarly graphbased random walk may not work well in a multilingual environment (section 4 will prove this).
Moreover, Crosslanguage Citation Recommendation (CCR) can be a quite challenging problem comparing with classical scholarly recommendation due to the following reasons:
Information need shifting. Different from monolingual citation recommendation, we cannot directly calculate the relevance between the papers written in two different languages. A straightforward solution is to utilize machine translation (MT) (Bahdanau et al., 2015) to translate the query content (e.g., keyword, text or user profile), then, use existing matching models (Zhai and Lafferty, 2001; Guo et al., 2016) to recommend the proper papers in target language. However, MT based methods and the CCR task can be fundamentally different. The goal of MT is to find a target text given a source text based on the same semantic meaning (Bahdanau et al., 2015) (e.g., find the papers contain exact or similar matched phrases or sentences), while the CCR task is focusing on recommending “relevant” papers in target language to the given query in the source language (Tang et al., 2014)
. When research context changes, content translation may not perform well. For instance, in Chinese/Japanese research context, machine learning methods can be important for word segmentation studies, which may not be the case for the English counterpart. MT approach cannot address this kind of information need shifting problem.
Sparse interrepositories citation relations. Besides textual information, citation relations are quite important for citation recommendation. In the prior studies, recommendation algorithms can learn the “relevance” by using citation relations on a graph (Ren et al., 2014; Liu et al., 2014). However, compared to the enormous monolingual citation relations, crosslanguage citations can be very sparse. For instance, in a computer science related bilingual (ChineseEnglish) context, we find the papers in ACM and Wanfang^{1}^{1}1One of the biggest digital libraries in Chinese., on average, have about 28 times more monolingual citation relations than crosslanguage ones. It is difficult to effectively employ the citation relations for crosslanguage citation recommendation by using classical graph mining methods.
Heterogeneous information environment. Intuitively, one could integrate the crosslanguage content semantics, citation relations and other useful heterogeneous information (e.g., keywords and authors) to address CCR. However, most existing text or graph based ranking algorithms rely on a set of human defined rules (e.g., sequential relation path (Lao and Cohen, 2010) and metapath (Sun et al., 2011)) to integrate different kinds of information. On a complex crosslanguage scholarly graph, this kind of handcrafting features can be timeconsuming, incomplete and biased.
To address these challenges, in this study, we propose a novel solution, Hierarchical Representation Learning on Heterogeneous Graph (HRLHG), for crosslanguage citation recommendation. By constructing a novel crosslanguage heterogeneous graph with various types of vertexes and relations, we “semantically” enrich the basic citation structure to carry more rich information. To avoid the handcrafting feature usage, we propose an innovative algorithm to project a vertex (on the heterogeneous graph) to a lowdimensional joint embedding space. Unlike prior hyperedge or metapath approaches, the proposed algorithm can generate a set of Relation Type Usefulness Distributions (RTUD), which enables fully automatic heterogeneous graph navigation. As Figure 1 shows, the hierarchical random walk algorithm enables a twolevel random walk guided by two different sets of distributions. The global one (relation type usefulness distributions) is designed for graph schema level navigation (taskspecific); while the local one (relation transition distributions) targets on graph instance level walking (taskindependent).
By using HRLHG, we can recommend a list of ranked crosslanguage citations for a given paper/query in the source language. We evaluate the proposed algorithm in Chinese and English scholarly corpora, i.e., Wanfang and ACM digital libraries. The results demonstrate that the proposed approach is superior than stateoftheart models for crosslanguage citation recommendation task.
The contribution of this paper is fourfold.
First, we propose a novel method (Hierarchical Representation Learning on Heterogeneous Graph) to characterize both the global and local semantic plus topological information for the publication representations. Second, we improve the interpretability of the publication representation model. By using an iterative EM (expectationmaximization) approach, the proposed algorithm can learn the implicit biases for crosslanguage citation recommendation, which significantly differs from classical heterogeneous graph mining algorithms. Third, we apply the proposed embedding method for a novel crosslanguage citation recommendation task. An experiment on realworld bilingual scientific datasets is employed to validate the proposed approach. Last but not least, although in this study we focus on crosslanguage citation recommendation task, the proposed method can be generalized for different tasks that are based on heterogeneous graph embedding learning.
2. Problem Formulation
Compared to the homogeneous graph, the heterogeneous graph has been demonstrated as a more efficient way to model real world data for many applications, it represents an abstraction of the real world, focusing on the objects and the interactions between the objects (Liu et al., 2014). Formally, following the works (Sun et al., 2011; Dong et al., 2017), we present the definitions of a heterogeneous graph with its schema.
Definition 1 ().
Heterogeneous Graph, namely heterogeneous information network, is defined as a graph , where denotes the vertex set, and denotes the edge (relation) set. is the vertex type mapping function, and denotes the set of vertex types. is relation type mapping function, and denotes the set of relation types. .
Definition 2 ().
Graph Schema. The graph schema is a meta template for a heterogeneous graph , denoted as .
The graph schema is used to specify type constraints on the sets of vertexes and relations of a heterogeneous graph. A graph that follows a graph schema is then called a Graph Instance following the target schema (Liu et al., 2014).
Definition 3 ().
Crosslanguage citation recommendation
. The CCR problem can be defined as a conditional probability
, i.e., the probability of in target language given a particular query paper in the source language:where is a representation function, which can project each paper to a lowdimensional embedding space. is probability scoring function based on the learned publication embeddings.
The CCR problem can be formalized as:

Input: A query paper (or partial text/keywords in the query paper) in a source language.

Output: A list of ranked papers in target language that could be potentially cited or useful given the input paper.
In this study, we investigate the novel method to enhance the representation learning function for CCR. More detailed method will be introduced in Section 3.
3. Hierarchical Representation Learning on Heterogeneous Graph
In this section, we discuss the proposed method in detail. We first formulate the heterogeneous graph based representation learning framework for CCR task (3.1), then, introduce the hierarchical random walkbased strategy by leveraging the critical relation type usefulness distribution training algorithm (3.2)
3.1. Heterogeneous Graph Representation Learning Framework for CCR
Due to the aforementioned challenges of CCR task, the proposed representation model can hardly depend only on textual or citation information. In this study, we integrate various kinds of entities and relations into a heterogeneous graph (as Figure 2 shows, and the detailed node and type information can refer to Table 1). Then, the goal is to design a novel representation learning model to encapsulate both semantic and topological information into a lowdimensional joint embedding for CCR task.
No.  Vertex  Description 
1  Paper in Source Language  
2  Paper in Target Language  
3  Keyword in Source Language  
4  Keyword in Target Language  
No.  Relation  Description 
1  A paper (in source language) is semantically related to another paper (in target language). We use machine translation* and language model (with Dirichlet smoothing) to generate this relation (Zhai and Lafferty, 2001).  
2  A Paper (in source language) has a monolingual citation relation to another paper (in source language).  
3  A Paper (in target language) has a monolingual citation relation to another paper (in target language).  
4  A Paper (in source language) has a crosslanguage citation relation to another paper (in target language).  
5  A Paper (in source language) has a keyword (in source language).  
6  A Paper (in target language) has a keyword (in target language).  
7  A keyword (in source language) has a monolingual citation relation to another keyword (in source language).  
8  A keyword (in target language) has a monolingual citation relation to another keyword (in target language).  
9  A keyword (in source language) has a crosslanguage citation relation to another keyword (in target language).  
10  A keyword (in source language) is translated into the corresponding keyword (in target language)*. 

*As this study is not focusing on machine translation, we use Google machine translation API (https://cloud.google.com/translate) to translate the paper abstract and keywords.

The keyword citation relations are derived from paper citation relations.

Because of the space limitation, the detailed relation transition probability calculation can be found at https://github.com/GraphEmbedding/HRLHG.
Formally, given a heterogeneous graph , is the vertex type mapping function and is the relation type mapping function. The goal of vertex representation learning is to obtain the latent vertex representations by mapping vertexes into a lowdimensional space , . The learned representations are able to preserve the information in . We use as the mapping function from multityped vertexes to feature representations. Here, is a parameter specifying the number of dimensions. is a matrix of size parameters. The following objective function should be optimized for heterogeneous graph representation learning.
(1) 
where denotes ’s network neighborhood (“context”) with the type of vertexes. The feature learning methods are based on the Skipgram architecture (Mikolov et al., 2013a; Mikolov et al., 2013c; Bengio et al., 2013)
, which is originally developed for natural language processing and word embedding. Unlike the linear nature of text, the structural and semantic characteristics of graph allow the vertex’s network neighborhood,
, to be defined in various of ways, i.e., direct (onehop) neighbors of . It is critical to model vertex neighborhood in graph representation learning. Following the previous network embedding models (Perozzi et al., 2014; Grover and Leskovec, 2016; Dong et al., 2017), in this study, we leverage a random walkbased strategy for every vertex to generate . For instance, in Figure 2, we can sample a random walk sequence of length , which results in , , and (window size is 1). The detailed description of this method will be introduced in section 3.2.defines the conditional probability of having a context vertex given the node ’s representation, which is commonly modeled as a softmax function :
(2) 
In this study, we use a Heterogeneous Softmax function for conditional probability calculation (Dong et al., 2017):
(3) 
where is the vertex set of type in . Different from common Skipgram form, the Heterogeneous Skipgram with heterogeneous softmax function can specify one set of multinomial distributions for each type of neighborhood in the output layer of the Skipgram model. Stochastic gradient ascent is used for optimizing the model parameters of . Negative sampling (Mikolov et al., 2013c) is applied for optimization efficiency. Especially, for “heterogeneous softmax”, the negative vertexes are sampled from the graph according to their type information (Dong et al., 2017).
Recall the CCR definition in section 2, given a query paper in source language, the problem is to compute the recommendation probability of a candidate paper in target language, with representation function and probability scoring function . In this study, is the optimized heterogeneous vertex representation
, and we use cosine similarity with Relu function for
:(4) 
3.2. Hierarchical Representation Learning on Heterogeneous Graph
In this section, we propose a novel hierarchical random walkbased strategy for vertex neighborhoods on heterogeneous graph. Before moving on, let’s clarify four challenges for random walkbased graph embedding models.
(1) The existing homogeneous random walkbased embedding approaches, e.g., (Perozzi et al., 2014; Grover and Leskovec, 2016), cannot be directly applied to address the heterogeneous graph problems. For instance, as Figure 2 shows, between the paper pair and , there are two different types of relations: ( is semantically related to ) and ( cites ), but the homogeneous random walkbased approaches cannot distinguish the difference of relation types, then the neighborhood generating could be problematic for further representation learning.
(2) Recent heterogeneous graph embedding algorithms (Dong et al., 2017; Fu et al., 2017) require a domain expert to generate random walk hypotheses, which can be inconvenient and problematic for complex heterogeneous graphs.
(3) Insufficient global information. For each step in the walk, a lot of random walkbased models are solely depending on the (local) network topology of the vertexes, but global information, i.e., graph schema information, may bring important information to navigate the walker on the graph.
(4) Most existing graph embedding methods aim to encode the topological information of the graphs, which are task independent. For instance, as described in Table 1, the vertexes and relations with transition probability are fixed after the graph is constructed. We argue that, the learned representation should be optimized for different tasks, e.g., crosslanguage citation recommendation task for this study. A flexible representation mechanism can be important to address recommendation problem via heterogeneous graph. For instance, on a complex graph, some kinds of relations can be more important for random walk than others given the task (conditional relation type usefulness probability given the task).
To address these challenges, we propose a Hierarchical Representation Learning on Heterogeneous Graph (HRLHG) method. By introducing a set of Relation Type Usefulness Distributions (RTUD) on graph schema, the hierarchical (twolevel) random walk algorithm can be more appropriate for heterogeneous network structure. As RTUD can be automatically learned, we don’t need expert knowledge (e.g., generating metapath) for representation learning. Meanwhile, by using RTUD, we not only bring global information for guiding the random walk, but also utilize the task specific information for optimizing the random walk generation.
Given a specific task on a heterogeneous graph : Relation Type Usefulness Distributions (RTUD)
is a group of taskpreferred (usefulness) probability distributions over relation types, which is defined at graph schema level (global level) of
. As Figure 3 shows, RTUD can be represented as a probability matrix of size , where row of this matrix represents a relation type usefulness distribution given a specific vertex type , in which denotes the usefulness probability of a relation type given .Correspondingly, Relation Transition Distributions (RTD) is a group of taskindependent probability distributions associated to relations, which is defined at graph instance level (local level) of . Given a vertex of type , denotes the transition distribution of a type relation (from vertex type to vertex type ), are vertexes of type. RTD aims to reflect the basic semantics of different types of relations in and focuses on the local structure around . For instance, Table 1 defines the RTD of crosslanguage heterogeneous graph.
As Figure 1 shows, with RTUD and RTD, we can simulate a hierarchical random walk of fixed length in . In order to avoid walking into a dead end, the directions of relations are ignored in hierarchical random walk algorithm. The hierarchical random walk process is as follows:

For vertex in the walk:

Generate from RTUD based on ’s vertex type

Probabilistically draw a relation type from

For the generated relation type

Generate from RTD based on

Based on , probabilistically draw one vertex from as the destination vertex

Walk forward to


In this study, with a set of labeled vertex pairs , we propose an iterative Kshortest paths ranking based EM (expectation  maximization) approach to obtain and optimize RTUD. is generated based on the taskspecified relevance. For instance, for CCR task, a pair of paper vertexes connected via a crosslanguage relation could be a labeled pair for RTUD training. For a specific task, the representations of relevant pair of vertexes should be similar.
Then, in the proposed hierarchical representation learning framework, the goals are: (1) the vertex neighborhood should contain the taskrelevant vertexes to the greatest extent possible; (2) the distance (random walk sequence length) between two taskrelevant vertexes should be as short as possible. In other words, RTUD should be trained to navigate the random walk between the related vertexes pairs, a.k.a., with the trained RTUD, there is a greater chance that one relevant vertex could random walk to another relevant one on the heterogeneous graph.
We formalize this goal as a Kshortest paths ranking problem. Let a path from to in is a sequence of vertexes and relations with the form:
denotes the set of all paths from to in . Given a relation of type from vertex of type to vertex of type , the ’s weight integrated RTUD and RTD, which can be calculated as:
The weight function of is , a weight sum of all relations from . The shortest path objective is the determination of a path for which holds for any path (Gallo and Pallottino, 1986). Then, the Kshortest paths objective is extended to determine the second, third,…, Kth shortest paths in , that can be denoted as . There are lots of efficient algorithms for this problem, we utilize the method proposed in (de Azevedo et al., 1990). The RTUD training utilizes an EM framework, as described in Algorithm 1.
In Algorithm 1,
is a relation type update factor vector,
denotes the update value of th relation type, denotes one count for the appearance of a specific type relation in the shortest paths. We explore 3 ways for relation type update factor function :Raw Count (RC): . During each iteration, directly accumulate the relation type count.
LengthNormalized Count (LNC): . During each iteration, accumulate the relation type count that is normalized by the path length. is the length of path . By doing so, we try to minimize the possible bias from the long paths.
LogDiscounted Count (LDC): . During each iteration, accumulate the relation type count that is discounted by path ranking. is the rank of the path , the shortest path’s rank is 1. Using this update function, different shortest paths are given different weights.
For RTUD update function , we define 2 different forms:
Direct Sum (DS): in DS, we update by directly adding the update values, is for normalization, .
Sum with a Dumping Factor (SDF): in order to avoid the extreme probability, in SDF we add a dumping factor for updating, is the possible relation type amount for a specific vertex type (constrained by graph schema).
Note that, RTUD is constrained by graph schema. Given a vertex type , if a relation type violates the graph schema, the probability will be set to zero. For instance, for a keyword vertex, the usefulness probability of the paper citation relation (a relation between paper vertex pairs) is zero. RTUD is taskspecified, that means RTUD can dynamically change for different tasks, even though they share the same graph (e.g., we can use this graph for collaborator recommendation task, but the corresponding RTUD may change).
The pseudocode for Hierarchical Representation Learning on Heterogeneous Graph (HRLHG) is given in Algorithm 2. By applying random walks of fixed length starting from each vertex in , we can minimize the implicit random walk biases. The RTUD can be pretrained by the Kshortest paths ranking based EM approach. The space complexity of HRLHG is , where is the relation number of . The time complexity is per hierarchical random walk, where is the relation type number, and is the relation instance number of a specific sampled type connected to the current walking vertex. The time complexity can be further reduced, as suggested by (Grover and Leskovec, 2016), if we parallelize the hierarchical random walk simulations, and execute them asynchronously^{2}^{2}2The source code of HRLHG, constructed graph data (with labeled ground truth) and learned representations are available at https://github.com/GraphEmbedding/HRLHG.
4. Experiment
4.1. Dataset and Experiment Setting
Dataset^{2}. We validated the proposed approach in a citation recommendation task between Chinese and English digital libraries. The goal was to recommend English candidate cited papers for a given Chinese publication. For this experiment, we collected 14,631 Chinese papers from the Wanfang digital library and 248,893 English papers from the Association for Computing Machinery (ACM) digital library (both in computer science). There were 750,557 EnglishtoEnglish paper citation relations, 11,252 ChinesetoChinese paper citation relations, 27,101 ChinesetoEnglish paper citation relations, and 12,403 English papers had been cited by 7,900 Chinese papers. By using machine translation^{3}^{3}3As this study is not focusing on machine translation, we use Google translation API (https://cloud.google.com/translate) to translate the paper abstract and keywords. and language modeling (with Dirichlet smoothing). We generated 158,000 crosslanguage semantic matching relations (from Chinese to English). There were 3,953 Chinese keywords associated to the collected Chinese papers, and the Chinese paperkeywordassociated relation number was 7,316; while there were 7,436 English keywords associated to the collected English papers and the English paperkeywordassociated relation number was 903,265. Between keywords, There were 283,268 EnglishtoEnglish keyword citation relations, 2,973 ChinesetoChinese keyword citation relations, 9,828 ChinesetoEnglish keyword citation relations. 2,564 Chinese keywords could be successfully translated into the corresponding English keywords^{3}.
Ground Truth and Evaluation Metric
. For evaluation, we generated a number of positive and negative instances to compare different algorithms for CCR task. The actual crosslanguage citation relation was used as ground truth (as 0 or 1 relevant scores) for evaluation. For example, if a candidate ACM paper was cited by the a testing Wanfang paper, the relevant score was 1, otherwise it was 0. We generated test and candidate collection data by using the following method: (1) randomly selected a certain proportion of papers from 7,900 Chinese papers that had crosslanguage citation relations to English corpus; (2) removed all crosslanguage citation relations from selected Chinese papers (Other relations, e.g., Chinese citation relations, were kept for model training); (3) the selected papers were used as a test collection. All the English papers cited by the Chinese papers in the test collection were used as candidate (cited paper) collection. For evaluation, the different models were compared by using the mean average precision (MAP), normalized discounted cumulative gain at rank (NDCG), precision (P) and Mean Reciprocal Rank (MRR).Validation Set. For HRLGH, there were several hyper parameters (i.e., for shortest path EM method) and algorithm functions (i.e., and for RTUD training) needed to be tuned. Meanwhile, for a fair comparison, we also tuned the hyper parameters of baselines (i.e., return parameter and inout parameter for node2vec algorithm) for making sure the baseline algorithms could achieve the best performance. So, we constructed a validation set following the process described above (10% papers were randomly selected for validation). A comprehensive model component analysis and baseline hyper parameter tuning would be conducted via this validation set.
Baselines. We compared with three groups of representation algorithms, from text or graph viewpoints, to comprehensively evaluate the performance of the proposed method. 10fold crossvalidation was applied to avoid evaluation bias.
Textual Content Based Method.
1. Embedding Transformation (Mikolov et al., 2013b): We transformed the testing Chinese paper’s abstract embedding into the English embedding space through a trained transformation matrix. Then, recommend the English citations based on the transformed Chinese abstract embedding, denoted as EF.
2. Machine Translation by Google Translation API + Language Model (with Dirichlet smoothing) (Zhai and Lafferty, 2001): We translated the testing Chinese paper’s abstract into English, and then used language model to recommend English citations, denoted as MT+LM.
Collaborative Filtering Based Method.
3. Itembased Collaborative Filtering (Sarwar et al., 2001): Recommended English citations using Itembased Collaborative Filtering based on (monolingual + crosslanguage) citation relations, denoted as CF.
4. Popularitybased Collaborative Filtering (Su and Khoshgoftaar, 2009): Recommended English citations using Popularitybased Collaborative Filtering based on (monolingual + crosslanguage) citation relations, denoted as CF.
Network Representation Learning Based Method.
5. DeepWalk (Perozzi et al., 2014): We used DeepWalk to learn the graph embeddings via uniform random walk in the network and recommended English citations based on the learned embeddings. Because DeepWalk was originally designed for homogeneous graph, for a fair comparison, we applied DeepWalk on two graphs, (1) citation network, denoted as DW
; (2) all typed networks with accumulated relation weights. For this approach, we integrated all relations between two vertexes into one edge, and the weight was estimated by the sum of all integrated relations. Then, a heterogeneous graph could be simplified to a homogeneous graph, denoted as
DW.6. LINE (Tang et al., 2015): LINE aimed at preserving firstorder and secondorder proximity in concatenated embeddings. Similar as DeepWalk, we applied LINE on two graphs, with LINE 1storder and 2ndorder representation approaches. So, there were four different baseline models, denoted as LINE, LINE, LINE and LINE.
7. node2vec (Grover and Leskovec, 2016): node2vec learned graph embeddings via 2nd order random walks in the network. Similar as DeepWalk and LINE, we also applied node2vec on two graphs for comparison, denoted as N2V and N2V. We tuned return parameter and inout parameter with a grid search over on the validation set, and picked up a best performed parameter setting for experiment, as suggested by (Grover and Leskovec, 2016).
8. metapath2vec++ (Dong et al., 2017): metapath2vec++ was originally designed for heterogeneous graphs. It learned heterogeneous graph embeddings via metapath based random walk and heterogeneous negative sampling in the network. Metapath2vec++ required a humandefined metapath scheme to guide random walks. We tried 3 different metapaths for this experiment: (1) , (2) , (3) . These metapaths were denoted as M2V++, M2V++ and M2V++, respectively. We also trained two learning to rank models (Coordinate Ascent (Metzler and Croft, 2007) and ListNet (Cao et al., 2007)) to further integrate these three metapath2vec++ models (by utilizing each metapath as a ranking feature), denoted as M2V++ and M2V++.
For a fair comparison, for all the random walk based embedding methods, we used the same parameters as follows: (1) The number of walks per vertex : 10; (2) the walk length : 80; (3) the vector dimension : 128; (4) the neighborhood size (Context Window size) : 10. Please note that most original baseline papers used the above parameter settings, and the proposed method also shared the same parameters. For the experiment fairness, we didn’t tune those parameters on validation set. We applied the parameter sensitivity analysis in section 4.2.
4.2. Impact of Different Model Components
Algorithm  NDCG@10  NDCG@30  NDCG@50  P@10  P@30  P@50  MAP@10  MAP@30  MAP@50  MRR 

EF  0.0176  0.0301  0.0384  0.0072  0.0060  0.0054  0.0101  0.0129  0.0140  0.0300 
MT+LM  0.3404  0.3811  0.3966  0.1225  0.0573  0.0387  0.2563  0.2739  0.2777  0.4343 
CF  0.0980  0.1034  0.1059  0.0330  0.0134  0.0086  0.0772  0.0793  0.0796  0.1290 
CF  0.0041  0.0082  0.0108  0.0026  0.0024  0.0022  0.0017  0.0023  0.0026  0.0090 
DW  0.2713  0.3060  0.3177  0.1037  0.0502  0.0336  0.2162  0.2348  0.2381  0.3053 
DW  0.3606  0.4214  0.4416  0.1463  0.0735  0.0499  0.2679  0.2979  0.3033  0.4077 
LINE  0.2258  0.2557  0.2674  0.0854  0.0421  0.0289  0.1777  0.1927  0.1958  0.2628 
LINE  0.1499  0.1730  0.1822  0.0572  0.0295  0.0205  0.1136  0.1241  0.1263  0.1894 
LINE  0.3534  0.4096  0.4302  0.1386  0.0691  0.0473  0.2671  0.2936  0.2990  0.4090 
LINE  0.1047  0.1385  0.1564  0.0453  0.0284  0.0221  0.0663  0.0775  0.0811  0.1544 
N2V  0.2724  0.3040  0.3153  0.1025  0.0489  0.0327  0.2183  0.2353  0.2383  0.3083 
N2V  0.4651  0.5194  0.5354  0.1730  0.0809  0.0533  0.3661  0.3951  0.3999  0.5194 
M2V++  0.0195  0.0214  0.0225  0.0052  0.0023  0.0015  0.0144  0.0147  0.0148  0.0312 
M2V++  0.0015  0.0031  0.0045  0.0006  0.0006  0.0006  0.0006  0.0007  0.0009  0.0037 
M2V++  0.0687  0.0933  0.1058  0.0308  0.0195  0.0150  0.0409  0.0481  0.0503  0.1070 
M2V++  0.0198  0.0273  0.0321  0.0084  0.0054  0.0045  0.0113  0.0130  0.0135  0.0389 
M2V++  0.0243  0.0335  0.0380  0.0107  0.0068  0.0052  0.0136  0.0156  0.0161  0.0451 
HRLHG  0.5034^{†††}  0.5522^{†††}  0.5664^{†††}  0.1840^{†††}  0.0832^{†††}  0.0543^{†††}  0.4033^{†††}  0.4309^{†††}  0.4353^{†††}  0.5598^{†††} 

Significant test: ^{†}, ^{††}, ^{†††}
For the proposed HRLHG, there were several important parameters and functions. To explore the effects of those model components, on the validation set, we compared the crosslanguage recommendation performances of proposed method under different model settings (by varying the examined model component while kept others fixed). We mainly focused on following components: (a) parameter for Kshorest paths based EM algorithm, we compared and selected the best from . (b) , relation type update factor function for RTUD training: Raw Count (RC), LengthNormalized Count (LNC) and LogDiscounted Count (LDC). (c) , RTUD update function for model training: Direct Sum (DS) and Sum with a Dumping Factor (SDF). (d) , parameter for of SDF form, we compared and selected the best from . (e) , the convergence percentage of EM algorithm, we tried over 90% to 10%. (f) Validation of relation type usefulness distributions (RTUD) and heterogeneous skipgram (HS).
Note that, we conducted a comprehensive comparison experiment. For each examined model component, we tried multiple combinations of other components to avoid the possible bias brought by the component setting choices. For instance, we tested the impact of different under component combination ( = LNC, = DS and = 80% for RTUD training, while using ordinary skipgram for embedding) and component combination ( = RC, = SDF with = 0.8 and = 20% for RTUD training, while using heterogeneous skipgram for embedding), respectively. Because of the space limitation, we cannot report all results in this paper. The representative results on the validation set in terms of NDCG are depicted in Figure 5 (the other comparison groups showed the similar trends).
As we can see, considering more shortest paths in RTUD training brings a performance improvement. Lengthnormalized count (LNC) function could achieve best among the three choices. For , sum with a dumping factor (SDF) outperforms direct sum (DS). If we utilized SDF, a small could be superior than a great one. A possible explanation was that a small would penalize the dominated relation type usefulness probability to avoid overfitting. Generally, the algorithm performed better when more became stabilize in training iterations.
To validate the effectiveness of RTUD and heterogeneous skipgram (HS), we also compared the performance of our model without them. As Figure 5 (f) showed, when RTUD was removed (treating each relation type equally when we conducted the hierarchical random walk) and HS was replaced by ordinary skipgram, recommendation performance declined significantly. It is clear that RTUD and HS contribute to heterogeneous graph based random walk and recommendation performance significantly. More importantly, RTUD doesn’t need any human intervention or expert knowledge.
Based on the comparison and analysis, we selected a component setting (=3, = LNC, = SDF with = 0.2 and = 80% for RTUD training, while using heterogeneous skipgram for vertex embedding), for further experiments with Baselines.
4.3. Comparison with Baselines
The crosslanguage citation recommendation performance results of different models were displayed in Table 2. Based on the experiment results, we had the following observations: (1) The proposed method significantly outperformed () other baseline models for all evaluation metrics. For instance, in terms of MAP@10, HRLHG achieved at least 10% improvement, comparing with all other 17 baselines. (2) The traditional models solely relied on one kind of information, i.e., machine translation based methods (EF, MT+LM) or citation relation based collaborative filtering approaches (CF and CF) cannot work as well as other network embedding based methods. (3) Although designed for homogeneous networks, by adding more types of vertexes and relations, the performance of DeepWalk (DW), LINE (LINE and LINE) and node2vec (N2V) had significant improvements over the ones using only citation networks (DW, LINE, LINE and N2V). This observation confirmed that heterogeneous information did enhance the models’ representation learning abilities. (4) metapath2vec++, designed for heterogeneous information networks, didn’t work well in the experiment. Even after applying learning to rank algorithms for integrating multiple metapath2vec++ models, the recommendation results were still not good. A possible explanation was that, for CCR task, no single metapath could cover the recommendation requirement. In addition, metapath based random walk was too strict to explore potential useful neighbourhoods for vertex representation learning. This observation also indicated that metapath2vec++ was depending on domain expert knowledge. If one cannot find the optimize metapath, the embedding performances were even worse than the homogeneous network representation learning models.
For each vertex type, HRLHG trained a relation type usefulness distribution. Based on the learned RTUD (available at project website), we can obtain the taskspecified knowledge and improve the interpretability of proposed graph representation model. For instance, in the experimental CCR task, when a random walker reaches an English vertex, for the next move, the probabilities of (an English paper cites another English paper) and (an English paper has an English keyword) are higher than other relation types. This distribution navigates the walker to prefer to stay in the English repository rather than going back to the Chinese repository. By conducting the hierarchical random walk based on RTUD, the task specific knowledge can be further embedded into the representations learned by HRLHG.
In sum, for the CCR task, the proposed HRLHG method could automatically learn the relation type usefulness distributions for random walk navigation and the new method significantly outperformed the current text, homogeneous graph and heterogeneous graph embedding methods.
5. Related Work
Citation recommendation aims to recommend a list of citations (references) based on the similarity between the recommended papers and user profiles or samples of inprogress text. For instance, He et al. (He et al., 2010) proposed a probabilistic model to compute the relevance score based on contexts of a citation and its abstract. Jiang et al. (Jiang et al., 2015) generated a heterogeneous graph with various relations between topics and papers, and a supervised random walk was used for citation recommendation. From bibliographic viewpoint, Shi, Leskovec, and McFarland (Shi et al., 2010) developed citation projection graphs by investigating citations among publications cited by a given paper. Collaborative filtering algorithm can also be used for recommending citation papers (McNee et al., 2002). However, all of the prior studies focused on monolingual citation recommendation and cannot be directly used for crosslanguage citation recommendation. Intuitively, translationbased models can be addressed for crosslanguage recommendation. Recently, word embedding is a powerful approach for content representation (Mikolov et al., 2013a). Mikolov et al. (Mikolov et al., 2013b) transformed one language’s vector space into the space of another by utilizing a linear projection with a transformation matrix . This approach is effective for word translation, but the translation effect for scholarly text has not yet been demonstrated. Tang et al. (Tang et al., 2014) proposed bilingual embedding algorithms, which were efficient for crosslanguage contextaware citation recommendation task. However, they ignored the important citation relations in their work.
Network embedding algorithms, namely graph representation learning models, which aim to learn the lowdimensional feature representations of nodes in networks, are attracting increasing attention recently. Based on the techniques utilized in the model, we can briefly classify these algorithms into the following categories: the graph factorization based models, e.g., GraRep
(Cao et al., 2015); the shallow neural network based models, e.g., LINE
(Tang et al., 2015); the deep neural network based models, e.g., GCN (Kipf and Welling, 2016); and the random walk based method, e.g., DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016) and metapathvec++ (Dong et al., 2017). Technically the random walk based models are also using a shallow neural network. The main difference between random walk based models are the random walk algorithms used for generating the vertex sequences from the graph. A potential problem for GraRep and GCN is the space complexity (), and the computational costs of these models can be too expensive to embed the large complex networks in the real world. For instance, in this CCR experiment (a 200,000 vertexes level graph), the memory requirement of GraRep/GCN is over 600G.In this study, we address the CCR problems and propose a novel method HRLHG to learn a mapping of publication to a lowdimensional joint embedding space for heterogeneous graph. HRLHG belongs to the random walk based network embedding models. A hierarchical random walk is proposed to cope the taskspecified problem on heterogeneous graph. To the best of our knowledge, few existing studies have investigated the graph embedding approach for crosslanguage citation recommendation problem.
6. Conclusion
In this paper, we propose a new problem: crosslanguage citation recommendation (CCR). Unlike existing scholarly recommendation problem, CCR enables cross language and cross repository recommendation. The proposed Hierarchical Representation Learning on Heterogeneous Graph (HRLHG) model can project a publication into a joint embedding space, which encapsulate both semantic and topological information. By training a set of relation type usefulness distributions (RTUD) on a heterogeneous graph, we propose a hierarchical twolevel random walk: the global level is for graph schema navigation (taskspecific); while the local level is for graph instance (taskindependent) walking.
Unlike most prior heterogeneous graph mining methods, which employed expertgenerated or rulebased ranking hypotheses to address recommendation problems, in a complex CCR graph, it can be difficult to exhaustively examine all of the potentially useful path types to generate metapaths. Furthermore, if a large number of random walkbased ranking functions are used, the computational cost can be prohibitive. Extensive experiments prove our hypothesis that the latent heterogeneous graph feature representations learned by HRLHG are able to improve crosslanguage citation recommendation performance (when comparing with 17 stateoftheart baselines). In addition, the learned RTUD is able to reveal the latent taskspecified knowledge, which is important to the interpretability of the proposed representation model.
In the future, we will validate the proposed method on other heterogeneous graph embedding based tasks, e.g., music recommendation or movie recommendation. Meanwhile, we will investigate more sophisticated method to generate RTUD. For instance, add personalization component to the algorithm, and enable personalized heterogeneous graph navigation for random walk optimization.
Acknowledgements.
The work is supported by the National Science Foundation of China (11401601, 61573028, 61472014), Guangdong Province Frontier and Key Technology Innovative Grant (2015B010110003, 2016B030307003), Health & Medical Collaborative Innovation Project of Guangzhou City, China (201604020003) and the Opening Project of State Key Laboratory of Digital Publishing Technology.References
 (1)
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR).
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
 Cao et al. (2015) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 891–900.
 Cao et al. (2007) Zhe Cao, Tao Qin, TieYan Liu, MingFeng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. ACM, 129–136.
 de Azevedo et al. (1990) José Augusto de Azevedo, Joaquim João ER Silvestre Madeira, Ernesto Q Vieira Martins, and Filipe Manuel A Pires. 1990. A shortest paths ranking algorithm. In Proceedings of the Annual Conference of Associazione Italiana di Ricerca Operativa: Models and Methods for Decision Support (AIRO’90). 1–8.
 Dong et al. (2017) Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 135–144.
 Fu et al. (2017) Taoyang Fu, WangChien Lee, and Zhen Lei. 2017. HIN2Vec: Explore Metapaths in Heterogeneous Information Networks for Representation Learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1797–1806.
 Gallo and Pallottino (1986) Giorgio Gallo and Stefano Pallottino. 1986. Shortest path methods: A unifying approach. Netflow at Pisa (1986), 38–64.
 Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 855–864.
 Guo et al. (2016) Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance matching model for adhoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 55–64.
 He et al. (2010) Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, and Lee Giles. 2010. Contextaware citation recommendation. In Proceedings of the 19th international conference on World wide web. ACM, 421–430.
 Jiang et al. (2015) Zhuoren Jiang, Xiaozhong Liu, and Liangcai Gao. 2015. Chronological Citation Recommendation with InformationNeed Shifting. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 1291–1300.
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. SemiSupervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907 (2016).
 Lao and Cohen (2010) Ni Lao and William W Cohen. 2010. Relational retrieval using a combination of pathconstrained random walks. Machine learning 81, 1 (2010), 53–67.
 Liu et al. (2014) Xiaozhong Liu, Yingying Yu, Chun Guo, and Yizhou Sun. 2014. MetaPathBased Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, 121–130.
 McNee et al. (2002) Sean M McNee, Istvan Albert, Dan Cosley, Prateep Gopalkrishnan, Shyong K Lam, Al Mamunur Rashid, Joseph A Konstan, and John Riedl. 2002. On the recommending of citations for research papers. In Proceedings of the 2002 ACM conference on Computer supported cooperative work. ACM, 116–125.
 Metzler and Croft (2007) Donald Metzler and W Bruce Croft. 2007. Linear featurebased models for information retrieval. Information Retrieval 10, 3 (2007), 257–274.
 Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
 Mikolov et al. (2013b) Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013).
 Mikolov et al. (2013c) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013c. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
 Ren et al. (2014) Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang, and Jiawei Han. 2014. Cluscite: Effective citation recommendation by information networkbased clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 821–830.
 Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Itembased collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. ACM, 285–295.
 Shi et al. (2010) Xiaolin Shi, Jure Leskovec, and Daniel A McFarland. 2010. Citing for high impact. In Proceedings of the 10th annual joint conference on Digital libraries. ACM, 49–58.

Su and
Khoshgoftaar (2009)
Xiaoyuan Su and Taghi M
Khoshgoftaar. 2009.
A survey of collaborative filtering techniques.
Advances in artificial intelligence
2009 (2009), 4.  Sun et al. (2011) Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim: Meta pathbased topk similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4, 11 (2011), 992–1003.
 Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1067–1077.
 Tang and Zhang (2009) Jie Tang and Jing Zhang. 2009. A discriminative approach to topicbased citation recommendation. Advances in Knowledge Discovery and Data Mining (2009), 572–579.
 Tang et al. (2014) Xuewei Tang, Xiaojun Wan, and Xun Zhang. 2014. Crosslanguage contextaware citation recommendation in scientific articles. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 817–826.
 Zhai and Lafferty (2001) Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 334–342.
Comments
There are no comments yet.