For a project on “information extraction”, who would be able to provide guidelines for problem solving? For a new funding proposal on “ontology alignment”, who would be able to review and make good assessment? For the upcoming PKDD conference on “data mining”, who should be invited to give a keynote speech? Experts.
Expert finding (balog2012expertise, ; deng2008formal, ; wang2013expertrank, ; zhang2007expert, ) is defined as the problem of ranking the candidates with appropriate expertise for a given query. The problem receives increasing attention in academia due to the TREC Expert Finding Track (soboroff2006overview, ). Accurate candidate ranking has broad applications. However, the problem is particularly challenging since a query can be as general as “data mining” and “planning” and as specific as “ontology alignment” and “information extraction”. Such discrepancy among given queries poses particular challenges for accurate expert identification.
Previous studies usually formulate the problem of expert finding as a document search problem in the information retrieval community. Although promising results are obtained (hertzum2000information, ) by standard document search algorithms, the returned results are documents, not candidates. We take a social website as an example. Users actively participate in various online activities, such as posting, commenting, tagging, rating, and reviewing. The online textual information provides evidence for users’ skills and expertise. Moreover, users engage in online communities, collaborating, and exchanging information with each other. Each user cannot be simply represented by her posts or comments and she has much more complicated personal, social, and collaborative practices (deng2012modeling, ).
Many approaches have been proposed and studied for expert finding. The most popular models are document-based generative probabilistic models (balog2006formal, ; balog2012expertise, ; fang2007probabilistic, ). The major idea of the document-based models is that the expertise of a candidate can be estimated by aggregating textual evidence from relevant documents, which is retrieved by statistical language models. Nevertheless, this method suffers from the following two drawbacks. On one hand, when applying the statistical language model, there is a vocabulary gap between terms in the query and the documents. On the other hand, such a method ignores network structure; that is the relationships among the candidates and other objects in the heterogeneous information network.
We attempt to solve the problem of expert finding, particularly focusing on specific queries with narrow semantic meanings without downgrading the accuracy for general queries. We propose a novel framework based on query expansion. It includes two different components, one is textual analysis to provide evidence for expertise identification, and the other is authority ranking to rank the candidates in the heterogeneous bibliographical networks.
Locally-trained Embedding Learning via Concept Hierarchy.
is proposed to project the terms into a latent semantic space, such that terms with similar semantic meanings are close to each other in the latent vector space. The vector representations are also known as embeddings or distributed representations. The learned embeddings are based on the co-occurrence statistics derived from the whole corpus, which can be (loosely) interpreted as a low-rank approximation for the observation data in the corpus(cai2011graph, ; deerwester1990indexing, ; levy2014neural, ).
Nevertheless, information regarding some specific queries might be missing through the semantic matching method. We have a toy example shown in Figure 1
(a), where terms related to different domains form different clusters, such as “information retrieval”, “natural language processing”, “data mining”, and “programming language”. Meanwhile, “information extraction” is close to both “natural language processing” and “named entity recognition”. Particularly for the task of expert finding, if we expand the query “information extraction” to “natural language processing”, there will be semantic drift.
In order to address the semantic drift discussed above, we propose to train a local embedding with concept hierarchy as guidance, as shown in Figure 1(b). For the query “information extraction”, the cluster that “information extraction” belongs to can be identified as “natural language processing”. Then the local embeddings can be learned based on the documents that are relevant to “natural language processing”, as shown in Figure 1(c). Since the locally-trained embeddings only need to preserve the information respecting the cluster of “natural language processing”, it has stronger representation power. Consequently, the local embeddings better capture the subtle semantic information such that “information extraction” shares closer semantic meaning with “named entity recognition” than “natural language processing”.
Ranking within Relevance Network.
Extensive online textual information is available from candidates’ activities, which serves as evidence for expertise identification. However, the final target of expert finding is to rank candidates, not textual information. There is a disparity.
The document-based models aggregate the relevant documents associated with each candidate and rank the candidates accordingly. The importance of each document is approximated by a monotonic function of the number of citations, such as logarithm functions. Such an aggregation method is inaccurate and sensitive to the choice of the monotonic function. On the other hand, besides textual information, the interactions among candidates and other objects (e.g., other candidates, group discussion in online social communities, venues in academia) offer additional insights for estimating the users’ cognitive capabilities. The interactions among the objects of different types naturally form a heterogeneous information network (sun2012mining, ; sun2009ranking, ). Bibliographical information network is a typical heterogeneous information network, which characterizes the academic publication behaviors of researchers. In heterogeneous bibliographical networks, researchers have various activities, including publishing, collaborating, and attending venues. In Figure 2(a), the network schema of an example heterogeneous bibliographical network is depicted, with an illustration in Figure 2(b).
To close the gap between textual information analysis and candidate ranking, we propose a coupled random walk algorithm, including both inter-type random walks and intra-type random walks, to estimate the authority of objects in the network and the rank order of candidates. More concretely, the ranking algorithms considers the relative importance of different edge types in the heterogeneous bibliographical network.
To summarize, we study the problem of expert finding in heterogeneous information networks. Specifically, we use the bibliographical network as a case study and the proposed framework can be straightforwardly extended to social networks and other types of networks. The proposed framework includes two phases. The phase is locally-trained embedding learning with concept hierarchy as guidance, based on which we obtain query expansion for the given query. The second is the authority rank algorithm within the heterogeneous bibliographical network, which is retrieved and constructed based on the query expansion. Such a framework is particularly designed for specific queries. We name the new framework as LE-expert, which is short for Locally-trained Embedding for Expert Finding. Our contributions are as follows:
We propose to learn locally-trained embeddings for query expansion with a given concept hierarchy as guidance for the problem of unsupervised expert finding in heterogeneous bibliographical networks.
We establish a new ranking algorithm, tailored for the task of expert finding, in heterogeneous bibliographical networks.
We conduct numerical experiments to corroborate the efficacy of our method.
2. Related Work
In general, there are two major approaches (balog2012expertise, ) for the problem of expert finding, one is profile-based (liu2005finding, ) and the other is document-based (balog2006formal, ; fang2007probabilistic, ) (also known as the candidate and topic models). For the profile-based models, each candidate is represented via a set of terms. Given a query, the candidates are ranked via the ad-hoc retrieval models. In contrast, the document-based models are to firstly retrieve all the relevant documents of the query and then the candidates are ranked via aggregating the associated documents. Since the document-based models make use of the whole corpus, it is usually more effective compared with the profile-based ones (balog2012expertise, ; deng2008formal, ). Besides these two models, there are many other approaches that take advantage of additional information. For instance, Karimzadehgan et al. propose to solve the problem of expert finding by incorporating the organizational hierarchy (karimzadehgan2009enhancing, ). The problem of vocabulary gap is addressed by query expansion with Normalized Google Distance (yang2014using, ). More recently, an unsupervised embedding learning method is proposed, where the embeddings are learned based on the co-occurrence between candidates and terms (van2016unsupervised, ). However, these methods mainly focus on the textual information while the rich network structure information is ignored.
Regarding (heterogeneous) network structures, it is proposed to rank the candidates within an online forum via a propagation-based approach (zhang2007expert, ). Besides, the problem is formalized as searching for reliable users and contents for the task of community-based query answering in a co-training fashion (bian2009learning, ). Regarding collaborative tagging recommendation, Noll et al. assess the expertise of users using a graph-based ranking method similar to the HITS algorithm (noll2009telling, ). Deng et al. propose a joint optimization framework to rank candidates based on the consistency implied by the network structure (deng2012modeling, ). Moreover, there are some other relevant studies, such as co-rank (zhou2007co, ) where authors and their publications are ranked based on a coupled random walk algorithm; NetClus (sun2009rankclus, ) simultaneously ranks and clusters strongly-typed objects with mutual enhancement in a heterogeneous information network; and RankClus (ji2011ranking, ) applies similar philosophy to classification and ranking. Nevertheless, these works are either query independent or consider the query-document relatedness based on global semantic mapping, which loses information for specific queries. Our method not only considers the network structure, but also captures query expansion for specific queries based on locally-trained embedding learning.
The idea of query expansion regarding local document analysis has been previously studied for information retrieval (xu1996query, ). Global analysis and local feedback are combined for query expansion with a new weight ranking function for query expansion. Recently, Diaz et al. propose to perform query expansion based on locally-trained embeddings for queries with ambiguous semantic meanings (diaz2016query, ). In contrast, our locally-trained embedding is designed for query expansion for specific queries, which is of particular importance for the task of expert finding; while theirs is for ad-hoc retrieval task (diaz2016query, ). In addition, our locally-trained embeddings are learned with guidance from a concept hierarchy. The details will be discussed in Section 4.
Before detailing our method, we first introduce heterogeneous bibliographical information networks, the document-based model and word embedding learning.
3.1. Heterogeneous Bibliographical Networks
A heterogeneous bibliographical network is constructed from bibliographical data. Due to the heterogeneity of the object types, a heterogeneous bibliographical network is naturally a heterogeneous information network (sun2009rankclus, ). The formal definition of heterogeneous information networks is as follows.
Definition 3.1 (Heterogeneous Information Network).
For an information network with an object mapping function and an edge mapping function , where and are the set of object types and edge types, respectively, if the number of object types or the number of edge types , is a heterogeneous information network.
DBLP is a public bibliographical dataset in the Computer Science domain. We further extract semantic phrases from the text data following the method proposed by Liu et al. (liu2016representing, ). Therefore, we use terms to refer both words and phrases in the corpus. Regarding each publication entry, DBLP provides detailed information about authors, terms, venues. Figure 2(a) depicts the network schema and Figure 2(b) is a sub-network with a user query. We define the set of publications as , authors as , terms as , and venues as , with denoting the set sizes accordingly.
3.2. The Document-based Models
. For completeness, we present probably the most popular method: document-based models. The family of document-based models formalizes the problem as a retrieval task. Given a query, the ranking score of a researcher candidate can be calculated as
where is the document corpus, is the probability that the candidate is relevant to the publication , is the probability that the query is relevant to the document , and denotes the preference over .
What remains is to estimate , and . Following the ideas of Deng et al. (deng2008formal, ), we estimate via , where is the count of citations of and is the mathematical constant to guarantee that weight factor is no less than one. is generally estimated as , with as the set of authors for publication . Finally, is calculated based on the query generation retrieval method with Dirichlet prior smoothing (zhai2004study, ),
where with as term frequency of term in and as the length of and is defined as
with and as the background language model of the text corpus .
3.3. Word Embedding Learning
Word embedding learning (mikolov2013distributed, ; pennington2014glove, ) is to represent the terms in a corpus into a low-dimensional latent semantic space, where each term is represented via a low-dimensional vector, which is called embedding or distributed representation. The semantic information regarding each term is preserved such that terms with similar semantic meanings are close to each other in the Euclidean space. There are many off-the-shelf embedding learning algorithms. We adopt word2vec (mikolov2013distributed, ) to learn the embeddings, and other embedding methods, such as Latent Semantic Indexing and Glove (pennington2014glove, ), can also be applied. In word2vec, for a pair of words that co-occur in a sliding window, one term is denoted as target and the other as context. Based on the skip-gram model, the conditional probability of observing given is defined using the softmax function
where are the embeddings for , with as the dimension of the embedding vector. In (5), since the denominator sums over all the terms in the corpus , it is computationally intractable. Consequently, negative sampling is proposed (mikolov2013distributed, ). For the term pair of , regarding (5), the following objective is optimized instead,
is the sigmoid function,is the embeddings of noise , is the noise term distribution, and is the negative sampling parameter. Due to space limit, one may refer to the original paper by Mikolov et al. for technical details (mikolov2013distributed, ).
4. Local Embedding via Concept Hierarchy
Word embedding learning is proposed for global embedding learning such that an embedding vector is learned for each term regarding the whole corpus. According to Levy et al. (levy2014neural, ), the word embedding learning with negative sampling in (6) can be loosely interpreted as an implicit matrix factorization problem, where the shifted positive Pointwise Mutual Information (PMI) matrix is approximated by a low-rank matrix with rank equivalent to (the dimension of the vector space). However, such an approximation may lead to coarse representations of specific terms. The term “information extraction” is not only close to “information extraction” and “named entity recognition” but also to “text mining” and “natural language processing”. Suppose that “natural language processing” was used as expansion of “information extraction”, there will be a semantic drift. Instead of obtaining experts on “information extraction” only, we may also find experts on “natural language processing”. However, not all of the experts on “natural language processing” are working on “information extraction”.
|Global Embedding||LE wo/ CH (diaz2016query, )||LE w/ CH|
|text-mining||knowledge-based||SystemT 111IBM SystemT is a declarative information extraction system.|
4.1. Concept Hierarchy
In order to address the semantic drift, we relax the global low-rank assumption and propose to represent the terms in the corpus using locally-trained embeddings. In particular, we make the following assumption.
The shifted positive PMI matrix is low-rank for a sub-corpus that is relevant to the query.
The sub-corpus is constructed with guidance from a concept hierarchy (Figure 1(b)). In other words, instead of learning embeddings to preserve the information in the whole corpus, we only preserve information in the sub-corpus. The sub-corpus corresponds to the cluster that “information extraction” belongs to, which is “natural language processing” according to the concept hierarchy in Figure 1(b). Therefore, the sub-corpus comprises publication documents constrained on “natural language processing”.
Why using a concept hierarchy as guidance? Regarding the task of expert finding, for a given query “information extraction”, the (implicit) background information is “natural language processing”. By taking advantage of concept hierarchy, we can identify the background information, as depicted in Figure 1(b). Alternatively, without a concept hierarchy, as proposed by Diaz et al. (diaz2016query, ), the sub-corpus is constructed by retrieving all the documents relevant to the “information extraction”. The results obtained following the idea of Diaz et al. (diaz2016query, ) are shown in the second column of Table 1. However, the top-ranked terms are random and irrelevant to “information extraction”. This is because when learning term embeddings on sub-corpus constrained on “information extraction”, the term “information extraction” becomes the background since it appears in almost all the documents and (almost) co-occur with all words in the corpus, especially for short documents. In the bibliographical data that we use, around 76% of the document entries are titles. Therefore, “information extraction” is similar to stop words. Meanwhile, if the sub-corpus is constrained on “natural language processing”, the term “natural language processing” becomes the background and is distant from “information extraction”, as shown in the third column of Table 1.
4.2. Locally-trained Embedding Learning
How to use concept hierarchy as guidance for local embedding learning? For brevity, we first consider the case where there is only one term in each query, corresponding to one concept in the concept hierarchy. Also, we assume that terms in the query can be trivially mapped to the concept hierarchy. For queries with more than one concept, we train local embeddings one by one. For each concept, we use the learned local embeddings to expand the concept accordingly.
For a given query in the concept hierarchy, we denote the path from root to as , where is the level of the concept hierarchy that lies at and corresponds to the root. We use for to denote the learned embeddings for terms at level . The idea of local embedding learning is to find the nearest neighbors (i.e., expansions) of based on the term embeddings learned constrained on . Therefore, the nearest neighbors of can be found based on the embeddings learned on a sub-corpus constrained on . In the following, we use “information extraction” as a running example.
For the (sub-)corpus constrained on concept , it is straightforward that we use the whole corpus to train terms’ embeddings (i.e., global embeddings). For the corpus constrained on concept for , we first search for the nearest neighbors of , which serve as expansions to close the vocabulary gap while constructing the sub-corpus. For the query “information extraction”, we have “natural language processing”.
As we do not have features for each concept (and term), we use the embeddings learned via a sub-corpus constrained on concept as features. Given as the embedding learned constrained on
, we use cosine similarity to measure the similarity between term. The top terms measured by is denoted , as expansion of concept . Therefore, a sub-corpus constrained on can be extracted based on . In other words, we use global embeddings () to firstly find the query expansions of “natural language processing”, which is denoted as “natural language processing”, “nlp”, “natural language understanding”, “language processing”,
. We interpolate such semantic similarity into the language model with parameter,
where query contains only one concept, . In order to train local embeddings on the sub-corpus constrained on , we sample each document with probability proportional to . We set , as the uniform sampling. While applying word2vec for embedding learning, in order to estimate the empirical distribution of terms in the sub-corpus constrained on , the sampling weights of each document (i.e., ) should be considered. The recursive embedding learning framework is detailed in Algorithm 1.
5. Expert Ranking in Relevance Network
In order to rank researcher candidates for each query, we have two key insights: (i) A candidate may have papers on many topics. For a given query, only the relevant papers can serve as textual evidence for expertise. (ii) Citation may have time-delay factor. Papers that are published in a higher-ranked venue are more likely to be important. Therefore, venues play an important role for ranking.
5.1. Relevance Network Construction
For a given query , we first retrieve all the relevant documents, the set of which is denoted as , where if is within the document , otherwise. In other words, we select all the papers that contain at least one relevant term in for each term in .
Based on , a relevance sub-network can be extracted from the heterogeneous bibliographical network by extracting and associated authors and venues. LE-expert ranks the candidates within the relevance sub-network.
5.2. Ranking in Relevance Network
To rank candidates for each query, we take advantage of the network structure and propose a ranking algorithm to estimate the authority of objects in the sub-network based on a coupled random walk in the relevance sub-network. We first present the ranking method in a general framework, which can be generalized for other heterogeneous information networks.
Suppose there are types of objects in the heterogeneous information network and the set of the type objects is denoted as . The network is represented by a set of relation matrices . For each , we define a diagonal matrix such that the diagonal element at of is the sum of the -th row of . Therefore, the transition matrix of is defined as . And the ranking score vector of objects in type can be updated iteratively:
where is the iteration step and . The relative importance of neighbors of different types is controlled by .
Regarding the task of expert finding in heterogeneous bibliographical networks, the random rank is designed regarding the following assumption 222Since terms are used to construct the relevance network and do not reflect authority, we do not consider terms while ranking the candidates..
High-quality and relevant papers will be frequently cited by many other relevant papers;
Relevant highly-ranked experts will publish many high-quality and relevant papers, and vice versa.
Relevant and highly-ranked conferences attract many high-quality and relevant papers, and vice versa.
Therefore, the relation types for each object type are as follows. Paper: (i) Citation relations. if the paper cites the paper ; (ii) Write relations. if author writes paper ; (iii) Publish relations. if paper is published in venue . Author: (i) Coauthor relations. ; (ii) Write relations. . Venue: (i) Citation relations. (ii) Publish relations. .
The underlying philosophy of the ranking module is similar to NetClus (sun2009rankclus, ) and RankClass (ji2011ranking, ). However, NetClus and RankClass are primarily designed for clustering and classification in the whole heterogeneous information network, respectively; while LE-expert is designed for authority ranking within a relevance sub-network. In addition, NetClus can only be applied to star-schema heterogeneous networks while LE-expert is independent of the network schema. Moreover, RankClass is a regularization framework for label propagation whereas LE-expert is based on random walks.
6. Experimental Results
We conduct various experiments to study effectiveness of the proposed framework in expert finding for both specific and general queries.
6.1. Experimental Setup
Data. To evaluate the proposed framework, we conduct numerical experiments and case studies on the dataset of DBLP. In the DBLP dataset, there are 2,244,018 papers, 1,274,360 authors, 8,882 venues, and 1,812,277 words and phrases. Among all the papers, 529,498 papers (24%) have abstract information. The labelled dataset is from Deng et al. (deng2012modeling, ), which contains 20 queries in total, including both general and specific ones. Details on the queries and the number of experts for each query can be found in (deng2012modeling, ).
Evaluation Measures. Regarding evaluation of the task, we employ several popular information retrieval measures (buttcher2016information, ), including Precision at rank (P), Mean Averaged Precision (MAP), Normalized Discounted Cumulative Gain at rank (NDCG), and bpref (buckley2004retrieval, ). P measures the percentage of relevant experts in the top of the retrieved candidate list, which is estimated as P, where if the -th retrieved candidate is relevant to the given query and otherwise. Suppose there are relevant experts, Average Precision is defined as AP and MAP is the averaged AP for all queries. Since the relevance labels are binary, therefore, NDCG is defined as NDCG. Also, we consider bpref, which is a summation based measure of the number of relevant documents ranking before irrelevant ones, bpref
Baselines. We compare LE-expert with the following baselines:
Balog. It is a classical document-based model for expert finding (balog2006formal, ).
NMF. We apply nonnegative matrix factorization (cai2011graph, ) to the author-term co-occurrence matrix. The ranking of authors is based on the inner product of the corresponding rows and columns of authors and queries.
LSI. We apply latent semantic indexing to identify the similarity of the authors and the queries.
Corank (zhou2007co, ). Co-ranking cannot be directly applied for expert finding since it is query independent. Therefore, we first retrieve relevant documents and then apply co-ranking for each query.
Embed (van2016unsupervised, ). This is a global embedding algorithm designed for the task of unsupervised expert finding without considering the network structure.
JointHyp (deng2012modeling, ). JointHyp is a regularization framework for expert finding in heterogeneous information networks. Specifically, information is propagated through the network based on consistency in the network.
Exact. The relevance sub-network is extracted based on the exact match.
RankClass. The sub-relevance sub-network is extracted based on query expansion, and rank the candidates by RankClass with only one class.
For fair comparison, we use the same leave-one-out cross-validation dataset and report the best performance of each model. The parameter setting of LE-expert is as follows in (7) and (8). We gradually reduce the size of dimension of the local vector space and set with being the hierarchy level. For the concepts , the size of the query expansion () is set to be . The final expansion for each query () is set by cross validation. Recall that is query expansion set for relevance sub-network construction. It is worth noting that general queries are more likely to have more expansions and specific ones have less expansions.
6.2. Experimental Results
Overall Results Analysis. The experimental results of different methods are summarized in Table 2. Compared with Balog, NMF, SVD, and Embed which only utilize the textual information and the overall number of citations as the prior of each document, as shown in Table 2, we can see that the methods that take advantage of the network information, including Corank, JointHyp, Exact, RankClass, and LE-expert, achieve significantly better results regarding all the evaluation measures. This result agrees with our argument that the task of expert finding is different from information retrieval and the network structure plays an important role. Moreover, we notice that the precision of Embed (van2016unsupervised, ) is even worse than classical embedding methods, such as NMF and SVD. It can be partially explained by that for a candidate and a query the ranking score can be (loosely) interpreted as scaling with (levy2014neural, ), which favors candidates with more focused expertise. More specifically, a candidate with only one paper on is likely to be ranked topmost.
Now we consider the methods taking advantage of the heterogeneous network structure. Comparing Corank with Exact, we see that Exact performs slightly better than Corank in measures of Precisions, NDCG’s, and MAP. This is because Exact additionally considers the venue information for ranking. Moreover, LE-expert significantly outperforms Exact regarding all the evaluation measures, which serves as evidence that the proposed query expansion method can solve the problem of vocabulary gap. Unlike the global embedding methods (NMF and SVD), LE-expert will not expand specific queries to more general ones thanks to the locally-trained embeddings. LE-expert achieves better precision and NDCG results. JointHyp (deng2008formal, ) is also designed for heterogeneous bibliographical information networks, the main idea of which to propagate the relevance of documents for each query to the candidates through the strongly-typed edges in the network. However, such a method will give inaccurate estimation for documents regarding specific queries since the relevance of documents is estimated via global embeddings. Our model is based on the coupled random walks, where the weights for all documents are the same (as ). The prediction accuracy of LE-expert is better than JointHyp; while JointHyp slightly outperforms ours regarding the overall ranking (bref). However, it is worth noting that for the task of expert finding, the top-ranked results are more important. We also compare LE-expert against RankClass, which is similar w.r.t. the ranking algorithm. RankClass is a regularization framework; while LE-expert considers the inter-type and intra-type random walks. We can see that LE-expert performs better than RankClass on precision and NDCG results.
Hyper parameter. As shown in (9), hyper parameter (the relative importance of different types of edges) plays an important role for the final ranking of candidates. The sensitivities of the ranking results with varying is depicted in Figure 3. For simplicity, we set for all . In addition, except for the of interest, all the other . The y-axis corresponds to Precision@10. Firstly, we observe that the ranking results are more sensitive to than and . This can be explained by the fact that the ranking is based on authors. The second observation is that the precision results follow a pattern that the precision first increases then decreases as the weight parameters increase. For one edge type, if the corresponding goes to zero, it is equivalent to removing that edge. Such an observation indicates that all the edge types are involved in the final ranking. Our third observation is that when and go to zero, the performance remains stable; when goes to zero, the performance drops significantly. This can be explained by that balances the relative importance between coauthor relations and writing relations. When goes to zero, the ranking order candidates is dominated by coauthor relations. The absence of authority information from papers leads to fallacious ranking results. Meanwhile, when goes to infinity, the ranking model is reduced to the document retrieval model (with the relevance of each document to be equal), since the other types of edges do not contribute to the authority scores of candidates.
|boosting||support vector machine|
|Robert E. Schapire||Robert E. Schapire||Qi Wu||Bernhard Schölkopf|
|Yoav Freund||Yoav Freund||Isabelle Guyon||Vladimir Vapnik|
|Ron Kohavi||Leo Breiman||Jason Weston||C. J. C. Burges|
|Thomas G. Dietterich||Yoram Singer||Vladimir Vapnik||Thorsten Joachims|
|Yoram Singer||David P. Helmbold||Bao-Kiang Lu||Chih-Jen Lin|
|information extraction||ontology alignment|
|Ralph Grishman||Dayne Freitag||Jerome Euzenat||W. M. Schorlemmer|
|Andrew McCallum||Ralph Grishman||Patrick Lambrix||Yannis Kalfoglou|
|Ellen Riloff||Andrew McCallum||Jason J. Jung||Anhai Doan|
|Oren Etzioni||Nicholas Kushmerick||He Tan||Jerome Euzenat|
|Dayne Freitag||Stephen Soderland||Marc Ehrig||Alon Y. Halevy|
Case Study. Some concrete case studies of candidate ranking are shown in Table 3
. For general queries, including “boosting” and “support vector machine”, the query expansions are based on the global embeddings. LE-expert has better precision. Specifically, for “support vector machine”, “Bernhard Schölkopf”, who makes particular contributions to “support vector machine”, and “Vladimir Vapnik”, who is a co-inventor of the support vector machine, rank topmost. This demonstrates the power of the proposed framework in general queries. For specific queries, we consider “information extraction” (as a child of “natural language processing”) and “ontology alignment” (as a child of “ontology”). The high precision results of specific queries indicate that the locally-trained embedding learning method provides accurate and relatively complete expansions for the queries. Moreover, the ranking algorithm contributes to the authority ranking of candidates. Taking “information extraction” as an example, “Dayne Freitag” whose research is focusing on “machine learning for information extraction” ranks higher than more senior researchers “Ralph Grishman” and “Andrew McCallum”, given that all of them work on “information extraction”.
7. Conclusions and Future Work
In this paper, we study the problem of expert finding in heterogeneous bibliographical information networks based on a query expansion approach. Firstly, we propose to perform query expansion based on locally-trained embedding learning recursively with a concept hierarchy as guidance. Secondly, we introduce a ranking algorithm on a relevance sub-network to estimate the expertise of the candidates via coupling both inter-type and intra-type random walks. Numerical experimental results on a large-scale heterogeneous bibliographical information network corroborate the effectiveness of the proposed LE-expert.
The proposed framework is general and can be applied to other tasks, such as query-answering in online communities or recruiting for open problem solving. Besides, the locally-trained embedding learning with a concept hierarchy as guidance is of independent interest and may be applied to other tasks, such as product recommendation given a product hierarchy. In addition, since our framework requires a concept hierarchy as the input, we plan to consider a more challenging scenario where concept hierarchy is not available for future work.
-  K. Balog, L. Azzopardi, and M. De Rijke. Formal models for expert finding in enterprise corpora. In SIGIR, pages 43–50. ACM, 2006.
-  K. Balog, Y. Fang, M. de Rijke, P. Serdyukov, and L. Si. Expertise retrieval. Foundations and Trends in Information Retrieval, 6(2–3):127–256, 2012.
-  J. Bian, Y. Liu, D. Zhou, E. Agichtein, and H. Zha. Learning to recognize reliable users and content in social media with coupled mutual reinforcement. In WWW, 2009.
-  C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR, pages 25–32. ACM, 2004.
-  S. Büttcher, C. L. Clarke, and G. V. Cormack. Information retrieval: Implementing and evaluating search engines. Mit Press, 2016.
-  D. Cai, X. He, J. Han, and T. S. Huang. Graph regularized nonnegative matrix factorization for data representation. PAMI, 33(8):1548–1560, 2011.
-  S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. JASIST, 41(6):391, 1990.
-  H. Deng, J. Han, M. R. Lyu, and I. King. Modeling and exploiting heterogeneous bibliographic networks for expertise ranking. In ICDL, pages 71–80. ACM, 2012.
-  H. Deng, I. King, and M. R. Lyu. Formal models for expert finding on dblp bibliography data. In ICDM, pages 163–172. IEEE, 2008.
-  F. Diaz, B. Mitra, and N. Craswell. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891, 2016.
-  H. Fang and C. Zhai. Probabilistic models for expert finding. In ECIR, 2007.
-  M. Hertzum and A. M. Pejtersen. The information-seeking practices of engineers: searching for documents as well as for people. Information Processing & Management, 2000.
-  M. Ji, J. Han, and M. Danilevsky. Ranking-based classification of heterogeneous information networks. In SIGKDD, pages 1298–1306. ACM, 2011.
-  M. Karimzadehgan, R. W. White, and M. Richardson. Enhancing expert finding using organizational hierarchies. In ECIR, pages 177–188. Springer, 2009.
-  O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, pages 2177–2185, 2014.
-  J. Liu, X. Ren, J. Shang, T. Cassidy, C. R. Voss, and J. Han. Representing documents via latent keyphrase inference. In WWW, 2016.
-  X. Liu, W. B. Croft, and M. Koll. Finding experts in community-based question-answering services. In CIKM, pages 315–316. ACM, 2005.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
-  M. G. Noll, C.-m. Au Yeung, N. Gibbins, C. Meinel, and N. Shadbolt. Telling experts from spammers: expertise ranking in folksonomies. In SIGIR, pages 612–619. ACM, 2009.
-  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
-  I. Soboroff, A. P. de Vries, and N. Craswell. Overview of the trec 2006 enterprise track. In Trec, 2006.
-  Y. Sun and J. Han. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(2):1–159, 2012.
-  Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: integrating clustering with ranking for heterogeneous information network analysis. In EDBT, pages 565–576. ACM, 2009.
-  Y. Sun, Y. Yu, and J. Han. Ranking-based clustering of heterogeneous information networks with star network schema. In SIGKDD, pages 797–806. ACM, 2009.
-  C. Van Gysel, M. de Rijke, and M. Worring. Unsupervised, efficient and semantic expertise retrieval. In WWW, 2016.
-  G. A. Wang, J. Jiao, A. S. Abrahams, W. Fan, and Z. Zhang. Expertrank: A topic-aware expert finding algorithm for online knowledge communities. Decision Support Systems, 2013.
-  J. Xu and W. B. Croft. Query expansion using local and global document analysis. In SIGIR, pages 4–11. ACM, 1996.
-  K.-H. Yang, Y.-L. Lin, and C.-T. Chuang. Using google distance for query expansion in expert finding. In ICDIM, pages 104–109. IEEE, 2014.
-  C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179–214, 2004.
-  J. Zhang, J. Tang, and J. Li. Expert finding in a social network. In International Conference on Database Systems for Advanced Applications, pages 1066–1069. Springer, 2007.
-  D. Zhou, S. A. Orshanskiy, H. Zha, and C. L. Giles. Co-ranking authors and documents in a heterogeneous network. In ICDM, 2007.