Impact of the Query Set on the Evaluation of Expert Finding Systems

06/28/2018 ∙ by Robin Brochier, et al. ∙ Peerus 0

Expertise is a loosely defined concept that is hard to formalize. Much research has focused on designing efficient algorithms for expert finding in large databases in various application domains. The evaluation of such recommender systems lies most of the time on human-annotated sets of experts associated with topics. The protocol of evaluation consists in using the namings or short descriptions of these topics as raw queries in order to rank the available set of candidates. Several measures taken from the field of information retrieval are then applied to rate the rankings of candidates against the ground truth set of experts. In this paper, we apply this topic-query evaluation methodology with the AMiner data and explore a new document-query methodology to evaluate experts retrieval from a set of queries sampled directly from the experts documents. Specifically, we describe two datasets extracted from AMiner, three baseline algorithms from the literature based on several document representations and provide experiment results to show that using a wide range of more realistic queries provides different evaluation results to the usual topic-queries.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is common to consider expertise as an implicit knowledge about a domain that someone carries and shares in different manners. Expertise retrieval aims at identifying this knowledge through explicit artifacts such as communications, actions or interactions between people. When someone call for an expert, she expects to find a candidate able to understand a specific query. Whereas most evaluations for expertise retrieval consist in directly querying the namings or descriptions of the ground truth topics of a given dataset, we claim that these queries do not show much interest for a real case scenario since:

  • the textual content of the topics namings are very limited in terms of language. Using richer (hence noisier) descriptions might better test the robustness of the evaluated algorithms. For example, it is better to query multiple times a retrieval algorithm with several texts relevant to the field of “data mining” than only once with the naming of the field itself. In real case scenarios, users have a wide range of behaviors and seldom use the same queries when looking for the same thing

  • no one really seeks for experts in so broad subjects. Most of the time, someone looks for an expert with a very specific application in mind. Indeed, if a recruiter from a company is looking for a researcher to work on a specific subject, it is more likely that she will use the detailed description of the project instead of a generic naming of the job to find the right person.

In this paper, we first provide in Section 3 a formal definition of the expert finding task applied to the data extracted from AMiner 111 In particular, we describe two protocols: the topic-query evaluation and the document-query evaluation. We then describe in Section 4 three baseline algorithms from the literature that we reimplemented and tested using several document representations. Finally we show and analyze in Section 5 the results of our experiments, demonstrating the impact of the type of query on the behaviors of the algorithms and document representations.

Precisely, our contribution is fourfold:

  1. we propose two different procedures for generating queries and study their impact on the evaluation results

  2. we describe two ways of using AMiner’s data for expert finding and detail the preprocessing needed

  3. we reimplement and evaluate 3 algorithms from the literature based on several document representations

  4. the corresponding Python code is made publicly available 222 which makes it easy to reproduce the experiments or even expand the proposed pipeline.

2 Related Works

The automation of expert finding appeared as a research field along with the creation of large databases when started the digitalization of libraries and of the communication tools in big companies. P@noptic Expert [craswell2001p] is one of the first published works on expertise retrieval. The proposed model transforms the expert finding task in a text similarity task by building a meta-documents for each candidate, aggregating all documents where the name of this candidate appears. In 2005, the research around expert finding received a boost with the TREC-2005 Enterprise Track, Expert search task. They provided a dataset extracted from the World Wide Web Consortium (W3C). Moreover, they shared an evaluation toolkit to allow researchers to confront their algorithms. As a result, a formal definition of the problem emerged [craswell2005overview]. As presented in [balog2006formal], the generative document-model of Balog et al., we denote a query, a document and

a candidate. The expert finding task consists in estimating the probability of a candidate to be an expert given a query

. Voting models as in [macdonald2006voting] relax the probabilistic view of the latter equation. As an example, the score of a candidate can be computed by ranking all documents against the query with a document representation such as the bag-of-words based model term frequency. Then each candidate is provided a score given the ranks of the documents she is associated to. In [zhang2007expert] and [serdyukov2008modeling], the authors propose to propagate the affinity between the query and the documents across the collaboration graph in a similar manner as PageRank [page1999pagerank]. More recently, [van2016unsupervised]

adapted a word embedding technique to embed words and candidates in the same vector space. Many algortihms presented recently in the field of representation learning such as TADW

[yang2015network] and metapath2vec [dong2017metapath2vec] can be adapted to the task of expert finding but their authors did not experiment them on this specific task. Much work has been done for expert finding in community-based question answering as shown in [zhao2016expert] and their ranking metric network learning framework and in [zhao2015cold] which adresses the cold-start expert finding problem.

3 Framework for Expert Finding Evaluation

In this Section, after formally describing the expert finding task, we present two methodologies to generate queries. The first, usually used in the literature, directly sets topics labels as queries whereas the second, which we introduce in this paper, samples documents from the experts of each topic. Finally we detail how we used the data from AMiner to generate two datasets for the expert finding task.

3.1 Formal Description

Let be a bipartite graph with nodes corresponding to a set of candidates and a set of documents , where the links are undirected associations candidate-documents. Let be the textual features of the documents. The expert finding task, given such dataset (see Figure 0(b)), consists in scoring the set of candidates given a textual query , in order to answer the question “who are the candidates more likely to be experts in the topics present in the query ?”. Given a set of queries each associated with an identified set of experts , being the global set of known experts among the candidates, we want to optimize the ranking of the ground truth experts among the global set of experts .











(a) Bipartite graph linking candidates and documents.
(b) adjacency matrix of the bipartite graph and the features matrix of the documents.
Figure 1: Hypothetical example of a dataset for expert finding.

3.2 Evaluation

To evaluate the ranking of experts produced by an algorithm given a query, we use several common metrics from information retrieval such as Precision at rank K (P@K), Average Precision (AP) and Reciprocal Rank (RR). Moreover, to better understand the behavior of the algorithms tested, we construct the Receiver Operating Characteristic (ROC) curve and compute its Area Under the Curve (AUC). For each of these metrics, we also compute their standard deviations along the queries which shows the robustness of the tested algorithms against the variations in the data. Moreover, when we have multiple queries per topic, we compute the standard deviation along the topics. We now present two ways of generating queries and their corresponding ground truth experts.

3.2.1 Topic-query evaluation

This approach is straightforward and is commonly adopted in the expert finding community. For a specific topic, its naming or description is directly used as a query and its associated experts are the ground truth list of candidates to be retrieved. Algorithm 1 shows the complete evaluation procedure. As a result, if the dataset is composed of 10 topics, the protocol of evaluation consists of 10 queries. We call this approach the topic-query evaluation. For each measure described above, we are interested in its mean (Mean) and standard deviation (STD) along the queries.

0:  Ranking_Algorithm
  scores [ ]
  for all topics do
     candidates_ranking = Ranking_Algorithm(current_topic_textual_expression)
     current_score Evaluate(candidates_ranking, ground_truth_experts_set)
  end for
  return  Mean(scores), STD(scores)
Algorithm 1 Topic-query evaluation procedure. The function Evaluate generates metrics such as P@10 and the ROC AUC based on the produced ranking and the ground truth expert set of a given topic.

3.2.2 Document-query evaluation

We propose to sample the documents linked with the experts of a given topic in order to use them as queries. Instead of using the topic description, we use the set of documents associated to the ground truth experts of a given topic. Precisely, we create a set of queries and their associated experts by selecting each document of the dataset linked with the ground truth experts. As such, the evaluated algorithm produces a ranked list of candidates for each document-query and its performance is measured by comparing the ranking with the experts of the same topic as the expert who produced the document-query. Since several document-queries are sampled for each topics, we also compute the means and standard deviations along the topics, by computing these values along the averaged measures intra-topics. To avoid any bias in the metrics, when evaluating an algorithm on a sampled document, we leave it out of the data. We call this approach document-query evaluation. Algorithm 2 shows the complete evaluation procedure.

0:  Ranking_Algorithm
  scores [ ]
  topical_scores {}
  for all topics do
     topical_scores[current_topic] [ ]
     for all experts_of_current_topic do
        for all documents_of_current_expert do
           candidates_ranking = Ranking_Algorithm(current_document_textual_expression,
                                                                  leave_out = current_document)
           current_score Evaluate(candidates_ranking, ground_truth_experts_set)
        end for
     end for
     topical_scores[current_topic] Mean(topical_scores[current_topic])
  end for
  return  Mean(scores), STD(scores), STD(topical_scores)
Algorithm 2 Document-query evaluation procedure. Note that the computed metrics are also averaged for each topic in order to compute the inter-topic standard deviation.

3.3 AMiner Data

The AMiner project aims to provide tools for mining researcher’s social network. They provided several datasets 333 [tang2008arnetminer] collecting papers, authors, co-authorship and citations links extracted from DBLP  [ley2002dblp], ACM (Association for Computing Machinery) and other sources in the field of computer science. For the task of expert finding, they provided two lists of experts 444 The first, the machine-annotated list, is composed of 13 topics and has been built from topical web search. The second, the human-annotated list, is composed of 7 topics built with the method of pooled relevance judgments together with human judgments as described in [zhang2008mixture]. We used the machine-annotated list with the citation dataset V2 and the human-annotated list with the citation dataset V1 available on the AMiner website 555

We preprocessed the two datasets based on the distribution of links between candidates and documents. We also took into account the document string length (number of letters). First we kept only authors with less than 100 documents links and with at least one link. This reduces author name ambiguity by discarding authors who were originally connected to tens of thousands documents. Then we composed the textual content of the documents by concatenating their titles and abstracts and by keeping only those with string length greater than 50. As a result, we ended up with two datasets:

  • AMiner expert dataset 1: using the machine-annotated list of experts, is composed of 996,110 candidates, 1,125,082 documents, 1,269 experts in 13 topics. The distribution of the experts across topics is given in Table 0(a) (one expert can be linked to several topics) with the total number of documents linked to those experts

  • AMiner expert dataset 2: using the human-annotated list of experts, is composed of 532,968 candidates, 480,630 documents, 210 experts in 7 topics. The distribution is given in Table LABEL:table:experts2.

Topic Exps Docs
neural networks 105 1941
ontology alignment 49 908
boosting 49 1062
support vector machine 91 1145
intelligent agents 28 729
machine learning 33 1229
computer vision 174 3636
data mining 285 8034
natural language processing 40 1480
semantic web 332 6435
planning 21 531
cryptography 139 3033
TOTAL 1269 30802
(a) AMiner dataset 1.
Table 1: Distribution of experts and their documents counts across topics in both AMiner expert datasets.

4 Baseline Algorithms

After a short description of document representation, we describe three baseline algorithms taken from the literature. We reimplemented them since their original codes were not available or hardly reusable. Moreover, we could easily extend them to work with any kind of document representation.

4.1 Document Representation

Our three baseline algorithms rely on a measure of semantic similarity between the queries and the corpus of documents. We chose to try several document representations: term frequency (TF), term frequency - inverse document frequency (TF-IDF) and latent semantic indexing (LSI) [papadimitriou1998latent]

. We tokenized the text of the documents by lowercasing the characters, removing stop words and concatenating tokens based on their co-occurrence counts to compound 2-grams and 3-grams. Then, words appearing less than 3 times in the corpus or in more than 50% of the documents were discarded to reduce the computational cost without affecting the retrieval performance. The number of dimensions of the singular value decomposition for the LSI is 300. This number was chosen to ensure components above noise level are retained as proposed in


4.2 Text-based Approach 1: P@noptic Model

P@noptic Expert [craswell2001p]

is a simple algorithm which creates meta-documents for each author. Our implementation first concatenates the contents (title+abstract) of all documents linked with each candidate, then vectorizes this meta-documents using the pretrained documents representation models. Finally, it computes the cosine similarities between a query and the meta-documents and ranks the candidates by descending order of their scores.

4.3 Text-based Approach 2: Voting Model

Our voting model based on [macdonald2006voting] first computes the cosine similarities between the query and the documents of the dataset and then ranks all documents by descending order of their score. The algorithm then sums the inverse value of the rank (Reciprocal Rank - RR) of each document a candidate is linked with. If a candidate is linked with the 2nd, 3rd and 7th closest documents to the query, its score will be . This algorithm gives a huge boost to candidates who have at least one document well ranked and tends to promote candidates with more documents than others. We also tried other fusion techniques than the RR such as CombSUM and CombMNZ, described in [macdonald2006voting], but they provided weaker results.

4.4 Graph and Text-based Approach: Propagation Model

The propagation model we made is a simpler version of those described in [serdyukov2008modeling]. The algorithm first computes the cosine similarities between the query and the documents and it initializes a score vector of length with zeros for candidates and the documents-query scores for documents. It then operates several two-steps random walks with restart until the score vector converges (until the norm of the difference of its previous value and current value is below ). These random walks are done iteratively: where is the jumping factor, a scalar between 0 and 1, which controls the restart, is the restart vector that represents the global probability of a random walk to restart from its original node, is the column-wise normalized adjacency matrix of the bipartite graph, also known as the PageRank transition matrix [page1999pagerank]. At each step, scores jump from documents to candidates then from candidates to documents. A last step is finally done to propagate scores back to the candidates. These scores are then ranked by descending order.

5 Experiments

In this section, we present the experiments we did with both topic-query and document-query evaluations. We first show some general results before analyzing the effect of the type of query and finally focusing on the variations of ranking along the queries and the topics.

5.1 Settings

We evaluated our baseline models on the topic-query and the document-query methodologies. We made two evaluations for the propagation model using where the restart is weak, hence the propagation is wide, and where the scores stay close to their initial values. Moreover, for each model, the semantic similarity was computed with TF, TF-IDF and LSI document representations. Table 2 shows the results on the AMiner expert dataset 1 and Table 3 shows the results on the AMiner expert dataset 2.

5.2 General Results

For both datasets, the document representation TF-IDF performs generally better except for the AUC score, where LSI performs best, especially on topic queries. Actually, taking a closer look at the ROC curve, we could see that LSI is better in ranking for the worst ranked experts. It smoothes the curve in the top right corner and hence improves the area under the curve. Most metrics (P@10 and RR for example) are intended to focus on the quality of the very first ranked experts but the ROC AUC allows us to analyze the behavior of a ranking algorithm over the entire ranking. It is also important to note that the results are more stable across the choice of document representation for the second dataset. This behavior is expected since the ground truth experts have been human curated.

5.3 Effect of the Type of Query

We observe different rankings of the baseline algorithms depending on the type of evaluation performed. For the topic-query procedure, the propagation model performs best (with for the first dataset and for the second) whereas the voting model is the best for the document-query evaluation. Our explanation is that voting models are good when queries and documents are of the same type since we only need one candidate’s document to be similar to the query to push her to the top of the ranking. When the query is as short as “data mining”, the chance to find such a similar document is low since few documents about data mining have the words “data mining” in their content. Indeed, scientific articles rarely deal with data mining in general but rather focus on particular aspect of this field.

In contrast, the propagation model can give a good score to a candidate if in her neighborhood, the query is similar to some documents. Even if this candidate is an expert of “data mining” without never actually using the expression, there are quite some chances that in its close social network, some other candidates used these two words.

Then, the voting model might perform best than propagation for document queries because the latter tends to mistaken an information retrieval expert who worked closely with data mining experts. This situation is less likely to happen when the query is a short and very specific description than with paper contents that share a lot of similar terms between topics.

Moreover, this difference of results between query types are weaker when using LSI, which is due to the ability of this document representation to capture a similarity between two texts that do not share any word in common. The effect of short query is thus highly reduced compared to TF and TF-IDF.

5.4 Standard Deviation along Queries and Topics

One important aspect is the amount of dispersion the sets of scores have around their means. We computed the standard deviations for each evaluation to have an insight of the robustness of the algorithms to queries and to topics. Interestingly, for the document-query, the standard deviations along topics evaluation are lower than the deviations along queries. This shows that the robustness of the algorithms are not that much impacted by the variation of topics, as could have suggested the standard deviations for the topic-query evaluation, but merely by the variety of queries intra-topics. As a result, using only a few topic queries is statistically biased since some topic namings might have lesser chance to appear in their related documents. Finally, in the second dataset, the deviations along topics for the voting model are significantly lower than other models which is a precious information that cannot be revealed by a topic-query evaluation if one wants to favorite stability over the searched topics of expertise.

5.5 Pros and Cons of the Document-Query Evaluation

Beside the fact that the document-query evaluation seems to better represent a real case application of expert finding, we showed that it provides a deeper insight on the robustness of an algorithm. The different rankings of algorithms for both evaluations and their corresponding inter-topics standard deviations prove that using only the namings of the topics is not a satisfactory protocol to compare expert finding systems. However, in a general manner, measures are much better with the topic-query evaluation. This is due to two aspects:

  • document-queries are semantically fine grained and it is more difficult to separate two queries of different topics. This makes the expert finding task harder to solve but it is not a bad thing for the evaluation.

  • in our current configuration, document-queries do not rely on an annotated dataset. As a consequence, some sampled documents might not actually belong to the topic their authors are associated with. This motivates the construction of a ground truth set of documents associated to at least one of the human-annotated expert topics.

P@noptic AUC 0.7770.099 0.7780.102 0.8320.083
P@10 0.6620.262 0.6850.260 0.6150.301
AP 0.3980.193 0.4150.204 0.3950.233
RR 1.6150.923 1.5380.746 1.7690.890
Vote AUC 0.7140.137 0.7140.138 0.8000.107
P@10 0.6080.312 0.6080.287 0.5380.325
AP 0.3730.212 0.3810.209 0.3900.247
RR 2.3082.398 1.5380.929 2.7693.765
Prop () AUC 0.8340.093 0.8340.096 0.8240.085
P@10 0.6690.270 0.6770.264 0.6150.298
AP 0.4580.239 0.4730.243 0.3890.230
RR 1.4620.929 1.5381.082 1.6921.136
Propagation () AUC 0.8420.086 0.8420.088 0.8330.083
P@10 0.6770.255 0.6850.274 0.6150.296
AP 0.4570.232 0.4720.242 0.3950.231
RR 1.3080.606 1.3850.738 1.6921.136
(a) Baseline mean scores and their query (same as topic) standard deviations for the topic-query evaluation.
P@noptic AUC 0.5930.131 0.6180.134 0.6200.139
P@10 0.2550.266 0.3350.294 0.3020.298
AP 0.1500.117 0.1810.133 0.1740.133
RR 9.15316.028 6.21013.323 8.66316.262
Vote AUC 0.6060.131 0.6300.137 0.6370.142
P@10 0.2840.263 0.3220.280 0.2750.276
AP 0.1690.131 0.1930.145 0.1870.152
RR 6.81915.421 5.78313.983 8.43816.987
Propagation () AUC 0.5910.140 0.6170.143 0.6120.145
P@10 0.2560.253 0.3360.292 0.3000.288
AP 0.1520.114 0.1830.132 0.1730.127
RR 9.03516.718 6.77314.641 8.94817.422
Propagation () AUC 0.5980.141 0.6230.142 0.6210.147
P@10 0.2550.245 0.3330.283 0.2980.282
AP 0.1550.115 0.1850.133 0.1770.130
RR 8.99417.224 6.41914.088 9.08918.358
(b) Baseline mean scores and their query standard deviations for the document-query evaluation.
Table 2: Results of the evaluations with the AMiner dataset 1, composed with the machine-annotated experts and the biggest candidates-documents set. Bold values are the best scores across the algorithms for each document representation.
P@noptic AUC 0.8090.114 0.8150.119 0.8530.059
P@10 0.7570.176 0.8140.146 0.7710.183
AP 0.5800.175 0.6130.186 0.5800.157
RR 1.0000.000 1.0000.000 1.4290.495
Vote AUC 0.7880.131 0.7930.136 0.8570.048
P@10 0.7860.136 0.8000.141 0.7290.158
AP 0.6070.158 0.6360.171 0.5990.123
RR 1.2860.452 1.0000.000 1.0000.000
Prop () AUC 0.8600.097 0.8660.100 0.8340.052
P@10 0.8290.148 0.8430.118 0.6860.181
AP 0.6470.142 0.6760.139 0.5640.140
RR 1.0000.000 1.0000.000 1.1430.350
Prop () AUC 0.8600.098 0.8640.101 0.8430.057
P@10 0.7860.146 0.8140.125 0.6860.181
AP 0.6460.139 0.6710.140 0.5790.138
RR 1.0000.000 1.0000.000 1.2860.452
(a) Baseline mean scores and their query (same as topic) standard deviations for the topic-query evaluation.
P@noptic AUC 0.5990.112 0.6260.123 0.6210.121
P@10 0.3240.278 0.3870.285 0.3430.287
AP 0.2820.145 0.3180.164 0.3020.158
RR 4.9045.731 3.5574.976 4.8105.895
Vote AUC 0.6110.120 0.6370.129 0.6340.132
P@10 0.3700.257 0.4170.274 0.3700.266
AP 0.3030.148 0.3380.168 0.3180.161
RR 3.2114.637 2.7524.211 3.6865.174
Propagation () AUC 0.5960.113 0.6250.121 0.6120.119
P@10 0.3250.269 0.3810.283 0.3390.278
AP 0.2830.147 0.3190.163 0.2980.158
RR 4.5125.510 3.5575.171 4.5586.141
Propagation () AUC 0.5980.114 0.6240.122 0.6150.121
P@10 0.3350.268 0.3920.280 0.3510.279
AP 0.2860.146 0.3200.162 0.3030.158
RR 4.0275.039 3.2804.885 4.1975.902
(b) Baseline mean scores and their query standard deviations for the document-query evaluation.
Table 3: Results of the evaluations with the AMiner dataset 2, composed with the human-annotated experts and the smallest candidates-documents set. Bold values are the best scores across the algorithms for each document representation.

6 Summary and Future Work

We compared two evaluation protocols for scientific expert finding that rely on two types of query generation. Evaluating our baseline models with this framework, we showed that using the documents written by the ground truth experts brings different results than with the usual topic queries. Specifically, short queries can profit of a propagation model whereas longer queries are better handled by a simpler voting model. Moreover, the lower standard deviations along topics for the document-query evaluation shows that there is a bias in using only one topic naming as query since the document representations do not handle well such short query similarity to the documents.

To improve the document-query evaluation with the AMiner data, we would like to filter the set of sampled documents by human annotation in order to keep only those that match the expertise of their authors. This would then justify a deeper analysis of the significance of the measurements to consider the variations of ranking of the evaluated algorithms along the queries. Another interesting work would be to perform an online evaluation of the same expert finding algorithms in the case of a reviewer assignment application in order to compare the results with our framework.