Modern machine learning algorithms require large amounts of labeled data to be trained. Crowdsourcing marketplaces, such as Amazon Mechanical Turk111https://mturk.com/, Figure Eight222https://www.figure-eight.com, and Yandex Toloka333https://toloka.yandex.com/, make it possible to obtain labels for large data sets in a shorter time as well as at a lower cost comparing to that needed for a limited number of experts. However, as workers at the marketplaces are non-professional and vary in levels of expertise, such labels are much noisier than those obtained from experts Snow et al. (2008). In order to reduce the noisiness, typically, each object is labelled by several workers, and then these labels are further aggregated in a certain way to infer a more reliable aggregated label for the object Ipeirotis et al. (2014).
Ranking problem implies recovering a total ordering over a set of objects, for example, documents or images. The problem is important for ranking-based applications like web search and recommender systems. To evaluate such algorithms preferences over objects are obtained from human workers. Since sets of objects needed to be ranked are large (e.g., for a given query search engines retrieve thousands of objects), workers are asked to express their preferences either in the form of absolute judgments by grading each object on a given scale or in the form of pairwise comparisons by choosing a winner among two objects. Carterette et al. (2008) showed that in the latter approach workers perform tasks quicker and achieve a better inter-worker agreement. This, in turn, motivates the problem of ranking from pairwise comparisons, which is addressed in this paper.
There are several sources of noise that can be observed in crowdsourced data in different tasks. A most studied kind of noise appears in multi–classification tasks, where workers, being non-professional, can confuse classes. A special case is when noisy labels come from spammers, which provide deterministic labels or answer randomly Vuurens et al. (2011). A large part of these workers can be automatic bots Difallah et al. (2012), some others are real users trying to earn more money in short time Carterette & Soboroff (2010). In some cases, workers can be even malicious: they can deliberately provide incorrect labels for some reasons Raykar & Yu (2012). To infer the true label in the presence of such type of noise, different unsupervised consensus models were proposed (e.g., Dawid & Skene, 1979; Whitehill et al., 2009; Zhou et al., 2015)
. These approaches are able to detect and implicitly eliminate or correct the noisy part of crowdsourced information. Another kind of noise can be generated due to stochastic nature of true labels. In pairwise comparisons, a weaker player can beat a stronger one by chance. A document that is more preferable by the majority of users can be less preferable for some minority people, which are not neither spammers, nor malicious users. In this case, true labels are often modeled as random variables rather than deterministic values. For example, in the Bradley–Terry modelBradley & Terry (1952)
, the probability to win for a player is the logistic transformation of the difference between latent scores (skills) of the player and the opponent. The random outcome of a match is considered as the (unknown) true label, which can be spoiled in the crowdsourced setting by low-quality workers (see, e.g. the Crowd-BT modelChen et al. (2013)).
In this paper, we study the problem of ranking from pairwise comparisons in the presence of a novel type of noise, which was overlooked in previous studies on consensus modeling. We consider the biases of workers caused by some known factors that are irrelevant to the task and cannot a-priori influence the unknown true answer, but can actually bias noisy crowdsourced labels. Examples of such factors are the positions of the two objects on the screen of the worker, the background design of each document, when their relevance should be compared, and etc.We propose factorBT, a new model for ranking from pairwise comparisons, which takes into account and eliminates individual biases of workers towards irrelevant factors. FactorBT includes Bradly-Terry model as a special case. To the best of our knowledge, this work is the first one addressing the problem of ranking aggregation from pairwise comparisons in the presence of multiple factors that are irrelevant to the task but can bias the results of pairwise comparisons.
2 Related work
Classical score-based probabilistic models (Bradley & Terry, 1952; Thurstone, 1927) for ranking from pairwise comparisons were designed for comparisons without noise. Many previous studies for ranking from noisy pairwise comparisons (Chen et al., 2013; Volkovs & Zemel, 2012; Sunahase et al., 2017; Shah & Wainwright, 2017) were based on the assumption that noise occurred in crowdsourced labels is independent of pairwise tasks properties, in contrast in this paper we focus on modelling the dependence of noise on tasks factors that are a priori known to be irrelevant to the result of comparisons.
Several works modeled biases of workers towards certain classes for multi-classication (Dawid & Skene, 1979; Raykar & Yu, 2012; Zhou et al., 2015; Welinder & Perona, 2010) and ordinal labels (Joachims & Raman, ; Lyu, 2018; Kao et al., 2018). Some works modeled in-batch bias (Zhuang et al., 2015; Zhuang & Young, 2015) which occurs when each worker receives a batch of tasks and for these reason labels for tasks may be affected by other tasks in the same batch. In this work, we focus on modelling of another source of bias that is caused by some known features of pairwise task such as position bias discussed below.
It was observed long ago that results of pairwise comparisons are affected by position bias Day (1969). Xu et al. (2016) focused on filtering out workers having position bias rather than on aggregating pairwise comparisons. For this reason, their approach to optimization process is not traditional, and we do not see an easy extension of this process to account for multiple factors of bias. Secondly, in their method, all labels from biased workers are discarded and therefore we waste some unbiased labels produced by workers, who are biased only when they are confused or tired. Both of these shortcomings are covered by our work.
There have been several works using factors of tasks for better aggregation of crowdsourced multiclass labels (Ruvolo et al., 2013; Jin et al., 2017; Welinder & Perona, 2010; Kajino et al., 2012; Kamar et al., 2015; Wauthier & Jordan, 2011). In contrast to our work, all these works considered factors that are predictive of the true labels for tasks, whereas we consider factors that should not affect the result of comparisons. Secondly, all the models were designed for multiclass labels and can not obtain ranking from pairwise comparisons. For this reason models for aggregating multiclass labels have not been used in previous works on ranking from noisy pairwise comparisons (Chen et al., 2013; Sunahase et al., 2017; Xu et al., 2016).
3 Problem setup
Let a set of items and a set of workers be given. We assume that each worker has completed some number of tasks to make a pairwise comparison of two items from . Namely, in such a task, a worker receives two different items and , compares them on its own, and chooses one of the two answers or (where represents that prefers over ). Formally, when all tasks are completed, we have the set of all pairwise comparisons made by workers in denoted by . Note that, first, may not include all possible pairs of items, second, the number of comparisons produced by different workers may vary, and, third, the same pair of items can be compared by several workers. The latter is a standard approach to reduce the amount of noise that presents in the pairwise comparisons when they are obtained via crowdsourcing where reliability of workers is unknown Chen et al. (2013).
Assume that each task to compare documents and given to a worker
is described by a vector offeatures , for brevity further on it is written as . Denote the collection of the features’ vectors for all tasks in as . The features describe the presence of particular properties of the task that should not affect the result of the comparison when it is made by a perfect worker. However, in our crowdsourcing setting, non-professional workers may unconsciously be affected by such features of tasks (because of perceptual reasons) and may thus provide biased answers Day (1969); Xu et al. (2016). In other words, biased comparisons demonstrate the reaction of workers to features of a given pairwise task rather than reflecting the true preference over items in the task. For example, the features may include the position of compared items on the screen444e.g., which one of the two items is located to the left of the other one, which is known as position bias (Day, 1969; Xu et al., 2016). To give examples of other features of tasks that may affect pairwise comparisons, consider the application of comparing the results of two search engines over a pool of queries. It is desirable that comparisons of two search engines results pages (SERPs) for queries assess the relevance of the SERPs content and should not be affected by the order of the results from the two search engines, the SERPs style, or the existence of non-relevant but colorful objects on them.
In this paper, we consider the problem of constructing a ranking list of the items by means of score-based models for pairwise comparisons Bradley & Terry (1952); Thurstone (1927). Such a model assumes that each item has a latent “quality” score and models the probability of preferring an item over another one as a specific function of the difference between their scores. For example, in the Bradley-Terry (BT) model, the probability that an item will be preferred in a comparison over an item (denoted by ) is
where . The set of scores for all items in is denoted as . Traditionally, scores are inferred by maximizing the log-likelihood of the observed pairwise comparisons . For the BT model, the log-likelihood of the observed data is
Given the scores of the items , one can sort the items according to their scores and obtain a ranking of items in , which is a permutation mapping each item to its rank . We follow Chen et al. (2013)
and say that estimated scores of items areaccurate if they agree with ground truth quality of these items in the following sense. Given a set of ground truth quality for items , the accuracy is measured by
where the indicator function equals to 1 when is true and 0 otherwise.
Given the observed results of pairwise comparisons and features describing pairwise tasks presented to workers, the goal of this paper is to estimate the scores of items in maximizing the likelihood of . In this paper, we address the following questions for the problem of estimating scores of items from noisy pairwise comparisons: Firstly, how to model pairwise comparisons with multiple types of bias using the features of pairwise tasks? Secondly, can we use the features of pairwise tasks for debiasing the results of pairwise comparisons and to obtain more accurate scores of items?
In this section we briefly describe established score-based methods for ranking from noisy pairwise comparisons and provide references to papers describing the methods.
4.1 Bradley-Terry model
The classical Bradley-Terry (BT) model Bradley & Terry (1952), described in Section 3, is the basic approach for ranking from pairwise comparisons. For this model all workers are treated equally. To estimate unknown scores for items the log-likelihood (2) of is maximized. As expected, in the crowdsourcing setting where reliability of workers vary, the classical BT model performs badly Chen et al. (2013).
4.2 CrowdBT model
Chen et al. (2013) described an extension of the BT model to the crowdsourcing setting, where a parameter for each worker describes the worker’s qualification. For their CrowdBT model the probability that a worker prefers an item over another item is defined as follows:
To optimize parameters in the case of sparse pairwise comparisons the authors suggested to use the virtual node regularization: the set of pairwise comparisons is extended to by adding fictive comparisons of the virtual item with a score to all other items performed by a perfect worker whose parameter and resulting in one virtual win for and one virtual loss: . The model parameters and are estimated by maximizing the sum
where the first term is the log-likelihood of , the second one is the regularization, and is the regularization parameter. CrowdBT by construction can not detect biased workers and consider them to be random: for such workers would be near 0.5 and log-likelihood addends corresponding to these workers become constant. Thus, CrowdBT cannot extract any information from biased answers.
4.3. Pairwise HITS
Sunahase et al. (2017) proposed a method called pairwise HITS, where the relationships between scores and ability of workers are described by two equations. For any :
where is the set of comparisons in for which beats . These equations summarizing all comparisons in can be rewritten in a matrix form and then items scores are estimated using a standard method. Workers’ ability are approximated as the fraction of correct comparisons according to the current estimation of scores :
4.4. Linear model
A linear model incorporating position bias was proposed in Xu et al. (2016). For each comparison the result of this comparison is equal to 1 if and -1 otherwise. The answer is modeled based on scores of items and noise:
where is random sub-gaussian noise and is position bias specific for each worker . The problem can be rewritten in a matrix form and then LASSO approach is used for approximating optimal parameters. Though the method tries to capture position bias, it is able to model only linear relations between parameters and our experiments in Sec. 6 show that the model performs poorly in comparison with our method.
In this section we describe our factorBT model for ranking from pairwise comparisons. For our model a worker has parameters and . The intuition for is the following: for a given pairwise task a worker produces a non-biased answer (according to the BT model (1)) with the probability and with the probability her answer is affected by undesirable features of the task. The parameter models the reaction of a worker on tasks’ features. To model the strength of workers’ reaction on tasks features we follow Ruvolo et al. (2013). In our model, the amount of observed bias in answers of a worker depends on her sensitivity to certain tasks’ features and on the presence of these features in tasks performed by this worker.
In this way, for factorBT the probability that a worker prefers an item over another item is
where is the scalar product of vectors and .
To estimate parameters parameters , and we maximize the log-likelihood of the observed pairwise comparisons . Similarly to other score based model, factorBT needs a regularization in the case when comparisons are sparse. For this purpose we use the virtual node regularization similarly to Chen et al. (2013), which has been described in Section 4.2. This gives the following target function to maximize:
To maximize the function we use a standard conjugate gradient descent algorithm. The gradients of for each parameter are the following:
Results of the optimization depend on the initial values for parameters. All scores (including for the virtual item) are initialized by 0. If a worker always answers the same answer her is and otherwise. The initialization of is more complex. It can be explained using an example of our experimental data. In our experiments, tasks’ features take values from . For each feature (denoted as for a given task ): means that this feature presents in (for example, a certain background design) and does not present in ; visa verse; and means this feature either presents in both items or absents in both. Then, let a vector consists of values of . For each worker and each vector component we estimate smoothed statistics of her answers, i.e. the fraction of her answers for tasks with a feature presenting: . Finally, is initialized by 555In factorBT is used in , where , for this reason we apply logarithmic transformation for the initialisation.
Simulated study. In this experiment we follow the simulating procedure described in (Chen et al., 2013, Section 5.1). We assume that there are 100 objects and randomly generate unique 400 pairs to be used as pairwise tasks. Ground truth scores of items were randomly chosen without replacement among integers from 0 to 99. Every task (i.e. the pair) has two factors each of them is whether , or (). The values of factors were chosen randomly with equal probability. The parameters and (
) for 100 workers were sampled from the standard normal distribution. For each pair of items we randomly select 10 different workers to generate their pairwise comparisons and worker is voted for the first object with the probability .
To evaluate the goodness of approximating true parameters by factorBT we use the following metrics: Firstly, as in Chen et al. (2013) we compute accuracy of a ranking that was obtained by sorting objects based on estimated sores. The ranking generates golden true comparisons between each pair of elements and we can compute the accuracy (3). Secondly, as in Whitehill et al. (2009) we compute paired Pearson correlation coefficient between true and approximated parameters of factorBT. Results in Table 1 are averaged over 10 trials. It shows that correlation between reconstructed and original parameters is quite good and ranking itself is close to the ground truth one.
|Pearson correlation for||0.50|
|Pearson correlation for||0.47|
|Pearson correlation for||0.81|
|Pearson correlation for||0.92|
Reading difficulty data set. The data set 666We used the data set available at http://www-personal.umich.edu/~kevynct/datasets/wsdm_rankagg_2013_readability_crowdflower_data.csv. contains 490 different documents (text passages). In each task a worker was given two passages of text (a passage A and a passage B). The task is to decide whether the passage A is more difficult to read and understand, than the passage B is more difficult or it was possible to answer that they can not decide. Pairs of passages for tasks were chosen randomly. Overall, 624 workers made 13856 comparisons (for 1424 tasks) but each worker contributes no more than 40 judgments. For some passages golden labels assessing their reading difficulty level on the scale from 1 to 12 were provided. Only determined answers (i.e. ”Passage A is more difficult.” and ”Passage B is more difficult.”) were used to compute accuracy of ranking (3) for different methods.
Apart from the original data set we used its modifications where we add simulated malicious workers. It is known that workers solving pairwise tasks tend to choose left or first answer (Day, 1969; Xu et al., 2016), so for our simulation each malicious worker for each tasks always chooses a passage A as more difficult. This type of simulated workers is called uniform spammers according to Vuurens et al. (2011). Tasks for each simulated worker are chosen randomly and the number of tasks per simulated worker is equal to the mean amount of tasks assigned to original workers (22 tasks for this data set). To evaluate stability of methods to uniform spammers, we compute accuracy of ranking obtained after adding different fractions of malicious workers from 0 to 100 percents (with the step 20) of the amount of original workers (i.e. of malicious workers are added if there are real ones). Accuracy results were averaged by 10 trials. The results are shown on Figure 1.
As we can see only FactorBT method is stable when uniform spammers appear in data set. Moreover, even without spammers it shows the best result that means it detects position bias which is present in the data set.
SERPs’ comparisons data sets. Finally, we evaluated our algorithm on proprietary industry data. Each data set was collected to compare two systems using pairwise tasks. Two systems in each data set were chosen in a way that the underlying quality of one of them was a-priory known to be better than for another, and by the experimental design weaker systems had certain features that were irrelevant to the search quality but may attract workers attention. The task contains two SERP’s of different search engines on the same query. Workers are asked to chose the SERP containing more relevant documents for a given query. Crowdsourced labels for the data sets were collected via Yandex.Toloka. Statistics about the data sets are summarized in Table 2.
|Data set size||2660||2500|
|Number of queries||133||125|
|Number of workers||194||194|
|Avg. number of tasks per worker||13.7||20|
|Avg. number of workers per task||20||12.9|
The first data set compares results of the first page of Google and the fifth page of Yandex. The search engine’s name is not shown to worker but certain workers are able to guess the search engine by the background design or other secondary features such as fonts, the distance between snippets, etc. As Yandex.Toloka is a product of Yandex apart of positional bias workers tend to have bias towards Yandex system. It is obvious though that the first page of Google should win.
The second data set contains Yandex first result page on a given query and Google first result page on a spoiled query. The idea is that we change the sense of query slightly so that results became less relevant than that for the original one. E.g., the original query is ”Tom and Jerry” and the spoiled one is ”Tom”. In this case Yandex should win Google.
Although for our proprietary data sets only SERPs are compared, we would like to obtain an overall score for the quality of each system. For that purpose we compute the probability of one system to win another by averaging the probabilities based on estimated scores of the first system’s page win the second system’s page over all queries in a data set. This is our main quality metric used for experiments in this subsection. To evaluate stability of methods to uniform spammers we evaluate quality scores for each system after adding different fractions of malicious workers, as described the previous experiment.
In the next figures we show probability of an a-priory better system to win an a-priory worse system, so if the line is above 0.5 then the overall quality of the better system was estimated correctly. Figure 2 shows results for the first data set when we add uniform left-biased spammers. As we can see all methods but FactorBT are unstable when left spammers are added. Only FactorBT and BT show the probability more than 0.5 for all modifications of the data set. Figure 3 shows results for the second data set when we add mixture of uniform left-biased spammers and uniform Yandex-biased spammers (half of the spammers are of the first type and half of the second). Only FactorBT and CrowdBT show correct results on this data set. We noticed that CrowdBT looks more stable than FactorBT when the two kinds of spammers are added, though the former method does not model any biases. Also, we checked that if just one kind of spammers is added we observe similar stable results for FactorBT as for the first data set. So the situation with the two kind of spammers requires further study.
We have described the factorBT model for ranking from pairwise comparisons, which has the following properties: Firstly, it is a score-based model which, given a collection of pairwise comparisons, allows to obtain a ranking of items based on estimated scores for the items. Secondly, by modelling workers reaction to known irrelevant features presented in pairwise tasks it reduces the influence of these features on estimated scores of items. Empirical evaluation with three real data sets has shown that: firstly, FactorBT produces more accurate ranking comparing to previously established baselines, and , secondly, factorBT shows a much stable performance to the addition of spammers.
- Bradley & Terry (1952) Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 1952.
- Carterette & Soboroff (2010) Carterette, B. and Soboroff, I. The effect of assessor error on ir system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, 2010.
- Carterette et al. (2008) Carterette, B., Bennett, P. N., Chickering, D. M., and Dumais, S. T. Here or there. In European Conference on Information Retrieval, 2008.
- Chen et al. (2013) Chen, X., Bennett, P. N., Collins-Thompson, K., and Horvitz, E. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the sixth ACM international conference on Web search and data mining, 2013.
- Dawid & Skene (1979) Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pp. 20–28, 1979.
- Day (1969) Day, R. L. Position bias in paired product tests. Journal of Marketing Research, 1969.
- Difallah et al. (2012) Difallah, D. E., Demartini, G., and Cudré-Mauroux, P. Mechanical cheat: Spamming schemes and adversarial techniques on crowdsourcing platforms. 2012.
- Ipeirotis et al. (2014) Ipeirotis, P. G., Provost, F., Sheng, V. S., and Wang, J. Repeated labeling using multiple noisy labelers. Data Mining and Knowledge Discovery, 2014.
- Jin et al. (2017) Jin, Y., Carman, M., Kim, D., and Xie, L. Leveraging side information to improve label quality control in crowd-sourcing. Procs of Hcomp2017. AAAI, 2017.
- (10) Joachims, T. and Raman, K. Bayesian ordinal aggregation of peer assessments: A case study on kdd 2015. In Solving Large Scale Learning Tasks. Challenges and Algorithms.
Kajino et al. (2012)
Kajino, H., Tsuboi, Y., and Kashima, H.
A convex formulation for learning from crowds.
Transactions of the Japanese Society for Artificial Intelligence, 2012.
- Kamar et al. (2015) Kamar, E., Kapoor, A., and Horvitz, E. Identifying and accounting for task-dependent bias in crowdsourcing. In Third AAAI Conference on Human Computation and Crowdsourcing, 2015.
- Kao et al. (2018) Kao, A. B., Berdahl, A. M., Hartnett, A. T., Lutz, M. J., Bak-Coleman, J. B., Ioannou, C. C., Giam, X., and Couzin, I. D. Counteracting estimation bias and social influence to improve the wisdom of crowds. Journal of The Royal Society Interface, 2018.
- Lyu (2018) Lyu, L. Spam elimination and bias correction: ensuring label quality in crowdsourced tasks. 2018.
- Raykar & Yu (2012) Raykar, V. C. and Yu, S. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research, 2012.
- Ruvolo et al. (2013) Ruvolo, P., Whitehill, J., and Movellan, J. R. Exploiting commonality and interaction effects in crowdsourcing tasks using latent factor models. In Neural Information Processing Systems. Workshop on Crowdsourcing: Theory, Algorithms and Applications, 2013.
- Shah & Wainwright (2017) Shah, N. B. and Wainwright, M. J. Simple, robust and optimal ranking from pairwise comparisons. The Journal of Machine Learning Research, 2017.
Snow et al. (2008)
Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y.
Cheap and fast—but is it good?: evaluating non-expert annotations
for natural language tasks.
Proceedings of the conference on empirical methods in natural language processing, pp. 254–263. Association for Computational Linguistics, 2008.
- Sunahase et al. (2017) Sunahase, T., Baba, Y., and Kashima, H. Pairwise hits: Quality estimation from pairwise comparisons in creator-evaluator crowdsourcing process. In AAAI, 2017.
- Thurstone (1927) Thurstone, L. L. The method of paired comparisons for social values. The Journal of Abnormal and Social Psychology, 1927.
- Volkovs & Zemel (2012) Volkovs, M. N. and Zemel, R. S. A flexible generative model for preference aggregation. In Proceedings of the 21st international conference on World Wide Web, 2012.
- Vuurens et al. (2011) Vuurens, J., de Vries, A. P., and Eickhoff, C. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proc. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR’11), pp. 21–26, 2011.
- Wauthier & Jordan (2011) Wauthier, F. L. and Jordan, M. I. Bayesian bias mitigation for crowdsourcing. In Advances in neural information processing systems, 2011.
- Welinder & Perona (2010) Welinder, P. and Perona, P. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, 2010.
- Whitehill et al. (2009) Whitehill, J., Wu, T., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pp. 2035–2043, 2009.
- Xu et al. (2016) Xu, Q., Xiong, J., Cao, X., and Yao, Y. False discovery rate control and statistical quality assessment of annotators in crowdsourced ranking. In International Conference on Machine Learning, 2016.
- Zhou et al. (2015) Zhou, D., Liu, Q., Platt, J. C., Meek, C., and Shah, N. B. Regularized minimax conditional entropy for crowdsourcing. arXiv preprint arXiv:1503.07240, 2015.
Zhuang & Young (2015)
Zhuang, H. and Young, J.
Leveraging in-batch annotation bias for crowdsourced active learning.In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 2015.
- Zhuang et al. (2015) Zhuang, H., Parameswaran, A., Roth, D., and Han, J. Debiasing crowdsourced batches. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.