Consider a standard crowdsourcing task such as image labeling (Deng et al., 2014; McLaughlin, ) or corpus annotation (Sabou et al., 2014). 111Crowdsourcing is also used for a variety of tasks in which there is no “ground truth,” such as studying vocabulary size (Keuleers et al., 2015), rating the quality of an image (Ribeiro et al., 2011), etc. In this paper, we focus only on questions for which there is a well defined true answer, as in the examples above.
Such tasks are often used to construct large databases, that can later be used to train and test machine learning algorithms. Crowdsourcing workers are usually not experts, thus answers obtained this way often contain many mistakes(Kazai et al., 2011; Vuurens et al., 2011; Wais et al., 2010). A simple approach to improve accuracy is to ask the same question to a number of workers and to aggregate their answers by some aggregation rule.
Truth discovery is a general name for a broad range of methods that aim to extract some underlying ground truth from noisy answers. While the mathematics of truth discovery dates back to the early days of statistics, at least to the Condorcet Jury Theorem (Condorcet and others, 1785), the rise of crowdsourcing platforms suggests an exciting modern application for truth discovery.
A simple approach to improve accuracy in crowdsourcing applications is to ask the same question a number of workers and to aggregate their answers by some aggregation rule. The use of aggregation rules suggests a natural connection between truth discovery and social choice, which deals with aggregation of voters’ opinions and preferences. Indeed, voting rules have come useful in the design of truth discovery and crowdsourcing techniques (Conitzer and Sandholm, 2005; Mao et al., 2013; Caragiannis et al., 2013). It is our intention to further explore and exploit these connections in the current paper.
In political and organizational elections, a common practice is to allow voting-by-proxy, where some voters let others vote on their behalf (Green-Armytage, 2015; Brill, 2018). Thus the aggregation is performed over a subset of “active” voters, who are weighted by the number of “inactive” voters with similar opinions. In a recent paper, Cohensius et al. (2017)
showed that under some assumptions on the distribution of preferences, proxy voting reduces the variance of the outcome, and thus requires fewer active voters to reach the socially-optimal alternative. Cohensius et al. suggested that an intuitive explanation for the effectiveness of proxy voting lies in the fact that the more ‘representative’ voters tend to also be similar to one another, but did not provide formal justifications of that claim. Further, in their model the designer seeks to approximate the subjective “preferences of the society,” whereas truth-discovery is concerned with questions for which there is an objective ground truth.
In this paper, we consider algorithms for truth discovery that are inspired by proxy voting, with crowdsourcing as our main motivation. Our goal is to develop a simple approach for tackling the following challenge in a broad range of domains: We are given a set of workers, each answering multiple questions, and want to: (a) identify the competent workers; and (b) aggregate workers’ answers such that the outcome will be as close as possible to the true answers. These challenges are tightly related: a good estimate of workers’ competence allows us to use better aggregation methods (e.g. by giving higher weight to good workers); and aggregated answers can be used as an approximation of the ground truth to asses workers’ competence. Indeed, several current approaches in truth discovery tackle these goals jointly (see Related Work below).
Our approach decouples the above goals. For (a), we apply proxy voting to estimate each worker’s competence, where each worker increases the estimated competence of similar workers. For the truth discovery problem (b), we then use a straightforward aggregation function (e.g. Majority or Average), giving a higher weight to more competent workers (i.e. who are closer to others).
We depart from previous work on proxy voting mentioned above by dropping the assumption that each worker relegates her vote to the (single) nearest proxy. While this requirement makes sense in a political setting so as to keep the voting process fair (1 vote per person), it is somewhat arbitrary when our only goal is a good estimation of workers’ competence and the ground truth.
For analysis purposes, we use distances rather than proximity. We assume that each worker has some underlying fault level that is the expected distance between her answers and the ground truth. The question of optimal aggregation when competence/fault levels are known is well studied in the literature, and hence our main challenge is to estimate fault levels. To capture the positive influence of similar workers, we define the proxy distance of each worker, as her average distance from all other workers. While the similarity between workers has been considered in the literature (see Related Work), we are unaware of any systematic study of its uses. Our main theoretical result can be written as follows:
Theorem 0 (Anna Karenina principle, informal).
The expected proxy distance of each worker is linear in her fault level .
Essentially, the theorem says that as in Tolstoy’s novel, “good workers are all alike,” (thereby boosting one another proxy distances), whereas “each bad worker is bad in her own way” and thus not particularly close to other workers. The exact linear function depends on the statistical model, and in particular on whether the data is categorical or continuous.
The Anna Karenina principle suggests a natural proxy-based truth-discovery (P-TD) algorithm, that first estimates fault levels based on proxy distances, and then uses standard techniques from the truth discovery literature to assign workers’ weights and aggregate their answers. We emphasize that a good estimate of may be of interest regardless of the aggregation procedure (this is goal (a) above). For example, the operators of the crowdsourcing platform may use it to decide on payments for workers or for terminating the contract with low-quality workers.
1.1. Contribution and paper structure
Our main theoretical contribution is a formal proof of Theorem 1 in the following domains: (i) when answers are continuous with independent normal noise, and where the answers of each worker have variance ; (ii) when answers are categorical and each worker
fails each question independently with probability; (iii) when answers are rankings of alternatives sampled from the Condorcet noise model with parameter . In all three domains, the parameters of the linear function depend on the distribution from which fault levels are sampled. We show conditions under which our estimates of the true fault levels and of the ground truth are statistically consistent. In the continuous domain, we further show that our proxy-based method generalizes another common approach for fault estimation.
We devote one section to each domain (continuous, categorical, rankings). In each section, the theoretical results are followed by an extensive empirical evaluation of truth discovery methods on synthetic and real data. We compare P-TD to standard (unweighted) aggregation and to other approaches suggested in the truth discovery literature. We show that P-TD is substantially better than straight-forward (unweighted) aggregation, and often beats other approaches of weighted aggregation. We also show how to extend P-TD to an iterative algorithm that competes well with more advanced approaches.
Due to space constraints and to allow continuous reading, most proofs are deferred to appendices. Appendices also contain additional figures that show the findings brought in the paper apply broadly.
We denote by the indicator variable of the Boolean condition .
is the set of probability distributions over set.
A domain is a tuple where is a set of possible world states; and is a distance measure.
We assume there is a fixed set of workers, denoted by . An instance in domain is a set of reports where for all ; and a ground truth . To make things concrete, we will consider three domains in particular:
Continuous domain. Here we have questions with real-valued answers. Thus ; and we define to be the squared normalized Euclidean distance.
Categorical domain. Here we have questions with categorical answers in some finite set . Thus ; and we define as the normalized Hamming distance.
Rankings domain. Here is all orders over a set of element , with the Kendall-tau distance .
A noise model in domain is a function from to . We consider noise models that are informative, in the sense that: (I) returns w.p. 1 (i.e., without noise); and (II) is strictly increasing in
, for any moment(i.e., higher means more noise).
A population in domain is a set of workers, each with a fault level . A Proto-population is a distribution over fault levels. We denote by , the mean and variance of distribution , respectively. We omit when clear from the context. Note that the higher is, we should expect more erroneous answers.
Fix a particular domain . Given a population , a noise model , and a ground truth , we can generate instance by sampling each answer independently from . We can similarly generate instances from a proto-population by first sampling each from , and then sampling an instance from the resulted population.
An aggregation function in a domain is a function . In this work we consider simple aggregation functions: is the Mean function in the continuous domain; and is the Plurality function in the categorical domain. The functions are applied to each question independently. Formally: and
In the ranking domain we apply several popular voting rules,222The more accurate term for functions that aggregate several rankings into a single ranking is social welfare functions (Brandt et al., 2016). including Plurality, Borda, Veto, Copeland, and Kemeny. All the aggregation functions we consider have natural weighted analogs, denoted as
for a weight vector.
Given an instance and aggregation function or algorithm in domain , the error of on is defined as . The goal of truth discovery is to find aggregation functions that tend to have low error.
For the convenience of the reader, a list of notation and acronyms is available at the end of the appendix.
2.1. A general Proxy-based Truth-Discovery scheme
Given an instance and a basic aggregation function , we apply the following work flow, whose specifics depend on the domain.
Collect the answers from all workers.
Compute the pairwise distance for each pair of workers.
For each worker :
Compute proxy distance by averaging over all pairwise distances:
Estimate the fault level from .
Transform fault level to a weight .
Aggregate and return the estimated true answers .
By Theorem 1
, the (expected) proxy distance of each agent is a linear transformation of her fault level. Thus, we implement step3b by reversing the linear transformation to get an estimated fault level , and then use known results to obtain the (estimated) optimal weight in step 3c. The details are given in the respective sections, where we refer to the algorithms that return and as Proxy-based Estimation of Fault Levels (P-EFL) and Proxy-based Truth Discovery (P-TD), respectively.
3. Continuous Answers Domain
As specified in the preliminaries, each report is a vector . The distance measure we use is the normalized squared Euclidean distance:333The squared Euclidean distance in not a true metric (violates triangle inequality), but this is not required for our needs. The squared Euclidean distance is often used as a dissimilarity measure in various clustering applications (Carter et al., 1989; Kosman and Leonard, 2005; Cha, 2007). .
The Independent Normal Noise model
For our theoretical analysis, we assume an independent normal noise (INN). Formally,
, where is a -dimensional noise.
In the simplest case the noise of each worker is i.i.d. across dimensions, whereas workers are also independent but with possibly different variance. We further assume that questions are equally difficult444 In the INN model the equal-difficulty assumption is a normative decision, since we can always scale the data. Essentially, it means that we measure errors in standard deviations, giving the same importance to all questions.
In the INN model the equal-difficulty assumption is a normative decision, since we can always scale the data. Essentially, it means that we measure errors in standard deviations, giving the same importance to all questions.and not correlated given the fault level, i.e. . Note that .
3.1. Estimating Fault Levels
Given en instance , our first goal is to get a good estimate of the true fault level.
Fault estimation by distance from the empirical mean
In situations where there is a simple aggregation method , a simple approach that is a step in many truth discovery algorithms, is estimating the quality of each worker according to their distance from the aggregated outcome (Li et al., 2016). We name this approach Estimation Fault Levels by Distance from the Outcome (D-EFL). In the continuous domain, we use D-EFL (Alg. 1) where is the mean function and is the square Euclidean distance, however, we leave the notation general as the algorithm can be used in other domains with appropriate distance and aggregation functions.
We later analyze the properties of D-EFL in Section 3.2. But before, we will describe our own approach that relies on the proxy distance.
Fault estimation by proxy distance
Applying Eq. (1) to the continuous domain, we get that the proxy distance of each worker is .
Note that once is fixed, the proxy distance
is a random variable that depends on two separate randomizations. First, the sampling of the other workers’ fault levels from the proto-population; and second, the realization of a particular instance , where .
Theorem 1 (Anna Karenina principle for the INN model).
Suppose that instance is sampled from proto-population via the INN model. For every worker ,
Denote , which is a random variable.
We use the fact that the difference of two independent normal variables is also a normal variable whose expectation is the difference of expectations, and whose variance is the sum of the two variances. Denote by then ; since is the variance of for all ,
We continue by bounding the inner expression for .
as required. ∎
What should be the value for ? If we know , we can of course use it. Otherwise, we can estimate it from the data. We first argue that lower values of would result in a more conservative estimation of : Consider two workers with and some . Denote by the estimate we get when using some parameter . Then it is easy to see that the ratio between and gets closer to 1 as we pick smaller .
We define Algorithm 3 for estimating as (where ). If we use , then is the average of . By the argument above, lower values of would result in more conservative estimation, therefore a default conservative value we could use is , in which case the estimated fault level in D-EFL would equal .
We can see the output of the P-EFL algorithm on four different instances in Fig. 1. The blue dots and green dots would be the output when using and , respectively. Ideally, we would like the estimated fault to be on the solid green line. Note that in real datasets, we do not have access to the “true” , and we use the distance from the ground truth instead.
We show in the next subsections that the value (which is less conservative than both) is also of interest.
3.2. Equivalence and Consistency of P-EFL and D-EFL
Theorem 2 (Equivalence of P-EFL and D-EFL).
Denote by , the output of algorithms -P-EFL and D-EFL, respectively. For any instance , and any worker , .
Note that does not depend on the particular instance or on the identity of the worker. Moreover, since in the continuous domain only relative fault matters (multiplying all by a constant is just a change of scale), we get D-EFL as a special case of the proxy-based algorithm. Note that this equivalence does not depend on any statistical assumptions.
Theorem 1 does not guarantee that estimated fault levels are good estimates. We want to verify that at least for large instances from the INN model, they converge to the correct value. More precisely, an algorithm is consistent under a statistical model, if for any ground truth parameter and any , the probability for the outcome of the algorithm to be away from the ground truth according to some measure goes to as the datasize grows.
Theorem 3 (Consistency of D-EFL (continuous)).
When has bounded support and is bounded, D-EFL is consistent as , and . That is, for all , as , and .
When fault levels are known, the best aggregation method is well understood.
Proposition 4 ((Aitkin, 1935)).
Under the Independent Normal Noise model, is minimizing .
That is, the optimal way to aggregate the data is by taking a weighted mean of the answers of each question, where the weight of each worker is inversely proportional to her variance (i.e. to her fault level).
We define the Proxy Truth Discovery (P-TD) algorithm for the continuous domain, by similarly combining Algorithm 4 with -P-EFL (Algorithm 2). When using , the P-TD and D-TD algorithms coincide by Thm. 2.
Note that both algorithms are well defined for any instance, whether the assumptions of the INN model hold or not. Moreover, in the INN model, Theorems 1 and 3 guarantee that with enough workers and questions, is a good estimate of the real fault level .
This of course does not yet guarantee that either algorithm returns accurate answers. For this, we need the following two results. The first says that good approximation of entails a good approximation of ; and the second says that in the limit, D-TD (and thus P-TD) return the correct answers. Recall that is the best possible estimation of by Prop. 4.
Theorem 5 ().
For any instance, such that for some ; it holds that .
Theorem 6 (Consistency of D-TD (continuous)).
When has bounded support and is bounded, D-TD is consistent as , and . That is, for any , for all , as , and .
3.4. Empirical Results
We compared the performance of -P-TD (which coincides with D-TD) to the baseline method UA on synthetic and real data. In addition, we created an “Oracle Aggregation” (OA) baseline, which runs the aggregation skeleton with the true fault level when available, or the empirical fault otherwise.
We also tried other values for the parameter , which had similar results.
We generated instances from the INN model, where (additional distributions in Appendix B). Each instance was generated by first sampling a population from , and then generating the instance from . The Buildings dataset was collected via Amazon Mechanical Turk. The Triangles dataset is from (Hart et al., 2018) (see Appendix B.1 for further details). We used each such dataset as a distribution over instances, where for each instance we sampled questions and workers uniformly at random with replacement. We then normalized each question so that its answers have mean and variance . For every combination of and we sampled 500 instances.
We can see in Fig. 2 that the P-TD/D-TD method is substantially better than unweighted mean almost everywhere.
4. Categorical Answers Domain
In this setting we have categorical (multiple-choice) questions. The ground truth is a vector , where . The distance measure we use is the normalized Hamming distance (note that for binary labels it coincides with the squared Euclidean distance):
We follow the same scheme of Section 3: given an instance , we first show how to estimate workers’ fault levels under a simple noise model, then transform them to weights and aggregate multiple answers.
Independent Errors Model
For our theoretical analysis, we assume an independent error (IER) model. Formally, for every question , and every worker , with probability ; and for any w.p. . That is, all wrong answers occur with equal probability. Note that .
We denote by the probability that two workers who are wrong select the same answer.
4.1. Estimating Fault Levels
Fault estimation by distance from the Plurality outcome
As in the continuous case, it is a common practice to use a simple aggregation method (in this case Plurality) to estimate fault levels. We similarly denote the estimated fault level by , and refer to it as the D-EFL algorithm for the categorical domain.
Fault estimation by proxy distance
Applying the definition of the proxy distance (Eq. (1)) to the categorical domain, we get:
Theorem 1 (Anna Karenina principle for the IER model).
Suppose that instance is sampled from proto-population via the IER model. For every worker ,
The proof is somewhat more nuanced than in the continuous case. We first show that for every pair of workers,
Then, for every population,
and then we take expectation again over populations to prove the claim.
We get that there is a positive relation between and exactly when , i.e. when the average fault levels are below those of completely random answers.
To estimate from the data, the P-EFL algorithm (Alg. 5) reverses the linear relation.
Setting parameter values
As in the continuous domain, setting means that is the average of . Also as in the continuous domain, we can use as a conservative default value, in which case .
In contrast to the continuous case, it is obvious that the estimates we get from the P-EFL and the D-EFL algorithms are not at all equivalent. To see why, note that a small change in the report of a single worker may completely change the Plurality outcome (and thus the fault estimates of all workers in the D-EFL algorithm), but only has a gradual effect on P-EFL.
We do not know whether D-EFL is consistent, but we can show that P-EFL is.
Theorem 2 (Consistency of P-EFL).
Suppose the support of is a closed subset of and . Then -P-EFL is consistent as and . That is, for any , for all , as and .
How good is this estimation of P-EFL for a given population? Rephrasing Theorem 1, we can write
That is, the proxy distance is a sum of two components. The first is the actual fault level (the “signal”). The second one decreases the signal proportionally to . Thus the lower is, the better estimation we get on average. We can see this effect in Fig. 3: the top left figure presents an instance with lower than the figure to its right, so the dependency of on is stronger and we get a better estimation. The top right figure has the same as the middle one, but higher , thus fault levels are more spread out and easier to estimate. The two figures on the bottom left demonstrate that a good fit is not necessarily due the the IER model, as the estimated faults for the real dataset on the middle are much more accurate.
The bottom right figure shows that in P-EFL is somewhat more accurate that D-EFL on average.
The vector that minimizes the expected distance to the ground truth is also the MLE under equal prior. This is since we simply try to find the most likely answer for each question. When fault levels are known, this was studied extensively in a binary setting (Grofman et al., 1983; Shapley and Grofman, 1984; Nitzan and Paroush, 1985). Specifically, Grofman et al. (Grofman et al., 1983) identified the optimal rule for binary aggregation. Ben-Yashar and Paroush (Ben-Yashar and Paroush, 2001) extended these results to questions with multiple answers.555Ben-Yashar and Paroush (Ben-Yashar and Paroush, 2001) also considered other extensions including unequal priors, distinct utilities for the decision maker, and general confusion matrices instead of equal-probability errors. In all these cases the optimal decision rule is not necessarily a weighted plurality rule, and generally requires comparing all pairs of answers.
For a worker with fault level , we denote . We refer to as the Grofman weights of population .
Suppose that is a random instance from the IER model. Let . Then is the maximum likelihood estimator of (and thus also minimizes in expectation).
That is, is the result of a weighted plurality rule, where the optimal weight of is her Grofman weight (note that it depends only on ). Note that workers whose fault level is above random error () get a negative weight. Of course, since we have no access to the true fault level, we cannot use Grofman weights directly.
Prop. 3 suggests a simple aggregation skeleton, which is the same as Alg. 4, except it uses instead of , and sets weights to .666Also appears as Alg. 8 in the appendix for completeness. D-TD and -P-TD are the combinations of this categorical skeleton with D-EFL and with -P-EFL, respectively.
As in the continuous case, the algorithm is well-defined for any categorical dataset, but in the special case of the IER noise model we get that the workers’ weights are a reasonable estimate of the Grofman weights due to Theorems 1 and 2. Lastly for this section, we show that P-TD is consistent.
Theorem 4 (Consistency of P-TD).
Suppose the support of is a closed subset of and . Then -P-TD is consistent as and . That is, for any , as and .
4.3. Empirical results
We compared the performance of P-TD to the competing methods UA (which returns ) and D-TD on synthetic and real data. In all simulations we used the default parameter (i.e., -P-TD). Averages are over 1000 samples for each and . The oracle benchmark OA returns . Note that OA and UA are not affected by the number of questions.
For synthetic data we generated instances from the IER model: one distribution with Yes/No questions (), where ; and another with multiple-choice questions (), where (additional distributions in Appendix D). In addition, we used three datasets from (Shah and Zhou, 2015) (Flags, GoldenGate, Dogs) and one that we collected (DotsBinary). Their description is in Appendix D.1.
4.4. Iterative methods
There are more advanced methods for truth discovery that are based on the following reasoning: a good estimate of workers’ fault levels leads to aggregated answers that are close to the ground truth (by appropriately weighing the workers); and a good approximation of the ground truth can get us a good estimate of workers’ competence (by measuring their distance from the approximate answers). Thus we can iteratively improve both estimates (see e.g. in (Li et al., 2016), Section 2.2.2). The Iterative D-TD algorithm (ID-TD, see Alg. 6) captures this reasoning. Note that the D-TD algorithm is a special case of ID-TD with a single iteration.
We can adopt a similar iterative approach for estimating fault levels using proxy voting. Intuitively, in each iteration we compute the proxy distance of a worker according to her weighted average distance from all workers, and then recalculate the estimated fault levels and weights. Note that as in the single step P-EFL, this estimation does not require any aggregation step. We avoid estimating and instead use the default value of , which means that equals the proxy distance. The complete pseudocode is in Alg. 7.
Intuitive analysis of IP-EFL
For ease of presentation, we assume for the analysis in the remainder of this section. Recall again that for unweighted proxy distance we have by Thm. 1:
Ideally, we would like to make this second part smaller to strengthen the signal. We argue that weighted proxy distance obtains just that. We do not provide a formal proof, but rather approximate calculations that should provide some intuition. Exact calculations are complicated due to correlations among terms.
We assume that fault levels are already determined, and take expectation only over realization of workers’ answers.
Lemma 5 ().
In step of the iterative algorithm,
The proof simply repeats the steps of the proof of Theorem 1.
Recall that is the actual optimal weight (Grofman weight) of worker . We also denote . For values of not too far from , is an approximation of (Grofman et al., 1983)). Thus .
Recall that is the variance of . Consider the “noisy” part that multiplies above. In expectation, the nominator holds:
Similarly, in expectation, the denominator holds .
If we neglect both the correlation among nominator and denominator, and the fact that is only an approximation of , we get that:
We conclude from Lemma 5 and the above discussion that after enough iterations, .
Since , this noise term is not larger than the noise in the unweighted P-EFL algorithm ( in Eq. (6)). We therefore expect that if is already a reasonable estimation of then accuracy will grow with further iterations.
Empirical results for iterative algorithms
The Iterative Proxy-based Truth Discovery (IP-TD) algorithm, combines our aggregation skeleton with IP-EFL. We can see how adding iterations affects the performance of P-TD on synthetic data in Fig. 4. We further compare the performance of IP-TD to ID-TD on more distributions and datasets in the third row of Fig. 5 (and in Appendix D). For both algorithms we used iterations. A higher number of iterations had little effect on the results. Note that as we use , our IP-TD algorithm never explicitly estimates or , yet it manages to take advantage of the variance among workers. We do see instances however where the initial estimation is off, and any additional iteration makes it worse.
5. Ranking Domain
Consider a set of alternatives , where is the set of all rankings (permutations) over . Each pairwise relation over corresponds to a binary vector where (i.e. each dimension is a pair of candidates). In particular, every ranking has a corresponding vector . A vector is called transitive if it corresponds to some ranking s.t. . The ground truth is a transitive vector (equivalently, a ranking ). A natural metric over rankings is the Kendall-tau distance (a.k.a swap distance): .
Independent Condorcet Noise Model
According to Independent Condorcet noise (ICN) model, an agent with fault level observes a vector where for every pair of candidates , we have with probability .777Classically, the Condorcet model assumes all voters have the same parameter (Young, 1988). In particular, may not be transitive.
The down side of the Condorcet noise model is that it may result in nontransitive answers. Mallows model is similar, except it is guaranteed to produce transitive answers (rankings).
Formally, given ground truth and parameter , the probability of observing order is proportional to . Thus for
we get a uniform distribution, whereas for lowwe get orders concentrated around .
In fact, if we throw away all non-transitive samples, the probability of getting rank under the Condorcet noise model with parameter (conditional on the outcome being transitive) is exactly the same as the probability of getting under Mallows Model with parameter .
5.1. Estimating Fault Levels
By definition, the ICN model is a special case of the IER model, where , , and the ground truth is a transitive vector. We thus get the following result as an immediate corollary of Theorem 1, and can therefore use P-EFL directly.
Theorem 1 (Anna Karenina principle for the Condorcet model).
Suppose that instance is sampled from population via the ICN model, where all are sampled independently from proto-population with expected value . For every worker , .
Note that while our results on fault estimation from the binary domain directly apply (at least to the ICN model), aggregation is more tricky: an issue-by-issue aggregation may result in a non-transitive (thus invalid) solution. The voting rules we consider are guaranteed to output a valid ranking.
The problem of retrieving given votes is a classical problem, and in fact any social welfare function offers a possible solution (Brandt et al., 2016).
There is a line of work that deals with finding the MLE under various assumptions on the noise model (Ben-Yashar and Paroush, 2001; Drissi-Bakhkhat and Truchon, 2004). In general, these estimators may take a complicated form that depends on all parameters of the distribution. Yet some cases are simpler.
The Kemeny rule and optimal aggregation
It is well known that for both Condorcet noise model and Mallows model, when all voters have the same fault level, the maximum likelihood estimator of is obtained by applying the Kemeny-Young voting rule on (Young, 1988).
The Kemeny-Young rule (henceforth, Kemeny) computes the binary vector that corresponds to the majority applied separately on every pair of candidates (that is, ); then .
In particular it can be applied when is composed of transitive vectors.
A natural question is whether there is a weighted version of KY that is an MLE and/or minimizes the expected distance to when fault levels are known. We did not find any explicit reference to this question or to the case of distinct fault levels in general. There are some other extensions: (Drissi-Bakhkhat and Truchon, 2004) deal with a different variation of the noise model where it is less likely to swap pairs that are further apart. (Xia and Conitzer, 2011) extend the Kemeny rule to deal with more general noise models and partial orders.
As it turns out, using weighted KY with (binary) Grofman weights provides us with (at least) an approximately optimal outcome.
Proposition 2 ().
Suppose that the ground truth is sampled from a uniform prior on . Suppose that instance is sampled from population via the ICN model. Let . Let be any random variable that may depend on . Then , where expectation is over all instances.
Consider the ground truth binary vector . Let . Let be an arbitrary random variable that may depend on the input profile. For every (i.e., every pair of elements), we know from Prop. 3 that is the MLE for , and thus . Now, recall that by definition of the KY rule, is the closet ranking to . Also denote .
|(since is the closest transitive vector to )|
In particular, this holds for any transitive vector corresponding to ranking , thus
as required. ∎
Prop. 2 provides some justification to use Kemeny voting rule and Grofman weights for aggregating rankings when the ’s are known. We can now apply a similar reasoning as in the binary case to estimate . Given a set of rankings and any voting rule (not necessarily Kemeny!), the P-TD algorithm is a combination of Alg. 8 with and instead of , and Alg. 5 with the Kendall-tau distance and . Since the meaning of negative weights is not clearly defined, we replace every negative weight with (the full description appears as Alg. 9 in the appendix for completeness).
5.3. Empirical results
We compared the performance of P-TD using 8 different voting rules: Borda, Copeland, Kemeny-Young (with weighted and unweighted majority graph), Plurality, Veto, Random dictator, and Best dictator.