1. Introduction
Consider a standard crowdsourcing task such as image labeling (Deng et al., 2014; McLaughlin, ) or corpus annotation (Sabou et al., 2014). ^{1}^{1}1Crowdsourcing is also used for a variety of tasks in which there is no “ground truth,” such as studying vocabulary size (Keuleers et al., 2015), rating the quality of an image (Ribeiro et al., 2011), etc. In this paper, we focus only on questions for which there is a well defined true answer, as in the examples above.
Such tasks are often used to construct large databases, that can later be used to train and test machine learning algorithms. Crowdsourcing workers are usually not experts, thus answers obtained this way often contain many mistakes
(Kazai et al., 2011; Vuurens et al., 2011; Wais et al., 2010). A simple approach to improve accuracy is to ask the same question to a number of workers and to aggregate their answers by some aggregation rule.Truth discovery is a general name for a broad range of methods that aim to extract some underlying ground truth from noisy answers. While the mathematics of truth discovery dates back to the early days of statistics, at least to the Condorcet Jury Theorem (Condorcet and others, 1785), the rise of crowdsourcing platforms suggests an exciting modern application for truth discovery.
A simple approach to improve accuracy in crowdsourcing applications is to ask the same question a number of workers and to aggregate their answers by some aggregation rule. The use of aggregation rules suggests a natural connection between truth discovery and social choice, which deals with aggregation of voters’ opinions and preferences. Indeed, voting rules have come useful in the design of truth discovery and crowdsourcing techniques (Conitzer and Sandholm, 2005; Mao et al., 2013; Caragiannis et al., 2013). It is our intention to further explore and exploit these connections in the current paper.
In political and organizational elections, a common practice is to allow votingbyproxy, where some voters let others vote on their behalf (GreenArmytage, 2015; Brill, 2018). Thus the aggregation is performed over a subset of “active” voters, who are weighted by the number of “inactive” voters with similar opinions. In a recent paper, Cohensius et al. (2017)
showed that under some assumptions on the distribution of preferences, proxy voting reduces the variance of the outcome, and thus requires fewer active voters to reach the sociallyoptimal alternative. Cohensius et al. suggested that an intuitive explanation for the effectiveness of proxy voting lies in the fact that the more ‘representative’ voters tend to also be similar to one another, but did not provide formal justifications of that claim. Further, in their model the designer seeks to approximate the subjective “preferences of the society,” whereas truthdiscovery is concerned with questions for which there is an objective ground truth.
In this paper, we consider algorithms for truth discovery that are inspired by proxy voting, with crowdsourcing as our main motivation. Our goal is to develop a simple approach for tackling the following challenge in a broad range of domains: We are given a set of workers, each answering multiple questions, and want to: (a) identify the competent workers; and (b) aggregate workers’ answers such that the outcome will be as close as possible to the true answers. These challenges are tightly related: a good estimate of workers’ competence allows us to use better aggregation methods (e.g. by giving higher weight to good workers); and aggregated answers can be used as an approximation of the ground truth to asses workers’ competence. Indeed, several current approaches in truth discovery tackle these goals jointly (see Related Work below).
Our approach decouples the above goals. For (a), we apply proxy voting to estimate each worker’s competence, where each worker increases the estimated competence of similar workers. For the truth discovery problem (b), we then use a straightforward aggregation function (e.g. Majority or Average), giving a higher weight to more competent workers (i.e. who are closer to others).
We depart from previous work on proxy voting mentioned above by dropping the assumption that each worker relegates her vote to the (single) nearest proxy. While this requirement makes sense in a political setting so as to keep the voting process fair (1 vote per person), it is somewhat arbitrary when our only goal is a good estimation of workers’ competence and the ground truth.
For analysis purposes, we use distances rather than proximity. We assume that each worker has some underlying fault level that is the expected distance between her answers and the ground truth. The question of optimal aggregation when competence/fault levels are known is well studied in the literature, and hence our main challenge is to estimate fault levels. To capture the positive influence of similar workers, we define the proxy distance of each worker, as her average distance from all other workers. While the similarity between workers has been considered in the literature (see Related Work), we are unaware of any systematic study of its uses. Our main theoretical result can be written as follows:
Theorem 0 (Anna Karenina principle, informal).
The expected proxy distance of each worker is linear in her fault level .
Essentially, the theorem says that as in Tolstoy’s novel, “good workers are all alike,” (thereby boosting one another proxy distances), whereas “each bad worker is bad in her own way” and thus not particularly close to other workers. The exact linear function depends on the statistical model, and in particular on whether the data is categorical or continuous.
The Anna Karenina principle suggests a natural proxybased truthdiscovery (PTD) algorithm, that first estimates fault levels based on proxy distances, and then uses standard techniques from the truth discovery literature to assign workers’ weights and aggregate their answers. We emphasize that a good estimate of may be of interest regardless of the aggregation procedure (this is goal (a) above). For example, the operators of the crowdsourcing platform may use it to decide on payments for workers or for terminating the contract with lowquality workers.
1.1. Contribution and paper structure
Our main theoretical contribution is a formal proof of Theorem 1 in the following domains: (i) when answers are continuous with independent normal noise, and where the answers of each worker have variance ; (ii) when answers are categorical and each worker
fails each question independently with probability
; (iii) when answers are rankings of alternatives sampled from the Condorcet noise model with parameter . In all three domains, the parameters of the linear function depend on the distribution from which fault levels are sampled. We show conditions under which our estimates of the true fault levels and of the ground truth are statistically consistent. In the continuous domain, we further show that our proxybased method generalizes another common approach for fault estimation.We devote one section to each domain (continuous, categorical, rankings). In each section, the theoretical results are followed by an extensive empirical evaluation of truth discovery methods on synthetic and real data. We compare PTD to standard (unweighted) aggregation and to other approaches suggested in the truth discovery literature. We show that PTD is substantially better than straightforward (unweighted) aggregation, and often beats other approaches of weighted aggregation. We also show how to extend PTD to an iterative algorithm that competes well with more advanced approaches.
Due to space constraints and to allow continuous reading, most proofs are deferred to appendices. Appendices also contain additional figures that show the findings brought in the paper apply broadly.
2. Preliminaries
We denote by the indicator variable of the Boolean condition .
is the set of probability distributions over set
.A domain is a tuple where is a set of possible world states; and is a distance measure.
We assume there is a fixed set of workers, denoted by . An instance in domain is a set of reports where for all ; and a ground truth . To make things concrete, we will consider three domains in particular:

Continuous domain. Here we have questions with realvalued answers. Thus ; and we define to be the squared normalized Euclidean distance.

Categorical domain. Here we have questions with categorical answers in some finite set . Thus ; and we define as the normalized Hamming distance.

Rankings domain. Here is all orders over a set of element , with the Kendalltau distance .
A noise model in domain is a function from to . We consider noise models that are informative, in the sense that: (I) returns w.p. 1 (i.e., without noise); and (II) is strictly increasing in
, for any moment
(i.e., higher means more noise).A population in domain is a set of workers, each with a fault level . A Protopopulation is a distribution over fault levels. We denote by , the mean and variance of distribution , respectively. We omit when clear from the context. Note that the higher is, we should expect more erroneous answers.
Generated instances
Fix a particular domain . Given a population , a noise model , and a ground truth , we can generate instance by sampling each answer independently from . We can similarly generate instances from a protopopulation by first sampling each from , and then sampling an instance from the resulted population.
Aggregation
An aggregation function in a domain is a function . In this work we consider simple aggregation functions: is the Mean function in the continuous domain; and is the Plurality function in the categorical domain. The functions are applied to each question independently. Formally: and
In the ranking domain we apply several popular voting rules,^{2}^{2}2The more accurate term for functions that aggregate several rankings into a single ranking is social welfare functions (Brandt et al., 2016). including Plurality, Borda, Veto, Copeland, and Kemeny. All the aggregation functions we consider have natural weighted analogs, denoted as
for a weight vector
.Aggregation errors
Given an instance and aggregation function or algorithm in domain , the error of on is defined as . The goal of truth discovery is to find aggregation functions that tend to have low error.
For the convenience of the reader, a list of notation and acronyms is available at the end of the appendix.
2.1. A general Proxybased TruthDiscovery scheme
Given an instance and a basic aggregation function , we apply the following work flow, whose specifics depend on the domain.

Collect the answers from all workers.

Compute the pairwise distance for each pair of workers.

For each worker :

Compute proxy distance by averaging over all pairwise distances:
(1) 
Estimate the fault level from .

Transform fault level to a weight .


Aggregate and return the estimated true answers .
By Theorem 1
, the (expected) proxy distance of each agent is a linear transformation of her fault level. Thus, we implement step
3b by reversing the linear transformation to get an estimated fault level , and then use known results to obtain the (estimated) optimal weight in step 3c. The details are given in the respective sections, where we refer to the algorithms that return and as Proxybased Estimation of Fault Levels (PEFL) and Proxybased Truth Discovery (PTD), respectively.3. Continuous Answers Domain
As specified in the preliminaries, each report is a vector . The distance measure we use is the normalized squared Euclidean distance:^{3}^{3}3The squared Euclidean distance in not a true metric (violates triangle inequality), but this is not required for our needs. The squared Euclidean distance is often used as a dissimilarity measure in various clustering applications (Carter et al., 1989; Kosman and Leonard, 2005; Cha, 2007). .
The Independent Normal Noise model
For our theoretical analysis, we assume an independent normal noise (INN). Formally, , where is a dimensional noise. In the simplest case the noise of each worker is i.i.d. across dimensions, whereas workers are also independent but with possibly different variance. We further assume that questions are equally difficult^{4}^{4}4
In the INN model the equaldifficulty assumption is a normative decision, since we can always scale the data. Essentially, it means that we measure errors in standard deviations, giving the same importance to all questions.
and not correlated given the fault level, i.e. . Note that .3.1. Estimating Fault Levels
Given en instance , our first goal is to get a good estimate of the true fault level.
Fault estimation by distance from the empirical mean
In situations where there is a simple aggregation method , a simple approach that is a step in many truth discovery algorithms, is estimating the quality of each worker according to their distance from the aggregated outcome (Li et al., 2016). We name this approach Estimation Fault Levels by Distance from the Outcome (DEFL). In the continuous domain, we use DEFL (Alg. 1) where is the mean function and is the square Euclidean distance, however, we leave the notation general as the algorithm can be used in other domains with appropriate distance and aggregation functions.
We later analyze the properties of DEFL in Section 3.2. But before, we will describe our own approach that relies on the proxy distance.
Fault estimation by proxy distance
Applying Eq. (1) to the continuous domain, we get that the proxy distance of each worker is .
Note that once is fixed, the proxy distance
is a random variable that depends on two separate randomizations. First, the sampling of the other workers’ fault levels from the protopopulation
; and second, the realization of a particular instance , where .Theorem 1 (Anna Karenina principle for the INN model).
Suppose that instance is sampled from protopopulation via the INN model. For every worker ,
Proof.
Denote , which is a random variable.
We use the fact that the difference of two independent normal variables is also a normal variable whose expectation is the difference of expectations, and whose variance is the sum of the two variances. Denote by then ; since is the variance of for all ,
(2) 
We continue by bounding the inner expression for .
Finally,
as required. ∎
By Theorem 1, given and an estimate of , we can extract an estimate of , which suggests the PEFL algorithm (Alg. 2).
Estimating parameters
What should be the value for ? If we know , we can of course use it. Otherwise, we can estimate it from the data. We first argue that lower values of would result in a more conservative estimation of : Consider two workers with and some . Denote by the estimate we get when using some parameter . Then it is easy to see that the ratio between and gets closer to 1 as we pick smaller .
We define Algorithm 3 for estimating as (where ). If we use , then is the average of . By the argument above, lower values of would result in more conservative estimation, therefore a default conservative value we could use is , in which case the estimated fault level in DEFL would equal .
We can see the output of the PEFL algorithm on four different instances in Fig. 1. The blue dots and green dots would be the output when using and , respectively. Ideally, we would like the estimated fault to be on the solid green line. Note that in real datasets, we do not have access to the “true” , and we use the distance from the ground truth instead.
We show in the next subsections that the value (which is less conservative than both) is also of interest.
3.2. Equivalence and Consistency of PEFL and DEFL
Theorem 2 (Equivalence of PEFL and DEFL).
Denote by , the output of algorithms PEFL and DEFL, respectively. For any instance , and any worker , .
Note that does not depend on the particular instance or on the identity of the worker. Moreover, since in the continuous domain only relative fault matters (multiplying all by a constant is just a change of scale), we get DEFL as a special case of the proxybased algorithm. Note that this equivalence does not depend on any statistical assumptions.
Theorem 1 does not guarantee that estimated fault levels are good estimates. We want to verify that at least for large instances from the INN model, they converge to the correct value. More precisely, an algorithm is consistent under a statistical model, if for any ground truth parameter and any , the probability for the outcome of the algorithm to be away from the ground truth according to some measure goes to as the datasize grows.
Theorem 3 (Consistency of DEFL (continuous)).
When has bounded support and is bounded, DEFL is consistent as , and . That is, for all , as , and .
3.3. Aggregation
When fault levels are known, the best aggregation method is well understood.
Proposition 4 ((Aitkin, 1935)).
Under the Independent Normal Noise model, is minimizing .
That is, the optimal way to aggregate the data is by taking a weighted mean of the answers of each question, where the weight of each worker is inversely proportional to her variance (i.e. to her fault level).
A common approach to truth discovery is to combine Algorithm 4 with Algorithm 1 (DEFL). We refer to this simple algorithm as the Distancebased Truth Discovery (DTD) algorithm.
We define the Proxy Truth Discovery (PTD) algorithm for the continuous domain, by similarly combining Algorithm 4 with PEFL (Algorithm 2). When using , the PTD and DTD algorithms coincide by Thm. 2.
Note that both algorithms are well defined for any instance, whether the assumptions of the INN model hold or not. Moreover, in the INN model, Theorems 1 and 3 guarantee that with enough workers and questions, is a good estimate of the real fault level .
This of course does not yet guarantee that either algorithm returns accurate answers. For this, we need the following two results. The first says that good approximation of entails a good approximation of ; and the second says that in the limit, DTD (and thus PTD) return the correct answers. Recall that is the best possible estimation of by Prop. 4.
Theorem 5 ().
For any instance, such that for some ; it holds that .
Theorem 6 (Consistency of DTD (continuous)).
When has bounded support and is bounded, DTD is consistent as , and . That is, for any , for all , as , and .
3.4. Empirical Results
We compared the performance of PTD (which coincides with DTD) to the baseline method UA on synthetic and real data. In addition, we created an “Oracle Aggregation” (OA) baseline, which runs the aggregation skeleton with the true fault level when available, or the empirical fault otherwise.
We also tried other values for the parameter , which had similar results.
We generated instances from the INN model, where (additional distributions in Appendix B). Each instance was generated by first sampling a population from , and then generating the instance from . The Buildings dataset was collected via Amazon Mechanical Turk. The Triangles dataset is from (Hart et al., 2018) (see Appendix B.1 for further details). We used each such dataset as a distribution over instances, where for each instance we sampled questions and workers uniformly at random with replacement. We then normalized each question so that its answers have mean and variance . For every combination of and we sampled 500 instances.
We can see in Fig. 2 that the PTD/DTD method is substantially better than unweighted mean almost everywhere.
4. Categorical Answers Domain
In this setting we have categorical (multiplechoice) questions. The ground truth is a vector , where . The distance measure we use is the normalized Hamming distance (note that for binary labels it coincides with the squared Euclidean distance):
(3) 
We follow the same scheme of Section 3: given an instance , we first show how to estimate workers’ fault levels under a simple noise model, then transform them to weights and aggregate multiple answers.
Independent Errors Model
For our theoretical analysis, we assume an independent error (IER) model. Formally, for every question , and every worker , with probability ; and for any w.p. . That is, all wrong answers occur with equal probability. Note that .
We denote by the probability that two workers who are wrong select the same answer.
4.1. Estimating Fault Levels
Fault estimation by distance from the Plurality outcome
As in the continuous case, it is a common practice to use a simple aggregation method (in this case Plurality) to estimate fault levels. We similarly denote the estimated fault level by , and refer to it as the DEFL algorithm for the categorical domain.
Fault estimation by proxy distance
Applying the definition of the proxy distance (Eq. (1)) to the categorical domain, we get:
Theorem 1 (Anna Karenina principle for the IER model).
Suppose that instance is sampled from protopopulation via the IER model. For every worker ,
(4) 
The proof is somewhat more nuanced than in the continuous case. We first show that for every pair of workers,
Then, for every population,
(5) 
and then we take expectation again over populations to prove the claim.
We get that there is a positive relation between and exactly when , i.e. when the average fault levels are below those of completely random answers.
To estimate from the data, the PEFL algorithm (Alg. 5) reverses the linear relation.
Setting parameter values
As in the continuous domain, setting means that is the average of . Also as in the continuous domain, we can use as a conservative default value, in which case .
In contrast to the continuous case, it is obvious that the estimates we get from the PEFL and the DEFL algorithms are not at all equivalent. To see why, note that a small change in the report of a single worker may completely change the Plurality outcome (and thus the fault estimates of all workers in the DEFL algorithm), but only has a gradual effect on PEFL.
We do not know whether DEFL is consistent, but we can show that PEFL is.
Theorem 2 (Consistency of PEFL).
Suppose the support of is a closed subset of and . Then PEFL is consistent as and . That is, for any , for all , as and .
Evaluation
How good is this estimation of PEFL for a given population? Rephrasing Theorem 1, we can write
That is, the proxy distance is a sum of two components. The first is the actual fault level (the “signal”). The second one decreases the signal proportionally to . Thus the lower is, the better estimation we get on average. We can see this effect in Fig. 3: the top left figure presents an instance with lower than the figure to its right, so the dependency of on is stronger and we get a better estimation. The top right figure has the same as the middle one, but higher , thus fault levels are more spread out and easier to estimate. The two figures on the bottom left demonstrate that a good fit is not necessarily due the the IER model, as the estimated faults for the real dataset on the middle are much more accurate.
The bottom right figure shows that in PEFL is somewhat more accurate that DEFL on average.
4.2. Aggregation
The vector that minimizes the expected distance to the ground truth is also the MLE under equal prior. This is since we simply try to find the most likely answer for each question. When fault levels are known, this was studied extensively in a binary setting (Grofman et al., 1983; Shapley and Grofman, 1984; Nitzan and Paroush, 1985). Specifically, Grofman et al. (Grofman et al., 1983) identified the optimal rule for binary aggregation. BenYashar and Paroush (BenYashar and Paroush, 2001) extended these results to questions with multiple answers.^{5}^{5}5BenYashar and Paroush (BenYashar and Paroush, 2001) also considered other extensions including unequal priors, distinct utilities for the decision maker, and general confusion matrices instead of equalprobability errors. In all these cases the optimal decision rule is not necessarily a weighted plurality rule, and generally requires comparing all pairs of answers.
For a worker with fault level , we denote . We refer to as the Grofman weights of population .
Proposition 3 ((Grofman et al., 1983; BenYashar and Paroush, 2001)).
Suppose that is a random instance from the IER model. Let . Then is the maximum likelihood estimator of (and thus also minimizes in expectation).
That is, is the result of a weighted plurality rule, where the optimal weight of is her Grofman weight (note that it depends only on ). Note that workers whose fault level is above random error () get a negative weight. Of course, since we have no access to the true fault level, we cannot use Grofman weights directly.
Prop. 3 suggests a simple aggregation skeleton, which is the same as Alg. 4, except it uses instead of , and sets weights to .^{6}^{6}6Also appears as Alg. 8 in the appendix for completeness. DTD and PTD are the combinations of this categorical skeleton with DEFL and with PEFL, respectively.
As in the continuous case, the algorithm is welldefined for any categorical dataset, but in the special case of the IER noise model we get that the workers’ weights are a reasonable estimate of the Grofman weights due to Theorems 1 and 2. Lastly for this section, we show that PTD is consistent.
Theorem 4 (Consistency of PTD).
Suppose the support of is a closed subset of and . Then PTD is consistent as and . That is, for any , as and .
4.3. Empirical results
We compared the performance of PTD to the competing methods UA (which returns ) and DTD on synthetic and real data. In all simulations we used the default parameter (i.e., PTD). Averages are over 1000 samples for each and . The oracle benchmark OA returns . Note that OA and UA are not affected by the number of questions.
For synthetic data we generated instances from the IER model: one distribution with Yes/No questions (), where ; and another with multiplechoice questions (), where (additional distributions in Appendix D). In addition, we used three datasets from (Shah and Zhou, 2015) (Flags, GoldenGate, Dogs) and one that we collected (DotsBinary). Their description is in Appendix D.1.
4.4. Iterative methods
There are more advanced methods for truth discovery that are based on the following reasoning: a good estimate of workers’ fault levels leads to aggregated answers that are close to the ground truth (by appropriately weighing the workers); and a good approximation of the ground truth can get us a good estimate of workers’ competence (by measuring their distance from the approximate answers). Thus we can iteratively improve both estimates (see e.g. in (Li et al., 2016), Section 2.2.2). The Iterative DTD algorithm (IDTD, see Alg. 6) captures this reasoning. Note that the DTD algorithm is a special case of IDTD with a single iteration.
Input: number of iterations ; dataset
Output: Fault levels , answers
Initialize ;
for do
;
, set ;
, set ;
end for
Set ;
Set ;

Input: number of iterations ; dataset
Output: Fault levels
Initialize ;
Compute for every pair of workers;
for do
for every worker do
Set ;
Set ;
end for
end for
Set ;

Iterative PEFL
We can adopt a similar iterative approach for estimating fault levels using proxy voting. Intuitively, in each iteration we compute the proxy distance of a worker according to her weighted average distance from all workers, and then recalculate the estimated fault levels and weights. Note that as in the single step PEFL, this estimation does not require any aggregation step. We avoid estimating and instead use the default value of , which means that equals the proxy distance. The complete pseudocode is in Alg. 7.
Intuitive analysis of IPEFL
For ease of presentation, we assume for the analysis in the remainder of this section. Recall again that for unweighted proxy distance we have by Thm. 1:
(6) 
Ideally, we would like to make this second part smaller to strengthen the signal. We argue that weighted proxy distance obtains just that. We do not provide a formal proof, but rather approximate calculations that should provide some intuition. Exact calculations are complicated due to correlations among terms.
We assume that fault levels are already determined, and take expectation only over realization of workers’ answers.
Lemma 5 ().
In step of the iterative algorithm,
(7) 
The proof simply repeats the steps of the proof of Theorem 1.
Recall that is the actual optimal weight (Grofman weight) of worker . We also denote . For values of not too far from , is an approximation of (Grofman et al., 1983)). Thus .
Recall that is the variance of . Consider the “noisy” part that multiplies above. In expectation, the nominator holds:
Similarly, in expectation, the denominator holds .
If we neglect both the correlation among nominator and denominator, and the fact that is only an approximation of , we get that:
We conclude from Lemma 5 and the above discussion that after enough iterations, .
Since , this noise term is not larger than the noise in the unweighted PEFL algorithm ( in Eq. (6)). We therefore expect that if is already a reasonable estimation of then accuracy will grow with further iterations.
Empirical results for iterative algorithms
The Iterative Proxybased Truth Discovery (IPTD) algorithm, combines our aggregation skeleton with IPEFL. We can see how adding iterations affects the performance of PTD on synthetic data in Fig. 4. We further compare the performance of IPTD to IDTD on more distributions and datasets in the third row of Fig. 5 (and in Appendix D). For both algorithms we used iterations. A higher number of iterations had little effect on the results. Note that as we use , our IPTD algorithm never explicitly estimates or , yet it manages to take advantage of the variance among workers. We do see instances however where the initial estimation is off, and any additional iteration makes it worse.
5. Ranking Domain
Consider a set of alternatives , where is the set of all rankings (permutations) over . Each pairwise relation over corresponds to a binary vector where (i.e. each dimension is a pair of candidates). In particular, every ranking has a corresponding vector . A vector is called transitive if it corresponds to some ranking s.t. . The ground truth is a transitive vector (equivalently, a ranking ). A natural metric over rankings is the Kendalltau distance (a.k.a swap distance): .
Independent Condorcet Noise Model
According to Independent Condorcet noise (ICN) model, an agent with fault level observes a vector where for every pair of candidates , we have with probability .^{7}^{7}7Classically, the Condorcet model assumes all voters have the same parameter (Young, 1988). In particular, may not be transitive.
Mallows Models
The down side of the Condorcet noise model is that it may result in nontransitive answers. Mallows model is similar, except it is guaranteed to produce transitive answers (rankings).
Formally, given ground truth and parameter , the probability of observing order is proportional to . Thus for
we get a uniform distribution, whereas for low
we get orders concentrated around .In fact, if we throw away all nontransitive samples, the probability of getting rank under the Condorcet noise model with parameter (conditional on the outcome being transitive) is exactly the same as the probability of getting under Mallows Model with parameter .
5.1. Estimating Fault Levels
By definition, the ICN model is a special case of the IER model, where , , and the ground truth is a transitive vector. We thus get the following result as an immediate corollary of Theorem 1, and can therefore use PEFL directly.
Theorem 1 (Anna Karenina principle for the Condorcet model).
Suppose that instance is sampled from population via the ICN model, where all are sampled independently from protopopulation with expected value . For every worker , .
5.2. Aggregation
Note that while our results on fault estimation from the binary domain directly apply (at least to the ICN model), aggregation is more tricky: an issuebyissue aggregation may result in a nontransitive (thus invalid) solution. The voting rules we consider are guaranteed to output a valid ranking.
The problem of retrieving given votes is a classical problem, and in fact any social welfare function offers a possible solution (Brandt et al., 2016).
There is a line of work that deals with finding the MLE under various assumptions on the noise model (BenYashar and Paroush, 2001; DrissiBakhkhat and Truchon, 2004). In general, these estimators may take a complicated form that depends on all parameters of the distribution. Yet some cases are simpler.
The Kemeny rule and optimal aggregation
It is well known that for both Condorcet noise model and Mallows model, when all voters have the same fault level, the maximum likelihood estimator of is obtained by applying the KemenyYoung voting rule on (Young, 1988).
The KemenyYoung rule (henceforth, Kemeny) computes the binary vector that corresponds to the majority applied separately on every pair of candidates (that is, ); then .
In particular it can be applied when is composed of transitive vectors.
A natural question is whether there is a weighted version of KY that is an MLE and/or minimizes the expected distance to when fault levels are known. We did not find any explicit reference to this question or to the case of distinct fault levels in general. There are some other extensions: (DrissiBakhkhat and Truchon, 2004) deal with a different variation of the noise model where it is less likely to swap pairs that are further apart. (Xia and Conitzer, 2011) extend the Kemeny rule to deal with more general noise models and partial orders.
As it turns out, using weighted KY with (binary) Grofman weights provides us with (at least) an approximately optimal outcome.
Proposition 2 ().
Suppose that the ground truth is sampled from a uniform prior on . Suppose that instance is sampled from population via the ICN model. Let . Let be any random variable that may depend on . Then , where expectation is over all instances.
Proof.
Consider the ground truth binary vector . Let . Let be an arbitrary random variable that may depend on the input profile. For every (i.e., every pair of elements), we know from Prop. 3 that is the MLE for , and thus . Now, recall that by definition of the KY rule, is the closet ranking to . Also denote .
(since is the closest transitive vector to )  
In particular, this holds for any transitive vector corresponding to ranking , thus
as required. ∎
Prop. 2 provides some justification to use Kemeny voting rule and Grofman weights for aggregating rankings when the ’s are known. We can now apply a similar reasoning as in the binary case to estimate . Given a set of rankings and any voting rule (not necessarily Kemeny!), the PTD algorithm is a combination of Alg. 8 with and instead of , and Alg. 5 with the Kendalltau distance and . Since the meaning of negative weights is not clearly defined, we replace every negative weight with (the full description appears as Alg. 9 in the appendix for completeness).
5.3. Empirical results
We compared the performance of PTD using 8 different voting rules: Borda, Copeland, KemenyYoung (with weighted and unweighted majority graph), Plurality, Veto, Random dictator, and Best dictator.
Comments
There are no comments yet.