1 Introduction
Aggregating preferences of agents is a fundamental problem in artificial intelligence and social choice theory
[Conitzer, 2010]. Typically, agents (or voters) express their preferences over alternatives (or candidates). There are many different models for the expression of preferences, ranging from the simplistic (each voter provides his or her favorite choice, also known as plurality voting) to the comprehensive (each voter provides a complete ranking over the set of all candidates). Approval ballots are an intermediate model, where a voter approves or disapproves of each candidate — thus a vote may be thought of as a subset of approved candidates or as a binary string indexed by the candidate set. Approval votes are considered as a good compromise between the extreme models — they provide the agent an opportunity to make a comment about every candidate, without incurring the overhead of determining a full ranking on the candidate set [Fishburn, 1978, Brams, 2008, Kilgour, 2010, Lepelley, 2013, Baumeister et al., 2010].Our work in this paper focuses on finding the “best” subset of candidates when given approval votes over candidates. The set of candidates together with all the votes is usually called an election instance or simply an election. We use the term electorate to refer to the set of votes of all the agents, and the term committee to refer to a subset of candidates. Since a committee is a subset of candidates, and a vote can also be interpreted as a subset of the candidate set^{1}^{1}1This is simply the subset of approved candidates., one might consider various natural notions of distance between these two sets. A fixed notion of distance leads to a measure of suitability of a committee with respect to an election — for instance, by considering the sum of distances to all voters, or the maximum distance incurred from any voter. For a committee and a vote , the notions of distance that are wellstudied in the literature include the size of the symmetric difference (leading to the minisum or minimax rules) [Brams et al., 2007, LeGrand et al., 2007, Caragiannis et al., 2010, Gramm et al., 2003], the size of the intersection (leading to the notion of approval score), or the difference between and (leading to the notion of net approval scores) [Procaccia et al., 2008, Skowron and Faliszewski, 2015, Aziz et al., 2015].
Motivation
The standard approach to the selection of a winning committee is to look for one that optimizes these scoring functions over the entire election — that is, when the sum or max of scores are taken over all the votes. However, it is plausible that for some elections, there is a committee that represents a very good consensus when the scores are taken over a subset of the voters rather than the entire set. For example, consider the approval voting rule mentioned above. The approval score of a committee is the sum of , taken over all votes in the electorate. Let the approval score per vote be the ratio of the approval score of to . For a committee of size , the best approval score per vote possible is . Clearly, the higher the approval score per vote, the greater a voter is ‘satisfied’ on average. Now, consider the following toy example. Let be an arbitrary but fixed subset of candidates. Suppose the election has votes that are all exactly equal to , and one vote that approves . The best approval score per vote that one can hope for here is , whereas if we restrict our attention to the election on the first votes alone, then ’s approval score per vote is , which is now the best possible. Of course, the difference here is not significant for large , but the illustration does provoke the following thought: is there a subset of at least votes that admit a committee whose approval score per vote is at least ? For , we are back to the original question of finding the best committee. This more general setting, however, allows us to explore tradeoffs and structure in the election: if there is a committee that prescribes a very good consensus over a large fraction of the election, then it is likely to be a more suitable choice for the community compared to a committee that optimizes the same score function over the entire election. While these committees may coincide (as in the toy example), there are easy examples where they would not, making this a question worth exploring.
The notion of finding a good consensus over a large subset has been explored in various contexts. In particular, the idea of using outliers is quite common in the the literature of Closest String problems. The setting of closest string involves a collection of strings, and the goal is to find a single string that minimizes either the maximum distance, or the sum of distances, from all the input strings. The most commonly studied notion of distance is the Hamming distance. Notice that once we interpret votes as binary strings, we are asking a very similar question, and the main distinction is that our search is only over strings that have a fixed number of ones. This similarity has been noted and explored in some works on voting before (see, for instance, [Byrka and Sornat, 2014]). In the context of Closest String, a question that is often asked is the following: given a budget , is there a string that is “close to” at least strings? This question has been studied for both the minimax and minisum notions of closeness [Boucher et al., 2013, Lo et al., 2014]. Typically, the strings that are left out are called outliers. In the context of voting, one imagines that there may be a few votes that express rather tangential opinions, and that a good consensus emerges once they are removed. We will also refer to such votes as outliers. In the context of social choice theory, the very notion of Young voting rule can be regarded as to finding minimum number of outliers whose removal makes some candidate Condorcet winner [Young and Levenglick, 1978, Bartholdi III et al., 1989a, Rothe et al., 2003, Betzler et al., 2010].
Our Framework
One of the advantages of using scoring rules for approval ballots described above is that the winning committees are polynomially computable for most of them (with the notable exception of the minimax voting rule). However, once we pose the question of whether a target score is achievable after the removal of at most outliers, the complexity landscape changes dramatically. We show that this question is a computationally hard problem — in particular, we establish hardness. Having shown hardness in the classical setting, we explore the complexity further, primarily from the perspective of exact exponential algorithms and approximation algorithms. In the context of the former, we use the framework of parameterized complexity. Briefly, a parameterized problem instance comprises of an instance in the usual sense, and a parameter . A problem with parameter is called fixed parameter tractable () if it is solvable in time , where is an arbitrary function of and is a polynomial in the input size . While there have been important examples of traditional algorithms that have been analyzed in this fashion, the theoretical foundations for deliberate design of such algorithms, and a complementary complexitytheoretic framework for hardness, were developed in the late nineties [Downey and Fellows, 1988, Downey and Fellows, 1995a, Abrahamson et al., 1995, Downey and Fellows, 1995b]. Just as
hardness is used as evidence that a problem probably is not polynomial time solvable, there exists a hierarchy of complexity classes above
, and showing that a parameterized problem is hard for one of these classes is considered evidence that the problem is unlikely to be fixedparameter tractable. Indeed, assuming the Exponential Time Hypothesis, a problem hard for does not belong to [Downey and Fellows, 1999a]. The main classes in this hierarchy are:where a parameterized problem belongs to the class if there exists an algorithm for it with running time bounded by . A parameterized problem is said to be paracomplete if it is complete even for constant values of the parameter. A classic example of a paracomplete problem is graph coloring parameterized by the number of colors — recall that it is complete to determine if a graph can be properly colored with three colors. Observe that a paracomplete problem does not belong to unless . We are now ready to describe our contributions in greater detail.
1.1 Our Contributions
We consider three standard approval scoring mechanisms, namely the minisum, approval, and net approval scores. The last two scores, as originally defined in the literature, are designed to simply give us the total amount of approval that a committee incurs from all the voters. The scores themselves are therefore nondecreasing functions of , and as such, the question of outliers is not interesting if we use the scores directly (in particular, it is impossible to improve these scores by removing votes). Therefore, we consider the dual scoring system that complements the original — namely, we score a committee based on the amount disapproval that it incurs from all the votes, and seek to minimize the total disapproval. Typically, for any notion of approval, there are either one or two natural complementary notions of disapproval that present themselves (discussed in greater detail below). This formulation is consistent with the idea that we want our scores to capture “distance” rather than closeness. We note that in terms of scores, these rules are equivalent to the original, but choosing to ask the minimization question allows us to formulate the problem of finding the best committee in the presence of outliers.
Remark 1.
One might also consider the approval score per vote instead of the total score, as described earlier. However, we chose to use the notion of disapproval because of its consistency with the other distancebased rules (like minisum and minimax). All the variations are equivalent as scoring functions, and we note that our choice is only a matter of exposition.
Measures of approval  Measures of disapproval  

Minisum  Minisum  
Approval  Disapproval  
Net Approval  Net Disapproval 
In Table 1, we summarize the notions of distances between a committee and a vote. Each of these notions naturally gives rise to a scorebased voting rule. Formally, for any distance function between two subsets of candidates, we overload notation and define the corresponding score function for a set of votes as follows:
For the winner determination problem, the goal is to find a committee of size that minimizes . For all the scoring rules in Table 1, a winning committee of size can be found in polynomial time for any . We refer the reader to Table 2 for an overview of the notation we use in this paper. We are now ready to define the problem of winner determination for a scoring rule in the presence of outliers:
Outliers Input: A set of votes over a set of candidates , a committee size , a target number of nonoutliers , and a target score . Question: Does there exist a committee and a set of nonoutliers such that , , and ?
Remark 2.
We first show that the Outliers problem is complete for all the scoring rules considered here, even in the special case when every vote approves exactly three candidates and every candidate is approved by exactly three votes. This also establishes the parahardness of the problem with respect to the (combined) parameters (maximum) number of candidates approved by any vote and (maximum) number of votes that approve a candidate.
Theorem 1.
Let Minisum, Disapproval, NetDisapproval. The Outliers problem is complete even if every vote approves exactly candidates and every candidate is approved by exactly votes.
To initiate the parameterized study of the Outliers problem, we propose the following parameters: the size of the committee (), the number of candidates not in the committee (), the number of nonoutliers (), the number of outliers (), and the target score (which one might think of as the “solution size” or the standard parameter).
Number of votes  Set of candidates  

Number of candidates  Committee chosen  
Number of outliers  Noncommittee  
Number of nonoutliers  Set of votes  
Size of committee  Set of nonoutliers  
Size of noncommittee  Set of outliers  
Score of the committee  Hamming distance 
Our main results are the following. For any subset of these parameters, we establish if the Outliers problem is or hard when , and for the Disapproval voting rule, we have a classification for all cases but one. Specifically, for the minisum voting rule, we establish the following dichotomy.
Theorem 2.
Let , and . The Minisumoutliers problem parameterized by is if contains either , , or , and is hard otherwise.
For the net disapproval voting rule, we establish the following dichotomy.
Theorem 3.
Let , and . The NetDisapprovaloutliers problem parameterized by is if contains either or , and is hard otherwise.
Further, for the disapproval voting rule, we have the following theorem that classifies all cases but one.
Theorem 4.
Let , and . The DisapprovalOutliers problem parameterized by is if contains either or , and is hard for all other cases except for .
Apart from these classification results in the context of exact algorithms, we also pursue approximation algorithms and other special cases. We briefly summarize our contributions in these contexts below.
Approximation Results. We provide a polynomial time approximation algorithm for optimizing the minisum score considering outliers, for every constant [Theorem 5]. We also show a approximation algorithm for optimizing the minisum score considering outliers running in time , for every constant [Theorem 6]. On the hardness side, we show that in the presence of outliers, there does not exist an aproximation algorithm for optimizing score of the selected committee for any computable function , for the net disapproval [Corollary 2] and disapproval voting rules (unless ) [Lemma 8].
Other Special Cases. We show that when every voter approves at most one candidate, then the Minisumoutlier problem can be solved in polynomial amount of time. For some hard cases, we show that the problem becomes if the maximum number of candidates approved by a vote or the maximum number of votes that approve of a candidate is a constant. We refer the reader to Section 6 for a more detailed discussion.
1.2 Related Work
The notion of outliers is quite prominent in the literature pertaining to closest strings, where the usual setting is that we are given strings over some alphabet , and the task is to find a string that minimizes either the maximum Hamming distance from any string, or the sum of its Hamming distances from all strings. In the context of social choice theory, the notion of outliers are intimately related to manipulation and control of election by removing voters and candidates [Bartholdi et al., 1992, Bartholdi III et al., 1989b, Conitzer et al., 2007, Chevaleyre et al., 2007, Hemaspaandra et al., 2007, Brandt et al., 2012, Faliszewski et al., 2009, Faliszewski et al., 2011, Meir et al., 2008, Procaccia et al., 2007, Elkind et al., 2009, Betzler and Uhlmann, 2008].
In the minimax notion of distance, the problem of finding a closest string in the presence of outliers was initiated in [Boucher and Ma, 2011], and was further refined in [Boucher et al., 2013]. The work in [Boucher et al., 2013] demonstrates hardness of approximation for variations involving outliers. With the minisum notion of distance, the study of finding the best string with respect to outliers was studied in [Lo et al., 2014]. In contrast with the minimax rule, this work shows a PTAS for Consensus Sequence with Outliers.
The fact that the closest string problem has similarities to the setup of approval ballots has been explored in the past, see, for instance [Byrka and Sornat, 2014]. More specifically, and again in the spirit of observing similarities with the closest string family of problems, the minimax approval rule has been studied from the perspective of outliers [Misra et al., 2015]. To the best of our knowledge, our work is the first comprehensive study of the complexity behavior of approval ballots in the presence of outliers.
2 Preliminaries
In this section we briefly recall some terminology and notation that we will use throughout.
Approval Voting
Let be the set of all votes and the set of all candidates. Each vote is a subset of . We say voter approves a candidate if ; otherwise we say, voter does not approve the candidate . A voting rule for choosing a committee of size is a mapping , where is the set of all subsets of of size . We call a voting rule a scoring rule if there exists a scoring function such that , for every . We refer to Table 1 for some common scoring rules. Unless mentioned otherwise, we use the notations listed in Table 2.
Algorithmic Terminology
Given a set , we denote the complement of by . For a positive integer we denote the set by . We use the notation to denote .
For a minimization problem , we say an algorithm archives an approximation factor of if for every problem instance of . In the above, denotes the value of the optimal solution of the problem instance .
Definition 1.
(Parameterized Reduction [Downey and Fellows, 1999b])
Let and be parameterized problems. A parameterized reduction from to is an algorithm that, given an instance of , outputs an instance of such that:

[topsep=0pt,itemsep=0pt]

is a yesinstance of if and only if is a yesinstance of ,

for some computable function , and

the running time of the algorithm is for some computable function .
3 Classical Complexity: hardness Results
We begin by showing that even for rules where winner determination is polynomial time solvable, the possibility of choosing outliers makes the winner determination problem significantly harder. In particular, we show that the Outliers problem is hard for minisum, disapproval, and net disapproval voting rules. We reduce from the well known Vertex Cover problem, which is known to be complete even on regular graphs [Garey and Johnson, 1979].
Vertexcover Input: A regular graph and a positive integer . Question: Is there a subset of at most vertices such that is an independent set?
See 1
Proof.
First let us prove the result for the minisum voting rule. The problem is clearly in . To prove hardness, we reduce the Vertexcover problem to the Minisumoutliers. Let be an arbitrary instance of Vertexcover problem. Let . We will introduce a candidate and a vote corresponding to every vertex, and have candidate approved by a vote if and only if the vertices corresponding to and are adjacent in . We define the corresponding instance of the Minisumoutliers problem as follows:
Notice that, and each candidate is approved in exactly votes since is regular. We claim that the two instances equivalent. In the forward direction, suppose forms a vertex cover with . Consider the committee and the set of nonoutliers . We claim that for every thereby proving Notice that, since and for every . Suppose, there exist an such that Then there exist a candidate However, this implies that the edge is not covered by which contradicts that fact that is a vertex cover.
For the reverse direction, suppose there exists a set of outliers and a committee such that , , and . Since, adding votes in the set of outliers can only reduce , we may assume without loss of generality that Now, since for every , we have and thus for every Let and . We claim that covers all the edges incident on the vertices in . Indeed, otherwise there exist a such that all the edges incident on are not covered by and thus we have . We now construct a vertex cover of of size at most as follows. If and are disjoint then, forms a vertex cover of . Otherwise we let and notice that still covers all the edges incident on . We define . We note that and . We iterate this process for many times. Let us call . By the argument above, covers all the edges incident on the vertices in , , and thus we have . Hence, forms a set cover of .
The proofs for the other rules are identical, except for the values of the target score. We define for the disapproval, and for the net disapproval voting rule. It is easily checked that the details are analogous. ∎
Theorem 1 immediately yields the following corollary.
Corollary 1.
Let Minisum, Disapproval, NetDisapproval. Then the Outliers problem is parahard when parameterized by the sum of the maximum number of approvals that a candidate obtains and the maximum number of candidates that a vote approves.
4 Parameterized Complexity Results
In this section, we present the results on parameterized complexity of the Outliers problem.
4.1 algorithms
The following result follows from the fact that the Outliers problem is polynomial time solvable for all the voting rules considered here if we know either the committee or the nonoutliers of the solution.
Proposition 1.
Let Minisum, Disapproval, NetDisapproval. Then, there is a time algorithm and a time algorithm for the Outliers problem.
Now, we show that the Minisumoutliers problem, parameterized by , is .
Lemma 1.
There is a time algorithm for the Minisumoutliers problem.
Proof.
We consider the following two cases.
Case : In this case there exist a vote say such that , where is the committee selected. Hence, we can iterate over all possible such (if then, do not consider that ) and fix to be . We now identify the nonoutliers with respect to the committee .
Case : Use the algorithm in Proposition 1 which runs in time. ∎
4.2 hardness Results
In this section, we establish our hardness results. To begin with, we focus on parameters combined with the target score . Notice that we have tractability when we combine either or along with (follows from Proposition 1). For minisum, we even have tractability for and , from Lemma 1. Therefore, the interesting combinations for the minisum rule are and . We first consider the combination , and show hardness for all the voting rules considered here by exhibiting a parameterized reduction from the Clique problem which is known to be hard parameterized by clique size [Downey and Fellows, 2013].
Clique Parameter: Input: A regular graph and a positive integer . Question: Is there a clique of size ?
Lemma 2.
The Minisumoutliers problem is hard, when parameterized by . Also, for disapproval, net disapproval, the Outliers problem is hard when parameterized by , even when .
Proof.
First let us prove the result for the Minisumoutliers problem. We exhibit a parameterized reduction from the Clique problem to the Minisumoutliers problem thereby proving the result. Let be an arbitrary instance of the Clique problem. Let and . We define the corresponding instance of the Minisumoutliers problem as follows:
We claim that the two instances are equivalent. In the forward direction, suppose forms a clique with , and let denote the set of edges that have both endpoints in . Then the committee along with the set of nonoutliers achieves minisum score of .
In the reverse direction, suppose there exist a set of nonoutliers and a committee such that . We claim that the vertices in form a clique with edge set . If not then, there exist a vote such that . On the other hand we have for every . This implies that , which is a contradiction.
We define for the disapproval and net disapproval voting rules. It is easily checked that the details are analogous. ∎
We now turn to the combination . Again, we have hardness for all voting rules considered.
Lemma 3.
Let Minisum, NetDisapproval, Disapproval. Then the Outliers problem is hard, when parameterized by , even when every vote approves exactly two candidates.
We now consider the other combinations involving that are nontrivial for voting rules different from minisum, namely and . Lemma 4 below shows hardness for the first combination, for the Disapproval voting rule. This also shows hardness with respect to the parameter alone, for the minisum voting rule.
Lemma 4.
The MinisumOutliers problem is hard when parameterized by . Also, the DisapprovalOutliers problem is hard parameterized by , even when the target score . In particular, Outliers is hard parameterized by .
Proof.
To begin with, let us prove the result for the Minisumoutliers problem. We exhibit a parameterized reduction from the Clique problem to the Minisumoutliers problem thereby proving the result. Let be an arbitrary instance of the Clique problem. Let and . We define the corresponding instance of the Minisumoutliers problem as follows:
We claim that the two instances are equivalent. In the forward direction, suppose forms a clique with , and let denote the set of edges that have both endpoints in . Then the committee and the set of nonoutliers achieves minisum score of .
In the reverse direction, suppose there exist a set of outliers and a committee such that . We claim that the vertices in form a clique. If not then, there exist a candidate that is not approved by at least one vote in . However, this implies that , which is a contradiction.
We define for the disapproval voting rule. It is easily checked that the details are analogous. ∎
Next, we show that NetDisapprovaloutliers is hard with respect to the combined parameter .
Lemma 5.
The NetDisapprovaloutliers problem is hard, when parameterized by , even when the target score is . In particular, NetDisapprovaloutliers problem is hard, when parameterized by .
Proof.
We reduce the Clique problem to the Netapprovaloutlier problem. Let be an arbitrary instance of the Clique problem. Let . We define the corresponding instance of the Netapprovaloutlier problem as follows:
We claim that the two instances equivalent. In the forward direction, suppose has a clique on a subset of vertices of size ; let be the clique edges. Then we define the set of outlier votes to be and the committee to be the set of candidates . Now we have and thereby achieving the net approval score of .
In the reverse direction, suppose we have a set of outlier votes and a committee which achieves a net approval score of . First observe that we can assume without loss of generality that is a subset of since irrespective of the outliers chosen, every candidate in receives at least approvals and every candidate not in receives at most approvals. Now since the committee contains , for every , the vote contributes at most to the net approval score, whereas, for every , the vote contributes at least to the net approval score. Hence, we may assume without loss of generality that belongs to the set of nonoutliers for every . Now we claim that the set of edges must form a clique on the set of vertices . If not then, and thereby making the net approval score strictly more than . ∎
Finally, we show that for all rules considered, the Outliers problem is hard when parameterized by . In particular, for the NetDisapprovaloutliers problem, we have hardness even when and every candidate is approved in exactly two votes.
Lemma 6.
Let Minisum, Disapproval, NetDisapproval. Then Outliers is hard, when parameterized by , even when every candidate is approved in exactly two votes. Further, the NetDisapprovaloutliers problem, parameterized by , is hard even when and every candidate is approved in exactly two votes. In particular, the NetDisapprovaloutliers problem is hard parameterized by .
Proof.
First let us prove the result for the Minisumoutliers problem. We exhibit a parameterized reduction from the Clique problem to the Minisumoutliers problem thereby proving the result. Let be an arbitrary instance of the Clique problem. Let and . We define the corresponding instance of the Minisumoutliers problem as follows:
We claim that the two instances are equivalent. In the forward direction, suppose forms a clique with , and let denotes the set of edges that have both endpoints in . Consider the committee and the set of outliers . Consider now a vote . Note that , since every edge incident on is an edge that does not belong to , by the definition of . Further, this also implies that . Therefore, Hence we have
For the reverse direction, suppose there exist a set of nonoutliers and a committee such that . We claim that the vertices in form a clique. If not then there exists a vote , such that . However, for every vote , we have . This makes which is a contradiction.
The proofs for the other rules are identical, except for the values of the target score. We define for the disapproval voting rule. For the net disapproval voting rule, we add many dummy candidates who are approved by every vote. We keep and make . It is easily checked that the remaining details are analogous. ∎
4.3 Proofs of the Main Theorems
We are now ready to present the proofs of Theorems 2 to 4 . First, we recall the dichotomy for the minisum voting rule.
See 2
Proof.
Since and , the tractability results follow from Proposition 1 and Lemma 1. Now, we only have to consider subsets of parameters such that:

[topsep=0pt,itemsep=0pt]

does not contain both and

does not contain both and

does not contain both and
Among such choices of , we have the following cases.

[topsep=0pt,itemsep=0pt]

If neither nor belongs to , then is either a subset of , or . Note that these cases are already subsumed by Case (1) above.
This completes the proof of the theorem. ∎
Now, we turn to the case of the net disapproval voting rule.
See 3
Proof.
Since and , the tractability results follow from Proposition 1. Now, we only have to consider subsets of parameters such that does not contain both and , and does not contain both and . Among such choices of , we have the following cases.

[topsep=0pt,itemsep=0pt]

If neither nor belongs to , then is either a subset of , or . Note that these cases are already subsumed by the cases above.
This completes the proof of the theorem. ∎
Finally, we turn to the case of the disapproval voting rules.
See 4
Proof.
Since and , the tractability results follow from Proposition 1. Now, we only have to consider subsets of parameters such that does not contain both and , and does not contain both and . Among such choices of , we have the following cases.

[topsep=0pt,itemsep=0pt]

If neither nor belongs to , then is either a subset of , or . Note that these cases are already subsumed by the cases above.
This completes the proof of the theorem. ∎
5 Approximation Results
In this section, we describe our results in the context of approximation algorithms, where our goal is to minimize the target score, given a committee size and a budget for the number of outliers as a part of input. In the first part, we focus on the minisum voting rule, and show an approximation algorithm, for every constant . We also have a approximation algorithm whose running time is , for every constant . Subsequently, we show that for all the voting rules, the target score is hard to approximate within any factor, as it is hard already to determine if .
5.1 Approximation Algorithms
We now turn to some approximation algorithms for the MinisumOutliers problem. Our first result is the following.
Theorem 5.
There is a approximation algorithm for the Minisumoutliers problem, for every constant .
Our next result is an approximation scheme with a subexponential running time, loosely based on the framework introduced in [Lo et al., 2014]. To this end, we will need the following lemma.
Lemma 7.
Let be any positive constant. Then, given a set of votes , there exists a subset of size such that , where and