Fair Algorithms for Learning in Allocation Problems

08/30/2018 ∙ by Hadi Elzayn, et al. ∙ 0

Settings such as lending and policing can be modeled by a centralized agent allocating a resource (loans or police officers) amongst several groups, in order to maximize some objective (loans given that are repaid or criminals that are apprehended). Often in such problems fairness is also a concern. A natural notion of fairness, based on general principles of equality of opportunity, asks that conditional on an individual being a candidate for the resource, the probability of actually receiving it is approximately independent of the individual's group. In lending this means that equally creditworthy individuals in different racial groups have roughly equal chances of receiving a loan. In policing it means that two individuals committing the same crime in different districts would have roughly equal chances of being arrested. We formalize this fairness notion for allocation problems and investigate its algorithmic consequences. Our main technical results include an efficient learning algorithm that converges to an optimal fair allocation even when the frequency of candidates (creditworthy individuals or criminals) in each group is unknown. The algorithm operates in a censored feedback model in which only the number of candidates who received the resource in a given allocation can be observed, rather than the true number of candidates. This models the fact that we do not learn the creditworthiness of individuals we do not give loans to nor learn about crimes committed if the police presence in a district is low. As an application of our framework, we consider the predictive policing problem. The learning algorithm is trained on arrest data gathered from its own deployments on previous days, resulting in a potential feedback loop that our algorithm provably overcomes. We empirically investigate the performance of our algorithm on the Philadelphia Crime Incidents dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The bulk of the literature on algorithmic fairness has focused on classification and regression problems (see e.g. [13, 15, 17, 14, 9, 24, 8, 23, 25, 18, 7, 6, 4, 3]), but fairness concerns also arise naturally in many resource allocation settings. Informally, a resource allocation problem is one in which there is a limited supply of some resource to be distributed across multiple groups with differing needs. Resource allocation problems arise in financial applications (e.g. allocating loans), disaster response (allocating aid), and many other domains — but the running example that we will focus on in this paper is policing. In the predictive policing problem, the resource to be distributed is police officers, which can be dispatched to different districts. Each district has a different crime distribution, and the goal (absent additional fairness constraints) might be to maximize the number of crimes caught.111We understand that policing has many goals besides simply apprehending criminals, including preventing crimes in the first place, fostering healthy community relations, and generally promoting public safety. But for concreteness and simplicity we consider the limited objective of apprehending criminals.

Of course, fairness concerns abound in this setting, and recent work (see e.g. [19, 10, 11]) has highlighted the extent to which algorithmic allocation might exacerbate those concerns. For example, Lum and Isaac [19] show that if predictive policing algorithms such as PredPol are trained using past arrest data to predict future crime, then pernicious feedback loops can arise, which misestimate the true crime rates in certain districts, leading to an overallocation of police.222Predictive policing algorithms are often proprietary, and it is not clear whether in deployed systems, arrest data (rather than 911 reported crime) is used to train the models. Since the communities that Lum and Isaac [19] showed to be overpoliced on a relative basis were primarily poor and minority, this is especially concerning from a fairness perspective. In this work, we study algorithms which both avoid this kind of under-exploration, and can incorporate additional fairness constraints.

In the predictive policing setting, Ensign et al. [10] implicitly consider an allocation to be fair if police are allocated across districts in direct proportion to the district’s crime rate; generally extended, this definition asks that units of a resource are allocated according to the group’s share of the total candidates for that resource. In our work, we study a different notion of allocative fairness which has a similar motivation to the notion of equality of opportunity proposed by Hardt et al. [13] in classification settings. Informally speaking, it asks that the probability that a candidate for a resource be allocated a resource should be independent of his group. In the predictive policing setting, it asks that conditional on committing a crime, the probability that an individual is apprehended should not depend on the district in which they commit the crime.

1.1 Our Results

To define the extent to which an allocation satisfies our fairness constraint, we must model the specific mechanism by which resources deployed to a particular group reach their intended targets. We study two such discovery models, and we view the explicit framing of this modeling step as one of the contributions of our work; the implications of a fairness constraint depend strongly on the details of the discovery model, and choosing one is an important step in making one’s assumptions transparent.

We study two discovery models which capture two extremes of targeting ability. In the random discovery model, however many units of the resource are allocated to a given group, all individuals within that group are equally likely to be assigned a unit, regardless of whether they are a candidate for the resource or not. In other words, the probability that a candidate receives a resource is equal to the ratio of the number of units of the resource assigned to his group to the size of his group (independent of the number of candidates in the group).

At the other extreme, in the precision discovery model, units of the resource are given only to actual candidates within a group, as long as there is sufficient supply of the resource. In other words, the probability that a candidate receives a resource is equal to the ratio of the number of units of the resource assigned to his group, to the number of candidates within his group.

In the policing setting, these models can be viewed as two extremes of police targeting ability for an intervention like stop-and-frisk. In the random model, police are viewed as stopping people uniformly at random. In the precision model, police have the magical ability to identify individuals with contraband, and stop only them. Of course, reality lies somewhere in between.

Fairness in the random model constrains resources to be distributed proportional to group sizes, and so is uninteresting from a learning perspective. But the precision model yields an interesting fairness-constrained learning problem when the distribution of the number of candidates in each group is unknown and must be learned via observation.

We study learning in a censored feedback setting: each round, the algorithm can choose a feasible deployment of resources across groups. Then the number of candidates for the current round in each group is drawn independently from a fixed, but unknown group-dependent distribution. The algorithm does not observe the number of candidates present in each group, but only the number of candidates that received the resource. In the policing setting, this corresponds to the algorithm being able to observe the number of arrests, but not the actual number of crimes in each of the districts. Thus, the extent to which the algorithm can learn about the distribution in a particular group is limited by the number of resources it deploys there. The goal of the algorithm is to converge to an optimal fairness-constrained deployment, where here both the objective value of the solution, and the constraints imposed on it, depend on the unknown distributions.

One trivial solution to the learning problem is to sequentially deploy all

of one’s resources to each group in turn, for a sufficient amount of time to accurately learn the candidate distributions. This would reduce the learning problem to an offline constrained optimization problem, which we show can be efficiently solved by a greedy algorithm. But this algorithm is unreasonable: it has a large exploration phase in which it is using nonsensical deployments, vastly overpolicing some districts and underpolicing others. A much more realistic, natural approach is a“greedy”-style algorithm, which at each round simply uses its current best-guess estimate for the distribution in each group and deploys an optimal fairness-constrained allocation corresponding to these estimates. Unfortunately, as we show, if one makes no assumptions on the underlying distributions, any algorithm that has a guarantee of converging to a fair allocation must behave like the trivial algorithm, deploying vast numbers of resources to each group in turn.

This impossibility result motivates us to consider the learning problem in which the unknown distributions are from a known parametric family. For any single-parameter Lipschitz-continuous family of distributions (including Poisson), we analyze the natural greedy algorithm — which at each round uses an optimal fair deployment for the maximum likelihood distributions given its (censored) observations so far — and show that it converges to an optimal fair allocation.

Finally, we conduct an empirical evaluation of our algorithm on the Philadelphia Crime Incidents

dataset, which records all crimes reported to the Philadelphia Police Department’s INCT system between 2006 and 2016. We verify that the crime distributions in each district are in fact well-approximated by Poisson distributions, and that our algorithm converges quickly to an optimal fair allocations (as measured according to the empirical crime distributions in the dataset). We also systematically evaluate the

Price of Fairness, and plot the Pareto curves that trade off the number of crimes caught versus the slack allowed in our fairness constraint, for different sizes of police force, on this dataset. For the random discovery model, we prove worst-case bounds on the Price of Fairness.

1.2 Further Related Work

Our precision discovery model is inspired by and has technical connections to  Ganchev et al. [12], who study the dark pool problem from quantitative finance, in which a trader wishes to execute a specified number of trades across a set of exchanges of unknown liquidity. This setting naturally requires learning in a censored feedback model, just as in our setting. Their algorithm can be viewed as a learning algorithm in our setting that is only aiming to optimize utility, absent any fairness constraint. Later Agarwal et al. [1] extend the dark pool problem to an adversarial (rather than distributional) setting. This is quite closely related to the work of Ensign et al. [11] who also consider the precision model (under a different name) in an adversarial predictive policing setting. They provide no-regret algorithms for this setting by reducing the problem to learning in a partial monitoring environment. Since their setting is equivalent to that of Agarwal et al. [1], algorithms in Agarwal et al. [1] can be directly applied to the problem studied by Ensign et al. [11].

Our desire to study the natural greedy algorithm rather than an algorithm which uses “unreasonable” allocations during an exploration phase is an instance of a general concern about exploration in fairness-related problems [5]. Recent works have studied the performance of greedy algorithms in different settings for this reason [2, 16, 22].

Lastly, the term fair allocation appears in the fair division literature (see e.g. [21] for a survey), but that body of work is technically quite distinct from the problem we study here.

2 Setting

We study an allocator who has units of a resource and is tasked with distributing them across a population partitioned into groups. Each group is divided into candidates, who are the individuals the allocator would like to receive the resource, and non-candidates, who are the remaining individuals. We let to denote the total number of individuals in group . The number of candidates in group

is a random variable drawn from a fixed but unknown distribution

called the candidate distribution. We use to denote the total size of all groups i.e. . An allocation is a partitioning of these units, where denotes the units of resources allocated to group . Every allocation is bound by a feasibility constraint which requires that .

A discovery model is a (possibly randomized) function mapping the number of units allocated to group and the number of candidates in group , to the number of candidates discovered (to receive the resource). In the learning setting, upon fixing an allocation v, the learner will get to observe (a realization of) for the realized value of for each group . Fixing an allocation v, a discovery model and candidate distributions for all groups , we define the total expected number of discovered candidates, as

where the expectation is taken over and any randomization in the discovery model . When the discovery model and the candidate distributions are fixed, we will simply write for brevity. We also use the total expected number of discovered candidates and (expected) utility exchangeably. We refer to an allocation that maximizes the expected number of discovered candidates over all feasible allocations as an optimal allocation and denote it by .

2.1 Allocative Fairness

For the purposes of this paper, we say that an allocation is fair if it satisfies approximate equality of candidate discovery probability across groups. We call this discovery probability for brevity. This formalizes the intuition that it is unfair if candidates in one group have an inherently higher probability of receiving the resource than candidates in another. Formally, we define our notion of allocative fairness as follows.

Definition 1.

Fix a discovery model and the candidate distributions . For an allocation v, let

denote the expected probability that a random candidate from group receives a unit of the resource at allocation v. Then for any , v is -fair if

for all pairs of groups and .

When it is clear from the context, for brevity, we write for the discovery probability in group . We emphasize that this definition (1) depends crucially on the chosen discovery model, and (2) requires nothing about the treatment of non-candidates. We think of this as a minimal definition of fairness, in that one might want to further constrain the treatment of non-candidates — but we do not consider that extension.

Since discovery probabilities and are in , the absolute value of their difference always evaluates to a value in . By setting we impose no fairness constraints whatsoever on the allocations, and by setting we require exact fairness.

We refer to an allocation v that maximizes subject to -fairness and the feasibility constraint as an optimal -fair allocation and denote it by . In general, is a monotonically decreasing quantity in , since as diminishes, the utility maximization problem becomes more constrained.

3 The Precision Discovery Model

We begin by introducing the precision model of discovery. Allocation of units to group in the precision model will result in a discovery of candidates. This models the ability to perfectly discover and reach candidates in a group with resources deployed to that group, limited only by the number of deployed resources and the number of candidates present.

The precision model results in censored observations that have a particularly intuitive form. Recall that in general, a learning algorithm at each round gets to choose an allocation v and then observe for each group . In the precision model, this results in the following kind of observation: when is larger than , the allocator learns the number of candidates present on that day exactly. We refer to this kind of feedback as an uncensored observation. But when is smaller than , all the allocator learns is that the number of candidates is at least . We refer to this kind of feedback as a censored observation.

The rest of this section is organized as follows. In Sections 3.1 and 3.2 we characterize optimal and optimal fair allocations for the precision model when the candidate distributions are known. In Section 3.3 we focus on learning an optimal fair allocation when these distributions are unknown. We show that any learning algorithm that is guaranteed to find a fair allocation in the worst case over candidate distributions must have the undesirable property that at some point, it must allocate a vast number of its resources to each group individually. To bypass this hurdle, in Section 3.4 we show that when the candidate distributions have a parametric form, a natural greedy algorithm which always uses an optimal fair allocation for the current maximum likelihood estimates of the candidate distributions converges to an optimal fair allocation.

3.1 Optimal Allocation

We first describe how an optimal allocation (absent fairness constraints) can be computed efficiently when the candidate distributions are known. In Ganchev et al. [12], the authors provide an algorithm for computing an optimal allocation, where the distributions over the number of shares present in each dark pool are known and the trader wishes to maximize the expected number of traded shares. We can use the exact same algorithm to compute an optimal allocation in our setting. Here we present the high level ideas of their algorithm in the language of our model, and provide full details for completeness in Appendix B.

Let denote the probability that there are at least candidates in group . We refer to as the tail probability of at . Recall that the value of the cumulative distribution function (CDF) of at is defined to be

So can be written in terms of CDF values as .

First, observe that the expected total number of candidates discovered by an allocation in the precision model can be written in terms of the tail probabilities of the candidate distributions i.e.

Since the objective function is concave (as is a non-increasing function in for all ), a greedy algorithm which iteratively allocates the next unit of the resource to a group in

where is the current allocation to group in the th round achieves an optimal allocation.

3.2 Optimal Fair Allocation

We next show how to compute an optimal -fair allocation in the precision model when the candidate distributions are known and do not need to be learned.

To build intuition for how the algorithm works, imagine that the group has the highest discovery probability in , and the allocation to that group is somehow known to the algorithm ahead of time. The constraint of -fairness then implies that the discovery probability for each other group in must satisfy . This in turn implies upper and lower bounds on the feasible allocations to group . The algorithm is then simply a constrained greedy algorithm: subject to these implied constraints, it iteratively allocates units so as to maximize their marginal probability of reaching another candidate. Since the group maximizing the discovery probability in and the corresponding allocation are not known ahead of time, the algorithm simply iterates through all possible choices.

, and .
An optimal -fair allocation .
. Initialize the output.
. Keep track of the utility of the output.
for  do Guess for group with the highest probability of discovery.
     .
     for  do Guess for the allocation to that group.
         Set in v and compute .
         . Upper bound on allocation to group .
         . Lower bound on allocation to group .
         for  do Upper and lower bounds for other groups.
              Update and using , and .
              . Minimum allocation to group .          
         if  then
              continue. Allocation is not feasible.          
         
         for  do Allocate the remaining resources greedily while obeying fairness.
               s.t. .
              .          
         . Compute the utility of v.
         if  then Update the best -fair allocation found so far.
              
              .               
return .
Algorithm 1 Computing an optimal fair allocation in the precision model

Pseudocode is given in Algorithm 1. We prove that Algorithm 1 returns an optimal -fair allocation in Theorem 1. We defer the proof of Theorem 1 and all the other omitted proofs in the section to Appendix B.

Theorem 1.

Algorithm 1 computes an optimal -fair allocation for the precision model in time .

3.3 Learning Fair Allocations Generally Requires Brute-Force Exploration

In Sections 3.1 and 3.2 we assumed the candidate distributions were known. When the candidate distributions are unknown, learning algorithms intending to converge to optimal -fair allocations must learn a sufficient amount about the distributions in question to certify the fairness of the allocation they finally output. Because learners must deal with feedback in the censored observation model, this places constraints on how they can proceed. Unfortunately, as we show in this section, if candidate distributions are allowed to be worst-case, this will force a learner to engage in what we call “brute-force exploration” — the iterative deployment of a large fraction of the resources to each subgroup in turn. This is formalized in Theorem 2.

Theorem 2.

Define to be the size of the largest group and assume for all . Let , , and be any learning algorithm for the precision model which runs for a finite number of rounds and outputs an allocation. Suppose that there is some group for which has not allocated at least units for at least rounds upon termination, where is an absolute constant. Then there exists a candidate distribution such that, with probability at least , outputs an allocation that is not -fair.

Sketch of the Proof.

Let denote a group in which has not allocated at least units for at least rounds upon its termination and let v denote an arbitrary allocation. We will design two candidate distributions which have true discovery probabilities that are at least apart given , but which are indistinguishable given the observations of the algorithm with probability at least . To do so, consider distributions and which satisfy the following four conditions.

  1. and agree on all values less than .

  2. The total mass of both distributions below is .

  3. The remaining mass of is on the value .

  4. The remaining mass of is on the value .

Distinguishing between and requires at least one uncensored observation beyond . However, conditioned on allocating at least units, the probability of observing an uncensored observation is at most . So to distinguish between and with confidence , and therefore to guarantee a fair allocation, a learning algorithm must allocate at least units to group for rounds. ∎

Observe that if , then Theorem 2 implies that no algorithm can guarantee -fairness for sufficiently small . Moreover, even when , Theorem 2 shows that in general, if we want algorithms that have provable guarantees for arbitrary candidate distributions, it is impossible to avoid something akin to brute-force search (recall that there is a trivial algorithm which simply allocates all resources to each group in turn, for sufficiently many rounds to approximately learn the CDF of the candidate distribution, and then solves the offline problem). In the next section, we circumvent this by giving an algorithm with provable guarantees, assuming a parametric form on candidate distributions.

3.4 Poisson Distributions and Convergence of the MLE

In this section, we assume that all the candidate distributions have a particular and known parametric form but that the parameters of the these distributions are not known to the allocator. Concretely, we assume that the candidate distribution for each group is Poisson333To match our model, we would technically need to assume a truncated Poisson distribution to satisfy the bounded support condition. However, the distinction will not be important for the analysis, and so to minimize technical overhead, we perform the analysis assuming an untruncated Poisson. (denoted by ) and write for the true underlying parameters of the candidate distributions; this choice appears justified, at least in the predictive policing application, as the candidate distributions in the Philadelphia Crime Incidents dataset are well-approximated by Poisson distributions (see Section 4 for further discussion). This assumption allows an algorithm to learn the tails of these distributions without needing to rely on brute-force search, thus circumventing the limitation given in Theorem 2. Indeed, we show that (a small variant of) the natural greedy algorithm incorporating these distributional assumptions converges to an optimal fair allocation.

At a high level, in each round, our algorithm uses Algorithm 1 to calculate an optimal fair allocation with respect to the current maximum likelihood estimates of the group distributions; then, it uses the new observations it obtains from this allocation to refine these estimates for the next round. This is summarized in Algorithm 2. The algorithm differs from this pure greedy strategy in one respect, to overcome the following subtlety: there is a possibility that Algorithm 1, when operating on a preliminary estimate for the candidate distributions will suggest sending no units to some group, even when the optimal allocation for the true distributions sends some units to every group. Such a deployment would result in the algorithm receiving feedback in that round, for the group that did not receive any units. If this suggestion is followed and a lack of feedback is allowed to persist indefinitely, the algorithm’s parameter estimate for the halted group will also stop updating — potentially at an incorrect value. In order to avoid this problem and continue making progress in learning, our algorithm chooses another allocation in this case. As we show, any allocation that allocates positive resources to all groups will do; in particular, we make a natural choice: in this case, the algorithm simply re-uses the allocation from the previous round.

, and (total number of rounds).
An allocation and estimates to parameters .
. Allocate uniformly.
for rounds  do
     if  then Check whether every group is allocated a resource.
         .      
     Observe for each group.
     for  do
         Update history with and .
         . Solve the maximum likelihood estimation problem.      
     . Compute an allocation to be deployed in the next round.
return and .
Algorithm 2 Learning an optimal fair allocation

Notice that Algorithm 2 chooses an allocation at every round which is fair with respect to its estimates of the parameters of the candidate distributions; hence, asymptotic convergence of its output to an optimal -fair allocation follows directly from the convergence of the estimates to true parameters. However, we seek a stronger, finite sample guarantee, as stated in Theorem 3.

Theorem 3.

Let . Suppose that the candidate distributions are Poisson distributions with unknown parameters , where lies in the known interval . Suppose we run Algorithm 2 for rounds, where is some distribution specific function444See Corollary 3 for the relationship between and . Also hides poly-logarithmic terms in to get an allocation and estimated parameters for all groups . Then with probability at least

  1. For all in , .

  2. Let where denotes the total variation distance between two distributions. Then

    • is -fair.

    • has utility at most smaller than the utility of an optimal -fair allocation i.e. .

Remark 1.

Theorem 3 implies that in the limit, the allocation from Algorithm 2 will converge to an optimal -fair allocation. As , for all , meaning and more importantly, will be -fair and optimal.

The rest of this section is dedicated to the proof of Theorem 3. First, we introduce notation. Since we assumed the candidate distribution for each group is Poisson, the probability mass function (PMF) and the CDF of the candidate distribution for group can be written as

Given an allocation of units of the resource to group we use to denote the (possibly censored) observation received by Algorithm 2. So while the candidates in group are generated according to , the observations of Algorithm 2 follow a censored Poisson distribution which we abbreviate by . We can write the PMF of this distribution as

where is the CDF value of at .

Since Algorithm 2 operates in rounds, we use the superscript throughout to denote the round. For each round , denote the history of the units allocated to group and observations received (candidates discovered) in rounds up to by . We use to denote the history for all groups. All the probabilities and expectations in this section are over the randomness of the observations drawn from the censored Poisson distributions unless otherwise noted; hence we suppress related notation for brevity. Finally, an allocation function in round is a mapping from the history of all groups to the number of units to be allocated to each group i.e. . For convenience, we use to denote the allocation at round . We are now ready to define likelihood functions.

Definition 2.

Let denote the (censored) likelihood of discovering candidates given an allocation to group assuming the candidate distribution follows . We write as . So, given any history , the empirical log-likelihood function for group is

The expected log-likelihood function given the history of allocations but over the randomness of the candidacy distribution can be written as

where the expectation is over the randomness of drawn from .

Proof of Theorem 3

To prove Theorem 3, we first show that any sequence of allocations selected by Algorithm 2 will eventually recover the true parameters. There are two conceptual difficulties here: the first is that standard convergence results typically leverage the assumption of independence, which does not hold in this case as Algorithm 2 computes adaptive allocations which depend on the allocations in previous rounds; the second is the censoring of the observations. Despite these difficulties, we give quantifiable rates with which the estimates converge to the true parameters. Next, we show that computing an optimal -fair allocation using the estimated parameters will result in an allocation that is at most -fair with respect to the true candidate distributions where denotes the maximum total variation distance between the true and estimated Poisson distributions across all groups. Finally, we show that this allocation also achieves a utility that is comparable to the utility of an optimal -fair allocation. We note that while Theorem 3 is only stated for Poisson distributions, our result can be generalized to any single parameter Lipschitz-continuous family of distributions (see Remark 2).

Closeness of the Estimated Parameters

Our argument can be stated at a high level as follows: for any group and any history , the empirical log-likelihood converges to the expected log-likelihood for any sequence of allocations made by Algorithm 2 as formalized in Lemma 1. We then show in Lemma 2 that the closeness of the empirical and expected log-likelihoods implies that the maximizers of these quantities (corresponding to the estimated and true parameters) will also become close. Since in our analysis we consider the groups separately, we fix a group throughout the rest of this section and drop the subscript for convenience.

We start by studying the rate of convergence of the empirical log-likelihood to the expected log-likelihood.

Lemma 1.

With probability at least , for any and any observed by Algorithm 2

The true and estimated parameters for each group correspond to the maximizers of the expected and empirical log-likelihoods, respectively (see Corollary 1 in Appendix B). We next show that closeness of the empirical and expected log-likelihoods implies that the true and estimated parameters are also close.

Lemma 2.

Let denote the estimate of the Algorithm 2 after rounds. Then with probability at least , .

Proof.

Since Corollary 1 gives that has a unique maximizer at and Corollary 3 gives that there exists some so that for any such that , we must have that . We denote by for brevity. We define the empirical maximizer to be the maximizer of i.e.

(1)

Applying Lemma 1 implies that for any with and , with probability at least ,

In particular, we must have that

(2)

Since is a maximizer of Equation 1, we have that

where the last inequality is by Equation 2. This implies that

where the last inequality is by Equation 2. So , and thus Corollary 3 gives that . ∎

Combining Lemma 2 with a union bound over all groups show that, with probability , if Algorithm 2 is run for rounds, then , for all . Note that as , the maximum total variation distance between the estimated and the true distribution,, will converge in probability to 0.

Fairness of the Allocation

In this section, we show that the fairness violation (i.e. the maximum difference in discovery probabilities over all pairs of groups) is linear in terms of . Therefore, as the running time of the Algorithm 2 increases and hence, , the fairness violation of approaches . This is stated formally as follows.

Lemma 3.

Let denote the allocation returned by Algorithm 2 after rounds. Then with probability at least ,

Proof.

For any we have that

The first inequality follows from the triangle inequality. In the second inequality, the second term can be bounded because Algorithm 1 returns an -fair allocation with respect to its input distribution. The first and third term in the second inequality can be bounded by Lemma 12 (in Appendix B). Lemma 12 shows that for any fixed allocation the difference between the discovery probability with respect to the true and estimated candidate distributions in group is proportional to the total variation distance between the true and estimated distributions. ∎

Utility of the Allocation

In this section we analyze the utility of the allocation returned by Algorithm 2. Once again, note that as , which happens as the running time of Algorithm 2 increases, will become optimal and -fair.

Lemma 4.

Let denote the allocation returned by Algorithm 2 after rounds. Then with probability at least ,

Proof.

Consider the following optimization problem, .

subject to

We can think of the above optimization problem as the case where the underlying candidate distributions used for the objective value and the fairness constraints are different. Let us write to denote an optimal allocation in the above optimization problem, . So an optimal fair allocation and the allocation returned by Algorithm 2 can be written as and , respectively.

Note that for any fixed allocation v

(3)

where is the tail probability of . This is because . In other words, even when the underlying candidate distribution changes for the objective value, an allocation value can change by at most .

Now observe that

The inequalities in the first and third lines are by Equation 3, which shows how the utility deteriorates when the underlying distribution for the objective function changes. The inequality in the second line follows from Lemma 3, as any fair allocation is a feasible allocation to , and is an optimal solution to this problem. ∎

Remark 2.

Although we assumed Poisson distributions in this section, all our results hold for any single-parameter Lipschitz-continuous distribution whose parameter is drawn from a compact set. However, the convergence rate of Theorem 3 depends on the quantity which depends on the family of distributions used to model the candidate distributions.

4 Experiments

In this section, we apply our allocation and learning algorithms for the precision model to the Philadelphia Crime Incidents dataset, and complement the theoretical convergence guarantee of Algorithm 2 to an optimal fair allocation with empirical evidence suggesting fast convergence. We also study the empirical trade-off between fairness and utility in the dataset.

4.1 Experimental Design

The Philadelphia Crime Incidents dataset555https://www.opendataphilly.org/dataset/crime-incidents accessed 2018-05-16. contains all the crimes reported to the Police Department’s INCT system between 2006 and 2016. The crimes are divided into two types. Type I crimes include violent offenses such as aggravated assault, rape, and arson among others. Type II crimes include simple assault, prostitution, gambling and fraud. For simplicity, we aggregate all crime of both types, but in practice, an actual police department would of course treat different categories of crime differently. We note as a caveat that these incidents are reported and may not represent the entirety of committed crimes.

Figure 1: Frequencies of the number of reported crimes in each district in the Philadelphia Crime Incidents dataset. The red curves display the best Poisson fit to the data.

To create daily crime frequencies in Figure 1, we first calculate the daily counts of criminal incidents in each of the 21 geographical police districts in Philadelphia by grouping together all the crime reports with the same date; we then normalize these counts to get frequencies.666The current list of 21 districts can be found at https://www.phillypolice.com/districts-units/index.html. The dataset however contains 25 districts from which we removed 4 from consideration. Districts with identifiers 77 and 92 correspond to airport and parks, so the crime incident counts in these districts are significantly less and widely different from the rest of the districts. Moreover, we removed districts with identifiers 4 and 23 which were both dissolved in 2010. Each subfigure in Figure 1 represents a police district. The horizontal axis of the subfigure corresponds to the number of reported incidents in a day and the vertical axis represents the frequency of each number on the horizontal axis. These frequencies approximate the true distribution of the number of reported crimes in each of the districts in Philadelphia. Therefore, throughout this section we take these frequencies as the ground truth candidate distributions for the number of reported incidents in each of the districts.

Figure 1 shows that crime distributions in different districts can be quite different; e.g., the average number of daily reported incidents in District 15 is 43.5, which is much higher than the average of 11.35 in District 1 (see Table 1 in Appendix C for more details). Despite these differences, each of the crime distributions can be approximated well by a Poisson distribution. The red curves overlayed in each subfigure correspond to the Poisson distribution obtained via maximum likelihood estimation on data from that district. Throughout, we refer to such distributions as the best Poisson fit to the data (see Table 2 in Appendix C for details about the goodness of fit).

In our experiments, we take the police officers assigned to the districts as the resource to be distributed, the ground truth crime frequencies as candidate distributions, and aim to maximize the sum of the number of crimes discovered under the precision model of discovery.

4.2 Results

We can quantify the extent to which fairness degrades utility in the dataset through a notion we call Price of Fairness (PoF henceforth). In particular, given the ground truth crime distributions and the precision model of discovery, for a fairness level , we define . The PoF is simply the ratio of the expected number of crimes discovered by an optimal allocation to the expected number of crimes discovered by an optimal -fair allocation. Since for all , the PoF is at least one. Furthermore, the PoF is monotonically non-increasing in . We can apply the algorithms given in Sections 3.1 and 3.2 respectively for computing optimal unconstrained, and optimal fair allocations with the with ground truth distributions as input and numerically compute the PoF. This is illustrated in Figure 2. The axis corresponds to different values and the axis displays . Each curve corresponds to a different number of total police officers denoted by . Because feasible allocations must be integral, there can sometimes be no feasible -fair allocation for small . Since the PoF in these cases is infinite we instead opt to display the inverse, , which is always bounded in . Higher values of inverse PoF are more desirable.

Figure 2: Inverse PoF plots for the Philadelphia Crime Incidents dataset. Smaller values indicate greater sacrifice in utility to meet the fairness constraint.

Figure 2 shows a diverse set of utility/fairness trade-offs depending on the number of available police officers. It also illustrates that the cost of fairness is rather low in most regimes. For example, in the worst case, with only 50 police officers (the black curve) (which is much smaller than the average number of daily reported crimes: 563.88), the inverse PoF is 1 for , which corresponds to a 10% difference in the discovery probability across districts. When we increase the number of available police officers to 400 (the magenta curve), tolerating only a 4% difference in the discovery probability across districts is sufficient to guarantee no loss in the utility. Figure 2 also shows that for any fixed , the inverse increases as the number of police increases (i.e. the cost of fairness decreases). This captures the intuition that fairness becomes a less costly constraint when resources are in greater supply. Finally, we observe a thresholding phenomenon in Figure 2; in each curve, increasing beyond a threshold will significantly increase the inverse PoF. This is due to discretization effects, since only integral allocations are feasible.

We next turn into analyzing the performance of Algorithm 2 in practice. We run the algorithm instantiated to fit a Poisson distribution, but use observations from the ground truth distribution at each round. As we have shown in Figure 1 and Table 2, the ground truth is well approximated by a Poisson distribution.

We measure the performance of Algorithm 2 as follows. First, we fix a police budget and unfairness budget and run Algorithm 2 for 2000 rounds using the dataset as the ground truth. That is, we simulate each round’s crime count realizations in each of the districts as being sampled from the ground truth distributions, and return censored observations under the precision model to Algorithm 2 according to the algorithm’s allocations and the drawn realizations. The algorithm returns an allocation after termination and we can measure the expected number of crimes discovered and fairness violation (the maximum difference in discovery probabilities over all pairs of districts) of the returned allocation using the ground truth distributions. Varying while fixing allows us to trace out the Pareto frontier of the utility/fairness trade-off for a fixed police budget. Similarly, for any fixed and , we can run Algorithm 1 (the offline algorithm for computing an optimal fair allocation) with the ground truth distributions as input and trace out a Pareto curve by varying . We refer to these two Pareto curves by the learned and optimal Pareto curves, respectively.777We can also generate fitted Pareto curves using best Poisson fit distributions instead of the ground truth distributions. These curves look very similar to the optimal Pareto curves (see Figure 5 in Appendix C). So to measure the performance of Algorithm 2, we can compare the learned and optimal Pareto curves.

Figure 3: Pareto frontier of expected crimes discovered versus fairness violation.

In Figure 3, each curve corresponds to a police budget. The and axes represent the expected number of crimes discovered and fairness violation for allocations on the Pareto frontier, respectively. In our simulations we varied between 0 and 0.15. For each police budget , the ‘x’ s connected by the dashed lines show the learning Pareto frontier. Similarly, the circles connected by solid lines show the optimal Pareto frontier. We point out that while it is possible for the fairness violations in the learned Pareto curves to be higher than the level of set as an input to Algorithm 2, the fairness violations in the optimal Pareto curves are always bounded by .

The disparity between the optimal and learned Pareto curves are due to the fact that the learning algorithm has not yet fully converged. This can be attributed to the large number of censored observations received by Algorithm 2, which are significantly less informative than uncensored observations. Censoring happens frequently because the number of police used in every case plotted is less than the daily average of 563.88 crimes across all the districts in the dataset — so it is unavoidable that in any allocation, there will be significant censoring in at least some districts.

Figure 3 shows that while the learning curves are dominated by the optimal curves, the performance of the learning algorithm approaches the performance of the offline optimal allocation as increases. Again, this is because increasing generally has the effect of decreasing the frequency of censoring.

We study the regime in more detail, to explore the empirical rate of convergence. In Figure 4, we study the round by round performance of the allocation computed by Algorithm 2 in a single run with the choice of and .

Figure 4: The per round expected number of crimes discovered and fairness violation of Algorithm 2. and .

In Figure 4, the axis labels progression of rounds of the algorithm. The axis measures the fairness violation (left) and expected number of crimes discovered (right) of the allocation deployed by the algorithm, as measured with respect to the ground truth distributions. The black curves represent Algorithm 2. For comparison we also show the same quantities for the offline optimal fair allocation as computed with respect to the ground truth (red line), and the offline optimal fair allocation as computed with respect to the best Poisson fit to the ground truth (blue line). Note that in the limit, the allocations chosen by Algorithm 2 are guaranteed to converge to the blue baselines — but not the red baseline, because the algorithm is itself learning a Poisson approximation to the ground truth. The disparity between the red and blue lines quantifies the degradation in performance due to using Poisson approximations, rather than due to non-convergence of the learning process.

Figure 4 shows that Algorithm 2 converges to the Poisson approximation baseline well before the termination time of 2000, and substantially before the convergence bound guaranteed by our theory. Examining the estimated Poisson parameters used internally by Algorithm 2 reveals that although the allocation has converged to an optimal fair allocation, the estimated parameters have not yet converged to the parameters of the best Poisson fit in any of the districts. In particular, Algorithm 2 underestimates the parameters in all of the districts — but the degree of the underestimation is systematic: the correlation coefficient between the true and estimated parameters is .

We see also in Figure 4 that convergence to the optimum expected number of discovered crimes occurs more quickly than convergence to the target fairness violation level. This is also apparent in Figure 3 where the learning and optimal Pareto curves are generally similar in terms of the maximum number of crimes discovered, while the fairness violations are higher in the learning curves.

5 The Random Discovery Model

We next consider the random model of discovery. In the random model, when units are allocated to a group with candidates, the number of discovered candidates is a random variable corresponding to the number of candidates that appear in a uniformly random sample of individuals from a group of size . Equivalently, when units are allocated to a group of size with candidates, the number of candidates discovered by is a random variable where

is drawn from the hypergeometric distribution with parameters

, and . Furthermore the expected number of candidates discovered when allocating units to group satisfies .

For simplicity, throughout this section, we assume for all . This assumption can be completely relaxed (see the discussion in Appendix D). Moreover, let denote the expected fraction of candidates in group . Without loss of generality, for the rest of this section, we assume .

5.1 Optimal Allocation

In this section, we characterize optimal allocations. Note that the expected number of candidates discovered by the allocation choice in group is simply . This suggests a simple algorithm to compute : allocating every unit of the resource to group 1. More generally, let denote the subset of groups with the highest expected number of candidates. An allocation is optimal if and only if it only allocates all resources to groups in .

5.2 Properties of Fair Allocations

We next discuss the properties of fair allocations in the random discovery model. First, we point out that the discovery probability can be simplified as

So an allocation is -fair in the random model if for all groups and . Therefore, fair allocations (roughly) distribute resources in proportion to the size of the groups, essentially ignoring the candidate distributions within each group. We defer the full characterization to Appendix D.

5.3 Price of Fairness

Recall that PoF quantifies the extent to which constraining the allocation to satisfy -fairness degrades utility. While in Section 4 we study the PoF on the Philadelphia Crime Incidents dataset, we can define a worst-case variant as follows.

Definition 3.

Fix the random model of crime discovery and let . We define the PoF as

where ranges over all possible candidate distributions.

We can fully characterize this worst-case PoF in the random discovery model. We defer the proof of Theorem 4 to Appendix D.

Theorem 4.

The PoF in the random discovery model is

The PoF in the random model can be as high as in the worst case. If all groups are identically sized, this grows linearly with the number of groups.

6 Conclusion and Future Directions

Our presentation of allocative fairness provides a family of fairness definitions, modularly parameterized by a “discovery model.” What counts as “fair” depends a great deal on the choice of discovery model, which makes explicit what would otherwise be unstated assumptions about the process of tasks like policing. The random and precision models of discovery studied in this paper represent two extreme points of a spectrum. In the predictive policing setting, the random model of discovery assumes that officers have no advantage over random guessing when stopping individuals for further inspection. The precision model assumes they can oracularly determine offenders, and stop only them. An interesting direction for future work is to study discovery models that lie in between these two.

We have also made a number of simplifying assumptions that could be relaxed. For example, we assumed the candidate distributions are stationary — fixed independently of the actions of the algorithm. Of course, the deployment of police officers can change crime distributions. Modeling this kind of dynamics, and designing learning algorithms that perform well in such dynamic settings would be interesting. Finally, we have assumed that the same discovery model applies to all groups. One friction to fairness that one might reasonably conjecture is that the discovery model may differ between groups — being closer to the precision model for one group, and closer to the random model for another. We leave the study of these extensions to future work.

Acknowledgements

We thank Sorelle Friedler for giving a talk at Penn which initially inspired this work. We also thank Carlos Scheidegger, Kristian Lum, Sorelle Friedler, and Suresh Venkatasubramanian for helpful discussions at an early stage of this work. Finally we thank Richard Berk and Greg Ridgeway for helpful discussions about predictive policing.

References

  • Agarwal et al. [2010] Alekh Agarwal, Peter Bartlett, and Max Dama. Optimal allocation strategies for the dark pool problem. In

    Proceedings of the 13th International Conference on Artificial Intelligence and Statistics

    , pages 9–16, 2010.
  • Bastani et al. [2017] Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Exploiting the natural exploration in contextual bandits. CoRR, abs/1704.09011, 2017.
  • Berk et al. [2017] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. A convex framework for fair regression. CoRR, abs/1706.02409, 2017.
  • Berk et al. [2018] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, 2018.
  • Bird et al. [2016] Sarah Bird, Solon Barocas, Kate Crawford, Fernando Diaz, and Hanna Wallach. Exploring or exploiting? social and ethical implications of autonomous experimentation in AI. 2016.
  • Calders et al. [2013] Toon Calders, Asim Karim, Faisal Kamiran, Wasif Ali, and Xiangliang Zhang.

    Controlling attribute effect in linear regression.

    In Proceedings of 13th International Conference on Data Mining, pages 71–80, 2013.
  • Chierichetti et al. [2017] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Proceedings of the 31th Annual Conference on Neural Information Processing Systems, pages 5029–5037, 2017.
  • Corbett-Davies et al. [2017] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017.
  • Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science, pages 214–226, 2012.
  • Ensign et al. [2018a] Danielle Ensign, Sorelle Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. Runaway feedback loops in predictive policing. In Conference on Fairness, Accountability and Transparency, pages 160–171, 2018a.
  • Ensign et al. [2018b] Danielle Ensign, Sorelle Frielder, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. Decision making with limited feedback. In Proceedings of the 29th Conference on Algorithmic Learning Theory, pages 359–367, 2018b.
  • Ganchev et al. [2009] Kuzman Ganchev, Michael Kearns, Yuriy Nevmyvaka, and Jennifer Wortman Vaughan. Censored exploration and the dark pool problem. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pages 185–194, 2009.
  • Hardt et al. [2016] Moritz Hardt, Eric Price, and Nathan Srebro.

    Equality of opportunity in supervised learning.

    In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 3315–3323, 2016.
  • Jabbari et al. [2017] Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth.

    Fairness in reinforcement learning.

    In

    Proceedings of the 34th International Conference on Machine Learning

    , pages 1617–1626, 2017.
  • Joseph et al. [2016] Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in learning: classic and contextual bandits. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pages 325–333, 2016.
  • Kannan et al. [2018] Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. CoRR, abs/1801.03423, 2018.
  • Kleinberg et al. [2017] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In Proceedings of the 8th Conference on Innovations in Theoretical Computer Science, pages 43:1–43:23, 2017.
  • Liu et al. [2018] Lydia Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of fair machine learning. In Proceedings of the 35th International Conference on Machine Learning, pages 3156–3164, 2018.
  • Lum and Isaac [2016] Kristian Lum and William Isaac. To predict and serve? Significance, pages 14–18, October 2016.
  • MacKay [2003] David MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
  • Procaccia [2013] Ariel Procaccia. Cake cutting: Not just child’s play. Communications of the ACM, 56(7):78–87, 2013.
  • Raghavan et al. [2018] Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The externalities of exploration and how data diversity helps exploitation. In Proceedings of the 31st Conference On Learning Theory, pages 1724–1738, 2018.
  • Woodworth et al. [2017] Blake Woodworth, Suriya Gunasekar, Mesrob Ohannessian, and Nathan Srebro. Learning non-discriminatory predictors. In Proceedings of the 30th Conference on Learning Theory, pages 1920–1953, 2017.
  • Zafar et al. [2017] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez-Rodriguez, and Krishna Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171–1180, 2017.
  • Zemel et al. [2013] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning, pages 325–333, 2013.

Appendix A Feasibility in Expectation

In this section, we show how to compute for any arbitrary but known candidate distributions and known discovery model in a relaxation where the feasibility constraint is satisfied in expectation.

The first observation is that when and are both known, for a group and allocation of units of resource to that group, the expected number of discovered candidates

and the discovery probability

can both be computed exactly. The second observation is that when allowing the feasibility condition to be satisfied in expectation, instead of allocating integral units of resources to each group, we can allocate resources to a group using a distribution.

Let denote the probability that units of resource is allocated to group . We can compute

by writing the following linear program with

s as variables.

subject to

The objective function maximizes the number of candidates discovered given the allocation. The first constraint guarantees that the allocation is feasible in expectation. The second constraint (which is linear in ) ensures that -fairness is satisfied by the allocation. The last two constraints guarantees that for any ,

values define a valid probability distribution on all the possible allocations to group

.

Appendix B Omitted Details from Section 3

b.1 Omitted Details from Section 3.1

We first show how the expected number of discovered candidates in a group in the precision model can be written as a function of the tail probabilities of the group’s candidate distribution.

Lemma 5 (Ganchev et al. [12]).

The expected number of discovered candidates in the precision model when allocating units of resource to group can be written as .