1 Non-discrimination in machine learning
As automated decision systems permeate our world, the problem of implicit biases in these systems have become more serious. Machine learning algorithms are routinely used to make decisions in credit, criminal justice, and education, all of which are domains protected by anti-discrimination law. Although automated decision systems seem to eliminate the biases of a human decision maker, they may perpetuate or even exacerbate biases in the data.
For example, consider an advertising platform which uses demographic information of visitors to a website to decide which credit card offers to show first-time visitors. If the system is trained on historical data where minority visitors were given less advantageous offers, the system may steer similar visitors to less advantageous offers, which is illegal (Steel and Angwin, 2010).
In response, the scientific community has proposed several formal definitions of non-discrimination and various approaches to ensure algorithms are non-discriminatory. Unfortunately, the myriad of definitions and approaches hinders the adoption of this work by practitioners: they must choose from the growing list of definitions and approaches, and there is often no clear choice.
In light of this plethora of definitions, we identify a general notion of non-discrimination in Section 2 that not only includes many recently proposed definitions but also suggests new definitions. In Sections 3 and 4, we study randomization as a general mechanism for achieving conditional parity and a kernel-based test of conditional parity. Finally, in Section 5, we apply this test to determine whether insurance companies charge higher premiums to insure cars in minority neighborhoods.
2 Conditional parity: a notion of non-discrimination
Intuitively, any claim of discrimination or non-discrimination depends on a comparison: a comparison between the outcome of two groups that differ only by a sensitive attribute. For instance, a claim of gender discrimination by a female employee of a consulting firm implies she was treated differently from male employees of the company in her position. Here the two groups are female and male employees that share her position. By only comparing herself to male employees in her position, she is implicitly permitting the firm to treat male employees in other positions differently. In other words, the firm is allowed to discriminate on an employee’s position (eg. paying senior employees higher salaries). We see that in order to fully specify the two groups, we must not only specify the protected attribute (eg. gender) but also specify the discriminatory attribute (eg. position).
Definition 2.1 (conditional parity (CP)).
satisfies parity with respect to conditioned on if the distribution of is constant in :
Similarly, satisfies parity with respect to conditioned on (without specifying a value of ) if it satisfies parity with respect to conditioned on for any .
In terms of independence, conditional parity is . Table 1
is a graphical representation of the groups in the running gender discrimination example. As we shall see, many existing notions of non-discrimination such as demographic parity, equalized odds, equalized opportunity, and counterfactual fairness are all instances of CP. We remark that the definition of CP isinvariant under post-processing: if satisfies CP with respect to conditioned on , then so does for an arbitrary function . This is especially desirable because it leads to a simple way of eliminating bias in machine learning algorithms.
The intuition of identical conditional distributions that Definition 2.1 formalizes extends easily to yield approximate notions of non-discrimination. To keep things simple, we assume is discrete.
Definition 2.2 (-conditional parity).
Let be a metric on distributions.111Formally, the metric must satisfy and for any pair of distributions and as well as for any distribution . A random variable satisfies -conditional parity with respect to conditioned on if
To wrap up, we compare CP to two other notions of non-discrimination: functional blindness and individual fairness.
Definition 2.3 (functional blindness).
A decision rule satisfies functional blindness with respect to iff
In other words, the decision rule has no functional dependence on the protected attribute.
Functional blindness, also known as fairness through unawareness, is a rudimentary but widely used notion of non-discrimination. Although intuitive, it is a weak notion that is easily circumvented because it does not rule out implicit dependence of the decision on the protected attribute.
For decades, insurance companies have charged drivers in predominantly minority neighborhoods higher premiums than drivers in majority white neighborhoods. Although insurers have justified their pricing by citing a higher risk of accidents in minority neighborhoods, consumer advocates suspect the practice is merely a way around laws that ban discriminatory rate-setting: a driver’s zip code is a good proxy for his or her race in segregated areas.
We remark that functional blindness implies parity conditioned on . However, if includes attributes that are proxies for the protected attribute (eg. zip code is a proxy for race in the preceding example), enforcing CP is vacuous. After all, by including an attribute in , we are allowing the decision rule to discriminate based on it.
Definition 2.4 (individual fairness (Dwork et al., 2012)).
Let be a metric on distributions and be a metric on the space of individuals . A (possibly randomized) decision rule satisfies -individual fairness iff it is -Lipschitz in :
Individual fairness is based on the principle that two similar individuals should be treated similarly by the decision rule. The precise definition of individual fairness depends crucially on the choice of the metrics and . Dwork et al. (2012) suggest the metrics be chosen by a regulatory body or proposed by civil rights organizations and left open to discussion and continual refinement.
Both CP and individual fairness formalize the intuition that similar individuals should be treated similarly. In CP, similar individuals are those that share discriminatory attributes. In individual fairness, similar individuals are determined by the choice of the the metric on individuals. Although Dwork et al. (2012) does not distinguish between disriminatory and protected attributes, it is possible to encode the distinction into the choice of metric on .
2.1 Demographic parity and equalized odds
In this subsection, we describe several factual (as opposed to counterfactual) notions of non-discrimination and show that they are instances of CP.
Definition 2.5 (demographic parity (DP)).
The outcome satisfies demographic parity if
As we can see, there is no discriminatory attribute in DP, and it is required that individuals from the group has to be treated equally as individuals from the group . Although in general, DP seems too coarse a notion of non-discrimination, there are some scenarios where it is suitable. For example, in the allocation of public resources, DP is a fitting notion of non-discrimination. A concrete example is public secondary school admission. Due to the public service nature of public secondary education, parents should be allowed to send their children to any school in their neighborhood, regardless of their background. In reality, such goal is often attained by lottery, meaning that random selections in the pool of applicants are made.
Example 2.6 (War on Drugs).
According to the American Civil Liberties Union (ACLU), “an African American adult is 2.8 times as likely to have a misdemeanor cannabis charge filed against him or her than does an Anglo American adult” in Washington State (Jensen and Roussell, 2016). By comparing the likelihood of being charged without stratifying the population (eg. by prevalence of cannabis consumption), the ACLU is claiming the War on Drugs violates DP. This is an example where DP is not a suitable notion of non-discrimination: the disparity between the likelihood of being charged may be due to disparities between prevalence of cannabis consumption. Thus it is incorrect to conclude the targeting of African Americans by law enforcement from violation of DP.
To avoid the problems of DP, we first segment the population by certain discriminatory attributes (eg.
prevalence of cannibis consumption in the War on Drugs example) and then apply DP to each segment of the population. This led us to the notion of CP. In supervised learning, a natural instantiation of CP isequalized odds, which appeared in Hardt et al. (2016) and Zafar et al. (2017).
Definition 2.7 (equalized odds (EO) Hardt et al. (2016)).
A prediction of satisfies equalized odds with respect to protected attribute and outcome if
In terms of CP, EO is equivalent to the prediction satisfying parity with respect to the protected attributed conditioned on the outcome . In other words, EO requires the individuals with the same but differing to be treated equally. If both and
are binary, then in standard terminology, EO means that the probabilities of false alarms and the detection probabilities are the same under all possible values of. Note that typically the optimal ROC curve depends on the value of , and oftentimes, we sacrifice some efficiency to achieve EO through randomization Hardt et al. (2016).
Example 2.8 (Example 2.6 continued).
In the War on Drugs example, is whether an individual is charged, is an individual’s race, and may be whether an individual consumes cannabis. If there is a discrepancy between the prevalence of cannibis consumption among African and Anglo Americans, then EO is more suitable notion of non-discrimination in law enforcement. Since depends on , even perfect prediction violates DP, but it is hardly discriminatory for law enforcement to charge anyone who consumes cannabis with a misdemeanor. On the other hand, it is easy to check that the perfect predictor satisfies EO.
In some applications, one of the outcomes is considered “advantaged”. For example, consider the use of historical repayment data to predict default. If the historical data contains biases against minority groups, the prediction system may echo the bias in its predictions. A possible relaxation of EO is to require people who will not default to have equal chance of getting a loan, regardless of their race.
Definition 2.9 (equal opportunity Hardt et al. (2016)).
Let be the “advantaged” outcome. A prediction satisfies equalized opportunity with respect to protected attribute and outcome if
It is easy to see how equalized odds leads to the more general notion of CP. The key idea of comparing segments of the population that share discriminatory attributes but differ in the protected attribute is clear. In classification, there is a natural discriminatory attribute: the outcome . However, it is worth considering other ways of segmenting the population, even in supervised learning.
Example 2.10 (gender bias in UC Berkeley admissions Bickel, Hammel and O’Connell (1977)).
In the autumn of 1973, the graduate division of UC Berkeley admitted 44% of male applicants but only 35% of female applicants, prompting allegations of gender bias in the admissions process. However, adjusting the admissions outcome by department reveals a “small but statistically significant bias in favor of women”. Bickel, Hammel and O’Connell (1977) concluded that women tended to apply to highly competitive departments, which admit a smaller percentage of applicants, while men tended to apply to less competitive departments. In this example, including department as a discriminatory attribute leads to a qualitatively different conclusion.
As an aside, this example also shows that CP generally does not imply DP. Even if the admission rates of male and female applicants are identical in all departments, the admission rates to the graduate division may still differ if male and female applicants apply to departments at different rates. Conversely, even if the admission rates of male and female applicants to the graduate division are identical, the admission rates to each department may differ. This reveals another problem of DP: it permits disparate treatment within segments of the population as long as the disparities “cancel out” on average. Although this is rare in practice, we point it out to emphasize CP and DP are generally incomparable.
Finally, to highlight the generality of CP, we describe an application of CP in representation learning. In machine learning, feature or representation learning is the task of learning a transformation of raw data to a feature vector that is amenable to machine learning algorithms. By lettingbe the learned feature vector, CP readily leads to a notion of non-discrimination in representation learning:
As we shall see, this notion of non-discrimination has been implicitly used in natural language processing (NLP).
To wrap up, we describe a post-processing method that returns a new feature vector that satisfies CP. To keep things simple, we assume , , and are jointly Gaussian. Without loss of generality, let
where and are chosen so that . Rearranging, we have
A new feature vector that satisfies is
One way to estimateis to select a subset of feature vectors that are similar in and compute their principal components. This is essentially the approach proposed by Bolukbasi et al. (2016) to remove gender bias in word embeddings.
Example 2.11 (debiasing word embeddings (Bolukbasi et al., 2016)).
A word embedding is a representation of words by vectors in . Word embeddings enable machine learning algorithms to reason semantically by performing arithmetic operations on the word embeddings; eg.
They are learned from text corpus and inherit implicit biases in the texts. For example, according to the the popular word2vec embedding, which is trained on a corpus of Google News articles, we have
To remove gender bias in word embeddings, Bolukbasi et al. (2016) propose a method that identifies a gender subspace and projects the embedding onto the orthocomplement of the gender subspace to obtain a debiased word embedding. To identify the gender subspace, the method takes pairs of words whose meanings differ only in gender (eg. (actor, actress), (father, mother)) and estimates the principal compoments of the pairwise differences.
By the invariance of CP under post-processing, the output of a machine learning algorithm based on features that satisfy CP inherits the property. This suggests using non-discriminatory features as a simple approach to eliminating bias in machine learning algorithms.
2.2 Counterfactual notions of non-discrimination
In order to work with counterfactuals, we must impose some modeling assumptions on the data generating process. In the rest of this subsection, we assume the data is generated by a structural equations model (SEM). A SEM consists of (i) a set of random variables, (ii) a set of (deterministic) equations that assign values to some random variables, (iii) a probability distribution that assigns values to the rest of the variables. The variables whose values are assigned by the probability distribution are called exogenous.
SEM’s are conveniently represented as directed acyclic graphs (DAG). The nodes represent random variables, and the edges represent direct causal relationships between variables: there is an edge from node to node if the equation that assigns value to variable takes variable as input. The nodes that have no parents represent the exogenous variables.
To sample from an SEM, we start by assigning values to the root nodes by sampling from the probability distribution and recursively assign values to the other nodes by the equations. Thus the nodes whose values are assigned by equations are random variables on the probability space “generated by” the exogenous variables. In this setting, counterfactuals are defined as random variables whose values are assigned by a modified SEM, where the equations and/or the probability distribution are modified according to the premise of the counterfactual. We wrap up our brief overview of counterfactuals with an example and refer to Pearl, Glymour and Jewell (2016), Chapter 4 for further details.
Consider the intervention and the counterfactual in the SEM depicted in Figure 0(a). The counterfactual is the counterpart of in the modified SEM depicted on the right of Figure 0(b), in which the equation that assigns the value of is replaced by the equation . We see that the value of ultimately depends on the values of the exogeneous variables in the SEM, making it a random variable on the same probability space as . Thus it is possible to evaluate “cross-SEM” probabilities such as . We remark that this SEM formalism allows us to study the effects of more sophisticated interventions such as (cf. Figure 0(c)).
In the rest of this subsection, we describe two counterfactual notions of non-discrimination. The first was proposed recently by Kusner et al. (2017), while the second is suggested by CP. To keep things simple, we specialize to supervised learning and focus on prediction.
Definition 2.13 (counterfactual fairness (CF) Kusner et al. (2017)).
A prediction is counterfactually fair with respect to sensitive attribute in light of evidence iff
If (2.1) holds for all , is counterfactually fair with respect to sensitive attribute in light of evidence .
In Definition 2.13, is the evidence we observe in the real world. Although it plays the part of in CP, we call it evidence and denote it by to emphasize it is observed. To see that CF is an instance of CP, let , be the counterparts of , in a modified SEM, where the step that assigns value to is replaced by , and note that (2.1) is equivalent to
We remark that the law of the intervention is unimportant because we condition on the value of . We pick to keep things concrete.
The notion of CF is best illustrated by the following case on employment discrimination. In 1996Carson (1996), the judges wrote “the central question in any employment-discrimination case is whether the employer would have taken the same action had the employee been of a different race (age, sex, religion, national origin, etc.) and everything else had been the same”. In other words, to ascertain whether discrimination occurred, the judges compared the employee with his counterpart in a counterfactual world, rather than a similar employee in the real world.
We remark that it may not be possible to follow the judges directive literally and keep all other attributes the same: the intervention may propagate in the modified SEM and lead to discrepancies with the evidence. For example, consider a female employee who is homosexual. In a counterfactual world where she is male, it is not possible to keep both her sexual orientation and the gender she is attracted the same as hers in the real world.
As we saw, CF is an instance of CP where we segment the population by observable evidence. A related notion of non-discrimination is equalized counterfactual odds (ECO): it is an instance of CP that segments the population by counterfactual attributes. It is motivated by Example 2.14.
Consider a system that predicts a driver’s accident risk from his or her driving record. The prediction depends directly on a driver’s driving record , which in turn depends on the driver’s driving ability . Driving ability also directly affects a driver’s accident risk and whether he or she is disabled (poor drivers tend to get into accidents, which cause disabilities). Figure 1 is a DAG that depicts the functional dependencies among the variables.
This example does not satisfy EO: the path between and is not blocked. However, is intuitively non-discriminatory: there is only dependence between and because driving ability is a parent of disability and driving record. the prediction has no causal dependence on does not penalize good drivers that happen to be disabled.
We see that EO is too stringent a condition in this scenario. It not only prohibits the prediction from treating disabled drivers differently because of their disability, but also prohibits the prediction from happening to put disabled drivers in an disadvantaged position due to the presence of a confounder. In this scenario, disabled people happen to have worse driving records because driving ability affects one’s driving record and causes disability.
Example 2.14 shows that EO is too stringent because it prohibits probabilistic dependence between and , which may arise due to confounding. The notion of equalized counterfactual odds is an amendment of EO that only prohibits causal relationships between and .
Definition 2.15 (equalized counterfactual odds (ECO)).
A prediction of satisfies equalized counterfactual odds with respect to protected attribute conditioned on iff
If (2.3) holds for all , we say satisfies equalized counterfactual odds with respect to .
In a nutshell, ECO is EO on a modified SEM, in which the step that assigns value to is replaced by . The graph of the modified SEM is identical to that of the original SEM, except all the edges that point to are removed. This removes all back door paths between and , which typically represent the effects of confounders. Thus ECO only prohibits probabilistic dependence between and in the original SEM through front door paths. This leads to a simple way of verifying ECO.
A prediction of satisfies ECO with respect to protected attribute if any front door paths from to are blocked by .
ECO is also closely related to CF: both compare the law of the counterfactual prediction on segments of the population. ECO segments the population by the counterfactual target , while CF segments the population by (observable) evidence . In practice, CF is a fairly stringent notion of non-discrimination. As Kusner et al. (2017) point out, there are instances in which perfect prediction does not satisfy CF. On the other hand, perfect prediction always satisfies ECO. To wrap up, we present another example that highlights the difference between the two notions.
Consider an SEM of a church’s priest hiring process. The church’s hiring decision depends on an applicants score , where is the applicant’s propensity for priest work and is whether the applicant is Christian. Figure 2 depicts an SEM of the priest hiring process.
Consider an atheist applicant whose talents are well-suited to priest work (eg. charismatic, persuasive, ). He applied for the position, but was rejected. Since he would have been hired if he was a Christian, the hiring process is not counterfactually fair with respect to in light of evidence . Graphically, conditioning on in the unmodified SEM does not block the path between and in the graph of the modified SEM. On the other hand, the hiring process clearly satisfies ECO: blocks the only front door path between and in the graph of the (unmodified) SEM.
Before moving on, we mention a few recently proposed counterfactual notions of non-discrimination. Zhang, Wu and Wu (2016) and Nabi and Shpitser (2017) formalize discrimination as the presence of path specific effects (cf. Pearl (2009), §4.5.3). Although path-specific notions of non-discrimination are also instances of CP, we skip the details here. Kilbertus et al. (2017) addresses the difficulty of modeling and determining the effect of intervening on protected attributes by considering non-discrimination with respect to proxies of protected attributes.
To wrap up, we cite a few related works on non-discrimination in machine learning. DP as a notion of non-discrimination was studied in Zemel et al. (2013). Friedler, Scheidegger and Venkatasubramanian (2016) extends the notion of individual fairness to distinguish between constructs, which are unobservable attributes (eg. intelligence), and observations (eg. score on IQ test), which are proxies of constructs that enter into the algorithm. Berk et al. (2017) reviews various notions of non-discrimination in the criminal justice system.
3 Conditional parity by randomization
In supervised learning, we observe realizations of , where is the protected attribute, is a score, is the outcome. In general, is dependent on . If we wish to obtain a prediction that does not depend on the protected attribute we have to sacrifice some efficiency and use a randomized procedure. In this section we consider the construction of a (randomized) decision rule such that does not depend on .
Assume for simplicity that is discrete. In that case let be the probability vector corresponding to . A non-discriminatory randomization is a pair of Markov kernels, satisfying
The minor difficulty is due to the fact that the conditional density given should be checked, while the randomization cannot depend on which is unobserved at the time of the randomization.
In general, we need to randomize the score of both categories to achieve EO.
The lemma follows from the fact that a Markov kernel is a contraction in two measures, and in general they are unrelated. Consider two sets of densities such that , but on a small interval, small enough such that it has a little contribution to the distance, . If it was that
which contradict the assumption. On the other hand, if it was that
which contradict the other assumption. Hence neither it is that nor that .
The set (3.1) has equality constraints with undefined parameters. It always has the trivial solution. Adding a linear cost function, eg.
, turns the feasibility problem into a linear program. To avoid sparse solutions, we add another set of constraints
That is, the rows of are increasing in the mean.
Consider the following model in which there is a never observed latent variable
In words is the ability of the random subject, the two groups are of different ability. Let assume that We have a noisy observations which is biased in favor of the stronger group. Since
if we have then the Bayes estimator of given is the same for the two groups. In reality, biasing the score of the stronger group is not done explicitly. However, in order to improve the Bayes estimator, a culturally dependent criteria may be introduced, which in effect biases the score. This situation may seem fair. The score (e.g., the SAT) is used in the same way independently of the attribute and it is the optimal way to use the score as it is Bayes. Removal of the bias would result in an inferior selection procedure.
However, this is a discriminatory policy by other criteria. In Figure 3 we present the situation. The score distributions of the two groups are different, but worse, between two subjects with the same latent “ability”, , the subject from the stronger group gets, on the average, a higher score .
The Breir score is presented in Table 2.
|Bayes decision based on 2 categories||0.2064||0.1990|
|The non-discrimantory decision||0.2092||0.2001|
Finally, in Figure 5 we present the output distribution of the Markov kernel when .
The situation is more complicated when the outcome is more than binary valued, or continuous. However, it is simple enough if the outcome and raw prediction are jointly normal conditioned on the protected attribute. Consider, for example, the situation of Example 3.2. The concluding model is that .
Suppose , and , , are full rank. Then, without loss of generality, we can assume , , and . A minimal randomized procedure that equates the conditional distribution of is given by , where , and is an orthonormal eigen-system of .
We can always transform when by the transformation
which equate the conditional mean of under .
Now, since , does not depend on . Finally, since has a full rank, , iff
(the shift is a complete sufficient statistics in the multivariate normal distribution. This is achieved by the randomization given above. ∎
4 Testing conditional parity
In this section, we describe a kernel-based approach to testing CP developed by Zhang et al. (2012). We begin by characterizing the conditional independence condition in terms of cross-covariance operators. Let , , and be positive definite kernels on , , and respectively and , , and be the respective reproducing kernel Hilbert spaces (RKHS). Throughout this section, we assume the kernels satisfy
For any probability distribution on , its RKHS embedding is the unique such that
for any . It is well-defined because the assumptions on imply is a bounded linear functional. By the reproducing property, we see that has the explicit form
The cross-covariance operator of is an operator from to such that
It is the functional analogue of the covariance matrix a pair of random vectors. In terms of the kernel and the RKHS embeddings of the marginal distributions of and , it has the form
Letting , we see that is
The cross-covariance operator of is a positive self-adjoint operator and is called the covariance operator of . The conditional cross-covariance operator of given is
Under some technical conditions, Fukumizu et al. (2008) show that it is an operator from to such that
Before we state the functional characterization of conditional independence, we define some additional notation. The tensor productis an RKHS equipped with the inner product
We extend this inner product to all of by bilinearity. We see that the representer of evaluation in is the outer product of the representers of evaluation in and :
and the kernel is the pointwise product of and :
We are ready to state the functional characterization of conditional independence by Fukumizu et al. (2008).
Let be a kernel on and be its RKHS. As long as is a characteristic kernel 222We call kernel is characteristic if the RKHS embedding is injective. In other words, for all implies . on and is dense in , where denotes direct sum and is the space of constant functions, we have
The non-trivial implication in Theorem 4.1 is the “only if” implication. If , we have
for any , which implies as long as is rich enough. The assumption is dense in ensures is rich enough.
In light of Theorem 4.1
, a natural test statistic is the Hilbert-Schmidt (HS) norm of a plug in estimator of the conditional cross-covariance operator. Let theempirical cross-covariance operator of be
where (resp. ). The empirical conditional cross-covariance operator is
where is a regularization parameter. It is possible to express its HS norm in terms of the kernel matrices , , , where (resp. , ).
Zhang et al. (2012) show that the test statistic is asymptotically a mixture of independent random variables
and proposed two ways to approximate the asymptotic distribution. We refer to their paper for the details.
5 Do minority neighborhoods pay higher insurance premiums?
It has been observed that drivers from predominantly minority zip codes are often charged higher insurance premiums than drivers from non-minority zip codes (Feltner and Heller, 2015). The insurance industry has justified the higher premiums by arguing that drivers from minority neighborhoods have higher risk of accidents. In this section, we examine the claim of the insurance industry using the proposed framework and the data obtained by Jeff Larson (2017).
Before presenting the results, we briefly describe the data, which was obtained by Jeff Larson (2017) from Quadrant Information Services and S&P Global Inc. It consists of 98,441 insurance quotes for drivers fitting a single profile: A 30-year-old female teacher with a bachelor’s degree, excellent credit, no accidents or moving violations, and who is purchasing a policy for $100,000 of property damage coverage and $100,000 to cover medical bills per person up to $300,000 per accident for the first time. She drives a 2016 Toyota Camry, has a 15 mile daily commute, and drives 13,000 miles a year. The quotes are associated with the zip code of the driver, and by fixing the profile and letting zip code change, we control for factors outside of geography.
The risk of drivers in a zip code is measured by the ratio of dollars paid out for liability claims to the number of insured cars. In California, this ratio is called average loss and is a measure of the cost to the insurer of insuring a car in a zip code. Ideally, we would have data on the claims from drivers that fit aforementioned profile, but, unfortunately, we do not have such fine-grained data. We refer to Jeff Larson (2017) for further details regarding the data.
We tested two hypotheses in California: the quotes were independent of the percent minority population given the risk in the associated zip codes (); the quotes were independent of whether the associated zip code is underserved given the risk (). The California Department of Insurance defines “underserved” zip codes as zip codes where (i) the fraction of uninsured drivers exceeds the statewide average by at least 10%, (ii) the per capita income is below the statewide median, (iii) minorities are at least two-thirds of the population. Among the 1,648 Californian zip codes recorded in the data, there are 145 such underserved zip codes. The results are reported in Table 3.
We examine the data on Progressive Group more closely because there is a discrepancy between the results of the test of (its quotes are independent of the percent minority population given the risk) and that of (its quotes are independent of whether the associated zip code is underserved given the risk). To comprehend this discrepancy, we redefine underserved zip codes as zip codes where the percent minorities population is at least for various values of and test again. The results are reported in Table 4. We see that although the percentage of minority population, as a continuous variable, does not pass the conditional independence test at level, the minority indicator derived from it sometimes does.
Mathematically, the discrepancy between the tests is unsurprising: does not generally imply . However, its practical implication is noteworthy because it exposes one problem of thresholding a continuous protected attribute. Thresholding tolerates discrimination within subgroups (discrimination within the minority/non-minority subgroup), as long as there is no discrimination across different subgroups. We also note that the -value when in 4 is quite different from that for in 3. This is because only of the zip codes where minority percentage exceeds are truly underserved.
In Illinois and Texas, we tested the hypothesis that the quotes were independent of the percent minority population given the risk in the associated zip codes. In Illinois, we excluded the zip codes in Chicago because Chicago has a law that require insurers to charge the same price for bodily injury insurance. The results are reported in Table 5.
Finally, we apply the randomization procedure described in Section 3 to adjust the premium. We concentrate on one insurer (Garrison Property and Casualty Insurance Company) and only on the property damage policy premium. There are two protected group. The white-non-Hispanics and the rest. However, this attribute is not given (to the insurer and in the data) and is derived from the proportions of the the two groups within any zip-code area.
Let be the premium in the zip code, the state defined risk and
the proportion of whites (non-Hispanic) within the zip code area. Linear regression ofon , , and finds all coefficients to be significant at the 0.001 level. To proceed we make the (clearly unrealistic) assumption that all zip code areas have the same number of car insured by the discussed insurer. Under this assumption the distribution of the excess premiums paid by the two group (after controling for the risk) is plotted in Figure 6
(a). In this semi-artificial setup, a white customer pays, on the average 1.54 USD less than predicted, while a minority group members pays 1.65 USD more with standard deviations equal to 25.7 and 29.2 USD respectively.
Since the group membership is concealed from the insurer, this cannot be corrected directly. We suggest that in such a situation a cross-subsidization between zip code areas, where the premium would have a component which is proportional to the deviation of from its mean and a randomized component proportional to . Practically, randomization can be achieved by making the premium depend slightly on hardly relevant information about the customer or the location. In Figure 6(b) we present the result of this process. Both the white and non-white groups have the same mean and standard deviation (0.0 and 29.8 USD respectively).
6 Summary and discussion
We identified conditional parity as a general notion of non-discrimination in machine learning. It formalizes the implicit comparison in claims of discrimination and is applicable beyond supervised learning. It also includes many recently proposed notions of non-discrimination, including counterfactual ones.
The main takeaway for practitioners is the necessity of specifying not only protected attribute but also discriminatory ones in any rigorous notion of non-discrimination. Ignoring the discriminatory attributes may lead to ambiguous definitions. Consider the recent debate on whether no sex-based discrimination implies no discrimination based on sexual orientation (Thayer, 2017). In Hively v. Ivy Tech, the majority opinion expressed “common-sense reality [is] that it is actually impossible to discriminate on the basis of sexual orientation without discriminating on the basis of sex”. However, by letting the gender to which one is attracted to be the discriminatory attribute, we see that it is indeed possible to discriminate on sexual orientation but not on gender. The ambiguity in the prohibition of sex-based discrimination is in the non-specification of discriminatory attributes.
Let be gender and be the gender to which one is attracted to. Intuitively, no discrimination based on the gender requires the segments of the population on the same row of Table 6 be treated equally, while no discrimination based on sexual orientation requires the segments on the same column be treated equally.
Finally, we mention that CP is amenable to statistical analysis. We studied randomization as a general approach to achieving CP, as well as a kernel-based approach to check whether the output of a black-box machine learning algorithm satisfies CP. Most prior work on non-discrimination in machine learning has focused on designing non-discriminatory machine learning algorithms. However, to enforce non-discrimination, methods to detect violations are crucial, and we look forward to developments in future work.
Appendix A Proof of auxiliary lemmas
Proof of Lemma 4.2.
To keep things simple, we evaluate instead of . To obtain an expression of , simply replace by . We also abuse notation and denote linear combinations of the form by , where
is an “infinite matrix”. In this notation, we have
where . The (squared) HS norm of the is
For now, we focus on evaluating the first term. Since is a mapping from a subspace of to an subspace of , we may restrict to the subspaces. Let be an orthonormal basis of . We have
where , and
By the reproducing property, we have
where is the Gram matrix whose entries are , for any . Thus
where is the centered Gram matrix. The first term in (A.1) is