Algorithms have become powerful arbitrators
of a range of significant decisions with far-reaching
societal impact—hiring articleTimesHiring; articleWired,
welfare allocation articleSlateWelfare,
prison sentencing articlePropublica,
policing articleGuardianCrime; perry2013predictive,
amongst many others.
With the range and sensitivity of algorithmic decisions expanding by the day,
the question of whether an algorithm is fair is a pressing
Indeed, the notion of algorithmic fairness has captured
the attention of a broad spectrum of experts:
machine learning and theory researchers
has captured the attention of a broad spectrum of experts: machine learning and theory researchersdwork12; zemel13; feldman15; calders10; privacy researchers and investigative journalists datta2015automated; articlePropublica; articWsjstaples; sweeney2013discrimination; law scholars and social scientists tutt2016fda; ajunwa2016hiring; barocas2014big; governmental agencies and ngos articleWhitehouse14.
Ultimately, algorithmic fairness is a question about programs and their properties: Is a given program fair, under some definition of fairness? Or, how fair is ? In this paper, we describe a line of work that approaches the question of algorithmic fairness from a program-analytic perspective, in which our goal is to analyze a given decision-making program and construct a proof of its fairness or unfairness—just as a traditional static program verifier would prove correctness of a program with respect to, for example, lack of divisions by zero, integer overflows, null-pointer derefrences, etc.
We start by analyzing what are the challenges and research questions
in checking algorithmic fairness for decision making programs (Section 2 ).
We then present a simple case study and show how techniques for verifying probabilistic
programs can be used to automatically prove or disprove global fairness for
a class of programs that subsume a range of machine learning classifiers (Section
). We then present a simple case study and show how techniques for verifying probabilistic programs can be used to automatically prove or disprove global fairness for a class of programs that subsume a range of machine learning classifiers (Section3). Finally, we lay a list of many challenging and interesting questions that the algorithms and programming languages communities need to answer to achieve the ultimate goal of building a fully automated system for verifying and guaranteeing algorithmic fairness in real-world applications (Section 4).
2 Proving Programs Fair
In this section, we describe the components of the fairness verification problem. Intuitively, our goal is to prove whether a certain program is fair with respect to the set of possible inputs over which it operates. Tackling the fairness-verification problem requires answering a number of challenging questions:
What class of decision-making programs should our program model capture?
How can we define the set of possible inputs to the program and capture complex probability distributions that are useful and amenable to verification?
How can we describe what it means for the program to be fair?
How can we fully automate the verification process?
Figure 1 provides a high-level picture of our proposed framework. As shown, the fairness verifier takes a (white-box) decision-making program and a population model . The verifier then proceeds to prove or disprove that is fair for the given population defined by the model . Here, the model defines a joint probability distribution on the inputs of . Existing definitions of fairness define programs as fair or unfair with respect to a given concrete dataset. While using a concrete dataset simplifies the verification problem, it also raises questions of whether the dataset is representative for the population for which we are trying to prove fairness. Our technique moves away from concrete datasets and replaces them with a probabilistic population model. We envision a future in which fairness verification is regulated.111The European Union (EU), for instance, has already begun regulating algorithmic decision-making Goodman16. For instance, a governmental agency can publish a probabilistic population model (e.g., generated from census data). Any organization employing a decision-making algorithm with potentially significant consequences (e.g., hiring) must quantify fairness of their algorithmic process against the current picture of the population, as specified by the population model.
In the context of algorithmic fairness, a program takes as input a vector
takes as input a vector of argumentsrepresenting a set of input attributes (features), where one (or more) of the arguments in the vector is sensitive—e.g., gender or race. Evaluating may return a Boolean value indicating—e.g., hire or not hire—if the program is a binary or a numerical value—e.g., a mortgage rate. The set of combinators, operations, and types used by the program can vastly affect the complexity of the verification procedures. For example, loops are the hardest type of programming construct to reason about, but most machine learning classifiers do not contain loops. Similarly, since classifiers typically operate over real values, we can limit the set of possible types allowed in our programs to only being reals or other types that can be desugared into reals. All these decisions are crucial in the design of a verification procedure.
To be able to reason about the outcome of the program we need to specify what kind of input the program will operate on. For example, although a program that allocates mortgages might be “fair” with respect a certain set of applicants, it may become unfair when considering a different pool of people. In program verification, the “kind of inputs” over which the program operates is called the precondition and is typically stated as a formal logical property with the program inputs as free variables. An example of program precondition is
which indicates that none of the program inputs is both a woman and a priest. Of course, there are many possible choices for what language we can use to describe the program’s precondition. In particular, if we want to capture a certain probability distribution over the input of the program, our language will be a logic that can describe probabilities and random variables.
For example, we might want to be able to specify that half of the inputs are female,
which indicates that none of the program inputs is both a woman and a priest. Of course, there are many possible choices for what language we can use to describe the program’s precondition. In particular, if we want to capture a certain probability distribution over the input of the program, our language will be a logic that can describe probabilities and random variables. For example, we might want to be able to specify that half of the inputs are female,, or that the age of the processed inputs has a particular distribution, . Again, the choice of the language allowed in the preconditions is crucial in the design of a verification procedure. From now on, we refer to the program precondition, , as the population model.
The next step is to define a property stating that the program’s outcome is fair with respect to the program’s precondition. In program verification, this is called the postcondition of the program. As observed in the fairness literature, there are many ways to define when and why a program is fair or unfair.
For example, if we want to prove group fairness—i.e., that the algorithm is just as likely to hire a minority applicant () as it is for other, non-minority applicants—our postcondition will be an expression of the form
where true is the desired return value of the program, e.g., indicating hiring. On the other hand, if we want to prove individual fairness—i.e., similar inputs should have similar outcomes—our postcondition will be an expression of the form
Notice that the last postcondition relates the outcomes of the program on different input values. As the two types of properties we described are radically different, they will also require different verification mechanisms.
Proofs of (un)fairness
The task of proving whether a program is fair boils down to statically checking whether, on inputs satisfying the precondition, the outcome of the program satisfies the post-condition. For simple definitions, such as group fairness, the verification problem reduces to computing the probability of a number of events with respect to the program and the population model. For more complex definitions, such as individual fairness, proving fairness requires more complex reasoning involving multiple runs of the programs (i.e., a hyperproperty clarkson2010hyperproperties), a notoriously hard problem. In the case of a negative result, the verifier should provide the users with a proof of unfairness. Depending on the fairness definition, producing a human-readable proof might be challenging as the argument might involve multiple and potentially infinite inputs. For example, in the case of group fairness it might be challenging to explain why the program outputs true on 40% of the minority inputs and on 70% of the majority inputs.
3 Case Study
We now describe a simplified case study demonstrating how our fairness verification methodology can be used to prove or disprove fairness of a given decision-making program.
A program and a population model
Consider the following program dec,
which is a decision-making
program that takes a job applicant’s
college ranking and years of experience and
decides whether they get hired or not (the fairness target ).
The program implements a decision tree,
perhaps one generated by a machine-learning algorithm.
A person is hired if they attended a
). The program implements a decision tree, perhaps one generated by a machine-learning algorithm. A person is hired if they attended atop-5 college (colRank <= 5) or have lots of experience compared to their college’s ranking (expRank > -5). Observe that dec does not access ethnicity.
Now, consider the program popModel, which is a probabilistic
program describing a simple model of the population.
Here, a member of the population has three
attributes, all of which are real-valued:
(ii) colRank, the ranking of the college
the person attended (lower is better);
and (iii) yExp, the years of work experience
a person has.
We consider a person is a member of a protected group
if ethnicity > 10; we call this the
The population model can be viewed as a generative model of records of individuals—the more likely a combination
is to occur in the population, the more likely it will be generated.
For instance, the years of experience an individual has (line 4)
follows a Gaussian
distribution with mean
of records of individuals—the more likely a combination is to occur in the population, the more likely it will be generated. For instance, the years of experience an individual has (line 4) follows a Gaussian distribution with mean.
A note on the program model
Note that our program model, while admitting arbitrary programs,
is rich enough to capture programs (classifiers) generated
by standard machine learning algorithms.
For example, linear support vector machines, decision
trees, and neural networks, can be represented in our language
simply using assignments with arithmetic expressions and conditionals.
Similarly, the population model is a probabilistic program, where assignments can be made
by drawing values from predefined distributions. Like other
probabilistic programming languages, our programming model
is rich enough to subsume graphical models like Bayesian networks
Note that our program model, while admitting arbitrary programs, is rich enough to capture programs (classifiers) generated by standard machine learning algorithms. For example, linear support vector machines, decision trees, and neural networks, can be represented in our language simply using assignments with arithmetic expressions and conditionals. Similarly, the population model is a probabilistic program, where assignments can be made by drawing values from predefined distributions. Like other probabilistic programming languages, our programming model is rich enough to subsume graphical models like Bayesian networksgordon2014probabilistic.
Suppose that our goal is to prove group fairness, following the definition of Feldman et al. feldman15:
where min is shorthand for the sensitive condition ethnicity > 10.
Probabilistic inference as volume computation
To prove (un)fairness of the decision-making model with respect to the population, we need to compute the probabilities appearing in the group fairness ratio. For illustration, suppose we are computing the probability . We need to reason about the composition of the two programs, . That is, we want to compute the probability that (i) popModel generates a non-minority applicant, and (ii) dec hires that applicant. To do so, we observe that every possible execution of the composition is uniquely characterized by the set of the three probabilistic choices made by popModel. In other words, every execution is characterized by a vector .
Thus, our goal is to compute the probability that we draw a vector that results in a minority applicant being hired. Probabilistic programming languages, e.g., Church goodman2012church, R2 nori2014r2, and Stan carpenter2015stan, employ approximate inference techniques, like mcmc, which converge in the limit but offer no guarantees on how far we are from the exact result. In our work, we consider exact inference, which has primarily received attention in the Bayesian network setting, and boils down to solving a #SAT instance chavira2008probabilistic. In our setting, however, we are dealing with real-valued variables.
Using standard techniques from program analysis and verification, we can characterize the set of all such vectors as a formula , which is comprised of Boolean combinations (conjunctions/disjunctions) of linear inequalities—since our program only has linear expressions. Geometrically, the formula is a set of convex polyhedra in . Therefore, the probability is the same as the probability of drawing a vector that lies inside of . In other words, we are interested in the volume of , weighted by the probabilistic choices. Formally:
where, e.g., is the probability density function of the distribution gauss(0,10)—the distribution from which the value of ethnicity is drawn in line 2 of popModel.
The volume computation problem is a well-studied and hard problem khachiyan1993complexity; dyer1988complexity. Indeed, even for a convex polytope, computing its volume is #P-hard. Leveraging the great developments in satisfiabiltiy modulo theories (smt) solvers barrett09, we developed a procedure that reduces the volume compuation problem to a series of calls to the smt solver, viewed completely as an oracle. Specifically, our procedure uses the smt solver to sample subregions of that are hyperrectangular. Intuitively, for hyperrectangular regions in , evaluating the above integral is a matter of evaluating the cdfs of the various distributions. Thus, by systematically sampling more and more non-overlapping hyperrectangles in , we maintain a lower bound on the probability of interest. Figure 2 pictorially illustrates and an under-approximation with 4 hyperrectangles. Similarly, to compute an upper bound on the probability, we can simply invoke our procedure on .
The fairness verification tool terminates when it has computed lower/upper bounds that prove or disprove the desired fairness criteria. The hyperrectangles sampled in the process of computing volumes can serve as proof certificates. That is, an external entity can take the hyperrectangles, compute their volumes, and ensure that they indeed lie in the expected regions in .
4 Experience and future Outlook
We have built a fairness-verification tool, called FairSquare, that takes a decision-making program, a population model, and verifies fairness of the program with respect to the model. So far, we have focused on group fairness. The tool uses the popular Z3 smt solver de2008z3 for manipulating first-order formulas over arithmetic theories.
We have used FairSquare to prove or disprove fairness of a
suite of population models and
machine-learning classifiers that were automatically
generated from real-world datasets used in other
work on algorithmic fairness feldman15; zemel13; datta2016algorithmic.
Specifically, we have considered linear svms ,
simple neural networks with rectified linear units,
and decision trees.
, simple neural networks with rectified linear units, and decision trees.
Looking forward, we see a wide range of avenues for improvement and exploration. For instance, we are currently working on the problem of making an unfair program fair. That is, given a program that is considered unfair, what is the smallest tweak that would make it fair. Our goal is to repair the program, making it fair, while ensuring that it is semantically close to the original program.