Fairness as a Program Property

by   Aws Albarghouthi, et al.

We explore the following question: Is a decision-making program fair, for some useful definition of fairness? First, we describe how several algorithmic fairness questions can be phrased as program verification problems. Second, we discuss an automated verification technique for proving or disproving fairness of decision-making programs with respect to a probabilistic model of the population.


page 1

page 2

page 3

page 4


Quantifying Program Bias

With the range and sensitivity of algorithmic decisions expanding at a b...

Subjective fairness: Fairness is in the eye of the beholder

We analyze different notions of fairness in decision making when the und...

Fair Pipelines

This work facilitates ensuring fairness of machine learning in the real ...

Algorithmic Decision Making with Conditional Fairness

Nowadays fairness issues have raised great concerns in decision-making s...

A Tale of Fairness Revisited: Beyond Adversarial Learning for Deep Neural Network Fairness

Motivated by the need for fair algorithmic decision making in the age of...

A Novel Approach to Fairness in Automated Decision-Making using Affective Normalization

Any decision, such as one about who to hire, involves two components. Fi...

A Human-Centric Perspective on Fairness and Transparency in Algorithmic Decision-Making

Automated decision systems (ADS) are increasingly used for consequential...

1 Introduction

Algorithms have become powerful arbitrators of a range of significant decisions with far-reaching societal impact—hiring articleTimesHiring; articleWired, welfare allocation articleSlateWelfare, prison sentencing articlePropublica, policing articleGuardianCrime; perry2013predictive, amongst many others. With the range and sensitivity of algorithmic decisions expanding by the day, the question of whether an algorithm is fair is a pressing one. Indeed, the notion of algorithmic fairness

has captured the attention of a broad spectrum of experts: machine learning and theory researchers 

dwork12; zemel13; feldman15; calders10; privacy researchers and investigative journalists datta2015automated; articlePropublica; articWsjstaples; sweeney2013discrimination; law scholars and social scientists tutt2016fda; ajunwa2016hiring; barocas2014big; governmental agencies and ngos articleWhitehouse14.

Ultimately, algorithmic fairness is a question about programs and their properties: Is a given program fair, under some definition of fairness? Or, how fair is ? In this paper, we describe a line of work that approaches the question of algorithmic fairness from a program-analytic perspective, in which our goal is to analyze a given decision-making program and construct a proof of its fairness or unfairness—just as a traditional static program verifier would prove correctness of a program with respect to, for example, lack of divisions by zero, integer overflows, null-pointer derefrences, etc.

We start by analyzing what are the challenges and research questions in checking algorithmic fairness for decision making programs (Section 2

). We then present a simple case study and show how techniques for verifying probabilistic programs can be used to automatically prove or disprove global fairness for a class of programs that subsume a range of machine learning classifiers (Section 

3). Finally, we lay a list of many challenging and interesting questions that the algorithms and programming languages communities need to answer to achieve the ultimate goal of building a fully automated system for verifying and guaranteeing algorithmic fairness in real-world applications (Section 4).

Figure 1: Overview

2 Proving Programs Fair

In this section, we describe the components of the fairness verification problem. Intuitively, our goal is to prove whether a certain program is fair with respect to the set of possible inputs over which it operates. Tackling the fairness-verification problem requires answering a number of challenging questions:

  • What class of decision-making programs should our program model capture?

  • How can we define the set of possible inputs to the program and capture complex probability distributions that are useful and amenable to verification?

  • How can we describe what it means for the program to be fair?

  • How can we fully automate the verification process?

Figure 1 provides a high-level picture of our proposed framework. As shown, the fairness verifier takes a (white-box) decision-making program and a population model . The verifier then proceeds to prove or disprove that is fair for the given population defined by the model . Here, the model defines a joint probability distribution on the inputs of . Existing definitions of fairness define programs as fair or unfair with respect to a given concrete dataset. While using a concrete dataset simplifies the verification problem, it also raises questions of whether the dataset is representative for the population for which we are trying to prove fairness. Our technique moves away from concrete datasets and replaces them with a probabilistic population model. We envision a future in which fairness verification is regulated.111The European Union (EU), for instance, has already begun regulating algorithmic decision-making Goodman16. For instance, a governmental agency can publish a probabilistic population model (e.g., generated from census data). Any organization employing a decision-making algorithm with potentially significant consequences (e.g., hiring) must quantify fairness of their algorithmic process against the current picture of the population, as specified by the population model.

Decision-making programs

In the context of algorithmic fairness, a program

takes as input a vector of arguments

representing a set of input attributes (features), where one (or more) of the arguments in the vector is sensitive—e.g., gender or race. Evaluating may return a Boolean value indicating—e.g., hire or not hire—if the program is a binary or a numerical value—e.g., a mortgage rate. The set of combinators, operations, and types used by the program can vastly affect the complexity of the verification procedures. For example, loops are the hardest type of programming construct to reason about, but most machine learning classifiers do not contain loops. Similarly, since classifiers typically operate over real values, we can limit the set of possible types allowed in our programs to only being reals or other types that can be desugared into reals. All these decisions are crucial in the design of a verification procedure.

Population model

To be able to reason about the outcome of the program we need to specify what kind of input the program will operate on. For example, although a program that allocates mortgages might be “fair” with respect a certain set of applicants, it may become unfair when considering a different pool of people. In program verification, the “kind of inputs” over which the program operates is called the precondition and is typically stated as a formal logical property with the program inputs as free variables. An example of program precondition is

which indicates that none of the program inputs is both a woman and a priest. Of course, there are many possible choices for what language we can use to describe the program’s precondition. In particular, if we want to capture a certain probability distribution over the input of the program, our language will be a logic that can describe probabilities and random variables. For example, we might want to be able to specify that half of the inputs are female,

, or that the age of the processed inputs has a particular distribution, . Again, the choice of the language allowed in the preconditions is crucial in the design of a verification procedure. From now on, we refer to the program precondition, , as the population model.

Fairness properties

The next step is to define a property stating that the program’s outcome is fair with respect to the program’s precondition. In program verification, this is called the postcondition of the program. As observed in the fairness literature, there are many ways to define when and why a program is fair or unfair.

For example, if we want to prove group fairness—i.e., that the algorithm is just as likely to hire a minority applicant () as it is for other, non-minority applicants—our postcondition will be an expression of the form

where true is the desired return value of the program, e.g., indicating hiring. On the other hand, if we want to prove individual fairness—i.e., similar inputs should have similar outcomes—our postcondition will be an expression of the form

Notice that the last postcondition relates the outcomes of the program on different input values. As the two types of properties we described are radically different, they will also require different verification mechanisms.

Proofs of (un)fairness

The task of proving whether a program is fair boils down to statically checking whether, on inputs satisfying the precondition, the outcome of the program satisfies the post-condition. For simple definitions, such as group fairness, the verification problem reduces to computing the probability of a number of events with respect to the program and the population model. For more complex definitions, such as individual fairness, proving fairness requires more complex reasoning involving multiple runs of the programs (i.e., a hyperproperty clarkson2010hyperproperties), a notoriously hard problem. In the case of a negative result, the verifier should provide the users with a proof of unfairness. Depending on the fairness definition, producing a human-readable proof might be challenging as the argument might involve multiple and potentially infinite inputs. For example, in the case of group fairness it might be challenging to explain why the program outputs true on 40% of the minority inputs and on 70% of the majority inputs.

3 Case Study

We now describe a simplified case study demonstrating how our fairness verification methodology can be used to prove or disprove fairness of a given decision-making program.

A program and a population model

Consider the following program dec, which is a decision-making program that takes a job applicant’s college ranking and years of experience and decides whether they get hired or not (the fairness target

). The program implements a decision tree, perhaps one generated by a machine-learning algorithm. A person is hired if they attended a

top-5 college (colRank <= 5) or have lots of experience compared to their college’s ranking (expRank > -5). Observe that dec does not access ethnicity.

define dec(colRank, yExp)
  expRank  yExp - colRank
  if (colRank <= 5)
    hire  true
  elif (expRank > -5)
    hire  true
    hire  false
  return hire

Now, consider the program popModel, which is a probabilistic program describing a simple model of the population. Here, a member of the population has three attributes, all of which are real-valued: (iethnicity; (iicolRank, the ranking of the college the person attended (lower is better); and (iiiyExp, the years of work experience a person has. We consider a person is a member of a protected group if ethnicity > 10; we call this the sensitive condition. The population model can be viewed as a generative model

of records of individuals—the more likely a combination is to occur in the population, the more likely it will be generated. For instance, the years of experience an individual has (line 4) follows a Gaussian distribution with mean

and standard deviation


define popModel()
  ethnicity ~ gauss(0,10)
  colRank ~ gauss(25,10)
  yExp ~ gauss(10,5)
  if (ethnicity > 10)
    colRank  colRank + 5
  return colRank, yExp

A note on the program model

Note that our program model, while admitting arbitrary programs, is rich enough to capture programs (classifiers) generated by standard machine learning algorithms. For example, linear support vector machines, decision trees, and neural networks, can be represented in our language simply using assignments with arithmetic expressions and conditionals. Similarly, the population model is a probabilistic program, where assignments can be made by drawing values from predefined distributions. Like other probabilistic programming languages, our programming model is rich enough to subsume graphical models like Bayesian networks 


Group fairness

Suppose that our goal is to prove group fairness, following the definition of Feldman et al. feldman15:

where min is shorthand for the sensitive condition ethnicity > 10.

Probabilistic inference as volume computation

To prove (un)fairness of the decision-making model with respect to the population, we need to compute the probabilities appearing in the group fairness ratio. For illustration, suppose we are computing the probability . We need to reason about the composition of the two programs, . That is, we want to compute the probability that (ipopModel generates a non-minority applicant, and (iidec hires that applicant. To do so, we observe that every possible execution of the composition is uniquely characterized by the set of the three probabilistic choices made by popModel. In other words, every execution is characterized by a vector .

Thus, our goal is to compute the probability that we draw a vector that results in a minority applicant being hired. Probabilistic programming languages, e.g., Church goodman2012church, R2 nori2014r2, and Stan carpenter2015stan, employ approximate inference techniques, like mcmc, which converge in the limit but offer no guarantees on how far we are from the exact result. In our work, we consider exact inference, which has primarily received attention in the Bayesian network setting, and boils down to solving a #SAT instance chavira2008probabilistic. In our setting, however, we are dealing with real-valued variables.

Using standard techniques from program analysis and verification, we can characterize the set of all such vectors as a formula , which is comprised of Boolean combinations (conjunctions/disjunctions) of linear inequalities—since our program only has linear expressions. Geometrically, the formula is a set of convex polyhedra in . Therefore, the probability is the same as the probability of drawing a vector that lies inside of . In other words, we are interested in the volume of , weighted by the probabilistic choices. Formally:

where, e.g., is the probability density function of the distribution gauss(0,10)—the distribution from which the value of ethnicity is drawn in line 2 of popModel.

The volume computation problem is a well-studied and hard problem khachiyan1993complexity; dyer1988complexity. Indeed, even for a convex polytope, computing its volume is #P-hard. Leveraging the great developments in satisfiabiltiy modulo theories (smt) solvers barrett09, we developed a procedure that reduces the volume compuation problem to a series of calls to the smt solver, viewed completely as an oracle. Specifically, our procedure uses the smt solver to sample subregions of that are hyperrectangular. Intuitively, for hyperrectangular regions in , evaluating the above integral is a matter of evaluating the cdfs of the various distributions. Thus, by systematically sampling more and more non-overlapping hyperrectangles in , we maintain a lower bound on the probability of interest. Figure 2 pictorially illustrates and an under-approximation with 4 hyperrectangles. Similarly, to compute an upper bound on the probability, we can simply invoke our procedure on .

Figure 2: Underapproximation of as hyperrectangles

Fairness certificates

The fairness verification tool terminates when it has computed lower/upper bounds that prove or disprove the desired fairness criteria. The hyperrectangles sampled in the process of computing volumes can serve as proof certificates. That is, an external entity can take the hyperrectangles, compute their volumes, and ensure that they indeed lie in the expected regions in .

4 Experience and future Outlook


We have built a fairness-verification tool, called FairSquare, that takes a decision-making program, a population model, and verifies fairness of the program with respect to the model. So far, we have focused on group fairness. The tool uses the popular Z3 smt solver de2008z3 for manipulating first-order formulas over arithmetic theories.

We have used FairSquare to prove or disprove fairness of a suite of population models and programs representing machine-learning classifiers that were automatically generated from real-world datasets used in other work on algorithmic fairness feldman15; zemel13; datta2016algorithmic. Specifically, we have considered linear svms

, simple neural networks with rectified linear units, and decision trees.

Future outlook

Looking forward, we see a wide range of avenues for improvement and exploration. For instance, we are currently working on the problem of making an unfair program fair. That is, given a program that is considered unfair, what is the smallest tweak that would make it fair. Our goal is to repair the program, making it fair, while ensuring that it is semantically close to the original program.