source software is a backbone of reproducible research, especially when considering the nature of artificial intelligence (AI) and machine learning (ML) algorithms where sometimes changing the seed of a random number generator can cause a state-of-the-art solution to become a sub-par one. Despite the effort to release code alongside publications, both fields are struggling with a reproducibility crisis(hutson2018artificial). This may be due to poor reporting, the desire to keep trade secrets or simply aiming to keep an edge over competitors. One way to tackle this problem is to promote publishing high-quality software used for scientific experiments under an open source licence or simply require it as a part of the publishing and peer-review process, which has been advocated for a long time (sonnenburg2007need). Despite their importance, implementations are commonly treated just as a research by-product and are often abandoned after publishing a work based on them. We call this phenomenon paperware – a piece of software, which main purpose is to see a paper towards publication rather than implement any particular concept with thorough software engineering practice. Alternatively, they are provided as standalone packages that often do not follow the best software engineering practices, hence can prove difficult to use for a wider community due to lack of documentation and maintenance, therefore impacting its usability and reproducibility in general.
Some researchers have realised the widespread reliability issues of machine learning systems and proposed unified frameworks to assess and document them. For example, multiple researchers have proposed approaches to document data sets (gebru2018datasheets; holland2018dataset) to ensure their high quality and reliability. A similar approach has been taken towards machine learning systems that are offered as services (hind2018increasing) accessible through an Application Programming Interface (API). Such efforts are laudable, however they suffer from limited scope and a labour-intensive creation process, which may slow down the ML research and development cycle. Furthermore, self-reporting – they are not audited – means that some of their aspects may be subjective, hence may not reflect the true behaviour of the system, whether done intentionally or not. Certification, on the other hand, creates a need for external bodies, which seems impossible to achieve for all the machine learning systems that are somehow influencing human lives.
To help mitigate such undesired practices in the field of AI and ML fairness, accountability and transparency (FAT), we developed an open source Python package called FAT Forensics. It is designed as an interoperable framework for implementing, testing and deploying novel algorithms invented by the FAT research community and facilitate their evaluation and comparison against the state-of-the-art ones, therefore democratising access to these techniques. In addition to supporting research in this space, the toolbox is capable of analysing all artefacts of the machine learning process – data, models and predictions – by considering their fairness, accountability (robustness, security, safety and privacy) and transparency (interpretability and explainability). The common interface layer (see Section 2.1 for more details) of the toolbox supports several “modes of operation”. A research mode (data in – visualisation out), where the toolbox can be loaded into an interactive Python session (e.g., a Jupyter Notebook), supports prototyping and exploratory analysis. This mode is intended for FAT researchers who could use it to propose new fairness metrics, compare them with the existing ones or use them to inspect a new system or a data set. The second one is a deployment mode (data in – data out) where it can be used as a part of a data processing pipeline to provide a (numerical) FAT analytics, hence support any kind of automated reporting or dashboarding. This mode is intended for ML engineers who may use it to monitor or evaluate an ML system during development and deployment.
Our main contribution is design and implementation of a software package that collates fairness, accountability and transparency algorithms for the entire predictive pipeline: data (raw and features), models and predictions. The package is supported by a thorough and beginner-friendly documentation, which includes tutorials, examples, how-to guides and a user guide. The toolbox is flexible enough to support work of researchers and practitioners alike since it has been designed with research and deployment modes in mind. We hope that our package will be adopted by the FAT community, who will contribute their approaches here instead of releasing them as a standalone code, given a firm foundation of FAT Forensics. Another, minor, contribution of our work is a modular implementation of local surrogate explanations, discussed in Section 3.3, which shows the breath of transparency algorithms that can be built with our toolbox by simply combining components it implements.
In the following section we introduce our tool, describe its architecture and list algorithms its first public release implements. Next, in Section 3, we describe a number of possible use cases and benefits of having FAT algorithms under a common roof. Then, in Section 4, we survey the landscape of relevant commercial and freely available FAT software that can be used to assess security, privacy, fairness, interpretability and explainability of data processing pipelines, namely: (raw) data, their features, predictive models and algorithmic decisions. In the final section, we conclude this paper with a discussion and envisaged software development time-line.
2. Inspecting FAT of AI and ML Systems
Depending on the maturity of a research field, proposing, adopting and developing a common software infrastructure for implementing novel algorithms and comparing them with others may be difficult. Relatively young and still developing fields, such as FAT of predictive systems, usually lack this type of software solutions. Within the past few year we have seen an increasing number of novel FAT algorithms implemented in different programming languages, each one with different requirements and API making them difficult to use and compare in a systematic way. To address this issue, while the community is still young and flexible enough to adopt it, we developed an open source Python framework for evaluating, comparing and deploying FAT algorithms. We chose Python – compatible with Python 3.5 and higher – because of its popularity in AI and ML research communities and its overall simplicity. We opted for a minimal required dependency on NumPy (1.10.0 or higher) and SciPy (0.13.3 or higher) to facilitate easy deployment in a variety of systems. An optional dependency on Matplotlib (3.0.0 or higher) enables access to basic visualisations. The toolbox is hosted on GitHub111https://github.com/fat-forensics/fat-forensics to encourage community contributions and it is based on the 3-clause BSD License, which opens it up for commercial applications. To encourage long-term sustainability it has been developed employing the best software engineering practices such as:
unit/integration testing (with perfect code coverage),
detailed technical API documentation that includes function-level, module-level and functional documentation,
tutorials that walk the user step-by-step through the main functionality of the package, and
code examples that can be used as a reference material for more advanced users.
The toolbox collates many state-of-the-art FAT algorithm implementations – with many more to come – and provides a coherent API to make them accessible to the community. Implementing fairness, accountability and transparency algorithms under a common roof allows to reuse their components across FAT implementations. For example, grouping data based on values of a selected feature can be used for both: evaluating group-based fairness (disparate impact) such as demographic parity (hardt2016equality)
and uncovering systematic performance bias of a predictive model. The initial development is heavily focused on tabular data and “classic” predictive algorithms. Once a certain level of maturity is reached the development will move towards techniques capable of handling sensory data (images, audio) and neural networks (TensorFlow, PyTorch). We envisage that relevant software packages that are already well established in the FAT community and that adhere to the best software engineering practices can be “wrapped” by our toolbox under a common API to avoid re-implementing them. We will also encourage researchers and practitioners alike to contribute their novel approaches to our software package or make them compatible with it, therefore exposing them to the wider community in a controlled and sustainable environment, hence improving reproducibility in the FAT field.
We envisage two main application areas of our toolbox. The first one is directed toward FAT research communities in ML and AI. We provide them with a platform to develop, test, compare and evaluate their novel algorithms without the burden of setting up a software engineering work-flow (see Section 3 for an example of this application). This in turn will ensure that our framework contains (or is compatible with) implementations of cutting-edge algorithms what will encourage its use for auditing ML systems – our second intended audience. Our package should also appeal to the latter group since its members can access a low-level API that can be used for FAT reporting and certification (see Section 3 for examples that could contribute to them). Both these application areas give ML researchers and practitioners a tool to inspect quality and security of their systems in a transparent and reproducible manner.
2.1. Design and Architecture
Since most of the FAT software is developed with the intention to highlight research outputs, this often results in unnecessary dependencies, data sets and interactive visualisations being distributed with the code-base, which itself uses non-standardised API (Figure 1).
To mitigate these issues, FAT Forensics decouples the core functionality of an FAT algorithm from its possible presentation to the user (e.g., visualisation) and dependencies that may be used for experiments (e.g., particular data sets or predictive algorithms) – see Figure 2. Since visualisations are a vital part of the first application mode that we advocate – research mode – we provide a basic visualisation module within the package, however its functionality is conditioned on an optional Matplotlib software dependency. This FAT software infrastructure generalisation is achieved by making minimal assumptions about these operational settings, therefore providing a common interface layer for key FAT concepts, focusing on the interactions between data, models, predictions and people. A predictive model is assumed to be a plain Python object that has fit, predict and, optionally, predict_proba
methods, therefore making it compatible with the most popular Python machine learning toolbox scikit-learn without introducing additional dependencies. This also means that our package can easily support arbitrary “black-box” predictive models, e.g., TensorFlow, PyTorch or even ones hosted on the Internet and accessible via a web API, by coding them as a Python object with appropriate methods. Furthermore, model-specific transparency (as well as fairness and accountability) approaches for “glass-box” predictive models (decision trees, linear models, etc.) implemented by standard machine learning libraries will be incorporated into the package over time to improve its versatility. A data set is assumed to be a two-dimensional NumPy array: either a classic or a structured array, with the latter being a welcome addition given that some of the features may be categorical (string-based).
|Data & Features||Systemic Bias (disparate treatment labelling). Sub-population Representation (sample size disparity and class imbalance).||Sampling Bias. Data Density Checker.||Data Description and Summary Statistics (e.g., imbalanced classes and [per-class] feature distribution).|
|Models||Group-based (sub-population) Fairness Metrics (disparate impact).||Group-based Performance Metrics (e.g., systematic performance error).||Global Surrogates (bLIMEy). Individual Conditional Expectation. Partial Dependence.|
|Predictions||Counterfactual Fairness (disparate treatment).||
Prediction Confidence (via training data density estimation).
|Model-agnostic Counterfactuals. Local Surrogates (bLIMEy). LIME.|
In addition to relaxed input requirements, all of the techniques incorporated into the package are decomposed into atomic components that later can be reused to create new functionality. The FAT methods implemented in the initial release of the package are shown in Table 1. To ground the idea of atomic-level decomposition and show their re-usability, even across the FAT borders, we give three examples.
All of the: sample-size disparity, sub-population fairness (e.g., group unaware, equal opportunity, equal accuracy, demographic parity (hardt2016equality)), sub-population predictive performance disparity and summary statistics can be based on a function that partitions a data set with respect to a chosen feature, which is implemented as one of the core components of the package. This grouping can be then coupled with any standard performance metric to achieve a group-based fairness metric. With the addition of a module that fits a threshold for different groups (given data points ranking) a variety of different fairness criteria, not limited to the ones implemented in the package itself, can be derived – the user just needs to provide a function that measures some sort of performance with predicted and true labels as the only input. Additionally, the grouping functionality can help the user to evaluate the predictive performance, the number of data points and the feature distribution across different (maybe underrepresented) sub-populations – if there is only a small number of samples for some sub-population, it will most likely face bigger predictive errors.
Estimating density (based on training data) of a region in which a data point of interest lies can provide important clues about the robustness of its prediction. To this end, a density score can be treated as a proxy measurement of the confidence of a predictive model (perello2016background). In addition to engendering trust in its predictions, a density estimate can help to compute realistic counterfactual explanations of selected data points. While computing and ranking possible counterfactual explanations a scoring function can discount the ones that lie in a low density region (with respect to the training data distribution), as such counterfactual data points will usually be impossible to achieve in the real life. An example of such undesired explanation can be a person who is 200 years old or a male who gave birth to 3 children.
A black-box counterfactual explainer can be used to generate an explicit (of a selected class) or implicit (of any class other then the one of the given instance) counterfactual, i.e., what-if, explanations. By restricting the set of features that a counterfactual explanation can be conditioned on (choosing protected features in this instance), a counterfactual explanation can be used as a disparate treatment measure of individual fairness. Another possible use case of a counterfactual explainer is discovering possible feature variations of a given data point that are not affecting its prediction, i.e., counterfactuals of the same class.
Surrogate model (craven1996extracting) explanations (popularised in the recent years by their implementation called LIME (lime)) also exhibit a high level of modularity. bLIMEy222The article describing modular and customisable surrogate explanations, which we call bLIMEy, is currently under review. (build LIME yourself) – our approach to modular surrogate explanations – is composed of the following atomic steps, all of which are part of the FAT Forensics package:
feature transformation/extraction – creating a human-understandable representation of the input space (used only when the original feature space is not human-interpretable);
data augmentation – sampling new data points in a (local) region of interest;
labels generation – predicting the labels of the sampled data with the original predictive model;
[optional] feature selection and proximity weighting of the sampled data points – introducing sparsity to the explanations and controlling the locality of the explanation;
surrogate model training; and
surrogate model (or its predictions) explanation.
An example realisation of this process for a human-understandable tabular data, hence eliminating need for step a), can be:
augmenting data with MixUp (zhang2018mixup), which guarantees a local sample that includes instances of opposite classes, therefore eliminating need for step d);
predicting the labels of the sampled data with the original model;
training a decision tree in the vicinity of a chosen data point; and
explaining its predictions with a root-to-leaf path extracted from the surrogate tree and the region around the selected data point with a feature importance measure extracted from the surrogate tree.
The major difference between global (population-based) and local surrogates is the constrain of the region that is used for sampling (and/or weighting) new data points.
Sharing a common functional base between algorithmic implementations of fairness, accountability and transparency tools is one of many advantages of a combined FAT software package. This versatility of the toolbox makes it more appealing to academics and industrial researchers as it allows them to investigate all social aspects of a whole predictive pipeline: data, models and predictions. This in turn will encourage them to contribute their own algorithms and bug fixes back to the package considering their best interest. Furthermore, having a software package which ownership is outside of a single lab, company or research group ensures its longevity – the contributors are not limited to the package creators and designers of the algorithms implemented therein – and the tools are not biased towards implementations originating from a single group. With all of that in mind, a development of such a package becomes a community effort driving it towards a common goal.
Since the contributions to the package will go through a community review process before becoming part of it, we can easily avoid common pitfalls – such as undesired software engineering practices and spurious (and often unnecessary and difficult to manage) dependencies – that academic software is particularly vulnerable to. It will also help to gear the package towards real world use cases as opposed to a mean of “proving” reproducibility of a published research. Having said that, we do not aim to just wrap all of the relevant packages under a common API. If at all, we will only do that for good quality code to avoid perpetuating issues of these packages. Microsoft’s Interpret and Oracle’s Skater, for example, mainly serve as wrappers for a wide range of explainability packages, hence risking users’ trust as they are prone to errors introduce therein. LIME (lime), which is part of both these packages, has recently been shown to have issues with locality of its explanations (laugel2018defining), which affects both Interpret and Skater. Therefore, in a long term we want to re-implement necessary algorithms from the grounds up, which should be possible given the common functional base of the package. In doing so we will be able to enforce high-quality code that is easy to manage and maintain since it is fully under our control.
The major development challenge of the package was not producing the code itself but coming up with the infrastructure (package structure design, versatility, testing, informative error raising and input validation) and the documentation surrounding it. Usually, the main barrier and obstacle, especially for a lay audience, for understanding and adaptation of a software package is lack of an appropriate documentation. Many tools in the FAT space are just supported by two types of documentation: a technical API documentation, which is only suitable for (proficient) users who are already familiar with the package and its structure, and code examples, often presented as Jupyter Notebooks, which drop a potential new user into a deep water instead of easing him in, therefore discouraging further exploration of a package. These are the two most popular approaches since they usually do not require extra effort: the first one can be generated automatically from the source code and the latter one is usually an artefact of research experiments, hence none of them is designed with an end user in mind. FAT Forensics mitigates these issues and evens out the learning curve by basing its documentation on four main pillars, which together build up the user confidence in using the package:
narrative-driven tutorials designated for new users, which will guide them step by step through practical use cases of all the main aspects of the package;
how-to guides created for relatively new users of the package, which will showcase the flexibility of the package and show how to use it to solve user-specific FAT challenges, e.g., how to build your own local surrogate model explainer by pairing a data generator and a local glass-box model;
the API documentation describing functional aspects of the algorithms implemented in the package designated for a technical audience as a reference material and complemented by task-focused code examples that put the functions in a context;
the user guide discussing theoretical aspects of the algorithms implemented in the package such as their restrictions, caveats, computational time and memory complexity, among others.
3. FAT Forensics Use Cases
To show how FAT Forensics can be used on real data to analyse their fairness, accountability and transparency and demonstrate how the common infrastructure of the package facilitates its broad functionality we present three distinct use cases. To their end, we use the UCI Census Income (Adult) data set333http://archive.ics.uci.edu/ml/datasets/Census+Income – a commonly used data set in algorithmic fairness research. The analysis of the adult data set presented in this section is heavily inspired by the content of tutorials, which constitute a vital part of the FAT Forensics documentation444https://fat-forensics.org/tutorials/index.html. Results presented in this section can be recreated with a Jupyter Notebook distributed alongside this manuscript555https://nbviewer.jupyter.org/urls/dl.dropbox.com/s/z5n2pn3fvlif6jg/FAT_Forensics.ipynb. All of the examples included below are representative of the FAT Forensics research mode. To demonstrate the deployment mode we present an interactive dashboard built using Plotly’s Dash, which facilitates interactive analysis of the same data set using FAT Forensics as a back-end and hosted on the Internet666Please allow 15–30 seconds for the server to start before the web application is loaded: https://fatf.herokuapp.com..
3.1. Grouping for FAT
One of the basic building blocks of FAT Forensics is grouping data based on selected (sets of) unique feature values for categorical features and threshold-based binning for numerical features. This algorithmic concept proves to be useful for fairness, accountability and transparency applications. Below, we present its three possible applications in the FAT Forensics package.
3.1.1. Grouping for Data Transparency
When analysing a data set prior to any sort of modelling it is usually advised to inspect the ground truth distribution to uncover whether the target classes are balanced. While this is in itself a very important aspect of a data modelling pipeline, asking the same question for each protected group – a sub-population in a data set derived by conditioning on unique values of a feature that can be used for discriminatory treatment, e.g., gender or ethnic group – can help to prevent model biases and systematic under-performance. With FAT Forensics it is easy to inspect the class distribution for each protected sub-population, for example, “race” in the Adult data set – see Figure 3. This figure shows that while the classes are imbalanced for all of the races, the strongest disproportion can be observed for the “Black” and “Amer-Indian-Eskimo” races.
3.1.2. Grouping for Model Fairness
Grouping can also be used to investigate (pairwise) group-based fairness metrics to identify a model’s disparate impact. Since some of these metrics are known to be mutually incompatible (miconi2017note), it is usually a good idea to compare them side by side – see Figure 4. We can easily see that there is a fairness disparity for the “Asian-Pac-Islander” group and the “Other” group when we use equal accuracy and demographic parity metrics. In addition, according to the demographic parity metric, the “Other” and “White” groups are also treated unfairly with respect to each other. Interestingly, the equal opportunity metric does not show any signs of disparate impact for any pair of the protected sub-populations.
3.1.3. Grouping for Model Performance Disparity
Grouping can also be used to inspect systematic bias of a predictive model, i.e., whether a predictive model under-performs for any of the sub-populations in our data set. For this experiment, we will again use the “race” feature for which we will investigate two performance metrics: accuracy and true negative rate. Unsurprisingly, for the first metric we get the same results as when analysing equal accuracy of group-based fairness – Figure 5. Analysing the true negative rate, on the other hand, reveals that 4 different pairs of “race”-based sub-populations exhibit significant performance differences, with the “Other” race group suffering from the worst pairwise performance disparity against all other “race”s except “Amer-Indian-Eskimo”.
3.2. Data Density for Robustness and Feasible Counterfactuals
FAT Forensics can also help with investigating robustness of a predictive model and assessing the “usefulness” of explanations. When using counterfactuals for “useful” explanations, two possible applications come to mind: providing data point-specific explanations and assessing individual fairness.
3.2.1. Prediction Robustness
FAT Forensics comes with neighbour-based density estimation. This estimate can be used to validate robustness of a prediction as dense regions in the training data should translate into accurate predictive modelling in this region. To see how this could be used, we estimate the density from the first 10,000 data points and check which elements of the data set have a density estimate of more than 0.5. (The density score – as computed by FAT Forensics’ bespoke density estimator – is between and , where high values indicate that a data point lies in a relatively sparse region since its th neighbour – a parameter to be set by the user – is relatively far away with this distance being proportional to the density score.) With this setting we identify 2 sparse data points (with and density scores) with one of them ( with “¿50K” ground truth value) being misclassified by our model. Upon closer inspection we notice that this data point has quite a high () value of the “fnlwgt” feature, which is in the th percentile of the data set – a clue to its high density score.
3.2.2. Counterfactual Explanation Feasibility
A similar approach can be taken when evaluating “usefulness” of counterfactual explanations. If a counterfactual data point, which serves as an explanation, has a high density score with respect to the training data, it may be an indication that such a data point is not possible in the real life. For example, imagine a counterfactual explanation where the foil states that the age of a person would have to be 155 or a man should have given birth to at least 3 children. Counterfactually explaining the data point with a high density score from the previous section yielded multiple different explanations with the more interesting ones being:
Had this person had “capital-gain” instead of , this person would have been predicted as “¿50K”. (Density score: .)
Had this person had “capital-loss” instead of and “fnlwgt” of instead of , this person would have been predicted as “¿50K”. (Density score: .)
of “capital-gain” makes sense to be classified as a high-income person, however given the unusual value of the “fnlwgt” feature it does not make it a common data point. The second counterfactual, on the other hand, significantly decreases the value of the “fnlwgt” feature – therefore moving it to a dense region – and also shows that even withof “capital-loss” this person would still be classified as a high-income individual casting even more suspicion on the unusual, original value of the former feature.
3.2.3. Counterfactual Fairness
Counterfactual explanations can also be used to inspect individual fairness by forcing their foils to include at least one protected attribute change. Doing so for the same sparse data point shows us that the decision for this data point is fair as our counterfactual explainer could not identify any explanation that is conditioned on any one of the protected features.
3.3. Local Surrogate Explanations
Explaining predictions of a black-box model using local surrogates has been popularised by an approach called LIME (Local Interpretable Model-agnostic Explanations) (lime). LIME builds a local, sparse linear model in the neighbourhood of the data point that the user wants to explain to approximate the local decision boundary of a global, more complex predictive model. Given modularity of the local surrogate explanations, our package allows the user to construct a custom explainer by putting all the components together and having a complete control over the process. Depending on the use case, one local model may have advantage over another; hence, opening up the modification of this process to the user can yield significant improvements in the quality of an explanation. We support this claim with a visualisation of the local decision approximation for the two moons
data set with LIME (to be more precise: replicating LIME with bLIMEy using a local, linear, ridge regression model) and decision tree-based bLIMEy777Since, by default, LIME computes interpretable representation of the data being explained – feature binning and discretisation – to improve the readability of explanations, visualising the local surrogate in the original feature space is relatively difficult. We simplify this process by skipping the step responsible for creating the interpretable data representation. This is possible in this particular case since the data set is two-dimensional and we do not need to reduce its dimensionality to convey the explanation to the user..
3.3.1. Linear Surrogate
Figure 6 shows an example surrogate explanation of the marked data point (the black dot) with a linear model. Even though the decision boundary can be easily approximated with a linear model – an almost vertical line crossing the x-axis around 0.25 – the actual decision boundary is tilted because of the data distribution. This can be averted by weighting the data points in the neighbourhood – but finding an approach to generate this weights that would generalise well is a difficult task.
3.3.2. Tree-based Surrogate
A better local approximation of the decision boundary can be achieved with a tree-based model – one of the ways in which bLIMEy improves on LIME. Figure 7 shows the improvement of this approximation achieved by using a decision tree-based surrogate. This local model precisely cuts off the left arm of the blue half-moon, therefore providing a good approximation of the global model in the neighbourhood of the selected data point (the black dot). Furthermore, as opposed to a feature weight-based explanation provided by a linear surrogate, here, we are given logical conditions on the feature values describing the local decision approximation. Since the tree generates decision rules based on feature splits, we get data discretisation and binning for free, which proves to be very useful in high dimensions where we are unable to visualise these results.
4. Related Work
In this section we discuss approaches to systematic evaluation and comparison of AI and ML solutions across different research communities. We also review software packages and reporting approaches available to people seeking to assess FAT of these systems.
4.1. AI Community Effort
In well-established research communities, e.g., supervised learning or reinforcement learning, a consensus among researchers is emerging: each community is converging towards using a common performance metric or an evaluation software framework. For predictive performance of supervised learning algorithms these can be, for example, accuracy, F1 score or AUC, which are a compulsory component of any such software framework – cf. scikit-learn’ssklearn.metrics888https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics (sklearn_api) and TensorFlow’s tf.metrics999https://www.tensorflow.org/api_docs/python/tf/metrics modules (tensorflow2015-whitepaper). Given the independence of these metrics from the underlying ML algorithm implementation, some software – e.g., PyCM101010https://github.com/sepandhaghighi/pycm (haghighi2018) – is focused entirely on calculating them. In other research environments, e.g., reinforcement learning, there are common software platforms used to systematically compare novel approaches, hence making research results easier to reproduce and compare. Examples of these are Project Malmo111111https://www.microsoft.com/en-us/research/project/project-malmo/ (johnson2016malmo) and OpenAI Gym121212https://gym.openai.com/ (1606.01540) (which includes the MuJoCo environment (todorov2012mujoco)). Alternatively, projects such as cookiecutter131313https://github.com/audreyr/cookiecutter allow researchers to create software packages that follow common structure, hence make them easier to execute and integrate. With all these packages available it is clear that a common software platform for inspecting FAT aspects of ML and AI systems (data sets, predictive models and their predictions) would be a welcome addition.
4.2. FAT Software
Many researchers, companies and developers joined in the effort of making AI systems more transparent and socially acceptable141414E.g., TuringBox (epstein2018turingbox) – https://turingbox.mit.edu/ – an online platform (currently under development) to automatically benchmark and evaluate AI systems with respect to a chosen metric, for example, accuracy and fairness., however the FAT research software landscape is relatively scattered when compared to mature fields such as supervised learning. A recent attempt to create a common framework for FAT algorithms is the “What-If” tool151515https://pair-code.github.io/what-if-tool/, which implements group-based model fairness evaluation and counterfactual prediction explainability (transparency). While its agenda is similar to our project, the “What-If” tool is only compatible with TensorFlow models, which is a significant limitation. Another tool built for TensorFlow is TensorFlow Extended161616https://github.com/tensorflow/model-analysis, a platform that facilitates analysis of TensorFlow models by measuring, for example, their performance for multiple sub-populations in a data set.
In addition to general frameworks such as the “What-If” tool we can also find implementations of particular interpretability and explainability algorithms published in the literature. Examples of these are: LIME171717https://github.com/marcotcr/lime (lime), Anchor181818https://github.com/marcotcr/anchor (anchors:aaai18) and PyCEbox191919https://github.com/AustinRochford/PyCEbox (goldstein2015peeking). Many of these have been collected and built into algorithmic transparency packages with the most prominent ones being:
shap242424https://github.com/slundberg/shap (lundberg2017unified), and
AI Explainability 360252525https://github.com/IBM/AIX360.
The open source software landscape of fairness in AI and ML is even more varied: they use different programming languages, often lack a licence or documentation and vary in code quality. The most important ones are:
AI Fairness 360262626https://github.com/IBM/AIF360 (bellamy2018ai),
BlackBoxAuditing272727https://github.com/algofairness/BlackBoxAuditing (adler2018auditing; feldman2015certifying),
Finally, open source software for ML and AI accountability (security and privacy) is even more scarce. The most prominent software here is TensorFlow Privacy323232https://github.com/tensorflow/privacy, OpenMined’s Grid333333https://github.com/OpenMined/Grid (a part of PyTorch) and DeepGame343434https://github.com/TrustAI/DeepGame (a deep network verification tool). An alternative accountability research, and software development, direction is robustness of predictive systems against adversarial attacks. Software toolboxes available in this space are: FoolBox353535https://github.com/bethgelab/foolbox, CleverHans363636https://github.com/tensorflow/cleverhans and IBM’s adversarial robustness toolbox373737https://github.com/IBM/adversarial-robustness-toolbox.
Despite this lack of coherence in the open source world, commercial products start to emerge in this space. The most prominent one is IBM’s cloud offering – Watson OpenScale383838https://www.ibm.com/cloud/watson-openscale/ – where as part of their cloud ML infrastructure the users can measure disparate impact of a data set and a model393939https://www.ibm.com/blogs/watson/2018/09/trust-transparency-ai/ (group-based fairness of ground truth labelling and predicted classes) as well as use some of the model and prediction transparency approaches in addition to the standard model performance monitoring functionality.
4.3. FAT Reporting
In addition to evaluating FAT aspects of predictive systems with software, some researchers are advocating to produce unified reports describing their quality, reliability and other properties of interest. For example, gebru2018datasheets (gebru2018datasheets) proposed “data sheets for data sets” that aim to provide users with standardised information about technical properties of a data set, its intended use and provenance. holland2018dataset (holland2018dataset) have independently come up with a similar idea called “nutrition labels for data sets” that mimics well know food nutrition labels by providing basic details about a data set. kelley2009nutrition (kelley2009nutrition) proposed “privacy labels”, the goal of which is to inform users about the ways in which their data is collected, used and shared.
While all these initiatives aim to improve transparency of data usage, the time and effort required to produce them may be prohibitive on a large scale, therefore hindering their uptake. Moreover, their applicability is limited to data, hence leaving out AI and ML models and their predictions. hind2018increasing (hind2018increasing) proposed a similar line of research by designing “Supplier’s Declarations of Conformity” for AI services. Their goal is to provide developers and suppliers of AI products-as-a-service with a unified way to report quality, security, interpretability and fairness of their products. Having rigorous and easily comparable reports for such products is of paramount importance given that their use does not require any prior ML or AI knowledge. reisman2018algorithmic (reisman2018algorithmic) came up with a similar idea of “Algorithmic Impact Assessments“, which is a framework that can be used to systematically evaluate automated decision-making systems to keep them accountable. A related concept was introduced by yang2018nutritional (yang2018nutritional) who created “nutritional labels” for rankings. Alternative solutions include AI systems “checklists”404040https://www.oreilly.com/ideas/of-oaths-and-checklists, “Data Ethics Workbook”414141https://www.gov.uk/government/publications/data-ethics-workbook/data-ethics-workbook and “Test-Driven Data Analysis”424242http://www.tdda.info/.
Many of these solutions share one disadvantage: they need to be created manually by people who have a deep understanding of the system or data being evaluated, which often means that they cannot be retrofitted. The toolbox described in this paper could be used to automatically generate parts of (or at least provide validated algorithms for generating metrics for) customisable FAT reports – which can constitute a part of introduced earlier “report cards” – for all aspects of a machine learning system (data, models and predictions), hence eliminating their manual, error-prone and subjective creation process. Furthermore, FAT Forensics has the potential to become a vital component of any ML pipeline development process: where continuous integration is used in software development to ensure high quality of the code, our toolbox could be used to evaluate FAT of any component of an ML pipeline before its deployment.
While software is the primary driver of progress in AI and ML research, its quality is often found lacking. Some research fields such as supervised learning and reinforcement learning have reached a consensus on that matter and have standardised metrics and software frameworks used to evaluate and compare novel algorithms. At the moment, fairness, accountability and transparency research in AI and ML communities lacks such a common software infrastructure to analyse, compare and communicate research results in a coherent manner. In this paper we proposed a flexible and modular open source Python toolbox to facilitate the development, evaluation, comparison and deployment of FAT algorithms.
FAT Forensics has bee released to the public on GitHub434343https://github.com/fat-forensics/fat-forensics under the BSD 3-Clause licence with a collection of state-of-the-art FAT algorithms available at the release. Our toolbox has been implemented with two use cases in mind: research – intended for exploratory analysis, and deployment – designed for report generation, monitoring and certification. Since FAT Forensics is an open source effort we envisage the research community to contribute their outputs to the software package, therefore making it easily accessible, reproducible and attractive to FAT enthusiasts. We hope and expect that all the software engineering best practice followed during the initial development of FAT Forensics have helped us to create a sustainable software package that is easy to extend and contribute to, serving the community for a long time to come.