Despite much excitement and activity in the field of xai [montavon2018methods, arya2019explanation, LapNCOMM19, SamPIEEE21, bykov2021explaining], the evaluation of explainable methods still remains an unsolved problem [samek2016evaluating, adebayo2020debugging, 2020scs, yona2021revisiting, arras2021ground]. Unlike in traditional ml, the task of explaining inherently lacks “ground-truth” data — there is no universally accepted definition of what constitutes a “correct” explanation and less so, which properties an explanation ought to fulfill [BAM2019]. Due to this lack of standardised evaluation procedures in xai, researchers frequently conceive new ways to experimentally examine explanation methods [samek2016evaluating, bach2015pixel, adebayo2020sanity, BAM2019, kindermans2019reliability], oftentimes employing different parameterisations and various kinds of preprocessing and normalisations, each leading to different or even contrasting results, making evaluation outcomes difficult to interpret and compare. Critically, we note that it is common for xai papers tend to base their conclusions on one-sided, sometimes methodologically questionable evaluation procedures — which we fear is hindering access to the current State-of-the-art (SOTA) in xai and potentially may hurt the perceived credibility of the field over time.
For these reasons, researchers often rely on a qualitative evaluation of explanation methods e.g., [zeiler2014visualizing, ribeiro2016why, shrikumar2017learning], assuming that humans know what an “accurate” explanation would look like (or rather should look like, often disregarding the role that the explained model plays in the explanation process). However, the assumption that humans are able to recognise a correct explanation is generally not justified: not only does the notion of an “accurate” explanation often depend on the specifics of the task at hand, humans are also questionable judges of quality [wang2019designing, rosenfeld2021better]. To make matters more challenging, recent studies suggest that even quantitative evaluation of explainable methods is far from fault-proof [yona2021revisiting, bansal2020sam, budding2021evaluating, hase2021the].
In response to these issues, we developed Quantus, to provide the community with a versatile and comprehensive toolkit that collects, organises, and explains a wide range of evaluation metrics proposed for explanation methods. The library is designed to help automate the process of XAI quantification — by delivering speedy, easily digestible, and at the same time holistic summaries of the quality of the given explanations. As we see it, Quantus concludes an important, still missing contribution in today’s xai research by filling the gap between what the community produces and what it currently needs: a more quantitative, systematic and standardised evaluation of XAI methods.
2 Toolkit overview
Quantus provides its intended users — practitioners and researchers interested in the domains of ml and xai — with a steadily expanding list of 25+ reference metrics to evaluate explanations of ml predictions. Moreover, it offers comprehensive guidance on how to use these metrics, including information about potential pitfalls in their application.
The library is thoroughly documented and includes in-depth tutorials covering multiple use-cases and tasks — from a comparative analysis of xai methods and attributions, to quantifying to what extent evaluation outcomes are dependent on metrics’ parameterisations. In Figure 1, we demonstrate some example analysis that can be produced with Quantus111The full experiment can be reproduced (and obtained) at the repository, under the tutorials folder.
. Moreover, the library provides an abstract layer between APIs of deep learning frameworks e.g.PyTorch [NEURIPS2019_9015] and tensorflow [tensorflow2015-whitepaper] and can be employed iteratively both during- and after model training in the ml lifecycle. Code quality is ensured by thorough testing, using pytest and continuous integration (CI), where every new contribution is automatically checked for sufficient test coverage. We employ syntax formatting with flake8 under various Python versions.
Unlike other xai-related libraries222 Related libraries were selected with respect to the xai evaluation capabilities. Packages including no metrics for evaluation of explanation methods, e.g., Alibi [alibijanis], iNNvestigate [innvestigatealber], dalex [dalexhubert] and zennit [anders2021software] were excluded., Quantus has its primary focus on evaluation and as such, supports a breadth of metrics, spanning various different categories (see Table 1). Detailed descriptions of the different evaluation categories are documented in the repository. The first iteration of the library mainly focuses on attribution-based explanation techniques333This category of explainable methods aims to assign an importance value to the model features and arguably, is the most studied group of explanation. for (but not limited to) image classification. In planned future releases, we are working towards extending the applicability of the library further e.g., by developing additional metrics and functionality that will enable users to perform checks, verifications and sensitivity analyses on top of the metrics.
3 Library design
The user-facing API of Quantus is designed with the aim of replacing an oftentimes lengthy and open-ended evaluation procedure with structure and speed — with a single line of code, the user can gain quantitative insights of how their explanations are behaving under various criteria. In the following code snippet, we demonstrate one way for how Quantus can be used to evaluate pre-computed explanations via a PixelFlipping experiment [bach2015pixel] — by simply calling the initialised metric instance. In this example, we assume to have a pre-trained model (model), a batch of input- and output pairs (x_batch, y_batch) and a set of attributions (a_batch).
Needless to say, xai evaluation is intrinsically difficult and there is no one-size-fits-all metric for all tasks — evaluation of explanations must be understood and calibrated from its context: the application, data, model, and intended stakeholders [arras2021ground, chander2018evaluating]. To this end, we designed Quantus to be highly customisable and easily extendable — documentation and examples on how to create new metrics as well as how to customise existing ones are included. Thanks to the API, any supporting functions of the evaluation procedure, e.g., perturb_baseline — that determines with what value patches of the input shall be iteratively masked — can flexibly be replaced by a user-specified function to ensure that the evaluation procedure is appropriately contextualised.
It is practically well-known but not yet publicly recognised that evaluation outcomes of explanations can be highly sensitive to the parameterisation of metrics [bansal2020sam, agarwal2020explaining] and other confounding factors introduced in the evaluation procedure [yona2021revisiting, hase2021out]. Therefore, to encourage a thoughtful and responsible selection and parameterisation of metrics, we added mechanisms such as warnings, checks and user guidelines, cautioning users to reflect upon their choices. Great care has to be taken when interpreting the quantification results and to this end, we provide additional functionality on potential interpretation pitfalls.
4 Broader impact
We built Quantus to raise the bar of xai quantification — to substitute an ad-hoc and sometimes ineffective evaluation procedure with reproducibility, simplicity and transparency. From our perspective, Quantus contributes to the xai development by helping researchers to speed up the development and application of explanation methods, dissolve existing ambiguities and enable more comparability. As we see it, steering efforts towards increasing objectiveness of evaluations and reproducibility in the field will prove rewarding for the community as a whole. We are convinced that a holistic, multidimensional take on xai quantification will be imperative to the general success of (X)AI over time.
This work was partly funded by the German Ministry for Education and Research through project Explaining 4.0 (ref. 01IS20055) and BIFOLD (ref. 01IS18025A and ref. 01IS18037A), the Investitionsbank Berlin through BerDiBA (grant no. 10174498), as well as the European Union’s Horizon 2020 programme through iToBoS (grant no. 965221).