AI has been used in many diverse applications where the decision taken by the model directly impacts human life. It is therefore of utmost importance to make AI models as reliable as possible. Unfortunately, data scientists predominantly look for improving the accuracy or other generalization properties such as precision, recall, etc., often ignoring critical properties such as fairness, robustness while training the model. As a result, we have seen many instances of unfairness/bias and robustness issues in scenarios such as recidivism (Skeem and Eno Louden, 2007), job applications(2), etc.
Even though there are many techniques present for testing (Zhang et al., 2020), there has been a dearth of comprehensive ML model testing tools which can work across modalities, different types of models, and go beyond generalizability. There exist few toolkits like AIF360 (9), CHECKLIST (Ribeiro et al., 2020) which either concentrates on a single property such as fairness or single modality such as text, but a comprehensive set of testing algorithms under a common framework is scarce.
In (Aggarwal et al., 2021), we presented a tool called AITEST, a framework for testing black-box models to solve the above problems. It has covered tabular, text, and time-series modalities and focused on generalizability, fairness, and robustness properties. Yet, our coverage was incomplete when it comes to testing a variety of models used by our clients (10).
In this paper, we include a variety of testing algorithms in our tool in three modalities viz. tabular, image, speech-to-text. Extensible design of AITEST helped us include such analysis without much change in the framework.
One of the major problems of AI models is interpretability. While the accuracy of AI models has increased over the years, the models have become more complex and un-interpretable. Interpretability testing deals with whether a given black-box model can be effectively simulated by a given interpretable model. This is essentially related to global explainability (Craven, 1996)
and essential related to model auditing, especially in the finance industry who predominantly uses models built on tabular data. We include tree-simulatibility testing to check whether a given black-box model can be simulated by a decision-tree model having interpretable characteristics specified by the user. We further add Image testing capabilities - essentially adding various kinds of transformations under which the model should be robust i.e. not change its decision. Specifically, we present eleven generic image transformations (inverse, saturation, etc.) and a Bayesian optimization-based algorithm for the adversarial attack on the black-block model.
Industrial conversation systems often use a speech-to-Text (STT) engine to convert human speech to text before the text is fed into the dialog system. There are many instances in which the failure of the dialog system is attributed to the STT. We, therefore, introduce the testing capability of fairness and robustness properties for STT models. Our contributions are summarized below:
We present a tool called AITEST with functionalities related to a) interpretability testing of tabular AI models, b) comprehensive sets of image transformations for testing black-box image classifiers, c) a comprehensive set of audio transformations for testing fairness and robustness properties for speech-to-text models.
We present a short evaluation to demonstrate the effectiveness of the above techniques.
Our tool is part of IBM’s Ignite quality platform (10) and is used in testing multiple industrial AI models.
AITEST uses a uniform flow across testing of models across different modalities (see left). The user (typically AI tester, data scientist or model risk manager) will start with registering a model into AITEST by providing model API details (endpoint, input template, output template, authentication headers), type of models (tabular-data classifier, time-series prediction, text-classifier, conversation-classifier, image-classifier, speech-to-text models), and seed data (training or test data which can be used in creating test samples). Users can select the newly registered model or an existing registered model and proceed to select the model type-specific properties and input necessary configurations required for each property (the threshold for fairness, transformation control parameters, etc.) to schedule a test run. Users can view the status of the test run which shows the number of test cases generated and their status (passed/failed) specific to each selected property. Users can then view the metrics corresponding to the run for each completed property along with the failure samples and textual explanation. Users can compare the result of multiple runs.
3. Testing Algorithms
3.1. Interpretability Testing of Tabular Classifiers
Understanding the decision making process of most machine learning models are hard for human beings as typically they are optimized over performance metrics and not designed to be interpretable. The global explainability problem refers to creating an interpretable model which can mimic the reasoning of the target model for a given set of samples, usually the entire traiing data. Rule based models like decision trees or rule sets(Breiman et al., 1984; Lakkaraju et al., 2016; Wang et al., 2017) and sparse linear models (Ribeiro et al., 2016) are popular choices for such surrogates.
In our framework, we assess the interpretability of a given target blackbox classification model by first constructing a decision tree (Breiman et al., 1984) surrogate over the entire training data but with the predictions from the model instead of the training data labels. We then evaluate the following three decision tree interpretability characteristics that the user provides. Note that huge decision trees are not really interpretable. While building , we leverage data generation (Craven, 1996) to overcome sample scarcity to decide attribute splits at nodes far from the root.
Average Path Length (APL)
It is defined (Wu et al., 2018) as the expected no. of nodes along any root to leaf decision path or equivalently, the expected length of a path in which simulates the target model . The expectation is computed over all such paths in the surrogate.
Intuitively, this tries to measure the expected number of decisions to be made to arrive at an outcome for a sample according to the model . Each decision here is essentially a boolean predicate evaluation for some attributes of the sample.
Maximum Path Length (MPL)
It is the maximum number of nodes present in any decision path in the surrogate tree .
This metric captures, in the worst case, how many decisions are to be made to predict the outcome of a sample. Interpretability favours small values of APL and MPL.
It is computed as the percentage of test samples (from a reserved suite ) for which predictions for labels from target model and the surrogate tree match i.e., are same.
where is a sample from .
Fidelity is a measure of how accurately the surrogate tree resembles the target model in terms of output and decision boundary.
The successful test of tree-simulatability testing generates a decision tree model which satisfies all the above characteristics.
3.2. Image Classifier Testing
Deep Neural Networks (DNNs) have found widespread usage in several Computer Vision (CV) tasks. Even though they match (or exceed) the performance of humans on these tasks, in practice, DNNs are vulnerable to malicious inputs(Tian et al., 2018; Kurakin et al., 2016). Consequently, there is an increasing need to develop techniques to test these models before deployment. In our framework, we propose a rich set of transforms (refer to table 1) to test DNNs for CV tasks, each of which checks whether the prediction of the model changes on the transformed images.
The transforms used by our framework broadly fall into two categories. Static transforms such as translation, scaling, rotation, blurring, etc. (Simard et al., 2003), and dynamic ones that generates black-box adversarial examples via Bayesian Optimisation (Shukla et al., 2020). Static transforms can be further classified into the following three categories: linear, affine, and convolution.
In the case of linear transforms
, we add, subtract or multiply either a constant or random noise to each pixel of an image. For instance, to change the brightness of a given image, we add (or subtract) a constant value, and in the case of random Gaussian noise transform, we add random noise sampled from a unit Gaussian distribution to the image.Affine transforms modify the geometric structure of the image while preserving proportions of lines, but not necessarily the lengths and angles. Typically, affine transforms can be represented by a 2D matrix, and hence are amenable to compositions. In practice, these transforms (or their compositions) are used to simulate several image deformations which a model would encounter during a real-world deployment. Convolutions are general purpose filters that work by multiplying a pixel’s and its neighboring pixel’s value by a matrix (aka a kernel matrix). Intuitively, the value of each transformed pixel is computed by adding the products of each surrounding pixel value with the corresponding kernel value. Transforms such as Blur, Fog, etc., are a few examples of convolution transforms.
|Inverse||Linear||Inverts the image|
|Scale||Affine||Scales the image by a factor|
|Rotate||Affine||Rotates the image by an angle|
|Shear||Affine||Slants an image by an angle|
|Saturation||Linear||Perturbs the saturation levels of the images|
|Brightness||Linear||Changes the brightness of the image|
|Contrast||Linear||Changes the contrast of the image|
|Fog||Convolution||Simulates realistic fog on the image|
|Gaussian Blur||Convolution||Blurs the image through a random Gaussian Kernel|
|Zoom & Blur||Convolution||Zooms to part of the image & blurs it|
|Gaussian Noise||Linear||Adds a random noise sampled from a unit Gaussian distribution to the image|
|Bayesian Optimisation||Adversarial||Generates adversarial examples via Bayesian Optimisation|
Adversarial examples are malicious inputs crafted by adversaries to fool DNNs. In our framework, we generate adversarial examples in a black-box setup where an adversary can only query the model via a predictive interface and doesn’t know any other information about the deployed model . The adversarial noise added to an image is generated by solving the following constrained optimisation problem (Shukla et al., 2020), where we aim to find an adversarial input , which is close to , such that model’s prediction changes.
We solve the above problem via Bayesian Optimisation which offers an efficient approach to solve global optimisation problems. Typically, Bayesian Optimisers has 2 key ingredients, a surrogate model like a Gaussian process (GP) or a Bayesian Neural Network (BNN), and an acquisition function which provides us the next location to query the target function. Intuitively, an acquisition function balances the exploration and exploitation by giving higher value to the parts of the input space where the value of the target function is typically high and the surrogate model is very uncertain. The approach of using Bayesian Optimisation with a GP surrogate to attack the target model is described in Algorithm 1.
3.3. Speech-to-Text Model Testing
Automatic speech recognition engines are being widely used in many automated tasks (voice assistants to a myriad of downstream activities) as such systems make it very convenient for humans to clarify their intents. Examples include Speech-to-Text services from Google, Microsoft, Apple, Amazon, IBM etc. and also offline engines like Mozilla’s DeepSpeech (Hannun et al., 2014). The performance of the downstream tasks often depends on the precision of text transcription of the audio since it is fed to language understanding models next.
It is important to realize that humans are not always in a conducive surrounding for giving clear voice commands and regularly noise artefacts get blended in the input speech. In our framework we test the robustness of the blackbox audio transcription models under white background noise and various interfering environmental perturbations to its input clips. We also test for simple fairness properties such as changing the gender or accent of the input speech having identical textual content.
We measure the Word Error Rates (defined later) of text transcriptions to measure the effectiveness of our testing.
For the following paragraphs, we consider an audio clip as a one-dimensional signal of appropriately sampled and normalized values and denote the original clip by and its corresponding perturbed clip by .
White Noise Transform
Here we perturb the original clip by adding a random standard normal perturbation to each of its elements.
where each element of the sequence
is sampled from a standard normal distribution.
The variance of the noise is decided by a given
(Signal to Noise Ratio) value that also specifies the loudness of the noise.is a logarithmic scale measure of the ratio of power of two signals which is related to their values as follows.
Also note that in case of standard normal distribution, since it has zero mean, its variance is the value squared.
Using the above equations we can compute the perturbed signals for any desired specified during testing.
In this transformation, we overlay environmental noises that simulate different practical scenarios on top of the original clip. Example scenarios include restaurant, rainfall, festival, water dripping, windy noise etc. We control the signal strength by using the parameter while generating the linear combination of signal and noise that produces the interference effect.
Speech-to-Text transcription models are expected to have similar performance irrespective of the sensitive or protected attributes of the speaker. In this pretext, we test the models by synthesizing the perturbed input with the same spoken words or tokens in the same language but flipping its gender (male to female and vice versa) or by changing the speaker’s accent such as Indian, American, French etc.
We synthesize the perturbed inputs using open-source speech generation engines or pretrainedText-to-Speech models (such as Tacotron2 (Shen et al., 2018) and Waveglow (Prenger et al., 2019)) that synthesize audio from natural language transcripts directly. By training these models on audio from different speakers, we can leverage the learnt transformations.
These transforms try to capture the fairness of the models to different genders and varying ethnicity of the end-users who may be consuming the product or using in a critical decision making scenario.
In this section we present some basic evaluation of our techniques of the key underlying novel testing techniques. This demonstrates that AITEST can be effectively used.
Interpretability Testing for Tabular Data
In Table 2 we provide the interpretability metrics for some well known datasets: Adult Census Income (Kohavi and Becker, ), Bank Marketing (Moro et al., 2014) and US Execution (McElwee et al., ) all obtained from the UCI repository (Dua and Graff, 2017). We have taken blackbox tabular models from IBM Watson WML framework. Here the user may set appropriate thresholds for the three metrics to decide whether the test fails or not.
To illustrate the effectiveness of our framework, we tested a couple of standard models on popular data sets LeNet5 (LeCun et al., 1998)
on MNIST(Lecun and Cortes, 1998) and ResNet32 (He et al., 2016) on CIFAR10 (Krizhevsky, 2009). From Table 3, we can see that test cases generated by various transforms in our framework is quite effective in fooling the models. User can view such cases in the tool as shown in Figure 2.
|Zoom & Blur||91||82.1|
For evaluating the perturbations, we use Word Error Rate (WER) as a metric which is defined as the fraction of words omitted, substituted or newly inserted in the text transcription of the perturbed clip, when compared with the original text. If we have multiple audio samples, we simply the average the metric over all samples.
We have tested it in a client-scenario which requires a voice assistant to process simple commands or queries. We provide some example failures detected by our method in Table 4. User can play the original and transformed audio for the failure cases as shown in Figure 3. We direct interested readers to (Saha, 2021) and our supplement video that contain more examples and details.
|Transform type||Original Text||Perturbed Text||WER|
|White noise||I am ready||find great if||1.0|
|Restaurant noise||can I talk to someone||can||0.8|
|Water drip noise||keep holding||keep clothing||0.5|
|Wind noise||I need a minute||I need a man||0.25|
|French accent||repeat please||he Peter please||1.0|
A short video demonstration of our tool with a subset of properties include in this paper is available at https://youtu.be/sc0YEHx4m0c.
5. Related Work
In the recent past, there has been a lot of interest in developing techniques for testing DNNs for Computer Vision tasks that vary from simple transforms such a translation, rotation, etc. (Tian et al., 2018), to richer formulations based on GANs (Zhang et al., 2018), Concolic Testing (Sun et al., 2018) & adversarial examples generation (Shukla et al., 2020; Kurakin et al., 2016), we leverage some of these techniques in our framework. Similar techniques have also gained traction for speech recognition models. (Carlini and Wagner, 2018) uses fully white-box gradient-based methods to generate targeted and untargeted attacks for models such as (Hannun et al., 2014). Although the notion of simulatability and tree-simulatability are not new (Wu et al., 2018) and testing a model for tree-simulatability or interpretability seems to be novel. Overall, AITEST remains the comprehensive tool for testing these many modalities.
6. Conclusion and Future Work
The last decade has seen various types of models and their applications. Only in recent time, efforts has been made to define a concrete Data and AI lifecycle (Ishizaki, 2021) and testing plays an important role in the lifecycle to ensure the reliability of AI models. We present a tool AITEST which encompasses variety of testing techniques across multiple modalities. In this paper, we include some novel capabilities such as interpretability testing and fairness testing of Speech-to-text models along with the implementation of some known properties. However, implementation of all the properties under one framework which provides easy usability is very useful for serving variety of client models.
In future, we plan to add support for testing videos, multi-modal inputs, model compositions. The current implementation only supports black-box testing, which when configured, can be applied to a large number of other similar models of the same type without much change. We plan to add support for white-box testing algorithms.
-  (2021) Testing framework for black-box ai models. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 81–84. Cited by: §1.
-  (Last accessed 18th August, 2021) Amazon scraps secret ai recruiting tool that showed bias against women. Note: https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G Cited by: §1.
-  (1984) Classification and regression trees. CRC press. Cited by: §3.1, §3.1.
-  (2018) Audio adversarial examples: targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §5.
-  (1996) Extracting comprehensible models from trained neural networks. Ph.D. Thesis, The University of Wisconsin - Madison. Note: AAI9700774 External Links: Cited by: §1, §3.1.
-  (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §4.
-  (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §3.3, §5.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.
-  (Last accessed 18th August, 2021) IBM aif360. Note: https://github.com/IBM/AIF360 Cited by: §1.
-  (Last accessed 18th August, 2021) IBM ignite. Note: https://www.ibm.com/in-en/services/applications/testing Cited by: §1, §1.
-  (2021) AI model lifecycle management: overview. IBM.. External Links: Cited by: §6.
-  Census income data. External Links: Cited by: §4.
-  (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §4.
-  (2016) Adversarial examples in the physical world. CoRR abs/1607.02533. External Links: Cited by: §3.2, §5.
-  (2016) Interpretable decision sets: a joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1675–1684. Cited by: §3.1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.
-  (1998) The MNIST database of handwritten digits. External Links: Cited by: §4.
-  US executions since 1977. External Links: Cited by: §4.
-  (2014) A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., pp. 22–31. Cited by: §4.
-  (2019) Waveglow: a flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. Cited by: §3.3.
-  (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §3.1.
-  (2020) Beyond accuracy: behavioral testing of NLP models with CheckList. In ACL, pp. 4902–4912. External Links: Cited by: §1.
-  (2021) Ensuring trustworthy ai through ai testing. IBM Research AI, Bangalore, India.. External Links: Cited by: §4.
-  (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §3.3.
-  (2020) Hard label black-box adversarial attacks in low query budget regimes. CoRR abs/2007.07210. External Links: Cited by: §3.2, §3.2, §5.
Best practices for convolutional neural networks applied to visual document analysis. In 7th International Conference on Document Analysis and Recognition (ICDAR 2003), 2-Volume Set, 3-6 August 2003, Edinburgh, Scotland, UK, pp. 958–962. External Links: Cited by: §3.2.
-  (2007) Assessment of evidence on the quality of the correctional offender management profiling for alternative sanctions (compas). Unpublished report prepared for the California Department of Corrections and Rehabilitation. Available at: https://webfiles. uci. edu/skeem/Downloads. html. Cited by: §1.
-  (2018) Concolic testing for deep neural networks. External Links: Cited by: §5.
-  (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 303–314. External Links: Cited by: §3.2, §5.
-  (2017) A bayesian framework for learning rule sets for interpretable classification. The Journal of Machine Learning Research 18 (1), pp. 2357–2393. Cited by: §3.1.
Beyond sparsity: tree regularization of deep models for interpretability.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.1, §5.
-  (2020) Machine learning testing: survey, landscapes and horizons. IEEE Transactions on Software Engineering. Cited by: §1.
-  (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, M. Huchard, C. Kästner, and G. Fraser (Eds.), pp. 132–142. External Links: Cited by: §5.