Automated Testing of AI Models

10/07/2021 ∙ by Swagatam Haldar, et al. ∙ ibm 0

The last decade has seen tremendous progress in AI technology and applications. With such widespread adoption, ensuring the reliability of the AI models is crucial. In past, we took the first step of creating a testing framework called AITEST for metamorphic properties such as fairness, robustness properties for tabular, time-series, and text classification models. In this paper, we extend the capability of the AITEST tool to include the testing techniques for Image and Speech-to-text models along with interpretability testing for tabular models. These novel extensions make AITEST a comprehensive framework for testing AI models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

AI has been used in many diverse applications where the decision taken by the model directly impacts human life. It is therefore of utmost importance to make AI models as reliable as possible. Unfortunately, data scientists predominantly look for improving the accuracy or other generalization properties such as precision, recall, etc., often ignoring critical properties such as fairness, robustness while training the model. As a result, we have seen many instances of unfairness/bias and robustness issues in scenarios such as recidivism (Skeem and Eno Louden, 2007), job applications(2), etc.

Even though there are many techniques present for testing (Zhang et al., 2020), there has been a dearth of comprehensive ML model testing tools which can work across modalities, different types of models, and go beyond generalizability. There exist few toolkits like AIF360 (9), CHECKLIST (Ribeiro et al., 2020) which either concentrates on a single property such as fairness or single modality such as text, but a comprehensive set of testing algorithms under a common framework is scarce.

In (Aggarwal et al., 2021), we presented a tool called AITEST, a framework for testing black-box models to solve the above problems. It has covered tabular, text, and time-series modalities and focused on generalizability, fairness, and robustness properties. Yet, our coverage was incomplete when it comes to testing a variety of models used by our clients (10).

In this paper, we include a variety of testing algorithms in our tool in three modalities viz. tabular, image, speech-to-text. Extensible design of AITEST helped us include such analysis without much change in the framework.

One of the major problems of AI models is interpretability. While the accuracy of AI models has increased over the years, the models have become more complex and un-interpretable. Interpretability testing deals with whether a given black-box model can be effectively simulated by a given interpretable model. This is essentially related to global explainability (Craven, 1996)

and essential related to model auditing, especially in the finance industry who predominantly uses models built on tabular data. We include tree-simulatibility testing to check whether a given black-box model can be simulated by a decision-tree model having interpretable characteristics specified by the user. We further add Image testing capabilities - essentially adding various kinds of transformations under which the model should be robust i.e. not change its decision. Specifically, we present eleven generic image transformations (inverse, saturation, etc.) and a Bayesian optimization-based algorithm for the adversarial attack on the black-block model.

Industrial conversation systems often use a speech-to-Text (STT) engine to convert human speech to text before the text is fed into the dialog system. There are many instances in which the failure of the dialog system is attributed to the STT. We, therefore, introduce the testing capability of fairness and robustness properties for STT models. Our contributions are summarized below:

  • [leftmargin=*]

  • We present a tool called AITEST with functionalities related to a) interpretability testing of tabular AI models, b) comprehensive sets of image transformations for testing black-box image classifiers, c) a comprehensive set of audio transformations for testing fairness and robustness properties for speech-to-text models.

  • We present a short evaluation to demonstrate the effectiveness of the above techniques.

Our tool is part of IBM’s Ignite quality platform (10) and is used in testing multiple industrial AI models.

The rest of the paper is organized as follows. The next section presents the flow of the tool. Section 3 describes our testing algorithms. Section 4 presents the experimental results. Section 5 presents the related work. We conclude in Section 6.

2. Flow

Figure 1. Flow

AITEST uses a uniform flow across testing of models across different modalities (see left). The user (typically AI tester, data scientist or model risk manager) will start with registering a model into AITEST by providing model API details (endpoint, input template, output template, authentication headers), type of models (tabular-data classifier, time-series prediction, text-classifier, conversation-classifier, image-classifier, speech-to-text models), and seed data (training or test data which can be used in creating test samples). Users can select the newly registered model or an existing registered model and proceed to select the model type-specific properties and input necessary configurations required for each property (the threshold for fairness, transformation control parameters, etc.) to schedule a test run. Users can view the status of the test run which shows the number of test cases generated and their status (passed/failed) specific to each selected property. Users can then view the metrics corresponding to the run for each completed property along with the failure samples and textual explanation. Users can compare the result of multiple runs.

3. Testing Algorithms

3.1. Interpretability Testing of Tabular Classifiers

Understanding the decision making process of most machine learning models are hard for human beings as typically they are optimized over performance metrics and not designed to be interpretable. The global explainability problem refers to creating an interpretable model which can mimic the reasoning of the target model for a given set of samples, usually the entire traiing data. Rule based models like decision trees or rule sets 

(Breiman et al., 1984; Lakkaraju et al., 2016; Wang et al., 2017) and sparse linear models (Ribeiro et al., 2016) are popular choices for such surrogates.

In our framework, we assess the interpretability of a given target blackbox classification model by first constructing a decision tree (Breiman et al., 1984) surrogate over the entire training data but with the predictions from the model instead of the training data labels. We then evaluate the following three decision tree interpretability characteristics that the user provides. Note that huge decision trees are not really interpretable. While building , we leverage data generation (Craven, 1996) to overcome sample scarcity to decide attribute splits at nodes far from the root.

Average Path Length (APL)

It is defined (Wu et al., 2018) as the expected no. of nodes along any root to leaf decision path or equivalently, the expected length of a path in which simulates the target model . The expectation is computed over all such paths in the surrogate.


Intuitively, this tries to measure the expected number of decisions to be made to arrive at an outcome for a sample according to the model . Each decision here is essentially a boolean predicate evaluation for some attributes of the sample.

Maximum Path Length (MPL)

It is the maximum number of nodes present in any decision path in the surrogate tree .


This metric captures, in the worst case, how many decisions are to be made to predict the outcome of a sample. Interpretability favours small values of APL and MPL.


It is computed as the percentage of test samples (from a reserved suite ) for which predictions for labels from target model and the surrogate tree match i.e., are same.


where is a sample from .

Fidelity is a measure of how accurately the surrogate tree resembles the target model in terms of output and decision boundary.

The successful test of tree-simulatability testing generates a decision tree model which satisfies all the above characteristics.

3.2. Image Classifier Testing

Deep Neural Networks (DNNs) have found widespread usage in several Computer Vision (CV) tasks. Even though they match (or exceed) the performance of humans on these tasks, in practice, DNNs are vulnerable to malicious inputs 

(Tian et al., 2018; Kurakin et al., 2016). Consequently, there is an increasing need to develop techniques to test these models before deployment. In our framework, we propose a rich set of transforms (refer to table  1) to test DNNs for CV tasks, each of which checks whether the prediction of the model changes on the transformed images.

The transforms used by our framework broadly fall into two categories. Static transforms such as translation, scaling, rotation, blurring, etc. (Simard et al., 2003), and dynamic ones that generates black-box adversarial examples via Bayesian Optimisation (Shukla et al., 2020). Static transforms can be further classified into the following three categories: linear, affine, and convolution.

Static Transforms

In the case of linear transforms

, we add, subtract or multiply either a constant or random noise to each pixel of an image. For instance, to change the brightness of a given image, we add (or subtract) a constant value, and in the case of random Gaussian noise transform, we add random noise sampled from a unit Gaussian distribution to the image.

Affine transforms modify the geometric structure of the image while preserving proportions of lines, but not necessarily the lengths and angles. Typically, affine transforms can be represented by a 2D matrix, and hence are amenable to compositions. In practice, these transforms (or their compositions) are used to simulate several image deformations which a model would encounter during a real-world deployment. Convolutions are general purpose filters that work by multiplying a pixel’s and its neighboring pixel’s value by a matrix (aka a kernel matrix). Intuitively, the value of each transformed pixel is computed by adding the products of each surrounding pixel value with the corresponding kernel value. Transforms such as Blur, Fog, etc., are a few examples of convolution transforms.

Name Type Description
Inverse Linear Inverts the image
Scale Affine Scales the image by a factor
Rotate Affine Rotates the image by an angle
Shear Affine Slants an image by an angle
Saturation Linear Perturbs the saturation levels of the images
Brightness Linear Changes the brightness of the image
Contrast Linear Changes the contrast of the image
Fog Convolution Simulates realistic fog on the image
Gaussian Blur Convolution Blurs the image through a random Gaussian Kernel
Zoom & Blur Convolution Zooms to part of the image & blurs it
Gaussian Noise Linear Adds a random noise sampled from a unit Gaussian distribution to the image
Bayesian Optimisation Adversarial Generates adversarial examples via Bayesian Optimisation
Table 1. Image transforms

Dynamic Transforms

Adversarial examples are malicious inputs crafted by adversaries to fool DNNs. In our framework, we generate adversarial examples in a black-box setup where an adversary can only query the model via a predictive interface and doesn’t know any other information about the deployed model . The adversarial noise added to an image is generated by solving the following constrained optimisation problem (Shukla et al., 2020), where we aim to find an adversarial input , which is close to , such that model’s prediction changes.


We solve the above problem via Bayesian Optimisation which offers an efficient approach to solve global optimisation problems. Typically, Bayesian Optimisers has 2 key ingredients, a surrogate model like a Gaussian process (GP) or a Bayesian Neural Network (BNN), and an acquisition function which provides us the next location to query the target function. Intuitively, an acquisition function balances the exploration and exploitation by giving higher value to the parts of the input space where the value of the target function is typically high and the surrogate model is very uncertain. The approach of using Bayesian Optimisation with a GP surrogate to attack the target model is described in Algorithm 1.

1 INPUT: A black box function , seed data , no. of iterations T OUTPUT: Adversarial noise for m = 1, 2, …, T do
2       Select
3       and
4       Update the surrogate model with
Algorithm 1 Bayesian Optimisation Attack

3.3. Speech-to-Text Model Testing

Automatic speech recognition engines are being widely used in many automated tasks (voice assistants to a myriad of downstream activities) as such systems make it very convenient for humans to clarify their intents. Examples include Speech-to-Text services from Google, Microsoft, Apple, Amazon, IBM etc. and also offline engines like Mozilla’s DeepSpeech (Hannun et al., 2014). The performance of the downstream tasks often depends on the precision of text transcription of the audio since it is fed to language understanding models next.

It is important to realize that humans are not always in a conducive surrounding for giving clear voice commands and regularly noise artefacts get blended in the input speech. In our framework we test the robustness of the blackbox audio transcription models under white background noise and various interfering environmental perturbations to its input clips. We also test for simple fairness properties such as changing the gender or accent of the input speech having identical textual content.

We measure the Word Error Rates (defined later) of text transcriptions to measure the effectiveness of our testing.

For the following paragraphs, we consider an audio clip as a one-dimensional signal of appropriately sampled and normalized values and denote the original clip by and its corresponding perturbed clip by .

White Noise Transform

Here we perturb the original clip by adding a random standard normal perturbation to each of its elements.


where each element of the sequence

is sampled from a standard normal distribution


The variance of the noise is decided by a given

(Signal to Noise Ratio) value that also specifies the loudness of the noise.

is a logarithmic scale measure of the ratio of power of two signals which is related to their values as follows.


Also note that in case of standard normal distribution, since it has zero mean, its variance is the value squared.


Using the above equations we can compute the perturbed signals for any desired specified during testing.

Interference Transform

In this transformation, we overlay environmental noises that simulate different practical scenarios on top of the original clip. Example scenarios include restaurant, rainfall, festival, water dripping, windy noise etc. We control the signal strength by using the parameter while generating the linear combination of signal and noise that produces the interference effect.


Fairness Transforms

Speech-to-Text transcription models are expected to have similar performance irrespective of the sensitive or protected attributes of the speaker. In this pretext, we test the models by synthesizing the perturbed input with the same spoken words or tokens in the same language but flipping its gender (male to female and vice versa) or by changing the speaker’s accent such as Indian, American, French etc.

We synthesize the perturbed inputs using open-source speech generation engines or pretrained

Text-to-Speech models (such as Tacotron2 (Shen et al., 2018) and Waveglow (Prenger et al., 2019)) that synthesize audio from natural language transcripts directly. By training these models on audio from different speakers, we can leverage the learnt transformations.

These transforms try to capture the fairness of the models to different genders and varying ethnicity of the end-users who may be consuming the product or using in a critical decision making scenario.

4. Evaluation

In this section we present some basic evaluation of our techniques of the key underlying novel testing techniques. This demonstrates that AITEST can be effectively used.

Interpretability Testing for Tabular Data

In Table 2 we provide the interpretability metrics for some well known datasets: Adult Census Income (Kohavi and Becker, ), Bank Marketing (Moro et al., 2014) and US Execution (McElwee et al., ) all obtained from the UCI repository (Dua and Graff, 2017). We have taken blackbox tabular models from IBM Watson WML framework. Here the user may set appropriate thresholds for the three metrics to decide whether the test fails or not.

Dataset APL MPL Fidelity(%)
Adult Income 11.4 18 91.12
Bank Marketing 11.29 18 94.25
Execution 8.07 12 97.92
Table 2. Interpretability results


To illustrate the effectiveness of our framework, we tested a couple of standard models on popular data sets LeNet5 (LeCun et al., 1998)


(Lecun and Cortes, 1998) and ResNet32 (He et al., 2016) on CIFAR10 (Krizhevsky, 2009). From Table 3, we can see that test cases generated by various transforms in our framework is quite effective in fooling the models. User can view such cases in the tool as shown in Figure 2.

Figure 2. Failure Image Transformations
Transform Dataset
None 99.8 92.1
Inverse 97.1 86.1
Scale 96.1 88.2
Rotate 93.7 91
Shear 97.8 85.1
Saturation 97.1 90.1
Brightness 97.3 90.3
Contrast 96.5 90.5
Fog 91.7 86.4
Gaussian Blur 94 85.6
Zoom & Blur 91 82.1
Gaussian Noise 92 89.4
Bayesian Optimisation 10 9
Table 3. Accuracy (%) of various models on transformed example


For evaluating the perturbations, we use Word Error Rate (WER) as a metric which is defined as the fraction of words omitted, substituted or newly inserted in the text transcription of the perturbed clip, when compared with the original text. If we have multiple audio samples, we simply the average the metric over all samples.

We have tested it in a client-scenario which requires a voice assistant to process simple commands or queries. We provide some example failures detected by our method in Table 4. User can play the original and transformed audio for the failure cases as shown in Figure 3. We direct interested readers to (Saha, 2021) and our supplement video that contain more examples and details.

Figure 3. Speech-to-text Screenshots
Transform type Original Text Perturbed Text WER
White noise I am ready find great if 1.0
Restaurant noise can I talk to someone can 0.8
Water drip noise keep holding keep clothing 0.5
Wind noise I need a minute I need a man 0.25
French accent repeat please he Peter please 1.0
Table 4. Speech-to-Text failure examples

A short video demonstration of our tool with a subset of properties include in this paper is available at

5. Related Work

In the recent past, there has been a lot of interest in developing techniques for testing DNNs for Computer Vision tasks that vary from simple transforms such a translation, rotation, etc. (Tian et al., 2018), to richer formulations based on GANs (Zhang et al., 2018), Concolic Testing (Sun et al., 2018) & adversarial examples generation (Shukla et al., 2020; Kurakin et al., 2016), we leverage some of these techniques in our framework. Similar techniques have also gained traction for speech recognition models.  (Carlini and Wagner, 2018) uses fully white-box gradient-based methods to generate targeted and untargeted attacks for models such as  (Hannun et al., 2014). Although the notion of simulatability and tree-simulatability are not new (Wu et al., 2018) and testing a model for tree-simulatability or interpretability seems to be novel. Overall, AITEST remains the comprehensive tool for testing these many modalities.

6. Conclusion and Future Work

The last decade has seen various types of models and their applications. Only in recent time, efforts has been made to define a concrete Data and AI lifecycle (Ishizaki, 2021) and testing plays an important role in the lifecycle to ensure the reliability of AI models. We present a tool AITEST which encompasses variety of testing techniques across multiple modalities. In this paper, we include some novel capabilities such as interpretability testing and fairness testing of Speech-to-text models along with the implementation of some known properties. However, implementation of all the properties under one framework which provides easy usability is very useful for serving variety of client models.

In future, we plan to add support for testing videos, multi-modal inputs, model compositions. The current implementation only supports black-box testing, which when configured, can be applied to a large number of other similar models of the same type without much change. We plan to add support for white-box testing algorithms.


  • [1] A. Aggarwal, S. Shaikh, S. Hans, S. Haldar, R. Ananthanarayanan, and D. Saha (2021) Testing framework for black-box ai models. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 81–84. Cited by: §1.
  • [2] (Last accessed 18th August, 2021) Amazon scraps secret ai recruiting tool that showed bias against women. Note: Cited by: §1.
  • [3] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen (1984) Classification and regression trees. CRC press. Cited by: §3.1, §3.1.
  • [4] N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §5.
  • [5] M. W. Craven (1996) Extracting comprehensible models from trained neural networks. Ph.D. Thesis, The University of Wisconsin - Madison. Note: AAI9700774 External Links: ISBN 0-591-14495-6 Cited by: §1, §3.1.
  • [6] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.
  • [7] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §3.3, §5.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §4.
  • [9] (Last accessed 18th August, 2021) IBM aif360. Note: Cited by: §1.
  • [10] (Last accessed 18th August, 2021) IBM ignite. Note: Cited by: §1, §1.
  • [11] K. Ishizaki (2021) AI model lifecycle management: overview. IBM.. External Links: Link Cited by: §6.
  • [12] R. Kohavi and B. Becker Census income data. External Links: Link Cited by: §4.
  • [13] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §4.
  • [14] A. Kurakin, I. J. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. CoRR abs/1607.02533. External Links: Link, 1607.02533 Cited by: §3.2, §5.
  • [15] H. Lakkaraju, S. H. Bach, and J. Leskovec (2016) Interpretable decision sets: a joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1675–1684. Cited by: §3.1.
  • [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.
  • [17] Y. Lecun and C. Cortes (1998) The MNIST database of handwritten digits. External Links: Link Cited by: §4.
  • [18] A. McElwee, A. Zelenak, and M. Di Marco US executions since 1977. External Links: Link Cited by: §4.
  • [19] S. Moro, P. Cortez, and P. Rita (2014) A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., pp. 22–31. Cited by: §4.
  • [20] R. Prenger, R. Valle, and B. Catanzaro (2019) Waveglow: a flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. Cited by: §3.3.
  • [21] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §3.1.
  • [22] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh (2020) Beyond accuracy: behavioral testing of NLP models with CheckList. In ACL, pp. 4902–4912. External Links: Document Cited by: §1.
  • [23] D. Saha (2021) Ensuring trustworthy ai through ai testing. IBM Research AI, Bangalore, India.. External Links: Link Cited by: §4.
  • [24] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §3.3.
  • [25] S. N. Shukla, A. K. Sahu, D. Willmott, and J. Z. Kolter (2020) Hard label black-box adversarial attacks in low query budget regimes. CoRR abs/2007.07210. External Links: Link, 2007.07210 Cited by: §3.2, §3.2, §5.
  • [26] P. Y. Simard, D. Steinkraus, and J. C. Platt (2003)

    Best practices for convolutional neural networks applied to visual document analysis

    In 7th International Conference on Document Analysis and Recognition (ICDAR 2003), 2-Volume Set, 3-6 August 2003, Edinburgh, Scotland, UK, pp. 958–962. External Links: Link, Document Cited by: §3.2.
  • [27] J. Skeem and J. Eno Louden (2007) Assessment of evidence on the quality of the correctional offender management profiling for alternative sanctions (compas). Unpublished report prepared for the California Department of Corrections and Rehabilitation. Available at: https://webfiles. uci. edu/skeem/Downloads. html. Cited by: §1.
  • [28] Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening (2018) Concolic testing for deep neural networks. External Links: arXiv:1805.00089 Cited by: §5.
  • [29] Y. Tian, K. Pei, S. Jana, and B. Ray (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 303–314. External Links: ISBN 978-1-4503-5638-1, Link, Document Cited by: §3.2, §5.
  • [30] T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu, E. Klampfl, and P. MacNeille (2017) A bayesian framework for learning rule sets for interpretable classification. The Journal of Machine Learning Research 18 (1), pp. 2357–2393. Cited by: §3.1.
  • [31] M. Wu, M. C. Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. Doshi-Velez (2018) Beyond sparsity: tree regularization of deep models for interpretability. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §3.1, §5.
  • [32] J. M. Zhang, M. Harman, L. Ma, and Y. Liu (2020) Machine learning testing: survey, landscapes and horizons. IEEE Transactions on Software Engineering. Cited by: §1.
  • [33] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, M. Huchard, C. Kästner, and G. Fraser (Eds.), pp. 132–142. External Links: Link, Document Cited by: §5.