Test Metrics for Recurrent Neural Networks

11/05/2019 ∙ by Wei Huang, et al. ∙ 15

Recurrent neural networks (RNNs) have been applied to a broad range of application areas such as natural language processing, drug discovery, and video recognition. This paper develops a coverage-guided test framework, including three test metrics and a mutation-based test case generation method, for the validation of a major class of RNNs, i.e., long short-term memory networks (LSTMs). The test metrics are designed with respect to the internal structures of the LSTM layers to quantify the information of the forget gate, the one-step information change of an aggregate hidden state, and the multi-step information evolution of positive and negative aggregate hidden state, respectively. We apply the test framework to several typical LSTM applications, including a network trained on IMDB movie reviews for sentiment analysis, a network trained on MNIST dataset for image classification, and a network trained on a lipophilicity dataset for scientific machine learning. Our experimental results show that the coverage-guided testing can be used to not only extensively exploit the behaviour of the LSTM layer in order to discover the safety loopholes (such as adversarial examples) but also help understand the internal mechanism of how the LSTM layer processes data.



There are no comments yet.


page 1

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recurrent neural networks (RNNs) have been widely used in many application areas such as automated language translation [45], robotic control [42], drug discovery [35]

, automatic speech recognition

[18], time series prediction [32], etc. Most efforts for developing RNNs have been spent on improving empirical accuracy and reducing empirical generalisation error, and less work has been done towards their verification and validation. Verification and validation (V&V) are independent procedures that are used together for checking that a product, service, or system meets requirements and specifications and that it fulfills its intended purpose [1]. Unlike software systems which are programmed based on their specifications, a learning system is obtained by learning from a set of data samples. That is, the specification of a learning system is less explicit. While research has started to discuss how to establish specifications for learning systems [4], this paper is moving towards developing techniques to validate RNNs against their specifications. More specifically, we focus on a major class of RNNs, i.e., long short-term memory networks (LSTMs) and an important specification.

The specification to be considered is the robustness, which requires that a prediction made by a learning system is invariant with respect to small perturbations on the input. It has been shown that robustness may be at odd with accuracy


, which suggests that, in addition to improving the empirical accuracy, efforts are needed to check, and improve, the robustness of learning systems. Since the discovery of adversarial examples in deep feedforward neural networks (FNNs), particularly convolutional neural networks (CNNs) for image classification

[39], the robustness of neural networks has been been intensively studied from several aspects including e.g., attack methods [7], defence methods [27], verification methods [23, 20, 15], and testing methods [31, 38, 25], etc. However, the research on the verification and validation methods for RNNs are sparse, if there is any, due to the challenging facts that (1) verification and testing methods are usually white-box, which requires the access to internal structures of a network, (2) the internal structures of RNNs are much more complicated than CNNs, with some continuous structures such as sigmoid and tanh cannot be directly handled with existing verification methods such as SMT-based methods [23] and its iterative nature easily leads to the scalability problem for search-based methods [20] and abstract interpretation based methods [15]

, and (3) existing deep learning platforms such as Tensorflow do not support a direct, easy access to the internal structures of LSTM layers. Nevertheless, the verification and validation of RNNs are needed, see Figure 

1 for a few typical adversarial examples (the faults with respect to the robustness) for LSTM networks used in our experiments.

Fig. 1: Adversarial examples for LSTM models trained on MNIST handwritten digits dataset for image classification, an IMDB reviews dataset for sentiment analysis, and a SMILES strings dataset for lipophilicity prediction, respectively.

Our approach is based on the coverage-guided testing [47], which has been shown successful in software fault detection. Coverage-guided testing has been extended to work with FNNs in recent work such as [31, 44, 25, 37, 38, 24]

, where a collection of test metrics and test case generation algorithms are proposed. These metrics are based on the structural information of the FNNs, such as the neurons

[31, 25], the relation between neurons in neighboring layers [37, 38], etc. By generating a set of test conditions and a set of test cases, a coverage-guided testing exploits the behaviour of network by claiming that a certain percentage of test conditions are asserted or satisfied by the test cases. When working with RNNs whose internal structures are much more involved, new test metrics and new test case generation methods are needed to take into account the additional structures and complexity.

We develop three test metrics and a generic mutation-based test case generation method. The test metrics are designed to exploit different functional components of the LSTM networks, by considering both the one-step information change and the multi-step information evolution, and both the gate vectors and the hidden state vectors. Gate and hidden state are internal structural components of an LSTM cell. Specifically, we quantify the filtering ability of forget gates with gate coverage (GC), the one-step change of aggregated hidden states with cell coverage (CC), and the multi-step evolution of positive or negative hidden states with sequence coverage (SC). For test case generation, we iteratively apply a set of mutation operations on seed inputs. When a gradient of the network loss function with respect to the input is obtainable (for continuous inputs such as images), we design a set of mutation operations based on the gradient direction. On the other hand, when a gradient is not obtainable (for discrete inputs such as words), we use user-specified mutation operations.

In , the effectiveness of structural coverage guided testing is questioned. The authors mentioned a few aspects, including the structural coverage can be too loose, the test case generation is dependent on the adversarial attacking algorithm, and there is a lack of correlation between misclassified inputs in a test set and its structural coverage.

We develop a tool testRNN111Available via https://github.com/TrustAI/testRNN and conduct an extensive set of experiments on three LSTM networks, trained on an IMDB review dataset, the MNIST image dataset, and a lipophilicity (a physicochemical property of drugs) dataset, respectively. Two of them have a single LSTM layer and the other has two connected LSTM layers. The experimental results show that, (1) our test framework is effective, in terms of its efficiency in generating test cases, its (non-trivial) possibility in achieving high coverage, and its ability in discovering faults, (2) the test metrics are complementary to each other in guiding the generation of test cases, i.e., the test cases generated by one test metric cannot be applied to achieve high coverage of the other metrics, and (3) the generated test cases can be utilised to understand the working principle of the LSTM networks.

Ii Preliminaries

Feedforward neural networks (FNNs) can be represented as a function , mapping from input domain to output domain , and is usually used to perform predictions based on an input , or recognise patterns in . For a sequence of inputs , will handle them individually without considering results from previous predictions, that is, the result of is independent of the results of when .

By contrast, a recurrent neural network (RNN) is used to work with sequential data. Usually, an RNN contains at least one recurrent layer. Figure 2 illustrates how a sequential input is handled by a recurrent layer by unfolding.

Fig. 2: A simple recurrent layer

A recurrent layer can be represented as a function such that for , where is the cell state and acts as the intermediate memory, and denotes the -th time step. Intuitively, the recurrent layer takes as inputs the current time input , the previous time memory state and the previous time output , updates the memory state into , and returns as the current time output. Initially, we let and be 0-valued vectors. For a (finite) sequence of inputs , this function is applied recursively on them. For example, the popular long short-term memory (LSTM) layer can be represented with the following equations for time :



is the sigmoid function such that

for any , is the hyperbolic tangent function such that for any , are weight matrices,

are bias vectors,

are internal gate variables, is the hidden state variable (utilising ), and is the cell state variable. For the connection with successive layers, we only take the last output as the output. For simplicity, when working with finite sequential data, we can also define a recurrent layer as , which takes, as input, a sequential data of length and returns the last output . Normally, the recurrent layer is connected to non-RNN layers such as fully connected layers so that the output is processed further. We let the remaining layer be a function . Moreover, there can be feedforward layers connecting to the RNN layer, and we let it be a function . Then given a sequential input , the RNN is a function such that


In general, an RNN is a function which takes, as input, a sequence . Inputs are processed in parallel with the function , with the results being processed within the RNN layer . The result from the RNN layer is then processed further with function to provide the final output. While the function is for networks with a single RNN layer, it can be easily extended to networks with multiple RNN layers.

Deep (recurrent) neural networks, , consist of multiple layers, and we write for the -th layer. We write for the set of variables (or neurons) of layer , where is the -th variable among the number of variables at layer .

Iii Internal Information of LSTM Cells

In this part, we first formulate two types of internal information of LSTMs that will be used later to define our coverage criteria. For an LSTM network, there are a sequence of inputs , gate values , , , and hidden states , where each element is a vector of real numbers. We use variable to range over , use to denote the value of at time , and write to denote the -th component of . Moreover, for either or where , we may use a subscript , written as e.g., , , etc., when it is needed to refer to its corresponding input .

In the following, we consider abstract information from the internal structure, i.e., hidden state vectors and gate vectors, of an LSTM cell, and present their intuitive meanings. These information will be used to design test metrics in Section IV.

Aggregate information of hidden states

According to Equation (1), the component value of varies between -1 and 1. It is therefore reasonable to divide the possible information of a component into positive () and negative (). For , we take the aggregate information [28] , by summing up positive component values and negative ones, respectively, such that


Intuitively, represents the extent to which the hidden state contains positive information and represents the extent to which the hidden state contains negative information. In the following, we show how can be utilised to provide intuition into the processing of inputs by LSTM layer. Basically, we consider both a one-step change and a multi-step change on information based on .

Fig. 3: Four examples to show how positive and negative elements of hidden state vectors represent the information in MNIST and IMDB models. The x-axis includes the inputs (bottom row) and the y-axis includes (top row), (second row) and (third row) values.

Consider a sentiment analysis example, where an LSTM network is trained to classify the sentiment (positive or negative) of movie reviews. As shown in Figure 

3, assume that the input is e.g., “horrible movie and really bad watching experience”, “I really liked the movie and had fun”, etc., we consider the following quantity


which compares the information of the current step with its previous step . Intuitively, is the abstract representation of information updated between hidden states. From the bar charts in the third row of Figure 3, we can see that sensitive words such as “like”,“horrible”,“fun” trigger greater values than non-sensitive words such as “movie”, “really”, and “had”. Also, for MNIST images where the -th column is processed as the cell input , more informative columns such as the 9th and 10th columns of both the digit “2” and “3” trigger greater values than the first and last columns.

We can track the change of and for a certain time span. Figure 3 presents two curves for each input. The one in the top row is for and in the second row is for .

Abstract information of gates

We now consider information represented in the gates , , and . In the LSTM, these gates have their intuitive meanings. For example, the forget gate is to control the portion of information that should pass through the gate, the input gate determines how much information will be added to the cell state, and the output gate determines which part of the cell state to output.

Definition 1

Let be the value of the gate at time when the input is . We use remember rate to denote the portion of information passing through the gate at time that can be remembered, given the input is . Formally, we have


It is easy to see that is within the range , since all components in have their values bounded in .

Iv Test Metrics for LSTM

Existing test coverage metrics work for neural networks [31, 25, 37, 38, 44, 24] focuses on feedforward structures such as convolutional layer, fully connected layer, maxpooling layer, etc. In this paper, we propose a set of structural coverage metrics to work with LSTM layers and sequential inputs, by considering the internal information of the cells as discussed in Section III.

Let be a set of sets of test conditions, a set of assert methods, and a set of test suites. A test coverage metric on a network is a function . Intuitively, quantifies the degree of adequacy to which a DNN is tested by a test suite with respect to a set of test conditions and an assert method . Usually, the greater the number , the more adequate the testing. In this section, we will develop the set and for every their corresponding assert method and metric . A test suite contains a set of test cases, and we will discuss in Section VI an algorithm we use in our experiments to generate test suites.

Iv-a Test Conditions

Every test condition is some functionality or behaviour of the system we want to verify. In the following, we introduce methods to define test conditions for LSTM networks.

Neuron Coverage

As a simple baseline, we extend the idea of neuron coverage [31] to work with LSTM layers. Let be the set of test conditions, each of which represents a hidden neuron under consideration. For LSTM layer, includes the component neurons in the last output , for an input of length . Note that, does not include component neurons in any for , since they are intermediate results of LSTM and not obtainable by simply observing the output of an LSTM layer.

Definition 2

Assume that we are working with the -th layer which is an LSTM layer. A neuron is asserted by a test case , denoted as , if . Recall that, we use to denote the -th component of the hidden state at the last time step when given the input , and .

As will be shown in the experiments in Section VIII, the 100% neuron coverage (the concept of coverage will be formally defined later in Section IV-B

) can be easily achieved with a small number of test cases. This makes neuron coverage less powerful for testing, which aims to exploit the system with a non-trivial number of test cases in order to expose incorrect behaviour of a network. One reason for neuron coverage to be less strong is that, it treats neurons as independent units, without considering their collective behaviour. However, the key strength of neural networks is its capability of extracting features (which are usually represented with a set of neurons) and based on this, combining a set of features to conduct high-level tasks such as pattern recognition. It is not hard to see that a feature in itself can be harder to be asserted by a set of test cases than the neurons supporting it.

Instead of considering only the individual neurons, in the following we design several test metrics for LSTM layers that take into account the internal information of a vector of neurons. In particular, the aggregate information of hidden states and the abstract information of gates in Section III will be used.

Gate Coverage

The first set of new test conditions is designed to cover the behaviour of internal gates, i.e., . In this paper, we use the forget gate as an example. The behaviour of the forget gate at time , i.e., , affects not only the behaviour of the cell at time locally but also the overall behaviour of the LSTM layer. We let

be the set of test conditions, such that is the forget gate value at time and is a threshold value .

Definition 3

A test condition is asserted by a test case , denoted as , if , i.e., the forget rate at time is greater than the designated threshold .

An important benefit for studying forget gate is on understanding the cell state . According to Equation (1), is used to directly control the information flowing from to . However, because represents long-term memory, its component values are not bounded, which makes the design of test conditions harder to be fully automated, i.e., may have to be problem specific. The other benefit is that, gate activation statistics have been used to explain the internal mechanisms of LSTM [22], and as will be discussed in the experiments, the study of forget gates enables our understanding of the learning process through a memorisation curve.

Cell Coverage

Cell is the basic working unit in LSTM to deal with the information flow. According to Equation (1), in addition to the gates (forget gate and internal gate ), the hidden state is also a factor determining the value of . The first sets of test conditions on hidden state is based on quantifying the one-step change of the hidden output , which is measured by the hidden aggregate information change . Therefore, we let

be the set of test conditions, such that is the hidden aggregate information change at time and is a threshold value.

Definition 4

A test condition is asserted by a test case , denoted as , if . Recall that, we use to refer to the value when the input is .

Sequence Coverage

As explained earlier, we are also interested in checking, for an LSTM model, how information updates in the time series. This complex task is essentially a time series classification problem, and we can employ symbolic representation method to provide a specific metric to divide the hidden sequence patterns into several classes. Our interest localizes on the sequence of positive hidden information and negative information .

Definition 5 (Symbolic Representation of Time Series)

Let be the (possibly unbounded) range of values of the quantity . Given a finite set of symbols, we split into a set of sub-ranges

such that and for all , implies that . Based on , for any value of , we associate it with a symbol whenever . One step further, every sequence can be associated with a clause such that for all .

While Definition 5 is for , it can be extended to work with . Intuitively, it discretises the domain of the continuous quantity into a set of finite sub-ranges , associates each sub-range with a symbol in , and then projects each component of a time series into a symbol by matching its value with a sub-range. By this process, a time series is transformed into a sequence of symbols, or a clause.

Fig. 4: A time series is transformed into symbolic representation with three symbols

Practically, the symbolic representations of time series are obtained by the following steps, as illustrated intuitively in Figure 4

. First, a large amount of sequence data are sampled to have their distribution. Second, a set of breakpoints are determined by dividing the area under the curve into a set of sub-areas of equal size volume. Every sub-area is associated with a symbol. This step is to ensure that the appearance of symbols has equal probability. Finally, a clause is generated for each time series by replacing the vertices with symbols corresponding to their respective sub-areas.

After getting the symbolic representation for time sequences, we can define test conditions. Given and a range , we let


be a set of test conditions, where is a special symbol representing that the place is not within the range . Given a sequence , we write for to denote its -th component.

Definition 6

Given a test case and a test condition , we say that is asserted by with positive information, denoted as , if for all , we have that implies that . Moreover, we say that is asserted by with negative information, denoted as , if for all , we have that implies that .

Intuitively, sequence coverage is designed to test all possible time series of either the positive information or the negative information .

Iv-B Test Metrics

Let , and , as defined in Section IV-A. Based on them, we define a generic metric as follows.

Definition 7

Let and a set of test cases. The test coverage metric for a network is defined as the percentage of the test conditions in asserted by the test suite with method . Formally,


V Test Oracle

Test oracle is usually employed to determine if a test passes or fails. Test oracle automation is important to remove a bottleneck that inhibits greater overall test automation [5]. For our testing of LSTM networks, we employ a constraint (or more broadly, a program) to determine whether a generated test case passes the oracle or not. A test case does not pass the test oracle suggests a potential safety loophole. It is usually for the designer (or developer) of the LSTM network to determine which constraint should be taken as the oracle, since the latter relates to the correctness specification of the network under design. In this paper, as an example to exhibit our test framework (i.e., test metrics and test case generation algorithm), we take the currently intensively-studied adversarial example problem [39] to design our test oracle, and remark that the test oracle in general is orthogonal to the test framework we developed in this paper.

Practically, to work with adversarial examples, we define a set of norm-balls, each of which is centered around a data sample with known label. The radius of norm-balls is given to intuitively mean that a human cannot differentiate between inputs within a norm ball. In this paper, we consider Euclidean distance . We say that a test case does not pass the oracle if and only if (1) is within the norm-ball of some training sample , i.e., , and (2) has a different classification from , i.e., .

Vi Test Case Generation

We present a generic test case generation algorithm used in our experiments. To avoid “gaming against criteria”, our test case generation does not use test metrics as targets, which is recommended by Chilensky and Miller in their seminal work [10]. Given a LSTM network and a specified test metric, a test suite is generated: the coverage result and the number of adversarial examples out of the test suite are used to evaluate the robustness of the LSTM network under test. The effectiveness of our test generation algorithm is later justified by experiments in Section VIII.

The test generation in Algorithm 1 starts from a set of seed inputs , and the function is a pre-defined mutation function. Initially, the test suite contains only inputs from (Line ). A dictionary is defined (Line ) to map every test case/mutant generated to its original input in , and every seed input from is mapped to itself (Line ).

As long as the coverage condition, such as neuron coverage, is not satisfied (Line ), the test generation loop iterates. At each iteration, a test case is sampled from the present (Line ) and it is mutated using to have a new input (Line ). To guarantee the quality of test cases generated, is only added into the test suite if it falls within the norm ball with respect to its original seed input (Lines ). Finally, the generated test suite is returned (Line ).

Input: : DNN to be tested
: a set of seed inputs
: a mutation function
Output: : a set of test cases
3 for all
4 while coverage condition is not satisfied do
5       to sample an element in
7       if   then
Algorithm 1 A Generic Test Case Generation Algorithm

The test case generation in Algorithm 1 is indeed very generic and the core part is defining the mutation function . In the following, we introduce the mutation operations used in this paper.

Gradient-Based Mutation

For the models where the gradient of the network loss function with respect to the input is obtainable (for continuous inputs such as images), we can utilise Algorithm 1 to search the input space along the gradient direction. Let be a cost function over the model , with input , and output , and be a set of variables (hidden or not). We can define gradient as


to denote the gradient of the loss with respect to . Normally, we take to be for some layer . For example, represents the usual gradient over the input dimensions. Finally, we let

where . That is, the gradient based mutation takes two parameters and .

Random Mutations

For models where the gradient of the cost function is not obtainable (having discrete inputs such as words), we define a set of mutation functions that randomly mutate an input. At each step of the test generation, is instantiated by one function in . Given different kinds of input data, the randomisation operations allowed may also differ. Our mutation operations used in the experiment are detailed in Section VIII-B.

Fig. 5: Test Framework

Vii Test Framework

Figure 5 presents the overall architecture of our test framework for LSTMs. Given the LSTM network , a training dataset , an oracle (Section V), the test metric (Section IV), and a mutation method (Section VI), it first selects a set of seed inputs, and then the test case generation method in Algorithm 1 is applied to generate a test suite. The test generation terminates when certain level of coverage rate has been satisfied or certain number of test cases have been generated. We can then evaluate : (1) by considering oracle, we obtain the information about adversarial example; (2) by considering test metric, we obtain the coverage rate. Practically, the adversarial example can be used for data-augmentation. Given that this is a method well explored in other works such as [39], we did not pursue this direction in the paper. The coverage rate is used to determine if the test suite is adequately tested.

Viii Experimental Results

This section presents our experimental results on several typical LSTM networks. We evaluate the effectiveness of coverage-guided LSTM testing using adversarial examples as a proxy, we compare different metrics and show their complementary in guiding test case generation, and we utilise the generated test cases to access and understand the the LSTM internal mechanism. All the experiments are run on a CPU sever with Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70 GHz, 64 GB Memory.

Viii-a LSTM Models

In the experiments, we consider the use of LSTM networks in different domains, including a network trained on MNIST dataset for image classification, a network trained on an IMDB review dataset for sentiment analysis, and a network trained on a Lipophilicity dataset for molecular machine learning.

MNIST Handwritten Digits Analysis by LSTM

The MNIST database, containing a set of

grey-scale images of size , is used to train a HNN model with 4 layers. Figure 6

depicts its structure. The first two layers are LSTM layers, which are correspondingly connected and fed with rows of input images. That is, each input image is encoded as the row vector of shape (28,128) by the first LSTM layer, and then second layer will do further processing to output a image vector representing the whole image. Finally, two fully-connected layers with ReLU and SoftMax activation functions respectively, are used to process the extracted feature information to get the classification result. The model achieves

accuracy in training dataset ( samples) and accuracy in the default MNIST test dataset ( samples).

Fig. 6: An illustration of an LSTM network trained on MNIST database.
Sentiment Analysis by LSTM

The sentiment analysis network [36] has three layers, i.e., an embedding layer, an LSTM layer, and a fully-connected layer, with trainable parameters. The embedding layer takes as input a vector of length 500 and outputs a matrix, which is then fed into the LSTM layer. Subsequently, there is a fully-connected layer of 100 neurons.

Lipophilicity Analysis by LSTM

We trained an LSTM network from a Lipophilicity dataset downloaded from the MoleculeNet [46]. The model has four layers: an embedding layer, an LSTM layer, a dropout layer, and a fully connected layer. The input is a SMILES string (Figure 8

) representing a molecular structure and the output is its prediction of Lipophilicity. A dictionary is used to map the symbols in the SMILES string to integers. We use the length of the longest SMILES in training dataset as the number of cells for the LSTM layer. Similar to text processing in the IMDB model, short SMILES inputs are padded with 0s to the left side. We use the root mean square error (RMSE) as the measurement of model accuracy. Our model is trained up to RMSE = 0.2371 in training dataset and RMSE = 0.6278 in test dataset, which are better than the traditional and convolutional methods used in


Viii-B Experimental Setting

Symbolic Representation based on Empirical Distribution

As discussed in Section IV

, the symbolic representation for a time series is obtained by sampling a number of data and splitting the area under the curve of the empirical distribution, so that each sub-area is associated with a symbol. In a general way, the memory data follow the normal distribution. we do a statistics on three experimental models and find that abstract information from LSTM layer conform to the hypothesis. Figure 

7 plot the statistical distribution of the positive information and negative information

, for Sentiment Analysis networks. We can see that, all the empirical distributions largely follow the Gaussian distribution (red curves), as shown in Figure 


Fig. 7: Statistical distribution of positive and negative information in LSTM sequence [480-484], test cases for Sentiment Analysis model are considered.
Selection of Mutation Operations

We use Algorithm 1 to generate test cases. We detail the mutation operations used for different models. For MNIST model, because the input vector space is continuous, we are able to apply gradient-based mutation, in particular, by configuring and .

Different from the MNIST model where a small change to its input image is still a valid image, the input to IMDB model is a sequence of words, on which a small change may lead to an unrecognisable (and invalid) text paragraph. Based on this observation, a gradient-based method is not useful and we take a random-based mutation approach. This is implemented uding the EDA toolkit [43], which was originally designed to augment the training data for improvement on text classification tasks. In our experiments, we consider four mutation operations, i.e., includes (1) Synonym Replacement, (2) Random Insertion, (3) Random Swap, (4) Random Deletion. The text meanings are reserved in all mutations.

Fig. 8: The same molecular structure represented by different SMILES strings

For Lipophilicity model, we consider a set of mutations which change the SMILES string without affecting the molecular structure it represents. The enumeration of possible SMILES for a molecule relies on python cheminformatics package RDkit [34]. with the utilization of RDkit, each input SMILES string is converted into molfile format, and then atom order will be changed randomly before converting back. As shown in Figure 8, several SMILES strings can represent the same molecular structure. The enumerated SMILES will be test cases to check if the model really learns the prediction from molecular structure to the lipophilicity logD value, but not just the syntax like SMILES representation.

Oracle Setting

In our experiments, we set different radius of the test oracle depending on the models. For continuous input problems like MNIST, it’s easy to calculate the distance between input images, and we do the experiments (except for those in Section VIII-C5) with . For sentiment analysis, it’s ambiguous to quantify the text meanings. However, our mutations only apply small perturbations to the original text, with the constraint that the sentiment does not change. The case for Lipophilicity model is similar, we use enumeration approach to generate variational SMILES string of the same molecule, such that obtained SMILES strings have the same molecular structure as the original string.

Viii-C Effectiveness in Testing

In this part, we show the effectiveness of our coverage-guided LSTM testing from the following aspects: (1) the test case generation is efficient, (2) the test suite generated achieved high coverage that is non-trivial, and (3) there is a significant percentage of adversarial examples in the test suite. All experimental results are averaged over 5 runs with different random seeds.

Viii-C1 Coverage Improvement by Generated Test Cases

At first, we compare the coverage results by sampling the 500 inputs from training dataset and by generating test cases generated from these samples using Algorithm 1. The Table I, shows the results. We can see that, the coverage level is significantly improved with the test suite generated, and this comes with a a non-trivial percentage of adversarial examples.

Viii-C2 Neuron Coverage is Easy to Cover

In our experiments we found that neuron coverage can be trivially achieve. For the MNIST model, on average 35 generated test cases are sufficient to cover all the test conditions. For the IMDB model, it takes 7 test cases on average to reach neuron coverage, and only 20 test cases are needed for the Lipophilicity prediction model. The detailed running results are shown in Figure 9. This shows that neuron coverage is not a suitable metric for RNNs, which also justifies the necessity of our new metrics for LSTM networks.

Fig. 9: The updating process of Neuron Coverage in 50 test cases.
Test Model Test Cases
# of Adv.
Average Perturbations
(L2 norm)
Running Time
(in seconds)
Coverage Rate
Cell Forget Gate Positive Sequence Negative Sequence
MNIST 500 0 0 555
= 6, = 0.78
symbols{a, b}
0.75 0.77 0.72 0.66
592 1.99 0.89 0.89 0.86 0.84
Sentiment Analysis 500 0 0 726
= 7, = 0.73
symbols{a, b}
0.68 0.65 0.91 0.94
82 0.11 0.90 0.91 0.94 0.94
Lipophilicity prediction 500 0 0 236
= 5, = 0.75
symbols{a, b}
0.80 0.88 0.94 0.91
779 0 949 1.0 0.99 0.94 0.94
TABLE I: Cell, Gate & Sequence coverage for LSTM model trained on Three Different Dataset
Fig. 10: Illustration of Cell, Gate & Sequence Coverage updating in 2000 test cases For MNIST (28 Cells, Seq. in [19-23], Sentiment Analysis (500 Cells, Seq. in [480-484]) and Lipophilicity Prediction (80 Cells, Seq. in [70,74])

Viii-C3 Gate Coverage and Cell Coverage

Coverage threshold is a hyper-parameter which needs to be set before the start of testing process. It is interesting to understand how the setting of threshold affects the testing result. In our experiments, for both cell and gate coverage, we sample 1000 test cases and use averaged gate value and averaged aggregate information change value as the median threshold values, and then gradually decrease and increase them to get a set of experimental settings.


we set thresholds , and test the LSTM-2 layer. Since MNIST model has 28 cells in LSTM layer, the total amount of test conditions to cover is 28 for both cell and gate coverage.

Sentiment Analysis Model

The number of LSTM cells is 500. Thus, there are 500 test conditions for both cell and gate metrics. The threshold values are set as and .

Lipophilicity Prediction Model

We have and . There are 80 cells in LSTM layer, which means 80 test conditions to be asserted for each metric.

With the addition of mutated samples into our test suite, coverage rate will gradually increase. However, it is common to see that coverage rate may be difficult to improve after reaching a certain level. That is because the remaining test conditions in our coverage metrics can only be asserted by corner cases which are hard to be tested. In our experiments, we record the updating process in 2000 mutants. The plots are shown in Figure 10.

The first impression of coverage updating results is that small value of threshold is less effective in finding adversarial examples. E.g. when we set for MNIST, 40 test cases can meet all test conditions. Since test metrics guide the algorithm to terminate, and weak test conditions are easier to be satisfied, less number of test cases are explored. In contrast, if the threshold is assigned with higher value, coverage rate may stay in a relatively low level and be hard to be improved. More test cases are considered, including the corner cases we want to test. Nevertheless, the time consumption will raise for the same goal of coverage rate. And it’s very likely that some test conditions may be too strict to meet.

We list experiment results for certain thresholds and compare the results between original seeds input and mutants in Table I. It shows that our mutation-based test generation has the following advantages. First, a significant amount of adversarial examples are found. As stated in Section VIII-A, the three LSTM models have high levels of accuracy in training dataset. The mutation method enable us to search for the misclassified points very close to the training data points. These adversarial data points can be utilized to retrain the model for better robustness. Second, the mutants are in favor of raising coverage rate of a test set. Although there isn’t a strong relationship between coverage rate and adversarial examples, the higher coverage rate in a test set with certain amount of adversarial examples give us more confidence to find various kinds of adversarial examples induced by different root causes.

Viii-C4 Sequence Coverage

Sequence coverage is more complicated than cell and gate coverages, because the number of sequence patterns is exponential with respect to the sequence length. For example, if we test the whole hidden information sequence in MNIST model with 2 symbols , There are different patterns to be added as test conditions. Moreover, as discussed earlier, it is likely that a certain portion of patterns are unable to be asserted. The information update inside LSTM is a progressive process which means some hugely-fluctuated patterns are in practice hard to occur in our experiment. In order to simplify the problem and provide a more customized options, the testing range of LSTM sequence is decided by the users, as the test conditions as given in Equation (6). In our experiments, we focus on the coverage of sequence patterns at intermediate cells, i.e., for MNSIT, for Sentiment Analysis, and for Lipophilicity Prediction.

Our experiments exercise the sequence coverage using 2 and 3 symbols for each model, i.e., and , to represent the sequence data and test the sequential pattern respectively. So the total test conditions to be asserted for 2 symbols are and for 3 symbols .

The experiment results in Table I and Figure 10 indicates that sequence coverage is a rigorous test requirement. Although we try to cover sequence information inside 5 cells, thousands of test cases are generated. 2 symbols representation shows a good coverage result within 2000 test cases. However, for 3 alphabetic symbols representation, very few sequence patterns are found in consideration of the same test cases. It is expected that if we use more symbols to get more precise representation, coverage rate will be further reduced even with more test cases. On the other hand, this reveals the complexity of LSTM memory. If the sequence coverage is to test the memory of RNNs, sophisticated memory storing mechanism makes the network flexible to make predictions for a large variety of sequential tasks.

Viii-C5 The Impact of Oracle on Adversarial examples

In the above experiments, we fix the oracle radius. However, it is interesting to understand how the setting of oracle, for example the radius, may affect the quantity and quality of the test cases and adversarial examples. Here, we focus on MNSIT model to show how adversarial rate and perturbations on adversarial examples vary with different values of oracle radius. To give more clear comparison, we apply large fluctuations on input images. The hyper-parameter is set with , and the range of oracle value is . The experiment results are averaged over 5 different sets of seed inputs. For each set of seeds, 2000 test cases are generated.

Fig. 11: The number and the average perturbations of adversarial examples against the value of oracle radius
Fig. 12: An illustration of adversarial examples’ quality at different oracle radius for MNIST model

As can be seen in Figure 11, the quantity and the perturbations of adversarial examples both increase with the growth of oracle radius. The large oracle radius reveals the weak restrictions imposed on filtering adversarial examples. What‘s more, it is significant to see that adversarial rate in the test set goes linearly with oracle radius in an intermediate interval. The average perturbations of adversarial examples are different cases, which show the polynomial relationship with oracle radius. Both figures give the intuitive explanation on how robust the model is. The steeper curve, to some extent, indicates the worse robustness of LSTM to the adversarial examples.

We also plot some adversarial examples for MNSIT in Figure 12 when experiments are set up with different oracle radius. It is expected that large oracle radius can tolerate adversarial examples added with more image noise. The handwritten digit 3 is largely perturbed and even a little indistinguishable by humans. In contrast, digit 4 is under imperceptible noise and much more comply with the definition of adversarial examples.

Viii-D Comparison between Test Metrics

This section explains how test metrics are complementary in covering different functionality of the LSTM layer. We consider minimal test suite, in which the removal of any test case may lead to the reduction of coverage rate. The rationale of taking minimal test suite is to reduce the overlaps, so that it is fairer to compare test metrics. Experimental results for the three models are presented in Table II.

Model Targeted Test Metrics Threshold / Symbols Minimal Test Set Coverage Rate
Cell Forget Gate Positive Sequence Negative Sequence
MNIST Cell Coverage 6 21 0.821 0.464 0.219 0.312
Forget Gate Coverage 0.85 12 0.214 0.821 0.125 0.250
Positive Sequence Coverage {a, b} 27 0.357 0.357 0.844 0.344
Negative Sequence Coverage {a, b} 27 0.250 0.429 0.219 0.844
Sentiment Analysis Cell Coverage 7 263 0.908 0.768 0.875 0.812
Forget Gate Coverage 0.73 262 0.742 0.926 0.875 0.844
Positive Sequence Coverage {a, b} 30 0.108 0.078 0.938 0.844
Negative Sequence Coverage {a, b} 30 0.108 0.078 0.844 0.938
Lipophilicity prediction Cell Coverage 6 67 0.900 0.812 0.688 0.817
Forget Gate Coverage 0.77 29 0.075 0.950 0.469 0.500
Positive Sequence Coverage {a, b} 30 0.050 0.637 0.938 0.531
Negative Sequence Coverage {a, b} 30 0.075 0.650 0.500 0.938
TABLE II: Comparison between Test Metrics in minimal test set

By considering coverage rates in different minimal test sets, it’s not difficult to find that the diagonal data is always the largest value in that row and column. E.g., if we derive a minimal test set, which has a very high coverage rate of cell values, it’s not necessary to see that other test metrics can achieve a good or even better coverage results. This indicates that our designed test metrics have few interactions in terms of test requirements. Actually, our purpose of developing these test metrics are totally different. As discussed in definitions, cell and gate coverage are one-step test condition, which are developed to test the basic unit of LSTM layer. Nevertheless, sequence coverage brings us back to the multi-step test conditions. When using sequence coverage as stopping conditions for testing, we care more about the information processing progress of the whole LSTM layer.

Another thing should be noticed is that user-specified threshold values for cell and gate coverage or symbols for sequence coverage may affect the comparison results. Although our test metrics are not tightly correlated with each other, weak test requirements are too easy to achieve. That means the minimal test set targeting a test metrics may have relatively good coverage result of another test metric, if the latter ones are weak test conditions. That is reason why in Table II, some minimal test sets depict the high coverage rate for all test metrics.

Viii-E Exhibition of Internal Working Mechanism

We show how to use the generated test cases to understand the LSTM networks. The experiments consider the two models for MNIST and IMDB datasets and visualise their learning mechanism inside the LSTM layer. Figure 13 and 14 present the coverage times for each feature (or test condition). Coverage time counts the number of times a test condition is satisfied in a test suite. For both models, we take a test suite with 2,000 test cases. Coverage time exhibits the difficulty of asserting a feature.

Less Active Features

For MNIST model, the test suite is obtained by setting thresholds as and . Note that, each test condition is defined with respect to a row of pixels on the input image. Figure 13 presents two histogram plots for cell and gate, respectively. From the cell plot (left), we can see that the first and last few test conditions are not covered. This is actually expected since we split an MNIST image into rows such that each row is used at a time step. For MNIST images, their top and bottom rows are blank, and therefore the aggregate information change is not significant enough to be over the threshold. The gate plot (right) use a relatively large threshold 0.85. Statistically, in the first several time steps it is hard for the abstract gate value to meet this strict test condition. A reasonable explanation is that the LSTM cell in MNIST model will throw away comparatively more information at the beginning. These unnecessary information is likely the edges of MNIST images, as mentioned above.

For IMDB model, in Figure 14, the plots on the bottom line show the coverage times of all 500 test features for cell and gate coverage, respectively. From the graph, we can see that no matter which coverage metric is considered, in contrast to the last 200 cells, the first 300 cells are less active and obviously “lazy” in trying to be asserted. The reason behind this is that the input samples are padded or truncated to the same length. Since we set the review length to 500 words, the text we used for training and testing maybe too long or too short. The short reviews are padded in the left side with 0s. Therefore, it’s reasonable and likely to see in many test cases, the first few cells are the “dead cells”.

Active Features

In the cell plot of Figure 13, we can see that cells around 7 and 8 are easy to be asserted, which means that, for many inputs, within this range there are great changes in hidden outputs. In other words, those cells’ inputs provide significant information for the classification. In gate plot, the long term memory tends to update lazily at the end of the time series. This can be seen from the high coverage times of the memory values over 0.85 at the last few cells.

Fig. 13: 2000 test cases are used to demonstrate the coverage times of 28 features in LSTM 2 layer of MNIST model.
Fig. 14: 2000 test cases are used to demonstrate the coverage times of 500 features in LSTM layer of Sentiment Analysis model.
Overall Analysis

If we combine cell and gate plots, the whole working process of LSTM inside the MNIST model becomes transparent. The sequence input of an image starts from the top rows to the bottom rows. At the beginning, the top rows of MNIST images do not contribute to the model prediction, which can be seen from the fact that they do not cause significant hidden state changes. These unimportant information will be gradually thrown away from the long-term memory. When input rows containing digits are fed to the LSTM cells, the model will start learning and the short term memory, represented by the hidden outputs , starts to have strong reactions. The following process will be random due to the diversity of digits. When approaching the end of the sequence, which corresponds to the bottom of digit images, LSTM has already been confident about the final classification and becomes lazy to update the memory. Overall, MNIST digits recognition is not a complicated task for the LSTM model and the top half part of images seems to be enough for the classification.

For IMDB model, each cell plays an important role in dealing with passed information and even affecting the final classification. This is based on the observation of results in Figure 14. We use 2000 reviews, the length of which are all greater than 50. This is to make sure cells between 450-500 have the real input words but not padded 0s. Then the coverage times are counted and plotted which is shown at the top line of Figure 14. Compared with MNIST result in Figure 13, LSTM cells and gates in sentiment analysis model are randomly activated. This can further explain that this network doesn’t have a fixed working pattern like MNIST. For a MNIST input, image rows from top to the bottom are gradually changed and combined to get the specific characteristic which can be recognized as the digits we perceive. However, sensitive words in a review text may randomly locate and these words are just model needs to learn for the classification.

Ix Related Work

Adversarial Examples for RNN

Adversarial examples represent the major safety/robustness concern for deep neural networks. The adversarial example generation has been an active domain and a large body of research has been conducted for adversarial attacks for recurrent neural networks on tasks such as natural language processing [30, 21, 9, 49, 13, 14, 2] and automated speech recognition [16, 11, 6]. In this paper, adversarial examples are used as a proxy to evaluate the effectiveness of the proposed coverage criteria.

Testing feedforward neural networks

Most existing neural network testing methods focus on the feedforward structure. In [31] the neuron coverage is proposed. Though neuron coverage is able to guide the detection of adversarial examples, it is rather easy to construct a trivial test set to achieve very high neuron coverage level. Several refinements and extentions of neuron coverage are later developed in [25]. Motivated by the usage of MC/DC coverage metrics in high criticality software, in [37], a family of MC/DC variants are adopted for neural networks, which takes into account the causal relation between features of different layers and closer. It has been formally proved in [37] that the criteria in [31, 25] are specially cases of its MC/DC variant for neural networks.

Besides the structural coverage criteria mentioned above, metrics in [44, 3, 8] define a set of test conditions to partition the input space. Though not being a coverage metric, the method in [24] measures difference measures the difference between training dataset and test dataset based on neural network structural information.

Guided by the coverage metrics, test cases can be generated by neural network testing techniques based on such as heuristic search

[48], fuzzing [19, 29], mutation [26, 41] and symbolic encoding [38, 17]. None of the mentioned test generation algorithms has considered RNNs.

Testing RNNs

As far as we know, [12] is the only paper so far that provides coverage metrics for RNNs, in which at first an automaton is learned to abstract the RNN under test and then test cases are generated for this automaton. There are major differences between our work and [12]: (1) the approach in [12] treats the RNNs as a black-boxes, whereas our coverage criteria explicitly consider the RNN internal structure; and (2) the metrics in [12] are based on the abstract automaton, instead we define test conditions directly upon the neural network.

Visualisation for LSTM

For LSTMs, a sequence of input or hidden state is a sequence of vectors. To get an overview of information inside RNN cells, dimensionality reduction methods (e.g. PCA and t-SNE) have been adopted. For example, [33] employs PCA to extract the most important feature of hidden state at each time step. These methods facilitate the visualisation of RNN hidden behaviors. Our visualisation is completely different, and works by visualising the working principles based on a set of test cases.

X Conclusions

In this paper, we develop a test framework for the verification and validation of recurrent neural networks, more specifically networks with LSTM layers. The framework includes a few test metrics to exploit the internal structures of LSTM cells and a test case generation method. Our experimental results show the effectiveness of the test framework in working with a few LSTM networks on different application tasks.


  • [1] G. S. G. 3 (2004) Quality management systems - process validation guidance. Technical report The Global Harmonization Task Force. Cited by: §I.
  • [2] M. Alzantot, Y. Sharma, A. Elgohary, B. Ho, M. B. Srivastava, and K. Chang (2018) Generating natural language adversarial examples. CoRR abs/1804.07998. External Links: Link, 1804.07998 Cited by: §IX.
  • [3] R. Ashmore and M. Hill (2018) Boxing clever: practical techniques for gaining insights into training data and monitoring distribution shift. In

    First International Workshop on Artificial Intelligence Safety Engineering

    Cited by: §IX.
  • [4] A. Banks and R. Ashmore (2019) Requirements assurance in machine learning. In AAAI Workshop on Artificial Intelligence Safety (SafeAI2019), Cited by: §I.
  • [5] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo (2015-05) The oracle problem in software testing: a survey. IEEE Transactions on Software Engineering 41 (5), pp. 507–525. External Links: Document, ISSN 0098-5589 Cited by: §V.
  • [6] N. Carlini and D. A. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. CoRR abs/1801.01944. External Links: Link, 1801.01944 Cited by: §IX.
  • [7] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In Security and Privacy (SP), IEEE Symposium on, pp. 39–57. Cited by: §I.
  • [8] C. Cheng, G. Nührenberg, C. Huang, and H. Yasuoka (2018) Towards dependability metrics for neural networks. In Proceedings of the 16th ACM-IEEE International Conference on Formal Methods and Models for System Design, Cited by: §IX.
  • [9] M. Cheng, J. Yi, H. Zhang, P. Chen, and C. Hsieh (2018) Seq2Sick: evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR abs/1803.01128. External Links: Link, 1803.01128 Cited by: §IX.
  • [10] J. J. Chilenski and S. P. Miller (1994-Sep.) Applicability of modified condition/decision coverage to software testing. Software Engineering Journal 9 (5), pp. 193–200. External Links: Document, ISSN 0268-6961 Cited by: §VI.
  • [11] M. Cisse, Y. Adi, N. Neverova, and J. Keshet (2017-07) Houdini: Fooling Deep Structured Prediction Models. arXiv e-prints, pp. arXiv:1707.05373. External Links: 1707.05373 Cited by: §IX.
  • [12] X. Du, X. Xie, Y. Li, L. Ma, J. Zhao, and Y. Liu (2018) DeepCruiser: automated guided testing for stateful deep learning systems. CoRR abs/1812.05339. External Links: Link, 1812.05339 Cited by: §IX.
  • [13] J. Ebrahimi, D. Lowd, and D. Dou (2018)

    On adversarial examples for character-level neural machine translation

    CoRR abs/1806.09030. External Links: Link, 1806.09030 Cited by: §IX.
  • [14] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2017-12) HotFlip: White-Box Adversarial Examples for Text Classification. arXiv e-prints, pp. arXiv:1712.06751. External Links: 1712.06751 Cited by: §IX.
  • [15] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev (2018) AI2: safety and robustness certification of neural networks with abstract interpretation. In Security and Privacy (SP), 2018 IEEE Symposium on, Cited by: §I.
  • [16] Y. Gong and C. Poellabauer (2017) Crafting adversarial examples for speech paralinguistics applications. CoRR abs/1711.03280. External Links: Link, 1711.03280 Cited by: §IX.
  • [17] D. Gopinath, K. Wang, M. Zhang, C. S. Pasareanu, and S. Khurshid (2018) Symbolic execution for deep neural networks. arXiv preprint arXiv:1807.10439. Cited by: §IX.
  • [18] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, New York, NY, USA, pp. 369–376. External Links: ISBN 1-59593-383-2, Link, Document Cited by: §I.
  • [19] J. Guo, Y. Jiang, Y. Zhao, Q. Chen, and J. Sun (2018) DLFuzz: differential fuzzing testing of deep learning systems. In Proceedings of the 2018 12nd Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2018, Cited by: §IX.
  • [20] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu (2017) Safety verification of deep neural networks. In CAV2017, pp. 3–29. External Links: Document, Link Cited by: §I.
  • [21] R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. CoRR abs/1707.07328. External Links: Link, 1707.07328 Cited by: §IX.
  • [22] A. Karpathy, J. Johnson, and F. Li (2015) Visualizing and understanding recurrent networks. CoRR abs/1506.02078. External Links: Link, 1506.02078 Cited by: §IV-A.
  • [23] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer (2017) Reluplex: an efficient SMT solver for verifying deep neural networks. In CAV2017, pp. 97–117. Cited by: §I.
  • [24] J. Kim, R. Feldt, and S. Yoo (2018) Guiding deep learning system testing using surprise adequacy. arXiv preprint arXiv:1808.08444. Cited by: §I, §IV, §IX.
  • [25] L. Ma, F. Juefei-Xu, J. Sun, C. Chen, T. Su, F. Zhang, M. Xue, B. Li, L. Li, Y. Liu, J. Zhao, and Y. Wang (2018) DeepGauge: comprehensive and multi-granularity testing criteria for gauging the robustness of deep learning systems. In ASE2018, Cited by: §I, §I, §IV, §IX.
  • [26] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao, et al. (2018) DeepMutation: mutation testing of deep learning systems. In Software Reliability Engineering, IEEE 29th International Symposium on, Cited by: §IX.
  • [27] D. Meng and H. Chen (2017) MagNet: a two-pronged defense against adversarial examples. In ACM Conference on Computer and Communications Security, Cited by: §I.
  • [28] Y. Ming, S. Cao, R. Zhang, Z. Li, Y. Chen, Y. Song, and H. Qu (2017) Understanding hidden memories of recurrent neural networks. In VAST2017, pp. 13–24. Cited by: §III.
  • [29] A. Odena and I. Goodfellow (2018) TensorFuzz: debugging neural networks with coverage-guided fuzzing. arXiv preprint arXiv:1807.10875. Cited by: §IX.
  • [30] N. Papernot, P. D. McDaniel, A. Swami, and R. E. Harang (2016) Crafting adversarial input sequences for recurrent neural networks. CoRR abs/1604.08275. External Links: Link, 1604.08275 Cited by: §IX.
  • [31] K. Pei, Y. Cao, J. Yang, and S. Jana (2017) DeepXplore: automated whitebox testing of deep learning systems. In SOSP2017, pp. 1–18. Cited by: §I, §I, §IV-A, §IV, §IX.
  • [32] G. Petneházi (2019) Recurrent neural networks for time series forecasting. CoRR abs/1901.00069. External Links: Link, 1901.00069 Cited by: §I.
  • [33] P. E. Rauber, S. G. Fadel, A. X. Falcao, and A. C. Telea (2017) Visualizing the hidden activity of artificial neural networks. IEEE transactions on visualization and computer graphics 23 (1), pp. 101–110. Cited by: §IX.
  • [34]

    RDKit: open-source cheminformatics

    Note: http://www.rdkit.org[Online; accessed 11-April-2013] Cited by: §VIII-B.
  • [35] M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science 4 (1), pp. 120–131. Cited by: §I.
  • [36] (2018)

    Sentiment detection with keras, word embeddings and lstm deep learning networks.

    External Links: Link Cited by: §VIII-A.
  • [37] Y. Sun, X. Huang, and D. Kroening (2019) Structural coverage metrics for deep neural networks. EMSOFT2019. Cited by: §I, §IV, §IX.
  • [38] Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening (2018) Concolic testing for deep neural networks. In ASE, Cited by: §I, §I, §IV, §IX.
  • [39] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR2014, Cited by: §I, §V, §VII.
  • [40] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2019) Robustness may be at odds with accuracy. In International Conference on Learning Representations, External Links: Link Cited by: §I.
  • [41] J. Wang, J. Sun, P. Zhang, and X. Wang (2018) Detecting adversarial samples for deep neural networks through mutation testing. arXiv preprint arXiv:1805.05010. Cited by: §IX.
  • [42] Y. Wang, K. Velswamy, and B. Huang (2017)

    A long-short term memory recurrent neural network based reinforcement learning controller for office heating ventilation and air conditioning systems

    Processes 5 (3), pp. 46. Cited by: §I.
  • [43] J. W. Wei and K. Zou (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196. Cited by: §VIII-B.
  • [44] M. Wicker, X. Huang, and M. Kwiatkowska (2018) Feature-guided black-box safety testing of deep neural networks. In TACAS2018, pp. 408–426. Cited by: §I, §IV, §IX.
  • [45] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Link, 1609.08144 Cited by: §I.
  • [46] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. S. Pande (2017) MoleculeNet: A benchmark for molecular machine learning. CoRR abs/1703.00564. External Links: Link, 1703.00564 Cited by: §VIII-A.
  • [47] Q. Yang, J. J. Li, and D. Weiss (2006) A survey of coverage based testing tools. In Proceedings of the 2006 International Workshop on Automation of Software Test, AST ’06, New York, NY, USA, pp. 99–103. External Links: ISBN 1-59593-408-1, Link, Document Cited by: §I.
  • [48] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid (2018) DeepRoad: GAN-based metamorphic autonomous driving system testing. In Automated Software Engineering (ASE), 33rd IEEE/ACM International Conference on, Cited by: §IX.
  • [49] Z. Zhao, D. Dua, and S. Singh (2017) Generating natural adversarial examples. CoRR abs/1710.11342. External Links: Link, 1710.11342 Cited by: §IX.