# Part-X: A Family of Stochastic Algorithms for Search-Based Test Generation with Probabilistic Guarantees

Requirements driven search-based testing (also known as falsification) has proven to be a practical and effective method for discovering erroneous behaviors in Cyber-Physical Systems. Despite the constant improvements on the performance and applicability of falsification methods, they all share a common characteristic. Namely, they are best-effort methods which do not provide any guarantees on the absence of erroneous behaviors (falsifiers) when the testing budget is exhausted. The absence of finite time guarantees is a major limitation which prevents falsification methods from being utilized in certification procedures. In this paper, we address the finite-time guarantees problem by developing a new stochastic algorithm. Our proposed algorithm not only estimates (bounds) the probability that falsifying behaviors exist, but also it identifies the regions where these falsifying behaviors may occur. We demonstrate the applicability of our approach on standard benchmark functions from the optimization literature and on the F16 benchmark problem.

## Authors

• 2 publications
• 3 publications
• 1 publication
• 2 publications
• 23 publications
• 1 publication
• 16 publications
06/04/2021

### PSY-TaLiRo: A Python Toolbox for Search-Based Test Generation for Cyber-Physical Systems

In this paper, we present the Python package PSY-TaLiRo which is a toolb...
09/01/2020

### Quantifying the Latency and Possible Throughput of External Interrupts on Cyber-Physical Systems

An important characteristic of cyber-physical systems is their capabilit...
10/02/2021

### A Semantic Model for Interacting Cyber-Physical Systems

We propose a component-based semantic model for Cyber-Physical Systems (...
11/08/2018

### Integrating Security in Resource-Constrained Cyber-Physical Systems

Defense mechanisms against network-level attacks are commonly based on t...
02/26/2021

### Engineering Swarms of Cyber-Physical Systems with the CPSwarm Workbench

Engineering swarms of cyber-physical systems (CPSs) is a complex process...
01/30/2013

### Learning the Structure of Dynamic Probabilistic Networks

Dynamic probabilistic networks are a compact representation of complex s...
03/04/2020

### Probabilistic Performance-Pattern Decomposition (PPPD): analysis framework and applications to stochastic mechanical systems

Since the early 1900s, numerous research efforts have been devoted to de...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Motivation

Search-based test generation (SBTG) for Cyber-Physical Systems (CPS) [KapinskiEtAl2016csm] refers to a broad class of best-effort methods that attempt to discover system behaviors that do not satisfy a set of functional requirements. In other words, SBTG methods search the operating space of the system for behaviors that falsify (i.e., do not satisfy) the given requirements (also known as falsification process). A variety of methods for SBTG have been developed that range from tree exploration methods [NahhalD07cav, DreossiDDKJD15nfm, ZhangEtAl2018cadics, PlakuKV09tacas] to black-box optimization based methods [abbas2013probabilistic, deshmukh2017testing, ZhangAH2020cadics] and related variations of these techniques [zutshi2014multiple, Waga2020hscc, MenghiEtAl2020icse, YamaguchiKDS2016fmcad]

. More recently, reinforcement learning methods

[AkazakiEtAl2018, LeeEtAl2020jair] have also being explored for SBTG. The allure of SBTG methods is that in general they do not require any information about the System under Test (SUT) and, hence, they can be easily applied to a range of challenging applications such as medical [sankaranarayanan2017model], automotive [TuncaliEtAl2018iv, DreossiDS2019jar], and aerospace [LeeEtAl2020jair, MenghiEtAl2020icse] (see [BartocciEtAl2018survey, AbbasEtAl2018emsoft, CorsoEtAl2020arxivSurvey, KapinskiEtAl2016csm] for surveys).

A common characteristic of all the aforementioned methods is that their goal is to discover with as few tests as possible a falsifying (or a most likely falsifying) behavior of the SUT [ernst2020arch]. In fact, SBTG methods have already established that they can falsify benchmark problems at a fraction of the cost of Monte-Carlo sampling, or even when Monte-Carlo sampling fails, e.g., [ZhangEtAl2018cadics, abbas2013probabilistic]. However, current SBTG methods (with exception maybe of [AbbasHFU14cyber, Fan2020]) cannot answer an important open question: what are the conclusions to be drawn if no falsifying behavior has been discovered? In other words, when the test/simulation budget is exhausted and no violation has been discovered, can we conclude that the SUT is safe, or that at least it is likely safe? This is a challenging problem for almost all black-box SBTG methods since foundationally, the speedup in falsification detection is a result of smart sampling in the search space. This is a challenging problem even for SBTG approaches, e.g., [NahhalD07cav, DreossiDDKJD15nfm], which are driven by coverage metrics over the search space. Nevertheless, a positive answer to this question is necessary if SBTG methods were ever to be adopted and incorporated into assurance procedures [HeimdahlEtl2016faa].

In this paper, we develop a general framework that can assess the probability that falsification regions exist in the search space of the SUT. Our working assumption is that the SUT can be represented as an input-output function

, i.e., a function that takes as input a vector

and returns as output a vector signal representing the (deterministic) behavior of the system. The vector can represent system initial conditions , static parameters , and/or any finite representation of an input signal to the SUT. The vector is usually sampled from a convex bounded space , i.e., , which in our case is a hypercube. These assumptions are standard in the literature and can represent a software- or hardware-in-the-loop SUT, or a simulation model of the SUT. In addition, we assume that the system is checked for correctness against a formal requirement which has a quantitative interpretation of satisfaction usually referred to as specification robustness , e.g., as in the case of Signal Temporal Logic (STL) and its variants [fainekos2009robustness, DonzeM10formats, AkazakiH15cav, BartocciEtAl2018survey]. The robustness is positive when the trajectory satisfies the requirement and negative otherwise. Moreover, the magnitude of represents how robustly the trajectory satisfies (or not) the requirement. Using the robustness function, the falsification problem for CPS can be converted into a search problem where the goal is to find points that belong to the zero level-set of the function for .

In order to identify the zero level-set of and, hence, estimate the probability of falsification, we propose a family of stochastic algorithms, referred to as Part-X (Partitioning with X-distributed sampling). Part-X adaptively partitions the search space in order to enclose the falsifying points. The algorithm uses local Gaussian process estimates in order to adaptively branch and sample within the input space. The partitioning approach not only helps us identify the zero level-set, but also to circumvent issues that rise due to the fact that the function is discontinuous. In fact, the only assumption we need on is that it is a locally continuous function. In order to evaluate our approach, we have built an SBTG library in Python (to be publicly released after publication) that can use the RTAMT [NickovicY2020atva] or TLTk [CralleySHF2020rv] temporal logic robustness Python libraries. We demonstrate our framework on selected functions from the optimization literature and on the F16 benchmark problem  [heidlauf2018verification]. An additional feature of our framework is that Part-X can also be utilized for evaluating test vectors generated by other SBTG tools. Therefore, Part-X can also function as an evaluation tool for other falsification algorithms in assurance procedures, or even in competitions [ernst2020arch].

Besides the aforementioned practical/applied contributions of our work, the Part-X framework also makes the following theoretical and technical contributions. First, it uses multiple local models as opposed to a single global process. This helps the algorithm to handle discontinuities, estimate the worst and best performing inputs, and identify disconnected zero-level sets. Second, Part-X uses a branching criterion to reduce the number of regions generated.

## 2 Literature Review

This work spans two macro-areas: (i) Automatic falsification of cyber-physical systems; (ii) Learning level sets of non-linear non-convex black-box functions . In the following, we first motivate and document the contribution and state of the art within the field of automatic falsification performed with optimization based approaches, and we identify the challenges (section 2.1). Section 2.2 reviews techniques that can support non-stationary learning problems, with focus on the learning of a level set.

### 2.1 Stochastic Optimization for falsification

Global algorithms hold asymptotic convergence guarantees to global optima, but often require a large number of evaluations to effectively reach a good solution

. Stochastic optimization techniques such as simulated annealing, genetic algorithms, ant colony optimization, and the cross-entropy method have been applied in this domain. It is important to highlight that optimization algorithms are stochastic when:

(i) an observed objective function evaluation is subject to noise; or (ii) randomness is injected via the search procedure itself while the function results are noiseless [spall2005introduction], or a mixture of the two

. However, these methods notably lack sample efficiency, partly due to the difficulty in setting the numerous hyperparameters for methods with memory, and the inability of exploiting information from previous iterations for memory free methods. As an example of memory free sampling, hit and run, which is a common implementation of uniform random sampling, epitomizes myopic search: locations iteratively evaluated have no impact upon subsequent sampling decisions. The benefit of these stochastic search techniques is their easy-to-derive guarantees in terms of coverage. On the other hand, local search and hill climbing techniques (which are deterministic with noiseless function evaluations) such as CMA-ES, simplex search, trust region search, response surface methodology, and gradient ascent/descent have increased sample efficiency. However, due to the notorious non-linearity of robustness landscapes in CPS falsification, these local techniques often get trapped in sub-optimal local regions as these searches lack explorative properties. Among recent

local

optimization approaches, Stochastic Trust-Region Response-Surface Method (STRONG) iteratively executes two stages: (1) constructing local models through linear regression, when gradient information is not readily available, and (2) formulating and solving the associated trust-region subproblem

[chang2013stochastic]. In [shashaani2015astro], the proposed Derivative Free Adaptive Sampling Trust Region Optimization (ASTRO-DF) uses derivative free trust-region methods to generate and statistically certify local stochastic models, that are used to update candidate solution values through an adaptive sampling scheme. Without assumptions on the degree of non-linearity in the objective function, the drawback of these methods is that the quality of the discovered local minimum may be poor, relative to the entire surface. Research has been conducted on multi-starts and restart policies which have overcome these issues in some cases [zabinsky2010stopping]. Global algorithms aim to balance the trade-off between exploration and exploitation, and investigate un-sampled regions without the promise of improvement. Notable examples include: improving hit-and-run [Zabinsky1993] with polynomial expected function evaluations, and the Greedy Randomized Adaptive Search Procedure (GRASP) [feo1995greedy]. A drawback to these methods is when function calls are limited. In such cases, meta-model based search uses previous samples to predict function values in regions where sampling has not been performed. Efficient Global Optimization (EGO) [jones1998efficient], for deterministic black-box settings, is an established meta-model based method. In this context, Bayesian optimization (BO) is a popular black-box stochastic optimization method [frazier2018bayesian]. BO balances exploration and exploitation via surrogate modeling to produce high quality solutions in a relatively small number of iterations. However, due to the overhead costs associated to BO, such as surrogate model estimation and acquisition function optimization, this technique should be employed when observations of the objective function are expensive to collect - as in the case of observing the robustness for a given input . BO has proven to be quite successful over CPS falsification problems [deshmukh2017testing, ghosh2018verifying, waga2020falsification]. Recently, BO was combined with a local trust region search in an intelligent global-local optimization framework and proved highly effective for CPS falsification [mathesen2021stochastic, mathesen2019falsification].

### 2.2 Statistical Methods for iterative Partitioning for Function Learning and Level Set Estimation

Partitioning has deep roots in exact optimization (branch and bound methods), in black box optimization (target level set estimation), and in statistical applications such as additive modeling. In fact, partitioning is used: (i) to handle large scale data sets, by iteratively splitting the data; (ii) to handle high dimensional inputs, by iteratively splitting the original space into lower dimensional subspaces; (iii) to handle discontinuities of the reward function with respect to the input space, by iteratively branching the support of the decision space forming a partition . In this paper, we focus on the ability to estimate a surrogate for the robustness function that allows the needed flexibility to represent a function with potentially high rate of variation. Since our Part-X uses Gaussian processes (see section 3.1), we are challenged by the correlation function which is assumed to be constant. Partitioning allows us to change the rate of variation, through the Gaussian process correlation function, across the input space. Furthermore, we are interested in the estimation of the, potentially disconnected, set of locations that violate desired system properties. In this brief review, we focus on methods that use partitioning for level set estimation, and we highlight, when present, the type of guarantees offered by the different algorithms.

#### Surrogate Driven Approaches for Level Set Estimation

In [shekhar2019]

, the authors propose and analyze an algorithm for the estimation of a level set for the case where the reward function can be evaluated with noise. The authors estimate a Gaussian process across the entire solution space and, at each iteration, each subregion within the current partition is further branched based on the estimated distance to the, unknown, level set, and based on the predicted variance at the centroid of the subregion. If a subregion is branched, a new location is also evaluated and more replications are ran for the subregions forming the updated partition thus updating the Gaussian process. While the scope of the paper is different from ours, the authors do not provide and error bound over the level set estimate, and use a unique model across the partition. Also, the sampling is accuracy driven and does not consider the maximization of the probability to identify the level set. Within the falsification literature,

[Fan2020] proposes to use partitioning to identify falsifying level sets. Specifically, assuming a partitioning and sampling schemes are provided as input, sequentially uses Conformal regression to derive an estimate of the maximum and minimum function value within a subregion, with subregion-wide noise. Based on such predictions, a subregion can be eliminated, maintained (if we have confidence that the system satisfies the properties) or further branched. While conformal regression does not allow to produce a point estimate of the response, the method allows to derive a probabilistic guarantee over a subregion for the correctness of the eliminating and maintaining. The work in [Lei2018]

is used to define the conditions over the conformal regression. Besides the lack of point estimates, another potential drawback is that the authors assume that the sampling density is defined as input. Nonetheless, sampling densities can impact the performance of the approach. Even if not using surrogate models, we mention Probabilistic Branch and Bound (PBnB) that uses a directed random search to approximate a target level set associated with a target quantile of the best globally optimal solutions

[Pra05, Zabinsky2020PBnB]. An advantage of PBnB is that it partitions the space iteratively, and performs more function evaluations in the promising regions as it refines its approximation of the target level set. In addition, there is a probabilistic bound on the error of the approximation, which we will use in the analysis of our algorithm (section 4).

#### Surrogates for non-stationary responses

As mentioned, one possible criticality in using one single surrogate model is in the difficulty to capture discontinuities in the function behavior. Within the statistical learning community, so-called treed Gaussian processes have been proposed with main applications in learning from non-stationary and large data sets. This literature does not address level set estimation, but it is relevant in the sense that it addresses non-stationarity of the response and the noise. In [Chipman2002] a binary tree is used to learn partitions based on Bayesian regression. The difficulty to scale the approach to high dimensional inputs/ large data sets led to the computational work in [DENISON2002]. One drawback of these approaches was identified in the irergularity of the variance associated to the different subregions. In particular, it was observed that some subregions tended to exhibit variances orders of magnitude larger. In [Kim2005] this problem is alleviated by using stationary processes in the several subregions (thus leading to a larger number of subregions, but better control over the variance profile). Differently, [Gramacy05bayesiantreed]

deals directly with the problem of non-stationarity of the response and of the variance (heteroscedasticity) proposing a new form of Gaussian process, the Bayesian Treed Gaussian Process model. The approach combines stationary Gaussian processes and partitioning, resulting in treed Gaussian processes, and it implements a tractable non-stationary model for non-parametric regression. Along a similar line,

[Liang2018] also uses a binary tree which to iteratively learn different models and while the method has good fitting results, it also shows good computational performance for large datasets.

#### Reinforcement learning approaches

As previously mentioned, sampling can have an important impact on the ability of the search method to perform effective partitioning, i.e., branching decisions that can increase either the prediction accuracy or the probability to identify a falsifying input. In this regard, it is important to point at relevant references in the area of optimal sequential sampling. In [Wang2019], the latent action Monte Carlo Tree Search (LA-MCTS) sequentially focuses the evaluation budget by iteratively splitting the input space to identify the most promising region. Differently from our case, LA-MCTS uses non-linear boundaries for branching regions, and, like us, learns a local model to select candidates. Another relevant contribution, AlphaX, presented in [Wang2019AlphaX]

, explores the search using distributed Monte Carlo Tree Search (MCTS) coupled with a Meta-Deep Neural Network (DNN) model that guides the search and branching decision, focusing on the sequential selection of the most promising region. In general, while relevant in terms of sampling, sequential optimization methods are designed to learn and focus on

the elite region of the input space. Our premise is different: as long as a point is falsifying the user will be interested in having this information no matter how bad the falsification is.

### 2.3 Contributions

In this paper, we propose Part-X, a partitioning algorithm that relies on local Gaussian process estimates in order to adaptively branch and sample within the input space. Part-X brings the following innovative features over PBnB and prior falsification approaches:

1. Due to the local metamodels, as opposed to a unique global process, Part-X produces point estimates with associated predictions, for the output function. This allows to build estimates for the worst and best performing inputs, as well as the volume of falsifying sets for arbitrary levels of negative robustness.

2. The presence of a branching criterion avoids proliferation of subregions as a subregion is only branched when a criterion is satisfied (e.g., we are not able to classify the region).

3. Under mild assumptions, the algorithm asymptotically achieves minimum error as a function of the minimum allowed size for a partition.

4. Part-X mechanism to partition, classify and model can be used to support and evaluate any other test generation approach. This will be further detailed in section 3.

## 3 Part-X: a family of algorithms for partitioning-driven Bayesian optimization

Given the robustness function , we aim at identifying the minimum for the function , while producing an estimate of the function at locations that satisfy , and an estimate of the associated set of values , such that . The algorithm we propose is an adaptive partitioning and sampling approach that provides: (i) the probability that a falsification exists within the input region ; (ii) the estimate of the falsification volume interpreted as , where represent the measure, volume in our application, associated to the relevant set. The probability in (i) is strongly dependent on the magnitude of the violation (i.e., how negative the robustness function is predicted to be), while the measure in (ii) helps engineers understanding how likely is the system to produce a falsifying behaviour (i.e., convergence of the estimated level set to the true level set w.p.1).

Figure 1 gives an overview of the proposed approach, which we refer to as Part-X (Partitioning with X-distributed sampling). The algorithm sequentially and adaptively partitions the space and evaluates inputs (test vectors) in order to estimate a surrogate for the robustness function in each of the subregions that are sequentially generated. In particular, the surrogates we use are Gaussian processes [santner2013design, mathesen2021stochastic, pedrielli2020extended], which allows us to define a variety of sampling distributions ( in Figure 1). At each iteration, a number of points is sampled in each sub-region to update the corresponding surrogate. Then, we decide whether to stop branching a region based on the fact that the posterior -quantile of the minimum (maximum) predicted robustness is above (below) the zero level, which deems an input to be unsafe. It is important to highlight the complexity associated with the estimation of the minimum and maximum -quantile associated to the Gaussian process , . In this work, we use a Monte-Carlo estimate for these quantities.

At each iteration, a region can be temporarily classified, thus entering the, potentially disconnected, set if the region is classified as not falsifying, while is entered in the opposite scenario. In the case the uncertainty associated to the model(s) is large (which is typically the case at the first iterations of the algorithm) no sub-region is maintained, i.e., everything is branched. If a region reaches the minimum volume , where is the length of the robustness function support along dimension , and is an input parameter. we cannot branch the subregion any more. The algorithm continues until: (i) the maximum number of evaluation is exhausted; (ii) all the subregions have been classified .

Section 3.1 presents the basic definitions for Gaussian processes, which we use to produce predictions of the robustness function, section 3.2, introduces the scheme followed by Part-X to iteratively branch, sample, update subregion models and decide whether to classify each of the subregions.

### 3.1 Modeling the Robustness as a Gaussian Process

A Gaussian process (GP) is a statistical learning model used to build predictions for non-linear, possibly non-convex smooth functions. The basic idea is to interpret the true, unknown function

is a realization from a stochastic process, the Gaussian process. If we can measure the function without noise, then the Gaussian process will interpolate the true function at the evaluated points, while, conditional on the sampled locations

, a Gaussian process produces the conditional density . In particular, , where is the, constant, process mean, and , with being the constant process variance and the correlation matrix. Under the Gaussian correlation assumption, , for . The -dimensional vector of hyperparameters controls the smoothing intensity of the predictor in the different dimensions. The parameters and are estimated through maximum likelihood [santner2013design]: . The best linear unbiased predictor form is [santner2013design]:

 ˆf(x)=ˆμ+rTR−1(f(Xn)−1nˆμ) (1)

where is a set of sampled locations, and is the -dimensional vector having as elements the function value at the sampled locations. The model variance associated to the predictor is:

 s2(x)=τ2⎛⎜ ⎜⎝1−rTR−1r+((1−1TnR−1r)21TnR−11n⎞⎟ ⎟⎠ (2)

where is the -dimensional vector having as elements the Gaussian correlation between location and the elements of , i.e., .

In our application, we use the model in (1) as a surrogate for the, unknown, robustness function. In particular, given a training set of input and associated robustness value , we will predict the robustness value at a new unsampled location . The robustness prediction will have the associated variance .

### 3.2 Sequential adaptive branching with classification

In this section, the model-based sequential adaptive partitioning is presented. Part-X starts considering the entire input and keeps branching until the simulation budget is exhausted (i.e., the maximum number of tests has been executed) or all the non-classified subregions have the a length along all the dimensions (i.e., the subregion is unbranchable), or all the subregions have been classified as either satisfying, violating or unbranchable.

Figure 2 shows an example of the tree produced by the Part-X algorithm at iteration . In general, at each iteration , the Part-X tree be updated with the following subregion (leaf) types:

• New satisfying region (): for each level , at each iteration , the new set of satisfying regions, union of the new satisfying subregions, is , formed by new subregions.

• New positive reclassified region (): for each level , at each iteration , the new set of regions reclassified from positive, union of the new reclassified subregions, is , formed by new subregions.

• New violating region (): for each level , at each iteration , the new set of regions classified as violating, union of the new violating subregions, is , formed by new subregions.

• New negative reclassified region (): for each level , at each iteration , the new set of regions reclassified from violating, union of the new reclassified subregions, is , formed by new subregions.

• New remaining region (): for each level , at each iteration , the new set of regions remaining, union of the new remaining subregions, is , formed by new subregions.

• New unclassified region (): for each level , at each iteration , the new set of regions remaining, union of the new unclassified subregions, is , formed by new subregions.

We refer to as the individual subregion of type , resulting from branching at iteration , at level of the partitioning tree, given the previous definitions, at the th iteration, there will be a number of new subregions. We refer to the union of the positively classified subregions at level , at iteration as for satisfying, reclassified from satisfying, violating, reclassified from violating, undefined and remaining, respectively. Dropping the index will result in the notation for the same sets at each iteration.

Sampling and model estimation Part-X samples in a different way subregions that have been reclassified, or are not classified, i.e., , and regions of type ). Each subregion in the first group requires at least points for the estimation of the Gaussian process (and associated predictor ), and we also add observations that are sampled using a Bayesian optimization approach, thus biasing sampling toward locations that falsify the requirements. In fact, the samples are sequentially collected in a way that maximizes the Expected Improvement [jones1998efficient], and at each sampling iteration , for each subregion , we sample a new location that maximizes the Expected Improvement , namely:

 (3)

Where is the best function value sampled so far in subregion . Once the point has been sampled, we update the Gaussian process, and proceed until evaluations have been performed. We then proceed verifying the branching conditions and possibly updating the partition.

On the other hand, classified subregions receive an overall evaluation budget of evaluations to be distributed across all the subregions in this group. In order to perform such distribution of evaluations, we consider the Gaussian Process predictor , and we add samples to each subregion in this group using the following metric:

 Iγijk=1v(σγijk)∫x0∈σγijk(∫0−∞fγijk(y(x0))dy)dx0. (4)

The basic idea behind the metric in (4) is to sample a region proportionally to the cumulated density below of the Gaussian process. The scalar , representing the volume of the subregion , is used to normalize the indicator so that . Note that, by construction and (4) exists finite.

An overview of the procedure for the sampling phase is reported in Algorithm 1. In general, we refer to as the cumulated number of evaluations in each subregion at level of the partitioning tree at iteration .

Classification Scheme At the end of the sampling stage, we have Gaussian processes, and we need to estimate the -quantile for the minimum and maximum value of the function in each of the subregions. In order to do so, we use the Monte Carlo procedure in Algorithm 2.

Once the estimation procedure is complete, we classify each subregion according to the maximum and minimum quantile of the robustness function across the subregion. Specifically if the maximum quantile in a subregion satisfies the region is classified as violating the requirement and we update . On the other hand, if the region is classified as satisfying the requirement and we update . If a subregion is classified at iteration , we verify if the corresponding classification criteria are violated and, in such scenario, re-classify the region as remaining .

Each time a subregion is classified as either satisfying or violating the requirements, the remaining region () is updated by removing the classified subregions. Given the remaining subregions , a branching algorithm is called that randomly selects a direction and cuts each subregion along that dimension into equal volume subregions. In particular, we allow subregions to be branched in direction only if the size of the hypercube in that dimension is larger than , where is an input parameter, and is the maximum length of the reward support along dimension . Part-X terminates when the maximum number of function evaluations has been reached. We refer to this as the total evaluation budget .

Partitioning with X-distributed sampling The procedure in Algorithm 4 summarizes the phases of the proposed approach. In the algorithm, we use the notation to refer to the volume of a region. Since the regions in Part-X are hyperboxes, volumes are easily computable.

###### Remark 1 (Part-X as evaluation tool).

As mentioned in section 2.3, Part-X can be used by any test generation to evaluate it in its ability to identify falsifying regions. In order to do so Algorithm 4 can be executed without sampling. Doing so, the algorithm will keep branching until a subregion is classified, there are less than points in a subregion, the subregion has reached the minimum volume. Doing so the falsification volume will be computed and different test generation tools can be compared.

## 4 Part-X Theoretical Analysis

Figure 3 shows the 7th, and 8th iterations of Part-X applied to the 2- Himmelblau’s function (section 5.1). The Figure 3 highlights the main quantitie at the core of our algorithm analysis, i.e., the concept of mis-classification volumes for the classified regions (for brevity referred to as and in Figure 3 for the satisfying and violating region, respectively), and mis-classification events , and . The following results provide bounds, with associated guarantees, of the mis-classification volume at each iteration of the algorithm. We start providing an important result from the literature that allows us to analyse the error within a single subregion. Based on this result, we bound misclassification error and recovery of such error, both for satisfying as well violating subregions. Such bound is probabilistic in nature and it is defined at each level of the partitioning tree, and for each iteration of the algorithm (Lemmas 3-4). We then extend these result to the overall error in Theorems 1-2.

### 4.1 Assumptions and Notation

###### Assumption 1.

is a compact space.

###### Assumption 2.

is locally smooth, i.e., there exists a collection of subregions with positive Lebesgue measure such that is smooth within each subregion.

###### Assumption 3.

The hyperparameters , of the Gaussian process model are assumed to be known.

###### Assumption 4.

The initial sample set of points , produces a Gaussian process model that satisfies cross-validation.

###### Assumption 5.

The true function to be optimized, , is bounded over .

###### Definition 1 (Significance).

Let , be the significance of a probabilistic statement at level of the partitioning tree. Then, given any , we will have that the significance at level satisfies:

 αj={αIf j=1αj−1/BIf j>1 (5)
###### Definition 2.

We will refer to the -level set as

 L0={x∈S:f(x)≤0} (6)
###### Definition 3 (ϵ+j,k,ϵ+k,ˆΘ+j,k,ˆΘ+k).

Let denote the volume incorrectly classified as at the th level of the partitioning tree, up to the th iteration of the algorithm. Then, will be the volume incorrectly classified as up to the th iteration of the algorithm. Equivalently, , where .

###### Definition 4 (ϵ−j,k,ϵ−k,ˆΘ−j,k,ˆΘ−k).

Let denote the volume incorrectly classified as at the th level of the partitioning tree, up to the th iteration of the algorithm. Then, will be the volume incorrectly classified as up to the th iteration of the algorithm. Equivalently, , where .

In the first part of the proof, we will assume that (level formulations will also be used applying definitions 3-4).

 0≤ϵ+k≤ϵv(ˆΘ+k)v(S),0≤ϵ−k≤ϵv(ˆΘ−k)v(s).

Where the user-defined parameter represents, for each subregion, the volume that the user tolerates to be wrongly classified.

###### Definition 5 (Misclassification events).

Let denote the mis-classification event at the th iteration to be defined as:

 C+k=⎧⎪ ⎪⎨⎪ ⎪⎩v(ˆΘ+k∩L0)≤ϵv(ˆΘ+k)v(s)⎫⎪ ⎪⎬⎪ ⎪⎭.C−k={v(ˆΘ−k∩L0)≥v(ˆΘ−k)−ϵv(ˆΘ−k)v(s)}. (7)

let event to ensure that the volume of incorrectly classified subregions satisfy the upper bound.

Part-X returns , as an estimate for the true, unknown -level set .

### 4.2 Main Results

The first result we provide exploits the PAC guarantee in [mcallester1999some] (Thm. 1, p. 358), and reformulates it for Part-X when the focus is the classification of the th subregion, at the th level of the tree, at the th iteration of the algorithm. Consider Algorithm 3 that classifies each subregion , then, the following can be proved.

###### Lemma 1 (General PAC guarantee for subregion classification).

Let us refer to as the classification of the subregion . Then, we have for , that with probability at least over the choice of a sample of observations, the subregion , which is such that every element of is consistent with the sample and the subregion is measurable (), satisfies the following.

 εHijk≤ln1P(ˆσHijk)+ln1αj+2lnnjk+1njk.

where, the error rate associated to is defined for a given desired hypothesis (target) to be the probability over the choice of the input over disagrees with .

###### Proof.

This Lemma is a readaptation of the result in Thm. 1, p. 358 of [mcallester1999some], therefore we omit the proof here. ∎

Lemma 2 and 4 bound the classification error for the positively and negatively classified volume, respectively, at each iteration of the Part-X algorithm. Lemma 3 and 5 produce a bound on the reclassified error volume at each level of the partitioning tree, at iteration of the algorithm, conditional upon the event . This volume is important because it is responsible for a decrease in the classification error due to the subregions wrongly classified as safe in previous iterations. Reclassification cannot occur at the first iteration (i.e., ) and the recovered error volume at the th is clearly bounded by the error at the previous iteration .

###### Lemma 2 (Positive classified error bound).

Let denote the -level set defined in (6), and be the set of subregions classified as satisfying at the th level of the Part-X tree, at the th iteration of the algorithm. Refer to as the set of subregions classified as satisfying at the th iteration of the algorithm. Assume the event in Definition 5 to be true. Then, the following holds for the volume misclassified as satisfying at the th iteration of Part-X:

 (8)

where, under event , , and . By construction:

 P(v(ˆσ+k∩L0)≤N+kϵk|Ck)≥k∏j=1N+j,k∏i=1[1−δijk]. (9)

where, under event , .

###### Proof.

By the definition of ,

 P(v(ˆσ+j,k∩L0)≤N+j,kϵ+k|Ck)=P(v(xk:f(xj,k)≤0,xj,k∈ˆσ+j,k)≤N+j,kϵ+k|Ck),∀k. (10)

We assume that there exists , such that

 P(v(xj,k:f(xj,k)≤y+j,k,xj,k∈ˆσ+j,k))=v(xj,k:f(xj,k)≤y+j,k)v(ˆσ+j,k)=N+j,kϵ+kv(ˆσ+j,k). (11)

Therefore, by applying equation (11) to equation (10), we have

 P(v(ˆσ+j,k∩L0)≤N+j,kϵ+k|Ck)=P(v(xj,k:f(xj,k)≤0)≤v(xj,k:f(xj,k)≤y+j,k|Ck)).

which is equal to . Part-X decides the characterization of each subregion based on the minimum quantile. Let us define the event occurring when , such that:

 0≤ϵ+i,j,k≤ϵv(ˆσ+ij,k)v(S).

Consider as the PAC sampling measure, i.e., in Lemma 1, the , where is the set of sampled points in the th subregion at level , at iteration . Such probability is implied by our separable kernel in equations (1)-(2), and it is therefore known at each iteration. Since each subregion has associated a different Gaussian process, we refer to this probability as . Then the probability associated to the event is:

 P(Fijk)=P(v(ˆσ+i,j,k∩L0)≤ϵ+i,j,k|Ck).

where is the th subregion at level at iteration . Let us now refer to the mis-classification probability of this positively classified volume as . Then, from Lemma 1, we know that:

 P⎛⎜⎝ε+ijk≤ln1pijk+ln1αjk+2lnnjk+1njk⎞⎟⎠≥1−αjk.

It is important to connect the classified volume and the error rate . In particular, we have that:

 ε+ijk=[v(ˆσ+i,j,k∩L0)/v(S)]. (12)

And, conditional on the event , the relationship (12) implies the following:

Given the definition of our event , the following holds:

 P(v(ˆσ+i,j,k∩L0)≤ϵ+i,j,k|Ck)≥1−ε+ijk