## 1 Introduction

The multi-armed bandit problem is an idealized mathematical model for sequential decision making in unknown random environments and it has been used to study exploration-exploitation trade-offs. In the problem setup, each arm of the bandit is associated with a cost of playing and an unknown reward distribution . In order to maximize the final reward with a given cost budget, the algorithm plays some arm, collects the stochastically generated reward, and decides on the next arm, until the cost budget is exhausted. Starting from the motivation of designing clinical trials in the 1930s [thompson1935theory, thompson1933likelihood, robbins1952some], there has been major developments in the Bandit theory over the last few decades (see, for example the books [munos:hal-2014, Bubeck:2011, Bubeck12]). Several different strategies have addressed this problem and strong connections have been drawn with other fields such as online learning.

In this paper, we explore how Bandit algorithms can be used for model checking of stochastic systems.
A requirement for a stochastic system usually checks whether the measure of executions of satisfying certain temporal formulas cross certain thresholds [younes2002probabilistic, GburekB18].
Model checking for such requirements can be solved by calculating the exact measure of the executions that satisfy the relevant subformulas of [BustanRV04, HermannsWZ08, JansenKOSZ07, Kwiatkowska:book2004].
In this paper, we focus on the alternative statistical model checking (SMC)
approach which samples some executions of and uses hypothesis testing to infer whether the samples provide
statistical evidence for the satisfaction (or violation) of [younes2002probabilistic, Sen:2005:SMC, Younes05].
Execution data is a costly resource^{1}^{1}1Generating execution data involves running simulations or performing tests., therefore, a number of SMC approaches minimize the expected number of samples needed for verification, for example, using sequential probability ratio tests, Chernoff bound, and Student’s t-distribution.

Several SMC tools have been developed (for example, Ymer [Ymer], VESTA [VESTA:SenVA05], MultiVesta [MVesta], PlasmaLab [Boyer:2013:PFD], MODES [MODES], UPPAAL [UPPAAL], and MRMC [MRMC] ), and they have been used to verify many systems [DDavidLLMPVW11, BaierHHK02, JhaCLLPZ09, KyleHC15, PalaniappanG0HT13, ZulianiPC13, Meseguer:2006:SAD, Martins:2011:SMC]

. Most SMC algorithms crucially rely on fully stochastic models that never make nondeterministic choices. Although recent progress has been made towards verifying Markov Decision Processes with restricted types of schedulers

[lassaigne2015approximate, DBLP:conf/tacas/HartmannsH14, henriques2012statistical], SMC for MDPs remain a challenge problem (see [Agha:2018:SSM] and [Legay:2015] for recent surveys).We will focus on stochastic models that are essentially Discrete Time Markov Chains, except that they are initialized from a (possibly very large) set of states.
In other words, these are Markov Decision Processes (MDPs) where the adversary gets to initialize^{2}^{2}2
Finite number nondeterministic action choices can be encoded in the choice of the initial state..
Further, we restrict our attention to safety requirements^{3}^{3}3All the results in the paper generalize to bounded time properties of the form where is a threshold constant and is a path formula. Generalizing to nested probabilistic operators and unbounded time properties will require further research.

. That is, we study problems that require maximizing (or minimizing) the probability of hitting certain unsafe states, starting from any initial state. Further, this class of models and requirements capture many practical problems like online monitoring where the initialization has to consider worst case error bounds in state estimation, for example, from sensing and perception. We observe that this optimization of a probability measure over a set of initial choices, can coincide with the multi-armed bandit problem for appropriately defined costs and rewards. By building the connection with the Bandit literature, we not only gain algorithmic ideas, but also new types of theoretical (regret) bounds on the sample efficiency of the algorithms. These bounds rely on the smoothness and the near-optimality dimension of the objective function, and are fundamentally different from the existing performance bounds in the SMC literature.

Hierarchical optimistic optimization (HOO) [Bubeck:2011] is a bandit algorithm that builds a tree on a search space by using the so called principle of optimism in the face of uncertainty. It is a black-box optimization method that applies an upper confidence bound (UCB) on a tree search method for finding the optimal points over the uncertain domain. The UCB in the tree search approach takes care of the trade-off between exploiting the most promising parts of the domain and exploring the most uncertain parts of the domain. Multi-fidelity hierarchical optimistic optimization(MFHOO) [sen2019noisy] is a multi-fidelity HOO based method that allows noisy and biased observations from the uncertain domain. The performance of MFHOO is measured by how the regret—the gap between the actual maximum and the computed—scales with the number of samples. A key feature of these algorithms is that they can work with black-box functions and the regret guarantees only rely on certain smoothness parameters and the near-optimality dimension of the problem (see Definition 3.1).

The standard theoretical assumptions required by off-the-shelf bandit algorithms in order to get performance guarantees do not precisely fit our verification problem, and that in-depth analysis and modification is required to get these guarantees in our setting; In addition, to apply these algorithms several functions need to be judiciously determined, a priori, and are at the heart of how the algorithms will perform. These choices are non-trivial and multi-faceted, and we develop and provide such functions explicitly in the context of our SMC problem in order to demonstrate successful application.

The key contributions of the paper are as follows.

First, we show how the MFHOO algorithm, can be used for statistical model checking with provable regret bounds. In the process, we define an appropriate notions of fidelity, bias-functions, and also modify the existing near-optimality dimension required for regret bounds of MFHOO to accommodate the non-smoothness of the typical functions we have to optimize for SMC.

Second, we have built a new SMC tool called HooVer using MFHOO [sen2019noisy]. We have created a practically inspired [simonepresent, FanQM18] suite of benchmark NiMC models that can be useful for safety analysis of driver assistance features in vehicles for standards such as ISO26262 [iso26262]. Using the benchmarks we have carried out a detailed performance analysis of HooVer and our results suggest that the proposed approach can indeed make use of simulations more effectively than existing SMC approaches. A fair comparison of HooVer with other discrete-state model checkers like Prism [HKNP06], Storm [dehnert2017storm], and PlasmaLab [legay2016plasma] is complicated as it relies on a continuous state models. We created discretized models for comparison, and observed that: Compared with exact probabilistic model checking tools like Storm, HooVer is faster, more memory efficient and scales better, and thus it can be used to check models with very large initial state space; Compared with statistical model checking tools like PlasmaLab, HooVer requires much less data to achieve comparable results.

Finally, to our knowledge, this is the first work connecting statistical model checking with the Bandits theory; specifically, the hierarchical tree search using the principle of optimism in the face of uncertainty. Thus we believe that the exposition of these algorithms (Section 3) engender new applications in verification and synthesis algorithms.

## 2 Model and problem statement

Consider a Euclidean space and let denote the non-negative real numbers. For any real-valued function of its support is the set A discrete probability distribution over is a function such that is countable, and We use

to denote the set of discrete probability distributions over

. For a finite set , denotes the cardinality of .### 2.1 Nondeterministically initialized Markov chains

###### Definition 2.1

A Nondeterministically initialized Markov chains (NiMC) is defined by a triple , where: (i) is the state space; (ii) is the set of possible initial states; and (iii) is the probability transition function.

That is, from state , the next state is chosen according to the discrete distribution . The probability of transitioning from state to state is , which we write more compactly as . An execution of length for the NiMC is a sequence of states , where , and for each , We denote the set of all length executions of starting from as . The probability of an execution , given , is .

Given a set , we say an execution *hits* if there exists such that . We denote the subset of executions starting from , of length , that hit by .
From a given initial state the probability of hitting an unsafe state within steps is given by:

(1) |

Note that if then and . We are interested in finding the worst case probability of hitting unsafe states from any initial state of . This can be regarded as determining, for each , the value

(2) |

Importantly, we would like to solve this optimization problem without relying on detailed information about the probability transition function . Further, our solution should not rely on precisely computing for a given , but instead only the use of noisy observations.

### 2.2 Example: Single-lane platoon with two speeds ()

We present an NiMC of a platoon of cars on a single lane (). Variations of this model are used in all our experiments later in Section 5. Each car probabilistically decides to “cruise” or “brake” based on its current gap with the predecessor. These types of models are used for risk analysis of Automatic Emergency Braking (AEB) systems [simonepresent, FanQM18]. The probabilistic parameters of the model are derived from data collected from overhead traffic enforcement cameras on roads. The uncertainty in the initial positions (and gaps) arise from perception inaccuracies, which are modeled as worst-case ranges.

Let be the position of ^{th} car in the sequence. Initially, takes a value in an interval on the -axis such that .
The pseudocode in Figure 1 specifies the probabilistic transition rule that updates the position of all the cars synchronously. Car always moves at a constant breaking speed of .
The variable is the distance of to the predecessor , for each .
If is less than the constant threshold , then continues to cruise with probability and it brakes with probability
Similarly, if greater than , then continues to cruise with probability and brakes with probability

It is straightforward to connect the above description to a formal definition of a NiMC. The state space . The set of initial states is a hyperrectangle in (such that ). For any state, the probability transition function is given by the equations in lines 1, 1, 1, 1 and 1. We define the set of unsafe states for some constant collision threshold .

Given that cars start their motion at any initial state from , the goal is to find the maximum probability of hitting the unsafe set . For , , , , , , and initial , Figure 1 shows estimates of probabilities of hitting the unsafe set from different initial separations between cars. As our intuition suggests, for large enough time horizons the probability of hitting the unsafe set approaches from all initial states, but, for smaller time horizons the maximum probability of unsafety arises when the initial gap is smaller.

## 3 Background: Hierarchical Optimistic Optimization

Multi-Fidelity Hierarchical Optimistic Optimization (MFHOO) [sen2019noisy] is an black-box optimization algorithm from the multi-armed bandits literature [munos:hal-2014, Bubeck:2011, Bubeck12]. The setup is the following: suppose we want to maximize the function , which is assumed to have a unique global maximum. Let . MFHOO allows the choice of evaluating at different fidelities with different costs. This flexibility matters for SMC because it will be beneficial to evaluate the probability of unsafety for certain initial states more precisely, for example, with longer number of simulations, while for other initial states a less precise evaluation may be adequate.

Thus, MFHOO has access to a biased function that depends on fidelity parameter . Setting gives the lowest fidelity (and lowest cost) and corresponds to full fidelity (and highest cost). At full fidelity, , and the evaluation is unbiased. More generally, and evaluating costs , where the functions are respectively, non-increasing and non-decreasing, and called the bias and the cost functions [sen2019noisy].

A bandit algorithm chooses a sequence of sample points (arms) , evaluates them at fidelities , and receives the corresponding sequence of noisy observations (rewards) . We assume that each is drawn from a unknown distribution satisfying

. Further the distribution has a sub-Gaussian component, with variance

, which captures uncertainty in the observations. The algorithm actively chooses based on past choices and observations . When the budget is exhausted, the algorithm decides the optimal point and the optimal value with the aim of minimizing regret, which is defined asThe MFHOO algorithm (Algorithm 1) for selecting estimates by building a binary tree in which each height in the tree represents a partition of the state space . The algorithm maintains estimates of an upper-bound on for each partition subset, and uses the principle of optimism for choosing the next sample . That is, it chooses the samples in those partitions where the estimated upper-bounds are the highest.

Each node in the constructed tree is labeled by a pair of integers , where is the height of the node in the tree, and satisfying is its position within height level . The root is labeled , and each node can have two children and . Node is associated with subset a of , denoted by , where , and for each these disjoint subsets satisfy . Therefore, larger values of represent finer partitions of .

For each node in the tree, the algorithm maintains the following information: (i) : the number of times the node is visited; (ii) : the empirical mean of observations over points visited in ; (iii) : an initial estimate of the maximum of over ; and (iv) : a tighter and optimistic upper bound on the maximum of over

The algorithm proceeds as follows. The starts out with a single node, the root , initializes the -values of its two children and to , and initializes the cost to . At a high-level, in each iteration of the while loop (line 3), the algorithm adds a new node in the and updates all of the above quantities for several nodes in . In more detail, first a from the root to a leaf is found by traversing the child with the higher -value (with ties broken arbitrarily). Let the child with the higher -value of the traversed leaf be (line 4). An arbitrary point in the partition of this node is chosen (line 5). Then, this point is evaluated at fidelity and a reward is received (line 6). Next, is extended by inserting in the (line 8) and for all the nodes in including , the and the empirical mean are updated (line 9). Finally, in line 13, for all nodes in , and are updated using the smoothness parameters and which will discussed later in Section 3.1 and the parameter . Once the sampling budget is exhausted, a leaf with maximum -value is returned by the Algorithm 1 [sen2019noisy].

### 3.1 Regret bounds for MFHOO

In this section, we summarize the assumptions and results from [sen2019noisy]. In order to analyze the regret bounds for the MFHOO algorithm, the following assumption on the smoothness of [sen2019noisy] is made.

###### Assumption 1

There exist and such that for all satisfying (for a constant ), we have that , for all .

This assumption connects the function to the partitioning rule of the binary tree. We now define the concept of near-optimality dimension which plays an important role in the analysis of black-box optimization algorithms [Bubeck:2011, sen2019noisy, grill2015black, valko2013stochastic]. It measures the dimension of sets that are close to optimal. The regret bound for MFHOO uses this notion. First, given a partitioning scheme over , we define as the number of -near optimal partitions, that is, the number of partitions such that satisfies .

###### Definition 3.1

The near-optimality dimension of with respect to parameters is:

With Assumption 1, the regret bound for MFHOO is proved in [sen2019noisy].

###### Theorem 1

## 4 Statistical Model Checking with Optimistic Optimization

We aim to solve the statistical model checking problem of maximizing of Equation (2) for a given NiMC and a time horizon , using MFHOO. In order to apply the MFHOO algorithm, one has to make several critical choices regarding the objective function, the budget, the cost, the parameters for fidelity and smoothness, and the multi-fidelity bias function. In this section we discuss the rationale behind our choices.

### 4.1 Objective function, budget, cost, and fidelity

#### Fidelity parameter .

Consider a NiMC with the unsafe set . We have to maximize over all initial states , and for a long time horizon . Given , the fidelity of evaluating will depend on the actual length of the simulations drawn for creating the observation for the state . Suppose we fix as the shortest simulation to be used. We define the fidelity of an observation (or evaluation) with simulations of length as .

#### Objective function and observations.

A natural choice for the objective function would be to define , for any initial state . Computing this probability, however, is infeasible when the probability transition function is unknown. Even if is known, calculating involves summing over many executions (as in (1)). Instead, we take advantage of the fact that MFHOO can work with noisy observations. For any initial , and execution we define the observation:

(3) |

Recall that for a fixed initial states , is a Markov chain and defines a probability distribution over the set of executions as given by Equation (1). Thus, given an initial state , with probability , and with probability . That is,

is a Bernoulli random variable with mean

at fidelity . In MFHOO, once an initial state is chosen (line 5), we simulate upto steps several times starting from and calculate the empirical mean of , which serves as the noisy observation at fidelity .#### Cost and budget .

In our setup the cost function for any is the computational time required to simulate an execution for any , where is the execution length corresponding to the fidelity . The budget is a computational cost budget.

The next proposition states that with these choices, our goal to maximize over can be achieved using MFHOO.

###### Proposition 1

### 4.2 Multi-fidelity bias function

Recall that the bias function gives an upper bound , over all and for any fidelity . We will derive a bias function satisfying . Of course, a guarantee like this is only possible for known models. Therefore, for this section we will assume that the NiMC is known. We also assume is finite and is the absorbing set for (i.e., all other states are transient; if and only if ).

Fixing an initial state , is a reducible Markov chain. Let be the number of transient states and . Then, the probability transition function can be represented by the matrix :

(4) |

Here is an matrix, is an identity matrix, is an zero matrix, and is a nonzero matrix. The following standard result gives the absorption probabilities in terms of .

###### Proposition 2 ( [grinstead2012introduction])

Suppose is a matrix, where and is as defined in (4). Then is the probability, that starting from state , the chain is absorbed in .

We now state and prove the theorem that defines a multi-fidelity bias function.

###### Theorem 2

For any and any time horizon ,

where

is the maximum eigenvalue,

is the matrix of eigenvectors of the matrix

, and is the (-norm) condition number of .This theorem can be proved using the spectral decomposition of matrix and -norm bounds (for detailed analysis please refer to Appendix B).

###### Remark 1

Using the fidelity parameter in the upper bound given by Theorem 2, the bias function can be rewritten as:

(5) | ||||

This bias function upper bounds the gap between the probability of hitting the unsafe set for a point in the initial set for different time horizons. Thus, it can be used in the analysis of the theoretical regret bound of the Algorithm 1 given in Theorem 1. If the transition matrix is known, then the bias function can be used in updating the -values of nodes in the Algorithm 1. Note, this bias function depends on parameters and . In the case where the full transition model of the NiMC is unknown, but there is access to and , this function can be utilized in the algorithm. More generally, in problems with unknown and no access to the parameters and , we consider a linear parameterized bias function with an unknown parameter , and adaptively estimate the parameter [sen2019noisy].

### 4.3 Non-smoothness and non-uniqueness of hitting probabilities

In this section, we start with the observation that in general the function over is not a continuous function and has infinitely many maxima. Thus this function does not satisfy the Assumption 1 and the assumption of finite number of maxima, that are required for the regret bounds of Theorem 1. However, a modified definition of near-optimality dimension we show that the bounds given by Theorem 1 will hold. Finally, we will compare the theoretical regret with the actual regret achieved by MFHOO.

Consider a discrete time linear system with state space , set of initial states and set of unsafe states given by

for some vector

and . The system evolves according to:(6) |

where is the state vector at time . is a discrete random vector with probability mass distribution at sequence , and and are and matrices, respectively.

Consider two initial states such that , for some small . Writing out the probabilities of hitting unsafe states of this system explicitly for the two initial states and finding the gap between these probabilities we can observe that, for small enough , the gap between the probability is zero^{4}^{4}4
The full analysis is given in Appendix D.. By increasing the , the gap can take a nonzero value. This means that the probability of hitting the unsafe set of the system (13) is not a continuous function in terms of the initial states and has infinitely many maxima. This is because random variable is a discrete-type random variable. is a concrete example of this kind.

Assumption 1 asserts that there exist smoothness parameters such that for all in the tree, and all the partitions that are near optimal, the gap between and the value of the function over those partitions should be bounded. From the above discussion we see that this may not hold for all depths . In addition, the regret bounds in Theorem 1 are for function with finite number of maxima. If there are infinitely many maxima, then for given smoothness parameters the number of -near-optimal partitions would not be bounded for all . However, we observe that any instance of MFHOO runs on a finite budget and the final constructed tree has a maximum height . For this , there exist smoothness parameters and that satisfy the Assumption 1, and we can modify the near-optimality dimension in the Definition 3.1 as:

###### Definition 4.1

-bounded near-optimality dimension of with respect to is: .

With this modified definition, there exists a satisfying the above condition and the corresponding can be used to recover the regret bound of the Theorem 1.

#### Regret in theory and in practice.

Given a budget , let be the point that the Algorithm 1 returns after the budget is exhausted. Then the regret would be . We compare the theoretical regret bound and the actual regret for the model. Consider an instance of with , initial states and . Intuitively, the partition corresponding to and would maximize the probability of hitting . In Figure 2 (top-left), the mean of actual regret obtained by running MFHOO for various smoothness parameters is presented. As the budget increases, as expected, the actual regret decreases monotonically and approximately approaches to .

For budget , the partitioning stops at . We can derive the number of -near-optimal partitions for any and . Setting (actual values used to generate regret results), we see that for , the maximum number of -near-optimal partitions . This is because the total number of partitions at depth is equal to , and for , the -near-optimal partitions belong to the area corresponding to and , whose area is . According to Definition 4.1, for different values of , different -bounded near-optimality dimensions can be obtained. We are looking for the pair of values that minimize the theoretical regret bound of Theorem 1. Figure 2 (top-right) represents the number of -near-optimal partitions for that are upper bounded by for different values of and their corresponding . Figure 2 (bottom) also, represent the theoretical upper bound vs. and their corresponding for different values of smoothness parameters used to generate the actual regret. As it is seen, the best theoretical regret bounds that can be achieved is approximately .

## 5 HooVer tool and experimental evaluation

We have implemented a prototype tool called HooVer which uses MFHOO for solving the SMC problem of (2). We compare the performance of HooVer with that of Prism [HKNP06], Storm [dehnert2017storm], and PlasmaLab [legay2016plasma] on several benchmarks we have created.

### 5.1 Benchmark models

We have created several NiMC models for evaluation of probabilistic and statistical model checking tools. The benchmarks are variants of (Section 2.2). The complete models are described in Appendix C. The executable models are available from [SMCOO-online:2020].

models a sequence of cars on a single lane. At each step, each car can choose to move with one of three speeds: , , and . The first vehicle moves at a constant speed. For all the others, the speed is chosen probabilistically, according to distributions that depend on their distance to the preceding vehicle. For example, if the longitudinal distance is less than a threshold then the speed for vehicle is chosen according to a probability distribution , and so on. The state variables, the initial states, and the unsafe set are defined in the same way as in .

models cars on lanes. At every step, each car probabilistically chooses to either move forward like a vehicle in , or it changes to its left or right lane. These actions are chosen probabilistically, according to probability distributions that depend on their distances to the vehicles on its current lane, as well as left and right lanes. In addition to the longitudinal position , each vehicle has a second state variable which keeps track of its current lane. The initial value of each is chosen as in the case of , and is set to 1. The unsafe set is defined based on the distance to the preceding car on the same lane.

### 5.2 HooVer implementation and metrics

Our implementation of the HooVer tool uses the MFHOO implementation presented in [sen2019noisy] to solve the model checking problem of Equation (2). It works in two stages: First, it uses MFHOO to find the best partition and a putative “best” (most unsafe) initial point with the maximum estimate for the probability of hitting the unsafe set . In the second stage, HooVer uses additional simulations to do a Monte Carlo estimation of the probability . In the experiments reported below, a constant number of K simulations are used in all experiments in the second stage.

To achieve the theoretically optimal performance, MFHOO requires the smoothness parameters and which are unknown for our benchmarks. To circumvent this HooVer chooses several parameter configurations ( sets in our experiments), runs an instance of MFHOO for each, and returns the result with the highest hitting probability. For each instance of MFHOO, we set a time budget which is the maximum time allowed to be consumed by the simulator.

#### Metrics.

We report the regret of HooVer. In order to calculate the regret, first we have to calculate the actual maximum hitting probability for each benchmark. This is computed using Prism [HKNP06] which uses numerical and symbolic analysis. The regret is the difference between the exact probability and the estimated probability. Then, we report the running time and memory usage. The memory usage is measured by calculating the total size of the Python object which contains the constructed tree and all other data of MFHOO. We also report the number of queries for each method, which is total number of simulations used in both stage 1 and 2. All the experiments are conducted on a Linux workstation with two Xeon Silver 4110 CPUs and 32 GB RAM.

Table 1 shows the running time, the memory usage, the number of queries, and the resulting actual regret for HooVer using MFHOO as well as HooVer(1) using HOO. On every benchmark, HooVer gives low regrets. With the same simulation budget, HooVer devotes longer simulations in the interesting parts , as a consequence, it is usually faster than HooVer(1) as shown in Figure 3.

Model | Method | Time(s) | Memory(MB) | #queries | Regret |
---|---|---|---|---|---|

Storm | 68.88 | 1019 | NA | 0 | |

PlasmaLab | 37.73 | NA | 749711 | 0.0017 0.0020 | |

HooVer(1) | 59.37 | 9.13 | 31381 | 0.0011 0.0025 | |

HooVer | 24.03 | 6.23 | 29415 | 0.0002 0.0020 | |

Storm | 89.10 | 1974 | NA | 0 | |

PlasmaLab | 22.61 | NA | 749711 | 0.0022 0.0021 | |

HooVer(1) | 44.52 | 7.30 | 30315 | 0.0038 0.0113 | |

HooVer | 26.46 | 4.53 | 28520 | 0.0010 0.0018 | |

Storm | NA | OOM | NA | NA | |

PlasmaLab | 49.14 | NA | 749711 | 0.0032 0.0026 | |

HooVer(1) | 110.64 | 9.37 | 31555 | 0.0016 0.0025 | |

HooVer | 76.35 | 5.27 | 28955 | 0.0007 0.0043 |

### 5.3 Comparison with other model checkers

We compare the performance of HooVer with other model checkers. Prism [HKNP06] and Storm [dehnert2017storm] are leading probabilistic model checkers for Markov chains and MDPs and compute exact probability of reaching the unsafe states. As Storm has the same functionality as Prism and we found it to be much more efficient than Prism in all our experiments, we only compare HooVer with Storm. PlasmaLab [legay2016plasma] is one of the few statistical model checkers that can handle MDPs. For probability estimation problems, PlasmaLab uses smart sampling algorithm [d2015smart] to efficiently assign the simulation budgets to each scheduler and then estimates the probability for the putative “best” scheduler.

#### Discretizing and scaling the benchmarks.

The theory for MFHOO is based on a continuous state spaces, however, most model checking tools, including the ones mentioned above, are designed for discrete state space models and they do not support floats as state variables. Therefore, a direct comparison of the approaches on the same model is infeasible, and we created equivalent, discretized (quantized) versions of the benchmarks.

In HooVer, the algorithm keeps partitioning hierarchically and stops at a depth , which can be considered as searching over all the partitions at depth . In order to make a fair comparison, we make sure that the discrete version of the benchmark has initial states, i.e. where is the discretized initial state space.

Before stating the discretizing process, we give two key observations of the benchmarks. First, considering the nature of our benchmarks, it is obvious that if we scale the velocities, distance thresholds and initial distances by a constant factor, the probability of hitting unsafe set doesn’t change. Second, taking as an example, we set all the constants (velocities and distance thresholds) in the model as integers, which leads to the function shown in Figure 1. It’s clear that for any state , there exists an integer state such that . If we make sure that for each integer interval in the original continuous space, the discretized space has an value in that interval, then the maximum probability doesn’t change. With these observations, we discretize the benchmark as follows. First, we sample points uniformly in the continuous initial state space. Due to the second observation mentioned above, if is larger than the number of integer points in the original space, the maximum probability doesn’t change. Then, we find an integer scaling factor such that by multiplying all the points become integer points. Other constants in the model are also multiplied by . According to the first observation, scaling with a constant doesn’t change the probability. Thus, we now have a model where all state variables are integers and it has the same maximum probability as the original model. Then we evaluate Storm and PlasmaLab on this new model.

All of the model checkers mentioned above support the Prism [HKNP06] language. Thus, we implement each benchmark in Prism language, and then we check the equivalence between the Prism implementation and the Python implementation by calculating and comparing the probability at all integer points .

For Storm, we report the running time and the memory usage. These metrics are directly measured by the software itself. For PlasmaLab, we report the running time. We do not report the memory usage of PlasmaLab because it doesn’t provide an interface for that and it is hard to track the actual memory used by the algorithm inside the JAVA virtual machine. We also report the regret for PlasmaLab. We use the term “regret” here just for simplicity, which also refers to the difference between the estimated probability and the exact probability.

The results of HooVer, PlasmaLab and Storm are summarized in Table 1. We show in Figure 3 how the performance of different methods changes as the increases. As the size of the initial state space increases, the memory usage of Storm grows quickly, which limits its application on large models. In contrast, HooVer and PlasmaLab scale well. The running time of PlasmaLab is determined by the parameters of the smart sampling [d2015smart] algorithm. We use the same parameters regardless of , and thus the running time of PlasmaLab is almost constant.

As shown in Table 1, Storm fails to check the model due to the memory limitation. Compared with PlasmaLab, HooVer requires more running time. However, we note that PlasmaLab is a considerably more mature tool than HooVer. As shown in Table 1, compared to PlasmaLab, HooVer requires much fewer queries to reach comparable regrets, which attests to the sample efficiency of our proposed method.

## 6 Conclusions

In this paper, we formulated the statistical model checking problem for a special type of MDP models as a multi-armed bandit problem and showed how a hierarchical optimistic optimization algorithm from [sen2019noisy] can be used to solve it. In the process, we modified the existing definition of near-optimality dimension to accommodate the non-smoothness of the typical functions we have to optimize and recovered the regret bounds of the algorithm. We created several benchmarks, developed a SMC tool HooVer, and experimentally established the sample efficiency and scalability of the method.

In order to enlarge the area of application of this method, it is necessary to study more general temporal properties for NiMC beyond bounded time safety. In this paper, we focused on a very special type of MDP models with nondeterminism only in the initial set. The results suggest that it will be interesting to study black-box optimization algorithms for model checking more general MDP models.

## References

## Appendix A Multi-Fidelity Parallel Optimistic Optimization (MFPOO)

The optimal smoothness parameters are generally not known. Algorithm 2 [sen2019noisy] adaptively searches for the optimal smoothness parameters by releasing several MFHOO instances with various and values.

The following theorem gives the upper bound for the algorithm 2.

###### Theorem 3

It is noted that if the Algorithm 1 is run at full fidelity case , then it will be equivalent to a budgeted HOO algorithm. At this case the simple regret bound will be:

(7) |

## Appendix B Proofs

###### Proof 1

(Theorem 2) We define a new Markov chain that starts from (like ) and stops by going entering a new absorbing state called Stop, in steps, in expectation. That is, from any transient state , the one step transition probability of hitting is . For any other transient state , the one step probability of transitioning from to is . Therefore, the probability transition matrix for the new Markov chain is as follows.

where 1 is a vector whose elements are , and the state is the last state. is a

row stochastic matrix. Then after

steps, would beFor the Markov chain ,