I Introduction
Probabilistic Computation Tree Logic (PCTL) is frequently used to formally specify control objectives such as reachability and safety on probabilistic systems [1]. To check the correctness of PCTL specifications on these systems, model checking methods are required [2]. Although model checking PCTL by modelbased analysis is theoretically possible [1], it is not preferable in practice when the system model is unknown or large. In these cases, model checking by sampling, i.e. statistical model checking (SMC), is needed [3, 4].
The statistical model checking of PCTL specifications on Markov Decision Processes (MDPs) is frequently encountered in many decision problems – e.g., for a robot in a grid world under probabilistic disturbance, checking whether there exists a feasible control policy such that the probability of reaching certain goal states is greater than a probability threshold [5, 6, 7]. In these problems, the main challenge is to search for such a feasible policy for the PCTL specification of interest.
To search for feasible policies for temporal logics specifications, such as PCTL, on MDPs, one approach is modelbased reinforcement learning [8, 9, 10, 11] – i.e., first inferring the transition probabilities of the MDP by sampling over each stateaction pair, and then searching for the feasible policy via modelbased analysis. This approach is often inefficient, since not all transition probabilities are relevant to the PCTL specification of interest. Here instead, we adopt a modelfree reinforcement learning approach [12].
Common modelfree reinforcement learning techniques cannot directly handle temporal logic specifications. One solution is to find a surrogate reward function such that the policy learned for this surrogate reward function is the one needed for checking the temporal logic specification of interest. For certain temporal logics interpreted under special semantics (usually involving a metric), the surrogate reward can be found based on that semantics [13, 14, 15].
For temporal logics under the standard semantics [16], the surrogate reward functions can be derived via constructing the product MDP [17, 18, 7] of the initial MDP and the automaton realizing the temporal logic specification. The disadvantages of this approach are that 1) the complexity of constructing the automaton from the temporal logic specification is double exponential; and 2) the product MDP is much larger than the initial MDP [16].
In this work, we propose a new statistical model checking method for PCTL specifications on MDPs. For a lucid discussion, we only consider nonnested PCTL specifications. PCTL formulas in general form with nested probabilistic operators can be handled in the standard manner using the approach proposed in [19, 20]. Our method uses upperconfidencebound (UCB) based Qlearning to directly learn the feasible policy of PCTL specifications, without constructing the product MDP. The effectiveness of UCBbased Qlearning has been proven for the bandit problem, and has been numerically demonstrated on many decisionlearning problems on MDPs (see [21]).
For boundedtime PCTL specifications, we treat the statistical model checking problem as a finite sequence of bandit problems and use the UCBbased Qlearning to learn the desirable decision at each time step. For unboundedtime PCTL specifications, we look for a truncation time to reduce it to a boundedtime problem by checking the PCTL specification and its negation at the same time. Our statistical model checking algorithm is online; it terminates with probability 1, and only when the statistical error of the learning result is smaller than a userspecified value.
The rest of the paper is organized as follows. The preliminaries on labeled MDPs and PCTL are given in Section II. In Section III, using the principle of optimism in the face of uncertainty, we design Qlearning algorithms to solve finitetime and infinitetime probabilistic satisfaction, and give finite sample probabilistic guarantees for the correctness of the algorithms. We implement and evaluate the proposed algorithms on several case studies in Section IV. Finally, we conclude this work in Section V.
Ii Preliminaries and Problem Formulation
The set of integers and real numbers are denoted by and , respectively. For , let . The cardinality of a set is denoted by . The set of finitelength sequences taken from a finite set is denoted by .
Iia Markov Decision Process
A Markov decision process (MDP) is a finitestate probabilistic system, where the transition probabilities between the states are determined by the control action taken from a given finite set. Each state of the MDP is labeled by a set of atomic propositions indicating the properties holding on it, e.g., whether the state is a safe/goal state.
Definition 1
A labeled Markov decision process (MDP) is a tuple where

is a finite set of states.

is a finite set of actions.

is a partial transition probability function. For any state and any action ,
With a slight abuse of notation, let be the set of allowed actions on the state .

is a finite set of labels.

is a labeling function.
Definition 2
A policy decides the action to take from the sequence of states visited so far. Given a policy and an initial state , the MDP becomes purely probabilistic, denoted by . The system is not necessarily Markovian.
IiB Probabilistic Computation Tree Logic
The probabilistic computation tree logic (PCTL) is defined inductively from atomic propositions, temporal operators and probability operators. It reasons about the probabilities of timedependent properties.
Definition 3 (Syntax)
Let be a set of atomic propositions. A PCTL state formula is defined by
where , , is a (possibly infinite) time horizon, and is a threshold.^{1}^{1}1This logic is a fraction of PCTL from [16]. The operators and are called probability operators, and the “next”, “until” and “release” operators , , are called temporal operators.
More temporal operators can be derived by composition: for example, “or” is ; “true” is ; “finally” is ; and “always” is . For simplicity, we write , , and as , , and , respectively.
Definition 4 (Semantics)
For an MDP , the satisfaction relation is defined by for a state or path by
where . And means the path is drawn from the MDP under the policy , starting from the state from.
The PCTL formulas (or ) mean that the maximal (or minimal) satisfaction probability of “next” is . The PCTL formulas (or ) mean that the maximal (or minimal) satisfaction probability that holds “until” holds is .
Iii NonNested PCTL Specifications
In this section, we consider the statistical model checking of nonnested PCTL specifications using an upperconfidencebound based Qlearning. For simplicity, we focus on and where and are atomic propositions. Other cases can be handled in the same way. We discuss the case of in Section IIIA, the case of in Section IIIB, and the case of in Section IIIC. Similar to other works on statistical model checking [3, 4], we make the following assumption.
Assumption 1
For and with and , we assume that and , respectively.
When it holds, as the number of samples increases, the samples will be increasingly concentrated on one side of the threshold
by the Central Limit Theorem. Therefore, a statistical analysis based on the majority of the samples has increasing accuracy. When it is violated, the samples would be evenly distributed between the two sides of the boundary
, regardless of the sample size. Thus, no matter how the sample size increases, the accuracy of any statistical test would not increase. Compared to statistical model checking algorithms based on sequential probability ratio tests (SPRT) [22, 23], no assumption on the indifference region is required here. Finally, by Assumption 1, we have the additional semantic equivalence between the PCTL specifications: and ; thus, we will not distinguish between them below.For further discussion, we first identify a few trivial cases. For , let
(1) 
Then for any policy , if ; and if . The same holds for by defining to be the union of end components of the MDP labeled by (this only requires knowing the topology of ) [16]. In the rest of this section, we focus on handling the nontrivial case .
Iiia Single Time Horizon
When , for any , the PCTL specification (or ) holds on a random path starting from the state if and only if , where and are from (1). Thus, it suffices to learn from samples whether
(2) 
where
and means is drawn from the transition probability for state and action . This is an arm bandit problem; we solve this problem by upperconfidencebound strategies [24, 25].
Specifically, for the iteration , let be the number samples for the onestep path , and with a slight abuse of notation, let
(3) 
The unknown transition probability function
is estimated by the empirical transition probability function
(4) 
And the estimation of from the existing samples is
(5) 
Since the value of the Qfunction
is bounded, we can construct a confidence interval for the estimate
with statistical error at most using Hoeffding’s inequality by(6) 
where we set the value of the division to be for .
Remark 1
We use Hoeffding’s bounds to yield hard guarantees on the statistical error of the model checking algorithms. Tighter bounds like Bernstein’s bounds [26] can also be used, but they only yield asymptotic guarantees on the statistical error.
The sample efficiency for learning for the bandit problem (2) depends on the choice of sampling policy, decided from the existing samples. A provably best solution is to use the Qlearning from [24, 25]. Specifically, an upper confidence bound (UCB) is constructed for each stateaction pair using the number of samples and the observed reward, and the best action is chosen with the highest possible reward, namely the UCB. The sampling policy is chosen by maximizing the possible reward greedily:
(7) 
The action is chosen arbitrarily when there are multiple candidates. The choice of in (7) ensures that the policy giving the upper bound of the value function gets most frequently sampled in the long run.
To initialize the iteration, the Qfunction is set to
(8) 
to ensure that every stateaction is sampled at least once. The termination condition of the above algorithm is
(9) 
where is the probability threshold in the nonnested PCTL formula.
Remark 2
For or , it suffices to change the termination condition (9) by returning true if , and returning false if . The same statements hold for general PCTL specifications, as discussed in Sections IIIC and IIIB
Now, we summarize the above discussion by Algorithm 1 and Theorem 1 below.
Theorem 1
The return value of Algorithm 1 is correct with probability at least .
We provide the proof of a more general statement in Theorem 2.
Remark 3
The Hoeffding bounds in (6) are conservative. Consequently, as shown in the simulations in Section IV, the actual statistical error of the our algorithms can be smaller than the given value. However, as the MDP is unknown, finding tighter bounds is challenging. One possible solution is to use asymptotic bounds, such as Bernstein’s bounds [26]. Accordingly, the algorithm will only give asymptotic probabilistic guarantees.
IiiB Finite Time Horizon
When , for any , let
(10) 
i.e., and are the maximal satisfaction probability of for a random path starting from for any policy and any policy with first action being , respectively. By definition, and satisfy the Bellman equation
(11) 
The second equality of the second equation is derived from
by the semantics of PCTL.
From (11), we check by induction on the time horizon . For , the lower and upper bounds for can be derived using the bounds on the value function for the previous step — for from (6) and for by the following lemma.
(12) 
and
(13) 
where is a parameter such that with probability at least . The bounds in (12) are derived from (11) by applying Hoeffding’s inequality, using the fact that and the Qfunctions are bounded within .
From the boundedness of , we note that this confidence interval encompasses the statistical error in both the estimated transition probability function and the bounds and of the value function. Accordingly, the policy chosen by the OFU principle at the step is
(14) 
with an optimal action chosen arbitrarily when there are multiple candidates, to ensure that the policy giving the upper bound of the value function is sampled the most in the long run. To initialize the iteration, the Qfunction is set to
(15) 
for all , to ensure that every stateaction is sampled at least once.
Sampling by the updated policy can be performed in either episodic or nonepisodic ways [21]. The only requirement is that the stateaction pair should be performed frequently for each and for each state satisfying . In addition, batch samples may be drawn, namely sampling over the stateaction pairs multiple times before updating the policy. In this work, for simplicity, we use a nonepisodic, nonbatch sampling method, by drawing
(16) 
for all and state such that . The Qfunction and the value function are set and initialized by (13) and (15). The termination condition is give by
(17) 
where is the probability threshold in the nonnested PCTL formula. The above discussion is summarized by Algorithm 2 and Theorem 2.
Theorem 2
Algorithm 2 terminates with probability and its return value is correct with probability at least , where .
By construction, as the number of iterations , . Thus, by Assumption 1, the termination condition (17) will be satisfied with probability . Now, let E be the event that the return value of Algorithm 2 is correct, and let be the event that Algorithm 2 terminates at the iteration , then we have . For any , the event E happens given that holds, if the Hoeffding confidence intervals given by (12) hold for any actions , , and state with . Thus, we have , where , implying that the return value of Algorithm 2 is correct with probability .
By Theorem 2, the desired overall statistical error splits into the statistical errors for each stateaction pair through the time horizon. For implementation, we can split it equally by . The specification can be handled by replacing argmax with argmin in (14), and with in (13). The termination condition is the same as (17).
Remark 4
Due to the semantics in Definition 4, running Algorithm 2 proving or disproving is easier than disproving or proving ; and the difference increases with the number of actions and the time horizon . This is because proving or disproving requires only finding and evaluating some policy with , while disproving it requires evaluating all possible policies with sufficient accuracy. This is illustrated by the simulation results presented in Section IV.
IiiC Infinite Time Horizon
Infinitestep satisfaction probability can be estimated from finitestep satisfaction probabilities, using the monotone convergence of the value function in the time step ,
(18) 
Therefore, if the satisfaction probability is larger than for some step , then the statistical model checking algorithm should terminate, namely,
(19) 
where is the probability threshold in the nonnested PCTL formula.
The general idea in using the monotonicity to check infinite horizon satisfaction probability in finite time is that if we check both and its negation at the same time, one of them should terminate in finite time. Here and are treated as atomic propositions. We can use Algorithm 2 to check their satisfaction probabilities for any time horizon simultaneously. The termination in finite time is guaranteed, if the time horizon for both computations increase with the iterations. The simplest choice is to increase by for every iterations; however, this brings the problem of tuning . Here, we use the convergence of the best policy as the criterion for increasing for each satisfaction computation. Specifically, for all the steps in each iteration, in addition to finding the optimal policy with respect to the upper confidence bounds of the Qfunctions by (14), we also consider the the optimal policy with respect to the lower confidence bounds of the Qfunctions . Obviously, when , we know that the policy is optimal for all possible Qfunctions within . This implies that these bounds are fine enough for estimating ; thus, if the algorithm does not terminate by the condition (19), we let
(20) 
Combining the above procedure, we derive Algorithm 3 and Theorem 3 below for statistically model checking PCTL formula .
Theorem 3
Algorithm 3 terminates with probability and its return value is correct with probability , where is the largest time horizon when the algorithm stops, and with and derived from (1) for and , respectively.
Terminates with probability follows easily from (18). Following the proof of Theorem 2, if the procedure of checking either or its negation stops and the largest time horizon is , then the return value is correct with probability at least . Thus, the theorem holds.
Remark 5
By Theorem 3, given the desired overall confidence level , we can split it geometrically by , where .
Remark 6
Similar to Section IIIB, checking for is derived by replacing argmax with argmin in (14), and with in (13). The termination condition is the same as (19).
Remark 7
Finally, we note that the exact savings of sample costs for Algorithms 3 and 2 depend on the structure of the MDP. Specifically, the proposed method is more efficient than [9, 10, 27], when the satisfaction probabilities differ significantly among actions, as it can quickly detect suboptimal actions without oversampling on them. On the other hand, if all the stateaction pairs yield the same Qvalue, then an equal number of samples will be spent on each of them — in this case, the sample cost of Algorithms 3 and 2 is the same as [9, 10, 27].
Iv Simulation
To evaluate the performance of the proposed algorithms, we ran them on two different sets of examples. In all the simulations, the transition probabilities are unknown to the algorithm (this is different from [9]).
The first set contains 10 randomly generated MDPs with different sizes. For these MDPs, we considered the formula for the finite horizon, and for the infinite horizon. In both cases, , , and are chosen arbitrarily. The second set contains versions of the Sum of Two Dice program. The standard version [28] models the sum of two fair dice, each with different possibilities numbered . To consider MDPs with different sizes, we consider cases where each dice is still fair, but has possibilities numbered , with in the smallest example, and in the largest example. For these MDPs, we considered the formula for the finite horizon, and for the infinite horizon. In both cases encodes the atomic predicate that is true iff values of both dice are chosen and their sum is less than an arbitrarily chosen constant. Also, in the finite case, in the smallest example and everywhere else.
For each MDP, we first numerically estimate the values of and , for the randomly generated MDPs, and , and , for the variants of the twodice examples, using the known models on PRISM [29]. Then, we use Algorithms 3 and 2 on the example models with only knowledge of the topology of the MDPs and without knowing the exact transition probabilities. For every MDP, we tested each algorithm with two different thresholds , one smaller and the other larger than the estimated probability, to test the proposed algorithms, with set to . We ran each randomly generated test times and each twodice variant test times. Here we only report average running time and average number of iterations. All tests returned the correct answers — this suggests that the Hoeffding’s bounds used in the proposed algorithms are conservative (see Remark 3). The algorithms are implemented in Scala and ran on Ubuntu 18.04 with i78700 CPU 3.2GHz and 16GB memory.
Iter.  Time (s)  PRISM Est.  

3  3  4  208.5  0.02  0.25  0.35333347851696 
4  3528.7  0.19  0.45  
4  2  4  171.8  0.01  0.09  0.19346819239686 
4  3671.7  0.22  0.29  
5  2  4  441.7  0.04  0.05  0.11765824006506 
4  4945.5  0.42  0.21  
10  2  4  544.7  0.14  0.04  0.09657378356496 
4  5193.3  1.45  0.19  
15  2  3  873.1  0.28  0.04  0.09460212499566 
3  4216.7  1.28  0.19  
20  4  5  337.6  0.99  0.12  0.22255245872626 
5  9353.3  28.06  0.32  
25  5  10  270.5  7.57  0.09  0.19127323359686 
10  25709.8  728.49  0.29  
30  5  10  355.6  14.35  0.08  0.17318971940206 
10  27161.7  1085.77  0.27  
35  5  10  328.9  18.82  0.09  0.18265118947816 
10  27369.6  1529.84  0.28  
40  5  10  390.0  26.79  0.08  0.16746526170426 
10  30948.8  2122.24  0.26 
Iter.  Time (s)  PRISM Est.  

3  36  5  15.6  0.79  0.27  0.37500000000006 
39.7  1.31  0.47  
5  121  10  32.9  4.07  0.20  0.30273437500006 
93.8  11.30  0.40  
6  169  10  29.2  4.95  0.29  0.39550781250006 
90.4  15.33  0.49  
7  196  10  33.4  6.85  0.31  0.41015625000006 
70.7  13.84  0.51  
9  400  10  41.1  15.87  0.21  0.31054687500006 
110.3  43.47  0.41  
11  529  10  46.4  24.07  0.28  0.38671875000006 
111.9  58.15  0.48  
13  729  10  25.8  18.52  0.20  0.30468750000006 
109.9  80.28  0.40  
15  1156  10  42.1  47.90  0.05  0.10253906250006 
155.3  178.29  0.20  
17  1369  10  24.0  32.17  0.06  0.13281250000006 
155.0  208.78  0.23 
Tables II and I show the results for finite horizon reachability. An interesting observation in these tables is that in all examples, disproving the formula is 3 to 100 times faster. We believe this is mainly because, to disprove , all we need is one policy for which holds. However, to prove , one needs to show that every policy satisfies (see Remark 4).
Iter.  Time (s)  PRISM Est.  

3  3  126.0  0.03  0.25  0.35378487395856  14.7  15.8 
111.3  0.01  0.45  12.9  13.0  
4  2  103.7  0.01  0.09  0.19348436808996  13.0  27.2 
73.0  0.01  0.29  9.6  19.2  
5  2  239.9  0.06  0.05  0.11765966830036  12.4  59.3 
92.7  0.00  0.21  6.9  24.3  
10  2  891.4  0.24  0.04  0.09657598203426  3.2  225.6 
79.6  0.00  0.19  1.1  21.5  
15  2  1862.0  0.61  0.04  0.09463904842416  1.3  117.8 
161.3  0.01  0.19  1.0  11.0  
20  4  1336.2  0.27  0.12  0.22255363220376  1.0  1.0 
16843.8  3.02  0.32  1.0  1.8  
25  5  1619.2  0.64  0.09  0.19127323049496  1.0  1.0 
154621.6  56.56  0.29  1.0  2.0  
30  5  2246.8  1.05  0.08  0.17318970684216  1.0  1.0 
86617.0  42.53  0.27  1.0  1.3  
35  5  1925.1  0.82  0.09  0.18265118680906  1.0  1.0 
13219.0  5.79  0.28  1.0  1.0  
40  5  2401.5  1.35  0.08  0.16746526033006  1.0  1.0 
12158.6  6.66  0.26  1.0  1.0 
Tables IV and III show the results for infinite horizon reachability. Note that Algorithm 3 considers both the formula and its negation, and contrary to the finite horizon reachability, disproving a formula is not always faster for the infinite case. In most of the larger examples that are randomly generated, and are very small on average. This shows that in these examples, the algorithm was smart enough to learn there is no need to increase in order to solve the problem. However, this is not the case for twodice examples. We believe this is because in the current implementation, the decision to increase does not consider the underlying graph of the MDP. For example, during the execution, if the policy forces the state to enter a selfloop with only one enabled action, which is the case in twodice examples, then after every iteration the value of will be increased by .
Iter.  Time (s)  PRISM Est.  

3  36  191.1  3.39  0.56 
Comments
There are no comments yet.