1 Introduction
Reinforcement learning enables agents to automatically learn policies maximizing a given reward signal. Recent developments combining reinforcement learning with deep learning have had great success in tackling more and more complex domains, such as learning to play video games based on visual input or enabling automated realtime scheduling in production systems
(5; 46).Reinforcement learning also provides valuable solutions for systems operating in nondeterministic and partially known environments, such as autonomous systems, sociotechnical systems and collective adaptive systems (see e.g. (12; 28; 16; 39)). However, it is often difficult to ensure the quality and the correctness of reinforcement learning solutions. In many applications, learning is focusing on achieving and optimizing system behavior but not on guaranteeing the safety of the system (see e.g. also (12; 28; 16)).
Optimizing for both functional effectiveness and system safety at the same time poses a fundamental challenge: If maximizing return and satisfying bounds are interfering, what should the learner optimize in a given situation? That is, besides the fundamental dilemma of exploration and exploitation, the learner now faces an additional choice to be made: When to optimize return, and when to optimize feasibility, when it is not possible to optimize both at the same time?
Motivated by this fundamental challenge, we propose Policy Synthesis under probabilistic Constraints (PSyCo), a systematic method for deriving safe policies under probabilistic constraints with reinforcement learning and Bayesian model checking. We treat the problem of learning policies under such probabilistic goals as a constrained optimization problem, and propose a blackbox approach based on evolutionary strategies and Bayesian verification for solving it approximately.
PSyCo is organized along the classical phases of systematic software development: a system specification comprising a constrained Markov decision process (CMDP) as domain specification and a requirement specification in terms of probabilistic constraints, an abstract design defined by an algorithm for safe policy synthesis with reinforcement learning, and Bayesian model checking for system verification.
For implementing PSyCo’s abstract design we propose Safe Neural Evolutionary Strategies
(SNES). SNES leverages Bayesian model checking while learning to adjust the Lagrangian of a constrained optimization problem derived from a PSyCo specification. SNES combines three building blocks for learning safe policies: (1) Constrained optimization with a Lagrangian multiplier, (2) modeling confidence in constraint satisfaction with a Beta distribution and (3) using this confidence to adjust the Lagrangian multiplier in the learning process to adaptively balance the optimization objective of the learner between maximizing return and minimizing cost, respectively.
The paper makes the following contributions.

The PSyCo methodology for safe policy sythesis under probabilistic constraints with reinforcement learning and Bayesian model checking. PSyCo accounts for empirical verification based on finite observations by including confidence requirements.

We introduce Safe Neural Evolutionary Strategies
(SNES) for learning safe policies under probabilistic constraints. SNES leverages online Bayesian model checking to obtain estimates of constraint satisfaction probability and a confidence in this estimate.

We empirically evaluate SNES showing it is able to synthesize policies that satisfy probabilistic constraints with a required confidence.
The paper is structured as follows: In Section 2 we introduce the PSyCo method. Section 3 describes safe policy synthesis with the Safe Neural Evolutionary Strategy SNES. In Section 4 we present the results of our experiments with the Particle Dance case study. Sections 5 and 6 discuss related work and the limitations of our approach. Finally, Section 7 gives a short summary of PSyCo and addresses further work.
2 The PSyCo Method for Safe Policy Synthesis under Probabilistic Constraints
Our approach to safe policy synthesis comprises three phases: System specification as constrained Markov decision process with goaloriented requirements, system design and implementation via safe policy synthesis, and verification by Bayesian model checking. As we will see, we use Bayesian model checking in two ways: To guide the learning process towards feasible solutions, and to verify synthesized policies.
2.1 PSyCo Overview
PSyCo comprises the following three components.

A system specification consisting of

a domain specification given by a CMDP (in the set of all CMDPs) and

a set of goaloriented requirements including optimization goals and probabilistic hard constraints. We restrict our further discussion to a single optimization goal and a single hard constraint (in the set of all constraint formulas) for the sake of simplicity. We think that extending our results to full sets of constraints is straightforward.


A safe reinforcement learning algorithm yielding (the parameters of) a policy.

A verification algorithm to check constraint satisfaction of the learned policy in the given CMDP.
PSyCo leverages two operations: a learning algorithm for synthesizing safe policies wrt. rewards, costs and constraints (i.e. optimizing goals and probabilistic constraints), and a verification algorithm for statistically verifying synthesized policies.
We optimize the parameters of the policy wrt. rewards and costs of with the safe reinforcement learning algorithm , taking the given probabilistic requirement into account.
(1) 
We verify the optimized parameters of the synthesized policy wrt. the given CMDP and the constraint specification.
(2) 
Given a CMDP and a constraint , PSyCo works by learning and verifying a policy as follows (where )).
(3) 
(4) 
2.2 System Specification: Constrained Markov Decision Processes and GoalOriented Requirements
The system specification consists of a domain specification and a set of requirements.
Domain Specification as Constrained MDP
The domain specification comprises a probabilistic labeled transition system describing the physics of the application domain. It is given by a tuple , where is a finite set of agent actions, is a finite set of system states and
is a probability distribution describing the transition probabilities of reaching some successor state when executing an action in a given state. For expressing our requirements we extend the labeled transition system by rewards and costs and specify a domain by a constrained Markov decision process (CMDP)
where is a probabilistic labeled transition system, is a reward function, is a cost function, and is the initial state distribution (4).
Note that CMDPs are not restricted to a single cost function in general, however in this paper we restrict ourselves to a single cost function for sake of simplicity. We think that our results could be extended to sets of cost functions straightforwardly.
An episode is a sequence of sequential transitions in the CMDP. By we denote the set of all episodes of length (where ).
Requirements
We consider two kinds of goals: optimization goals and (hard) constraints. Optimization goals are soft constraints and maximize an objective function, constraints are behavioral goals which impact the possible behaviors of the system, similar to e.g. maintain goals (which restrict the behavior of the system) and achieve goals (which generate behavior), see KAOS (21).
In our MDP setting, we relate the optimizing goal with the rewards of the MDP and require it to maximize the return:
(5) 
where is the cumulative sum of rewards in an episode and denotes the expectation of the return.
Probabilistic constraints are built over basic constraints. A basic constraint has the form
(6) 
where is an expression denoting a function from the set of episodes of length into the real numbers, , and is a constant denoting a real number .
Example
Let be instantiated with the cumulative cost of an episode, and let be a constant required maximum cumulative cost. Then the basic constraint
(7) 
compares (the cumulative cost) with the value of the constant .
The formula
(8) 
states that the basic constraint holds with at least probability within transitions in the domain. In our particular case of statistical verification based on finite observations, we add an additional confidence requirement to be satisfied and define:
(9) 
Let be a name for the constraint . Then we write the named probabilistic constraint as:
(10) 
Example (continued)
The probabilistic constraint
(11) 
with name BoundedCost and requires that with probability and confidence , the cumulative cost is smaller than .
Combining Optimization Goals and Constraints as Target for Policy Synthesis
Remark: Expressivity of Constraints
With our notion of constraint we can express several other specification formalisms. Examples are stepbounded probabilistic temporal logic and as a consequence a probabilistic version of the behavioral goals of KAOS (21), i.e. maintain, achieve, avoid and cease goals.
The set of all stepbounded CTL path formulas with bounds comprises formulas of the form (”next ”), (”stepbounded until”), (”stepbounded eventually”), and (”stepbounded always”), where are CTL state formulas (see e.g. (8), page 781).
The formula (where ) asserts that will hold within at most steps, while holds in all states that are visited before a state has been reached. The stepbounded eventually operator is defined by and the stepbounded always operator is defined by .
With the help of the characteristic function defined by
(13) 
we can transform any stepbounded CTL path formula into an expression denoting a function from into a real number or .
Thus any stepbounded CTL path formula induces a basic constraint and a probabilistic constraint of the form
(14) 
asserting that is true with probability and confidence . As an example we can define stepbounded maintain goals and stepbounded achieve goals in the sense of KAOS:
(15) 
requiring that will hold within at most steps with probability and confidence .
(16) 
requiring that with probability and confidence , holds in all steps of an episode.
Also stepbounded cease and avoid goals can be expressed. It suffices to replace with in the formulas above.
2.3 Abstract Design: Safe Reinforcement Learning Algorithm L
We now discuss the learning algorithm L for synthesizing a policy optimizing its parameters wrt. the constraint optimization problem stated in (Eq. 12). We denote executing a policy parameterized by in a CMDP with a given constraint as follows, yielding a distribution over episode return, cost and satisfaction of the constraint as result of the execution.
(17) 
We typically sample from this distribution, which we denote as follows.
(18) 
Remark: Notation
We overload to describe both the CMDP tuple and the probability distribution the CMDP yields when being executed with a policy and constraints.
We can synthesize a policy that optimizes the constrained optimization problem with safe reinforcement learning. One approach is to use the problem’s dual representation as a reward function for a reinforcement learning algorithm (4). Let be an episode generated by sampling from a CMDP using a policy. In general, we can transform the problem
(19) 
to its dual representation
(20) 
where is a Lagrangian multiplier and . Without loss of generality, we use an alternative formulation of Eq. 20 where .
(21) 
We outline the general process of safe RL with function approximation in Algorithm 1. The key challenge in this approach is to determine an appropriate and to adjust it effectively over the learning process.
2.4 Verification: Bayesian Model Checking Algorithm V
We resort to Bayesian model checking (BMC) (30; 49; 11) for verification of policies, leveraging it in two ways: To guide the learning process towards feasible solutions, and to verify synthesized policies.
Executing a policy in generating an episode either satisfies or violates it otherwise. Thus, we can treat the generation of multiple episodes as Bernoulli experiment with a satisfaction probability . We are interested in estimating this probability in order to check whether is complies with our probabilistic constraint , that is .
Rather than doing a point estimate of via maximum likelihood estimation, we assign a plausibility to each possible , yielding a Bayesian estimate. We assign a prior distribution to all possible values of , and then compute the posterior distribution based on the observations of cost constraint satisfaction or violation.
In general, the posterior is proportional to the prior and the likelihood of the observations given this prior.
(22) 
In the particular case of a Bernoulli variable, the conjugate prior is the Beta distribution
(22). The Beta distribution is defined by two parameters , which are given by the observed count of positive and negative results of the Bernoulli experiment. We use a uniform prior over possible values of , assigning the same plausibility to all possible values before observing any data. This yields the following equality when assuming satisfactions and violations of .(23) 
This update yields a posterior distribution over the possible values of given the observation of satisfaction and violation. We can now compute the probability mass of this posterior that lies above the required probability to obtain a confidence about the current system satisfying our probabilistic constraint.
(24) 
Here, denotes the cumulative density function of the Beta distribution.
Algorithm 2 shows the pseudo code for Bayesian verification of a policy that is parametrized with parameters .
While this approach allows to verify a given policy after training, it does not directly provide a way to synthesize a policy that is likely to be verified. We will provide an approach to this problem in the next section.
3 Safe Neural Evolutionary Strategies
The previous section outlined a methodology for engineering safe policy synthesis based on specification as a CMDP, safe RL and Bayesian verification. In this section, we propose Safe Neural Evolutionary Strategies (SNES) for learning policies that are likely to be positively verified. SNES weights return and cost in the process of policy synthesis based on Bayesian confidence estimates obtained in the course of learning.
3.1 Evolutionary Strategies
Evolutionary strategies (ES) is a gradient free, searchbased optimization algorithm that has shown competitive performance in reinforcement learning tasks using deep learning for function approximation (44)
. ES is attractive as it is not based on backpropagation and can therefore be parallelized straightforwardly. Also, it does not require expensive GPU hardware for efficient computation.
The basic ES procedure is shown in Algorithm 3. ES works by maintaining the parameters of the current solution. It then generates slightly perturbed offspring from this solution to be evaluated on the optimization task , for example optimizing expected episode return of a policy in an MDP (lines 4 to 7). The current solution is then updated by moving the solution parameters into the direction of offspring weighted by their respective return, in expectation increasing effectiveness of the solution (lines 8 and 9).
We normalize a set of values
to zero mean and unit standard deviation with the following normalization procedure.
(25) 
3.2 Safe Neural Evolutionary Strategies
Safe Neural Evolutionary Strategies (SNES) uses Bayesian verification in the learning process to adaptively weight return and cost in the course of policy synthesis such that the resulting policy is likely to be positively verified.
SNES works as basic ES, and additionally performs a Bayesian verification step in each iteration. The resulting confidence estimate in positive verification is then used to determine the weighting of return and cost.
In order to account for confidence requirements, the confidence estimate from Bayesian verification is set in relation to the required confidence such that if confidence in constraint satisfaction is lower than required, only costs are reduced in the parameter update. If the constraint is satisfied with enough confidence, the influence of return is gradually increased.
(26) 
SNES is shown in Algorithm 4. SNES generates offspring and evaluates return, cost and whether an offspring satisfies or violates the requirement in an episode (lines 6 to 11). It uses this information to update the Lagrangian (lines 12 to 14) and updates parameters weighted accordingly to normalized return and cost (lines 15 to 18).
Remark: SNES vs. maximum likelihood calibration of the Langrangian
To show the effect of the Bayesian treatment and the necessity of computing confidence for adapting in the learning process, we compared SNES to a naive variant for tuning . Here, we use a maximum likelihood estimate for satisfaction probability and adjust according to the following rule.
(27)  
(28) 
Note that the naive approach does not account for confidence in its result.
4 Experiments
In this section, we report on empirical results obtained when evaluating SNES.
4.1 Setup
Particle Dance
In the Particle Dance domain, an agent has to learn to follow a randomly moving particle as closely as possible. Agent and particle have a position and a velocity . The state space describes the positions and velocities of both agent and particle. We restrict positions and velocities to their respective boundaries by clipping any exceeding values.
(29) 
The initial positions are sampled from uniformly at random. The initial velocities are fixed to zero.
(30) 
The agent can choose its acceleration at each time step. This yields the continuous action space A.
(31) 
Both agent and particle are accelerated at each time step by a value . The particles’ acceleration is sampled uniformly at random at each time step. Positions are updated wrt. current velocities. Let be the systems current state and be the action executed by the agent, then the transition distribution is given by:
Reward and Cost
The agent gets a reward at each step of an episode that is incentivizing it to get as close to the particle as possible. We also define a collision radius, and induce a cost when the agent is closer to the particle than this radius.
(32) 
(33) 
Requirements
The reward computes the negative distance between particle and agent. Minimizing the distance means maximizing the reward. Thus the optimizing goal is to maximize the expectation of the return (see (5)):
(34) 
As constraint, we require the number of collisions in an episode (i.e. the cumulative cost ) to be below a given threshold . We set the episode length to , the required probability for satisfying the constraint and the required confidence .
(35) 
Policy Network
We model the policy of our agents as a feedforward neural network with parameters
. Our network consists of an input layer with dimension 8 (position and velocity of agent and particle), a hidden layer with dimension 32, and has an output dimension of 2 (two dimensions of acceleration). Let be the networks’ weights in the input and hidden layer respectively, and let andbe nonlinear activation functions, with
being a rectified linear unit
(27) and being tanh in our case. Then, the networks output is given as follows.(36) 
Other parameters
We report our results for population size , learning rate and perturbation rate . Experiments with other parameters yielded similar results. We repeated the experiments five times and show mean values as solid lines and standard deviation by shaded areas in our figures.
4.2 Results
The SNES agent learns to follow the particle closely. In Figure 1 we see sample trajectories of the particle and the agent (color gradients denote time). We observe that the synthesized policy learned the task to follow the particle successfully.
We can observe the effect of SNES learning unconstrained and constrained policies on the obtained episode return in Figure 2. Return and constraint define a Pareto front in our domain: Strengthening the constraint reduces the space of feasible policies, and also reduces the optimal return that is achievable by a policy due to increased necessary caution when optimizing the goal, i.e. when learning to follow the particle as close as possible.
Figure 3 shows the proportion of episodes that satisfy the given requirement on cost, i.e. . We can see that the proportion closely reaches the defined bound, shown by the dashed vertical line. Note that the satisfying proportion is closely above the required bound.
Figure 4 shows the confidence of the learning agent in its ability to satisfy the given requirement based on the observations made in the learning process so far. The confidence is determined from the Beta distribution maintained by SNES over the course of training. Note that the confidence is mostly kept above the confidence requirement given in the specification. This shows SNES is effectively incorporating observations, confidence and requirement into its learning process.
Figure 5 shows the results of Bayesian verification performed throughout the learning process every 1000 episodes. Here, we fix the current policy, and perform Bayesian verification of the policy wrt. the given requirement. The quantity measured is the confidence in requirement satisfaction after either surpassing (i.e. the policy satisfies the requirement with high confidence), falling below (i.e. the policy violates the requirement with high confidence) or after a maximum of 1000 verification episodes. We can see that the confidence in having learned a policy that satisfies the requirement is increasing over the course of training.
Figures 6, 7 and 8 show return and collisions (i.e. cost) obtained, split by episodes that satisfy the cost constraint and those that violate it. We can see the violating episodes are more effective in terms of return but keep collisions well below the requirement, highlighting again the Pareto front of return and cost given by our domain. Note that we smooth the shown quantity over the last 100 episodes, and that collision only can take discrete values. This may explain that the shown quantity is well below the theoretically given boundaries (one or four in our case). SNES is able to learn policies that exploit return in the defined proportion of episodes, and to optimize wrt. the Pareto front of return and cost otherwise.
Results on SNES vs. maximum likelihood calibration of the Lagrangian
We compared SNES’ Bayesian approach to calibrating (Eq. 26) to a maximum likelihood variant (Eq. 28). Figures 9, 10 and 11 show return and cost for both approaches, separated by satisfying and violating episodes. We show the results for , aggregated for 12 repetitions of the experiment.
The MLE approach oversatisfies the given constraint, and consequently does not yield as high returns as SNES. In contrast, SNES is able to find a local Pareto optimum of return and constraint satisfaction.
5 Related Work
For synthesizing policies of autonomous and adaptive systems, PSyCo comprises a systematic development method, algorithms for safe and robust reinforcement learning, and a Bayesian verification method which is related to MDP model checking and statistical model checking approaches. In this section we discuss related work in these areas.
Systematic Development of Adaptive Systems
PSyCo borrows its notion of goals from KAOS (21), an early method for goaloriented requirements engineering. KAOS distinguishes hard and soft goals, is formally based on linear temporal logic, and proposes activities for refining the goals and deriving operation requirements which serve as the basis for system design. In contrast to PSyCo, it does neither cover system design nor implementation.
SOTA (2) is a modern requirements engineering method for autonomous and collective adaptive systems with a specific format for goals. Properties of goals can be analyzed by a modelchecking tool (3) based on LTL formulas and the LTSA modelchecker (37). SOTA does neither address system design nor implementation; but it was used for requirements specification in the systematic construction process for autonomous ensembles (48) of the ASCENS project (47).
The ensemble development life cycle EDLC (14; 29) of ASCENS is a general agile development process covering all phases of system development and relating them with the “runtime feedback control loop” for awareness and adaptation. Its extension “Continuous Collaboration” (28)
integrates a machinelearning approach into EDLC, but as EDLC, it does not address (policy) synthesis.
The following papers address more specific development aspects. In (12) a generic framework for modeling autonomous systems is presented which is centered around simulationbased online planning; Monte Carlo Tree Search and Cross Entropy Open Loop Planning are used for online generation of adaptive policies, but safety properties are not studied. In (23), Dragomir et al. propose an automated design process based on formal methods targeting partially observable timed systems. They describe how to automatically synthesize runtime monitors for fault detection, and recovery strategies for controller synthesis.
Model Checking for MDPs
PSyCo provides a framework for synthesizing policies that maximize return while being conform to a given probabilistic requirement specification. A related line of research is treating the problems of verifying general properties of a given MDP, such as reachability. In this case, the verification is done either for all possible policies, or for a particular fixed one turning the MDP into a Markov chain to be verified.
(8; 6) present overviews of the main model checking techniques for qualitative and quantitive properties of MDPs expressed by LTL and PCTL formulas. For a recent review of this field see (7). There are also software tools available, e.g. the PRISM model checker (32).Statistical Model Checking
Another direction of research that is related to our work is statistical model checking. Here, a system is observed in its execution and the statistical distribution of system traces is verified wrt. probabilistic requirements. While our work builds on these ideas, policy synthesis is not a core aspect of statistical model checking: Usually information about the verification process is not induced into a learning process (34; 20; 32). Other prior work has discussed the Bayesian approach to model checking based on the Beta distribution, which is a key component of the PSyCo framework. In contrast to our work, these works did not use information about the verification process to guide policy synthesis (30; 11). ^{1}^{1}todo: 1MW: Was ist mit (35)? ^{2}^{2}todo: 2bounded model checking
Safe and Robust Reinforcement Learning
Closely related to our approach of verifiable policy synthesis are works in the area of safe reinforcement learning modeling the problem in terms of a constrained optimization problem (43; 19; 24). In contrast to our approach, these approaches do not reason about the statistical distribution of costs and corresponding constraint violation, nor do they provide a statistically grounded verification approach of given constraints. (25; 26) propose a method for safe reinforcement learning which combines verified runtime monitoring with reinforcement learning. In contrast to our approach, their method requires a fully verified set of safe actions for a subset of the state space. While it is an interesting approach guaranteeing safety in the modeled subset, it is infeasible to perform exhaustiv apriori verification for very large or highly complex MDPs. ^{3}^{3}todo: 3Deep mind safety layer
A notable exception is proposed in (18), which provides statistical optimization wrt. the cost distribution tail. In contrast, PSyCo provides (a) a framework integrating formal goal specifications and policy synthesis and (b) Bayesian verification of synthesized polices including statistical confidence in verification results. Also, to the best of our knowledge, leveraging the Beta distribution for adapting the Lagrangian of a constraint optimization dual problem is a novel approach.
Another direction to safe reinforcement learning is the use of adversarial methods, which treat the agent’s environment as an adversary to allow for synthesis of policies that are robust wrt. worst case performance or differences in simulations used for learning and real world application domains (42; 31; 41). These approaches optimize for worstcase robustness, but do not provide formal statistical guarantees on the resulting policies.
Another important line of research deals with the quantification of uncertainty and robustness to outofdistribution data in reinforcement learning, i.e. tries to enable a system to identify its ’known unknowns’ (45; 36). This is important also from a verification perspective, as any verification results achieved before system execution on are valid if the data distribution stays the same at runtime.
6 Limitations
While SNES is able to incorporate information from the learning process into the synthesis of feasible policies given probabilistic constraints as requirements, there are a number of limitations to be aware of.
Feedback loops and nonstationary data
As the policy is changing in the course of learning, the estimation of satisfaction probability and confidence therein is done on data that is generated by a nonstationary process. In the other direction, the current estimate is used by SNES to update the policy, thus creating a feedback loop. Therefore, the estimates made by SNES while learning are to be interpreted with care: The degree of nonstationarity may severely influence the validity of the estimates. This does however not affect a posteriori verification results, which are obtained for stationary CMDP and policy.
Lack of convergence proof
While our current approach leveraging the Beta distribution for adaptively adjusting the Lagrangian yields interesting and effective results empirically, SNES lacks rigorous proofs of convergence and local optimality so far. We consider this a relevant direction for future work.
Bounded verification
In its current formulation, SNES performs bounded verification for a given horizon (i.e. episode length). It is unclear how to interpret or model probabilistic system requirements and satisfaction for temporally unbound systems, as in the limit every possible event will occur almost surely. A promising direction could be the integration of rates as usually performed in Markov chain analysis, or to resort to average reward formulations of reinforcement learning (38). ^{6}^{6}todo: 6receeding horizon, RV
No termination criterion
PSyCo combines optimization goals with constraints. While it is possible to decide whether constraints are satisfied, or at least to quantify confidence in the matter, it is usually not possible to decide whether the optimization goal has been reached or not. One approach to this would be to formulate requirements wrt. reward as constraints as well, such as requiring the system to reach a certain reward threshold (10). In this case, policy synthesis could terminate when all given requirements are satisfied with a certain confidence.
7 Conclusion
We have proposed Policy Synthesis under probabilistic Constraints (PSyCo), a systematic method for deriving safe policies under probabilistic constraints with reinforcement learning and Bayesian model checking. PSyCo is organized along the classical phases of systematic software development: a system specification comprising a constrained Markov decision process as domain specification and a requirement specification in terms of probabilistic constraints, an abstract design defined by an algorithm for safe policy synthesis with reinforcement learning and Bayesian model checking for system verification.
As an implementation of PSyCo we introduced Safe Neural Evolutionary Strategies (SNES), a method for learning safe policies under probabilistic constraints. SNES is leveraging online Bayesian model checking to obtain estimates of constraint satisfaction probability and a confidence in this estimate. SNES uses the confidence estimate to weight return and cost adaptively in a principled way in order to provide a sensible optimization target wrt. the constrained task. SNES provides a way to synthesize policies that are likely to satisfy a given specification.
We have empirically evaluated SNES in a sample domain designed to show the potentially interfering optimization goals of maximizing return while reaching and maintaining constraint satisfaction. We have shown that SNES is able to synthesize policies that are very likely to satisfy probabilistic constraints.
We see various directions for future research in safe system and policy synthesis. As a direct extension to our work, it would be interesting to extend other reinforcement learning algorithms with our approach of online adaptation of the Lagrangian with Bayesian model checking, such as valuebased, actorcritic and policy gradient algorithms. We also think that notions for unbound probabilistic verification are of high interest for policy synthesis with general safety properties. Another direction could be the inclusion of curricula into the learning process, gradually increasing the strength of the constraints over the course of learning, thus potentially speeding up the learning process and allowing for convergence to more effective local optima. Finally, we think that safe learning in multiagent systems dealing with feedback loops, strategic decision making and nonstationary learning dynamics poses interesting challenges for future research.
References
 [1] (2013) 7th IEEE international conference on selfadaptation and selforganizing systems workshops, sasow, 2013, philadelphia, pa, usa, september 913, 2013. IEEE Computer Society. External Links: Link, ISBN 9781479950867 Cited by: 14.
 [2] (2020) The SOTA approach to engineering collective adaptive systems. International Journal on Software Tools for Technology Transfer 1. External Links: Link, Document Cited by: §5.
 [3] (2012) Model checking goaloriented requirements for selfadaptive systems. In IEEE 19th International Conference and Workshops on Engineering of ComputerBased Systems, ECBS 2012, Novi Sad, Serbia, April 1113, 2012, M. Popovic, B. Schätz, and S. Voss (Eds.), pp. 33–42. External Links: Link, Document Cited by: §5.
 [4] (1999) Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §2.2, §2.3.
 [5] (2017) Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine 34 (6), pp. 26–38. Cited by: §1.
 [6] (2018) Model checking probabilistic systems. In Handbook of Model Checking, E. M. Clarke, T. A. Henzinger, H. Veith, and R. Bloem (Eds.), pp. 963–999. External Links: Link, Document Cited by: §5.
 [7] (2019) The 10,000 facets of MDP model checking. In Computing and Software Science, pp. 420–451. Cited by: §5.
 [8] (2008) Principles of model checking. MIT press. Cited by: §2.2, §5.
 [9] B. Beckert, F. Damiani, F. S. de Boer, and M. M. Bonsangue (Eds.) (2013) Formal methods for components and objects, 10th international symposium, FMCO 2011, turin, italy, october 35, 2011, revised selected papers. Lecture Notes in Computer Science, Vol. 7542, Springer. External Links: Link, Document, ISBN 9783642358869 Cited by: 48.
 [10] (2016) QoSaware multiarmed bandits. In 2016 IEEE 1st International Workshops on Foundations and Applications of Self* Systems (FAS* W), pp. 118–119. Cited by: §6.
 [11] (2017) Bayesian verification under model uncertainty. In 2017 IEEE/ACM 3rd International Workshop on Software Engineering for Smart CyberPhysical Systems (SEsCPS), pp. 10–13. Cited by: §2.4, §5.
 [12] (2015) OnPlan: A framework for simulationbased online planning. See Formal aspects of component software  12th international conference, FACS 2015, niterói, brazil, october 1416, 2015, revised selected papers, Braga and Ölveczky, pp. 1–30. External Links: Link, Document Cited by: §1, §5.
 [13] C. Braga and P. C. Ölveczky (Eds.) (2016) Formal aspects of component software  12th international conference, FACS 2015, niterói, brazil, october 1416, 2015, revised selected papers. Lecture Notes in Computer Science, Vol. 9539, Springer. External Links: Link, Document, ISBN 9783319289335 Cited by: 12.
 [14] (2013) A life cycle for the development of autonomic systems: the emobility showcase. See 1, pp. 71–76. External Links: Link, Document Cited by: §5.
 [15] C. Canal and A. Idani (Eds.) (2015) Software engineering and formal methods  SEFM 2014 collocated workshops: hofm, safome, opencert, mokmasd, wsfmds, grenoble, france, september 12, 2014, revised selected papers. Lecture Notes in Computer Science, Vol. 8938, Springer. External Links: Link, Document, ISBN 9783319152004 Cited by: 35.
 [16] (2018) Engineering sustainable and adaptive systems in dynamic and unpredictable environments. See Leveraging applications of formal methods, verification and validation. distributed systems  8th international symposium, isola 2018, limassol, cyprus, november 59, 2018, proceedings, part III, Margaria and Steffen, pp. 221–240. External Links: Link, Document Cited by: §1.
 [17] I. Chatzigiannakis, M. Mitzenmacher, Y. Rabani, and D. Sangiorgi (Eds.) (2016) 43rd international colloquium on automata, languages, and programming, ICALP 2016, july 1115, 2016, rome, italy. LIPIcs, Vol. 55, Schloss Dagstuhl  LeibnizZentrum für Informatik. External Links: Link, ISBN 9783959770132 Cited by: 33.
 [18] (2017) Riskconstrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18 (1), pp. 6070–6120. Cited by: §5.
 [19] (2018) A Lyapunovbased approach to safe reinforcement learning. In Advances in neural information processing systems, pp. 8092–8101. Cited by: §5.
 [20] (2011) Statistical model checking for cyberphysical systems. In International Symposium on Automated Technology for Verification and Analysis, pp. 1–12. Cited by: §5.
 [21] (1993) Goaldirected requirements acquisition. Sci. Comput. Program. 20 (12), pp. 3–50. External Links: Link, Document Cited by: §2.2, §2.2, §5.
 [22] (1979) Conjugate priors for exponential families. The Annals of statistics 7 (2), pp. 269–281. Cited by: §2.4.
 [23] (2018) Designing systems with detection and reconfiguration capabilities: A formal approach. See Leveraging applications of formal methods, verification and validation. distributed systems  8th international symposium, isola 2018, limassol, cyprus, november 59, 2018, proceedings, part III, Margaria and Steffen, Lecture Notes in Computer Science, Vol. 11246, pp. 155–171. External Links: Link, Document Cited by: §5.
 [24] (2019) Safetyguided deep reinforcement learning via online gaussian process estimation. arXiv preprint arXiv:1903.02526. Cited by: §5.

[25]
(2018)
Safe reinforcement learning via formal methods: toward safe control through proof and learning.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §5.  [26] (2019) Verifiably safe offmodel reinforcement learning. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 413–430. Cited by: §5.
 [27] (2011) Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §4.1.
 [28] (2016) Continuous collaboration for changing environments. Trans. Found. Mastering Chang. 1, pp. 201–224. External Links: Link, Document Cited by: §1, §5.
 [29] (2015) The ensemble development life cycle and best practices for collective autonomic systems. See Software engineering for collective autonomic systems  the ASCENS approach, Wirsing et al., Lecture Notes in Computer Science, Vol. 8998, pp. 325–354. External Links: Link, Document Cited by: §5.
 [30] (2009) A Bayesian approach to model checking biological systems. In International conference on computational methods in systems biology, pp. 218–234. Cited by: §2.4, §5.
 [31] (2019) Robust temporal difference learning for critical domains. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 350–358. Cited by: §5.
 [32] (2011) PRISM 4.0: verification of probabilistic realtime systems. In International conference on computer aided verification, pp. 585–591. Cited by: §5, §5.
 [33] (2016) Model checking and strategy synthesis for stochastic games: from theory to practice. See 43rd international colloquium on automata, languages, and programming, ICALP 2016, july 1115, 2016, rome, italy, Chatzigiannakis et al., pp. 4:1–4:18. External Links: Link, Document Cited by: ToDo 5.
 [34] (2010) Statistical model checking: an overview. In International Conference on Runtime Verification, pp. 122–135. Cited by: §5.
 [35] (2014) Scalable verification of markov decision processes. See Software engineering and formal methods  SEFM 2014 collocated workshops: hofm, safome, opencert, mokmasd, wsfmds, grenoble, france, september 12, 2014, revised selected papers, Canal and Idani, pp. 350–362. External Links: Link, Document Cited by: ToDo 1.
 [36] (2019) Safe reinforcement learning with model uncertainty estimates. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8662–8668. Cited by: §5.
 [37] (2006) Concurrency  state models and java programs (2. ed.). Wiley. External Links: ISBN 9780470093559 Cited by: §5.
 [38] (1996) Average reward reinforcement learning: foundations, algorithms, and empirical results. Machine learning 22 (13), pp. 159–195. Cited by: §6.
 [39] (2018) Coordination model with reinforcement learning for ensuring reliable ondemand services in collective adaptive systems. See Leveraging applications of formal methods, verification and validation. distributed systems  8th international symposium, isola 2018, limassol, cyprus, november 59, 2018, proceedings, part III, Margaria and Steffen, pp. 257–273. External Links: Link, Document Cited by: §1.
 [40] T. Margaria and B. Steffen (Eds.) (2018) Leveraging applications of formal methods, verification and validation. distributed systems  8th international symposium, isola 2018, limassol, cyprus, november 59, 2018, proceedings, part III. Lecture Notes in Computer Science, Vol. 11246, Springer. External Links: Link, Document, ISBN 9783030034238 Cited by: 16, 23, 39.
 [41] (2020) Learning and testing resilience in cooperative multiagent systems. In Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, Cited by: §5.
 [42] (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2817–2826. Cited by: §5.
 [43] (2019) Benchmarking safe exploration in deep reinforcement learning. Technical report Open AI. Cited by: §5.
 [44] (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §3.1.
 [45] (2019) Uncertaintybased outofdistribution classification in deep reinforcement learning. arXiv preprint arXiv:2001.00496. Cited by: §5.
 [46] (2018) Optimization of global production scheduling with deep reinforcement learning. Procedia CIRP 72 (1), pp. 1264–1269. Cited by: §1.
 [47] M. Wirsing, M. M. Hölzl, N. Koch, and P. Mayer (Eds.) (2015) Software engineering for collective autonomic systems  the ASCENS approach. Lecture Notes in Computer Science, Vol. 8998, Springer. External Links: Link, Document, ISBN 9783319163093 Cited by: §5, 29.
 [48] (2011) ASCENS: engineering autonomic servicecomponent ensembles. See Formal methods for components and objects, 10th international symposium, FMCO 2011, turin, italy, october 35, 2011, revised selected papers, Beckert et al., pp. 1–24. External Links: Link, Document Cited by: §5.
 [49] (2010) Bayesian statistical model checking with application to simulink/stateflow verification. In Proceedings of the 13th ACM international conference on Hybrid systems: computation and control, pp. 243–252. Cited by: §2.4.
Comments
There are no comments yet.