1 Introduction
Learning to control an uncertain system is a problem with many applications in various engineering fields. This work has been inspired by problems in autonomous and semiautonomous systems. In the majority of practical scenarios, one wishes that the learning process terminates quickly and does not violate safety limits on key variables. It has to learn the control policy directly from experiments since there is no need to first derive an accurate physical model of the system. The main challenge when using such an approach is to ensure safety constraints during the learning process. However, constraining the freedom of the system can negatively affect performance, while attempting to learn less conservative safety constraints might fail to preserve safety if the learned constraints are inaccurate. Following this problem, in this work we seek for an approach to integrate safety in the learning process that relies on a partly known statespace model of the system and regards the unknown dynamics as an additive bounded disturbance.
During related work for this paper has been reviewed different approaches for model to control modern CyberPhysical Systems in the setting described above. We start with the general learning to control approach [Goodfellowetal2016] and concentrate on Bayesian sequential modelbased optimization [SeqOptBayes]
with the purpose of defining technics for additive disturbance estimation. That also gives an idea of possible implementation of exploration step into the algorithm.
Generally, most of the works in safety during model optimization are based on celebrated HamiltonJacobiIsaacs (HJI) reachability analysis, as for example [conf/cdc/AkametaluKFZGT14],[Ding_towardreachabilitybased], which are taken as the basic papers for the future work. In [conf/cdc/AkametaluKFZGT14] constraints bound made through Bayesian learning principles, more precisely has been used Gaussian Process (GP) model over the state space, which allows to model addition disturbances with infinitedimensional basis of functions. This is done by choose to model the covariance function as a squared exponential distance between the states, the proof can be found for example in [konig2013eigenvalue]. By considering worstcase disturbances, this method determines a safe region in the state space and provides a control policy to stay within that region. The main advantage is that in the interior of this region one can execute any desired action as long as the safe control is applied at the boundary, leading to a least restrictive control law. The desired action can be specified by any method, including any learning algorithm.
But the proposed technic is not without shortcomings, as HJI reachability analysis is done through solving PDE, which involves huge computation capacity. Then, in order to guarantee safety, the system designer must often rely on a nominal model that assumes conservative worstcase disturbances, which reduces the computed safe region, and thereby the region where the system can learn. Moreover, the least restrictive control law framework decouples safety and learning, which can lead to poor results, since it assumes drastic changes of control policy, due to discrete type of the response on safety constraints, and can lead to unexpected effects in the physics of the system as well as the learning controller has no notion of the unsafe regions of the state space and may attempt to drive the system into them. Has been reviewed the different approaches in the direction of reachability analysis, and one used in [abate2008probabilistic] has been chosen as a basic, with further extensions. As a benchmark for safe learning approach has been chosen an inverted pendulum as a classic model. This system has the advantage to only have two states so that it can be analyzed simply and accurate, and the results can be illustrated fairly easily.
2 Methodological background
2.1 Problem Setup
For supervised learning it is needed a preliminary information. Assume that we have a training set
of observations, , wheredenotes an input vector of dimension
and denotes a scalar output or target dependent variable; the column vector inputs for the design matrix all cases are aggregated in the design matrix , and the observed values are collected in the vector , so we can write . Additionally, assume that we have a surrogate model, for which we partly know the dynamics of the system, and it is affected by additive i.i.d. noise , where . For model this phenomena, consider the following nonlinear stochastic discrete time dynamical control system:(1) 
where, for any :

is the state space;

is the control input and U is the control input space;

is the real random variable, which models the unknown part of the system.
We model the known parts of a statespace model of the system as and think of the unknown parts of the model as an additive statedependent disturbance.
2.2 Gaussian Process
Assuming jointlyGaussian probability distribution of the disturbances and inputs, we model entire system as a stochastic process
over the states. The method employed for disturbance estimation is a Gaussian Process (GP) regression [Rasmussen]:(2) 
where mean and covariance
describe unknown part and noise of the system. They are chosen to capture the characteristics of the model (linearity, periodicity, etc), and defined by a set of hyperparameters
.Inspired by [conf/cdc/AkametaluKFZGT14] we use the squared exponential covariance function to define the kernel :(3) 
where is a diagonal matrix, with as the th diagonal element being the squared exponential’s characteristic length for the th state,
being the signal variance,
being the measurement noise variance andis already given input in the system. Take in account, that covariance kernel function of output of the system is defined trough the inputs. In such a way, it can be shown that the squared exponential covariance function corresponds to a Bayesian linear regression model with an infinite number of basis functions, as through the Mercers theorem
[konig2013eigenvalue]. All provided coefficients form the hyperparameters , are chosen to maximize the marginal likelihood of the training data set and are thus recomputed for each new batch of data, socalled MLEestimate:In the Bayesian formalism, we can now specify a prior over the parameters, expressing our beliefs about the prior parameters, just writing the joint prior distribution of the training outputs, , and the test (predictive) outputs :
(4) 
If there are training points and test points then denotes the matrix of the covariances evaluated at all pairs of training and test points, and similarly for the other entries , and
. The same way we can obtain joint distribution for one test point
.Deriving the conditional distribution, using Gaussian Identities ( see Appendix A in the full version ), we arrive at the key predictive Equations for Gaussian process regression, and therefore the probability of expected dynamics in (1) can be evaluated at any given state for the future reachability analysis:
(5)  
Regarding safety constraints, technics for reachability analysis used in [article] for Invariance Stochastic Problem has been chosen. Due to the marginalization of the probability distribution over the safe states we can build a reward function with the dependence of safety constraints, in other words, the introduced later algorithm has zero reward outside of the safe region, meanwhile maximizing chances to stay in. We, therefore, collect data samples through interactions with the system and estimate a model and disturbance based on GP regression.
2.3 Reachability Analysis
We seek for the following class of controls :
(6) 
namely, the class of time–varying feedback functions. Let be the class of control inputs sequence such that , for any . Any is called control policy. Given , and , set
(7) 
for any .
Given a safe set representing the set of ‘good’ states within which the state evolution of system (1) must evolve, our problem is to find a control policy that maximizes the probability of the state to be in , for any time within a finite time horizon .
More formally, let be the probability space associated with the system. The Stochastic Invariance Problem we reformulate as follows.
Problem 2.1.
Given a finite time horizon and a safe set subset of X, find the optimal control policy that maximizes
(8) 
In the next section, we used Lemmas and Propositions, proven in [article], changing the framework adapted for our case with the presence of safe set of states and a target set for implicitly define probability quantity from Problem 2.1 and construct a recursive optimization algorithm, that enables the computation of optimal control policy . (See Appendix in section 6.1)
3 Proposed Approach
In accordance with Problem 2.1, reformulate optimal control problem for stochastic systems, as our predictive dynamic described by (1), with integrated safety approach:
Problem 3.1.
Given a nonlinear stochastic discretetime dynamical control system , initial state , the target set of states , set of safe states and finite time horizon , find the optimal control policy that maximizes
(9) 
We now give the main result of the paper [article]: an algorithm that enables the computation of optimal control policy solving the Reachability Problem (3.1) just by greedy solving recursively optimization problem for every step in finite time horizon . Additionally, we reformulate it for the probability distribution, given by GP regression estimation in (2.2).
Theorem 3.2.
The optimal value of the problem 3.1 is equal to
where is given by the last step of the following recurrence algorithm,
(10) 
Furthermore, if maximizes the righthand side of Equation (10) for each and , then the class of policies is optimal.
Proof: Omitted. See Appendix (section 6.2) with following theory needed.
3.1 Exploration
Inspired by [Lee2010_2010arXiv1004.4027G], [BayUncConst2014arXiv1403.5607G], it has been chosen methodology for exploration purposes, which is useful in deriving more precise model through simulations with randomness. After a predefined number of steps, we relocate the target set inside the safe set of states to be close to the state, taking in account marginalization over all the control inputs, with the maximal predicted variance, or the maximal difference between upper and lower confidence bounds in other words. Define as for a fixed radius . Let be a set of feasible control inputs and is the input vector, formed by the combination of states and control inputs , as it was given before . Define constrained optimization problem:
(11) 
As a result, we a form new target set as a ball of radius centered in , that allows the safe exploration of the system and proved numerically in the next chapter.
4 Implementation
4.1 Inverted Pendulum
As a benchmark example consider a damped inverted pendulum system with mass , length , and friction coefficient . The states of the system are the pendulum angle and angular velocity . The system is disturbed by an additive statedependent disturbance . The description of dynamics given by [doya2000reinforcement]:
(12) 
All constants are assumed to be positive.
We would like to approximate this system with the discrete time system in state space so that , where is a sampling time. Notice that might not be equal to because there are infinite input signals which have the values at the sampling points , therefore, in general, they give rise to different output signals . As so, it is impossible to find an equivalent discretetime system such that for any input signal . Therefore, the best we can hope for is to find an approximation of the time derivative of . We can approximate as:
for small enough values of . The approximation associated with these substitutions is known as Forward Euler Approximation for state space models:
It is hereby important that the time step is small enough to actually allow to the learning agent to react on the changes, otherwise could be the position of ”synchronization” between state change and the reaction of the system. From the other side, a small time step does not allow to computationally proceed the algorithm, so this question under further consideration.
Since in the problem solve we are using programming tools, defining the set of states and the set of actions implies discretization over the intervals and , where is a circular state and should be in the interval is not larger than its period . The number of discretization steps impacts the convergence speed of the learning algorithm, due to correspondence to the number of evaluation of GP function during integration, and was chosen around for each state dimension so that we end up with evaluations in total for one step.
During the implementation of this technics has been used Matlab software with additional GPML toolbox for model Gaussian Process in a manner described [Rasmussen]. Note, that any further call to the GP function corresponds to one of the aspects, as inference or likelihood estimation through this toolbox[rasmussen2010gaussian], [rasmussen2016gpml]. Pseudocode for this problem solution is introduced in Algorithm 1, where apart from the data given above, it is required to define as the number of steps for time horizon considered, as the total number of steps to be done by system, as initial conservative hyperparameters, as variance and constants for polynomials, and define the function of the model, which is actually corresponds to (4.1) with additional bounds on total state space already incorporated.
The function Cost, used above, is described by Algorithm 2.
4.2 Results
The inverted pendulum system has been modeled as described above with physically meaningful state set bounds and . The statespace has been discretized with steps in each dimension and the action space bounds have been set as feasible. In addition, we form the safe set as restriction over , which corresponds to , and the initial target set as a ball with radius : , equal to around . The sampling time for the learning loop is chosen as . Initially, true disturbance introduced to the system is chosen to be and and graphically presented in Figure 1. This picture presents the real output of the system with the dependence of combination of the input state and , given the control input for reference. Hence, for having the possibility to describe such complex functions, after some empirical considerations, 5th degree Polynomial mean function has been chosen. As it was described above, covariance function is defined in accordance with Eq.(3). The set of hyperparameters, in this case, is formed by , where being the signal variance, being the measurement noise variance, and being the squared exponential’s characteristic length for the th state and are coefficients before polynomials. The hyperparameters are chosen as a solution of maximization of the marginal likelihood of the training data set and are thus sequentially recomputed for each new batch of data every 8 steps.
In what follows, the results from four learning iterations with every steps are presented. In this setting, exploration mechanics has not been incorporated. After each iteration, a new disturbance estimation, hyperparameters, and number of failures are carried out. The dynamics of the system is given in Figure 2, where in addition with the red line showed boundaries of safe set , so it is possible to define whether system attempts to fail. Evolution of expectation and variances of GP over the statespace are located on the Figures 3 and 4.
It is clear, that algorithm managed to learn disturbance and stay in the safe region. But, observing the variances, we can say that not all the system is identified and further exploration to ensuring safety needed. For this purpose has been introduced exploration set up. Significant numerical changes in mean function values were not noticed, but we see great improvements for the variance after 4 iterations, they are given in Figure 5. The same way safety constraints were satisfied after the second iteration done, but as it can be seen, more precise knowledge of the system is obtained, since even the order of variance become up to , which corresponds to significantly decreased uncertainty.
Additionally, here is presented comparative statistics of the system with incorporated exploration and without one. On the Figure 6 are marked states, visited during the learning process in both settings. With the green rectangle it is shown the boundaries of the safe set , in such a way we can quantitatively display accuracy of the learning algorithm. It is possible to note, that even if incorporation of exploration gives a control strategy more prone to failures, but this in future gives us less variance and, in fact, more confident knowledge about the system.
5 Conclusions and future work
Applying Bayesian learning methods to control tasks is a promising approach to overcome the strong dependence of traditional modelbased control methods on accurate models especially for nonlinear systems. Modelfree learning techniques have been extensively studied and proven their value in various applications. However, the problem of how to satisfy constraints during the learning process has not yet been addressed. Therefore, Reachability Analysis has been incorporated into the algorithm to ensure safety during the learning process as well as target achievement. By modeling the unknown parts of the statespace model as an unknown additive disturbance, this method provides a way to work with a stochastic system for further optimizing the control policy. Combining Bayesian learning and exploration analysis provides a way to regard safety learning to control a system with uncertain dynamics. Compared to the approach presented in [conf/cdc/AkametaluKFZGT14], the method has been used totally different reachability analysis technic [article] to safely define control policy and has been extended to incorporate exploration. We compared the approaches with exploration and without exploration, and find that both, policy learning and disturbance estimation, can be considerably improved by encouraging exploration.
However, the proposed algorithm needs improvements. In this paper the inverted pendulum system has been studied, which has the advantage of being easy to analyze and illustrate. But even for this system, in spite of all the work on optimization, real processing time does not let us apply it in realworld system. For reference, with time horizon, evaluation of one step on frequency processor takes from to seconds, depending on the size of the bath, and in the working regime, with it takes up to seconds, which shows us possibilities to apply it only in not realtime systems. These issues are due to the necessity of taking the invert of the matrices, size of which is dependent of the size of labeled data provided and also due to the usage of ”offtheshelf” optimizers able to work with stochastic functions. Thereby, for the future work it is proposed to work with finite state discretization [doya2000reinforcement], sparse matrices [Rasmussen]
and incorporation in advance learning algorithms, as Reinforcement learning
[conf/cdc/AkametaluKFZGT14]. But, speaking about the scalability of the model, convergence speed would be decreased considerably when the number of states increases [conf/cdc/AkametaluKFZGT14]. Secondly, the disturbance estimation could possibly be improved. The GP regression is implemented with batches of samples increasing with every step is done. The disturbance estimation might be improved considerably by employing a recursive method that takes all recorded samples into account the same way learning process is proceeding. It is not trivial how to implement GP regression in a recursive manner because the method relies on recomputing the covariances for each new input. Finally, the most significant improvement could be made by performing formal safety guarantees for the whole algorithm. This is, however, a very difficult matter because we come to relatively safe development of the process after some boundary of knowledge of the system, which is not trivially defined.In summary, remark that ideas from this paper could serve as a good starting point for future research. As pendulum, however, not really is the ”safetycritical applications”, later we consider to approach to pumpschedule problem as a strictly constrained problem, which is close to realworld task. Even though some adjustments should be made, i.e. engagement modern learning technologies, the approach of safe learning for control applications is really interesting and could overcome some of the limitations of traditional control theory.
6 Appendix
6.1 Preliminary results
The first result makes explicit in the probability quantity (8), the dependence on dynamics of the system (1) given as:
Lemma 6.1.
Given a control policy and a set ,
(13) 
where for any and
(14) 
Proof: By induction. For ,
(15) 
Suppose that (13) holds for step . By Bayes formula, the following chain of equalities holds:
(16) 
Thus, (13) holds for step .
Inspired by [article], we now define the cost function associated to the probability quantity (8) and hence to the Stochastic Invariance Problem. We introduce the following cost function which associates a real number to a triple by:
(17) 
All control policies are such that the cost function as in (17), is well–defined, due to properties explained in section 2.2. The following result establishes a formal functional relationship between and and hence between and the probability quantity (8).
Proposition 6.2.
Given a control policy and a safe set , for any :
(18) 
Proof: By induction. By definition of in (14),
(19) 
Hence, the statement holds for . By proceeding backwards, we suppose that (18) is true for step and we prove that (18) is true for step . By replacing Equation (14) into Equation (18), and by Equation (17), we have:
and hence the result follows.
The result above gives the way for rewriting the probability quantity (8) in terms of the cost function :
Proposition 6.3.
Given a control policy and a safe set ,
(20) 
By Proposition 6.3, it is possible to rewrite the Stochastic Invariance Problem, as follows.
Problem 6.4.
Given a finite time horizon and a safe set , compute:
(21) 
Problem 6.4, as reformulated above, highlights connections between the Stochastic Invariance Problem and optimal control problems. In accordance with Problem 2.1 we formulate optimal control problem for stochastic systems with integrated safety approach as Problem 3.1 from section 2.3. Similarly to Problem 6.4 we redefine the optimal control problem, just substituting a safe set for in the last step. Before giving the main proof, we need the following technical result.
Lemma 6.5.
Given a measurable subset of X, let be
such that for any and let be the class of feedback functions, i.e. . The optimal control policy , solving the following optimization problem:
(22) 
is such that for any ,
(23) 
almost everywhere with respect to Conversely, if satisfies (23) for any , then is the solution to the optimization problem (22).
Proof: For the sake of contradiction, suppose that there exists a measurable set , with non–zero Lebesgue measure, such that for any , inequality (23) is not true. Then, there exists a feedback control policy such that
Define the following control policy,
Then,
and hence is not the optimal control policy that solves problem (22). The second part of the statement is trivial.
6.2 Proof of Theorem 3.2
Proof: For , let be the optimal cost for the stage problem that starts at state and time , and ends at time , i.e. . For , we define Now we will show by induction that the functions are equal to the functions , as defined in (10), so that for we obtain the desired result. Assume that for some and all , we have that . Then, since , we have for all ,
completing the induction. The second equality holds by definition of in (17). In the third equality, we moved the supremum over inside the integral because of Lemma 6.5 and of the principle of optimality argument ( see e.g. [bertsekas2005dynamic] ). In the forth equality, we used the definition of , and in the fifth equality, we used the induction hypothesis. Finally, in the sixth equality, we converted the supremum over to a supremum over , using the fact that for any function of and , we have: