1 Introduction
Reinforcement Learning (RL) is a powerful tool for tackling stochastic processes without depending on a detailed model of the probability distributions underlying the state transitions. Indeed, most RL methods rely purely on observed data, and realizations of the stage cost assessing the system performance. RL methods seek to increase the closedloop performance of the control policy deployed on the system as observations are collected. RL has drawn an increasingly large attention thanks to its accomplishments, such as, e.g., making it possible for robots to learn to walk or fly from experiments
(Wang et al., 2012; Abbeel et al., 2007).Most RL methods are based on learning the optimal control policy for the real system either directly, or indirectly. Indirect methods typically rely on learning a good approximation of the optimal actionvalue function underlying the system. The optimal policy is then indirectly obtained as the minimizer of the valuefunction approximation over the inputs. Direct RL methods, if based on the policy gradient, seek to adjust the parameters of a given policy such that it yields the best closedloop performance when deployed on the real system. An attractive advantage of direct RL methods over indirect ones is that they are based on formal necessary conditions of optimality for the closedloop performance of , and therefore guarantee  for a large enough data set  the (possibly local) assymptotic optimality of the parameters (Sutton et al., 1999; Silver et al., 2014).
RL methods often rely on Deep Neural Networks (DNN) to carry the policy approximation
. Unfortunately, control policies based on DNNs provide limited opportunities for formal verifications of the resulting policy, and for imposing hard constraints on the evolution of the state of the real system. The development of safe RL methods, which aims at tackling this issue, is currently an open field of research (J. Garcia, 2013). A novel approach towards providing formal safety certificates in the context of RL has been recently proposed in (Gros and Zanon, 2019, 2020; Zanon and Gros, 2019), where the policy approximation is based on robust Model Predictive Control (MPC) schemes rather than unstructured function approximators like DNNs. The validity of this choice is discussed in details in (Gros and Zanon, 2019). In (Gros and Zanon, 2020), methodologies to deploy direct RL techniques on MPCbased policy approximations are proposed. These methodologies are, however, restricted to continuous input spaces and therefore exclude integer decision variables, which are central in a number of applications.In this paper, we propose an extension of the policy gradient techniques proposed in (Gros and Zanon, 2020) to mixedinteger problems. A mixedinteger MPC is used as a policy approximation, and a policy gradient method adjusts the MPC parameters for closedloop performance. We detail how the actorcritic method can be deployed in this specific context. In particular, we propose an asymptotically exact hybrid stochasticdeterministic policy approach allowing for computing the policy gradient at a lower computational complexity than a full stochastic approach. We then propose a hybrid compatible advantagefunction approximator tailored to our formulation. We finally detail how the mixedinteger MPC can be differentiated at a low computational cost, using principles from parametric Nonlinear Programming, in order to implement the actorcritic method. The proposed method is illustrated on a simple example, allowing for an unambiguous presentation of the results.
The paper is structured as follows. Section 2 provides background material on MDPs and RL. Section 3 presents the construction of a mixedinteger stochastic policy using a mixedinteger MPC scheme to support the policy approximation. Section 4
details an actorcritic method tailored to the proposed formulation, and how the policy gradient can be estimated. A compatible advantage function approximation is proposed. Section
5 details how the mixedinteger MPC scheme can be efficiently differentiated. Section 6 proposes an illustrative example, and Section 7 provides some discussions.2 Background
In the following, we will consider that the dynamics of the real system are described as a stochastic process on (possibly) continuous stateinput spaces. We will furthermore consider (possibly) stochastic policies , taking the form of probability densities:
(1) 
denoting the probability density of selecting a given input when the system is in a given state . Deterministic policies delivering as a function of will be labelled as:
(2) 
Any deterministic policy can be viewed as a stochastic one, having a Dirac function as a probability density (or unit function for discrete inputs), i.e.,
We consider a stage cost function and a discount factor , the performance of a policy is assessed via the total expected cost:
(3) 
The optimal policy associated to the state transition, the stage cost and the discount factor is deterministic and given by:
(4) 
The value function , actionvalue function and advantage functions associated to a given policy are given by (Bertsekas, 1995; Bertsekas and Shreve, 1996; Bertsekas, 2007):
(5a)  
(5b)  
(5c) 
where the expected value in (5b) is taken over the state transition, and the one in (5a) is taken over the state transitions and (1).
2.1 Stochastic policy gradient
In most cases, the optimal policy cannot be computed, either because the system is not exactly known or because solving (5) is too expensive. It is then useful to consider approximations of the optimal policy, parametrized by . The optimal parameters are then given by:
(6) 
The policy gradient associated to the stochastic policy is then instrumental in finding by taking gradient steps in . The policy gradient can be obtained using various actorcritic methods (Sutton and Barto, 1998; Sutton et al., 1999). In this paper, we will use the actorcritic formulation:
(7) 
for stochastic policies, and the actorcritic formulation:
(8) 
for deterministic policies.
The value functions , and associated to a given policy are typically evaluated via TemporalDifference (TD) techniques (Sutton and Barto, 1998), and require that a certain amount of exploration is included in the deployment of the policy. For deterministic policies, the exploration can, e.g., be generated by including stochastic perturbations over the policy , while stochastic policies generate exploration by construction. Note that is is fairly common in RL to define the stochastic policy
as an arbitrary density, e.g., the normal distribution, centered at a deterministic policy
. We shall observe here that the deterministic policy gradient (8) is not suited as such for integer inputs, as the gradients do not exist on discrete input spaces. On continuous input spaces, the choice between the deterministic approach (8) or the stochastic approach (7) is typically motivated by computational aspects.3 Mixedinteger Optimizationbased policy
In this paper, we will consider parametrized deterministic policies based on parametric optimization problems. In particular, we will focus on optimization problems resulting from a nominal mixedinteger MPC formulation. The results proposed in this paper extend to robust MPC  enabling the construction of safe Reinforcement Learning methods  but this case is omitted in this paper for the sake of brevity.
3.1 Policy approximation based on mixedinteger MPC
The mixedinteger MPC scheme reads as:
(9a)  
(9b)  
(9c)  
(9d)  
(9e) 
where are the predicted system trajectories, the planned continuous inputs and the planned integer inputs. Without loss of generality, we consider binary integer inputs. Functions , are the stage and terminal costs. Functions are the stage constraints and function is the terminal constraint.
For a given state and parameters , the MPC scheme (9) delivers the continuous and integer input profiles
(10a)  
(10b) 
with and . The MPC scheme (9) generates a parametrized deterministic policy
(11) 
where
(12a)  
(12b) 
are the first elements of the continuous and integer input sequences generated by (9). In the following, it will be useful to consider the MPC scheme (9) as a generic parametric mixedinteger NLP:
(13a)  
(13b)  
(13c)  
(13d) 
where function gathers the stage and terminal cost functions from (9a), function gathers the dynamic constraints and initial conditions (9b), and function gathers the stage and terminal constraints (9c)(9d).
4 actorcritic method
In order to build actorcritic methods for (11), exploration is required (Sutton and Barto, 1998). When the input space is constrained and mixedinteger, the exploration becomes nontrivial to setup, as 1. it must retain the feasibility of the hard constraints (9c)(9d) and 2. simple input disturbances are not possible for the integer part since they are locked on an integer grid. To address this issue, we will adopt a stochastic policy approach, well suited for the integer part, and consider its asymptotically equivalent deterministic counterpart on the continuous input space, well suited for computational efficiency.
4.1 MPCbased exploration
In order to generate exploration, we will build a stochastic policy (1) based on the deterministic policy (11) where will gather the continuous inputs and integer inputs actually applied to the real system, i.e., . We will build (1) such that it generates exploration that is respecting the constraints (9c)(9d) with unitary probability. We propose to build (1) such that it becomes naturally separable between the integer and continuous part in the policy gradient computation. To that end, we consider a softmax approach to handle the integer part of the problem. More specifically, we consider the parametric mixedinteger NLP:
(14a)  
(14b)  
(14c)  
(14d)  
(14e) 
derived from (13), where the first integer input is assigned to via constraint (14d). We will consider that takes infinite value when the selected integer input is infeasible. Let us label the feasible set of for a given state and MPC parameter , and the integer profile solution of (14). By construction when . We then define the softmax stochastic integer policy distribution using
(15) 
where
is a parameter adjusting the variance of
. In order to build the continuous part of the policy, we will consider the continuous part of the stochastic policy as conditioned on , and taking the form of a probability density:(16) 
which will be constructed from the parametric NLP:
(17a)  
(17b)  
(17c) 
derived from (13), but where the integer input profile is entirely assigned, and where
is a random vector chosen as
. The random variable
in (16) will then be selected as:(18) 
As previously observed in (Gros and Zanon, 2020), while is easy to sample, it is in general difficult to evaluate.
Because is conditioned on and, therefore, , the Kolmogorov definition of conditional probabilities entails that the overall stochastic policy (1) reads as the distribution:
(19) 
We establish next a straightforward but useful result concerning the stochastic policy (19).
Lemma 1
Because when , policy (15) selects feasible integer inputs with probability 1. Furthermore, NLP (17) is feasible for all and all , such that its solution satisfies constraints (13b)(13c). As a result, the samples generated from (19) are guaranteed to be feasible. The policy gradient associated to (19) can be computed using (7). Unfortunately, it has been observed that this approach is computationally expensive for continuous input spaces (Gros and Zanon, 2020) when the policy is restricted by nontrivial constraints. Hence, we now turn to detailing how the policy gradient associated to policy (19) can be efficiently computed.
4.2 Policy gradient
Using policy (19), the stochastic policy gradient is separable between the continuous and integer part and reads as:
(20)  
where is the advantage function associated to the stochastic policy (19). Using (15), we then observe that the score function associated to the integer part of the policy is simply given by:
(21)  
The computation of the policy gradient associated to the continuous part of the stochastic policy ought to be treated differently. Indeed, it has been observed in (Gros and Zanon, 2020) that deterministic policy gradient methods are computationally more effective than stochastic ones for policy approximations on problems having continuous input and state spaces. Defining the deterministic policy for the continuous inputs as
(22) 
where is the first element of the solution of (17), we consider the approximation (Silver et al., 2014)
(23) 
which is asymptotically exact for under some technical but fairly unrestrictive assumptions. We can then use the asymptotically exact hybrid policy gradient
(24) 
as a computationally effective policy gradient evaluation. The stochastic policy (16) is then deployed on the system and generates exploration, while the deterministic policy (22) is used to compute the policy gradient (24). We propose next a compatible advantage function approximator for (24), offering a systematic approximation of the advantage function .
4.3 Compatible advantage function approximation
We note that the advantage function approximation
(25) 
is compatible by construction (Silver et al., 2014) for the stochastic policy gradient (20), in the sense that
(26) 
holds if is the solution of the LeastSquares problem
(27) 
Similarly, we seek a compatible advantage function approximation for the hybrid policy gradient (24). We propose the hybrid advantage function approximation, inspired from (Gros and Zanon, 2020):
(28) 
where we label the exploration performed on the continuous part of the input space , and is symmetric and . We will show in the following proposition that for and adequately chosen, the advantage function approximation (28) is compatible with the policy gradient (24).
Proposition 1
The hybrid function approximation (28) is asymptotically compatible, i.e.,
(29)  
holds for solution of (27) and for chosen according to (Gros and Zanon, 2020):
(30) 
evaluated at the solution of (17) for , where (17) satisfies the regularity assumptions of (Gros and Zanon, 2020, Proposition 1). These assumptions are technical but fairly unrestrictive, see (Gros and Zanon, 2020) for a complete discussion.
The proof delivered below is a sketch that follows the lines of the proof of Proposition 1 in Gros and Zanon (2020). We observe that the solution of (27) using (28) is given by:
(31) 
Using a Taylor expansion of at , as proposed in (Gros and Zanon, 2020, Proposition 1), we observe that (31) becomes:
(32)  
where is the secondorder remainder of the Taylor expansion of . Unlike (31), all terms in (4.3) are evaluated at . Following a similar argumentation as in (Gros and Zanon, 2020, Proposition 1), we obtain
(33a)  
(33b)  
(33c) 
Equality (33b) holds from the Delta method, while equalities (33a), (33c) hold because
(34)  
(35) 
result from (30), see (Gros and Zanon, 2020). Hence
(36)  
5 NLP sensitivities
In order to deploy the policy gradient techniques described above, one needs to compute the sensitivities and . Computing the score function (21) requires computing the sensitivity of the cost function of the NLP (14). This sensitivity exists almost everywhere and is given by:
(37) 
where is the primal solution of the NLP (14), gathering the continuous inputs and states of the NLP, and the dual variables associated to constraints (13b)(13c), respectively, and is the Lagrange function associated to (14). The computation of is more involved. Consider:
(38) 
i.e., the primaldual interiorpoint KKT conditions associated to (14) for a barrier parameter , and gathering the primaldual variables of the NLP (14), i.e., . Then, if the solution of the NLP (14) satisfies LICQ and SOSC (Nocedal and Wright, 2006), the sensitivity of the solution of the NLP (14) exists almost everywhere and can be computed via the Implicit Function Theorem, providing
(39) 
see (Büskens and Maurer, 2001). Using (22), the sensitivity then read as
(40) 
where is extracted from .
6 Simulated example
For the sake of brevity and in order to present results that are easy to interpret and verify, we propose to use a very low dimensional example, allowing us to bypass the evaluation of the actionvalue function via TemporalDifference techniques, and isolate the discussions of this paper from questions regarding TD methods. We consider the linear, scalar dynamics:
(41) 
where , and
is uniformly distributed in
. We consider the baseline stage cost:(42)  
as the reference performance, where are scalar weight and are references for the state and continuous input. The MPC model is deterministic, given by:
(43) 
where is constant, but subject to adaptation via RL. The baseline cost imposes a high penalty for , and constitutes an exact relaxation of the constraint , see (Gros and Zanon, 2019). The MPC stage cost has the form (42). The MPC parameters , and are subject to adaptation via RL.
The policy gradient (29) was implemented, where the advantage function estimation was computed from (27), using the approximator from (28). The true advantage function was evaluated via classic policy evaluation (Sutton and Barto, 1998) in order to deliver unambiguous results. On more complex examples (27) would be evaluated via TemporalDifference techniques. The evaluations of (29) and (27) were performed in a batch fashion, using 30 batches of 50 time steps each, all starting from the deterministic initial condition . The MPC scheme had a horizon of time samples, and a terminal cost based on the Riccati matrix of the control problem with . A discount factor of was adopted. The stepsize selected for adapting the parameters from the policy gradient was . The exploration parameters were chosen as , .
The parameters , , were adopted for the baseline cost. The MPC scheme parameters were initialized using the same values, and using . Fig. 1 reports the trajectories of the system at the beginning and end of the learning process, showing how performance is gained by bringing the state trajectories in the interval . Fig. 2 reports the policy for the continuous and integer inputs, showing how RL reshapes the MPC policy for a better closedloop performance. Fig. 3 reports the estimated policy gradients via the compatible approximation (29) and directly via (24), showing a match predicted by Prop. 1. Fig. 4 reports the closedloop performance of the MPC controller, calculated from , and shows the performance gain obtained via the learning. Fig. 5 shows the MPC parameter evolution through the learning process.
7 Discussion & Conclusion
This paper proposed an actorcritic approach to compute the policy gradient associated to policy approximations based on mixedinteger MPC schemes. The methodology is generic and applicable to linear, nonlinear and robust approaches. The paper proposes a hybrid stochasticdeterministic policy approach to generate the exploration and evaluate the policy gradient, avoiding the heavy computational expenses associated to using a stochastic policy approach on problems having continuous inputs and state constraints. A simple, compatible advantage function approximation is then proposed, tailored to our formulation and to MPCbased policy approximations. Some implementation details are provided, and the methods are illustrated on a simple example, providing a clear picture of how the proposed method is performing.
Future work will consider extensions to reduce the noise in the policy gradient estimation resulting from the choice of advantage function approximation, and will investigate techniques to integrate the stochastic policy and sensitivity computations with the branchandbound techniques used to solve the mixedinteger MPC problem. Future work will also investigate the potential of using the approaches detailed here to offer computationally less expensive approaches to solve the mixedinteger problem.
References and Notes
 Abbeel et al. (2007) Abbeel, P., Coates, A., Quigley, M., and Ng, A.Y. (2007). An application of reinforcement learning to aerobatic helicopter flight. In In Advances in Neural Information Processing Systems 19, 2007. MIT Press.
 Bertsekas (2007) Bertsekas, D. (2007). Dynamic Programming and Optimal Control, volume 2. Athena Scientific, 3rd edition.
 Bertsekas (1995) Bertsekas, D. (1995). Dynamic Programming and Optimal Control, volume 1 and 2. Athena Scientific, Belmont, MA.
 Bertsekas and Shreve (1996) Bertsekas, D. and Shreve, S. (1996). Stochastic Optimal Control: The Discrete Time Case. Athena Scientific, Belmont, MA.
 Büskens and Maurer (2001) Büskens, C. and Maurer, H. (2001). Online Optimization of Large Scale Systems, chapter Sensitivity Analysis and RealTime Optimization of Parametric Nonlinear Programming Problems, 3–16. Springer Berlin Heidelberg, Berlin, Heidelberg.
 Gros and Zanon (2019) Gros, S. and Zanon, M. (2019). DataDriven Economic NMPC using Reinforcement Learning. IEEE Transactions on Automatic Control. (in press).
 Gros and Zanon (2020) Gros, S. and Zanon, M. (2020). Safe Reinforcement Learning Based on Robust MPC and Policy Gradient Methods. IEEE Transactions on Automatic Control (submitted).

J. Garcia (2013)
J. Garcia, J.F. (2013).
A comprehensive survey on safe reinforcement learning.
Journal of Machine Learning Research
, 16, 1437–1480.  Nocedal and Wright (2006) Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2 edition.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, ICML’14, I–387–I–395.
 Sutton and Barto (1998) Sutton, R.S. and Barto, A.G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.
 Sutton et al. (1999) Sutton, R.S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, 1057–1063. MIT Press, Cambridge, MA, USA.
 Wang et al. (2012) Wang, S., Chaovalitwongse, W., and Babuska, R. (2012). Machine learning algorithms in bipedal robot control. Trans. Sys. Man Cyber Part C, 42(5), 728–743.
 Zanon and Gros (2019) Zanon, M. and Gros (2019). Safe Reinforcement Learning Using Robust MPC. In Transaction on Automatic Control, (submitted)). Https://arxiv.org/abs/1906.04005.
Comments
There are no comments yet.