Reinforcement Learning (RL) is a powerful tool for tackling stochastic processes without depending on a detailed model of the probability distributions underlying the state transitions. Indeed, most RL methods rely purely on observed data, and realizations of the stage cost assessing the system performance. RL methods seek to increase the closed-loop performance of the control policy deployed on the system as observations are collected. RL has drawn an increasingly large attention thanks to its accomplishments, such as, e.g., making it possible for robots to learn to walk or fly from experiments(Wang et al., 2012; Abbeel et al., 2007).
Most RL methods are based on learning the optimal control policy for the real system either directly, or indirectly. Indirect methods typically rely on learning a good approximation of the optimal action-value function underlying the system. The optimal policy is then indirectly obtained as the minimizer of the value-function approximation over the inputs. Direct RL methods, if based on the policy gradient, seek to adjust the parameters of a given policy such that it yields the best closed-loop performance when deployed on the real system. An attractive advantage of direct RL methods over indirect ones is that they are based on formal necessary conditions of optimality for the closed-loop performance of , and therefore guarantee - for a large enough data set - the (possibly local) assymptotic optimality of the parameters (Sutton et al., 1999; Silver et al., 2014).
RL methods often rely on Deep Neural Networks (DNN) to carry the policy approximation. Unfortunately, control policies based on DNNs provide limited opportunities for formal verifications of the resulting policy, and for imposing hard constraints on the evolution of the state of the real system. The development of safe RL methods, which aims at tackling this issue, is currently an open field of research (J. Garcia, 2013). A novel approach towards providing formal safety certificates in the context of RL has been recently proposed in (Gros and Zanon, 2019, 2020; Zanon and Gros, 2019), where the policy approximation is based on robust Model Predictive Control (MPC) schemes rather than unstructured function approximators like DNNs. The validity of this choice is discussed in details in (Gros and Zanon, 2019). In (Gros and Zanon, 2020), methodologies to deploy direct RL techniques on MPC-based policy approximations are proposed. These methodologies are, however, restricted to continuous input spaces and therefore exclude integer decision variables, which are central in a number of applications.
In this paper, we propose an extension of the policy gradient techniques proposed in (Gros and Zanon, 2020) to mixed-integer problems. A mixed-integer MPC is used as a policy approximation, and a policy gradient method adjusts the MPC parameters for closed-loop performance. We detail how the actor-critic method can be deployed in this specific context. In particular, we propose an asymptotically exact hybrid stochastic-deterministic policy approach allowing for computing the policy gradient at a lower computational complexity than a full stochastic approach. We then propose a hybrid compatible advantage-function approximator tailored to our formulation. We finally detail how the mixed-integer MPC can be differentiated at a low computational cost, using principles from parametric Nonlinear Programming, in order to implement the actor-critic method. The proposed method is illustrated on a simple example, allowing for an unambiguous presentation of the results.
The paper is structured as follows. Section 2 provides background material on MDPs and RL. Section 3 presents the construction of a mixed-integer stochastic policy using a mixed-integer MPC scheme to support the policy approximation. Section 4
details an actor-critic method tailored to the proposed formulation, and how the policy gradient can be estimated. A compatible advantage function approximation is proposed. Section5 details how the mixed-integer MPC scheme can be efficiently differentiated. Section 6 proposes an illustrative example, and Section 7 provides some discussions.
In the following, we will consider that the dynamics of the real system are described as a stochastic process on (possibly) continuous state-input spaces. We will furthermore consider (possibly) stochastic policies , taking the form of probability densities:
denoting the probability density of selecting a given input when the system is in a given state . Deterministic policies delivering as a function of will be labelled as:
Any deterministic policy can be viewed as a stochastic one, having a Dirac function as a probability density (or unit function for discrete inputs), i.e.,
We consider a stage cost function and a discount factor , the performance of a policy is assessed via the total expected cost:
The optimal policy associated to the state transition, the stage cost and the discount factor is deterministic and given by:
2.1 Stochastic policy gradient
In most cases, the optimal policy cannot be computed, either because the system is not exactly known or because solving (5) is too expensive. It is then useful to consider approximations of the optimal policy, parametrized by . The optimal parameters are then given by:
The policy gradient associated to the stochastic policy is then instrumental in finding by taking gradient steps in . The policy gradient can be obtained using various actor-critic methods (Sutton and Barto, 1998; Sutton et al., 1999). In this paper, we will use the actor-critic formulation:
for stochastic policies, and the actor-critic formulation:
for deterministic policies.
The value functions , and associated to a given policy are typically evaluated via Temporal-Difference (TD) techniques (Sutton and Barto, 1998), and require that a certain amount of exploration is included in the deployment of the policy. For deterministic policies, the exploration can, e.g., be generated by including stochastic perturbations over the policy , while stochastic policies generate exploration by construction. Note that is is fairly common in RL to define the stochastic policy
as an arbitrary density, e.g., the normal distribution, centered at a deterministic policy. We shall observe here that the deterministic policy gradient (8) is not suited as such for integer inputs, as the gradients do not exist on discrete input spaces. On continuous input spaces, the choice between the deterministic approach (8) or the stochastic approach (7) is typically motivated by computational aspects.
3 Mixed-integer Optimization-based policy
In this paper, we will consider parametrized deterministic policies based on parametric optimization problems. In particular, we will focus on optimization problems resulting from a nominal mixed-integer MPC formulation. The results proposed in this paper extend to robust MPC - enabling the construction of safe Reinforcement Learning methods - but this case is omitted in this paper for the sake of brevity.
3.1 Policy approximation based on mixed-integer MPC
The mixed-integer MPC scheme reads as:
where are the predicted system trajectories, the planned continuous inputs and the planned integer inputs. Without loss of generality, we consider binary integer inputs. Functions , are the stage and terminal costs. Functions are the stage constraints and function is the terminal constraint.
For a given state and parameters , the MPC scheme (9) delivers the continuous and integer input profiles
with and . The MPC scheme (9) generates a parametrized deterministic policy
where function gathers the stage and terminal cost functions from (9a), function gathers the dynamic constraints and initial conditions (9b), and function gathers the stage and terminal constraints (9c)-(9d).
4 actor-critic method
In order to build actor-critic methods for (11), exploration is required (Sutton and Barto, 1998). When the input space is constrained and mixed-integer, the exploration becomes non-trivial to setup, as 1. it must retain the feasibility of the hard constraints (9c)-(9d) and 2. simple input disturbances are not possible for the integer part since they are locked on an integer grid. To address this issue, we will adopt a stochastic policy approach, well suited for the integer part, and consider its asymptotically equivalent deterministic counterpart on the continuous input space, well suited for computational efficiency.
4.1 MPC-based exploration
In order to generate exploration, we will build a stochastic policy (1) based on the deterministic policy (11) where will gather the continuous inputs and integer inputs actually applied to the real system, i.e., . We will build (1) such that it generates exploration that is respecting the constraints (9c)-(9d) with unitary probability. We propose to build (1) such that it becomes naturally separable between the integer and continuous part in the policy gradient computation. To that end, we consider a softmax approach to handle the integer part of the problem. More specifically, we consider the parametric mixed-integer NLP:
derived from (13), where the first integer input is assigned to via constraint (14d). We will consider that takes infinite value when the selected integer input is infeasible. Let us label the feasible set of for a given state and MPC parameter , and the integer profile solution of (14). By construction when . We then define the softmax stochastic integer policy distribution using
is a parameter adjusting the variance of. In order to build the continuous part of the policy, we will consider the continuous part of the stochastic policy as conditioned on , and taking the form of a probability density:
which will be constructed from the parametric NLP:
derived from (13), but where the integer input profile is entirely assigned, and where
is a random vector chosen as
. The random variablein (16) will then be selected as:
As previously observed in (Gros and Zanon, 2020), while is easy to sample, it is in general difficult to evaluate.
Because is conditioned on and, therefore, , the Kolmogorov definition of conditional probabilities entails that the overall stochastic policy (1) reads as the distribution:
We establish next a straightforward but useful result concerning the stochastic policy (19).
Because when , policy (15) selects feasible integer inputs with probability 1. Furthermore, NLP (17) is feasible for all and all , such that its solution satisfies constraints (13b)-(13c). As a result, the samples generated from (19) are guaranteed to be feasible. The policy gradient associated to (19) can be computed using (7). Unfortunately, it has been observed that this approach is computationally expensive for continuous input spaces (Gros and Zanon, 2020) when the policy is restricted by non-trivial constraints. Hence, we now turn to detailing how the policy gradient associated to policy (19) can be efficiently computed.
4.2 Policy gradient
Using policy (19), the stochastic policy gradient is separable between the continuous and integer part and reads as:
The computation of the policy gradient associated to the continuous part of the stochastic policy ought to be treated differently. Indeed, it has been observed in (Gros and Zanon, 2020) that deterministic policy gradient methods are computationally more effective than stochastic ones for policy approximations on problems having continuous input and state spaces. Defining the deterministic policy for the continuous inputs as
which is asymptotically exact for under some technical but fairly unrestrictive assumptions. We can then use the asymptotically exact hybrid policy gradient
as a computationally effective policy gradient evaluation. The stochastic policy (16) is then deployed on the system and generates exploration, while the deterministic policy (22) is used to compute the policy gradient (24). We propose next a compatible advantage function approximator for (24), offering a systematic approximation of the advantage function .
4.3 Compatible advantage function approximation
We note that the advantage function approximation
holds if is the solution of the Least-Squares problem
where we label the exploration performed on the continuous part of the input space , and is symmetric and . We will show in the following proposition that for and adequately chosen, the advantage function approximation (28) is compatible with the policy gradient (24).
The hybrid function approximation (28) is asymptotically compatible, i.e.,
evaluated at the solution of (17) for , where (17) satisfies the regularity assumptions of (Gros and Zanon, 2020, Proposition 1). These assumptions are technical but fairly unrestrictive, see (Gros and Zanon, 2020) for a complete discussion.
where is the second-order remainder of the Taylor expansion of . Unlike (31), all terms in (4.3) are evaluated at . Following a similar argumentation as in (Gros and Zanon, 2020, Proposition 1), we obtain
5 NLP sensitivities
In order to deploy the policy gradient techniques described above, one needs to compute the sensitivities and . Computing the score function (21) requires computing the sensitivity of the cost function of the NLP (14). This sensitivity exists almost everywhere and is given by:
where is the primal solution of the NLP (14), gathering the continuous inputs and states of the NLP, and the dual variables associated to constraints (13b)-(13c), respectively, and is the Lagrange function associated to (14). The computation of is more involved. Consider:
i.e., the primal-dual interior-point KKT conditions associated to (14) for a barrier parameter , and gathering the primal-dual variables of the NLP (14), i.e., . Then, if the solution of the NLP (14) satisfies LICQ and SOSC (Nocedal and Wright, 2006), the sensitivity of the solution of the NLP (14) exists almost everywhere and can be computed via the Implicit Function Theorem, providing
where is extracted from .
6 Simulated example
For the sake of brevity and in order to present results that are easy to interpret and verify, we propose to use a very low dimensional example, allowing us to bypass the evaluation of the action-value function via Temporal-Difference techniques, and isolate the discussions of this paper from questions regarding TD methods. We consider the linear, scalar dynamics:
where , and
is uniformly distributed in. We consider the baseline stage cost:
as the reference performance, where are scalar weight and are references for the state and continuous input. The MPC model is deterministic, given by:
where is constant, but subject to adaptation via RL. The baseline cost imposes a high penalty for , and constitutes an exact relaxation of the constraint , see (Gros and Zanon, 2019). The MPC stage cost has the form (42). The MPC parameters , and are subject to adaptation via RL.
The policy gradient (29) was implemented, where the advantage function estimation was computed from (27), using the approximator from (28). The true advantage function was evaluated via classic policy evaluation (Sutton and Barto, 1998) in order to deliver unambiguous results. On more complex examples (27) would be evaluated via Temporal-Difference techniques. The evaluations of (29) and (27) were performed in a batch fashion, using 30 batches of 50 time steps each, all starting from the deterministic initial condition . The MPC scheme had a horizon of time samples, and a terminal cost based on the Riccati matrix of the control problem with . A discount factor of was adopted. The step-size selected for adapting the parameters from the policy gradient was . The exploration parameters were chosen as , .
The parameters , , were adopted for the baseline cost. The MPC scheme parameters were initialized using the same values, and using . Fig. 1 reports the trajectories of the system at the beginning and end of the learning process, showing how performance is gained by bringing the state trajectories in the interval . Fig. 2 reports the policy for the continuous and integer inputs, showing how RL reshapes the MPC policy for a better closed-loop performance. Fig. 3 reports the estimated policy gradients via the compatible approximation (29) and directly via (24), showing a match predicted by Prop. 1. Fig. 4 reports the closed-loop performance of the MPC controller, calculated from , and shows the performance gain obtained via the learning. Fig. 5 shows the MPC parameter evolution through the learning process.
7 Discussion & Conclusion
This paper proposed an actor-critic approach to compute the policy gradient associated to policy approximations based on mixed-integer MPC schemes. The methodology is generic and applicable to linear, nonlinear and robust approaches. The paper proposes a hybrid stochastic-deterministic policy approach to generate the exploration and evaluate the policy gradient, avoiding the heavy computational expenses associated to using a stochastic policy approach on problems having continuous inputs and state constraints. A simple, compatible advantage function approximation is then proposed, tailored to our formulation and to MPC-based policy approximations. Some implementation details are provided, and the methods are illustrated on a simple example, providing a clear picture of how the proposed method is performing.
Future work will consider extensions to reduce the noise in the policy gradient estimation resulting from the choice of advantage function approximation, and will investigate techniques to integrate the stochastic policy and sensitivity computations with the branch-and-bound techniques used to solve the mixed-integer MPC problem. Future work will also investigate the potential of using the approaches detailed here to offer computationally less expensive approaches to solve the mixed-integer problem.
References and Notes
- Abbeel et al. (2007) Abbeel, P., Coates, A., Quigley, M., and Ng, A.Y. (2007). An application of reinforcement learning to aerobatic helicopter flight. In In Advances in Neural Information Processing Systems 19, 2007. MIT Press.
- Bertsekas (2007) Bertsekas, D. (2007). Dynamic Programming and Optimal Control, volume 2. Athena Scientific, 3rd edition.
- Bertsekas (1995) Bertsekas, D. (1995). Dynamic Programming and Optimal Control, volume 1 and 2. Athena Scientific, Belmont, MA.
- Bertsekas and Shreve (1996) Bertsekas, D. and Shreve, S. (1996). Stochastic Optimal Control: The Discrete Time Case. Athena Scientific, Belmont, MA.
- Büskens and Maurer (2001) Büskens, C. and Maurer, H. (2001). Online Optimization of Large Scale Systems, chapter Sensitivity Analysis and Real-Time Optimization of Parametric Nonlinear Programming Problems, 3–16. Springer Berlin Heidelberg, Berlin, Heidelberg.
- Gros and Zanon (2019) Gros, S. and Zanon, M. (2019). Data-Driven Economic NMPC using Reinforcement Learning. IEEE Transactions on Automatic Control. (in press).
- Gros and Zanon (2020) Gros, S. and Zanon, M. (2020). Safe Reinforcement Learning Based on Robust MPC and Policy Gradient Methods. IEEE Transactions on Automatic Control (submitted).
J. Garcia (2013)
J. Garcia, J.F. (2013).
A comprehensive survey on safe reinforcement learning.
Journal of Machine Learning Research, 16, 1437–1480.
- Nocedal and Wright (2006) Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2 edition.
- Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, ICML’14, I–387–I–395.
- Sutton and Barto (1998) Sutton, R.S. and Barto, A.G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.
- Sutton et al. (1999) Sutton, R.S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, 1057–1063. MIT Press, Cambridge, MA, USA.
- Wang et al. (2012) Wang, S., Chaovalitwongse, W., and Babuska, R. (2012). Machine learning algorithms in bipedal robot control. Trans. Sys. Man Cyber Part C, 42(5), 728–743.
- Zanon and Gros (2019) Zanon, M. and Gros (2019). Safe Reinforcement Learning Using Robust MPC. In Transaction on Automatic Control, (submitted)). Https://arxiv.org/abs/1906.04005.