DeepAI AI Chat
Log In Sign Up

Integral Policy Iterations for Reinforcement Learning Problems in Continuous Time and Space

by   Jae Young Lee, et al.
University of Alberta

Policy iteration (PI) is a recursive process of policy evaluation and improvement to solve an optimal decision-making, e.g., reinforcement learning (RL) or optimal control problem and has served as the fundamental to develop RL methods. Motivated by integral PI (IPI) schemes in optimal control and RL methods in continuous time and space (CTS), this paper proposes on-policy IPI to solve the general RL problem in CTS, with its environment modeled by an ordinary differential equation (ODE). In such continuous domain, we also propose four off-policy IPI methods---two are the ideal PI forms that use advantage and Q-functions, respectively, and the other two are natural extensions of the existing off-policy IPI schemes to our general RL framework. Compared to the IPI methods in optimal control, the proposed IPI schemes can be applied to more general situations and do not require an initial stabilizing policy to run; they are also strongly relevant to the RL algorithms in CTS such as advantage updating, Q-learning, and value-gradient based (VGB) greedy policy improvement. Our on-policy IPI is basically model-based but can be made partially model-free; each off-policy method is also either partially or completely model-free. The mathematical properties of the IPI methods---admissibility, monotone improvement, and convergence towards the optimal solution---are all rigorously proven, together with the equivalence of on- and off-policy IPI. Finally, the IPI methods are simulated with an inverted-pendulum model to support the theory and verify the performance.


Q-learning for Optimal Control of Continuous-time Systems

In this paper, two Q-learning (QL) methods are proposed and their conver...

Data efficient reinforcement learning and adaptive optimal perimeter control of network traffic dynamics

Existing data-driven and feedback traffic control strategies do not cons...

Model-Free Characterizations of the Hamilton-Jacobi-Bellman Equation and Convex Q-Learning in Continuous Time

Convex Q-learning is a recent approach to reinforcement learning, motiva...

Stochastic optimal well control in subsurface reservoirs using reinforcement learning

We present a case study of model-free reinforcement learning (RL) framew...

Robust optimal well control using an adaptive multi-grid reinforcement learning framework

Reinforcement learning (RL) is a promising tool to solve robust optimal ...

Capturing positive utilities during the estimation of recursive logit models: A prism-based approach

Although the recursive logit (RL) model has been recently popular and ha...