q-Learning in Continuous Time

07/02/2022
by   Yanwei Jia, et al.
0

We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020) As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term "(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a "q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes. We then apply the theory to devise different actor-critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2021). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2021) and time-discretized conventional Q-learning algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2021

Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

We study policy gradient (PG) for reinforcement learning in continuous t...
research
01/28/2019

Making Deep Q-learning methods robust to time discretization

Despite remarkable successes, Deep Reinforcement Learning (DRL) is not r...
research
06/28/2023

Continuous-Time q-learning for McKean-Vlasov Control Problems

This paper studies the q-learning, recently coined as the continuous-tim...
research
11/24/2021

A comment on stabilizing reinforcement learning

This is a short comment on the paper "Asymptotically Stable Adaptive-Opt...
research
08/15/2021

Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach

We propose a unified framework to study policy evaluation (PE) and the a...
research
04/15/2021

Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD)

Using insight from numerical approximation of ODEs and the problem formu...
research
07/26/2019

Large scale continuous-time mean-variance portfolio allocation via reinforcement learning

We propose to solve large scale Markowitz mean-variance (MV) portfolio a...

Please sign up or login with your details

Forgot password? Click here to reset