
QLearning for Continuous Actions with CrossEntropy Guided Policies
OffPolicy reinforcement learning (RL) is an important class of methods ...
read it

Entropic Policy Composition with Generalized Policy Improvement and Divergence Correction
Deep reinforcement learning (RL) algorithms have made great strides in r...
read it

ActorExpert: A Framework for using ActionValue Methods in Continuous Action Spaces
Valuebased approaches can be difficult to use in continuous action spac...
read it

Learning to Represent Action Values as a Hypergraph on the Action Vertices
Actionvalue estimation is a critical component of many reinforcement le...
read it

Randomized Policy Learning for Continuous State and Action MDPs
Deep reinforcement learning methods have achieved stateoftheart resul...
read it

Quantum reinforcement learning in continuous action space
Quantum mechanics has the potential to speed up machine learning algorit...
read it

Action Robust Reinforcement Learning and Applications in Continuous Control
A policy is said to be robust if it maximizes the reward while consideri...
read it
CAQL: Continuous Action QLearning
Valuebased reinforcement learning (RL) methods like Qlearning have shown success in a variety of domains. One challenge in applying Qlearning to continuousaction RL problems, however, is the continuous action maximization (maxQ) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuousaction Qlearning that can use several plugandplay optimizers for the maxQ problem. Leveraging recent optimization results for deep neural networks, we show that maxQ can be solved optimally using mixedinteger programming (MIP). When the Qfunction representation has sufficient power, MIPbased optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, crossentropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with stateoftheart RL algorithms on benchmark continuouscontrol problems that have different degrees of action constraints and show that CAQL outperforms policybased methods in heavily constrained environments, often dramatically.
READ FULL TEXT
Comments
There are no comments yet.