Deep Reinforcement Learning with Relative Entropy Stochastic Search
Many reinforcement learning methods for continuous control tasks are based on updating a policy function by maximizing an approximated action-value function or Q-function. However, the Q-function also depends on the policy and this dependency often leads to unstable policy learning. To overcome this issue, we propose a method that does not greedily exploit the Q-function. To do so, we upper-bound the Kullback-Leibler divergence of the new policy while maximizing the Q-function. Furthermore, we also lower-bound the entropy of the new policy to maintain its exploratory behavior. We show that by using a Gaussian policy and a Q-function that is quadratic in actions, we can solve the corresponding constrained optimization problem in a closed form. In addition, we show that our method can be regarded as a variant of the well-known deterministic policy gradient method. Through experiments, we evaluate the proposed method using a neural network as a function approximator and show that it gives more stable learning performance than the deep deterministic policy gradient method and the continuous Q-learning method.
READ FULL TEXT