Hill Climbing on Value Estimates for Search-control in Dyna

06/18/2019
by   Yangchen Pan, et al.
0

Dyna is an architecture for model-based reinforcement learning (RL), where simulated experience from a model is used to update policies or value functions. A key component of Dyna is search-control, the mechanism to generate the state and action from which the agent queries the model, which remains largely unexplored. In this work, we propose to generate such states by using the trajectory obtained from Hill Climbing (HC) the current estimate of the value function. This has the effect of propagating value from high-value regions and of preemptively updating value estimates of the regions that the agent is likely to visit next. We derive a noisy stochastic projected gradient ascent algorithm for hill climbing, and highlight a connection to Langevin dynamics. We provide an empirical demonstration on four classical domains that our algorithm, HC-Dyna, can obtain significant sample efficiency improvements. We study the properties of different sampling distributions for search-control, and find that there appears to be a benefit specifically from using the samples generated by climbing on current value estimates from low-value to high-value region.

READ FULL TEXT
research
06/08/2020

Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models

Dyna-style reinforcement learning (RL) agents improve sample efficiency ...
research
02/14/2020

Frequency-based Search-control in Dyna

Model-based reinforcement learning has been empirically demonstrated as ...
research
08/28/2018

High-confidence error estimates for learned value functions

Estimating the value function for a fixed policy is a fundamental proble...
research
11/04/2022

The Benefits of Model-Based Generalization in Reinforcement Learning

Model-Based Reinforcement Learning (RL) is widely believed to have the p...
research
02/02/2020

SARSA(0) Reinforcement Learning over Fully Homomorphic Encryption

We consider a cloud-based control architecture in which the local plants...
research
01/12/2022

Dyna-T: Dyna-Q and Upper Confidence Bounds Applied to Trees

In this work we present a preliminary investigation of a novel algorithm...
research
04/07/2021

Reinforcement Learning with a Disentangled Universal Value Function for Item Recommendation

In recent years, there are great interests as well as challenges in appl...

Please sign up or login with your details

Forgot password? Click here to reset