Exploration and preference satisfaction trade-off in reward-free learning

by   Noor Sajid, et al.

Biological agents have meaningful interactions with their environment despite the absence of a reward signal. In such instances, the agent can learn preferred modes of behaviour that lead to predictable states – necessary for survival. In this paper, we pursue the notion that this learnt behaviour can be a consequence of reward-free preference learning that ensures an appropriate trade-off between exploration and preference satisfaction. For this, we introduce a model-based Bayesian agent equipped with a preference learning mechanism (pepper) using conjugate priors. These conjugate priors are used to augment the expected free energy planner for learning preferences over states (or outcomes) across time. Importantly, our approach enables the agent to learn preferences that encourage adaptive behaviour at test time. We illustrate this in the OpenAI Gym FrozenLake and the 3D mini-world environments – with and without volatility. Given a constant environment, these agents learn confident (i.e., precise) preferences and act to satisfy them. Conversely, in a volatile setting, perpetual preference uncertainty maintains exploratory behaviour. Our experiments suggest that learnable (reward-free) preferences entail a trade-off between exploration and preference satisfaction. Pepper offers a straightforward framework suitable for designing adaptive agents when reward functions cannot be predefined as in real environments.



page 7

page 17

page 18

page 19

page 20

page 21

page 22


Reward Uncertainty for Exploration in Preference-based Reinforcement Learning

Conveying complex objectives to reinforcement learning (RL) agents often...

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

In this paper we consider multi-objective reinforcement learning where t...

A mechanism to promote social behaviour in household load balancing

Reducing the peak energy consumption of households is essential for the ...

RL agents Implicitly Learning Human Preferences

In the real world, RL agents should be rewarded for fulfilling human pre...

Learning Personalized Thermal Preferences via Bayesian Active Learning with Unimodality Constraints

Thermal preferences vary from person to person and may change over time....

Deciding What to Learn: A Rate-Distortion Approach

Agents that learn to select optimal actions represent a prominent focus ...

Understanding the origin of information-seeking exploration in probabilistic objectives for control

The exploration-exploitation trade-off is central to the description of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.