DeepAI AI Chat
Log In Sign Up

Maximizing the Total Reward via Reward Tweaking

by   Chen Tessler, et al.

In reinforcement learning, the discount factor γ controls the agent's effective planning horizon. Traditionally, this parameter was considered part of the MDP; however, as deep reinforcement learning algorithms tend to become unstable when the effective planning horizon is long, recent works refer to γ as a hyper-parameter. In this work, we focus on the finite-horizon setting and introduce reward tweaking. Reward tweaking learns a surrogate reward function r̃ for the discounted setting, which induces an optimal (undiscounted) return in the original finite-horizon task. Theoretically, we show that there exists a surrogate reward which leads to optimality in the original task and discuss the robustness of our approach. Additionally, we perform experiments in a high-dimensional continuous control task and show that reward tweaking guides the agent towards better long-horizon returns when it plans for short horizons using the tweaked reward.


Effect of Reward Function Choices in MDPs with Value-at-Risk

This paper studies Value-at-Risk (VaR) problems in short- and long-horiz...

Learning Long-Term Reward Redistribution via Randomized Return Decomposition

Many practical applications of reinforcement learning require agents to ...

Reinforcement Learning with Depreciating Assets

A basic assumption of traditional reinforcement learning is that the val...

Reward Learning using Structural Motifs in Inverse Reinforcement Learning

The Inverse Reinforcement Learning (IRL) problem has seen rapid evolutio...

Admissible Policy Teaching through Reward Design

We study reward design strategies for incentivizing a reinforcement lear...

Hierarchies of Reward Machines

Reward machines (RMs) are a recent formalism for representing the reward...

Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning

World models in model-based reinforcement learning usually face unrealis...