Distribution Estimation in Discounted MDPs via a Transformation
Although the general deterministic reward function in MDPs takes three arguments - current state, action, and next state; it is often simplified to a function of two arguments - current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective is a function of the expected cumulative reward only, this simplification works perfectly. However, when the objective is risk-sensitive - e.g., depends on the reward distribution, this simplification leads to incorrect values of the objective. This paper studies the distribution estimation of the cumulative discounted reward in infinite-horizon MDPs with finite state and action spaces. First, by taking the Value-at-Risk (VaR) objective as an example, we illustrate and analyze the error from the above simplification on the reward distribution. Next, we propose a transformation for MDPs to preserve the reward distribution and convert transition-based reward functions to deterministic state-based reward functions. This transformation works whether the transition-based reward function is deterministic or stochastic. Lastly, we show how to estimate the reward distribution after applying the proposed transformation in different settings, provided that the distribution is approximately normal.
READ FULL TEXT