DeepAI AI Chat
Log In Sign Up

Offline Policy Optimization with Eligible Actions

by   Yao Liu, et al.

Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation, and such estimators typically do not require assumptions on the properties and representational capabilities of value function or decision process model function classes. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, in which it may be possible for the learned policy to essentially avoid making aligned decisions for part of the initial state space. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint, and provide a theoretical justification of the proposed algorithm. We also show the limitations of previous attempts to this approach. We test our algorithm in a healthcare-inspired simulator, a logged dataset collected from real hospitals and continuous control tasks. These experiments show the proposed method yields less overfitting and better test performance compared to state-of-the-art batch reinforcement learning algorithms.


page 5

page 21


POPO: Pessimistic Offline Policy Optimization

Offline reinforcement learning (RL), also known as batch RL, aims to opt...

Overfitting and Optimization in Offline Policy Learning

We consider the task of policy learning from an offline dataset generate...

Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning

Offline Reinforcement Learning (RL) aims to learn policies from previous...

Offline Reinforcement Learning with Implicit Q-Learning

Offline reinforcement learning requires reconciling two conflicting aims...

Policy Optimization via Importance Sampling

Policy optimization is an effective reinforcement learning approach to s...

Deep Offline Reinforcement Learning for Real-World Treatment Optimization Applications

There is increasing interest in data-driven approaches for dynamically c...

Blind Decision Making: Reinforcement Learning with Delayed Observations

Reinforcement learning typically assumes that the state update from the ...