BATS: Best Action Trajectory Stitching

04/26/2022
by   Ian Char, et al.
0

The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement learning algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We do this by using learned dynamics models to plan short trajectories between states. Since exact value iteration can be performed on this constructed MDP, it becomes easy to identify which trajectories are advantageous to add to the MDP. Crucially, since most transitions in this MDP come from the logged data, trajectories from the MDP can be rolled out for long periods with confidence. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. Finally, we demonstrate empirically how algorithms that uniformly constrain the learned policy to the entire dataset can result in unwanted behavior, and we show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.

READ FULL TEXT
research
07/30/2022

A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

In the reinforcement learning literature, there are many algorithms deve...
research
03/29/2016

Algorithms for Batch Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (HRL) exploits temporal abstraction ...
research
02/13/2021

Online Apprenticeship Learning

In Apprenticeship Learning (AL), we are given a Markov Decision Process ...
research
12/28/2020

Blackwell Online Learning for Markov Decision Processes

This work provides a novel interpretation of Markov Decision Processes (...
research
10/25/2020

XLVIN: eXecuted Latent Value Iteration Nets

Value Iteration Networks (VINs) have emerged as a popular method to inco...
research
05/28/2019

Generation of Policy-Level Explanations for Reinforcement Learning

Though reinforcement learning has greatly benefited from the incorporati...
research
10/29/2018

An approach to predictively securing critical cloud infrastructures through probabilistic modeling

Cloud infrastructures are being increasingly utilized in critical infras...

Please sign up or login with your details

Forgot password? Click here to reset