Online Apprenticeship Learning

02/13/2021
by   Lior Shani, et al.
0

In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. Instead, we observe trajectories sampled by an expert that acts according to some policy. The goal is to find a policy that matches the expert's performance on some predefined set of cost functions. We introduce an online variant of AL (Online Apprenticeship Learning; OAL), where the agent is expected to perform comparably to the expert while interacting with the environment. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms: one for policy optimization and another for learning the worst case cost. To this end, we derive a convergent algorithm with O(√(K)) regret, where K is the number of interactions with the MDP, and an additional linear error term that depends on the amount of expert trajectories available. Importantly, our algorithm avoids the need to solve an MDP at each iteration, making it more practical compared to prior AL methods. Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL <cit.>, but where the discriminator is replaced with the costs learned by the OAL problem. Our simulations demonstrate our theoretically grounded approach outperforms the baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/05/2019

Apprenticeship Learning via Frank-Wolfe

We consider the applications of the Frank-Wolfe (FW) algorithm for Appre...
research
04/26/2022

BATS: Best Action Trajectory Stitching

The problem of offline reinforcement learning focuses on learning a good...
research
01/02/2021

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Many real-world applications, such as those in medical domains, recommen...
research
07/12/2019

Learning an Urban Air Mobility Encounter Model from Expert Preferences

Airspace models have played an important role in the development and eva...
research
06/26/2023

A General Framework for Sequential Decision-Making under Adaptivity Constraints

We take the first step in studying general sequential decision-making un...
research
09/21/2022

First-order Policy Optimization for Robust Markov Decision Process

We consider the problem of solving robust Markov decision process (MDP),...
research
07/29/2022

Improved Policy Optimization for Online Imitation Learning

We consider online imitation learning (OIL), where the task is to find a...

Please sign up or login with your details

Forgot password? Click here to reset