DeepAI AI Chat
Log In Sign Up

A Reduction from Reinforcement Learning to No-Regret Online Learning

by   Ching-An Cheng, et al.

We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which "any" online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any γ-discounted tabular RL problem, with probability at least 1-δ, it learns an ϵ-optimal policy using at most Õ(|S||A|log(1/δ)/(1-γ)^4ϵ^2) samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of |S|,|A|, though at the cost of potential approximation bias.


page 1

page 2

page 3

page 4


Provable Reset-free Reinforcement Learning by No-Regret Reduction

Real-world reinforcement learning (RL) is often severely limited since t...

Online Model Selection for Reinforcement Learning with Function Approximation

Deep reinforcement learning has achieved impressive successes yet often ...

Effective Warm Start for the Online Actor-Critic Reinforcement Learning based mHealth Intervention

Online reinforcement learning (RL) is increasingly popular for the perso...

Revised Progressive-Hedging-Algorithm Based Two-layer Solution Scheme for Bayesian Reinforcement Learning

Stochastic control with both inherent random system noise and lack of kn...

Agnostic System Identification for Model-Based Reinforcement Learning

A fundamental problem in control is to learn a model of a system from ob...

ChaCha for Online AutoML

We propose the ChaCha (Champion-Challengers) algorithm for making an onl...

A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret

In various control task domains, existing controllers provide a baseline...