Sample-Efficient Reinforcement Learning for POMDPs with Linear Function Approximations

04/20/2022
by   Qi Cai, et al.
5

Despite the success of reinforcement learning (RL) for Markov decision processes (MDPs) with function approximation, most RL algorithms easily fail if the agent only has partial observations of the state. Such a setting is often modeled as a partially observable Markov decision process (POMDP). Existing sample-efficient algorithms for POMDPs are restricted to the tabular setting where the state and observation spaces are finite. In this paper, we make the first attempt at tackling the tension between function approximation and partial observability. In specific, we focus on a class of undercomplete POMDPs with linear function approximations, which allows the state and observation spaces to be infinite. For such POMDPs, we show that the optimal policy and value function can be characterized by a sequence of finite-memory Bellman operators. We propose an RL algorithm that constructs optimistic estimators of these operators via reproducing kernel Hilbert space (RKHS) embedding. Moreover, we theoretically prove that the proposed algorithm finds an ε-optimal policy with Õ (1/ε^2) episodes of exploration. Also, this sample complexity only depends on the intrinsic dimension of the POMDP polynomially and is independent of the size of the state and observation spaces. To our best knowledge, we develop the first provably sample-efficient algorithm for POMDPs with function approximation.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset