Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks
We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high representation dimensionality. Specifically, we establish a sharp error bound for the fitted Q-evaluation that depends on the intrinsic low dimension, the smoothness of the state-action space, and a function class-restricted χ^2-divergence. It is noteworthy that the restricted χ^2-divergence measures the behavior and target policies' mismatch in the function space, which can be small even if the two policies are not close to each other in their tabular forms. Numerical experiments are provided to support our theoretical analysis.
READ FULL TEXT