Stochastic Variance Reduction Methods for Policy Evaluation

by   Simon S. Du, et al.

Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states' long-term value under a given policy. In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then present a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem. These algorithms scale linearly in both sample size and feature dimension. Moreover, they achieve linear convergence even when the saddle-point problem has only strong concavity in the dual variables but no strong convexity in the primal variables. Numerical experiments on benchmark problems demonstrate the effectiveness of our methods.


page 1

page 2

page 3

page 4


Linear Convergence of the Primal-Dual Gradient Method for Convex-Concave Saddle Point Problems without Strong Convexity

We consider the convex-concave saddle point problem _x_y f(x)+y^ A x-g(y...

SVRG for Policy Evaluation with Fewer Gradient Evaluations

Stochastic variance-reduced gradient (SVRG) is an optimization method or...

Expected Sarsa(λ) with Control Variate for Variance Reduction

Off-policy learning is powerful for reinforcement learning. However, the...

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

Bilevel optimization, the problem of minimizing a value function which i...

Rejoinder: Learning Optimal Distributionally Robust Individualized Treatment Rules

We thank the opportunity offered by editors for this discussion and the ...

Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning

Temporal-Difference (TD) learning with nonlinear smooth function approxi...

Batch-iFDD for Representation Expansion in Large MDPs

Matching pursuit (MP) methods are a promising class of feature construct...