Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

03/23/2022
by   Siow Meng Low, et al.
0

Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-to-end model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit the overall model-based DRP objective and instead take a minorization-maximization perspective to iteratively optimize the DRP w.r.t. a locally tight lower-bounded objective. This novel formulation of DRP learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective, (ii) it guarantees a monotonically improving objective under certain theoretical conditions, and (iii) it reuses samples between iterations thus lowering sample complexity. Empirical evaluation confirms that ILBO is significantly more sample-efficient than the state-of-the-art DRP planner and consistently produces better solution quality with lower variance. We additionally demonstrate that ILBO generalizes well to new problem instances (i.e., different initial states) without requiring retraining.

READ FULL TEXT

page 6

page 7

research
06/14/2021

RAPTOR: End-to-end Risk-Aware MDP Planning and Policy Learning by Backpropagation

Planning provides a framework for optimizing sequential decisions in com...
research
11/28/2019

Analysis of Lower Bounds for Simple Policy Iteration

Policy iteration is a family of algorithms that are used to find an opti...
research
10/14/2019

Bootstrapping the Expressivity with Model-based Planning

We compare the model-free reinforcement learning with the model-based ap...
research
07/10/2018

Algorithmic Framework for Model-based Reinforcement Learning with Theoretical Guarantees

While model-based reinforcement learning has empirically been shown to s...
research
06/06/2021

Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning

In the predict-then-optimize framework, the objective is to train a pred...
research
02/08/2019

Size Independent Neural Transfer for RDDL Planning

Neural planners for RDDL MDPs produce deep reactive policies in an offli...

Please sign up or login with your details

Forgot password? Click here to reset