Multi-Preference Actor Critic

04/05/2019
by   Ishan Durugkar, et al.
0

Policy gradient algorithms typically combine discounted future rewards with an estimated value function, to compute the direction and magnitude of parameter updates. However, for most Reinforcement Learning tasks, humans can provide additional insight to constrain the policy learning. We introduce a general method to incorporate multiple different feedback channels into a single policy gradient loss. In our formulation, the Multi-Preference Actor Critic (M-PAC), these different types of feedback are implemented as constraints on the policy. We use a Lagrangian relaxation to satisfy these constraints using gradient descent while learning a policy that maximizes rewards. Experiments in Atari and Pendulum verify that constraints are being respected and can accelerate the learning process.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/27/2019

Generalized Off-Policy Actor-Critic

We propose a new objective, the counterfactual objective, unifying exist...
research
09/15/2021

Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback

Fluid human-agent communication is essential for the future of human-in-...
research
10/18/2019

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Reinforcement learning, mathematically described by Markov Decision Prob...
research
09/09/2019

Transfer Reward Learning for Policy Gradient-Based Text Generation

Task-specific scores are often used to optimize for and evaluate the per...
research
09/20/2022

A Deep Reinforcement Learning-Based Charging Scheduling Approach with Augmented Lagrangian for Electric Vehicle

This paper addresses the problem of optimizing charging/discharging sche...
research
12/31/2020

Multi-Agent Reinforcement Learning for Unmanned Aerial Vehicle Coordination by Multi-Critic Policy Gradient Optimization

Recent technological progress in the development of Unmanned Aerial Vehi...
research
06/20/2022

DNA: Proximal Policy Optimization with a Dual Network Architecture

This paper explores the problem of simultaneously learning a value funct...

Please sign up or login with your details

Forgot password? Click here to reset