Distributional Advantage Actor-Critic

06/10/2018
by   Shangda Li, et al.
Carnegie Mellon University
0

In traditional reinforcement learning, an agent maximizes the reward collected during its interaction with the environment by approximating the optimal policy through the estimation of value functions. Typically, given a state s and action a, the corresponding value is the expected discounted sum of rewards. The optimal action is then chosen to be the action a with the largest value estimated by value function. However, recent developments have shown both theoretical and experimental evidence of superior performance when value function is replaced with value distribution in context of deep Q learning [1]. In this paper, we develop a new algorithm that combines advantage actor-critic with value distribution estimated by quantile regression. We evaluated this new algorithm, termed Distributional Advantage Actor-Critic (DA2C or QR-A2C) on a variety of tasks, and observed it to achieve at least as good as baseline algorithms, and outperforming baseline in some tasks with smaller variance and increased stability.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/09/2020

Is Standard Deviation the New Standard? Revisiting the Critic in Deep Policy Gradients

Policy gradient algorithms have proven to be successful in diverse decis...
09/13/2021

Direct Advantage Estimation

Credit assignment is one of the central problems in reinforcement learni...
05/24/2021

GMAC: A Distributional Perspective on Actor-Critic Framework

In this paper, we devise a distributional framework on actor-critic as a...
05/09/2018

Reward Estimation for Variance Reduction in Deep Reinforcement Learning

In reinforcement learning (RL), stochastic environments can make learnin...
10/08/2019

Deep Value Model Predictive Control

In this paper, we introduce an actor-critic algorithm called Deep Value ...
02/09/2021

Learning State Representations from Random Deep Action-conditional Predictions

In this work, we study auxiliary prediction tasks defined by temporal-di...
06/24/2021

Mix and Mask Actor-Critic Methods

Shared feature spaces for actor-critic methods aims to capture generaliz...

1 Introduction

Reinforcement learning (RL) is composed of the four elements: states that the agent could be in, actions to be taken by the agent, reward observed as a result of action, and the successor state after action is taken. The states an agent could be in, in real world tasks, are often composed of visual input, or RGB images. Traditionally, to decrease the dimensionality of input, relevant features are first extracted before using them as input states to the RL algorithm. This induces two major disadvantages: 1) every raw input needs to be preprocessed; 2) each task needs to have its own feature extraction algorithm tailored for it. The more generalized approach would be to use the visual image percept directly as input state, such as in the case of many OpenAI Atari environments. Visually simple and completely observable environments, such as the ones in Atari, are known to be solvable using well-known RL algorithms [1]. However, complex real-world tasks, such as having a robot navigate in a procedurally generated environment, are more involved. The three major challenges that arise from navigation in first-person perspective is 1) the substantial visual complexity that arises from independent moving objects in the environment; 2) the sparsity of rewards coupled with the constantly-changing view of the agent; 3) the procedurally generated environment is different on every episode, requiring agent to explore and rapidly adapt to the new environment. This paper is interested in exploring combinations of RL algorithms and techniques that best solve a navigation task from a first-person perspective in a visually complex dynamic environment.

2 Methodology

2.1 Proposed Model

Due to the complexity of the problem, only deep models are being considered. We want to use vanilla DQN and Duel-DQN with regular experience replay as baselines for the tasks. We do not expect any decent performance from using DQN with experience replay alone, especially for the Maze Navigation task. This is because of the challenges we mentioned above, such as the extremely sparse reward, partially observable state, and changing start/goal state in each episode. In order for the network to learn the latent structures of the complex pixel inputs, the baseline DQN would need at least a convolution neural network instead of simply stacking fully-connected layers. One drawback of using a convolution neural network is that a CNN is not rotational invariant, which may pose a problem in dynamic environments where both the agent’s angle of view and the object itself constantly changes. In this paper, we aim to explore a few more advanced reinforcement learning algorithms and techniques that could possibly enhance the performance.

2.2 Hindsight Experience Replay

Hindsight Experience Replay (HER) is used to accelerate and learn environments with sparse rewards that traditional DQN algorithms could not learn. [3] The idea of HER is to replay episodes with a different goal than the original goal that the agent is trying to achieve, thereby obtaining a learning signal despite episode not containing the original goal state.

2.3 Actor-Critic Network

The actor-critic network for policy gradient is an essential elements of many advanced algorithms, such as Hierarchical Actor Critic, and Deep Deterministic Policy Gradient. An actor-critic network maintains two separate neural networks: an actor, which learns a deterministic target policy , and a critic, which uses an approximation architecture and simulation to learn an action-value function to approximate actor’s action-value function . [3] An actor-critic network has more desirable convergence properties compared to critic-only method, and provides faster convergence compared to actor-only methods. [4]

2.4 Deep Deterministic Policy Gradient

DDPG is an off-policy algorithm based on actor-critic network with experience replay and adapted to perform well in high dimensional action space. More importantly to our purpose, DDPG has proved to be capable of learning directly from unprocessed pixels in simple environments and even had pixel-input outperform planning algorithms in many cases. [5]

2.5 Hierarchical Actor-Critic

Hierarchical Actor-Critic (HAC) algorithm enables the agent to explore at high level as well as learn short policies at each level of hierarchy. HAC has shown its potential of accelerating learning in tasks that has long time horizon and even in the presence of sparse rewards, which seems to be suited for the complex environment presented above [6]. HAC is an actor-critic algorithm based on deep deterministic policy gradient (DDPG) that uses Hindsight Experience Replay (HER) and Universal Value Function Approximators (UVFA). UVFA is used to extend action-value functions to incorporate multiple goals, which means that each goal corresponds to a reward function . This causes the Q-function to depend on both a state-action pair and a goal [3] [6]

3 Preliminary Results [TODO]

3.1 Dataset

To simulate a complex environment, we use DeepMind Lab’s 3D learning environment based on Quake III Arena games [2]. DeepMind Lab is a first-person 3D game platform created by Beattie, et al. in 2016 and designed to study how autonomous artificial agents interact in large, partially observed, and visually diverse worlds. It contains a suite of tasks from three different difficulty levels: the easiest is collecting different fruits (rewards) in maze, medium is navigating in a maze, and the most challenging one is laser tag while bouncing through space. The observation available to the agent consists of 3 parts: a RGBD (depth-included) image of the current field of vision, a reward, and an optional velocity information. Due to the complex nature of training pixels-to-actions autonomous agent that navigate in 3D space, this paper will focus mostly on the easy and medium tasks.

Easy task: "Stairway to Melon" and "Seek Avoid Arena"
There are two versions of this task, both requires the agent to gather fruits in a static map. The goal is to gather positive-reward fruits but avoid negative-reward ones.

Medium task: "Maze Navigation"
This task comes in two flavors: a static map and a procedurally generated random map. The agent is expected to find its way, from a random start location, to the goal location (which is fixed in static map and random in random map).

3.2 Initial Results

There are numerous environments available in Deepmind lab. For the sake of simplicity, we chose to focus on the easiest environment "seekavoid_arena_01".

Randomized Policy

The first policy that we tried is a randomized policy, with an equal probability for all possible actions at a given state. Unsurprisingly, the random policy did poorly in all environments we tried.

The policy received 0 reward in 1000 steps most of the times.

Linear model

The next model is the linear model with the state as input and the q values as output just like in the homework 2 writeup. We trained the model online for 1000 epochs and here are the results.

Unfortunately, linear model does not work just as well as on CartPole or MountainCar as we expected. In retrospect, the most probable explanation is that the inputs in CartPole / MountainCar are drastically different than the inputs in seakavoid_arena_01. In the previous two environments, the inputs are subject attributes such as location and velocity. But in seakavoid_arena_01, the inputs are raw pixels, which are way more complex.
There are two possible ways to improve the linear model.

  1. Manual feature extraction standards.
    We may construct a feature extraction model based on human expert. In other words, we pre-process the raw inputs data and extract the features with a human defined model, and then feed the features to a linear model. However, this approach is not generalizable to similar navigation tasks.

  2. Let the network decide feature extraction.
    We can train a Convolution Neural Network to extract useful features in the video feed. Combined with the techniques listed in section two, we believe there would be a significant performance improvement.

References

[1] Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. Vol. 16. 2016.

[2] Beattie, Charles, et al. "Deepmind lab." arXiv preprint arXiv:1612.03801 (2016).

[3] Andrychowicz, Marcin, et al. "Hindsight experience replay." Advances in Neural Information Processing Systems. 2017.

[4] Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.

[5] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

[6] Levy, Andrew, Robert Platt, and Kate Saenko. "Hierarchical Actor-Critic." arXiv preprint arXiv:1712.00948 (2017).