Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions

12/03/2015
by   Peter Sunehag, et al.
0

Many real-world problems come with action spaces represented as feature vectors. Although high-dimensional control is a largely unsolved problem, there has recently been progress for modest dimensionalities. Here we report on a successful attempt at addressing problems of dimensionality as high as 2000, of a particular form. Motivated by important applications such as recommendation systems that do not fit the standard reinforcement learning frameworks, we introduce Slate Markov Decision Processes (slate-MDPs). A Slate-MDP is an MDP with a combinatorial action space consisting of slates (tuples) of primitive actions of which one is executed in an underlying MDP. The agent does not control the choice of this executed action and the action might not even be from the slate, e.g., for recommendation systems for which all recommendations can be ignored. We use deep Q-learning based on feature representations of both the state and action to learn the value of whole slates. Unlike existing methods, we optimize for both the combinatorial and sequential aspects of our tasks. The new agent's superiority over agents that either ignore the combinatorial or sequential long-term value aspect is demonstrated on a range of environments with dynamics from a real-world recommendation system. Further, we use deep deterministic policy gradients to learn a policy that for each position of the slate, guides attention towards the part of the action space in which the value is the highest and we only evaluate actions in this area. The attention is used within a sequentially greedy procedure leveraging submodularity. Finally, we show how introducing risk-seeking can dramatically improve the agents performance and ability to discover more far reaching strategies.

READ FULL TEXT

page 12

page 13

research
09/27/2020

Scalable Deep Reinforcement Learning for Ride-Hailing

Ride-hailing services, such as Didi Chuxing, Lyft, and Uber, arrange tho...
research
06/05/2019

Reinforcement Learning When All Actions are Not Always Available

The Markov decision process (MDP) formulation used to model many real-wo...
research
02/15/2021

How RL Agents Behave When Their Actions Are Modified

Reinforcement learning in complex environments may require supervision t...
research
03/22/2023

Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access

A central task in control theory, artificial intelligence, and formal me...
research
06/02/2018

Efficient Entropy for Policy Gradient with Multidimensional Action Space

In recent years, deep reinforcement learning has been shown to be adept ...
research
12/12/2012

An MDP-based Recommender System

Typical Recommender systems adopt a static view of the recommendation pr...
research
04/14/2020

Extrapolation in Gridworld Markov-Decision Processes

Extrapolation in reinforcement learning is the ability to generalize at ...

Please sign up or login with your details

Forgot password? Click here to reset