1. Introduction
Deep reinforcement learning (RL) has achieved stateoftheart performance across a variety of tasks Mnih et al. (2013); Silver et al. (2017). However, successful deep RL training requires large amounts of sample data. Various learning methods have been proposed to improve sample efficiency, such as modelbased learning and incorporation of Bayesian priors Gu et al. (2016); Spector and Belongie (2018).
The key insight of this paper is that we can significantly improve efficiency by leveraging the exchangeable structure inherent in many reinforcement learning problems. That is, for a state space that can be factored into sets of substates, presenting the factored state in a way that does not rely on a particular ordering of the substates can lead to significant reduction in the searchspace.
In this work, we propose an attention mechanism as a means to leverage object exchangeability. We propose a mechanism that is permutation invariant in that it will produce the same output for any permutation of the items in the input set and show that this representation reduces the input search space by a factor of up to , where is the number of exchangeable objects.
2. Background and Related Work
Deep RL is a class of methods to solve Markov Decision Processes (MDPs) using deep neural networks. Solving an MPD requires finding a policy
that maps all states in a statespace to an action to maximize total accumulated rewards.Formally, we can define an object in an MDP to be a subset of the state space that defines the state of a single entity in the problem environment. In an aircraft collision avoidance problem, an object could be defined by the values associated with a single aircraft. It is well known that as the number of objects grow, the size of the MDP search space grows exponentially Robbel et al. (2016).
For MDPs with exchangeable objects, an optimal policy should provide the same action for any permutation of the input. When states are represented as ordered sets, as is common, this must be learned by the policy during training.
Many methods have been proposed instead to enforce this by permutation invariant input representations. The Object Oriented MDP (OOMDP) framework uses objectclass exchangeability to represent states in an orderinvariant space for discrete spaces Diuk et al. (2008). Approximately Optimal State Abstractions Abel et al. (2016) proposes a theoretical approximation to extend OOMDP to continuous domains. ObjectFocused Qlearning Cobo et al. (2013) uses object classes to decompose the Qfunction output space, though it does not address the input.
Deep Sets Zaheer et al. (2017)
proposes a permutation invariant abstraction method to produce input vectors from exchangeable sets. The method proposed produces a static mapping. That is each input object is weighted equally regardless of value during the mapping.
Our method improves upon Deep Sets by introducing an attention mechanism to dynamically map the inputs to the permutationinvariant space. Attention mechanisms are used in various deep learning tasks to dynamically filter the input to a downstream neural network to emphasize most important parts of the original input
Xu et al. (2015); Luong et al. (2015); Jaderberg et al. (2015). We adapt a mechanism from recent work in natural language processing, which use a dotproduct neural layer to efficiently apply dynamic attention
Vaswani et al. (2017).3. Problem Formalization
Our objective is to propose an attention mechanism that will take sets of objects as inputs and produce abstractions such that the mapping is permutation invariant. This output can then be used as an input to the policy neural network in an RL problem (e.g. a deep QNetwork or action policy). Our hypothesis is that learning using this abstract representation will be more sample efficient than learning on the original object set.
We propose the attention network architecture shown in fig. 1, which is a permutation invariant implementation of dotproduct attention. For a single input set , the
object state vectors are individually passed through feedforward neural networks
and . The scalar outputs of the filter graph are concatenated into a single vector and the softmax operation is applied. These outputs are then multiplied elementwise by the concatenated outputs of the network . In this way, the output of acts as the attention filter, weighting the inputs by importance prior to summation. The elements of the weighted vector are then summed over the different objects, resulting in a single vector . This vector is then used as the input to the policy neural network.We can now define bounds on the sample efficiency benefits of an invariant mapping. Define a state space such that , where is the number of objects. Let each object take on unique values. Representing the states as ordered sets of results in a statespace size that can be calculated from the expression for permutations of values. If all objects are exchangeable, there exists an abstraction that is permutation invariant. Since the order does not matter, the size of this abstract state can then be calculated from the expression for combinations of values.
(1) 
Using permutation invariant representation reduces the input space by a factor of compared to ordered set representation.
It can be shown that it is necessary and sufficient for a mapping to be invariant on all countable sets if and only if it can be decomposed using transformations and , where and are any vector valued functions to the form Zaheer et al. (2017):
(2) 
It can be shown that the proposed attention mechanism may be decomposed to the above form to prove it is permutation invariant. For problems with multiple classes of exchangeable objects, a separate attention mechanism can be deployed for each class.
4. Experiments and Results
We conducted a series of experiments to validate the effectiveness of our proposed abstraction. The first two tasks are simple MDPs in which a scavenger agent navigates a continuous twodimensional world to find food particles.
In the scavenger tasks, the state space contains vectors , where is the number of target objects. The vector contains the relative position of each food particle as well as the ego position of the agent. The agent receives a reward of when reaching a food particle, and a reward of for every timestep otherwise. The episode terminates upon reaching a food particle or when the number of timesteps exceeds a limit. Scavenger Task 2 introduced poison particles in addition to the food particles (one poison for each food particle). If an agent reaches a poison particle, a reward of is given and the episode terminates.
The third task is a convoy protection task with variable numbers of objects. The task requires a defender agent to protect a convoy that follows a predetermined path through a 2D environment. Attackers are spawned at the periphery of the environment during the episode, and the defender must block their attempts to approach the convoy. The state space is the space of vectors representing the state of each nonego object in the environment. The episode terminates when all convoy members either reach the goal position or are reached by an attacker.
For each task, we trained a set of policies with the attention mechanism as well as a baseline policies that use a standard ordered set to represent the input space. Each policy was trained with Proximal Policy Optimization (PPO) Schulman et al. (2017), policygradient algorithm.
For each scavenger task, we trained a policy for on tasks having one to five food particles. The baseline policies were unable to achieve optimal performance for tasks with more than two food particles in either scavenger task. The policy trained with our attention mechanism was able to learn an optimal policy for all cases with no increase in the number of required training samples. For the convoy task, the abstracted policy approached optimal behavior after approximately 2,500 epochs, where the baseline policy showed no improvement after 10,000 epochs, as shown in
fig. 2.These experiments demonstrate the effectiveness of the proposed approach to enhance the scalability of the PPO policy gradient learning algorithm. Together, these experiments validate our hypothesis that leveraging object exchangability for input representation can improve the efficiency of deep reinforcement learning.
References
 (1)

Abel
et al. (2016)
David Abel, D. Ellis
Hershkowitz, and Michael L. Littman.
2016.
Near Optimal Behavior via Approximate State
Abstraction. In
International Conference on Machine Learning (ICML)
.  Cobo et al. (2013) Luis C. Cobo, Charles L. Isbell, and Andrea L. Thomaz. 2013. Object Focused Qlearning for Autonomous Agents. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS).
 Diuk et al. (2008) Carlos Diuk, Andre Cohen, and Michael L. Littman. 2008. An Objectoriented Representation for Efficient Reinforcement Learning. In International Conference on Machine Learning (ICML).
 Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. 2016. Continuous Deep Qlearning with Modelbased Acceleration. In International Conference on Machine Learning (ICML).
 Jaderberg et al. (2015) Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems (NIPS).
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. Nature 518 (2013), 529–533.

Robbel
et al. (2016)
Philipp Robbel, Frans
Oliehoek, and Mykel Kochenderfer.
2016.
Exploiting Anonymity in Approximate Linear Programming: Scaling to Large Multiagent MDPs. In
AAAI Conference on Artificial Intelligence (AAAI)
.  Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017). arXiv:1707.06347 http://arxiv.org/abs/1707.06347
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge. Nature 550 (2017), 354–359.
 Spector and Belongie (2018) Benjamin Spector and Serge J. Belongie. 2018. SampleEfficient Reinforcement Learning through Transfer and Architectural Priors. CoRR abs/1801.02268 (2018). arXiv:1801.02268 http://arxiv.org/abs/1801.02268
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). 5998–6008.
 Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML).
 Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. 2017. Deep Sets. In Advances in Neural Information Processing Systems (NIPS).
Comments
There are no comments yet.