Deep reinforcement learning (RL) has achieved state-of-the-art performance across a variety of tasks Mnih et al. (2013); Silver et al. (2017). However, successful deep RL training requires large amounts of sample data. Various learning methods have been proposed to improve sample efficiency, such as model-based learning and incorporation of Bayesian priors Gu et al. (2016); Spector and Belongie (2018).
The key insight of this paper is that we can significantly improve efficiency by leveraging the exchangeable structure inherent in many reinforcement learning problems. That is, for a state space that can be factored into sets of sub-states, presenting the factored state in a way that does not rely on a particular ordering of the sub-states can lead to significant reduction in the search-space.
In this work, we propose an attention mechanism as a means to leverage object exchangeability. We propose a mechanism that is permutation invariant in that it will produce the same output for any permutation of the items in the input set and show that this representation reduces the input search space by a factor of up to , where is the number of exchangeable objects.
2. Background and Related Work
Formally, we can define an object in an MDP to be a subset of the state space that defines the state of a single entity in the problem environment. In an aircraft collision avoidance problem, an object could be defined by the values associated with a single aircraft. It is well known that as the number of objects grow, the size of the MDP search space grows exponentially Robbel et al. (2016).
For MDPs with exchangeable objects, an optimal policy should provide the same action for any permutation of the input. When states are represented as ordered sets, as is common, this must be learned by the policy during training.
Many methods have been proposed instead to enforce this by permutation invariant input representations. The Object Oriented MDP (OO-MDP) framework uses object-class exchangeability to represent states in an order-invariant space for discrete spaces Diuk et al. (2008). Approximately Optimal State Abstractions Abel et al. (2016) proposes a theoretical approximation to extend OO-MDP to continuous domains. Object-Focused Q-learning Cobo et al. (2013) uses object classes to decompose the Q-function output space, though it does not address the input.
Deep Sets Zaheer et al. (2017)
proposes a permutation invariant abstraction method to produce input vectors from exchangeable sets. The method proposed produces a static mapping. That is each input object is weighted equally regardless of value during the mapping.
Our method improves upon Deep Sets by introducing an attention mechanism to dynamically map the inputs to the permutation-invariant space. Attention mechanisms are used in various deep learning tasks to dynamically filter the input to a down-stream neural network to emphasize most important parts of the original inputXu et al. (2015); Luong et al. (2015); Jaderberg et al. (2015)
. We adapt a mechanism from recent work in natural language processing, which use a dot-product neural layer to efficiently apply dynamic attentionVaswani et al. (2017).
3. Problem Formalization
Our objective is to propose an attention mechanism that will take sets of objects as inputs and produce abstractions such that the mapping is permutation invariant. This output can then be used as an input to the policy neural network in an RL problem (e.g. a deep Q-Network or action policy). Our hypothesis is that learning using this abstract representation will be more sample efficient than learning on the original object set.
We propose the attention network architecture shown in fig. 1, which is a permutation invariant implementation of dot-product attention. For a single input set , the
object state vectors are individually passed through feed-forward neural networksand . The scalar outputs of the filter graph are concatenated into a single vector and the softmax operation is applied. These outputs are then multiplied element-wise by the concatenated outputs of the network . In this way, the output of acts as the attention filter, weighting the inputs by importance prior to summation. The elements of the weighted vector are then summed over the different objects, resulting in a single vector . This vector is then used as the input to the policy neural network.
We can now define bounds on the sample efficiency benefits of an invariant mapping. Define a state space such that , where is the number of objects. Let each object take on unique values. Representing the states as ordered sets of results in a state-space size that can be calculated from the expression for permutations of values. If all objects are exchangeable, there exists an abstraction that is permutation invariant. Since the order does not matter, the size of this abstract state can then be calculated from the expression for combinations of values.
Using permutation invariant representation reduces the input space by a factor of compared to ordered set representation.
It can be shown that it is necessary and sufficient for a mapping to be invariant on all countable sets if and only if it can be decomposed using transformations and , where and are any vector valued functions to the form Zaheer et al. (2017):
It can be shown that the proposed attention mechanism may be decomposed to the above form to prove it is permutation invariant. For problems with multiple classes of exchangeable objects, a separate attention mechanism can be deployed for each class.
4. Experiments and Results
We conducted a series of experiments to validate the effectiveness of our proposed abstraction. The first two tasks are simple MDPs in which a scavenger agent navigates a continuous two-dimensional world to find food particles.
In the scavenger tasks, the state space contains vectors , where is the number of target objects. The vector contains the relative position of each food particle as well as the ego position of the agent. The agent receives a reward of when reaching a food particle, and a reward of for every time-step otherwise. The episode terminates upon reaching a food particle or when the number of time-steps exceeds a limit. Scavenger Task 2 introduced poison particles in addition to the food particles (one poison for each food particle). If an agent reaches a poison particle, a reward of is given and the episode terminates.
The third task is a convoy protection task with variable numbers of objects. The task requires a defender agent to protect a convoy that follows a predetermined path through a 2D environment. Attackers are spawned at the periphery of the environment during the episode, and the defender must block their attempts to approach the convoy. The state space is the space of vectors representing the state of each non-ego object in the environment. The episode terminates when all convoy members either reach the goal position or are reached by an attacker.
For each task, we trained a set of policies with the attention mechanism as well as a baseline policies that use a standard ordered set to represent the input space. Each policy was trained with Proximal Policy Optimization (PPO) Schulman et al. (2017), policy-gradient algorithm.
For each scavenger task, we trained a policy for on tasks having one to five food particles. The baseline policies were unable to achieve optimal performance for tasks with more than two food particles in either scavenger task. The policy trained with our attention mechanism was able to learn an optimal policy for all cases with no increase in the number of required training samples. For the convoy task, the abstracted policy approached optimal behavior after approximately 2,500 epochs, where the baseline policy showed no improvement after 10,000 epochs, as shown infig. 2.
These experiments demonstrate the effectiveness of the proposed approach to enhance the scalability of the PPO policy gradient learning algorithm. Together, these experiments validate our hypothesis that leveraging object exchangability for input representation can improve the efficiency of deep reinforcement learning.
et al. (2016)
David Abel, D. Ellis
Hershkowitz, and Michael L. Littman.
Near Optimal Behavior via Approximate State
International Conference on Machine Learning (ICML).
- Cobo et al. (2013) Luis C. Cobo, Charles L. Isbell, and Andrea L. Thomaz. 2013. Object Focused Q-learning for Autonomous Agents. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS).
- Diuk et al. (2008) Carlos Diuk, Andre Cohen, and Michael L. Littman. 2008. An Object-oriented Representation for Efficient Reinforcement Learning. In International Conference on Machine Learning (ICML).
- Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. 2016. Continuous Deep Q-learning with Model-based Acceleration. In International Conference on Machine Learning (ICML).
- Jaderberg et al. (2015) Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems (NIPS).
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
- Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. Nature 518 (2013), 529–533.
et al. (2016)
Philipp Robbel, Frans
Oliehoek, and Mykel Kochenderfer.
Exploiting Anonymity in Approximate Linear Programming: Scaling to Large Multiagent MDPs. In
AAAI Conference on Artificial Intelligence (AAAI).
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017). arXiv:1707.06347 http://arxiv.org/abs/1707.06347
- Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge. Nature 550 (2017), 354–359.
- Spector and Belongie (2018) Benjamin Spector and Serge J. Belongie. 2018. Sample-Efficient Reinforcement Learning through Transfer and Architectural Priors. CoRR abs/1801.02268 (2018). arXiv:1801.02268 http://arxiv.org/abs/1801.02268
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). 5998–6008.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML).
- Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. 2017. Deep Sets. In Advances in Neural Information Processing Systems (NIPS).