Unsupervised reinforcement learning (RL) allows intelligent agents to learn various skills simultaneously without any extrinsic rewards related to specific tasks (Gupta et al., 2018; Eysenbach et al., 2018)
. Most of unsupervised RL methods utilizes a latent-conditioned policy to optimize an information theoretical objective. The condition of the policy can be associated with a "goal", which is generated randomly, by a prior, or in a heuristic way for exploring new states in the environment. This approach helps the agent to quickly adapt to the tasks requiring the agent to reach some goal states. Also the condition can be perceived as a latent code of high-level skills or options. The agent is driven to learn distinct skills or options which are discriminable from their states, or trajectories(Gregor et al., 2016; Achiam et al., 2018). These papers show the skills learned without supervision help the agent tackle with challenging tasks with sparse reward, form a option set for hierarchical RL, and provide a good initialization for further training.
Ideally, these unsupervised skill discovery algorithms can be seamlessly transplanted to multi-agent reinforcement learning (MARL) environments. However, three problems immediately emerge. First, the nature of MARL emphasizes the interaction and coordination amongst the agents and it is clearly out of consideration of the skills trained by individual agents. How can we train the agents to autonomously focus on the skill of coordination, or their interaction patterns? Second, under the framework of centralized training and decentralized execution due to partial observability, the policies will inevitably converges to suboptimal points, whether with task-specific reward or with unsupervised surrogate reward (Mahajan et al., 2019). Finally, rather than the environments used in single agent unsupervised RL, the multi-agent environment is highly unstable and volatile from the view of an individual agent. How can the agents retain a discriminable skill?
In this paper, we propose a novel algorithm, called multi-agent skill discovery (MASD), to address skill discovery on the level of coordination amongst multiple agents. Two key ideas are involved to design MASD. We first introduce a latent variable shared by all agents and maximize the mutual information between the latent variable and the whole set of states. Then we set an "information bottleneck" on individual states, namely, minimizing the mutual information between the latent and states of any single agents in an adversarial way, which forces the policies to learn skills on a higher level of coordination and interaction. Within the scope of implementation, we adopt MADDPG, an actor-critic structured algorithm with centralized training and decentralized execution, to optimize the surrogate objective derived from the two principles stated above.
Our work makes three contributions. First, we propose a method for learning skills in multi-agent environments without supervision. Second, we show the empowerment degeneracy and the collapse onto a single agent without the information bottleneck, both on simple demonstrations and particle multi-agent environments. Third, we demonstrate that MASD can learn a series of distinguishable skills of coordination and show initializing with good skills can outperform baseline algorithm on a complex supervised task.
Partially observable Markov decision process (POMDP) is an appropriate model to conclude many multi-agent Markov games. POMDPs are formally defined as a tuple. is the set of all possible states in the environment. At each time step , the th agent receives its own observation . The observation is generated from the internal state through a observing lens . With a certain policy, the agent chooses action from the action space and send it to the environment. The environment returns a new state
according to state transition probability distribution, where the tuple of actions from all agents is denoted as , and generates a scalar reward . After steps, the episode terminates. In supervised and decentralized scenarios, the agents improve their policies to maximize their collective expected accumulative discounted reward .
2.2 Mutual information and variational inference
The mutual information
is a general measure of dependency between two random variables. It is defined as theKullback-Leibler divergence
between the joint distributionand the product of two distributions as
An alternative expression implies when the mutual information is high, the uncertainty of variable is largely reduced given . Therefore the mutual information can be interpreted as the empowerment from one variable to another. In unsupervised RL, a latent random variable is introduced as a condition of policies . We hope the latent variable can shed its controllability on successive states, or trajectories. It is straightforward to set the mutual information between and the states , as the unsupervised objective. If is maximized, the behaviour of the agent will change consistently given different values of the latent code.
However, estimating and optimizingcan be very challenging. By symmetry we have . When the prior of is fixed, maximizing is equivalent to maximizing the negative entropy . Nevertheless, the posterior distribution is remained unknown, and we cannot compute it directly due to intractability of marginal distribution . Fortunately taking the tool of variational inference, we have a variational lower bound of the objective (Blei et al., 2017)
In (2), is an approximator of the true posterior parameterized with . Actually the gap of this inequality is the KL divergence between and , which means that the more precise the approximator is, the tighter the lower bound is.
In conclusion, we use as pseudo reward to train the agent. Meanwhile we train the approximator (we call it the discriminator below) with pairs stored in a replay memory.
2.3 Adversarial learning
In previous papers like DIAYN, VIC, the agent and the discriminator evolve together in a cooperative way (Eysenbach et al., 2018)
. However, in our work, we expect the agents to minimize some kind of mutual information (stated in the next section), which leads the agents and the discriminators to learn adversarially. A precedent work, Generative Adversarial Imitation Learning (GAIL) demonstrates the feasibility of implementing adversarial training in RL(Ho and Ermon, 2016; Song et al., 2018). It allows two entities to optimize a mini-max objective in which case the two opposite entities co-evolve.
3 Skill Discovery of Coordination
In this section, we propose a new method called multi-agent skill discovery (MASD). This algorithm dedicates to autonomous discovery of skills of coordinated agents. It indicates that the acquired skills are not affiliated to any single agent, but reflect different patterns in their coordinating behaviour, which is crucial to the agents in cooperative MARL tasks (Lowe et al., 2017).
Inspired by other unsupervised skill discovery methods in single agent RL, the straightforward way is letting the policy conditioned on a sampled latent variable shared by all agents in each episode, and maximizing the mutual information between the latent and the overall state . However it raises a tricky issue: unlike single agent configuration, the overall state of the multi-agent environment is unknown to us in most cases. What we have is a set of observations retrieved from the agents distributed in a map. Therefore, we can only use the combination of all observations, denoted as . In practice, contains redundant information, i.e., if the th agent and the
th agent are mutually visible, both of the observation vectors will include the information of the other agent. To this point, we extract the features necessary for learning through, from the full-length observation vector. For convenience, we call the "state" of the th agent, although it is not the actual state of the environment.
In summary, the collective objective of all agent is to maximize where is interpreted as skills. With slight abuse of notation, we use to denote the set of extracted features. On one hand, the sampled skill controls the set of states visited by multiple agents. On the other hand, the agents make the latent skill distinguishable from the states. The mutual information measures the obedience of the agents to the instruction . Nevertheless, this dose not automatically imply that the latent variable controls the coordination patterns amongst the agents. There are possibilities that maximizing may result in degeneracy, which means the latent solely controls the state of a single agent. This is partially due to the suboptimality trap of decentralized MARL (Mahajan et al., 2019). Furthermore in a toy experiment, we demonstrate maximizing can lead to multiple optimal policies but some of them are degenerated.
3.1 Enforcing the policies out of degeneracy
Intuitively, we are not desired to see the latent is clearly discriminable from a single agent. Thus the latent is forced to cast its controllability on the relations of the agents. To this end, we reduce every and the objective becomes
Here contains policy parameters of agents. The first term suggests it should be accurate to infer skill from the combination of all states and the second term guarantees the opaqueness of from individual agents. has a variational lower bound by (2) using a parameterized global discriminator
However does not have a non-trivial lower bound. Despite using local discriminators , what we yield
Therefore we feed the multi-agent policies with pseudo reward
Meanwhile we reduce the entropy of the global discriminator and local discriminators with rollout data . Local discriminators endeavor to distinguish latent skill code from their own states whilst the agents maintain high entropy of the posterior to hide the latent from local states. Hence as the agents learn to perform various skills, the entropy regularizer prevents the skills from degeneracy on the behaviour of a single agent.
Multi-agent deep deterministic policy gradient (MADDPG) is an actor-critic MARL algorithm, composed of actors with policy and critics
. MADDPG avoids high variance of classical policy gradient methods and alleviates the difficulties brought by non-stationarity in multi-agent Q-learning. Therefore we choose MADDPG as the basic learning framework to optimize our proposed objective (3). At the same time, we train the global discriminator and local discriminators with supervision loss. The overall structure is depicted in Fig. 1. Notice that the latent space can either be continuous or categorical. When is sampled from a
-category uniform distribution, the discriminator is equipped with categorical cross entropy loss. And whenis sampled from a uniform distribution , the discriminator is optimized by or loss. Choosing or loss depends on our hypothesis on the distribution family of the posterior . The loss corresponds to Laplacian distribution and
loss is related to Gaussian distribution. Disregarding what the latent space is, we denote the loss of the global discriminator asand the losses of local discriminators as . Our method is summarized in Algorithm 1.
In practice we use different variants of pseudo reward. First we multiply the second term in (7) by to balance the global discriminability and the local opaqueness. Second, we can replace the mean by the minimum to emphasize the worst case across all agents.
4.1 Empowerment degeneracy: a toy example
Considering a one-step game, two agents receive a one-bit observation respectively, randomly drawn from , and the agents each take a one-bit action . The successive observation is simply computed by . Two typical examples of optimal solutions are listed in Table 1. The two solutions both hit the maximum of however, in solution B, the latent only controls the first state . We call this phenomena "empowerment degeneracy". In contrary, the solution A shows determines the coordination pattern of two agents. When , two agents’ behaviour always keeps heterogeneous, and when , two agents’ behaviour keeps homogeneous, which is actually the skill of coordination. The key point of solution A is minimizing , which inspires our proposed objective (3).
We test MASD to demonstrate the ability to avoid empowerment degeneracy. Reward is set by . Results are presented in Fig. 2. Without the second term in reward, the policies consistently fall into non-coordinating solutions like solution B. Set , MASD succeed to reach solution A, except for several imperfect cases.
4.2 Visualization of learned skills
We visualize the learned skills in OpenAI multi-agent particle environments used in (Lowe et al., 2017). Specifically, we applied our method to the "simple spread" task, in which several agents are rewarded by covering all the landmarks while avoiding collisions. We use up to 30 discrete latent codes to explicitly represent the skill , and convert the code to one-hot vector. A curriculum approach similar to (Achiam et al., 2018) has been applied to overcome the training difficulty triggered by large latent space. In brief, we start with handful of skills and enlarge the skill set when reaches a high threshold, i.e., . For skill discovery procedure we set the reward of environment to zero and draw the trajectory of all agents after 10000 episodes of training in Fig. 3. To get clear observation, we fix the initial states of the environment when testing. The trajectory patterns of different skills show significant differences even in the environment without reward.
To verify the diversity of trajectory pattern emerges from the skill diversity rather than randomness of environment or policy, we add random disturbance ( world width) to the initial position of the agents and repeat the experiment for 100 times. We calculate some properties of all trajectories in Fig. 3. The left part represents the distribution of the smaller two of the included angles of the three trajectories, while the right part represents the length distribution of the shorter two trajectories. Each skill has its own color. In the two figures above we use coefficient for and succeed in learning 30 skills, while in the two figures below we set for contrast and only 17 skills are learned. When is taken into account, each skill has obvious clustering characteristics in both the dimension of included angle or length of the trajectories.
4.3 Local entropy of learned skills
prediction error across all local discriminators. (b) Behaviour of learned policies. (c) Standard deviation of the last position of agent 1, calculated across 16 initial conditions. Each dot represents the standard deviation of one skill.. (d) The same with (c),
MASD aims to augment the opaqueness of from local observations. It will eventually raise the local entropy of skills, characterized as . We find clues in "rendezvous" environment. In this environment, the agents are trained by pseudo reward, combined with a weak signal ( is the distance from the central point). The weal signal encourages all agents moving to the central point. As described in Fig.4, with , each skill is more diverse on the level of local state, which indirectly confirms that MASD skills focus more on coordination rather than individual patterns.
4.4 Learning with pretrained models
To examine the role of skills learned by MASD in specific task, we apply our method to the "simple tag" task, a classical predator-and-prey multi-agent environment that is more complex than "simple spread". We find that models pretrained with MASD has better performance than models with random initialization on performance.
Specifically, the goal of the agents is to cooperate in the pursuit of a randomly moving prey whose speed is higher. There are two parts of the reward: one is the goal reached reward when one agent hit the prey, and another is the auxiliary reward related to the distance between agent and prey. We use MADDPG to train a randomly initialized model and a model initialized with MASD separately. The reward curves of 5 seeds are plotted in Fig. 5. The models initialized with MASD have convergence reward 150 higher on average. When we remove the auxiliary reward to make the task more difficult, the reward curves of 5 seeds are plotted in Fig. 5. Models initialized with MASD get reward 700 on average, while models randomly initialized only get reward 450 on average. The results suggest MASD pretrained model may gain advantage of performance through skill learning.
5 Related work
Reinforcement learning as graphical-model probabilistic inference has been studied in prior works (Ziebart, 2010; Ziebart et al., 2008; Furmston and Barber, 2010; Levine, 2018). This framework leads to an augmented objective to maximize entropy which provides an alternative way to encourage exploration (Haarnoja et al., 2017, 2018b; Liu et al., 2017). In recent papers, latent space is introduced to model the latent structure of agent policy explicitly (Houthooft et al., 2016; Igl et al., 2018; Haarnoja et al., 2018a; Hausman et al., 2018). Since mutual information can be perceived as the measure of empowerment (Mohamed and Rezende, 2015), by maximizing mutual information, the agent can learn a set of diverse skills while the skills encoded as latent variable are easy to infer from states or trajectories (Gregor et al., 2016; Achiam et al., 2018; Sharma et al., 2020). DIAYN demonstrates the skills learned without task-specific reward provide a good initialization to successive learning, serve as options in hierarchical reinforcement learning, or imitate en expert (Eysenbach et al., 2018). Regarding multi-agent reinforcement learning, Mahajan et al. (2019) adopts a latent policy to implement committed exploration in multi-agent Q-learning algorithms. However, to our best knowledge, our approach is the first on unsupervised skill discovery of coordination in MARL.
Coordination of agents is crucial to MARL, especially in cooperative settings requiring agents to reach a collective goal (Cao et al., 2012). Part of methods concentrate on the credit/role assignment problem to decompose the collective reward to each agent (Rashid et al., 2018; Foerster et al., 2018; Le et al., 2017). Other works focus on the mechanism of information exchange, i.e., learning communication protocols between the agents (Sukhbaatar et al., 2016). Instead of dedicated differentiable communication channel, Lowe et al. (2017) has proposed MADDPG using centralized Q-functions that take all actions as input. However, the coordination patterns largely depend on the nature of the task goal when reaching the goal needs coordination. Since our paper is coping with unsupervised multi-agent environments, the collective optimization objective should be astutely designed to incentivize autonomous emergence of coordination.
Our method can be interpreted as an "information bottleneck" between the latent variable and the global state. In previous work, information bottleneck is a technique for regularizing (Tishby and Zaslavsky, 2015; Peng et al., 2019). Generally speaking, the bottleneck improves generalization and pushes the intermediate representation being irrelevant to input. Similar to this idea, DIAYN sets a bottleneck between and by minimizing . This technique results in a maximum entropy policy (Eysenbach et al., 2018). From theoretical analysis and empirical results, the bottleneck in MASD also leads to more diverse policies with higher entropy.
In this work, we have developed MASD, an algorithm allows multiple agents to learn various coordination skills without task-specific reward. We show empowerment degeneracy when maximizing the mutual information between latent variable and the global state. To obtain skills on the level of coordination, we add a regularizer to increase the opaqueness of in individual states . Empirically we demonstrate our method successfully overcomes empowerment degeneracy while keeps different skills discriminable.
Reduction of mutual information between and individual agent states brings adversarial term in our objective. As discussed in prior work (Arjovsky et al., 2017), adversarial objective incurs difficulties in training, also in our experiments, adversarial training significantly slows down the skill learning process. To eliminate adversarial learning, we can investigate prior-involved method to autonomously learn coordination patterns, which means we need a good representation of the relation of multiple agents. This research is left as future work.
- Variational option discovery algorithms. arXiv preprint arXiv:1807.10299. Cited by: §1, §4.2, §5.
Wasserstein generative adversarial networks.
Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 70, pp. 214–223. Cited by: §6.
- Variational inference: a review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §2.2.
- An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial informatics 9 (1), pp. 427–438. Cited by: §5.
- Diversity is all you need: learning skills without a reward function. arXiv preprint arXiv:1802.06070. Cited by: §1, §2.3, §5, §5.
Counterfactual multi-agent policy gradients.
Thirty-second AAAI conference on artificial intelligence, Cited by: §5.
- Variational methods for reinforcement learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 241–248. Cited by: §5.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.1.
- Variational intrinsic control. arXiv preprint arXiv:1611.07507. Cited by: §1, §5.
- Unsupervised meta-learning for reinforcement learning. arXiv preprint arXiv:1806.04640. Cited by: §1.
- Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808. Cited by: §5.
- Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1352–1361. Cited by: §5.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §5.
- Learning an embedding space for transferable robot skills. In International Conference on Learning Representations (ICLR), Cited by: §5.
- Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §2.3, §3.1.
- Vime: variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117. Cited by: §5.
- Deep variational reinforcement learning for POMDPs. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 2117–2126. Cited by: §5.
- Coordinated multi-agent imitation learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1995–2003. Cited by: §5.
- Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: §5.
- Stein variational policy gradient. arXiv preprint arXiv:1704.02399. Cited by: §5.
- Multi-agent actor-critic for mixed cooperative-competitive environments. Neural Information Processing Systems (NIPS). Cited by: §3, §4.2, §5.
- MAVEN: multi-agent variational exploration. In Advances in Neural Information Processing Systems, pp. 7611–7622. Cited by: §1, §3, §5.
- Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125–2133. Cited by: §5.
- Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow. In International Conference on Learning Representations (ICLR), Cited by: §5.
- QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 4295–4304. Cited by: §5.
- Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations (ICLR), Cited by: §5.
- Multi-agent generative adversarial imitation learning. In Advances in neural information processing systems, pp. 7461–7472. Cited by: §2.3.
Learning multiagent communication with backpropagation. In Advances in neural information processing systems, pp. 2244–2252. Cited by: §5.
- Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §5.
- Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §5.
- Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph.D. Thesis, Carnegie Mellon University, USA. External Links: Cited by: §5.