One of the most complex challenges in autonomous robots concerns behaviors that require a complex interaction between robots and humans, as well as between the robots themselves. In particular, learning in these settings is challenging because the learner can not assume that the other robots/agents it is interacting with follow a stable strategy. To keep complexity manageable, most research work in this field is tested on simple problems, such as the cooperative communication or predator-prey, that involve the robot having to interact with only one or two types of other robots/agents.
In this paper we are considering a more complex problem with a practical application. A VIP is navigating a crowded public area. We assume that the bystanders are purposefully moving from landmark to landmark, and use culture-specific crowd navigation protocols [Bölöni et al.2013] on giving way or asserting their right of way. The goal of a team of robot bodyguards is to physically protect the VIP from assault by appropriately positioning themselves in relation to the current position of the VIP.
The goal of the bodyguards can be stated as a minimization of a metric of the threat to the VIP, a problem made challenging due to the complexity of the environment. The bodyguard robot must adapt its position to the current position of the VIP. Second, the movement must also depend on the movement of the bystanders; if a bystander is heading towards the VIP, the robot needs to position itself to reduce the threat to the VIP. Finally, the robots need to coordinate between each other. This can be achieved either through central planning, explicit communication by exchanging messages or though implicit communication, by each robot taking actions based on a world view including the other robots.
While it is possible to hand-engineer robot bodyguard behaviors [Bhatia et al.2016], it is of interest whether such behavior can be learned through reinforcement learning.
2 Problem Formulation
The reward structure of the robotic bodyguard team problem can be modeled as a cooperative Markov game where multiple agents are learning a policy to maximize their rewards. A multi-agent MDP can be defined as a combination of a state space , an action space , agents
, the transition probabilities that are defined asand the reward function for every agent that is defined as .
For the sake of simplicity, we are assuming that all the bodyguards have an identical state and action space. We are considering a finite horizon problem where each episode is terminated after steps.
The environment that provides the rewards is a two-dimensional space, with the usual rules of physical movement, landmarks, line-of-sight and communication. We used the Multi-Agent Particle Environment MPE [Mordatch and Abbeel2017] to perform our experiments. MPE is a two-dimensional physically simulated environment that runs in discrete time and continuous state and action space. The environment consists of N agents and M landmarks possessing physical attributes such as location, velocity and mass etc. MPE also allows the agents to communicate via verbal utterances over the communication channel embedded in the environment. The complete observation of the environment is given by
The state of each agent is the physical state of all the entities in the environment and verbal utterances of all the agents. Formally, the state of each agent is defined as
where is the observation of the entity from the prespective of agent and is the verbal utterance of the agent .
2.1 Reward Function
[Bhatia et al.2016] defined a metric that quantifies the threat to the VIP from each crowd member at each timestep . We extended the metric to form a reward function defined as
where is the minimum distance the agents have to maintain at every timestep, is the safe distance and is a small penalty to the bodyguard for utterance.
The bodyguards were trained using multi-agent deep deterministic policy gradients MADDPG [Lowe et al.2017], a multi-agent extension of the DDPG algorithm described in [Lillicrap et al.2015]. MADDPG allows agents to have individual policies, but trains them with a centralized Q-function. The gradient of each policy is written as
is a centralized action-value function that takes the actions of all the agents in addition to the state of the agent to estimate the Q-value for agent. The primary motivation behind MADDPG is that knowing all the actions of other agents makes the environment stationary, even though their policy changes.
The experiments were conducted in an environment created in MPE, simulating a crowded mall with 12 landmarks representing stores. The VIP is following its own policy of moving towards a goal landmark. 10 bystanders are following their own policy that involve sequentially visiting randomly chosen landmarks. 3 bodyguards were deployed to protect the VIP from physical assault.
For training, the bodyguard agents were trained over 10,000 episodes, each episode being limited to 25 steps using DDPG and MADDPG. After training, performance was measures both empirically (by means of observing the movement of the bodyguards) and by recording the cummulative threat over the episode as defined in [Bhatia et al.2016]. An example of the achieved behavior is described in Figure 1. As the screenshot shows, the robots learned to surround and move with the VIP, concentrating on the sides where bystanders are closer. Empirical evaluation had shown better performance with MADDPG compared to DDPG.
In this paper, we outlined a technique for training a multi robot team of bodyguards that is collaborating towards a common goal of providing physical security to a VIP moving in a crowded environment. We trained the robot behaviors using MADDPG RL, and we observed the emergence of recognizable collaborative bodyguard behavior without any explicit instruction on how to provide security.
- [Bhatia et al.2016] T.S. Bhatia, G. Solmaz, D. Turgut, and L. Bölöni. Controlling the movement of robotic bodyguards for maximal physical protection. In Proc of the 29th International FLAIRS Conference, pages 380–385, May 2016.
- [Bölöni et al.2013] Ladislau Bölöni, S Khan, and Saad Arif. Robots in crowds-being useful while staying out of trouble. In Proc. of Intelligent Robotic Systems Workshop (IRS-2013) at AAAI, pages 2–7, 2013.
- [Lillicrap et al.2015] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
- [Lowe et al.2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Neural Information Processing Systems (NIPS), 2017.
- [Mordatch and Abbeel2017] Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908, 2017.