GALOPP: Multi-Agent Deep Reinforcement Learning For Persistent Monitoring With Localization Constraints

09/14/2021 ∙ by Manav Mishra, et al. ∙ 0

Persistently monitoring a region under localization and communication constraints is a challenging problem. In this paper, we consider a heterogenous robotic system consisting of two types of agents – anchor agents that have accurate localization capability, and auxiliary agents that have low localization accuracy. The auxiliary agents must be within the communication range of an anchor, directly or indirectly to localize itself. The objective of the robotic team is to minimize the uncertainty in the environment through persistent monitoring. We propose a multi-agent deep reinforcement learning (MADRL) based architecture with graph attention called Graph Localized Proximal Policy Optimization (GALLOP), which incorporates the localization and communication constraints of the agents along with persistent monitoring objective to determine motion policies for each agent. We evaluate the performance of GALLOP on three different custom-built environments. The results show the agents are able to learn a stable policy and outperform greedy and random search baseline approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visibility-based Persistent Monitoring (PM) problem involves the continuous surveillance of a bounded environment by a system of robots [27, 25, 23, 5, 12, 6]. Several applications like search and rescue, border patrol, critical infrastructure, etc. require persistent monitoring to obtain timely information. We study the problem of planning trajectories for each agent in a multi-robot system for persistently monitoring an environment. When the environment becomes increasingly complex, it becomes challenging to monitor using any deterministic coordinating strategies for PM [6]. Therefore, there is a need for developing strategies for the agents that can learn on how to navigate in such environments. One such approach is to use Multi-Agent Deep Reinforcement Learning (MADRL) algorithms to determine the policies for the individual agents to navigate in the environment [2].

In this paper, we consider a scenario where a team of multiple robots equipped with a limited field-of-view (FOV) sensor are deployed to monitor an environment as shown in Figure 1. We assume the environment does not support GPS and hence deploy two types of localization agents in the team – anchor and auxiliary agents. The anchor agents as assumed to have expensive sensors such as tactical grade IMUs or UWB ranging or cameras/LIDARs and more computational power to carry out onboard SLAM for accurate localization with very low position uncertainty. The auxiliary agents can localize only with relative localization. The auxiliary agents use the notion of cooperative localization [28, 13, 22] to localize by communicating with the anchor agents directly or indirectly through other auxiliary agents and hence have uncertainty in their positional beliefs. Further, the agents have finite communication range which restricts the auxiliary agents motion [7] and hence the agents may not be able to monitor the complete region as any communication disconnection from the anchor agents will result in poor localization and hence affects the coverage accuracy. However, intermittent disconnection will enable to able to recover from the localization uncertainty and also cover regions. This conflicting objective of monitoring the complete area while periodically maintaining connectivity from the anchor agents makes the problem of determining persistent monitoring strategies for the agents challenging.

Fig. 1: Persistent monitoring using anchor and auxiliary agents with FOV, localization and and communication range constraint.

In this paper, we propose a MADRL algorithm with graph attention based architecture called the Graph Localized Proximal Policy Optimization (GALOPP) to perform persistent monitoring with such heterogeneous agents subject to localization and communication constraints. As the agents have communication constraints which also affects the localization of the auxiliary agents, the graph attention based architecture effectively facilitates checks for agent-to-agent connectivity as they perform surveillance of the environment.

For persistent monitoring, the environment is modelled as a two-dimensional discrete grid. Each cell in the grid is allocated a penalty. When a cell is within the sensing range of any agent then the penalty value reduces to zero, otherwise the penalty accumulates over time. Thus, the agents must learn their motion strategy such that the net penalty accumulated over a time period is minimized which shows efficient persistent monitoring.

The main contribution of the paper is the MADRL graph attention based architecture GALOPP that determines policies for the agents taking the localization and communication constraints and the robot heterogeneity into account. To validate the performance of the GALOPP, we develop a custom environment for the simulation and compare the performance of GALOPP against random and greedy baseline strategies.

Ii Related Work

In the literature, persistent monitoring and cooperative localization have been addressed as individual topics of interest and there are inadequate works that consider these two aspects jointly. The mobile variant of the Art Gallery Problem (AGP) [17] details the goal is to determine the upper bound on the minimum number of agents required for patrolling the path/segments so as to minimize the time taken to cover the entire area. The objective of the Watchman Route Problem (WRP) [24] is to find the minimum length trajectory for a watchman (robot) to cover every point inside the input polygon. In addition, there have been several deterministic variants of multi-robot planning algorithms within the environment [27, 25, 12, 4]. In general, the deterministic visibility coverage problems are NP-hard and can provide an approximation of the optimal solution. However, these approaches determine static policies which do not take communication and localization constraints into account and hence cannot be directly used in the current problem.

MADRL based approaches have been proposed for studies on cooperative multi-agent tasks covering a wide spectrum of applications  [1, 18, 15, 10, 21, 11]. In [1] inter-agent communication for self-interested agents is studied for cooperative path planning, but they assume complete connectivity throughout without factoring in localization constraints. In [18], the notion of MARL under partial observability was formalized. The proposed scenario is also partially observable as the agents have limited sensing range. In [2], a method to find the trajectories for each agent to continuously cover the area is developed. However the agents do not have communication and localization constraints. In [11], a message-aware attention mechanism has been proposed that assigns relative importance to the messages received for coverage application. However, persistence monitoring with localization constraints introduces higher complexity which is addressed in this paper compared to [11].

For determining coordination strategies, it is essential for robots to incorporate the positional beliefs of surrounding agents and obstacles. Previously, there have been several works describing methods to achieve cooperative localization under limited inter-vehicle connectivity [28, 13, 14, 16, 3]

. We incorporate the idea of cooperative localization using Kalman Filter to achieve localization for the auxiliary agent.

Iii Problem Statement

Iii-a Formal description of PM

We consider the PM problem for a 2D grid world of size . Each grid cell , and , has a reward associated with it at time . When the cell is within the sensing range of an agent, then , otherwise, the reward decays with a decay parameter until it reaches a minimum value of . We consider negative reward as it refers to penalty on the cell for not monitoring. At time , and if is not monitored at time ; else if was monitored at time  [2]. As the rewards are modelled as penalties, the objective of the PM problem is to find a policy for the agent to minimize the neglect time which in turn minimizes the total accumulated penalty by , over a finite time . The optimal policy is given as

(1)

where is an optimal global joint-policy that dictates the actions of the agent in a multi-agent system, and is the reward due to following a policy .

Iii-B Localization for Persistent Monitoring

The grid consists of -agents to perform the monitoring task, with each agent being able to observe a sub–grid of size , (for ), centred at its own position, and having a communication range . At every time step, a connectivity graph is generated between the agents. An edge connection is formed between agents and if they are separated by a distance within the communication range,

. The connectivity of any agent with an anchor agent, is checked by using Depth-First Search (DFS) algorithm. Each agent estimates its position using Kalman Filer (KF). As the anchor agent has high-end IMU, the position uncertainty is negligible. However, the auxiliary agents can localize accurately if they are connected to an anchor agent, either directly or indirectly (

-hop connection) using cooperative localization techniques [22].

Each agent can observes cells. As the anchor agents are always localized, the can update the rewards in the grid world , that is, set . The auxiliary agents that are connected to the anchor agents either directly or indirectly can also update the rewards . However, those auxiliary agents that are disconnected from the anchor agents can observe the world, but cannot update the rewards due to localization uncertainty associated with increase in the covariance of the vehicle. When the vehicle reconnects with the anchor vehicle network then its uncertainty reduces and it can then update the rewards that it observes. The world that the auxiliary agent observed during disconnection is considered. The position and covariance update mechanism is given in Appendix A

An interesting aspect of solving Equation 1, to determine policies for the agents, is that it does not explicitly assume that the graph network is connected at all times. Although a strict connectivity constraint increases the global positional belief of the entire team, it reduces the ability of the team to persistently monitor any arbitrary region. An intermittent connectivity of agents leads to better exploration of the area allowing more flexibility  [8],[7]. The auxiliary agents once disconnected do not contribute to the net rewards obtained by the team. Since the objective is to find a policy that maximizes the rewards, the problem statement enables the agents to learn that connectivity increases the rewards and hence they should be connected. Through rewards, the connectivity constraints are indirectly implied and not hard-coded into the agent decision-making policy. The localization constraints are enabled through the connectivity graph.

Iv Graph Localized PPO - GALOPP

Iv-a Architectural overview

(a)
(b)
Fig. 2: (a) Schematic representation of GALOPP architecture. Each individual agent block of the architecture represents an actor-critic model. (b) The mini map is the image of the environment , resized to . The local map is a slice of the environment , centered around the agent. The mini map and local map are concatenated together to form the input for agent .

The GALOPP architecture, shown in Figure (a)a, consists of a multi-agent actor critic model that implements Proximal Policy Optimization (PPO) [20]

to determine individual agent trajectories. The agent’s observation space is the shared global reward map passed to each agent. The decentralized actors of each agent take the generated embedding from the agents to learn the policy, while a centralized critic updates the overall value function of the environment. The model uses Convolutional Neural Network (CNN) 

[9] to generate the individual embeddings which is then augmented with agent ’s positional mean and covariance . This serves as the complete information of the agent’s current state. The Graph Attention Network (GAT) [26] enforces the relay of messages in the generated connectivity graph to ensure inter-agent communication. The model is trained end-to-end for the PM problem. The local computation involves - updating the local map, updation of the mean and covariance of the position. The central computation is the computation of the joint-policy and updating the global map. The components of the GALOPP architecture is described in the below subsections.

Iv-B Embedding extraction and message passing

The GALOPP model takes the shared global reward values in the 2D grid as input. The observation of an agent at time is the set of cells that are within the sensing range (termed as local map) and also a compressed image of the current grid (termed as mini map) with the pixel values equal to the penalties accumulated by the grid cells [2]. The mini map is resized to the shape of the local map of the agent and then concatenated to form a 2-channeled image (shown in figure (b)b). This forms the sensing observation input for the model at time . The CNN is used to convert the observation

into a low-dimensional feature vector

termed as the embedding vector. The positional mean and covariance matrix of agent , are then flattened and their elements are concatenated with to generate a new information vector (as represented in figure (a)a).

The agents are heterogeneous agents (anchor and auxiliary) where the localization information is a parameter which is being aggregated in the graph network component of GALOPP. Since we have considered the anchor agents to be superior with respect to localization capabilities, one would typically expect a weighted aggregation of the message embeddings. The aggregated information vector of an agent is dependent on the current position in the environment, the generated message embedding, and the localization status of each neighbouring agent. Due to this, attention mechanism becomes useful to compute the attention parameters in the GALOPP model.

The GAT is used to transfer the information vector to all agents within the communication graph. Agents and are connected if the condition is satisfied. The agents take in the weighted average of the embeddings of the neighbourhood agents. The attention parameter gives an implicit weight parameter that assigns an attention value to each edge in the graph. The dynamics of the attention parameter is given as [26],

(2)

where, is the information vector of agent ; is the neighbourhood set of agent ; is the corresponding weight parameters for the inputs; is a single-layered feedforward function known as the attention mechanism; is the number of attention heads in the network used to stabilize the training process and to encompass complex parameter relations.

is the activation function used on the output of

. After the message passing, the aggregated information vector for each agent is given as,

(3)

Iv-C Multi-agent actor critic method using PPO

The goal of Proximal Policy Optimization (PPO) is to incorporate the trust region policy [19] constraint in the objective function. Multi-agent PPO is preferred over other policy gradient methods to avoid having large policy updates and to achieve better stability in learning in monitoring tasks. The decentralized actors in the multi-agent PPO take in the aggregated information vector

and generate the corresponding action probability distribution. The centralized critic estimates the value function of the environment to influence the policy of the individual actors. The shared reward for all agents is defined in Equation (

1).

The multi-agent PPO algorithm [2] is modified for GALLOP. For a defined episode length , the agent interacts with the environment to generate and collect the trajectory values in the form of states, actions, and rewards 

. The stored values are then sampled iteratively to update the action probabilities and to fit the value function through back-propagation. The PPO gradient expression is given as

(4)

where the advantage estimate function is defined as the difference between the discounted sum of rewards () and the state value estimate ()as The clipped surrogate objective function for a single PPO agent is given by

where, is the ratio of the action probability in the current policy to the action probability in the previous policy distribution for trainable parameter . The function clips the probability ratio to the trust-region interval . The modified multi-agent PPO objective function to be minimized in the GALOPP network is given as

(5)

V Experiments and analysis

To evaluate the performance of GALOPP, we custom-built three environments: 2-room map, 4-room map and a open-room map as shown in Figure 3. The agents have a sensing range of

units. We use the accumulated penalty metric for the evaluation and also evaluate the effect of communication range on the performance. The models in this paper were trained and tested using Python 3.6 on a server, running Ubuntu 20.04 L.T.S operating system, with an Intel(R) Core(TM) i9 CPU and a Nvidia GeForce RTX 3090 GPU (running on CUDA 11.1 drivers). The neural networks were written and trained using PyTorch 1.8 and dgl-cu111 (deep graph library). Due to space restriction, we provide some key results.

Training: For the 2-room case, the training is carried out for 30000 episodes, 50000 episodes for the 4-room map, and 30000 episodes for the open-room map with each episode length time steps. The penalties in the grid cells are updated with decay-rate of , . The maximum penalty a cell can have is . The total reward at time is defined as . For every training episode, the agents are initialized randomly in the environment but localized.

The GALOPP architecture input at time is the image representing the state of the grid which is resized to a

image using OpenCV’s INTER_AREA interpolation method, and concatenated with the local visibility map of the agent forming a

2-channeled image. The action space has 5 actions: front, back, left, right and stay. Each action enables the agent to move by one pixel, respectively. For testing the learned trajectories, we evaluate it for 100 episodes, each episode for time steps, in their respective environments. The reward for test episode is denoted by and is the summation of the reward at each time step of the episode i.e. and the final reward after episodes is calculated by averaging the rewards of each episode i.e. . The is used to evaluate the performance of the model.

Evaluation: For the 2-room map, 3 agents (1 anchor and 2 auxiliary agents) with comm range 20 are deployed. For the 4-room and open map we deploy 4 agents (2 anchor and 2 auxiliary agents) with 20 units comm range. Figures (a)a-(c)c show the training curves for these environments. From the figures, we can see that the training has saturated for the 2-room and open map, while for the 4-room the saturation point was reaching when the training was stopped at 50000 episodes. The simulation parameters used in the training and testing are provided in Appendix B-A.

(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3: The (a) 2-room (b) 4-room, and (c) open-room maps. The agents cannot move into black pixels, while the non-black regions needs to be persistently monitored. As the anchor agents (red stars), and auxiliary agents (dark blue triangles) monitor, their trajectory is shown as the fading white trails for the last 30 steps. Communication range between the agents is shown in red lines. (d)-(e) The trajectories of the anchor and auxiliary agents while monitoring.
(a)
(b)
(c)
Fig. 4: Comparison of the training curves (rolling average of 100 episodes and std error band) for the (a) 2-room, (b) 4-room, and (c) open-room environments.

Comparing the performance of GALOPP with non-RL baselines: Due to the localization constraints in the PM problem, the motion of the anchor agents and the auxiliary agents become coupled. Thus generating deterministic motion strategies for these heterogeneous agents becomes highly complex. Therefore, we compare the performance of GALOPP with 3 non-RL baselines: Random Search (RS), Random Search with Ensured Communication (RSEC) and Greedy Decentralized Search (GDCS). In RS, each agent randomly selects an action from the action space. In RSEC, every agent chooses a random action that ensures no auxiliary agent is localized. Thus the agents are always localized.

In GDCS, agents act independently and greedily. Given that an agent has an visibility range, is currently located at position and defines the set of grid cells that fall on the unobstructed line of sight of agent , we define: , which is a set of cells that are just one step beyond agent ’s visibility range. chooses an action that takes it towards the cell with the maximum penalty in without considering localization constraints. If all the grid cells in have the same penalty, then chooses a random action.

Figure (a)a

compares the performance of our architecture with the baselines. In the 2-room map case, GDCS and GALOPP always outperform the random baselines (RS and RSEC) by a significant margin. However, in the case of GDCS vs GALOPP, on average GALLOP outperforms GDCS with very less standard deviation showing consistent performance. The GCDS performance is highly susceptible to the initialization positions of the agents. GDCS only performs well when the agents are initialized in a manner when most of the cells fall within the unobstructed line of sight of the agents. But when the agents are initialized in an unfavourable location like in corners of one of the rooms, then GDCS leads to a situation where the agents are stuck leading to sub-optimal trajectory that propagates to some of the grid cells reaching

in their penalties. On the other hand, GALOPP can adapt to random initialization positions and plan the trajectories accordingly. In the 2-room map we notice that our algorithm ends up with the agents in a formation where two of the agents position themselves in the 2 rooms while one agent monitors the corridor. This can be seen in Figure (a)a where the faded cells show the trajectory followed by the agents for the last 30 steps. Figure (d)d shows the actual areas where each agent was present. From this we can see that the anchor was in the middle region, while the two auxiliary agents monitored the two rooms.

(a)
(b)
Fig. 5: (a) Comparing the performance of GALOPP with non-RL baselines. The 2 room environment had 3 agents with 1 anchor while the 4 room and the open room environment has 4 agents and 2 anchors. (b) Comparison of the mean unlocalization times std. deviation of the auxiliary agents using GALOPP vs. non-RL baselines in 2, 4 and open room environments
(a)
(b)
Fig. 6: (a) Overview of how agents, that are within communicable range of one another, update their global maps when in a decentralized setting. The resultant global map is generated by taking the element wise maximum value from the individual global maps of the agents. (b) Average rewards performance between centralized and decentralized implementations.

In the 4-room map case, the GDCS performs poorly (see Figure (a)a) due to the narrow passages that lead to the individual rooms. This obstructs the view of the agents into the grid cells within the room, hence leaving a subset of rooms neglected. This aspect shows that GDCS does not adapt well to complex environment. GALOPP outperforms all the strategies with the ability to adapt to complex environments. We see that even in this case, our algorithm learns a trajectory to maintain a formation where each of the 4 agents, monitor a room and they intermittently exit the room to monitor the central corridor region as shown in Figure (b)b and (e)e. The anchor agents monitor two cells and the central area while the auxiliary agents monitor the two rooms.

In the open-room map, GALOPP outperforms all the baselines (see Figure (a)a). The agents follow a circular trajectory with one anchor agent monitoring the central part while the rest circling in the environment. This aspect can be seen prominently in Figure (f)f. The anchor agents occupy the middle regions while the auxiliary agents perform a roundabout trajectory across the environment.

We have analyzed the effect of increase in the communication range from 20,25 and 30 units. With 20 units of communication range, each agent can communicate with each other most of the time. With communication ranges of 25 and 30, the will always be connected and hence could not achieve better performance than communication range of 20. Therefore, we are not reporting those results here due to space limitations but can be seen in [gallop].

The Figure (b)b shows the average (along with standard deviation) unlocalization per episode. We can see that the unlocalization time steps for all the strategies is very less compared to the complete mission time of 1000 steps. Since the communication range of 20 units is sufficient to communicate with each other, the auxiliary agents are almost always localized resulting in persistent monitoring of the region.

Decentralized maps: The training for the agents was carried in the centralized setting where a fully shared global map was available to all the agents even when they are not in communication range. A fully centralized map hampers the scalability of the system to a larger environment. To circumvent this, we equip each agent with a separate individual copy of the global map. Each agent continues to update the copy of their global map, and the monitoring awareness is updated through inter-agent connectivity. Figure (a)a shows a schematic representation of the decentralized map updation. For a network graph, the connected agents compare and aggregate the global map at each time-step by taking the element-wise maximum for each grid cell in the environment. In figure (b)b, we compare the performance of the decentralized map with the centralized map for the 2-room, 4-room, and open-room map using the same agent configuration as mentioned in section V. From the figure, we can see that the the decentralized map performance on average is comparable to that of its centralized map counterpart.

Vi Conclusion

This paper developed a MADRL with graph attention based approach – GALOPP for persistently monitoring a bounded region taking the communication, sensing and localization constraints into account. The agents learnt the environment and modified their trajectories to satisfy the conflicting objectives of coverage and meeting localization constraints based on the environmental conditions. The experiments show that the agents using GALOPP outperformed the greedy and random baseline strategies. This work can be further work extended to study its scalability to larger number of agents, robustness to varying environments and resilience in case of agent failures.

References

  • [1] J. Blumenkamp and A. Prorok (Cambridge, MA, USA, 2020) The emergence of adversarial communication in multi-agent reinforcement learning. Conference on Robot Learning. Cited by: §II.
  • [2] J. Chen, A. Baskaran, Z. Zhang, and P. Tokekar (2020) Multi-agent reinforcement learning for persistent monitoring. arXiv preprint arXiv:2011.01129. Cited by: §I, §II, §III-A, §IV-B, §IV-C.
  • [3] L. Du, L. Chen, X. Hou, and Y. Chen (2019) Cooperative vehicle localization base on extended kalman filter in intelligent transportation system. In Wireless and Optical Communications Conference, pp. 1–5. Cited by: §II.
  • [4] E. Galceran and M. Carreras (2013) A survey on coverage path planning for robotics. Robotics and Autonomous systems 61 (12), pp. 1258–1276. Cited by: §II.
  • [5] S. K. K. Hari, S. Rathinam, S. Darbha, K. Kalyanam, S. G. Manyam, and D. Casbeer (2019) The generalized persistent monitoring problem. In American Control Conference, pp. 2783–2788. Cited by: §I.
  • [6] S. K. K. Hari, S. Rathinam, S. Darbha, K. Kalyanam, S. G. Manyam, and D. Casbeer (2021) Optimal uav route planning for persistent monitoring missions. IEEE Transactions on Robotics 37 (2). Cited by: §I.
  • [7] R. Khodayi-mehr, Y. Kantaros, and M. M. Zavlanos (2019) Distributed state estimation using intermittently connected robot networks. IEEE Transactions on Robotics 35 (3), pp. 709–724. Cited by: §I, §III-B.
  • [8] F. Klaesson, P. Nilsson, A. D. Ames, and R. M. Murray (2020) Intermittent connectivity for exploration in communication-constrained multi-agent systems. In ACM/IEEE International Conference on Cyber-Physical Systems, pp. 196–205. Cited by: §III-B.
  • [9] Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §IV-A.
  • [10] Q. Li, F. Gama, A. Ribeiro, and A. Prorok (2019) Graph neural networks for decentralized multi-robot path planning. IEEE International Conference on Intelligent Robots and Systems. Cited by: §II.
  • [11] Q. Li, W. Lin, Z. Liu, and A. Prorok (2021) Message-aware graph attention networks for large-scale multi-robot path planning. IEEE Robotics and Automation Letters 6 (3), pp. 5533–5540. Cited by: §II.
  • [12] X. Lin and C. G. Cassandras (2014) An optimal control approach to the multi-agent persistent monitoring problem in two-dimensional spaces. IEEE Transactions on Automatic Control 60 (6), pp. 1659–1664. Cited by: §I, §II.
  • [13] J. Liu, J. Pu, L. Sun, and Y. Zhang (2018) Multi-robot cooperative localization with range-only measurement by uwb. In Chinese Automation Congress, pp. 2809–2813. Cited by: §I, §II.
  • [14] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (Long Beach, California, USA, 2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6382–6393. External Links: ISBN 9781510860964 Cited by: §II.
  • [15] D. Maravall, J. de Lope, and R. Domínguez (2013) Coordination of communication in robot teams by reinforcement learning. Robotics and Autonomous Systems 61 (7), pp. 661–666. Cited by: §II.
  • [16] A. I. Mourikis and S. I. Roumeliotis (2006) Performance analysis of multirobot cooperative localization. IEEE Transactions on robotics 22 (4), pp. 666–681. Cited by: §II.
  • [17] J. O’rourke et al. (1987) Art gallery theorems and algorithms. Vol. 57, Oxford University Press Oxford. Cited by: §II.
  • [18] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian (Sydney, Australia, 2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In

    International Conference on Machine Learning

    ,
    pp. 2681–2690. Cited by: §II.
  • [19] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (Lille, France, 2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §IV-C.
  • [20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IV-A.
  • [21] R. Shah, Y. Jiang, J. Hart, and P. Stone (Las Vegas, 2020) Deep r-learning for continual area sweeping. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5542–5547. Cited by: §II.
  • [22] R. Sharma, R. W. Beard, C. N. Taylor, and S. Quebe (2011) Graph-based observability analysis of bearing-only cooperative localization. IEEE Transactions on Robotics 28 (2), pp. 522–529. Cited by: §I, §III-B.
  • [23] S. L. Smith, M. Schwager, and D. Rus (2011) Persistent monitoring of changing environments using a robot with limited range sensing. In IEEE International Conference on Robotics and Automation, pp. 5448–5455. Cited by: §I.
  • [24] X. Tan (2001) Fast computation of shortest watchman routes in simple polygons. Information Processing Letters 77 (1), pp. 27–33. Cited by: §II.
  • [25] P. Tokekar and V. Kumar (2015) Visibility-based persistent monitoring with robot teams. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3387–3394. Cited by: §I, §II.
  • [26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (Vancouver, Canada, 2018) Graph Attention Networks. International Conference on Learning Representations. Cited by: §IV-A, §IV-B.
  • [27] J. Yu, S. Karaman, and D. Rus (2015) Persistent monitoring of events with stochastic arrivals at multiple stations. IEEE Transactions on Robotics 31 (3), pp. 521–535. Cited by: §I, §II.
  • [28] J. Zhu and S. S. Kia (2019) Cooperative localization under limited connectivity. IEEE Transactions on Robotics 35 (6), pp. 1523–1530. Cited by: §I, §II.

Appendix A Using Kalman Filter for state estimation

In cooperative localization (CL), one way an auxiliary agent can localize itself is by observing it’s environment. In our setting, this observation is made by an auxiliary agent by observing other agents. We assume the following: (1) all the agents are equipped with sensors that provide accurate range and bearing information of other agents when they are within communicable range of the current agent, (2) all the agents are initialized with the accurate estimate of their initialization position.

Fig. 7: Agent observing the relative position of agent , with respect to itself

To handle the uncertainties in it’s state estimates once the auxiliary agent has been disconnected from an anchor, we apply a Kalman Filter (KF) to update its state mean and covariance. The KF propagates the uncertainty in the position of the auxiliary agent, as long as it is unlocalized and upon localization, the agent is made aware of it’s true location. The motion model of the auxiliary agent is given as:

(6)

where and are the positions of the agents at time and respectively and

is a random variable (representing the error IMU suite) drawn from a normal distribution with

mean and covariance . is the control input at time and takes the following values for the following action: up , down , left , right or no action . and are identity matrices.

The agents are equipped with range and bearing sensors that provide the relative distance and bearing of other agents that are within it’s observable range. Figure 7 shows an agent observing another agent . The position of w.r.t the global coordinates is and that of is . measures the relative distance and bearing of in terms of polar coordinates where is the relative distance and is the relative bearing of . This can be converted into Cartesian coordinates as where are the coordinates of relative to . Given this information, the observation model for the agents can be stated as:

(7)

where is the observation (in our setting ), , is the current state estimate, and is the error in the observation obtained from the sensors and is drawn from a normal distribution with mean and covariance .

A Kalman filter represents the state estimate by the moments representation: at time

, the estimated position (or belief) of the agent is given by and the covariance (this means that at time

, the position of the agent can be estimated form a 2-D Gaussian distribution with mean

and covariance ). and are updated at every time step by the Kalman Filter algorithm which takes as input the mean and covariance of the previous time step (, respectively), the control input and the observation if there is another agent within it’s communicable range (determined by the boolean variable gotObservation). Algorithm 1 represents the working of the Kalman Filter algorithm.

if gotObservation True then
      
      
      
       return ,
end if
else
       return ,
end if
Algorithm 1 KalmanFilter ()

Appendix B Simulation parameters and model summary

The models in this paper were trained and tested using Python 3.6 on a server, running Ubuntu 20.04 L.T.S operating system, with an Intel(R) Core(TM) i9 CPU and a Nvidia GeForce RTX 3090 GPU (running on CUDA 11.1 drivers). The neural networks were written and trained using PyTorch 1.8 and dgl-cu111 (deep graph library). The following section present the details of the various parameters used in the models.

B-a Model summary

The GALOPP architecture consists for 4 deep neural networks as follows:

Embedding generator: This is a convolutional neural network that takes a 2-channeled (local map and mini map) image as the input and generates a 32 dimensional feature vector. We then append a 6-dimensional state vector to this feature vector, to form a 38-dimensional feature vector that acts as the embedding for the graph attention network. The state vector is derived by flattening the covariance matrix of the agent and appending it to the position vector . The model parameters are given in Table I

Graph attention network: The embeddings generated by the embedding generator are passed through the graph attention network that use 3 attention heads to generate the embeddings for the actor networks of the individual agents. The network parameters are shown in Table II

Actor network: The actor takes the embeddings generated by the ConvNet and the aggregated information vector from the GraphNet network as the input and generates the probability distribution for the available actions. The parameters of the actor network are listed in Table III

Critic network: The critic network takes the embeddings generated by the ConvNet for each agent and returns the state-value estimate for the current state. The model parameters for the critic model are in Table IV

Embedding generator (ConvNet)

ConvLayer1 (in-channels=2, out-channels=16, kernel-size=8, stride=4, padding=(1, 1))

ReLU activation
ConvLayer2 (in-channels=16, out-channels=32, kernel-size=4, stride=2, padding=(1, 1))
ReLU activation
ConvLayer3 (in-channels=32, out-channels=32, kernel-size=3, stride=1, padding=(1, 1))
ReLU activation
Flatten
Concatenate state vector

TABLE I: Parameters for embedding generator
Graph network
GATLayer (in-features=38, out-features=38, heads=3)

TABLE II: Parameters for graph attention network
Actor network
LinearLayer1 (in-features=38, out-features=500)
ReLU activation
LinearLayer2 (in-features=500, out-features=256)
ReLU activation
LinearLayer3 (in-features=256, out-features=5)
SoftMax

TABLE III: Parameters for actor network
Critic network
LinearLayer1 (in-features=38, out-features=500)
ReLU activation
LinearLayer2 (in-features=500, out-features=256)
ReLU activation
LinearLayer3 (in-features=256, out-features=1)

TABLE IV: Parameters for critic network

B-B Simulation parameters

Table V states the values that were used for the simulations in this paper.

Parameter Value
Decay Rate 1
Maximum penalty 400
Length of episode 1000
Agent visibility range
Local map and Mini map size
(covariance matrix for error in IMU suite)
(covariance matrix for uncertainty in sensors)
TABLE V: Simulation Parameters