Optimal sensing coverage is an active research branch. Solutions have been proposed in previous work, for instance, by solving general locational optimization problem , using Voronoi partitions [2, 3], using potential field methods [4, 5], or scalar field mapping [6, 7]. In most of those work, authors made assumption about the mathematical model of the environment, such as distribution model of the field or the predefined coverage path [8, 9]. In reality, however, it is very difficult to have an accurate model, because its data is normally limited or unavailable.
Unmanned aerial vehicles (UAV), or drones, have already become popular in human society, with a wide range of application, from retailing business to environmental issues. The ability to provide visual information with low costs and high flexibility makes the drones preferable equipment in tasks relating to field coverage and monitoring, such as in wildfire monitoring , or search and rescue . In such applications, usually a team of UAVs could be deployed to increase the coverage range and reliability of the mission. As with other multi-agent systems [12, 13], the important challenges in designing an autonomous team of UAVs for field coverage include dealing with the dynamic complexity of the interaction between the UAVs so that they can coordinate to accomplish a common team goal.
Model-free learning algorithms, such as Reinforcement learning (RL), would be a natural approach to address the aforementioned challenges relating to the required accurate mathematical models for the environment, and the complex behaviors of the system. These algorithms will allow each agent in the team to learn new behavior, or reach consensus with others , without depending on a model of the environment . Among them, RL is popular because it is relatively generic to address a wide range of problem, while it is simple to implement.
Classic individual RL algorithms have already been extensively researched in UAV applications. Previous papers focus on applying RL algorithm into UAV control to achieve desired trajectory tracking/following , or discussion of using RL to improve the performance in UAV application . Multi-Agent Reinforcement Learning (MARL) is also an active field of research. In multi-agent systems (MAS), agents’ behaviors cannot be fully designed as priori due to the complicated nature, therefore, the ability to learn appropriate behaviors and interactions will provide a huge advantage for the system. This particularly benefits the system when new agents are introduced, or the environment is changed . Recent publications concerning the possibility of applying MARL into a variety of applications, such as in autonomous driving , or traffic control .
In robotics, efforts have been focused on robotic system coordination and collaboration 22], or multi-target observation . For robot path planning and control, most prior research focuses on classic problems, such as navigation and collision avoidance , object carrying by robot teams , or pursuing preys/avoiding predators [26, 27]. Many other papers in multi-robotic systems even simplified the dynamic nature of the system to use individual agent learning such as classic RL algorithm , or actor-critic model . To our best knowledge, not so many works available addressed the complexity of MARL in a multi-UAV system and their daily missions such as optimal sensing coverage. In this paper, we propose how a MARL algorithm can be applied to solve an optimal coverage problem. We address two challenges in MARL: (1) the complex dynamic of the joint-actions of the UAV team, that will be solved using game-theoric correlated equilibrium, and (2) the challenge in huge dimensional state space will be tackled with an efficient space-reduced representation of the value function.
The remaining of the paper is organized as follows. Section II details on the optimal field coverage problem formulation. In section III, we discuss our approach to solve the problem and the design of the learning algorithm. Basics in MARL will also be covered. We present our experimental result in section IV with a comprehensive simulation, followed by an implementation with physical UAVs in a lab setting. Finally, section V concludes our paper and layouts future work.
Ii Problem Formulation
In an mission like exploring a new environment such as monitoring an oil spilling or an wildfire area, it is growing interest to send out a fleet of UAVs acting as a mobile sensor network, as it provides many advantages comparing to traditional static monitoring methods . In such mission, the UAV team needs to surround the field of interest to get more information, for example, visual data. Suppose that we have a team of quadrotor-type UAVs (Figure 1). Each UAV is an independent decision maker, thus the system is distributed. They can localize itself using on-board localization system, such as using GPS. They can also exchange information with other UAVs through communication links. Each UAV equipped with identical downward facing cameras provides it a square field of view (FOV). The camera of each UAV and its FOV form a pyramid with half-angles (Figure 2). A point is covered by the FOV of UAV if it satisfies the following equations:
where is the lateral-projected position, and is the altitude of the UAV , respectively. The objective of the team is not only to provide a full coverage over the shape of the field of interest under their UAVs’ FOV, but also to minimize overlapping other UAVs’ FOV to improve the efficiency of the team (e.g., minimizing overlap can increase resolution of field coverage). A UAV covers by trying to put a section of it under its FOV. It can enlarge the FOV to cover a larger section by increasing the altitude according to (1), however it may risk overlapping other UAVs’ FOV in doing so. Formally speaking, let us consider a field of arbitrarily shapes. Let denote the positions of UAV , respectively. Each UAV has a square FOV projected on the environment plane, denoted as . Let represents a combined areas under the FOVs of the UAVs. The team has a cost function represented by:
where measures the importance of a specific area. In a plain field of interest, is constant.
The problem can be solved using traditional methods, such as, using Voronoi partitions [2, 3], or using potential field methods [4, 5]. Most of these works proposed model-based approach, where authors made assumption about the mathematical model of the environment, such as the shape of the target [8, 9]. In reality, however, it is very difficult to obtain an accurate model, because the data of the environment is normally insufficiently or unavailable. This can be problematic, as the systems may fail if using incorrect models. On the other hand, many learning algorithm, such as RL algorithms, rely only on the data obtained directly from the system, would be a natural option to address the problem.
Iii-a Reinforcement Learning and Multi-Agent Reinforcement Learning
Classic RL defines the learning process happens when a decision maker, or an agent, interacts with the environment. During the learning process, the agent will select the appropriate actions when presented a situation at each state according to a policy , to maximize a numerical reward signal, that measures the performance of the agent, feedback from the environment. In MAS, the agents interact with not only the environment, but also with other agents in the system, making their interactions more complex. The state transition of the system is more complicated, resulting from a join action containing all the actions of all agents taking at a time step. The agents in the system now must also consider other agents states and actions to coordinate and/or compete with. Assuming that the environment has Markovian property, where the next state and reward of an agent only depends on the current state, the Multi-Agent Learning model can be generalized as a Markov game , where:
is the number of agents in the system.
is the joint state space , where is the individual state space of an agent . At time step , the individual state of agent is denoted as . The joint state at time step is denoted as .
is the joint action space, , where is the individual action space of an agent . Each joint action at time is denoted as while the individual action of agent is denoted as . We have: .
is the transition probability function,, is the probability of agent that takes action to move from state to state . Generally, it is represented by a probability: .
is the individual reward function: that specifies the immediate reward of the agent for getting from at time step to state at time step after taking action . In MARL, the team has a global reward in achieving the team’s objective. We have: .
The agents seek to optimize expected rewards in an episode by determining which action to take that will have the highest return in the long run. In single agent learning, a value function , , helps quantify strategically how good the agent will be if it takes an action at state , by calculating its expected return obtained over an episode. In MARL, the action-state value function of each agent also depends on the joint state and joint action , represented as:
where is the discount factor of the learning. This function is also called Q-function. It is obvious that the state space and action space, as well as the value function in MARL is much larger than in individual RL, therefore MARL would require much larger memory space, that will be a huge challenge concerning the scalability of the problem.
Iii-B Correlated Equilibrium
In order to accomplish the team’s goal, the agents must reach consensus in selecting actions. The set of actions that they agreed to choose is called a joint action, . Such an agreement can be evaluated at equilibrium, such as Nash equilibrium (NE)  or Correlated equilibrium (CE) 
. Unlike NE, CE can be solved with the help of linear programming (LP). Inspired by  and 
, in this work we use a strategy that computes the optimal policy by finding the CE equilibrium for the agents in the systems. From the general problem of finding CE in game theory, we formulate a LP to help find the stable action for each agent as follows:
Here, is the probability of UAV selecting action at time , and denotes the rest of the actions of other agents. Solving LP has long been researched by the optimization community. In this work, we use a state-of-the-art program from the community to help us solve the above LP.
Iii-C Learning Design
In this section, we design a MARL algorithm to solve our problem formulated in section II. We assume that the system is fully observable. We also assume the UAVs are identical, and operated in the same environment, and have identical sets of states and actions: , and .
The state space and action space set of each agent should be represented as discrete finite sets approximately, to guarantee the convergence of the RL algorithm . We consider the environment as a 3-D grid, containing a finite set of cubes, with the center of each cube represents a discrete location of the environment. The state of an UAV is defined as its approximate position in the environment, , where , , are the coordinates of the center of a cube at time step . The objective equation (2) now becomes:
where is the count of squares, or cells, approximating the field under the FOV of UAV , and is the total number of cells overlapped with other UAVs.
To navigate, each UAV can take an action out of a set of six possible actions : heading North, West, South or East in lateral direction, or go Up or Down to change the altitude. Note that if the UAV stays in a state near the border of the environment, and selects an action that takes it out of the space, it should stay still in the current state. Certainly, the action belongs to an optimal joint-action strategy resulted from (4). Note that in case multiple equilibrium exists, since each UAV is an independent agent, they can choose different equilibrium, making their respective actions deviate from the optimal joint action to a sub-optimal joint action. To overcome this, we employ a mechanism called social conventions , where the UAVs take turn to carry out an action. Each UAV is assigned with a specific ranking order. When considering the optimal joint action sets, the one with higher order will have priority to choose its action first, and let the subsequent one know its action. The other UAVs then can match their actions with respect to the selected action. To ensure collision avoidance, lower-ranking UAVs cannot take an action that will lead to the newly occupied states of higher-ranking UAVs in the system. By this, at a time step , only one unique joint action will be agreed among the UAV’s.
Defining the reward in MARL is another open problem due to the dynamic nature of the system . In this paper, the individual reward that each agent receives can be considered as the total number of cells it covered, minus the cells overlapping with other agents. However, a global team goal would help the team to accomplish the task quicker, and also speed up the learning process to converge faster . We define the global team’s reward is a function that weights the entire team’s joint state and joint action at time step in achieving (5). The agent only receives reward if the team’s goal reached:
where is an acceptable bound of the field being covered. During the course of learning, the state - action value function for each agent at time can be iteratively updated as in Multi-Agent Q - learning algorithm, similar to those proposed in [25, 30]:
where is the learning rate, and is the discount rate of the RL algorithm. The term derived from (4) at joint state .
Iii-D Approximate Multi-Agent Q-learning
In MARL, each agent updates its value function with respect to other agents’ state and action, therefore the state and action variable dimensions can grow exponentially if we increase the number of agent in the system. This makes value function representation a challenge. Consider the value function in (3), the space needed to store all the possible state - action pairs is .
Works have been proposed in the literature to tackle the problem: using graph theory  to decompose the global Q-function into a local function concerning only a subset of the agents, reducing dimension of Q-table , or eliminating other agents to reduce the space . However, most previous approaches require additional step to reduce the space, that may place more pressure on the already-intense calculation time. In this work, we employ simple approximation techniques by using state and action - dependent basis functions :
The FSR scheme uses a column vector of the size , where is the sum of dimensions of the state space. For example, if the state space is a 3-D space: , then . Each element in is defined as follows:
In RBF scheme, we can use a column vector of element, each can be calculated as:
where is the center and is the radius of pre-defined basis functions that have the shape of a Gaussian bell.
The and in FSR and RBF schemes are column vectors of the size and , respectively, which is much less than the space required in the original Q-value function. For instance, if we deploy 3 agents on a space of , and each agent has actions, the original Q-table size would have numbers in it. Compare to the total space required for approximated parameter vectors in FSR scheme is , and in RBF scheme is just numbers, the required space is hugely saved.
We propose our learning process as Algorithm LABEL:algo1. The algorithm required learning rate , discount factor , and a schedule . The learning process is divided into episodes, with arbitrarily-initialized UAVs’ states in each episode. We use a greedy policy with a big initial to increase the exploration actions in the early stages, but it will be diminished over time to focus on finding optimal joint action according to (4). Each UAV will evaluate their performance based on a global reward function in (6), and update the approximated value function of their states and action using the law (11) in a distributed manner.
Iv Experimental Results
We set up a simulation on MATLAB environment to prove the effectiveness of our proposed algorithm. Consider our environment space as a discrete 3-D space, and a field of interest on a grid board with an unknown shape (Figure 3). The system has UAVs, each UAV can take six possible actions to navigate: forward, backward, go left, go right, go up or go down. Each UAV in the team will have a positive reward if the team covers the whole field with no overlapping, otherwise it receives .
We implement the proposed algorithm LABEL:algo1 with both approximation schemes: FSR and RBF, and compare their performance with a baseline algorithm. For the baseline algorithm, the agents seek to solve the problem by optimizing individual performance, that is to maximize their own coverage of the field , and stay away from overlapping others to avoid a penalty of for each overlapping square. For the proposed algorithm, both schemes use learning rate , discount rate , and for the greedy policy which is diminished over time. To find CE for the agents in (4), we utilize an optimization package for MATLAB from CVX .
Our simulation on MATLAB shows that, in both FSR and RBF schemes after some training episodes the proposed algorithm allows UAV team to organize in several optimal configurations that fully cover the field while having no overlapping, while the baseline algorithm fails in most episodes. Figure 3 shows how the UAVs coordinated to cover the field in the last learning episode in 2D. Figure 4 shows a result of different solutions of the 3 UAV’s FOV configuration with no overlapping. For a clearer view, Figure 5 shows the UAVs team and their FOVs in 3D environment in the last episode of the FSR scheme.
Figure 6 shows the number of steps per episode the team took to converge to optimal solution. The baseline algorithm fails to converge, so it took maximum number of steps (2000), while the two schemes using proposed algorithm converged nicely. Interestingly, it took longer for the RBF scheme to converge, compare to the FSR scheme. It is likely due to the difference in accuracy of the approximation techniques, where RBF scheme has worse accuracy.
In this section, we implement a lab-setting experiment for 2 UAVs to cover the field of interest with the similar specification as of the simulation, in an environment space as a discrete 3-D space. We use a quadrotor Parrot AR Drone 2.0, and the Motion Capture System from Motion Analysis 
to provide state estimation. The UAVs are controlled by a simple PD position controller.
We carried out the experiment using the FSR scheme, with similar parameters to the simulation, but now for only 2 UAVs. Each would have a positive reward if the team covers the whole field with no overlapping, and otherwise. The learning rate was , and discount rate , , which was diminished over time. Similar to the simulation result, the UAV team also accomplished the mission, with two UAVs coordinated to cover the whole field without overlapping each other, as showed in (Figure 7).
This paper proposed a MARL algorithm that can be applied to a team of UAVs that enable them to cooperatively learn to provide full coverage of an unknown field of interest, while minimizing the overlapping sections among their field of views. The complex dynamic of the joint-actions of the UAV team has been solved using game-theoretic correlated equilibrium. The challenge in huge dimensional state space has been also tackled with FSR and RBF approximation techniques that significantly reduce the space required to store the variables. We also provide our experimental results with both simulation and physical implementation to show that the UAVs can successfully learn to accomplish the task without the need of a mathematical model. In the future, we are interested in using Deep Learning to reduce computation time, especially in finding CE. We will also consider to work in more important application where the dynamic of the field presents, such as in wildfire monitoring.
-  J. Cortes, S. Martinez, T. Karatas, and F. Bullo, “Coverage control for mobile sensing networks,” IEEE Transactions on robotics and Automation, vol. 20, no. 2, pp. 243–255, 2004.
-  M. Schwager, D. Rus, and J.-J. Slotine, “Decentralized, adaptive coverage control for networked robots,” The International Journal of Robotics Research, vol. 28, no. 3, pp. 357–375, 2009.
-  A. Breitenmoser, M. Schwager, J.-C. Metzger, R. Siegwart, and D. Rus, “Voronoi coverage of non-convex environments with a group of networked robots,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 4982–4989.
-  S. S. Ge and Y. J. Cui, “Dynamic motion planning for mobile robots using potential field method,” Autonomous robots, vol. 13, no. 3, pp. 207–222, 2002.
-  M. Schwager, B. J. Julian, M. Angermann, and D. Rus, “Eyes in the sky: Decentralized control for the deployment of robotic camera networks,” Proceedings of the IEEE, vol. 99, no. 9, pp. 1541–1561, 2011.
-  H. M. La and W. Sheng, “Distributed sensor fusion for scalar field mapping using mobile sensor networks,” IEEE Transactions on cybernetics, vol. 43, no. 2, pp. 766–778, 2013.
-  M. T. Nguyen, H. M. La, and K. A. Teague, “Collaborative and compressed mobile sensing for data collection in distributed robotic networks,” IEEE Transactions on Control of Network Systems, vol. PP, no. 99, pp. 1–1, 2017.
-  H. M. La, “Multi-robot swarm for cooperative scalar field mapping,” Handbook of Research on Design, Control, and Modeling of Swarm Robotics, p. 383, 2015.
-  H. M. La, W. Sheng, and J. Chen, “Cooperative and active sensing in mobile sensor networks for scalar field mapping,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 1, pp. 1–12, 2015.
-  H. X. Pham, H. M. La, D. Feil-Seifer, and M. Deans, “A distributed control framework for a team of unmanned aerial vehicles for dynamic wildfire tracking,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sept 2017, pp. 6648–6653.
-  T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair, I. L. Grixa, F. Ruess, M. Suppa, and D. Burschka, “Toward a fully autonomous uav: Research platform for indoor and outdoor urban search and rescue,” IEEE robotics & automation magazine, vol. 19, no. 3, pp. 46–56, 2012.
-  T. Nguyen, H. M. La, T. D. Le, and M. Jafari, “Formation control and obstacle avoidance of multiple rectangular agents with limited communication ranges,” IEEE Transactions on Control of Network Systems, vol. 4, no. 4, pp. 680–691, Dec 2017.
-  H. M. La and W. Sheng, “Dynamic target tracking and observing in a mobile sensor network,” Robotics and Autonomous Systems, vol. 60, no. 7, pp. 996 – 1009, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0921889012000565
-  F. Muñoz, E. S. Espinoza Quesada, H. M. La, S. Salazar, S. Commuri, and L. R. Garcia Carrillo, “Adaptive consensus algorithms for real-time operation of multi-agent systems affected by switching network events,” International Journal of Robust and Nonlinear Control, vol. 27, no. 9, pp. 1566–1588, 2017, rnc.3687. [Online]. Available: http://dx.doi.org/10.1002/rnc.3687
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
-  H. Bou-Ammar, H. Voos, and W. Ertel, “Controller design for quadrotor uavs using reinforcement learning,” in Control Applications (CCA), 2010 IEEE International Conference on. IEEE, 2010, pp. 2130–2135.
-  A. Faust, I. Palunko, P. Cruz, R. Fierro, and L. Tapia, “Learning swing-free trajectories for uavs with a suspended load,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 4902–4909.
-  L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” in Innovations in multi-agent systems and applications-1. Springer, 2010, pp. 183–221.
-  S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprint arXiv:1610.03295, 2016.
-  B. Bakker, S. Whiteson, L. Kester, and F. C. Groen, “Traffic light control by multiagent reinforcement learning systems,” in Interactive Collaborative Information Systems. Springer, 2010, pp. 475–510.
-  J. Ma and S. Cameron, “Combining policy search with planning in multi-agent cooperation,” in Robot Soccer World Cup. Springer, 2008, pp. 532–543.
-  M. K. Helwa and A. P. Schoellig, “Multi-robot transfer learning: A dynamical system perspective,” arXiv preprint arXiv:1707.08689, 2017.
-  F. Fernandez and L. E. Parker, “Learning in large cooperative multi-robot domains,” 2001.
J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochastic games,”
Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
-  A. K. Sadhu and A. Konar, “Improving the speed of convergence of multi-agent q-learning for cooperative task-planning by a robot-team,” Robotics and Autonomous Systems, vol. 92, pp. 66–80, 2017.
-  Y. Ishiwaka, T. Sato, and Y. Kakazu, “An approach to the pursuit problem on a heterogeneous multiagent system using reinforcement learning,” Robotics and Autonomous Systems, vol. 43, no. 4, pp. 245–256, 2003.
-  H. M. La, R. Lim, and W. Sheng, “Multirobot cooperative learning for predator avoidance,” IEEE Transactions on Control Systems Technology, vol. 23, no. 1, pp. 52–63, 2015.
-  S.-M. Hung and S. N. Givigi, “A q-learning approach to flocking with uavs in a stochastic environment,” IEEE transactions on cybernetics, vol. 47, no. 1, pp. 186–197, 2017.
-  A. A. Adepegba, M. S. Miah, and D. Spinello, “Multi-agent area coverage control using reinforcement learning.” in FLAIRS Conference, 2016, pp. 368–373.
-  A. Nowé, P. Vrancx, and Y.-M. De Hauwere, “Game theory and multi-agent reinforcement learning,” in Reinforcement Learning. Springer, 2012, pp. 441–470.
-  A. Greenwald, K. Hall, and R. Serrano, “Correlated q-learning,” in ICML, vol. 3, 2003, pp. 242–249.
-  C. H. Papadimitriou and T. Roughgarden, “Computing correlated equilibria in multi-player games,” Journal of the ACM (JACM), vol. 55, no. 3, p. 14, 2008.
-  L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement learning and dynamic programming using function approximators. CRC press, 2010, vol. 39.
-  L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans. Systems, Man, and Cybernetics, Part C, vol. 38, no. 2, pp. 156–172, 2008.
-  C. Guestrin, M. Lagoudakis, and R. Parr, “Coordinated reinforcement learning,” in ICML, vol. 2, 2002, pp. 227–234.
-  Z. Zhang, D. Zhao, J. Gao, D. Wang, and Y. Dai, “Fmrq—a multiagent reinforcement learning algorithm for fully cooperative tasks,” IEEE transactions on cybernetics, vol. 47, no. 6, pp. 1367–1379, 2017.
-  D. Borrajo, L. E. Parker, et al., “A reinforcement learning algorithm in cooperative multi-robot domains,” Journal of Intelligent and Robotic Systems, vol. 43, no. 2-4, pp. 161–174, 2005.
-  A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, J. P. How, et al., “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Foundations and Trends® in Machine Learning, vol. 6, no. 4, pp. 375–451, 2013.
-  M. Grant, S. Boyd, and Y. Ye, “Cvx: Matlab software for disciplined convex programming,” 2008.
-  “Motion analysis corporation.” [Online]. Available: https://www.motionanalysis.com/
-  H. X. Pham, H. M. La, D. Feil-Seifer, and L. V. Nguyen, “Autonomous uav navigation using reinforcement learning,” arXiv preprint arXiv:1801.05086, 2018.